CN107992469A

CN107992469A - A kind of fishing URL detection methods and system based on word sequence

Info

Publication number: CN107992469A
Application number: CN201710952360.5A
Authority: CN
Inventors: 亚静; 柳厅文; 时金桥; 张盼盼; 张振宇; 王玉斌; 李全刚
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2017-10-13
Filing date: 2017-10-13
Publication date: 2018-05-04

Abstract

The present invention provides a kind of fishing URL detection methods and system based on word sequence, for detecting fishing URL.By being segmented to URL character strings, and then obtain the vector representation of word sequence, then the contextual information and feature in word sequence are learnt automatically using deep learning model, it is not necessary to manually include the relevant text feature of word in extraction URL, be used for detecting fishing URL using trained model.So as to solve the problems, such as to run into the fishing URL detections of above-mentioned existing word-based feature.

Description

A method and system for detecting phishing URLs based on word sequences

技术领域technical field

本发明涉及信息安全领域，尤其涉及一种基于词序列的钓鱼URL检测方法及系统。The invention relates to the field of information security, in particular to a method and system for detecting phishing URLs based on word sequences.

背景技术Background technique

钓鱼URL是一种网络钓鱼行为，通过伪装成信誉卓著的法人媒体网站以获取用户的敏感信息，如用户名、密码和信用卡明细等。钓鱼URL通常都声称自己来自于流行的社交网站(包括YouTube、Facebook、Twitter等)、拍卖网站(eBay)、电子购物网站(PayPal、Alibaba等)、或网络管理者(谷歌、雅虎、互联网服务提供商)等，以此来诱骗受害人的轻信。攻击者经常采用的欺骗手段是在URL中嵌入混淆用户的关键词，如攻击者利用形如“login.mydomain.tld/paypal”的URL来诱骗PayPal用户。Phishing URLs are a type of phishing attempt to obtain sensitive user information such as usernames, passwords, and credit card details by masquerading as a reputable corporate media website. Phishing URLs usually claim to be from popular social networking sites (including YouTube, Facebook, Twitter, etc.), auction sites (eBay), electronic shopping sites (PayPal, Alibaba, etc.), or network managers (Google, Yahoo, Internet service providers, etc.) business), etc., in order to lure the credulity of the victim. The deception method often used by attackers is to embed keywords that confuse users in URLs. For example, attackers use URLs such as "login.mydomain.tld/paypal" to lure PayPal users.

目前，无论在研究领域，还是商业产品中，已有很多钓鱼URL检测的方法和安全产品，其主要原理大都基于人工提取URL相关数据的特征，构建分类模型，对URL进行分类，从而检测出钓鱼URL。根据分析数据的不同，已有检测方法可以分为基于多源信息的检测方法和基于URL自身的检测方法两大类。At present, there are many phishing URL detection methods and security products, both in the research field and in commercial products. Most of the main principles are based on manually extracting the characteristics of URL-related data, building a classification model, and classifying URLs to detect phishing URLs. URL. According to the analysis data, existing detection methods can be divided into two categories: detection methods based on multi-source information and detection methods based on URL itself.

基于多源信息的检测方法需要采集URL相关的多种数据，包括Alexa排名、WHOIS信息、网页内容等，构造复杂的模型对标注好的数据进行训练，用来检测未知URL是否为钓鱼URL。这种方法通常具有比较高的准确率，但是，由于采集这些多种数据需要很大的资源和时间等额外的开销，因此，不适用于高速网络中的实时检测。Detection methods based on multi-source information need to collect various data related to URLs, including Alexa rankings, WHOIS information, web page content, etc., and construct complex models to train marked data to detect whether unknown URLs are phishing URLs. This method usually has a relatively high accuracy rate, but because collecting these various data requires a lot of extra overhead such as resources and time, it is not suitable for real-time detection in high-speed networks.

而基于URL自身的检测方法，只分析URL字符串本身的文本特征，用来构建分类模型，是一种轻量级的检测方法，适用于实时检测。The detection method based on the URL itself only analyzes the text features of the URL string itself and is used to build a classification model. It is a lightweight detection method and is suitable for real-time detection.

具体而言，基于URL自身的钓鱼检测方法，通过提取URL字符串的文本特征，训练分类模型，用来检测钓鱼URL。URL字符串本身的文本特征又可以分为字符特征和词特征两类。字符特征主要考虑组成URL文本串的字符表现的特征，包括字符长度、元音辅音比例、数字个数、特殊符号个数、字符分布的熵值等。词特征主要分析URL中包含的有语义信息的单词及其出现频度特征等，如URL中常用的词login、update等以及流行的知名品牌paypal、alibaba等。Specifically, the phishing detection method based on the URL itself extracts the text features of the URL string and trains a classification model to detect phishing URLs. The text features of the URL string itself can be divided into character features and word features. Character features mainly consider the characteristics of the characters that make up the URL text string, including the length of characters, the proportion of vowels and consonants, the number of numbers, the number of special symbols, and the entropy value of character distribution. Word features mainly analyze the words with semantic information contained in the URL and their occurrence frequency features, such as the commonly used words login, update, etc. in URLs, as well as popular well-known brands such as paypal, alibaba, etc.

基于URL自身的轻量级钓鱼检测更符合高速网络中实时响应的需求。基于字符的特征忽略了URL中包含的语义信息，URL是用来方便人记忆的，因此通常具有可读性和易记忆性，包含多个有含义的常用词。而且，在钓鱼攻击中，攻击者经常采用的策略是利用关键词来迷惑用户。Lightweight phishing detection based on the URL itself is more in line with the need for real-time response in high-speed networks. Character-based features ignore the semantic information contained in URLs. URLs are used to facilitate human memory, so they are usually readable and easy to remember, and contain multiple meaningful common words. Moreover, in phishing attacks, attackers often use keywords to confuse users.

而目前已有的基于词特征的钓鱼URL检测方法大多采用词和出现的频率作为特征，没有考虑URL中包含的词序列特征，而且这些特征都是基于人工提出，有一定的局限性。首先，人工提取特征需要耗费大量的人力和资源去统计分析和验证特征的有效性；其次，人工提取的特征通常只对某一类数据有效，鲁棒性差；而且，攻击者在钓鱼URL中使用的关键词通常与正常URL相似，这样才可以混淆用户，造成分类模型检测效率降低。Most of the current phishing URL detection methods based on word features use words and frequency of occurrence as features, without considering the word sequence features contained in the URL, and these features are all based on manual proposals, which have certain limitations. First of all, manually extracting features requires a lot of manpower and resources to statistically analyze and verify the effectiveness of features; secondly, manually extracted features are usually only valid for a certain type of data and have poor robustness; moreover, attackers use Keywords are usually similar to normal URLs, which can confuse users and reduce the detection efficiency of classification models.

发明内容Contents of the invention

针对上述现有技术存在的不足，本发明的目的在于提供一种基于词序列的钓鱼URL检测方法及系统，用来检测钓鱼URL。通过对URL字符串进行分词，进而得到词序列的向量表示，然后利用深度学习模型自动学习词序列中的上下文信息和特征，不需要人工提取URL中包含单词相关的文本特征，采用训练好的模型用来检测钓鱼URL。从而，解决前面提到的已有基于词特征的钓鱼URL检测中遇到的问题。In view of the deficiencies in the above-mentioned prior art, the purpose of the present invention is to provide a method and system for detecting phishing URLs based on word sequences, which are used to detect phishing URLs. By segmenting the URL string, the vector representation of the word sequence is obtained, and then the deep learning model is used to automatically learn the context information and features in the word sequence, without the need to manually extract the text features related to words in the URL, and the trained model is used Used to detect phishing URLs. Thereby, the aforementioned problems encountered in the existing detection of phishing URLs based on word features are solved.

为达上述目的，本发明采取的技术方案是：For reaching above-mentioned purpose, the technical scheme that the present invention takes is:

一种基于词序列的钓鱼URL检测方法，包括以下步骤：A method for detecting phishing URLs based on word sequences, comprising the following steps:

将已标注URL转换为词序列向量作为训练数据；Convert labeled URLs into word sequence vectors as training data;

采用训练数据训练分类模型；Using the training data to train the classification model;

将未知的URL转换为词序列向量并输入到训练好的分类模型中进行标注。Convert unknown URLs into word sequence vectors and input them into the trained classification model for labeling.

进一步地，将已标注URL或未知的URL转换为词序列向量包括：Further, converting marked URLs or unknown URLs into word sequence vectors includes:

过滤掉已标注URL或未知的URL中的协议和通用顶级域名；Filter out protocols and gTLDs in marked URLs or unknown URLs;

对过滤后剩余的部分进行分割，对分割获得的每一个分段的字符串使用词典通过正向最大匹配的方式进行分词，得到词序列；Segment the remaining part after filtering, and use the dictionary to segment the character string of each segment obtained by the segmentation through the forward maximum matching method to obtain the word sequence;

对上述词典中所有的词从1开始进行编号，使每个词都有唯一编号，把每个已标注URL或未知的URL的词序列转换为数字表示的定长向量。All the words in the above dictionary are numbered starting from 1, so that each word has a unique number, and the word sequence of each marked URL or unknown URL is converted into a fixed-length vector represented by a number.

进一步地，所述协议包括http、https、ftp、ftps、gopher；所述通用顶级域名包括com、org、net、edu、gov。Further, the protocol includes http, https, ftp, ftps, gopher; the gTLD includes com, org, net, edu, gov.

进一步地，所述使用词典通过正向最大匹配的方式进行分词包括：Further, said using a dictionary to perform word segmentation by means of forward maximum matching includes:

判断整个字符串是否在词典中，如是，则不再进行分词；Determine whether the entire string is in the dictionary, and if so, no word segmentation is performed;

如果否，则去掉最后一个字符，判断剩余的字符串是否在词典中；If not, remove the last character and determine whether the remaining string is in the dictionary;

重复前述判断过程直到匹配到在词典中的词，然后去掉匹配中的词；Repeat the above judgment process until the word in the dictionary is matched, and then remove the word in the match;

对字符串剩下的部分继续进行上述步骤，直至字符串全部处理完毕；Continue to perform the above steps on the remaining part of the string until all the strings are processed;

如字符串不包含词典中的词，则分为单个字符。If the string does not contain words in the dictionary, it is broken into individual characters.

进一步地，所述词典选用Peter Norvig公开的谷歌英文单词语料库。Further, the dictionary selects the Google English word corpus disclosed by Peter Norvig.

进一步地，采用训练数据训练的分类模型选用基于词序列的双向LSTM模型进行训练。Furthermore, the classification model trained with training data is trained with a bidirectional LSTM model based on word sequences.

进一步地，采用训练数据训练分类模型包括：Further, using the training data to train the classification model includes:

将训练数据随机分为训练部分和验证部分，通过设置神经网络模型的超参数和激活函数等参数对双向LSTM模型进行训练。The training data is randomly divided into a training part and a verification part, and the bidirectional LSTM model is trained by setting parameters such as hyperparameters and activation functions of the neural network model.

进一步地，双向LSTM模型包含嵌入层、双向LSTM层、dropout层和sigmoid层四层神经网络，采用训练数据训练分类模型还包括：对双向LSTM层的输出使用dropout函数用于防止过拟合。Further, the bidirectional LSTM model includes a four-layer neural network including an embedding layer, a bidirectional LSTM layer, a dropout layer, and a sigmoid layer. Using training data to train the classification model also includes: using the dropout function on the output of the bidirectional LSTM layer to prevent overfitting.

一种基于词序列的钓鱼URL检测系统，包括：A phishing URL detection system based on word sequences, including:

转换模块及分类训练模型；Conversion modules and classification training models;

转换模块用以将已标注URL转换为词序列向量作为训练分类模型的训练数据；并用以将未知的URL转换为词序列向量并输入到训练好的分类模型中进行标注。The conversion module is used to convert the marked URL into a word sequence vector as the training data of the training classification model; and is used to convert the unknown URL into a word sequence vector and input it into the trained classification model for labeling.

如上所述，本发明提供的方法及系统，不需要人工提取任何特征，只需要把URL转换为词序列向量表示，通过深度神经网络(双向LSTM模型)自动学习词序列中的上下文信息和特征，用来检测钓鱼URL。As mentioned above, the method and system provided by the present invention do not need to manually extract any features, only need to convert the URL into a word sequence vector representation, and automatically learn the context information and features in the word sequence through a deep neural network (two-way LSTM model), Used to detect phishing URLs.

相较于传统的检测钓鱼URL的技术，具有以下优点：Compared with traditional techniques for detecting phishing URLs, it has the following advantages:

首先，不需要额外采集URL的相关数据以及人工提取URL的文本特征，通过采用深度学习模型自动学习URL的词序列上下文信息和特征，并藉此检测钓鱼URL；明显降低开销。First of all, there is no need to additionally collect URL-related data and manually extract URL text features, and automatically learn the word sequence context information and features of URLs by using deep learning models, and use them to detect phishing URLs; significantly reducing overhead.

另外，通过深度挖掘URL的词序列包含的上下文信息和特征，相比于基于人工提取的词特征的机器学习模型和基于字符序列的深度学习模型都有较好的效果，在相同数据集上的检测效果较佳。In addition, by deeply mining the contextual information and features contained in the word sequence of the URL, compared with the machine learning model based on manually extracted word features and the deep learning model based on character sequences, both have better results. The detection effect is better.

最后，通过本发明的方法和系统，使用训练好的模型，在普通的服务器上，单线程预测速度达每秒钟不少于600个URL。在提高检测准确率的前提下，能够同时满足实时检测的需求。Finally, with the method and system of the present invention, using the trained model, on a common server, the single-thread prediction speed can reach no less than 600 URLs per second. On the premise of improving the detection accuracy, it can meet the needs of real-time detection at the same time.

附图说明Description of drawings

图1是本发明一实施例中基于词序列的钓鱼URL检测方法的流程示意图。FIG. 1 is a schematic flowchart of a method for detecting phishing URLs based on word sequences in an embodiment of the present invention.

图2为本发明一实施例中基于词序列的钓鱼URL检测方法中采用的双向LSTM模型的结构示意图。FIG. 2 is a schematic structural diagram of a bidirectional LSTM model used in a method for detecting phishing URLs based on word sequences in an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整的描述。The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention.

在本发明的一实施例中，提供一种基于词序列的钓鱼URL检测方法及系统，方法的主要步骤包括:In one embodiment of the present invention, a kind of phishing URL detection method and system based on word sequence are provided, and the main steps of the method include:

(1)词序列向量表示，首先，采用基于词典匹配的方法得到URL中包含的关键词序列，然后基于词典编码得到URL词序列的向量表示；(1) word sequence vector representation, at first, adopt the method based on dictionary matching to obtain the keyword sequence contained in the URL, then obtain the vector representation of URL word sequence based on dictionary encoding;

(2)模型训练，对上一步中得到的词序列向量，使用标注好的训练数据训练基于词序列的双向LSTM模型；(2) Model training, for the word sequence vector obtained in the previous step, use the marked training data to train the bidirectional LSTM model based on the word sequence;

(3)钓鱼URL检测，使用训练好的基于词序列的双向LSTM模型检测未知URL是否为钓鱼。(3) Phishing URL detection, using a trained bidirectional LSTM model based on word sequences to detect whether an unknown URL is phishing.

系统包括：转换模块及分类训练模型；The system includes: conversion module and classification training model;

转换模块用以将已标注URL转换为词序列向量表示作为训练分类模型的训练数据；并用以将未知的URL转换为词序列向量表示并输入到训练好的分类模型中进行标注。The conversion module is used to convert the marked URL into a word sequence vector representation as the training data of the training classification model; and is used to convert the unknown URL into a word sequence vector representation and input it into the trained classification model for labeling.

该方法中的词序列向量表示步骤，主要是为了得到URL词序列的向量表示，主要有以下几步：The word sequence vector representation step in the method is mainly to obtain the vector representation of the URL word sequence, mainly including the following steps:

i)首先，过滤掉URL中公知的协议和通用顶级域名两部分，常用的协议包括http、https、ftp、ftps、gopher等，通用顶级域名包括com、org、net、edu、gov等14个；i) First, filter out known protocols and gTLDs in the URL. Commonly used protocols include http, https, ftp, ftps, gopher, etc., and gTLDs include com, org, net, edu, gov, etc. 14;

ii)对剩余的部分，先用符号进行分割，然后对每一个分段使用预先准备好的词典通过正向最大匹配的方法进行分词，结合下表算法1所示的伪代码，具体分词过程为：首先判断整个字符串是否在词典中，如果在，则不需要再进行分词；如果不在，则去掉最后一个字符，判断剩下的字符串是否在词典中，直到匹配到在词典中的词，然后去掉匹配中的词，对字符串剩下的部分继续进行上述步骤，直到字符串全部处理完，如果字符串不包含词典中的词，则分为单个字符。ii) For the remaining part, first use symbols to segment, and then use the pre-prepared dictionary for each segment to perform word segmentation through the method of forward maximum matching, combined with the pseudo code shown in Algorithm 1 in the following table, the specific word segmentation process is : First judge whether the entire string is in the dictionary, if it is, no word segmentation is needed; if not, remove the last character, and judge whether the remaining string is in the dictionary until it matches the word in the dictionary, Then remove the words in the match, and continue the above steps for the rest of the string until all the strings are processed. If the string does not contain the words in the dictionary, it is divided into individual characters.

上述分词过程中采用的词典是Peter Norvig公开的谷歌英文单词语料库(包含333,333个英文单词)；不适用其他英文单词词典，该词典是Peter Norvig统计了在web网页中常用的单词，更符合URL的命名方式。The dictionary used in the above word segmentation process is the Google English word corpus released by Peter Norvig (including 333,333 English words); other English word dictionaries are not applicable. This dictionary is Peter Norvig's statistics of commonly used words in web pages, which is more in line with the URL naming method.

iii)然后，对上述词典中所有的词从1开始进行编号，每个词都有唯一一个编号，把每个URL的词序列转换为数字表示的定长向量；iii) Then, all words in the above-mentioned dictionary are numbered from 1, and each word has a unique number, and the word sequence of each URL is converted into a fixed-length vector represented by numbers;

该方法中的模型训练步骤，对上一步中得到的向量集合，使用标注好的向量集合作为训练数据对基于词序列的双向LSTM模型进行训练。将训练样本集随机分为训练和验证两部分(分别约占全部标注数据的80％和20％)，通过设置神经网络模型的超参数(每一层的输出维度等)和激活函数等参数对双向LSTM模型进行训练。所使用的深度学习模型包含多层神经网络，分别为嵌入层、双向LSTM层、dropout层和sigmoid层四层神经网络，对双向LSTM层的输出使用dropout函数用于防止过拟合。In the model training step in the method, the vector set obtained in the previous step is used as the training data to train the bidirectional LSTM model based on the word sequence. The training sample set is randomly divided into two parts: training and verification (respectively accounting for about 80% and 20% of all labeled data), by setting the hyperparameters of the neural network model (the output dimension of each layer, etc.) Bidirectional LSTM model for training. The deep learning model used includes a multi-layer neural network, which is an embedding layer, a bidirectional LSTM layer, a dropout layer, and a sigmoid layer. The dropout function is used to prevent overfitting for the output of the bidirectional LSTM layer.

该方法中的钓鱼URL检测步骤，主要实现对未标注的数据，即未知URL，检测其是否为钓鱼。将未知URL的词序列向量输入到训练好的双向LSTM模型中进行标注，如果输出为1则表示其为钓鱼URL，否则为正常URL。The phishing URL detection step in the method is mainly to detect whether the unmarked data, that is, the unknown URL, is phishing. Input the word sequence vector of the unknown URL into the trained bidirectional LSTM model for labeling. If the output is 1, it means that it is a phishing URL, otherwise it is a normal URL.

结合实例做进一步说明：基于词序列的钓鱼URL检测方法，其总体流程如图1所示，基于词序列的双向LSTM模型结构如图2所示。To further illustrate with an example: the overall process of the word sequence-based phishing URL detection method is shown in Figure 1, and the structure of the word-sequence-based bidirectional LSTM model is shown in Figure 2.

以钓鱼URL：http://shen.mansell.tripod.com/games/gameboy.html为例，该URL标注状态为1，对URL进行定长的词序列向量表示和训练双向LSTM模型，并使用训练好的模型对未知URL：http://fly-project.net//yahoo.link/Yah/T/Y.html进行检测。Take the phishing URL: http://shen.mansell.tripod.com/games/gameboy.html as an example. The URL label status is 1, and the URL is represented by a fixed-length word sequence vector and the bidirectional LSTM model is trained, and the training is used A good model detects the unknown URL: http://fly-project.net//yahoo.link/Yah/T/Y.html.

1)首先对输入的URL进行词序列向量表示，首先使用预先准备好的词典对URL进行分词：1) First, perform word sequence vector representation on the input URL, and first use the pre-prepared dictionary to segment the URL:

然后对词典中的词进行编号，词序列表示为长度为N的定长向量，N的取值可以通过统计得到，经过统计发现超过百分之九十的URL中包含13个词，因此设定N＝13，那么两个URL分别得到向量(1,4,5,6,7,11,13,0,0,0,0,0,0)和(2,19,3,9,12,8,14,0,0,0,0,0,0)。Then number the words in the dictionary, and the word sequence is expressed as a fixed-length vector with a length of N. The value of N can be obtained through statistics. After statistics, it is found that more than 90% of URLs contain 13 words, so set N=13, then the two URLs get vectors (1,4,5,6,7,11,13,0,0,0,0,0,0) and (2,19,3,9,12, 8,14,0,0,0,0,0,0).

用相同的方法得到样本集中所有URL的词序列向量表示。样本集中包含已经标注的正常URL和钓鱼URL数据。Use the same method to obtain the word sequence vector representations of all URLs in the sample set. The sample set contains marked normal URL and phishing URL data.

2)使用词序列向量集合中标注的数据作为训练数据输入到如图2所示的基于词序列的双向LSTM模型中进行训练，首先URL的词序列向量输入到Embedding层降维处理，然后输入到双向LSTM层中进行学习，学习的结果输入到dropout层防止过拟合，最后一层sigmoid函数输出检测结果。标注1表示为钓鱼URL，标注为0表示正常URL，实际是个二分类问题，因此模型输出使用sigmoid函数进行0-1分类。2) Use the data marked in the word sequence vector set as training data to input into the two-way LSTM model based on word sequence as shown in Figure 2 for training. First, the word sequence vector of the URL is input to the Embedding layer for dimensionality reduction processing, and then input to Learning is carried out in the bidirectional LSTM layer, and the learning result is input to the dropout layer to prevent overfitting, and the last layer of sigmoid function outputs the detection result. Marking 1 indicates a phishing URL, and marking 0 indicates a normal URL, which is actually a binary classification problem, so the model output uses the sigmoid function for 0-1 classification.

把所有的标注数据输入到模型中训练数据，输出训练好的模型。Input all the labeled data into the model training data, and output the trained model.

3)对于未标注的数据，将其向量输入到训练好的模型中，输出标注结果，如果输出为1表示为钓鱼URL，否则为正常URL。3) For unlabeled data, input its vector into the trained model, and output the labeled result, if the output is 1, it means it is a phishing URL, otherwise it is a normal URL.

由此，通过上述实例可知，本例中的方法不需要人工提取任何特征，只需要把URL转换为词序列向量表示，通过深度神经网络(双向LSTM模型)自动学习词序列中的上下文信息和特征，用来检测钓鱼URL。Therefore, from the above example, we can see that the method in this example does not need to manually extract any features, but only needs to convert the URL into a word sequence vector representation, and automatically learn the context information and features in the word sequence through a deep neural network (bidirectional LSTM model) , used to detect phishing URLs.

其主要步骤包括:1)词序列向量表示，首先对URL进行分词，此处的URL包含已标注的和未知的。所有的URL都要转换为向量，然后用标注的数据训练模型。然后利用填充序列的方法得到固定长度的向量表示；“定长“表示每个URL得到的词序列向量长度是相同的。填充序列方法是用来处理不同长度的向量，转换为相同长度。Its main steps include: 1) word sequence vector representation, URL is carried out participle at first, the URL here comprises labeled and unknown. All URLs are converted to vectors, and then the model is trained with the labeled data. Then use the method of filling the sequence to obtain a fixed-length vector representation; "fixed length" means that the length of the word sequence vector obtained by each URL is the same. The padding sequence method is used to handle vectors of different lengths, converting them to the same length.

2)模型训练，对上一步骤得到的向量，使用标注好的训练数据训练双向LSTM模型。2) Model training, for the vector obtained in the previous step, use the marked training data to train the bidirectional LSTM model.

3)钓鱼URL检测，对于未标注的URL，把其向量表示输入到训练好的双向LSTM模型中进行标注，标注为1的为钓鱼URL。3) Phishing URL detection. For unlabeled URLs, input their vector representations into the trained two-way LSTM model for labeling. Those marked as 1 are phishing URLs.

步骤1)首先通过词序列向量表示，得到URL字符串的定长向量表示，该方法对URL的向量表示进行训练和分析；Step 1) at first by word sequence vector representation, obtain the fixed-length vector representation of URL character string, this method trains and analyzes the vector representation of URL;

步骤2)对预处理后的数据，使用标注好的数据训练基于词序列的双向LSTM模型；Step 2) To the preprocessed data, use the marked data to train the bidirectional LSTM model based on the word sequence;

步骤3)把未知URL的向量表示输入到训练好的双向LSTM模型中进行标注，检测其是否为钓鱼URL；Step 3) Input the vector representation of the unknown URL into the trained two-way LSTM model for labeling, and detect whether it is a phishing URL;

利用上述方法来检测钓鱼URL；能够深度挖掘URL的词序列包含的上下文信息和特征，相比于基于人工提取的词特征的机器学习模型和基于字符序列的深度学习模型都有较好的效果，在相同数据集上的检测效果如表1所示；Use the above method to detect phishing URLs; the context information and features contained in the word sequence of the URL can be deeply mined, compared with the machine learning model based on manually extracted word features and the deep learning model based on character sequences. The detection effect on the same data set is shown in Table 1;

并且，该方法是一种轻量级的钓鱼URL检测方法，使用训练好的模型，在普通的服务器上，单线程预测速度达每秒钟不少于600个URL。可在提高检测准确率的同时，满足实时检测的需求。Moreover, this method is a lightweight phishing URL detection method. Using a trained model, on a common server, the single-thread prediction speed can reach no less than 600 URLs per second. It can meet the requirements of real-time detection while improving the detection accuracy.

表1四种不同检测模型的检测结果对比Table 1 Comparison of detection results of four different detection models

模型Model PrecisionPrecision Recallrecall F1F1 基于词特征的决策树模型Decision Tree Model Based on Word Features 0.88030.8803 0.87000.8700 0.87510.8751 基于词特征的随机森林模型Random Forest Model Based on Word Features 0.89810.8981 0.89650.8965 0.89730.8973 基于字符序列的双向LSTM模型Bidirectional LSTM Model Based on Character Sequence 0.95530.9553 0.94740.9474 0.95130.9513 基于词序列的双向LSTM模型Bidirectional LSTM model based on word sequence 0.98080.9808 0.97160.9716 0.97620.9762

显然，所描述的实施例仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。Apparently, the described embodiments are only some of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

Claims

1. A method for detecting phishing URLs based on word sequences, comprising the following steps:

Convert labeled URLs into word sequence vectors as training data;

Using the training data to train the classification model;

Convert unknown URLs into word sequence vectors and input them into the trained classification model for labeling.

2. the phishing URL detection method based on word sequence as claimed in claim 1, is characterized in that, converting the marked URL or unknown URL into a word sequence vector comprises:

Filter out protocols and gTLDs in marked URLs or unknown URLs;

Segment the remaining part after filtering, and use the dictionary to segment the character string of each segment obtained by the segmentation through the forward maximum matching method to obtain the word sequence;

All the words in the above dictionary are numbered starting from 1, so that each word has a unique number, and the word sequence of each marked URL or unknown URL is converted into a fixed-length vector represented by a number.

3. the phishing URL detection method based on word sequence as claimed in claim 2, is characterized in that, described agreement comprises http, https, ftp, ftps, gopher; Described gTLD comprises com, org, net, edu, gov.

4. the phishing URL detection method based on word sequence as claimed in claim 2, is characterized in that, described using dictionary carries out word segmentation by the mode of forward maximum matching and comprises:

Determine whether the entire string is in the dictionary, and if so, no word segmentation is performed;

If not, remove the last character and judge whether the remaining string is in the dictionary;

Repeat the above judgment process until the word in the dictionary is matched, and then remove the word in the match;

Continue to perform the above steps on the remaining part of the string until all the strings are processed;

If the string does not contain words in the dictionary, it is broken into individual characters.

5. the phishing URL detection method based on word sequence as claimed in claim 4, is characterized in that, described dictionary selects the Google English word corpus disclosed by Peter Norvig for use.

6. the phishing URL detection method based on word sequence as claimed in claim 2, is characterized in that, the classification model that adopts training data training selects the two-way LSTM model based on word sequence to train.

7. the phishing URL detection method based on word sequence as claimed in claim 1, is characterized in that, adopts training data training classification model to comprise:

The training data is randomly divided into a training part and a verification part, and the bidirectional LSTM model is trained by setting parameters such as hyperparameters and activation functions of the neural network model.

8. the phishing URL detection method based on word sequence as claimed in claim 7, is characterized in that, two-way LSTM model comprises embedding layer, two-way LSTM layer, dropout layer and sigmoid layer four-layer neural network, adopts training data training classification model to also Including: using the dropout function on the output of the bidirectional LSTM layer to prevent overfitting.

9. A phishing URL detection system based on word sequences, comprising:

Conversion modules and classification training models;

The conversion module is used to convert the marked URL into a word sequence vector as the training data of the training classification model; and is used to convert the unknown URL into a word sequence vector and input it into the trained classification model for labeling.