CN107402959A

CN107402959A - URL matching process, URL matching units and storage medium

Info

Publication number: CN107402959A
Application number: CN201710451043.5A
Authority: CN
Inventors: 卢毓海; 张春燕; 刘燕兵; 谭建龙; 郭莉
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2017-06-15
Filing date: 2017-06-15
Publication date: 2017-11-28
Anticipated expiration: 2037-06-15
Also published as: CN107402959B

Abstract

The invention provides a URL matching method, a URL matching device and a storage medium. In the method, the input original URL data to be matched is checked in the URL result cache whether its corresponding information has been stored, and if so, the flag bit and the hit rule are checked. The table gets the information of whether it matched before, and outputs the matching result; otherwise, it matches in the pattern matching engine and stores the URL data. This method deduplicates a large amount of repeated URL data to reduce the number of repeated URL data matches and improve the matching speed. At the same time, the present invention adopts a polynomial hash algorithm with a lower complexity than the actual string matching, but it is not limited to the polynomial hash algorithm. After the data deduplication operation is added, the matching times of duplicate URL data can be reduced, so as to achieve the purpose of deduplication and reduce the matching time.

Description

URL matching method, URL matching device and storage medium

技术领域technical field

本发明涉及内容过滤、信息安全等领域，尤其涉及URL匹配方法、URL匹配设备及存储介质。The invention relates to the fields of content filtering, information security and the like, and in particular relates to a URL matching method, a URL matching device and a storage medium.

背景技术Background technique

随着计算机网络的发展，通过网络进行娱乐、工作、休闲等已经占据着生活的大部分时间，万维网的应用已占因特网通信量的90％。在这种情况下，需要对访问的URL(Uniform Resource Locator，统一资源定位符)进行匹配，进而对访问的URL进行控制，提高工作效率。With the development of computer networks, entertainment, work, and leisure through the network have occupied most of the time in life, and the application of the World Wide Web has accounted for 90% of Internet traffic. In this case, it is necessary to match the accessed URL (Uniform Resource Locator, uniform resource locator), and then control the accessed URL to improve work efficiency.

一种传统的URL匹配方法主要利用串匹配的算法针对URL规则集合做处理，如根据URL“黑名单”的过滤特点，利用AC(Aho-Corasick)算法，采取启发式策略从URL规则中选取定长、具有特点的子串，并利用双数组进行压缩。另一种传统的URL匹配方法主要利用路由器与缓存引擎之间的通信协议(WCCP)完成URL匹配。A traditional URL matching method mainly uses a string matching algorithm to process URL rule sets. For example, according to the filtering characteristics of the URL "blacklist", the AC (Aho-Corasick) algorithm is used to select certain URL rules from the URL rules using a heuristic strategy. Long, characteristic substrings are compressed using double arrays. Another traditional URL matching method mainly utilizes a communication protocol (WCCP) between a router and a cache engine to complete URL matching.

上述两种传统的URL匹配方法主要从常规性的内容过滤系统的角度来实现对URL的匹配过程，但是，在实际进行URL匹配时，待匹配的URL数据中存在大量的重复URL数据，这些重复的URL数据使得匹配过程重复匹配了多次数据，从而降低了传统URL匹配方法的匹配效率。The above two traditional URL matching methods mainly implement the URL matching process from the perspective of a conventional content filtering system. However, when actually performing URL matching, there are a large number of repeated URL data in the URL data to be matched. The URL data makes the matching process repeatedly match the data, thereby reducing the matching efficiency of the traditional URL matching method.

发明内容Contents of the invention

由于真实网络流量下URL数据的重复率比较高，对网络流量匹配时往往有大量数据重复匹配，这样会降低匹配速度。本发明旨在提供URL匹配方法、URL匹配设备及存储介质，该方法能够针对网络流量中的文本匹配问题快速匹配规则，并且能够将大规模的URL数据去重，以达到避免重复匹配数据的目的。Due to the relatively high repetition rate of URL data under real network traffic, a large amount of data is often matched repeatedly when matching network traffic, which will reduce the matching speed. The present invention aims to provide a URL matching method, a URL matching device, and a storage medium. The method can quickly match rules for text matching problems in network traffic, and can deduplicate large-scale URL data to achieve the purpose of avoiding repeated matching data .

针对上述目的，本发明所采用的技术方案为：For above-mentioned purpose, the technical scheme that the present invention adopts is:

一种URL匹配方法，其步骤为：A URL matching method, the steps of which are:

1)将待匹配的原始URL数据进行双哈希运算，得到双哈希值hash₁、hash₂及其对应的哈希值h；1) Perform a double hash operation on the original URL data to be matched to obtain double hash values hash ₁ , hash ₂ and their corresponding hash values h;

2)若在哈希表中查询到哈希值为h的哈希节点，则查询该哈希节点的双哈希值与上述待匹配的原始URL数据的双哈希值是否都相等；2) If a hash node with a hash value h is found in the hash table, then query whether the double hash value of the hash node is equal to the double hash value of the above-mentioned original URL data to be matched;

3)若相等，则判断上述哈希节点的标志位start和end是否满足预设条件；3) If they are equal, it is judged whether the flags start and end of the above-mentioned hash nodes meet the preset conditions;

4)若上述哈希节点的标志位start和end满足预设条件，则上述待匹配的原始URL数据匹配已命中规则表中的一条规则。4) If the flags start and end of the above-mentioned hash node meet the preset conditions, then the above-mentioned original URL data to be matched matches a rule in the matched rule table.

进一步地，步骤2)中所述哈希表与步骤4)中所述已命中规则表位于URL结果缓存中。Further, the hash table in step 2) and the hit rule table in step 4) are located in the URL result cache.

进一步地，步骤2)还包括：若在哈希表中未查询到哈希值为h的哈希节点，则直接结束，转到模式匹配引擎匹配所述待匹配的原始URL数据，并返回结果。Further, step 2) also includes: if no hash node with a hash value of h is found in the hash table, then directly end, go to the pattern matching engine to match the original URL data to be matched, and return the result .

进一步地，步骤3)中所述预设条件是指哈希节点的标志位start和end不相等。Further, the preset condition in step 3) means that the flag bits start and end of the hash node are not equal.

更进一步地，步骤3)还包括：若上述哈希节点的双哈希值与上述待匹配的原始URL数据的双哈希值不相等，则在上述哈希表中查询是否还有其它哈希值为h的哈希节点；若没有则利用模式匹配引擎匹配所述待匹配的原始URL数据，并返回结果。Further, step 3) also includes: if the double hash value of the above-mentioned hash node is not equal to the double hash value of the original URL data to be matched, then query whether there are other hashes in the above-mentioned hash table A hash node whose value is h; if not, use a pattern matching engine to match the original URL data to be matched, and return the result.

更进一步地，当模式匹配引擎匹配所述待匹配的原始URL数据，并返回结果时，将该待匹配的原始URL数据存入URL结果缓存，其步骤包括：Furthermore, when the pattern matching engine matches the original URL data to be matched and returns a result, the original URL data to be matched is stored in the URL result cache, and the steps include:

a)判断URL结果缓存中的URL数据数目J是否超出阈值M，若未超出，则将该待匹配的原始URL数据的双哈希值hash₁、hash₂记录到当前哈希节点hash_node[J]中；若超出，则对该块URL结果缓存采取替换策略；其中M为能缓存的URL数据大小；a) Determine whether the number of URL data J in the URL result cache exceeds the threshold M, if not, record the double hash values hash ₁ and hash ₂ of the original URL data to be matched to the current hash node hash_node[J] Medium; if it exceeds, a replacement strategy will be adopted for the URL result cache of the block; where M is the size of the URL data that can be cached;

b)当模式匹配引擎返回结果为0时，则该哈希节点hash_node[J]中的标志位hash_node[J].start＝hash_node[J].end＝K；当模式匹配引擎返回结果为1时，则该哈希节点hash_node[J]中的标志位hash_node[J].start＝K、hash_node[J].end＝K+1，并在matched_rules[]中存储从模式匹配引擎返回的匹配规则号matched_rules[K]＝RuleId，此时K＝K+1；其中0代表该待匹配的原始URL数据不匹配规则，1代表该待匹配的原始URL数据匹配规则，K为已匹配的规则数目，matched_rules[]代表已命中规则表；b) When the pattern matching engine returns a result of 0, the flag bit hash_node[J].start=hash_node[J].end=K in the hash node hash_node[J]; when the pattern matching engine returns a result of 1 , then the flags hash_node[J].start=K, hash_node[J].end=K+1 in the hash node hash_node[J], and store the matching rule number returned from the pattern matching engine in matched_rules[]. matched_rules[K]=RuleId, at this time K=K+1; where 0 means that the original URL data to be matched does not match a rule, 1 means that the original URL data to be matched matches a rule, K is the number of matched rules, matched_rules [] represents the hit rule table;

c)获取上述待匹配的原始URL数据的哈希值h，存储哈希值相同的哈希节点到next中，且存储的URL数据数目J＝J+1。c) Obtain the hash value h of the original URL data to be matched, store the hash nodes with the same hash value in next, and the number of stored URL data is J=J+1.

更进一步地，步骤a)中所述替换策略是指：对该块URL结果缓存清零，重新分配空间存储新的URL数据。Furthermore, the replacement strategy in step a) refers to: clearing the URL result cache of the block, and reallocating space to store new URL data.

更进一步地，所述待匹配的原始URL数据的哈希值h根据hash₁与(N-1)做与运算得到，其中N为哈希表大小。Furthermore, the hash value h of the original URL data to be matched is obtained by ANDing hash ₁ and (N-1), where N is the size of the hash table.

进一步地，步骤4)还包括：若上述哈希节点的标志位start和end不满足预设条件，则上述待匹配的原始URL数据不匹配规则。Further, step 4) further includes: if the flag bits start and end of the above-mentioned hash node do not meet the preset condition, then the above-mentioned original URL data to be matched does not match the rule.

进一步地，步骤4)中所述规则是指与上述哈希节点的哈希值所对应数据匹配的规则。Further, the rule in step 4) refers to a rule that matches the data corresponding to the hash value of the above-mentioned hash node.

进一步地，所述双哈希运算采用多项式散列算法，并采用多线程的方法并行对URL数据进行匹配。Further, the double hash operation adopts a polynomial hash algorithm, and uses a multi-thread method to match URL data in parallel.

一种URL匹配设备，包括处理器和存储器，所述处理器与所述存储器通过总线连接；A URL matching device, including a processor and a memory, the processor and the memory are connected through a bus;

所述存储器用于存储上述任一所述方法对应的计算机执行指令；The memory is used to store computer-executed instructions corresponding to any of the methods described above;

当所述URL匹配设备运行时，所述处理器读取所述存储器中的所述计算机执行指令，以使所述URL匹配设备执行上述任一所述方法的步骤。When the URL matching device is running, the processor reads the computer-executable instructions in the memory, so that the URL matching device executes the steps of any one of the methods described above.

一种存储有计算机程序的非易失性计算机可读存储介质，当计算机执行所述计算机程序时，所述计算机执行上述任一所述方法的步骤。A non-volatile computer-readable storage medium storing a computer program. When a computer executes the computer program, the computer executes the steps of any one of the methods described above.

本发明的有益效果在于：The beneficial effects of the present invention are:

1、本发明方法通过将输入的待匹配的原始URL数据在URL结果缓存中查询是否已存储过其对应的信息，若是则查询标志位及已命中规则表得到以前是否匹配的信息，输出匹配结果；否则在模式匹配引擎中进行匹配，并进行存储URL数据。1. The method of the present invention queries whether its corresponding information has been stored in the URL result cache by querying the input original URL data to be matched, and if so, queries the flag position and the hit rule table to obtain information on whether it has matched before, and outputs the matching result ; Otherwise, match in the pattern matching engine and store the URL data.

2、本发明方法及设备对大量重复的URL数据去重，以降低URL数据重复匹配次数，提高匹配速度。2. The method and device of the present invention deduplicate a large amount of repeated URL data, so as to reduce the number of repeated matching of URL data and improve the matching speed.

3、本发明在URL结果缓存中的URL数据数目超出阈值后，对该块URL结果缓存采取替换策略，以保证当前URL数据的存储过程顺利进行。3. After the number of URL data in the URL result cache exceeds the threshold, the present invention adopts a replacement strategy for the URL result cache to ensure the smooth progress of the current URL data storage process.

4、本发明采用比实际串匹配复杂度低的多项式散列算法，但不仅限于多项式散列算法。在增添数据去重操作后可以减少重复的URL数据的匹配次数，达到去重和降低匹配时间的目的。4. The present invention uses a polynomial hash algorithm with a lower complexity than the actual string matching, but it is not limited to the polynomial hash algorithm. After the data deduplication operation is added, the matching times of duplicate URL data can be reduced, so as to achieve the purpose of deduplication and reduce the matching time.

5、本发明提供的存储介质可以应用在入侵检测系统、拼写检查、网络流量监测等领域，还可以在计算机硬件单机及多台服务器中运行。5. The storage medium provided by the present invention can be applied in the fields of intrusion detection system, spell checking, network traffic monitoring, etc., and can also run on a single computer hardware or multiple servers.

附图说明Description of drawings

图1为本发明提供的URL匹配设备结构图。Fig. 1 is a structural diagram of a URL matching device provided by the present invention.

图2为本发明提供的URL匹配的基本数据类型示意图。Fig. 2 is a schematic diagram of basic data types of URL matching provided by the present invention.

图3为本发明提供的URL结果缓存中匹配阶段流程图。Fig. 3 is a flow chart of the matching phase in the URL result cache provided by the present invention.

图4为本发明提供的URL结果缓存中存储阶段流程图。FIG. 4 is a flow chart of the storage stage in the URL result cache provided by the present invention.

图5为本发明一实施例的规则存储示意图。Fig. 5 is a schematic diagram of rule storage according to an embodiment of the present invention.

图6为本发明一实施例的url1存储时，各表存储变化示意图。FIG. 6 is a schematic diagram of storage changes in each table when url1 is stored according to an embodiment of the present invention.

图7为本发明一实施例的url2存储时，各表存储变化示意图。FIG. 7 is a schematic diagram of storage changes in each table when url2 is stored according to an embodiment of the present invention.

图8为本发明一实施例的url3存储时，各表存储变化示意图。FIG. 8 is a schematic diagram of storage changes in each table when url3 is stored according to an embodiment of the present invention.

图9为本发明一实施例的url5存储时，各表存储变化示意图。FIG. 9 is a schematic diagram of storage changes in each table when url5 is stored according to an embodiment of the present invention.

图10为本发明一实施例的url8存储时，各表存储变化示意图。FIG. 10 is a schematic diagram of storage changes in each table when url8 is stored according to an embodiment of the present invention.

具体实施方式detailed description

针对多种散列方法，本发明从骨干网络中采集了千万条真实URL数据，数据大小为2.06GB。不同哈希算法在URL数据上的计算速度有所不同，其效果见表1。由于最终要提高匹配速度，需要对URL数据进行去重处理，但是如果去重处理的复杂度太高，散列速度很慢，最终会影响最后的串匹配速度。所以，为了提高其散列效率，需要选取散列计算速度最快的一种作为发明的散列函数，即多项式散列算法(poly_hash)。For various hashing methods, the present invention collects tens of millions of pieces of real URL data from the backbone network, and the data size is 2.06GB. Different hash algorithms have different calculation speeds on URL data, and their effects are shown in Table 1. To improve the matching speed in the end, it is necessary to de-duplicate the URL data, but if the complexity of the de-duplication processing is too high, the hashing speed is very slow, which will eventually affect the final string matching speed. Therefore, in order to improve its hash efficiency, it is necessary to select the hash function with the fastest hash calculation speed as the invented hash function, that is, polynomial hash algorithm (poly_hash).

表1：不同哈希算法在URL数据上的计算速度Table 1: Calculation speed of different hash algorithms on URL data

哈希算法hash algorithm 具体哈希算法Specific hash algorithm 计算速度(MB/s)Calculation speed (MB/s) 查表散列算法Lookup table hash algorithm crc_hashcrc_hash 274274 多项式散列算法polynomial hashing algorithm poly_hashpoly_hash 552552 位运算散列算法bitwise hash algorithm xor_hashxor_hash 370370 ELFELF elf_hashelf_hash 274274 HfIpHf HfIp_hashHfIp_hash 515515

经过实验验证，为了去重所采用的散列算法的复杂度比实际串匹配的复杂度要低，所以，增添了去重数据操作后，其匹配速度可以大大提高。面向千万规模的URL串匹配利用本发明的操作可以大大提高匹配速度和性能。It has been verified by experiments that the complexity of the hash algorithm used for deduplication is lower than that of actual string matching. Therefore, after adding deduplication data operations, the matching speed can be greatly improved. The matching speed and performance can be greatly improved by using the operation of the present invention for URL string matching on a scale of tens of millions.

假使URL数据中，不重复的URL比例为p，一次数据去重的时间为C₁，一次串匹配的时间为C₂，一次ULR数据插入URL结果缓存的时间为C₃，则采用数据去重技术后，期望的时间为：Assuming that in URL data, the proportion of unique URLs is p, the time for one data deduplication is C ₁ , the time for one string matching is C ₂ , and the time for one ULR data insertion into URL result cache is C ₃ , then data deduplication is used After technology, the expected time is:

T＝(1-p)*C₁+p*(C₂+C₃)T＝(1-p)*C ₁ +p*(C ₂ +C ₃ )

所以，只要数据去重做的足够快，则T<C₂。实验验证，数据去重所需要的时间很短，最终的实验结果证明T<C₂，所以本发明可以减少重复的URL数据的匹配次数，达到去重和降低匹配时间的目的。Therefore, as long as the data is redone fast enough, T<C ₂ . Experiments verify that the time required for deduplication of data is very short, and the final experimental result proves that T<C ₂ , so the present invention can reduce the matching times of repeated URL data, achieve the purpose of deduplication and reduce matching time.

本发明中的哈希函数采用多项式散列算法，并采用多线程的方法并行对URL数据进行匹配，针对千万级别的URL规则集合时，URL数据去重的结果效率较高。The hash function in the present invention adopts a polynomial hash algorithm, and adopts a multi-thread method to match URL data in parallel. When aiming at tens of millions of URL rule sets, the deduplication result of URL data is more efficient.

请参考图1，为简述方便，仅针对单线程模型，其主要思想包括URL结果缓存阶段和模式匹配引擎阶段。Please refer to Figure 1. For the convenience of brief description, only for the single-threaded model, the main idea includes the URL result cache stage and the pattern matching engine stage.

所述URL结果缓存阶段是指：首先为URL结果缓存开辟一块内存。将待匹配的原始URL数据经过双哈希运算散列到哈希表中，查询当前哈希表中是否已存储对应的数据(即所述待匹配的原始URL数据)。如果有，再查询已命中规则表中是否有对应的规则号。若有对应的规则号，则返回对应的结果即返回规则号；若没有对应的规则号，则返回该待匹配的原始URL数据不匹配任何规则；上述过程为数据去重过程。如果没有，则利用模式匹配引擎对上述待匹配的原始URL数据进行匹配。The URL result caching stage refers to: firstly, a memory is allocated for URL result caching. The original URL data to be matched is hashed into the hash table through a double hash operation, and whether the corresponding data (that is, the original URL data to be matched) has been stored in the current hash table is inquired. If so, check whether there is a corresponding rule number in the hit rule table. If there is a corresponding rule number, the corresponding result is returned, that is, the rule number is returned; if there is no corresponding rule number, the original URL data to be matched does not match any rule; the above process is a data deduplication process. If not, a pattern matching engine is used to match the above-mentioned original URL data to be matched.

所述模式匹配引擎阶段是指：当在哈希表中未查询到待匹配的原始URL数据时，即待匹配的原始URL数据第一次进行查询，也就是说没有相同的URL数据进入过哈希表中，首先在模式匹配引擎中进行匹配，并返回结果，然后在哈希表中记录待匹配的原始URL数据的哈希值，并将如若匹配上的规则号在已命中规则表中进行记录。The pattern matching engine stage refers to: when the original URL data to be matched is not queried in the hash table, that is, the original URL data to be matched is queried for the first time, that is to say, no same URL data has entered the hash table In the Hashtable, the pattern matching engine is firstly matched and the result is returned, and then the hash value of the original URL data to be matched is recorded in the Hashtable, and the rule number if matched is recorded in the hit rule table. Record.

本发明还提供一种URL匹配设备，该设备包括处理器和存储器，所述处理器与所述存储器通过总线连接；The present invention also provides a URL matching device, which includes a processor and a memory, the processor and the memory are connected through a bus;

所述存储器包括内存和外存，用于存储本发明任一所述方法对应的计算机执行指令；The memory includes internal memory and external storage, and is used to store computer-executed instructions corresponding to any of the methods described in the present invention;

当所述URL匹配设备运行时，所述处理器读取所述存储器中的所述计算机执行指令，以使所述URL匹配设备执行本发明任一所述方法的步骤。When the URL matching device is running, the processor reads the computer-executable instructions in the memory to cause the URL matching device to perform the steps of any one of the methods of the present invention.

为了解决哈希冲突引起的问题，本发明采用了双哈希函数，并设定链表的形式将哈希值相同但数据本身不同的URL进行存储，保证了散列的均衡性。请参考图2，为下文叙述方便，现定义有关符号用来描述下面的哈希去重过程。In order to solve the problems caused by hash conflicts, the present invention adopts a double hash function, and sets the form of a linked list to store URLs with the same hash value but different data itself, so as to ensure the balance of the hash. Please refer to Figure 2. For the convenience of the following description, the relevant symbols are defined to describe the following hash deduplication process.

M：能缓存的URL数据大小，一般为2的倍数，方便后面操作；M: The size of the URL data that can be cached, generally a multiple of 2, which is convenient for subsequent operations;

N：哈希表大小，本发明设定N为M的8倍，以保证散列的均衡性；N: the size of the hash table, the present invention sets N to be 8 times of M to ensure the balance of the hash;

K：已匹配的规则数目，初始值为0，记录matched_rules[]中最后的下标；K: the number of matched rules, the initial value is 0, record the last subscript in matched_rules[];

J：取值为输入的URL数据数目，初始值为0；J: The value is the number of input URL data, and the initial value is 0;

data[]：某条URL数据；data[]: a certain URL data;

A1，A2：散列常数，固定值，用来多项式迭代算法的常数；A1, A2: Hash constants, fixed values, constants used for polynomial iterative algorithms;

H1，H2：随机算子，固定值，用来初始化哈希函数；H1, H2: random operator, fixed value, used to initialize the hash function;

hash₁，hash₂：泛指每条URL数据经过哈希函数之后得到的哈希值；hash ₁ , hash ₂ : generally refers to the hash value obtained after each URL data passes through the hash function;

matched_rules[]：已命中规则表，存储匹配数据的规则号；matched_rules[]: the rule table has been hit, storing the rule number of the matching data;

hash_table[]：哈希表，存储数据哈希值所对应的hash_node[]编号，与J值相关，初始值哈希表中所有值设置为-1；hash_table[]: hash table, which stores the number of hash_node[] corresponding to the hash value of the data, which is related to the J value, and all values in the initial value hash table are set to -1;

hash_node[]：某个哈希节点，是一个结构体，其内容包含h₁、h₂哈希值，start、end标志位以及下一个相同哈希值的哈希节点next；hash_node[]: a hash node, which is a structure, its content includes h ₁ , h ₂ hash values, start, end flag bits and the next hash node next with the same hash value;

RuleId：本次匹配过程中的某个规则号。RuleId: A rule number in this matching process.

本发明中A1＝6364136223846793005、A2＝2862933555777941757、H1＝1442695040888963407、H2＝3037000493，选取了C++内核中的常采用的散列常数和随机算子，A1、A2和H1、H2可以是其他常数。本发明主要针对URL结果缓存阶段，模式匹配引擎阶段主要提供数据匹配结果以备后面存储，故只介绍URL结果缓存阶段，此阶段可以分为匹配阶段和存储阶段。Among the present invention, A1=6364136223846793005, A2=2862933555777941757, H1=1442695040888963407, H2=3037000493, have chosen the hash constant and random operator that often adopt in the C++ kernel, A1, A2 and H1, H2 can be other constants. The present invention is mainly aimed at the URL result caching stage, and the pattern matching engine stage mainly provides data matching results for later storage, so only the URL result caching stage is introduced, which can be divided into a matching stage and a storage stage.

请参考图3，所述匹配阶段包括以下步骤：Please refer to Figure 3, the matching stage includes the following steps:

1)将URL数据进行双哈希操作，利用URL数据的长度，哈希函数为：1) Double hash the URL data, using the length of the URL data, the hash function is:

hash₁＝A1*hash₁+data[i]hash ₁ ＝A1*hash ₁ +data[i]

hash₂＝A2*hash₂+data[i]hash ₂ ＝A2*hash ₂ +data[i]

其中i从URL数据起始第一个字符开始取值，到最后一个字符截止。hash₁、hash₂初始值为hash₁＝H1、hash₂＝H2。Among them, i starts to take the value from the first character of the URL data to the last character. The initial values of hash ₁ and hash ₂ are hash ₁ =H1 and hash ₂ =H2.

上述哈希函数中当URL数据取到最后一个字符时，对hash₁与(N-1)做与运算得到的结果为h，以保证哈希值在哈希表hash_table[]能落到下标范围内。In the above hash function, when the last character of the URL data is fetched, the result of the AND operation of hash ₁ and (N-1) is h, so as to ensure that the hash value can fall into the subscript in the hash table hash_table[] within range.

2)再从hash_table[h]取值开始，设为j＝hash_table[h]，访问此哈希节点，首先确定双哈希值是否都相等，即本次匹配URL数据的hash₁与hash_node[j].h₁是否相等，hash₂与hash_node[j].h₂是否相等，如果同时满足相等，说明哈希校验成功，该条URL数据曾经访问过且在URL结果缓存中存储过对应的信息，可以转到步骤3)对该条URL数据是否匹配规则做出判断，否则该条URL数据没有被访问过，返回匹配失败，让模式匹配引擎处理当前数据情况。2) Start from the value of hash_table[h], set j=hash_table[h], access this hash node, first determine whether the double hash values are equal, that is, the hash ₁ and hash_node[j of the matching URL data this time ].h ₁ is equal, whether hash ₂ is equal to hash_node[j].h ₂ , if they are equal at the same time, it means that the hash verification is successful, the URL data has been accessed and the corresponding information has been stored in the URL result cache , you can go to step 3) to make a judgment on whether the URL data matches the rule, otherwise the URL data has not been accessed, return matching failure, and let the pattern matching engine process the current data situation.

3)当hash_node[j]中存储的结构体中标志位start和end相等时，表示该条URL数据原来存储过但是不能匹配任何规则，返回不匹配任何规则；若start和end不相等，表示该条URL数据原来存储过并且与某条规则匹配，start的值表示已命中规则表中该规则存储的下标，即RuleId＝matched_rules[hash_node[j].start]，则返回匹配的规则号RuleId，匹配成功，结束匹配过程。3) When the flags start and end in the structure stored in hash_node[j] are equal, it means that the URL data was originally stored but cannot match any rules, and the return does not match any rules; if start and end are not equal, it means that the A piece of URL data was originally stored and matched with a certain rule. The value of start indicates that the subscript stored in the rule table has been hit, that is, RuleId=matched_rules[hash_node[j].start], then the matched rule number RuleId will be returned. If the matching is successful, the matching process ends.

4)然后选取hash_node[j]中相同哈希值的哈希节点hash_node[j].next，令j＝hash_node[j].next，转到步骤2)执行，不断取hash_node[j]中的next访问下一个哈希节点，直到j＝-1(已访问过所有哈希值hash₁相同的数据)或者已经匹配出结果为止。4) Then select the hash node hash_node[j].next with the same hash value in hash_node[j], let j=hash_node[j].next, go to step 2) and continue to take the next in hash_node[j] Visit the next hash node until j=-1 (all data with the same hash value hash ₁ has been visited) or the result has been matched.

请参考图4，所述存储阶段包括以下步骤：Please refer to Figure 4, the storage stage includes the following steps:

当模式匹配引擎阶段匹配一条未被存入URL结果缓存的URL数据，并且返回结果时，需要将此条URL数据存入URL结果缓存中，便于以后处理重复数据时，可以直接从URL结果缓存中利用匹配阶段输出匹配结果。初始化时，首先将哈希表hash_table[]中的每一项置为-1。When the pattern matching engine stage matches a piece of URL data that has not been stored in the URL result cache and returns the result, it needs to store the URL data in the URL result cache so that it can be directly retrieved from the URL result cache when processing duplicate data in the future. Use the matching stage to output matching results. When initializing, first set each item in the hash table hash_table[] to -1.

1)判断URL结果缓存中的URL数据数目J是否超出当前的阈值M，如果超出，则对该块URL结果缓存清零，重新分配空间，计数器K和J重设为0，存入新的URL数据；如果没有超出，则对每次处理的模式匹配引擎中的URL数据进行存储。1) Determine whether the number of URL data J in the URL result cache exceeds the current threshold M, if so, clear the URL result cache for this block, reallocate the space, reset the counters K and J to 0, and store the new URL data; if not exceeded, store the URL data in the pattern matching engine for each processing.

2)存储URL数据时，假设现在存储到第J个不重复的URL数据，首先对本次的两个哈希值hash₁、hash₂记录到当前的哈希节点hash_node[J]中，即hash_node[J].h₁＝hash₁、hash_node[J].h₂＝hash₂，以便用于匹配阶段的哈希校验。2) When storing URL data, assuming that the Jth unique URL data is now stored, first record the two hash values hash ₁ and hash ₂ in the current hash node hash_node[J], namely hash_node [J].h ₁ =hash ₁ , hash_node[J].h ₂ =hash ₂ , so as to be used for hash verification in the matching stage.

3)当模式匹配引擎阶段中返回结果为0，即代表该条URL数据并不能匹配任何规则，存储该条URL数据时，将哈希节点hash_node[J]中的标志位设为相等，即令hash_node[J].start＝hash_node[J].end＝K；3) When the result returned in the pattern matching engine stage is 0, it means that the URL data cannot match any rules. When storing the URL data, set the flag bits in the hash node hash_node[J] to be equal, that is, hash_node [J].start=hash_node[J].end=K;

当模式匹配引擎阶段中返回结果为1，即代表该条URL数据能匹配规则集中的某条规则，存储该条URL数据时，将哈希节点hash_node[J]中的标志位设为不等，即令hash_node[J].start＝K，hash_node[J].end＝K+1，其中K为已匹配的规则数目，同时在matched_rules[]中存储从模式匹配引擎阶段返回的匹配规则号RuleId，即matched_rules[K]＝RuleId。由于已经增添了一条已匹配的URL数据信息，已匹配的规则数目加1，即K＝K+1。When the result returned in the pattern matching engine stage is 1, it means that the URL data can match a certain rule in the rule set. When storing the URL data, set the flag in the hash node hash_node[J] to unequal. That is, hash_node[J].start=K, hash_node[J].end=K+1, where K is the number of matched rules, and at the same time store the matching rule number RuleId returned from the pattern matching engine stage in matched_rules[], namely matched_rules[K]=RuleId. Since a piece of matched URL data information has been added, the number of matched rules is increased by 1, that is, K=K+1.

4)对hash₁与(N-1)做与运算得到的结果为h，将哈希节点hash_node[J]的下一个哈希节点值设为哈希表hash_table[h]的值，即hash_node[J].next＝hash_table[h]。将哈希值h存储在哈希表hash_table[h]中，并设定值为J，即hash_table[h]＝J。由于增加了一条URL数据，故J＝J+1。4) The result of the AND operation of hash ₁ and (N-1) is h, and the value of the next hash node of the hash node hash_node[J] is set to the value of the hash table hash_table[h], that is, hash_node[ J].next=hash_table[h]. Store the hash value h in the hash table hash_table[h], and set the value to J, that is, hash_table[h]=J. Since a piece of URL data is added, J=J+1.

下面主要通过一个具体实例来展现具体的实施方式，主要从存储阶段和匹配阶段介绍其过程。The following mainly uses a specific example to show the specific implementation manner, and mainly introduces the process from the storage stage and the matching stage.

表2：URL数据示例Table 2: Example of URL data

URL数据编号URL data number URL具体内容URL specific content url1url1 www.abcd.comwww.abcd.com url2url2 www.ccef.cnwww.ccef.cn url3url3 www.cdef.cnwww.cdef.cn url4url4 www.abcd.comwww.abcd.com url5url5 www.mmmm.cnwww.mmmm.cn url6url6 www.cdef.cnwww.cdef.cn url7url7 www.mmmm.cnwww.mmmm.cn url8url8 www.zhihu.comwww.zhihu.com

表3：规则示例Table 3: Example rules

规则编号rule number 内容content r1r1 www.abcd.comwww.abcd.com r2r2 www.ccef.cnwww.ccef.cn r3r3 www.baidu.comwww.baidu.com

请参考表2、表3及图5，假设规则共有三条即r1，r2，r3；url1与url4内容完全相同；url3与url6内容完全相同；url2与url3内容不同，但哈希值h＝hash₁&(N-1)相同；url5与url7内容完全相同，其中，url1与r1匹配，url2与r2匹配，其他url数据均不相同且不匹配规则。Please refer to Table 2, Table 3 and Figure 5, assuming that there are three rules, r1, r2, r3; url1 and url4 have the same content; url3 and url6 have the same content; url2 and url3 have different content, but the hash value h=hash ₁ &(N-1) is the same; the contents of url5 and url7 are exactly the same, among which, url1 matches r1, url2 matches r2, and other url data are different and do not match the rules.

下面开始具体的操作过程，因为一开始，URL结果缓存中没有数据，首要的过程从存储阶段开始。首先定义M＝4，则N＝32。data[i-1]＝urli，i＝1……8。Let's start the specific operation process, because at the beginning, there is no data in the URL result cache, and the first process starts from the storage stage. First define M=4, then N=32. data[i-1]=urli, i=1...8.

匹配第一条URL数据url1时，由于未在URL结果缓存中得到此数据信息，先经过模式匹配引擎进行匹配，匹配结果返回规则号r1，则存储时，其各表存储变化如图6所示。When matching the first URL data url1, because the data information is not obtained in the URL result cache, the pattern matching engine is first used for matching, and the matching result returns the rule number r1. When storing, the storage changes of each table are shown in Figure 6 .

url1经过哈希函数，hash₁＝5，hash₁与31(即N-1)做与运算后得到h＝5，hash₂＝4，由于匹配规则r1，start＝0，end＝1，next节点为-1。After url1 passes through the hash function, hash ₁ = 5, hash ₁ and 31 (ie N-1) do AND operation to get h = 5, hash ₂ = 4, due to the matching rule r1, start = 0, end = 1, next node is -1.

请参考图7，同理，由于url2未在URL结果缓存中有结果信息，则经过模式匹配引擎后，url2经过哈希函数，hash₁＝287，hash₁与31(即N-1)做与运算后得到h＝11，hash₂＝359，由于匹配规则r2，start＝1，end＝2，next节点为-1。Please refer to Figure 7. Similarly, since url2 does not have result information in the URL result cache, after passing through the pattern matching engine, url2 passes through the hash function, hash ₁ = 287, and hash ₁ and 31 (ie N-1) are combined After the operation, h=11, hash ₂ =359, because of the matching rule r2, start=1, end=2, and the next node is -1.

请参考图8，由于url3未在URL结果缓存中有结果信息，则经过模式匹配引擎后，url3经过哈希函数，hash₁＝287，hash₁与31(即N-1)做与运算后得到h＝11，与url2中的h值一致，则本次哈希节点next中存储的是url2数据存储的哈希节点下标1，hash₂＝344，由于不匹配任何规则，start＝2，end＝2，next节点为1。Please refer to Figure 8, since url3 does not have result information in the URL result cache, after passing through the pattern matching engine, url3 passes through the hash function, hash ₁ = 287, hash ₁ and 31 (ie N-1) are obtained after AND operation h=11, which is consistent with the h value in url2, then the hash node next stores the hash node subscript 1 of url2 data storage, hash ₂ =344, since it does not match any rules, start=2, end =2, the next node is 1.

url4数据进行匹配时，由于在URL结果缓存中已经存在url4数据的信息(与url1内容完全相同)，则可以直接匹配结果信息，由于url4的hash₁＝5，hash₂＝4，h＝hash₁&31＝5，查询到hash_table[5]＝0，则查询hash_node[0]节点，发现url4的hash₁＝hash_node[0].h₁，hash₂＝hash_node[0].h₂，且hash_node[0].start与hash_node[0].end不相等，说明匹配规则，查询matched_rules[hash_node[0].start]＝r1，返回规则号r1。When the url4 data is matched, since the information of the url4 data already exists in the URL result cache (exactly the same as the content of the url1), the result information can be directly matched, because the hash ₁ of url4 = 5, hash ₂ = 4, h = hash ₁ &31＝5, query hash_table[5]＝0, query hash_node[0] node, find url4's hash ₁ =hash_node[0].h ₁ , hash ₂ =hash_node[0].h ₂ , and hash_node[0 ].start is not equal to hash_node[0].end, indicating the matching rules, query matched_rules[hash_node[0].start]=r1, and return the rule number r1.

请参考图9，由于url5未在URL结果缓存中有结果信息，则经过模式匹配引擎后，url5经过哈希函数，hash₁＝520，hash₁与31(即N-1)做与运算后得到h＝8，hash₂＝9，由于不匹配任何规则，start＝2，end＝2，next节点为-1。Please refer to Figure 9, since url5 does not have result information in the URL result cache, after passing through the pattern matching engine, url5 passes through the hash function, hash ₁ = 520, hash ₁ and 31 (ie N-1) are obtained after AND operation h=8, hash ₂ =9, since no rule is matched, start=2, end=2, and the next node is -1.

url6数据进行匹配时，由于在URL结果缓存中已经存在url6数据的信息(与url3内容完全相同)，则可以直接匹配结果信息，由于url6的hash₁＝287，hash₂＝344，h＝hash₁&31＝11，查询到hash_table[11]＝2，则查询hash_node[2]节点，发现url6的hash₁＝hash_node[2].h₁，hash₂＝hash_node[2].h₂，且hash_node[2].start与hash_node[2].end相等，说明不匹配任何规则，虽然next节点不为-1，由于当前哈希校验已经完成，说明数据相同，不再匹配hash_node[hash_node[2]next](即hash_node[0])节点，直接返回不匹配任何规则。When the url6 data is matched, since the information of the url6 data already exists in the URL result cache (identical to the content of the url3), the result information can be directly matched, because the hash ₁ of url6 = 287, hash ₂ = 344, h = hash ₁ &31＝11, query hash_table[11]＝2, query hash_node[2] node, find url6 hash ₁ =hash_node[2].h ₁ , hash ₂ =hash_node[2].h ₂ , and hash_node[2 ].start is equal to hash_node[2].end, which means that no rule is matched. Although the next node is not -1, since the current hash verification has been completed, it means that the data is the same and no longer matches hash_node[hash_node[2]next] (i.e. hash_node[0]) node, directly return does not match any rules.

url7数据进行匹配时，由于在URL结果缓存中已经存在url7数据的信息(与url5内容完全相同)，则可以直接匹配结果信息，由于url7的hash₁＝520，hash₂＝9，h＝hash₁&31＝8，查询到hash_table[8]＝3，则查询hash_node[3]节点，发现url7的hash₁＝hash_node[3].h₁，hash₂＝hash_node[3].h₂，且hash_node[3].start与hash_node[3].end相等，说明不匹配，直接返回不匹配任何规则结果。When the url7 data is matched, since the information of the url7 data already exists in the URL result cache (which is exactly the same as the content of the url5), the result information can be directly matched, because the hash ₁ of url7 = 520, hash ₂ = 9, h = hash ₁ &31＝8, query hash_table[8]＝3, then query hash_node[3] node, find url7's hash ₁ =hash_node[3].h ₁ , hash ₂ =hash_node[3].h ₂ , and hash_node[3 ].start is equal to hash_node[3].end, indicating that it does not match, and directly returns the result that does not match any rules.

请参考图10，由于url8未在URL结果缓存中有结果信息，则经过模式匹配引擎后，需要进行存储，但是此时URL结果缓存区已经满了(J＝M)，则需要对URL结果缓存区进行清零操作。url8经过双哈希函数，hash₁＝514，hash₁与31(即N-1)做与运算后得到h＝2，hash₂＝469，由于不匹配任何规则，start＝0，end＝1，next节点为-1。Please refer to Figure 10, since url8 does not have result information in the URL result cache, it needs to be stored after passing through the pattern matching engine, but at this time the URL result cache area is full (J=M), then the URL result cache needs to be The area is cleared. url8 undergoes a double hash function, hash ₁ = 514, hash ₁ and 31 (ie N-1) do AND operation to get h = 2, hash ₂ = 469, because it does not match any rules, start = 0, end = 1, The next node is -1.

本发明将上述所叙述的URL匹配方法称之为DDB算法，选取两种经典字符串匹配算法Aho-Corasick(AC)算法和Karp-Rabin(KR)算法作为对比算法，将实施数据去重策略前后的匹配方法的性能进行对比测试。The present invention refers to the URL matching method described above as the DDB algorithm, and selects two classic character string matching algorithms Aho-Corasick (AC) algorithm and Karp-Rabin (KR) algorithm as the comparison algorithm, and implements the data before and after the deduplication strategy. The performance of the matching method is compared and tested.

所述实验的软硬件环境为：The hardware and software environment of the experiment is:

CPU：Intel(R)Core(TM)i7-3820 CPU@3.60GHz；内存：32GB；操作系统：Red HatCentOS Linux release 7.2.1511(Core)；内核版本：3.10.0。CPU: Intel(R) Core(TM) i7-3820 CPU@3.60GHz; memory: 32GB; operating system: Red Hat CentOS Linux release 7.2.1511(Core); kernel version: 3.10.0.

URL数据集：从骨干网络上采集下来的真实URL数据作为待匹配的输入文本。URL data set: The real URL data collected from the backbone network is used as the input text to be matched.

规则集：模拟真实URL特点生成的500万URL规则。Rule set: 5 million URL rules generated by simulating real URL characteristics.

表4：真实数据集上AC、KR、DDB算法性能对比(规则集固定为500万)Table 4: Performance comparison of AC, KR, and DDB algorithms on real data sets (the rule set is fixed at 5 million)

由表4可知，实施数据去重策略之后的算法相比去重前，匹配效率提高，在真实数据集上效果更加明显。对于处理数百万甚至千万级别以上的大规模URL规则，匹配加速效果显著。It can be seen from Table 4 that the matching efficiency of the algorithm after the data deduplication strategy is improved compared with that before deduplication, and the effect is more obvious on the real data set. For processing large-scale URL rules of millions or even tens of millions, the matching acceleration effect is remarkable.

本发明还提供一种存储有计算机程序的非易失性计算机可读存储介质，当计算机执行所述计算机程序时，所述计算机执行本发明任一所述方法的步骤。且该存储介质可以应用在入侵检测系统、拼写检查、网络流量监测等领域，还可以在计算机硬件单机及多台服务器中运行。The present invention also provides a non-volatile computer-readable storage medium storing a computer program. When a computer executes the computer program, the computer executes the steps of any one of the methods described in the present invention. Moreover, the storage medium can be applied in fields such as intrusion detection systems, spell checking, network flow monitoring, etc., and can also be run on a computer hardware stand-alone or multiple servers.

以上实施仅用以说明本发明的技术方案而非对其进行限制，本领域的普通技术人员可以对本发明的技术方案进行修改或者等同替换，而不脱离本发明的精神和范围，本发明的保护范围应以权利要求书所述为准。The above implementation is only used to illustrate the technical solution of the present invention and not to limit it. Those skilled in the art can modify or equivalently replace the technical solution of the present invention without departing from the spirit and scope of the present invention. Protection of the present invention The scope should be defined by the claims.

Claims

1. A URL matching method, the steps of which are:

1) Perform a double hash operation on the original URL data to be matched to obtain double hash values hash ₁ , hash ₂ and their corresponding hash values h;

2) If a hash node with a hash value h is found in the hash table, then query whether the double hash value of the hash node is equal to the double hash value of the above-mentioned original URL data to be matched;

3) If they are equal, it is judged whether the flags start and end of the above-mentioned hash nodes meet the preset conditions;

4) If the flags start and end of the above-mentioned hash node meet the preset conditions, then the above-mentioned original URL data to be matched matches a rule in the matched rule table.

2. The method according to claim 1, wherein the hash table in step 2) and the matched rule table in step 4) are located in the URL result cache.

3. The method according to claim 1, characterized in that, step 2) also includes: if the hash node with a hash value of h is not found in the hash table, then directly end, and go to the pattern matching engine to match The original URL data to be matched, and return the result.

4. The method according to claim 1, wherein the preset condition in step 3) means that the flag bits start and end of the hash node are not equal.

5. The method according to claim 2, wherein step 3) further comprises: if the double hash value of the above-mentioned hash node is not equal to the double hash value of the original URL data to be matched, then in the above-mentioned Query whether there are other hash nodes with a hash value of h in the hash table; if not, use the pattern matching engine to match the original URL data to be matched, and return the result.

6. method as claimed in claim 5, is characterized in that, when pattern matching engine matches described original URL data to be matched, and when returning result, this original URL data to be matched is stored in URL result cache, its step include:

a) Determine whether the number of URL data J in the URL result cache exceeds the threshold M, if not, record the double hash values hash ₁ and hash ₂ of the original URL data to be matched to the current hash node hash_node[J] Medium; if it exceeds, a replacement strategy will be adopted for the URL result cache of the block; where M is the size of the URL data that can be cached;

b) When the pattern matching engine returns a result of 0, the flag bit hash_node[J].start=hash_node[J].end=K in the hash node hash_node[J]; when the pattern matching engine returns a result of 1 , then the flags hash_node[J].start=K, hash_node[J].end=K+1 in the hash node hash_node[J], and store the matching rule number returned from the pattern matching engine in matched_rules[]. matched_rules[K]=RuleId, at this time K=K+1; where 0 means that the original URL data to be matched does not match the rules, 1 means that the original URL data to be matched matches the rules, K is the number of matched rules, matched_rules [] represents the hit rule table;

c) Obtain the hash value h of the original URL data to be matched, store the hash nodes with the same hash value in next, and the number of stored URL data is J=J+1.

7 . The method according to claim 6 , wherein the replacement strategy in step a) refers to: clearing the URL result cache of the block, and reallocating space to store new URL data. 8 .

8. The method according to claim 1 or 6, wherein the hash value h of the original URL data to be matched is obtained according to an AND operation of hash ₁ and (N-1), wherein N is a hash table size.

9. The method according to claim 1, wherein step 4) further comprises: if the flags start and end of the hash node do not meet the preset conditions, then the original URL data to be matched does not match the rule.

10. The method according to claim 1, wherein the rule in step 4) refers to a rule matching the data corresponding to the hash value of the hash node.

11. The method according to claim 1, wherein the double hash operation adopts a polynomial hash algorithm, and uses a multi-thread method to match URL data in parallel.

12. A URL matching device, comprising a processor and a memory, the processor and the memory are connected through a bus;

The memory is used to store computer-executed instructions corresponding to the method in any one of claims 1 to 11;

When the URL matching device is running, the processor reads the computer-implemented instructions in the memory to cause the URL matching device to perform the steps of any one of the methods of claims 1-11.

13. A non-volatile computer-readable storage medium storing a computer program, characterized in that, when a computer executes the computer program, the computer executes the steps of the method according to any one of claims 1 to 11.