CN104392002B

CN104392002B - A kind of the approximate of extensive collections of web pages repeats lookup method

Info

Publication number: CN104392002B
Application number: CN201410779353.6A
Authority: CN
Inventors: 张鹏; 熊翠文; 刘庆云; 杨嵘; 郑超; 刘俊朋; 李舒
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2014-12-15
Filing date: 2014-12-15
Publication date: 2017-09-26
Anticipated expiration: 2034-12-15
Also published as: CN104392002A

Abstract

The invention relates to a method for approximate duplicate search of a large-scale webpage collection, which uses point signatures of documents to filter web page content noise, and combines partitioning and inverted index pruning to complete approximate duplicate search, making approximate duplicate search efficient and only calculating points The Jaccard similarity of signatures keeps the complexity of the method low.

Description

A Approximate Duplicate Finding Method for Large-Scale Web Collections

技术领域technical field

本发明涉及信息检索技术，特别是一种大规模网页集合中核心内容近似重复查找方法。The invention relates to an information retrieval technology, in particular to a method for searching for approximate repetition of core content in a large-scale webpage collection.

背景技术Background technique

随着互联网的飞速发展，互联网上存储的数据规模不断扩大，要想快速获得想要的数据，搜索引擎成为主要的手段。虽然搜索引擎在返回搜索结果中去除了重复网页，但是近似重复网页不断出现在语料库中，要获得更精确有价值的结果，需要查找这些近似重复的网页，并加以去除。然而，传统的近似重复查找方法有很多不足之处，例如shingling方法基于shingles或n-gram子集计算Jaccard重叠复杂度太高，指纹模式方法基于文档偏移量来选择位串带来很大额外开销，局部敏感哈希方法将多个签名映射单个哈希值增加了签名提取和哈希的额外开销。With the rapid development of the Internet, the scale of data stored on the Internet continues to expand. In order to quickly obtain the desired data, search engines have become the main means. Although search engines remove duplicate web pages in the returned search results, near-duplicate web pages continue to appear in the corpus. To obtain more accurate and valuable results, it is necessary to find and remove these near-duplicate web pages. However, traditional approximate duplicate search methods have many deficiencies. For example, the shingling method based on shingles or n-gram subsets is too complex to calculate Jaccard overlap. Overhead, locality-sensitive hashing methods map multiple signatures to a single hash value adding additional overhead for signature extraction and hashing.

发明内容Contents of the invention

本发明技术解决问题：克服现有技术的不足，提供一种大规模网页集合近似重复查找的方法，使用文档的点签名来过滤网页内容噪声，结合分区和倒排索引剪枝来完成近似重复查找，使得近似重复查找效率高，仅计算点签名的Jaccard相似度使得方法的复杂度很低。The technical solution of the present invention is to overcome the deficiencies of the prior art and provide a method for approximate duplicate search of a large-scale webpage collection, using document point signatures to filter web page content noise, and combining partitioning and inverted index pruning to complete approximate duplicate search , so that the approximate duplicate search efficiency is high, and only the Jaccard similarity of point signatures is calculated so that the complexity of the method is very low.

本发明技术解决方案：一种大规模网页集合的近似重复查找方法，过程如图1所示：The technical solution of the present invention: an approximate duplicate search method for a large-scale webpage collection, the process is shown in Figure 1:

步骤0中，开始执行本发明，转向步骤1执行；In step 0, start to execute the present invention, and turn to step 1 for execution;

步骤1中，输入带有加权点签名的文档向量、带有边界[p_k,p_k+1)的分区k和倒排列表(此时倒排列表为空)，转向步骤2执行；[p_k,p_k+1)为文档长度的范围，是一个半开半闭区间，p_k为区间下边界，p_k+1为区间上边界；In step 1, input document vector with weighted point signature, partition k with boundary [p _k ,p _k+1 ) and inverted list (the inverted list is empty at this time), turn to step 2 to execute; [p _k , p _k+1 ) is the range of document length, which is a half-open and half-closed interval, p _k is the lower boundary of the interval, and p _k+1 is the upper boundary of the interval;

步骤2中，存放结果的对集初始化为空，转向步骤3执行；In step 2, the pair set for storing the result is initialized to be empty, and then turn to step 3 for execution;

步骤3中，判断文档长度随机排序序列中是否存在没有被计算过的文档向量d_i，有则转向步骤4执行，否则转向步骤22执行；In step 3, it is judged whether there is an uncalculated document vector d _i in the document length random sorting sequence, and if so, turn to step 4 for execution, otherwise turn to step 22 for execution;

步骤4中，分区k的区间下边界p_k设置文档向量d_i的长度，转向步骤5执行；In step 4, the interval lower boundary p _k of the partition k sets the length of the document vector d _i , and turns to step 5 for execution;

步骤5中，将分区k中d_i所有的点签名按照频率做一次升序排列，将点签名频率最小的点签名置于第一个，转向步骤6执行；In step 5, arrange all point signatures of d _i in partition k in ascending order of frequency, put the point signature with the smallest point signature frequency first, and turn to step 6 to execute;

步骤6中，将第一辅助边界delta₁置为0，将已检测文档向量集置为空，转向步骤7执行；In step 6, set the first auxiliary boundary delta ₁ to 0, set the detected document vector set to empty, and turn to step 7 for execution;

步骤7中，判断d_i的点签名是否存在没有被计算的点签名s_ij，其中，j为点签名在d_i中的位置，若存在，则，转向步骤8执行，否则转向步骤20执行；In step 7, it is judged whether there is a point signature s _ij that has not been calculated in the point signature of d _i , where j is the position of the point signature in d _i , if it exists, then go to step 8, otherwise go to step 20;

步骤8中，分区k中与s_ij相关文档向量置入倒排列表list_kj中，且按文档长度降序排列，转向步骤9执行；In step 8, the document vectors related to s _ij in partition k are put into the inverted list list _kj , and arranged in descending order according to the document length, and then turn to step 9 for execution;

步骤9中，将第二辅助边界delta₂置为0，转向步骤10执行；In step 9, set the second auxiliary boundary delta ₂ to 0, and turn to step 10 for execution;

步骤10中，判断list_kj中是否存在按文档长度降序排列的文档向量d_i’，若存在，则转向步骤11执行，否则，转向步骤18执行；In step 10, it is judged whether there is a document vector d _i ' in descending order of document length in the list _kj , if it exists, then go to step 11, otherwise, go to step 18;

步骤11中，delta₂设置为文档向量d_i和d_i’长度之差，转向步骤12执行；In step 11, delta ₂ is set to be the difference between the lengths of document vectors d _i and d _i' , and turn to step 12 for execution;

步骤12中，判断文档向量d_i和d_i’是否相等或文档向量d_i’已经被检测过，若满足，则转向步骤10执行，否则转向步骤13执行；In step 12, it is judged whether the document vector d _i and d _i' are equal or the document vector d _i' has been detected, if satisfied, then go to step 10, otherwise go to step 13;

步骤13中，判断delta₂小于0且delta₁-delta₂大于d_i’长度的1-t倍，若满足，则转向步骤10执行，否则转向步骤14执行；In step 13, it is judged that delta ₂ is less than 0 and delta ₁ -delta ₂ is greater than 1-t times the length of d _i' , if satisfied, then go to step 10, otherwise go to step 14;

步骤14中，判断delta₂大于等于0且delta₁+delta₂大于d_i长度的1-t倍，若满足，则转向步骤18执行，否则转向步骤15执行；In step 14, it is judged that delta ₂ is greater than or equal to 0 and delta ₁ + delta ₂ is greater than 1-t times of the length of d _i , if satisfied, then turn to step 18 for execution, otherwise turn to step 15 for execution;

步骤15中，判断文档向量d_i和d_i’的Jaccard相似度大于等于t，若满足，则转向步骤16执行，否则转向步骤10执行；In step 15, it is judged that the Jaccard similarity between the document vectors d _i and d _i' is greater than or equal to t, if satisfied, then turn to step 16 for execution, otherwise turn to step 10 for execution;

步骤16中，将<d_i,d_i’>添加到结果对集中，转向步骤17执行；<d_i,d_i’>由文档向量d_i和d_i’组成的文档对，文档向量d_i和d_i’的Jaccard相似度大于等于阈值时则匹配，以<d_i,d_i’>形式加入对集，否则不加入对集。In step 16, add <d _i , d _i' > to the result pair set, and turn to step 17 for execution; <d _i , d _i' > is a document pair composed of document vector d _i and d _i ', document vector d _i When the Jaccard similarity with d _i' is greater than or equal to the threshold, it matches, and joins the pair set in the form of <d _i , d _i' >, otherwise it does not join the pair set.

步骤17中，将d_i’添加到已检测文档向量集，转向步骤10执行；In step 17, add d _i' to the detected document vector set, and turn to step 10 for execution;

步骤18中，delta₁的值增加文档向量d_i中点签名s_ij的频率，转向步骤19执行；In step 18, the value of delta ₁ increases the frequency of point signature s _ij in the document vector d _i , and turns to step 19 for execution;

步骤19中，判断delta₁大于分区k中没有被检测文档向量的最大长度的1-t倍，若满足，则转向步骤20执行，否则转向步骤7执行；In step 19, it is judged that delta ₁ is greater than 1-t times of the maximum length of the undetected document vector in partition k, if satisfied, then turn to step 20 to execute, otherwise turn to step 7 to execute;

步骤20中，判断分区上界与文档向量d_i的长度之差小于等于分区上界的1-t倍，若满足，则转向步骤21执行，否则转向步骤3执行；t是Jaccard相似度的临界值比例，取值根据实际情况而定，一般在0.5到1之间；In step 20, it is judged that the difference between the upper bound of the partition and the length of the document vector d _i is less than or equal to 1-t times the upper bound of the partition, if it is satisfied, then go to step 21, otherwise go to step 3; t is the criticality of Jaccard similarity Value ratio, the value depends on the actual situation, generally between 0.5 and 1;

步骤21中，将区间上边界p_k+1置为区间下边界p_k，迭代至下一个分区，转向步骤6执行；In step 21, set the interval upper boundary p _k+1 as the interval lower boundary p _k , iterate to the next partition, and turn to step 6 for execution;

步骤22中，返回结果对集，转向步骤23执行；In step 22, return the result pair set, and turn to step 23 for execution;

步骤23中，结束整个程序。In step 23, the whole program ends.

本发明与现有技术相比的优点在于：The advantage of the present invention compared with prior art is:

(1)现有的近似重复网页检测方法主要有shingling算法、指纹模式、局部敏感哈希三种。shingling算法基于shingles或n-grams计算Jaccard重叠，其复杂度太高，在此基础上引进"super shingles"的改进也只能小幅度降低结果的精度；指纹模式提取文档中有代表的单词或整个句子的哈希值连接位串生成指纹特征以判断文档相似性，然而位串选择产生很大的额外开销；局部敏感哈希连接来自每个数据对象到单个哈希值，通过独立的哈希函数，额外增加了签名提取和多个哈希的开销。(1) Existing near-duplicate web page detection methods mainly include shingling algorithm, fingerprint mode, and locality-sensitive hashing. The shingling algorithm calculates Jaccard overlap based on shingles or n-grams, and its complexity is too high. On this basis, the improvement of introducing "super shingles" can only slightly reduce the accuracy of the result; the fingerprint mode extracts representative words or entire Sentence hashes concatenate bitstrings to generate fingerprint features for judging document similarity, however bitstring selection incurs significant overhead; locality-sensitive hashing concatenates from each data object to a single hash, via a separate hash function , with the additional overhead of signature extraction and multiple hashes.

(2)本发明采用停用先行词与相邻内容项连接成的短链作为文档签名，签名提取速度快、开销小；计算Jaccard相似度时，结合集合分区缩小计算范围，避免计算明显不相似的文档向量之间的相似度，引入倒排索引剪枝去除了冗余计算过程加快计算进程。本发明的方法能够达到精确、快速、计算复杂度低的目的。(2) The present invention uses a short chain formed by disabling antecedents and adjacent content items as a document signature, and the signature extraction speed is fast and the overhead is small; when calculating the Jaccard similarity, the calculation range is reduced in combination with the set partition, and the calculation is obviously not similar The similarity between the document vectors, the introduction of inverted index pruning removes the redundant calculation process and speeds up the calculation process. The method of the invention can achieve the goals of accuracy, speed and low computational complexity.

附图说明Description of drawings

图1为本发明方法实现流程图；Fig. 1 is the realization flow chart of the method of the present invention;

图2为本发明中点签名近似重复查找方法执行图。Fig. 2 is an execution diagram of the midpoint signature approximate duplicate search method in the present invention.

具体实施方式detailed description

在阐述本发明之前，先对相关概念进行一下解释和说明。Before elaborating the present invention, the relevant concepts are firstly explained and illustrated.

点(p)：在文档经常出现的词，在自然语言文本中被称作停用词，例如is,the,do,have等。Point (p): Words that often appear in documents are called stop words in natural language texts, such as is, the, do, have, etc.

普通词汇：文档中不是点的单词。Common vocabulary: words in the document that are not dots.

点距(d)：文档中普通词汇与它之前选定的点之间或普通词汇与它之前的普通词汇之间间隔的词汇数，点不在此计数中，例如"a rally to kick"，词汇"a"和"to"在文档中都是经常出现的，都可以作为点，"rally"与"a"之间的点距为1，"kick"与"a"之间的点距为2，"kick"与"rally"之间的点距为1(因为"to"作为点不在计数中)，"kick"与"to"之间的点距为1。Dot distance (d): the number of words in the document between the common word and the dot selected before it or between the common word and the common word before it, dots are not included in this count, such as "a rally to kick", the word " Both a" and "to" appear frequently in documents, and both can be used as points. The point distance between "rally" and "a" is 1, and the point distance between "kick" and "a" is 2. The distance between "kick" and "rally" is 1 (since "to" is not counted as a point), and the distance between "kick" and "to" is 1.

链长(c)：规定的文档中相对于某个点满足点距的普通词汇个数，例如"a rallyto kick"，相对于点"a"点距为1的链长为2。Chain length (c): The number of common words in the specified document that satisfy the dot distance relative to a certain point, for example, "a rallyto kick", the chain length is 2 when the dot distance is 1 relative to point "a".

点签名(s)：用于查找一个连续的链长为c的普通词汇链表，链表中的普通词汇之间的间隔为点距d，其中链表中的第一个普通词汇与点之间间隔为d，表示形式为p(d,c)，即点签名＝点(点距：链长)，例如"a rally to kick"，选定"a"为点，点距d为1，链长c为2，点签名s＝"a"(1,2)值为{"a":"rally":"kick"}。Point signature (s): It is used to find a continuous list of common words with a chain length of c. The distance between common words in the linked list is the dot distance d, and the distance between the first common word in the linked list and the dot is d, the representation form is p(d,c), that is, point signature = point (point distance: chain length), such as "a rally to kick", select "a" as point, point distance d is 1, and chain length c is 2, point signature s="a"(1,2) is {"a":"rally":"kick"}.

点签名频率：在文档中点签名出现的次数，例如"a rally to kick off aweeklong campaign"，点签名s＝"a"(1,2)出现的次数为2，即频率为2。Point signature frequency: the number of times the point signature appears in the document, for example, "a rally to kick off aweeklong campaign", the number of times the point signature s="a"(1,2) appears is 2, that is, the frequency is 2.

点签名权重：即点签名频率。Point signature weight: the point signature frequency.

文档长度(|d|)：文档中所有的点签名频率之和，例如"a rally to kick off aweeklong campaign"，点签名s₁＝"a"(1,2)出现的次数为2，点签名s₂＝"to"(1,2)出现的次数为1，文档长度为2+1＝3。Document length (|d|): the sum of the frequency of all point signatures in the document, for example, "a rally to kick off aweeklong campaign", point signature s ₁ = "a" (1,2) occurs twice, point signature s ₂ = "to"(1,2) appears for 1 times, and the document length is 2+1=3.

文档向量(d)：文档表示成带有频率的点签名的序列，即文档向量＝{点签名1：点签名1频率，点签名2：点签名2频率，…}，例如d＝"a rally to kick off a weeklongcampaign"，文档向量表示成d＝{s₁:2,s₂:1}，其中s₁＝"a"(1,2)，s₂＝"to"(1,2)，文档长度|d|＝3。Document vector (d): The document is expressed as a sequence of point signatures with frequencies, that is, document vector = {point signature 1: point signature 1 frequency, point signature 2: point signature 2 frequency, ...}, for example d="a rally to kick off a weeklongcampaign", the document vector is expressed as d={s ₁ :2, s ₂ :1}, where s ₁ = "a"(1,2), s ₂ = "to"(1,2), Document length |d|=3.

分区k：文档长度大小的范围，是一个半闭半开区间，区间的下界固定，上界不固定。[p_k,p_k+1)为文档长度的范围，是一个半开半闭区间，p_k为区间下边界，p_k+1为区间上边界。Partition k: the range of document length, which is a half-closed and half-open interval. The lower bound of the interval is fixed, and the upper bound is not fixed. [p _k ,p _k+1 ) is the range of document length, which is a half-open and half-closed interval, p _k is the lower boundary of the interval, and p _k+1 is the upper boundary of the interval.

倒排列表(list)：具有相同点签名的文档向量组成列表，即倒排列表＝{文档向量1，文档向量2，…}，列表中每个文档向量带有该点签名频率，例如s₁＝"a"(1,2)，s₂＝"to"(1,2)，s₃＝"the"(2,1)，d₁＝"a rally to kick off a weeklong campaign"，d₁＝{s₁:2,s₂:1}，d₂＝"a biological argument for the cougar phenomenon"，d₂＝{s₁:1,s₃:1}，d₃＝"the past can be corrected"，d₃＝{s₃:1}，list₁：{d₁:2,d₂:1}(点签名s₁＝"a"(1,2)对应的倒排列表)，list₂：{d₁:1}(点签名s₂＝"to"(1,2)对应的倒排列表)，list₃：{d₂:1,d₃:1}(点签名s₃＝"the"(2,1)对应的倒排列表)。Inverted list (list): A list of document vectors with the same point signature, that is, inverted list = {document vector 1, document vector 2, ...}, each document vector in the list has the frequency of the point signature, for example, s ₁ ="a"(1,2), s ₂ ="to"(1,2), s ₃ ="the"(2,1), d ₁ ="a rally to kick off a weeklong campaign", d ₁ ＝{s ₁ :2, s ₂ :1}, d ₂ ="a biological argument for the cougar phenomenon", d ₂ ＝{s ₁ :1, s ₃ :1}, d ₃ ="the past can be corrected ", d ₃ ={s ₃ :1}, list ₁ : {d ₁ :2,d ₂ :1} (point signature s ₁ = "a" (1,2) corresponding inverted list), list ₂ : {d ₁ :1} (point signature s ₂ = "to" (1,2) corresponding inverted list), list ₃ : {d ₂ :1, d ₃ :1} (point signature s ₃ = "the" (2,1) corresponds to the inverted list).

Jaccard相似度sim：两个文档向量中对应的点签名频率的较小值之和除以两文档向量中对应的点签名频率的较大值之和，sim(d₁,d₂)＝(1+0+0)/(2+1+1)＝0.25。Jaccard similarity sim: the sum of the smaller values of the corresponding point signature frequencies in the two document vectors divided by the larger value of the corresponding point signature frequencies in the two document vectors, sim(d ₁ ,d ₂ )=(1 +0+0)/(2+1+1)=0.25.

阈值(t)：Jaccard相似度的临界值比例，取值根据实际情况而定，一般在0.5到1之间。Threshold (t): The critical value ratio of Jaccard similarity, the value depends on the actual situation, generally between 0.5 and 1.

辅助边界(delta)：用于临时记录数值的变量。Auxiliary boundary (delta): A variable used to temporarily record values.

对集{<d_i,d_i’>}：即文档对集合，两文档向量d_i和d_i’的Jaccard相似度大于等于阈值时则匹配，两文档向量以"<文档向量1，文档向量2>"，即"<d_i,d_i’>"，形式加入对集中，否则不匹配，不加入对集。Pair set {<d _i ,d _i' >}: document pair set, when the Jaccard similarity of two document vectors d _i and d _i' is greater than or equal to the threshold, the two document vectors are matched with "<document vector 1, document vector 2>", that is, "<d _i ,d _i' >", the form will be added to the pair set, otherwise it will not match and will not be added to the pair set.

以下结合上述对本发明进行详细描述。The present invention will be described in detail below in conjunction with the above.

网页集合近似去重的执行图如图2所示。图2中计算的网页文档向量为d₁＝{s₁:5,s₂:4,s₃:4}、d₂＝{s₁:8,s₂:4}和d₃＝{s₁:4,s₂:5,s₃:5}，文档向量d₁、d₂和d₃的长度分别为13、12和14，阈值t设为0.8执行过程描述如下：Figure 2 shows the execution diagram of the approximate deduplication of the web page set. The web document vectors calculated in Figure 2 are d ₁ ={s ₁ :5, s ₂ :4, s ₃ :4}, d ₂ ={s ₁ :8,s ₂ :4} and d ₃ ={s ₁ :4,s ₂ :5,s ₃ :5}, the lengths of document vectors d ₁ , d ₂ and d ₃ are 13, 12 and 14 respectively, and the threshold t is set to 0.8. The execution process is described as follows:

(1)输入带有加权点签名的文档向量d₁、d₂和d₃，输入边界为[p_k,p_k+1)的分区和倒排列表；(1) Input document vectors d ₁ , d ₂ and d ₃ with weighted point signatures, and input partitions and inverted lists whose boundaries are [p _k ,p _k+1 );

(2)对集设为空，文档向量按长度随机排列，随机选择d₁与其他文档向量比较；(2) The pair set is set to be empty, the document vectors are randomly arranged according to the length, and d ₁ is randomly selected to compare with other document vectors;

(3)将分区下边界置为文档向量d₁的长度13，点签名s₁、s₂和s₃的频率分别为17、13、9，点签名按照频率的大小做一次升序排序，选择计算点签名频率最小的点签名即s₃下的文档列表，辅助边界delta₁设为0，已检测文档向量集设为空；(3) Set the lower boundary of the partition to the length 13 of the document vector d ₁ , and the frequencies of the point signatures s ₁ , s ₂ and s ₃ are 17, 13, and 9 respectively, and sort the point signatures in ascending order according to the frequency, and choose to calculate The point signature with the smallest point signature frequency is the document list under s ₃ , the auxiliary boundary delta ₁ is set to 0, and the detected document vector set is set to empty;

(4)在每个点签名下的文档向量按照文档长度降序排列，首先遍历点签名s₃的文档向量集放入倒排列表中，检测文档向量d₃，辅助边界delta₂＝d₁的长度-d₃的长度＝-1<0，文档向量d₁与d₃不相等且d₃未被检测，delta₁-delta₂(＝1)小于(1-0.8)*14(＝2.8)，delta₁+delta₂(＝-1)小于(1-0.8)*13(＝2.6)，所以计算文档向量d₁与点签名s₃中的文档向量d₃的Jaccard相似度sim(d₁,d₃)＝(4+4+4)/(5+5+5)＝0.8≥t，将<d₁,d₃>加入对集；(4) The document vectors under each point signature are arranged in descending order according to the length of the document. Firstly, traverse the document vector set of point signature s ₃ and put it into the inverted list, detect the document vector d ₃ , and the auxiliary boundary delta ₂ = the length of d ₁ - length of d ₃ = -1<0, document vector d ₁ is not equal to d ₃ and d ₃ is not detected, delta ₁ -delta ₂ (=1) is less than (1-0.8)*14 (=2.8), delta ₁ +delta ₂ (=-1) is less than (1-0.8)*13(=2.6), so calculate the Jaccard similarity sim(d ₁ ,d ₃ of document vector d ₁ and document vector d ₃ in point signature s ₃ )=(4+4+4)/(5+5+5)=0.8≥t, add <d ₁ ,d ₃ > to the pair set;

(5)检测点签名s₃的倒排列表中文档向量d₁，文档向量d₁与d₁相等，不计算，delta₁增加文档向量d₁的点签名s₃的频率，delta₁＝4，并结束本次遍历，遍历第二个点签名s₁的文档向量集，delta₁设为0，将s₁的文档向量集放入列表中，检测文档向量d₂，辅助边界delta₂＝d₁的长度-d₂的长度＝1>0，文档向量d₁与d₂不相等且d₂未被检测，delta₁-delta₂(＝-1)小于(1-0.8)*12(＝2.4)，delta₁+delta₂(＝1)小于(1-0.8)*13(＝2.6)，所以计算文档向量d₁与点签名s₃中的文档向量的Jaccard相似度，sim(d1,d2)＝(5+4)/(8+4+4)＝0.5625<t，<d₁,d₃>不加入对集；(5) Detect document vector d ₁ in the inverted list of point signature s ₃ , document vector d ₁ is equal to d ₁ , not calculated, delta ₁ increases the frequency of point signature s ₃ of document vector d ₁ , delta ₁ = 4, And end this traversal, traverse the document vector set of the second point signature s ₁ , set delta ₁ to 0, put the document vector set of s ₁ into the list, detect document vector d ₂ , auxiliary boundary delta ₂ = d ₁ The length of -d ₂ = 1>0, the document vector d ₁ is not equal to d ₂ and d ₂ has not been detected, delta ₁ -delta ₂ (=-1) is less than (1-0.8)*12 (=2.4) , delta ₁ +delta ₂ (=1) is less than (1-0.8)*13 (=2.6), so calculate the Jaccard similarity between the document vector d ₁ and the document vector in the point signature s ₃ , sim(d1,d2)= (5+4)/(8+4+4)=0.5625<t, <d ₁ , d ₃ >does not join the pair set;

(6)继续遍历倒排列表，列表中文档向量d₁与d₁相等，文档向量d₃已检测，delta₁增加文档向量d₁的点签名s₁的频率，delta₁＝5，结束本次遍历，遍历第三个点签名的文档集s₂，delta₁设为0，列表中文档向量d₁与d₁相等，文档向量d₃、d₂已检测，结束本次遍历，返回对集{<d₁,d₃>}，结束程序。(6) Continue traversing the inverted list, the document vector d ₁ and d ₁ in the list are equal, the document vector d ₃ has been detected, delta ₁ increases the frequency of the point signature s ₁ of the document vector d ₁ , delta ₁ = 5, end this time Traverse, traverse the document set s ₂ signed by the third point, set delta ₁ to 0, document vector d ₁ and d ₁ in the list are equal, document vector d ₃ and d ₂ have been detected, end this traversal, and return the pair set{ <d ₁ ,d ₃ >}, end the program.

提供以上实施例仅仅是为了描述本发明的目的，而并非要限制本发明的范围。本发明的范围由所附权利要求限定。不脱离本发明的精神和原理而做出的各种等同替换和修改，均应涵盖在本发明的范围之内。The above embodiments are provided only for the purpose of describing the present invention, not to limit the scope of the present invention. The scope of the invention is defined by the appended claims. Various equivalent replacements and modifications made without departing from the spirit and principle of the present invention shall fall within the scope of the present invention.

Claims

1. A method for approximately repeated search of a large-scale webpage collection, characterized in that the steps of realization are as follows:

Step 1, input document vector with weighted point signature, partition k with boundary [p _k ,p _k+1 ) and inverted list, turn to step 2 for execution; [p _k ,p _k+1 ) is document length The range of is a semi-open and semi-closed interval, p _k is the lower boundary of the interval, and p _k+1 is the upper boundary of the interval;

In step 2, the pair set storing the result is initialized to be empty, and then turn to step 3 for execution; in step 3, judge whether there is an uncalculated document vector d _i in the document length random sorting sequence, and if it exists, turn to step 4 for execution, otherwise turn to step 22 execution;

In step 4, the interval lower boundary p _k of the partition k sets the length of the document vector d _i , and turns to step 5 for execution;

In step 5, arrange all point signatures of document vector d _i in partition k in ascending order of frequency, put the point signature with the smallest point signature frequency first, and turn to step 6 to execute;

In step 6, the first auxiliary boundary delta ₁ is set to 0, and the detected document vector set is set to empty, and the execution turns to step 7;

In step 7, judge whether the point signature of document vector d _i has point signature s _ij that has not been calculated, where j is the position of point signature in document vector d _i , if it exists, go to step 8, otherwise go to step 20 execution;

In step 8, the document vectors related to point signature s _ij in partition k are put into the inverted list list _kj , and arranged in descending order according to the document length, and then turn to step 9 for execution;

In step 9, set the second auxiliary boundary delta ₂ to 0, and turn to step 10 for execution;

In step 10, it is judged whether there is a document vector d _i' arranged in descending order of document length in the inverted list list _kj , if it exists, then go to step 11, otherwise, go to step 18;

In step 11, set the second auxiliary boundary delta ₂ as the difference between the lengths of document vectors d _i and d _i' , and turn to step 12 for execution;

In step 12, it is judged whether the document vector d _i and d _i' are equal or the document vector d _i' has been detected, if satisfied, then go to step 10, otherwise go to step 13;

In step 13, it is judged that delta ₂ is less than 0 and delta ₁ -delta ₂ is greater than 1-t times the length of d _i' , if satisfied, then go to step 10, otherwise go to step 14;

In step 14, it is judged that the second auxiliary boundary delta ₂ is greater than or equal to 0 and delta ₁ + delta ₂ is greater than 1-t times the length of d _i , if satisfied, then go to step 18, otherwise go to step 15;

In step 15, it is judged that the Jaccard similarity between the document vectors d _i and d _i' is greater than or equal to t, if satisfied, then turn to step 16 for execution, otherwise turn to step 10 for execution;

In step 16, add <d _i , d _i' > to the result pair set, and turn to step 17 for execution; <d _i , d _i' > is a document pair composed of document vectors d _i and d _i' , document vector d _i When the Jaccard similarity with d _i' is greater than or equal to the threshold, it matches, and joins the pair set in the form of <d _i , d _i' >, otherwise it does not join the pair set;

In step 17, add d _i' to the detected document vector set, and turn to step 10 for execution;

In step 18, the value of delta ₁ increases the frequency of point signature s _ij in the document vector d _i , and turns to step 19 for execution;

In step 19, it is judged that delta ₁ is greater than 1-t times of the maximum length of the undetected document vector in partition k, if satisfied, then turn to step 20 to execute, otherwise turn to step 7 to execute;

In step 20, it is judged that the difference between the upper bound of the partition and the length of the document vector d _i is less than or equal to 1-t times the upper bound of the partition, if it is satisfied, then go to step 21, otherwise go to step 3; t is the criticality of Jaccard similarity Value ratio, the value depends on the actual situation, and the value range is 0.5-1;

In step 21, set the interval upper boundary p _k+1 as the interval lower boundary p _k , iterate to the next partition, and turn to step 6 for execution;

In step 22, return the result pair set, and turn to step 23 for execution;

In step 23, the whole program ends.