CN103559259A

CN103559259A - Method for eliminating similar-duplicate webpage on the basis of cloud platform

Info

Publication number: CN103559259A
Application number: CN201310537406.9A
Authority: CN
Inventors: 向阳; 陈佑雄; 张依杨; 平宇; 张波; 袁书寒
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2013-11-04
Filing date: 2013-11-04
Publication date: 2014-02-05

Abstract

The invention discloses a method for eliminating a similar-duplicate webpage on the basis of a cloud platform. The method comprises the following steps of preprocessing a webpage, and extracting a text of the webpage; extracting characteristic items in the text to be used for representing the content of the text; calculating fingerprints of the characteristic items, and compressing or reducing dimension of the characteristic items so as to be conveniently stored and searched; and judging whether the original webpage is in similarity or not according to the similarity calculated on the basis of the characteristic fingerprints. The method has the advantages that the omitted similar-duplicate webpage repetition can be maximally alleviated, and the similarity calculation under different webpage structures can be well supported.

Description

The approximate repeated pages method of elimination based on cloud platform

Technical field

The present invention relates to the approximate repeated pages method of elimination based on cloud platform.

Background technology

Utilize the people of search engine retrieving news, blog or RSS reader may often meet with the problem of information overload and repetition, often can see an event occur after the content of each webpage basic identical, the article that viewpoint is identical.The information repeating is too much, causes user to take much time and reads the information repeating.

Take Google as example, and Googlebot spiders all can crawl about 20,000,000,000 webpages every day, and in total amount, it is following the trail of the independent URL link of 30,000,000,000 left and right.Among so huge data volume, web page quality is uneven unavoidably, and the information that returns to user during inquiry exists a large amount of repetitions, and many times user can not find needed information.Current search engine does not well address this problem.For example in Google input " iPad mini issue lift-launch apple A5 double-core is not joined Retina ", search for, in Search Results shows, in 10 entries of first page, having 9 contents is repetitions, and the Search Results of the first page exactly that most people is concerned about.

Search for maximum problem mainly from three directions: the problem of search quality; The problem that search subscriber is experienced; And the problem of the whole search ecosystem.The quality of search is the key of search engine competition, and a large amount of webpages that repeat are fatal for the impact of search quality.Many repeated pages are not only wasted the crawl time but also are wasted storage space.Especially when setting up index, must set up index to a large amount of repeated pages, also make inverted file become huge, response speed when impact provides inquiry service.If can find out these repeated pages and remove from web database, just can save a part of storage space, and then can utilize this part space to deposit more effectively web page contents and carry out increment collection, also improved web page retrieval quality simultaneously.So, how efficiently, remove accurately the webpage of repetition, improving recall precision, it is our problem to be solved that the retrieval that increases user is experienced.

Under the background of quietly arriving at current large data age, should from the process of deal with data and information, excavate its commercial value behind.Webpage polyisomenisms a large amount of in network weigh, affect the problems such as service effectiveness to internet, applications as search engine has brought the waste of resource, index burden.How effectively, accurately repeated pages is removed, to excavating, save bandwidth, improve the speed excavated, excavate ageing strong resource etc. and have important meaning.

Summary of the invention

Technical matters to be solved by this invention is that a kind of approximate repeated pages method of the elimination based on cloud platform that can farthest reduce approximate repeated pages will be provided.

In order to solve above technical matters, the invention provides a kind of approximate repeated pages method of elimination based on cloud platform, it is characterized in that:

The method comprises the following steps:

(1) webpage pre-service, extracts Web page text;

(2) in text, extract characteristic item for characterizing body matter;

(3) the fingerprint of calculated characteristics item, to characteristic item compress or dimension-reduction treatment to facilitate storage and retrieval;

(4) based on characteristic fingerprint, calculate similarity, judge that whether original web page is approximate.

(1), (2), (3) above-mentioned steps is that given webpage is carried out to pre-service, and Web page text is expressed as to the process of piecemeal fingerprint set, and the algorithm steps of this one-phase is as follows:

Step is 1.: webpage pre-service, remove webpage noise, and extract Web page text;

Concrete context extraction method is used the extraction algorithm of webpage content main based on weighting dom tree.

Step is 2.: to natural piecemeal pending in Web page text, using punctuation mark as separator, as ", ", ".", "? " "! " as separator piecemeal, be divided into a plurality of sentences, in these sentences, extract number of words and be no less than the long sentence of k word as the characteristic item of piecemeal, the long sentence to each extraction, calculates its weight according to the theme semanteme of webpage and long sentence self-characteristic;

Step is 3.: utilize the weighting long sentence of piecemeal to calculate piecemeal fingerprint, and piecemeal fingerprint is added to the piecemeal fingerprint collection of text;

Step is 4.: if also have new piecemeal, forward above-mentioned steps to and 2. continue new piecemeal to carry out that long sentence extracts and fingerprint calculating; If finish in full, this one-phase finishes, and obtains piecemeal fingerprint set in full.

(4) above-mentioned steps is that the block collection to having obtained is carried out similarity calculating, and judges that according to this whether their text is approximate, and step is as follows:

Step is 1.: establishing approximate threshold value is r;

Step is 2.: calculate similarity, if similarity >r judges it is approximate webpage; Otherwise, be not approximate webpage.

Text extraction is accurately to be similar to the important prerequisite detecting, and can integrate preferably with existing web crawling system, and checking possesses higher validity and accuracy by experiment simultaneously, and is significantly improved on treatment effeciency.

Compared with prior art, the present invention has the following advantages:

(1) for the application of similar web page duplicate removal, utilize webpage piecemeal and subject information extractive technique, extract the proper vector of webpage.Webpage piecemeal is mainly merged into the web page blocks of coarsegrain to content node wherein based on DOM syntax tree; The extracting method of subject content piece is the comparison based on text similarity mainly, calculates posterior probability be simultaneously optimized improvement by Bayes method.By above-mentioned processing, can complete affecting the removal of the webpage noise information of removing duplicate webpages, the extraction of the lexical item cutting of content of text, web page characteristics vector has obvious lifting compared with previous methods performance;

(2) ultimate principle of the similar web page duplicate removal method based on Shingle of analysis conventional, and on this basis, based on mapping/stipulations programming model, improved algorithm has been proposed, under distributed running environment, this improved algorithm will increase the efficiency of processing, and possesses good property extending transversely;

(3) in conjunction with Shingle and two kinds of algorithms of Simhash, traditional webpage fingerprint computing technique is improved, it is more similar that this fingerprint has between webpage content, the more approaching feature of Hamming distance between fingerprint, and in treatment effeciency and accuracy, done good balance; The webpage fingerprint obtaining for said method, has analyzed the similar fingerprint quick detection algorithm of Manku, by the mode of trading space for time, time loss is reduced in a large number;

(4) introduce principle of work and this mass data distributed treatment model of mapping/stipulations of class GFS Distributed Computing Platform, and proposed the deployment scheme of the Distributed Calculation cluster based on Open Framework Hadoop.Based on this platform, the task of fingerprint comparison is disperseed to parallelization, on the unit of executing the task at each, utilize the similar fingerprint quick detection technology of Manku to carry out fingerprint comparison, then the result of all units of executing the task of system merger, in obtaining original fingerprint base, after the entry similar to fingerprint to be added, just can complete the deletion work to the corresponding web page of the corresponding fingerprint of original fingerprint base and web page library.

Accompanying drawing explanation

Fig. 1 is process flow diagram of the present invention;

Fig. 2 is web crawling Solution Architecture figure of the present invention;

Fig. 3 is the process flow diagram that Web page text of the present invention is expressed as the set of piecemeal fingerprint;

Fig. 4 is the contrast table of the improved SimHash algorithm of the present invention and Manku traditional algorithm;

Fig. 5 is the contrast table of the improved SimHash algorithm of the present invention and Manku traditional algorithm working time.

Embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is described in detail.

As shown in Figure 1, the invention provides a kind of approximate repeated pages method of elimination based on cloud platform, it is characterized in that:

The method comprises the following steps:

(1) webpage pre-service, extracts Web page text;

(2) in text, extract characteristic item for characterizing body matter;

As shown in Figure 3, (1), (2), (3) above-mentioned steps is that given webpage A and B are carried out to pre-service, and Web page text is expressed as to the process of piecemeal fingerprint set, and the algorithm steps of this one-phase is as follows:

If the paragragh number of given webpage A is n, in figure, S is the piecemeal fingerprint set of webpage A.

By S _aand S _bsimilarity be defined as sim (S _a, S _b), establishing approximate threshold value is r:

Step is 1.: calculate S _aand S _bsimilarity sim (S _a, S _b).

Step is 2.: if sim is (S _a, S _b) >r, judge that webpage A and B are approximate webpages; Otherwise they are not approximate webpages.

So far, the approximate decision stage of webpage finishes.

If Fig. 2 is web crawling Solution Architecture figure of the present invention, comprise the following steps:

(1) asynchronous crawl original web page

Page download for internet, conventionally need to relate to the parsing of dns address, if do not taken measures in the situation that height is concurrent, DNS address resolution meeting becomes an important bottleneck, DNS client by multi-thread concurrent brings in the speed that improves DNS parsing, by DNS Cache, carry out the result of buffer memory dns resolution, DNS client should be obtained result by UDP procotol in addition simultaneously, and the process of simultaneously obtaining should improve performance by the mode of asynchronous unblock.

(2) analyzing web page

For web page contents, resolve, concrete analysis mode and parsing content can be carried out flexible configuration by configuration file, and the realization of concrete function comprises extracts web page characteristics vector sum extraction link.

(3) extract proper vector and removing duplicate webpages

Both are integrated in web crawling system as similar web page machining system, and according to different mode of operations, webpage fingerprint is processed, if operated under line model, webpage of so every download just need to be analyzed and extract web page characteristics fingerprint it, use improved Simhash algorithm to carry out similar web page detection, for deleting with the collections of web pages of newly downloaded page approximate similarity in the original web page library determining.

(4) extract the link in webpage

Link extraction module mainly extracts all links from the content of webpage, for nonstandard link on network, carry out simultaneously lack of standardization, such as the Canonical title of polishing link, add explicit port numbers, capital and small letter conversion etc.

(5) link filter and Robots control

Link filter and Robots control and process mainly for link, for following aspect: first, robots.txt carrys out the behavior of suggestive control crawl device for part website, should strictly observe agreement as web crawling system; Secondly, this module can be carried out predefine to need link to be filtered, and the mode of definition is mainly based on regular expression; Again, for network crawl trap, this module should be determined suitable rule, for example, limit the degree of depth of the link of creeping, and filters the mechanism such as URL rewriting link and processes; Finally, due to the demand of web crawling system parallelization, single-point need to carry out the main frame that hash calculative determination obtains this link to the link of extracting.

(6) queue load monitoring

Concrete control strategy can be divided into two-stage, first according to the link of creeping, carries out priority division, then link is divided according to access main frame, thereby restriction is for the frequent access of certain main frame.

The test data of specifically take is set forth technical scheme of the present invention as example is further.This case experimental situation design is as follows: 1 machine is as NameNode, and 24 machines are as DataNode.Every node hardware configuration is as follows: the Lenovo Yangtian T4600v ，CPUWei Intel E3200 of Celeron (2.6GHZ); Inside save as 2GB; Hard disk is 250GB/7200rpm.The operating system of using is RedHat5.3, and Java environment is JDK1.6, and Hadoop version is 0.20.2, and 4 slots of every machines configurations process 4 map/reduce tasks simultaneously.

Case is mainly verified the improved validity based on SimHash removing duplicate webpages algorithm by the contrast experiment with primal algorithm.Here approximately 1,500,000 webpages have been captured, as the data set of experiment.For process that can simulation test removing duplicate webpages, according to text size, 5000 mutual unduplicated Web page texts have been selected at random, its text is entered row stochastic a small amount of increasing, deletes, changes operation, generated 10000 approximate Web page texts that repeat, be inserted into data centralization.Finally extract the relevant information of the text files memory Web page text of an about 5GB.Use respectively algorithm and the improved algorithm of Manku to carry out data set to process, result as shown in Figure 4.

As can be seen from the figure, the difference of result is mainly reflected in the identification for more short width webpage, and the recall ratio of obvious improved algorithm is better than traditional Manku algorithm, and this is mainly reflected in the identification for short text.

In order to accelerate to algorithm, Manku proposes following methods: (1) carries out piecemeal to the finger print data of 64 bit; (2) fingerprint is replaced to expansion, by the redundancy in space, exchange the consumption of time for, and improved algorithm is the comparison that is relatively reduced to subrange of global scope, with the sacrifice of less precision, exchanged lower Time & Space Complexity for.In improved algorithm, a Job is originally decomposed into 5 Job and completes in improving algorithm, as being the T.T. that algorithm completes computing time in Fig. 5.As can be seen from the figure in routine processes, on the time, along with the increase of file set scale, be also better than traditional algorithm the working time of improvement algorithm.

Comparison algorithm working time under data set different scales, thus whether verification algorithm has higher retractility.Experiment has been compared respectively the computing time under 300,000 to 1,500,000 web data collection, and as seen from the figure, time loss is substantially according to linear growth, so this improved algorithm possesses higher extensibility under the environment of parallel computation.

Claims

1. the elimination based on cloud platform is similar to a repeated pages method, it is characterized in that:

The method comprises the following steps:

(1) webpage pre-service, extracts Web page text;

(2) in text, extract characteristic item for characterizing body matter;

2. the elimination based on cloud platform according to claim 1 is similar to repeated pages method, it is characterized in that: (1), (2), (3) described step is that given webpage is carried out to pre-service, and Web page text is expressed as to the process of piecemeal fingerprint set, the algorithm steps of this one-phase is as follows:

Step is 2.: to natural piecemeal pending in Web page text, using punctuation mark as separator, piecemeal is divided into a plurality of sentences, in these sentences, extract number of words and be no less than the long sentence of k word as the characteristic item of piecemeal, long sentence to each extraction, calculates its weight according to the theme semanteme of webpage and long sentence self-characteristic;

3. the approximate repeated pages method of the elimination based on cloud platform according to claim 1, is characterized in that: (4) described step is that the block collection to having obtained is carried out similarity calculating, and judges that according to this whether their text is approximate, and step is as follows:

Step is 1.: establishing approximate threshold value is r;