CN103559259A - Method for eliminating similar-duplicate webpage on the basis of cloud platform - Google Patents
Method for eliminating similar-duplicate webpage on the basis of cloud platform Download PDFInfo
- Publication number
- CN103559259A CN103559259A CN201310537406.9A CN201310537406A CN103559259A CN 103559259 A CN103559259 A CN 103559259A CN 201310537406 A CN201310537406 A CN 201310537406A CN 103559259 A CN103559259 A CN 103559259A
- Authority
- CN
- China
- Prior art keywords
- webpage
- piecemeal
- fingerprint
- text
- web page
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a method for eliminating a similar-duplicate webpage on the basis of a cloud platform. The method comprises the following steps of preprocessing a webpage, and extracting a text of the webpage; extracting characteristic items in the text to be used for representing the content of the text; calculating fingerprints of the characteristic items, and compressing or reducing dimension of the characteristic items so as to be conveniently stored and searched; and judging whether the original webpage is in similarity or not according to the similarity calculated on the basis of the characteristic fingerprints. The method has the advantages that the omitted similar-duplicate webpage repetition can be maximally alleviated, and the similarity calculation under different webpage structures can be well supported.
Description
Technical field
The present invention relates to the approximate repeated pages method of elimination based on cloud platform.
Background technology
Utilize the people of search engine retrieving news, blog or RSS reader may often meet with the problem of information overload and repetition, often can see an event occur after the content of each webpage basic identical, the article that viewpoint is identical.The information repeating is too much, causes user to take much time and reads the information repeating.
Take Google as example, and Googlebot spiders all can crawl about 20,000,000,000 webpages every day, and in total amount, it is following the trail of the independent URL link of 30,000,000,000 left and right.Among so huge data volume, web page quality is uneven unavoidably, and the information that returns to user during inquiry exists a large amount of repetitions, and many times user can not find needed information.Current search engine does not well address this problem.For example in Google input " iPad mini issue lift-launch apple A5 double-core is not joined Retina ", search for, in Search Results shows, in 10 entries of first page, having 9 contents is repetitions, and the Search Results of the first page exactly that most people is concerned about.
Search for maximum problem mainly from three directions: the problem of search quality; The problem that search subscriber is experienced; And the problem of the whole search ecosystem.The quality of search is the key of search engine competition, and a large amount of webpages that repeat are fatal for the impact of search quality.Many repeated pages are not only wasted the crawl time but also are wasted storage space.Especially when setting up index, must set up index to a large amount of repeated pages, also make inverted file become huge, response speed when impact provides inquiry service.If can find out these repeated pages and remove from web database, just can save a part of storage space, and then can utilize this part space to deposit more effectively web page contents and carry out increment collection, also improved web page retrieval quality simultaneously.So, how efficiently, remove accurately the webpage of repetition, improving recall precision, it is our problem to be solved that the retrieval that increases user is experienced.
Under the background of quietly arriving at current large data age, should from the process of deal with data and information, excavate its commercial value behind.Webpage polyisomenisms a large amount of in network weigh, affect the problems such as service effectiveness to internet, applications as search engine has brought the waste of resource, index burden.How effectively, accurately repeated pages is removed, to excavating, save bandwidth, improve the speed excavated, excavate ageing strong resource etc. and have important meaning.
Summary of the invention
Technical matters to be solved by this invention is that a kind of approximate repeated pages method of the elimination based on cloud platform that can farthest reduce approximate repeated pages will be provided.
In order to solve above technical matters, the invention provides a kind of approximate repeated pages method of elimination based on cloud platform, it is characterized in that:
The method comprises the following steps:
(1) webpage pre-service, extracts Web page text;
(2) in text, extract characteristic item for characterizing body matter;
(3) the fingerprint of calculated characteristics item, to characteristic item compress or dimension-reduction treatment to facilitate storage and retrieval;
(4) based on characteristic fingerprint, calculate similarity, judge that whether original web page is approximate.
(1), (2), (3) above-mentioned steps is that given webpage is carried out to pre-service, and Web page text is expressed as to the process of piecemeal fingerprint set, and the algorithm steps of this one-phase is as follows:
Step is 1.: webpage pre-service, remove webpage noise, and extract Web page text;
Concrete context extraction method is used the extraction algorithm of webpage content main based on weighting dom tree.
Step is 2.: to natural piecemeal pending in Web page text, using punctuation mark as separator, as ", ", ".", "? " "! " as separator piecemeal, be divided into a plurality of sentences, in these sentences, extract number of words and be no less than the long sentence of k word as the characteristic item of piecemeal, the long sentence to each extraction, calculates its weight according to the theme semanteme of webpage and long sentence self-characteristic;
Step is 3.: utilize the weighting long sentence of piecemeal to calculate piecemeal fingerprint, and piecemeal fingerprint is added to the piecemeal fingerprint collection of text;
Step is 4.: if also have new piecemeal, forward above-mentioned steps to and 2. continue new piecemeal to carry out that long sentence extracts and fingerprint calculating; If finish in full, this one-phase finishes, and obtains piecemeal fingerprint set in full.
(4) above-mentioned steps is that the block collection to having obtained is carried out similarity calculating, and judges that according to this whether their text is approximate, and step is as follows:
Step is 1.: establishing approximate threshold value is r;
Step is 2.: calculate similarity, if similarity >r judges it is approximate webpage; Otherwise, be not approximate webpage.
Text extraction is accurately to be similar to the important prerequisite detecting, and can integrate preferably with existing web crawling system, and checking possesses higher validity and accuracy by experiment simultaneously, and is significantly improved on treatment effeciency.
Compared with prior art, the present invention has the following advantages:
(1) for the application of similar web page duplicate removal, utilize webpage piecemeal and subject information extractive technique, extract the proper vector of webpage.Webpage piecemeal is mainly merged into the web page blocks of coarsegrain to content node wherein based on DOM syntax tree; The extracting method of subject content piece is the comparison based on text similarity mainly, calculates posterior probability be simultaneously optimized improvement by Bayes method.By above-mentioned processing, can complete affecting the removal of the webpage noise information of removing duplicate webpages, the extraction of the lexical item cutting of content of text, web page characteristics vector has obvious lifting compared with previous methods performance;
(2) ultimate principle of the similar web page duplicate removal method based on Shingle of analysis conventional, and on this basis, based on mapping/stipulations programming model, improved algorithm has been proposed, under distributed running environment, this improved algorithm will increase the efficiency of processing, and possesses good property extending transversely;
(3) in conjunction with Shingle and two kinds of algorithms of Simhash, traditional webpage fingerprint computing technique is improved, it is more similar that this fingerprint has between webpage content, the more approaching feature of Hamming distance between fingerprint, and in treatment effeciency and accuracy, done good balance; The webpage fingerprint obtaining for said method, has analyzed the similar fingerprint quick detection algorithm of Manku, by the mode of trading space for time, time loss is reduced in a large number;
(4) introduce principle of work and this mass data distributed treatment model of mapping/stipulations of class GFS Distributed Computing Platform, and proposed the deployment scheme of the Distributed Calculation cluster based on Open Framework Hadoop.Based on this platform, the task of fingerprint comparison is disperseed to parallelization, on the unit of executing the task at each, utilize the similar fingerprint quick detection technology of Manku to carry out fingerprint comparison, then the result of all units of executing the task of system merger, in obtaining original fingerprint base, after the entry similar to fingerprint to be added, just can complete the deletion work to the corresponding web page of the corresponding fingerprint of original fingerprint base and web page library.
Accompanying drawing explanation
Fig. 1 is process flow diagram of the present invention;
Fig. 2 is web crawling Solution Architecture figure of the present invention;
Fig. 3 is the process flow diagram that Web page text of the present invention is expressed as the set of piecemeal fingerprint;
Fig. 4 is the contrast table of the improved SimHash algorithm of the present invention and Manku traditional algorithm;
Fig. 5 is the contrast table of the improved SimHash algorithm of the present invention and Manku traditional algorithm working time.
Embodiment
Below in conjunction with the drawings and specific embodiments, the present invention is described in detail.
As shown in Figure 1, the invention provides a kind of approximate repeated pages method of elimination based on cloud platform, it is characterized in that:
The method comprises the following steps:
(1) webpage pre-service, extracts Web page text;
(2) in text, extract characteristic item for characterizing body matter;
(3) the fingerprint of calculated characteristics item, to characteristic item compress or dimension-reduction treatment to facilitate storage and retrieval;
(4) based on characteristic fingerprint, calculate similarity, judge that whether original web page is approximate.
As shown in Figure 3, (1), (2), (3) above-mentioned steps is that given webpage A and B are carried out to pre-service, and Web page text is expressed as to the process of piecemeal fingerprint set, and the algorithm steps of this one-phase is as follows:
Step is 1.: webpage pre-service, remove webpage noise, and extract Web page text;
Concrete context extraction method is used the extraction algorithm of webpage content main based on weighting dom tree.
Step is 2.: to natural piecemeal pending in Web page text, using punctuation mark as separator, as ", ", ".", "? " "! " as separator piecemeal, be divided into a plurality of sentences, in these sentences, extract number of words and be no less than the long sentence of k word as the characteristic item of piecemeal, the long sentence to each extraction, calculates its weight according to the theme semanteme of webpage and long sentence self-characteristic;
Step is 3.: utilize the weighting long sentence of piecemeal to calculate piecemeal fingerprint, and piecemeal fingerprint is added to the piecemeal fingerprint collection of text;
Step is 4.: if also have new piecemeal, forward above-mentioned steps to and 2. continue new piecemeal to carry out that long sentence extracts and fingerprint calculating; If finish in full, this one-phase finishes, and obtains piecemeal fingerprint set in full.
If the paragragh number of given webpage A is n, in figure, S is the piecemeal fingerprint set of webpage A.
(4) above-mentioned steps is that the block collection to having obtained is carried out similarity calculating, and judges that according to this whether their text is approximate, and step is as follows:
By S
aand S
bsimilarity be defined as sim (S
a, S
b), establishing approximate threshold value is r:
Step is 1.: calculate S
aand S
bsimilarity sim (S
a, S
b).
Step is 2.: if sim is (S
a, S
b) >r, judge that webpage A and B are approximate webpages; Otherwise they are not approximate webpages.
So far, the approximate decision stage of webpage finishes.
If Fig. 2 is web crawling Solution Architecture figure of the present invention, comprise the following steps:
(1) asynchronous crawl original web page
Page download for internet, conventionally need to relate to the parsing of dns address, if do not taken measures in the situation that height is concurrent, DNS address resolution meeting becomes an important bottleneck, DNS client by multi-thread concurrent brings in the speed that improves DNS parsing, by DNS Cache, carry out the result of buffer memory dns resolution, DNS client should be obtained result by UDP procotol in addition simultaneously, and the process of simultaneously obtaining should improve performance by the mode of asynchronous unblock.
(2) analyzing web page
For web page contents, resolve, concrete analysis mode and parsing content can be carried out flexible configuration by configuration file, and the realization of concrete function comprises extracts web page characteristics vector sum extraction link.
(3) extract proper vector and removing duplicate webpages
Both are integrated in web crawling system as similar web page machining system, and according to different mode of operations, webpage fingerprint is processed, if operated under line model, webpage of so every download just need to be analyzed and extract web page characteristics fingerprint it, use improved Simhash algorithm to carry out similar web page detection, for deleting with the collections of web pages of newly downloaded page approximate similarity in the original web page library determining.
(4) extract the link in webpage
Link extraction module mainly extracts all links from the content of webpage, for nonstandard link on network, carry out simultaneously lack of standardization, such as the Canonical title of polishing link, add explicit port numbers, capital and small letter conversion etc.
(5) link filter and Robots control
Link filter and Robots control and process mainly for link, for following aspect: first, robots.txt carrys out the behavior of suggestive control crawl device for part website, should strictly observe agreement as web crawling system; Secondly, this module can be carried out predefine to need link to be filtered, and the mode of definition is mainly based on regular expression; Again, for network crawl trap, this module should be determined suitable rule, for example, limit the degree of depth of the link of creeping, and filters the mechanism such as URL rewriting link and processes; Finally, due to the demand of web crawling system parallelization, single-point need to carry out the main frame that hash calculative determination obtains this link to the link of extracting.
(6) queue load monitoring
Concrete control strategy can be divided into two-stage, first according to the link of creeping, carries out priority division, then link is divided according to access main frame, thereby restriction is for the frequent access of certain main frame.
The test data of specifically take is set forth technical scheme of the present invention as example is further.This case experimental situation design is as follows: 1 machine is as NameNode, and 24 machines are as DataNode.Every node hardware configuration is as follows: the Lenovo Yangtian T4600v ,CPUWei Intel E3200 of Celeron (2.6GHZ); Inside save as 2GB; Hard disk is 250GB/7200rpm.The operating system of using is RedHat5.3, and Java environment is JDK1.6, and Hadoop version is 0.20.2, and 4 slots of every machines configurations process 4 map/reduce tasks simultaneously.
Case is mainly verified the improved validity based on SimHash removing duplicate webpages algorithm by the contrast experiment with primal algorithm.Here approximately 1,500,000 webpages have been captured, as the data set of experiment.For process that can simulation test removing duplicate webpages, according to text size, 5000 mutual unduplicated Web page texts have been selected at random, its text is entered row stochastic a small amount of increasing, deletes, changes operation, generated 10000 approximate Web page texts that repeat, be inserted into data centralization.Finally extract the relevant information of the text files memory Web page text of an about 5GB.Use respectively algorithm and the improved algorithm of Manku to carry out data set to process, result as shown in Figure 4.
As can be seen from the figure, the difference of result is mainly reflected in the identification for more short width webpage, and the recall ratio of obvious improved algorithm is better than traditional Manku algorithm, and this is mainly reflected in the identification for short text.
In order to accelerate to algorithm, Manku proposes following methods: (1) carries out piecemeal to the finger print data of 64 bit; (2) fingerprint is replaced to expansion, by the redundancy in space, exchange the consumption of time for, and improved algorithm is the comparison that is relatively reduced to subrange of global scope, with the sacrifice of less precision, exchanged lower Time & Space Complexity for.In improved algorithm, a Job is originally decomposed into 5 Job and completes in improving algorithm, as being the T.T. that algorithm completes computing time in Fig. 5.As can be seen from the figure in routine processes, on the time, along with the increase of file set scale, be also better than traditional algorithm the working time of improvement algorithm.
Comparison algorithm working time under data set different scales, thus whether verification algorithm has higher retractility.Experiment has been compared respectively the computing time under 300,000 to 1,500,000 web data collection, and as seen from the figure, time loss is substantially according to linear growth, so this improved algorithm possesses higher extensibility under the environment of parallel computation.
Claims (3)
1. the elimination based on cloud platform is similar to a repeated pages method, it is characterized in that:
The method comprises the following steps:
(1) webpage pre-service, extracts Web page text;
(2) in text, extract characteristic item for characterizing body matter;
(3) the fingerprint of calculated characteristics item, to characteristic item compress or dimension-reduction treatment to facilitate storage and retrieval;
(4) based on characteristic fingerprint, calculate similarity, judge that whether original web page is approximate.
2. the elimination based on cloud platform according to claim 1 is similar to repeated pages method, it is characterized in that: (1), (2), (3) described step is that given webpage is carried out to pre-service, and Web page text is expressed as to the process of piecemeal fingerprint set, the algorithm steps of this one-phase is as follows:
Step is 1.: webpage pre-service, remove webpage noise, and extract Web page text;
Step is 2.: to natural piecemeal pending in Web page text, using punctuation mark as separator, piecemeal is divided into a plurality of sentences, in these sentences, extract number of words and be no less than the long sentence of k word as the characteristic item of piecemeal, long sentence to each extraction, calculates its weight according to the theme semanteme of webpage and long sentence self-characteristic;
Step is 3.: utilize the weighting long sentence of piecemeal to calculate piecemeal fingerprint, and piecemeal fingerprint is added to the piecemeal fingerprint collection of text;
Step is 4.: if also have new piecemeal, forward above-mentioned steps to and 2. continue new piecemeal to carry out that long sentence extracts and fingerprint calculating; If finish in full, this one-phase finishes, and obtains piecemeal fingerprint set in full.
3. the approximate repeated pages method of the elimination based on cloud platform according to claim 1, is characterized in that: (4) described step is that the block collection to having obtained is carried out similarity calculating, and judges that according to this whether their text is approximate, and step is as follows:
Step is 1.: establishing approximate threshold value is r;
Step is 2.: calculate similarity, if similarity >r judges it is approximate webpage; Otherwise, be not approximate webpage.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310537406.9A CN103559259A (en) | 2013-11-04 | 2013-11-04 | Method for eliminating similar-duplicate webpage on the basis of cloud platform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310537406.9A CN103559259A (en) | 2013-11-04 | 2013-11-04 | Method for eliminating similar-duplicate webpage on the basis of cloud platform |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103559259A true CN103559259A (en) | 2014-02-05 |
Family
ID=50013506
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310537406.9A Pending CN103559259A (en) | 2013-11-04 | 2013-11-04 | Method for eliminating similar-duplicate webpage on the basis of cloud platform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103559259A (en) |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103955529A (en) * | 2014-05-12 | 2014-07-30 | 中国科学院计算机网络信息中心 | Internet information searching and aggregating presentation method |
CN104079559A (en) * | 2014-06-05 | 2014-10-01 | 腾讯科技(深圳)有限公司 | Web address security detecting method and device and server |
CN104615714A (en) * | 2015-02-05 | 2015-05-13 | 北京中搜网络技术股份有限公司 | Blog duplicate removal method based on text similarities and microblog channel features |
CN105630802A (en) * | 2014-10-30 | 2016-06-01 | 阿里巴巴集团控股有限公司 | Webpage duplication removal method and apparatus |
CN105653668A (en) * | 2015-12-29 | 2016-06-08 | 武汉理工大学 | Webpage content analysis and extraction optimization method based on DOM Tree in cloud environment |
CN106302202A (en) * | 2015-05-15 | 2017-01-04 | 阿里巴巴集团控股有限公司 | Data current-limiting method and device |
CN106407195A (en) * | 2015-07-28 | 2017-02-15 | 北京京东尚科信息技术有限公司 | Method and system for eliminating duplication of webpage |
CN106528508A (en) * | 2016-10-27 | 2017-03-22 | 乐视控股(北京)有限公司 | Repeated text judgment method and apparatus |
CN106897271A (en) * | 2017-03-17 | 2017-06-27 | 北京搜狐新媒体信息技术有限公司 | Body noise removal method and system |
CN108416221A (en) * | 2018-01-22 | 2018-08-17 | 西安电子科技大学 | Safe set of metadata of similar data possesses proof scheme in a kind of cloud environment |
CN109002517A (en) * | 2018-07-06 | 2018-12-14 | 佛山市灏金赢科技有限公司 | A kind of webpage content display method and system |
CN109189824A (en) * | 2018-08-10 | 2019-01-11 | 阿里巴巴集团控股有限公司 | A kind of method and device for retrieving similar article |
CN110322692A (en) * | 2019-07-09 | 2019-10-11 | 广东工业大学 | A kind of detection method, device and equipment repeating traffic flow data |
CN110442679A (en) * | 2019-08-01 | 2019-11-12 | 信雅达系统工程股份有限公司 | A kind of text De-weight method based on Fusion Model algorithm |
CN112307303A (en) * | 2020-10-29 | 2021-02-02 | 扆亮海 | Efficient and accurate network page duplicate removal system based on cloud computing |
CN113408660A (en) * | 2021-07-15 | 2021-09-17 | 北京百度网讯科技有限公司 | Book clustering method, device, equipment and storage medium |
CN119203976A (en) * | 2024-11-25 | 2024-12-27 | 国家计算机网络与信息安全管理中心江苏分中心 | A web page deduplication method based on structure perception |
CN119249020A (en) * | 2024-09-19 | 2025-01-03 | 广州盈风网络科技有限公司 | Artificial intelligence-based website map generation method, system, device and medium |
CN119249020B (en) * | 2024-09-19 | 2025-09-16 | 广州盈风网络科技有限公司 | Sitemap generation method, system, equipment and medium based on artificial intelligence |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080235163A1 (en) * | 2007-03-22 | 2008-09-25 | Srinivasan Balasubramanian | System and method for online duplicate detection and elimination in a web crawler |
CN101645082A (en) * | 2009-04-17 | 2010-02-10 | 华中科技大学 | Similar web page duplicate-removing system based on parallel programming mode |
CN102024065A (en) * | 2011-01-18 | 2011-04-20 | 中南大学 | SIMD optimization-based webpage duplication elimination and concurrency method |
CN102375813A (en) * | 2010-08-09 | 2012-03-14 | 腾讯科技(深圳)有限公司 | Duplicate detection system and method for search engines |
US8140505B1 (en) * | 2005-03-31 | 2012-03-20 | Google Inc. | Near-duplicate document detection for web crawling |
-
2013
- 2013-11-04 CN CN201310537406.9A patent/CN103559259A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8140505B1 (en) * | 2005-03-31 | 2012-03-20 | Google Inc. | Near-duplicate document detection for web crawling |
US20080235163A1 (en) * | 2007-03-22 | 2008-09-25 | Srinivasan Balasubramanian | System and method for online duplicate detection and elimination in a web crawler |
CN101645082A (en) * | 2009-04-17 | 2010-02-10 | 华中科技大学 | Similar web page duplicate-removing system based on parallel programming mode |
CN102375813A (en) * | 2010-08-09 | 2012-03-14 | 腾讯科技(深圳)有限公司 | Duplicate detection system and method for search engines |
CN102024065A (en) * | 2011-01-18 | 2011-04-20 | 中南大学 | SIMD optimization-based webpage duplication elimination and concurrency method |
Non-Patent Citations (3)
Title |
---|
GURMEET SINGH MANKU 等: "Detecting Near-Duplicates for Web Crawling", 《WWW2007/TRACK:DATA MINING》 * |
徐娜 等: "基于Bloom Filter的网页去重算法", 《微型电脑应用》 * |
黄仁 等: "基于正文结构和长句提取的网页去重算法", 《计算机应用研究》 * |
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103955529A (en) * | 2014-05-12 | 2014-07-30 | 中国科学院计算机网络信息中心 | Internet information searching and aggregating presentation method |
CN104079559B (en) * | 2014-06-05 | 2017-07-25 | 腾讯科技(深圳)有限公司 | A kind of website safety detection method, device and server |
CN104079559A (en) * | 2014-06-05 | 2014-10-01 | 腾讯科技(深圳)有限公司 | Web address security detecting method and device and server |
CN105630802A (en) * | 2014-10-30 | 2016-06-01 | 阿里巴巴集团控股有限公司 | Webpage duplication removal method and apparatus |
US10691769B2 (en) | 2014-10-30 | 2020-06-23 | Alibaba Group Holding Limited | Methods and apparatus for removing a duplicated web page |
CN104615714A (en) * | 2015-02-05 | 2015-05-13 | 北京中搜网络技术股份有限公司 | Blog duplicate removal method based on text similarities and microblog channel features |
CN106302202A (en) * | 2015-05-15 | 2017-01-04 | 阿里巴巴集团控股有限公司 | Data current-limiting method and device |
CN106302202B (en) * | 2015-05-15 | 2020-07-28 | 阿里巴巴集团控股有限公司 | Data current limiting method and device |
CN106407195A (en) * | 2015-07-28 | 2017-02-15 | 北京京东尚科信息技术有限公司 | Method and system for eliminating duplication of webpage |
CN105653668A (en) * | 2015-12-29 | 2016-06-08 | 武汉理工大学 | Webpage content analysis and extraction optimization method based on DOM Tree in cloud environment |
CN106528508A (en) * | 2016-10-27 | 2017-03-22 | 乐视控股(北京)有限公司 | Repeated text judgment method and apparatus |
CN106897271A (en) * | 2017-03-17 | 2017-06-27 | 北京搜狐新媒体信息技术有限公司 | Body noise removal method and system |
CN106897271B (en) * | 2017-03-17 | 2020-05-22 | 北京搜狐新媒体信息技术有限公司 | News text noise removal method and system |
CN108416221A (en) * | 2018-01-22 | 2018-08-17 | 西安电子科技大学 | Safe set of metadata of similar data possesses proof scheme in a kind of cloud environment |
CN109002517A (en) * | 2018-07-06 | 2018-12-14 | 佛山市灏金赢科技有限公司 | A kind of webpage content display method and system |
CN109189824A (en) * | 2018-08-10 | 2019-01-11 | 阿里巴巴集团控股有限公司 | A kind of method and device for retrieving similar article |
CN109189824B (en) * | 2018-08-10 | 2022-04-26 | 创新先进技术有限公司 | Method and device for retrieving similar articles |
CN110322692A (en) * | 2019-07-09 | 2019-10-11 | 广东工业大学 | A kind of detection method, device and equipment repeating traffic flow data |
CN110322692B (en) * | 2019-07-09 | 2020-10-23 | 广东工业大学 | Method, device and equipment for detecting repeated traffic flow data |
CN110442679A (en) * | 2019-08-01 | 2019-11-12 | 信雅达系统工程股份有限公司 | A kind of text De-weight method based on Fusion Model algorithm |
CN112307303A (en) * | 2020-10-29 | 2021-02-02 | 扆亮海 | Efficient and accurate network page duplicate removal system based on cloud computing |
CN113408660A (en) * | 2021-07-15 | 2021-09-17 | 北京百度网讯科技有限公司 | Book clustering method, device, equipment and storage medium |
CN113408660B (en) * | 2021-07-15 | 2024-05-24 | 北京百度网讯科技有限公司 | Book clustering method, device, equipment and storage medium |
CN119249020A (en) * | 2024-09-19 | 2025-01-03 | 广州盈风网络科技有限公司 | Artificial intelligence-based website map generation method, system, device and medium |
CN119249020B (en) * | 2024-09-19 | 2025-09-16 | 广州盈风网络科技有限公司 | Sitemap generation method, system, equipment and medium based on artificial intelligence |
CN119203976A (en) * | 2024-11-25 | 2024-12-27 | 国家计算机网络与信息安全管理中心江苏分中心 | A web page deduplication method based on structure perception |
CN119203976B (en) * | 2024-11-25 | 2025-02-11 | 国家计算机网络与信息安全管理中心江苏分中心 | A web page deduplication method based on structure perception |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103559259A (en) | Method for eliminating similar-duplicate webpage on the basis of cloud platform | |
CN106844640B (en) | Webpage data analysis processing method | |
US10579661B2 (en) | System and method for machine learning and classifying data | |
CN110543595B (en) | In-station searching system and method | |
CN108132929A (en) | A kind of similarity calculation method of magnanimity non-structured text | |
CN110309446A (en) | The quick De-weight method of content of text, device, computer equipment and storage medium | |
US20140068768A1 (en) | Apparatus and Method for Identifying Related Code Variants in Binaries | |
JP2009104591A (en) | Web document clustering method and system | |
CN103544255A (en) | Text semantic relativity based network public opinion information analysis method | |
CN110929145A (en) | Public opinion analysis method, public opinion analysis device, computer device and storage medium | |
CN106815307A (en) | Public Culture knowledge mapping platform and its use method | |
CN105389349A (en) | Dictionary update method and device | |
CN105653668A (en) | Webpage content analysis and extraction optimization method based on DOM Tree in cloud environment | |
CN108363686A (en) | A kind of character string segmenting method, device, terminal device and storage medium | |
CN112307303A (en) | Efficient and accurate network page duplicate removal system based on cloud computing | |
CN110532389B (en) | Text clustering method and device and computing equipment | |
CN105550359B (en) | Webpage sorting method and device based on vertical search and server | |
CN111444411A (en) | Network data increment acquisition method, device, equipment and storage medium | |
CN110008419A (en) | Removing duplicate webpages method, device and equipment | |
CN103593454A (en) | Mining method and system for microblog text classification | |
CN106649557A (en) | Semantic association mining method for defect report and mail list | |
CN106844550A (en) | Method and device is recommended in a kind of virtual platform operation | |
Mallick et al. | Incremental mining of sequential patterns: Progress and challenges | |
CN106202552A (en) | Data search method based on cloud computing | |
JP6749865B2 (en) | INFORMATION COLLECTION DEVICE AND INFORMATION COLLECTION METHOD |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20140205 |
|
WD01 | Invention patent application deemed withdrawn after publication |