[go: up one dir, main page]

CN103559259A - Method for eliminating similar-duplicate webpage on the basis of cloud platform - Google Patents

Method for eliminating similar-duplicate webpage on the basis of cloud platform Download PDF

Info

Publication number
CN103559259A
CN103559259A CN201310537406.9A CN201310537406A CN103559259A CN 103559259 A CN103559259 A CN 103559259A CN 201310537406 A CN201310537406 A CN 201310537406A CN 103559259 A CN103559259 A CN 103559259A
Authority
CN
China
Prior art keywords
webpage
piecemeal
fingerprint
text
web page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310537406.9A
Other languages
Chinese (zh)
Inventor
向阳
陈佑雄
张依杨
平宇
张波
袁书寒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN201310537406.9A priority Critical patent/CN103559259A/en
Publication of CN103559259A publication Critical patent/CN103559259A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method for eliminating a similar-duplicate webpage on the basis of a cloud platform. The method comprises the following steps of preprocessing a webpage, and extracting a text of the webpage; extracting characteristic items in the text to be used for representing the content of the text; calculating fingerprints of the characteristic items, and compressing or reducing dimension of the characteristic items so as to be conveniently stored and searched; and judging whether the original webpage is in similarity or not according to the similarity calculated on the basis of the characteristic fingerprints. The method has the advantages that the omitted similar-duplicate webpage repetition can be maximally alleviated, and the similarity calculation under different webpage structures can be well supported.

Description

The approximate repeated pages method of elimination based on cloud platform
Technical field
The present invention relates to the approximate repeated pages method of elimination based on cloud platform.
Background technology
Utilize the people of search engine retrieving news, blog or RSS reader may often meet with the problem of information overload and repetition, often can see an event occur after the content of each webpage basic identical, the article that viewpoint is identical.The information repeating is too much, causes user to take much time and reads the information repeating.
Take Google as example, and Googlebot spiders all can crawl about 20,000,000,000 webpages every day, and in total amount, it is following the trail of the independent URL link of 30,000,000,000 left and right.Among so huge data volume, web page quality is uneven unavoidably, and the information that returns to user during inquiry exists a large amount of repetitions, and many times user can not find needed information.Current search engine does not well address this problem.For example in Google input " iPad mini issue lift-launch apple A5 double-core is not joined Retina ", search for, in Search Results shows, in 10 entries of first page, having 9 contents is repetitions, and the Search Results of the first page exactly that most people is concerned about.
Search for maximum problem mainly from three directions: the problem of search quality; The problem that search subscriber is experienced; And the problem of the whole search ecosystem.The quality of search is the key of search engine competition, and a large amount of webpages that repeat are fatal for the impact of search quality.Many repeated pages are not only wasted the crawl time but also are wasted storage space.Especially when setting up index, must set up index to a large amount of repeated pages, also make inverted file become huge, response speed when impact provides inquiry service.If can find out these repeated pages and remove from web database, just can save a part of storage space, and then can utilize this part space to deposit more effectively web page contents and carry out increment collection, also improved web page retrieval quality simultaneously.So, how efficiently, remove accurately the webpage of repetition, improving recall precision, it is our problem to be solved that the retrieval that increases user is experienced.
Under the background of quietly arriving at current large data age, should from the process of deal with data and information, excavate its commercial value behind.Webpage polyisomenisms a large amount of in network weigh, affect the problems such as service effectiveness to internet, applications as search engine has brought the waste of resource, index burden.How effectively, accurately repeated pages is removed, to excavating, save bandwidth, improve the speed excavated, excavate ageing strong resource etc. and have important meaning.
Summary of the invention
Technical matters to be solved by this invention is that a kind of approximate repeated pages method of the elimination based on cloud platform that can farthest reduce approximate repeated pages will be provided.
In order to solve above technical matters, the invention provides a kind of approximate repeated pages method of elimination based on cloud platform, it is characterized in that:
The method comprises the following steps:
(1) webpage pre-service, extracts Web page text;
(2) in text, extract characteristic item for characterizing body matter;
(3) the fingerprint of calculated characteristics item, to characteristic item compress or dimension-reduction treatment to facilitate storage and retrieval;
(4) based on characteristic fingerprint, calculate similarity, judge that whether original web page is approximate.
(1), (2), (3) above-mentioned steps is that given webpage is carried out to pre-service, and Web page text is expressed as to the process of piecemeal fingerprint set, and the algorithm steps of this one-phase is as follows:
Step is 1.: webpage pre-service, remove webpage noise, and extract Web page text;
Concrete context extraction method is used the extraction algorithm of webpage content main based on weighting dom tree.
Step is 2.: to natural piecemeal pending in Web page text, using punctuation mark as separator, as ", ", ".", "? " "! " as separator piecemeal, be divided into a plurality of sentences, in these sentences, extract number of words and be no less than the long sentence of k word as the characteristic item of piecemeal, the long sentence to each extraction, calculates its weight according to the theme semanteme of webpage and long sentence self-characteristic;
Step is 3.: utilize the weighting long sentence of piecemeal to calculate piecemeal fingerprint, and piecemeal fingerprint is added to the piecemeal fingerprint collection of text;
Step is 4.: if also have new piecemeal, forward above-mentioned steps to and 2. continue new piecemeal to carry out that long sentence extracts and fingerprint calculating; If finish in full, this one-phase finishes, and obtains piecemeal fingerprint set in full.
(4) above-mentioned steps is that the block collection to having obtained is carried out similarity calculating, and judges that according to this whether their text is approximate, and step is as follows:
Step is 1.: establishing approximate threshold value is r;
Step is 2.: calculate similarity, if similarity >r judges it is approximate webpage; Otherwise, be not approximate webpage.
Text extraction is accurately to be similar to the important prerequisite detecting, and can integrate preferably with existing web crawling system, and checking possesses higher validity and accuracy by experiment simultaneously, and is significantly improved on treatment effeciency.
Compared with prior art, the present invention has the following advantages:
(1) for the application of similar web page duplicate removal, utilize webpage piecemeal and subject information extractive technique, extract the proper vector of webpage.Webpage piecemeal is mainly merged into the web page blocks of coarsegrain to content node wherein based on DOM syntax tree; The extracting method of subject content piece is the comparison based on text similarity mainly, calculates posterior probability be simultaneously optimized improvement by Bayes method.By above-mentioned processing, can complete affecting the removal of the webpage noise information of removing duplicate webpages, the extraction of the lexical item cutting of content of text, web page characteristics vector has obvious lifting compared with previous methods performance;
(2) ultimate principle of the similar web page duplicate removal method based on Shingle of analysis conventional, and on this basis, based on mapping/stipulations programming model, improved algorithm has been proposed, under distributed running environment, this improved algorithm will increase the efficiency of processing, and possesses good property extending transversely;
(3) in conjunction with Shingle and two kinds of algorithms of Simhash, traditional webpage fingerprint computing technique is improved, it is more similar that this fingerprint has between webpage content, the more approaching feature of Hamming distance between fingerprint, and in treatment effeciency and accuracy, done good balance; The webpage fingerprint obtaining for said method, has analyzed the similar fingerprint quick detection algorithm of Manku, by the mode of trading space for time, time loss is reduced in a large number;
(4) introduce principle of work and this mass data distributed treatment model of mapping/stipulations of class GFS Distributed Computing Platform, and proposed the deployment scheme of the Distributed Calculation cluster based on Open Framework Hadoop.Based on this platform, the task of fingerprint comparison is disperseed to parallelization, on the unit of executing the task at each, utilize the similar fingerprint quick detection technology of Manku to carry out fingerprint comparison, then the result of all units of executing the task of system merger, in obtaining original fingerprint base, after the entry similar to fingerprint to be added, just can complete the deletion work to the corresponding web page of the corresponding fingerprint of original fingerprint base and web page library.
Accompanying drawing explanation
Fig. 1 is process flow diagram of the present invention;
Fig. 2 is web crawling Solution Architecture figure of the present invention;
Fig. 3 is the process flow diagram that Web page text of the present invention is expressed as the set of piecemeal fingerprint;
Fig. 4 is the contrast table of the improved SimHash algorithm of the present invention and Manku traditional algorithm;
Fig. 5 is the contrast table of the improved SimHash algorithm of the present invention and Manku traditional algorithm working time.
Embodiment
Below in conjunction with the drawings and specific embodiments, the present invention is described in detail.
As shown in Figure 1, the invention provides a kind of approximate repeated pages method of elimination based on cloud platform, it is characterized in that:
The method comprises the following steps:
(1) webpage pre-service, extracts Web page text;
(2) in text, extract characteristic item for characterizing body matter;
(3) the fingerprint of calculated characteristics item, to characteristic item compress or dimension-reduction treatment to facilitate storage and retrieval;
(4) based on characteristic fingerprint, calculate similarity, judge that whether original web page is approximate.
As shown in Figure 3, (1), (2), (3) above-mentioned steps is that given webpage A and B are carried out to pre-service, and Web page text is expressed as to the process of piecemeal fingerprint set, and the algorithm steps of this one-phase is as follows:
Step is 1.: webpage pre-service, remove webpage noise, and extract Web page text;
Concrete context extraction method is used the extraction algorithm of webpage content main based on weighting dom tree.
Step is 2.: to natural piecemeal pending in Web page text, using punctuation mark as separator, as ", ", ".", "? " "! " as separator piecemeal, be divided into a plurality of sentences, in these sentences, extract number of words and be no less than the long sentence of k word as the characteristic item of piecemeal, the long sentence to each extraction, calculates its weight according to the theme semanteme of webpage and long sentence self-characteristic;
Step is 3.: utilize the weighting long sentence of piecemeal to calculate piecemeal fingerprint, and piecemeal fingerprint is added to the piecemeal fingerprint collection of text;
Step is 4.: if also have new piecemeal, forward above-mentioned steps to and 2. continue new piecemeal to carry out that long sentence extracts and fingerprint calculating; If finish in full, this one-phase finishes, and obtains piecemeal fingerprint set in full.
If the paragragh number of given webpage A is n, in figure, S is the piecemeal fingerprint set of webpage A.
(4) above-mentioned steps is that the block collection to having obtained is carried out similarity calculating, and judges that according to this whether their text is approximate, and step is as follows:
By S aand S bsimilarity be defined as sim (S a, S b), establishing approximate threshold value is r:
Step is 1.: calculate S aand S bsimilarity sim (S a, S b).
Step is 2.: if sim is (S a, S b) >r, judge that webpage A and B are approximate webpages; Otherwise they are not approximate webpages.
So far, the approximate decision stage of webpage finishes.
If Fig. 2 is web crawling Solution Architecture figure of the present invention, comprise the following steps:
(1) asynchronous crawl original web page
Page download for internet, conventionally need to relate to the parsing of dns address, if do not taken measures in the situation that height is concurrent, DNS address resolution meeting becomes an important bottleneck, DNS client by multi-thread concurrent brings in the speed that improves DNS parsing, by DNS Cache, carry out the result of buffer memory dns resolution, DNS client should be obtained result by UDP procotol in addition simultaneously, and the process of simultaneously obtaining should improve performance by the mode of asynchronous unblock.
(2) analyzing web page
For web page contents, resolve, concrete analysis mode and parsing content can be carried out flexible configuration by configuration file, and the realization of concrete function comprises extracts web page characteristics vector sum extraction link.
(3) extract proper vector and removing duplicate webpages
Both are integrated in web crawling system as similar web page machining system, and according to different mode of operations, webpage fingerprint is processed, if operated under line model, webpage of so every download just need to be analyzed and extract web page characteristics fingerprint it, use improved Simhash algorithm to carry out similar web page detection, for deleting with the collections of web pages of newly downloaded page approximate similarity in the original web page library determining.
(4) extract the link in webpage
Link extraction module mainly extracts all links from the content of webpage, for nonstandard link on network, carry out simultaneously lack of standardization, such as the Canonical title of polishing link, add explicit port numbers, capital and small letter conversion etc.
(5) link filter and Robots control
Link filter and Robots control and process mainly for link, for following aspect: first, robots.txt carrys out the behavior of suggestive control crawl device for part website, should strictly observe agreement as web crawling system; Secondly, this module can be carried out predefine to need link to be filtered, and the mode of definition is mainly based on regular expression; Again, for network crawl trap, this module should be determined suitable rule, for example, limit the degree of depth of the link of creeping, and filters the mechanism such as URL rewriting link and processes; Finally, due to the demand of web crawling system parallelization, single-point need to carry out the main frame that hash calculative determination obtains this link to the link of extracting.
(6) queue load monitoring
Concrete control strategy can be divided into two-stage, first according to the link of creeping, carries out priority division, then link is divided according to access main frame, thereby restriction is for the frequent access of certain main frame.
The test data of specifically take is set forth technical scheme of the present invention as example is further.This case experimental situation design is as follows: 1 machine is as NameNode, and 24 machines are as DataNode.Every node hardware configuration is as follows: the Lenovo Yangtian T4600v ,CPUWei Intel E3200 of Celeron (2.6GHZ); Inside save as 2GB; Hard disk is 250GB/7200rpm.The operating system of using is RedHat5.3, and Java environment is JDK1.6, and Hadoop version is 0.20.2, and 4 slots of every machines configurations process 4 map/reduce tasks simultaneously.
Case is mainly verified the improved validity based on SimHash removing duplicate webpages algorithm by the contrast experiment with primal algorithm.Here approximately 1,500,000 webpages have been captured, as the data set of experiment.For process that can simulation test removing duplicate webpages, according to text size, 5000 mutual unduplicated Web page texts have been selected at random, its text is entered row stochastic a small amount of increasing, deletes, changes operation, generated 10000 approximate Web page texts that repeat, be inserted into data centralization.Finally extract the relevant information of the text files memory Web page text of an about 5GB.Use respectively algorithm and the improved algorithm of Manku to carry out data set to process, result as shown in Figure 4.
As can be seen from the figure, the difference of result is mainly reflected in the identification for more short width webpage, and the recall ratio of obvious improved algorithm is better than traditional Manku algorithm, and this is mainly reflected in the identification for short text.
In order to accelerate to algorithm, Manku proposes following methods: (1) carries out piecemeal to the finger print data of 64 bit; (2) fingerprint is replaced to expansion, by the redundancy in space, exchange the consumption of time for, and improved algorithm is the comparison that is relatively reduced to subrange of global scope, with the sacrifice of less precision, exchanged lower Time & Space Complexity for.In improved algorithm, a Job is originally decomposed into 5 Job and completes in improving algorithm, as being the T.T. that algorithm completes computing time in Fig. 5.As can be seen from the figure in routine processes, on the time, along with the increase of file set scale, be also better than traditional algorithm the working time of improvement algorithm.
Comparison algorithm working time under data set different scales, thus whether verification algorithm has higher retractility.Experiment has been compared respectively the computing time under 300,000 to 1,500,000 web data collection, and as seen from the figure, time loss is substantially according to linear growth, so this improved algorithm possesses higher extensibility under the environment of parallel computation.

Claims (3)

1. the elimination based on cloud platform is similar to a repeated pages method, it is characterized in that:
The method comprises the following steps:
(1) webpage pre-service, extracts Web page text;
(2) in text, extract characteristic item for characterizing body matter;
(3) the fingerprint of calculated characteristics item, to characteristic item compress or dimension-reduction treatment to facilitate storage and retrieval;
(4) based on characteristic fingerprint, calculate similarity, judge that whether original web page is approximate.
2. the elimination based on cloud platform according to claim 1 is similar to repeated pages method, it is characterized in that: (1), (2), (3) described step is that given webpage is carried out to pre-service, and Web page text is expressed as to the process of piecemeal fingerprint set, the algorithm steps of this one-phase is as follows:
Step is 1.: webpage pre-service, remove webpage noise, and extract Web page text;
Step is 2.: to natural piecemeal pending in Web page text, using punctuation mark as separator, piecemeal is divided into a plurality of sentences, in these sentences, extract number of words and be no less than the long sentence of k word as the characteristic item of piecemeal, long sentence to each extraction, calculates its weight according to the theme semanteme of webpage and long sentence self-characteristic;
Step is 3.: utilize the weighting long sentence of piecemeal to calculate piecemeal fingerprint, and piecemeal fingerprint is added to the piecemeal fingerprint collection of text;
Step is 4.: if also have new piecemeal, forward above-mentioned steps to and 2. continue new piecemeal to carry out that long sentence extracts and fingerprint calculating; If finish in full, this one-phase finishes, and obtains piecemeal fingerprint set in full.
3. the approximate repeated pages method of the elimination based on cloud platform according to claim 1, is characterized in that: (4) described step is that the block collection to having obtained is carried out similarity calculating, and judges that according to this whether their text is approximate, and step is as follows:
Step is 1.: establishing approximate threshold value is r;
Step is 2.: calculate similarity, if similarity >r judges it is approximate webpage; Otherwise, be not approximate webpage.
CN201310537406.9A 2013-11-04 2013-11-04 Method for eliminating similar-duplicate webpage on the basis of cloud platform Pending CN103559259A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310537406.9A CN103559259A (en) 2013-11-04 2013-11-04 Method for eliminating similar-duplicate webpage on the basis of cloud platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310537406.9A CN103559259A (en) 2013-11-04 2013-11-04 Method for eliminating similar-duplicate webpage on the basis of cloud platform

Publications (1)

Publication Number Publication Date
CN103559259A true CN103559259A (en) 2014-02-05

Family

ID=50013506

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310537406.9A Pending CN103559259A (en) 2013-11-04 2013-11-04 Method for eliminating similar-duplicate webpage on the basis of cloud platform

Country Status (1)

Country Link
CN (1) CN103559259A (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103955529A (en) * 2014-05-12 2014-07-30 中国科学院计算机网络信息中心 Internet information searching and aggregating presentation method
CN104079559A (en) * 2014-06-05 2014-10-01 腾讯科技(深圳)有限公司 Web address security detecting method and device and server
CN104615714A (en) * 2015-02-05 2015-05-13 北京中搜网络技术股份有限公司 Blog duplicate removal method based on text similarities and microblog channel features
CN105630802A (en) * 2014-10-30 2016-06-01 阿里巴巴集团控股有限公司 Webpage duplication removal method and apparatus
CN105653668A (en) * 2015-12-29 2016-06-08 武汉理工大学 Webpage content analysis and extraction optimization method based on DOM Tree in cloud environment
CN106302202A (en) * 2015-05-15 2017-01-04 阿里巴巴集团控股有限公司 Data current-limiting method and device
CN106407195A (en) * 2015-07-28 2017-02-15 北京京东尚科信息技术有限公司 Method and system for eliminating duplication of webpage
CN106528508A (en) * 2016-10-27 2017-03-22 乐视控股(北京)有限公司 Repeated text judgment method and apparatus
CN106897271A (en) * 2017-03-17 2017-06-27 北京搜狐新媒体信息技术有限公司 Body noise removal method and system
CN108416221A (en) * 2018-01-22 2018-08-17 西安电子科技大学 Safe set of metadata of similar data possesses proof scheme in a kind of cloud environment
CN109002517A (en) * 2018-07-06 2018-12-14 佛山市灏金赢科技有限公司 A kind of webpage content display method and system
CN109189824A (en) * 2018-08-10 2019-01-11 阿里巴巴集团控股有限公司 A kind of method and device for retrieving similar article
CN110322692A (en) * 2019-07-09 2019-10-11 广东工业大学 A kind of detection method, device and equipment repeating traffic flow data
CN110442679A (en) * 2019-08-01 2019-11-12 信雅达系统工程股份有限公司 A kind of text De-weight method based on Fusion Model algorithm
CN112307303A (en) * 2020-10-29 2021-02-02 扆亮海 Efficient and accurate network page duplicate removal system based on cloud computing
CN113408660A (en) * 2021-07-15 2021-09-17 北京百度网讯科技有限公司 Book clustering method, device, equipment and storage medium
CN119203976A (en) * 2024-11-25 2024-12-27 国家计算机网络与信息安全管理中心江苏分中心 A web page deduplication method based on structure perception
CN119249020A (en) * 2024-09-19 2025-01-03 广州盈风网络科技有限公司 Artificial intelligence-based website map generation method, system, device and medium
CN119249020B (en) * 2024-09-19 2025-09-16 广州盈风网络科技有限公司 Sitemap generation method, system, equipment and medium based on artificial intelligence

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080235163A1 (en) * 2007-03-22 2008-09-25 Srinivasan Balasubramanian System and method for online duplicate detection and elimination in a web crawler
CN101645082A (en) * 2009-04-17 2010-02-10 华中科技大学 Similar web page duplicate-removing system based on parallel programming mode
CN102024065A (en) * 2011-01-18 2011-04-20 中南大学 SIMD optimization-based webpage duplication elimination and concurrency method
CN102375813A (en) * 2010-08-09 2012-03-14 腾讯科技(深圳)有限公司 Duplicate detection system and method for search engines
US8140505B1 (en) * 2005-03-31 2012-03-20 Google Inc. Near-duplicate document detection for web crawling

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8140505B1 (en) * 2005-03-31 2012-03-20 Google Inc. Near-duplicate document detection for web crawling
US20080235163A1 (en) * 2007-03-22 2008-09-25 Srinivasan Balasubramanian System and method for online duplicate detection and elimination in a web crawler
CN101645082A (en) * 2009-04-17 2010-02-10 华中科技大学 Similar web page duplicate-removing system based on parallel programming mode
CN102375813A (en) * 2010-08-09 2012-03-14 腾讯科技(深圳)有限公司 Duplicate detection system and method for search engines
CN102024065A (en) * 2011-01-18 2011-04-20 中南大学 SIMD optimization-based webpage duplication elimination and concurrency method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
GURMEET SINGH MANKU 等: "Detecting Near-Duplicates for Web Crawling", 《WWW2007/TRACK:DATA MINING》 *
徐娜 等: "基于Bloom Filter的网页去重算法", 《微型电脑应用》 *
黄仁 等: "基于正文结构和长句提取的网页去重算法", 《计算机应用研究》 *

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103955529A (en) * 2014-05-12 2014-07-30 中国科学院计算机网络信息中心 Internet information searching and aggregating presentation method
CN104079559B (en) * 2014-06-05 2017-07-25 腾讯科技(深圳)有限公司 A kind of website safety detection method, device and server
CN104079559A (en) * 2014-06-05 2014-10-01 腾讯科技(深圳)有限公司 Web address security detecting method and device and server
CN105630802A (en) * 2014-10-30 2016-06-01 阿里巴巴集团控股有限公司 Webpage duplication removal method and apparatus
US10691769B2 (en) 2014-10-30 2020-06-23 Alibaba Group Holding Limited Methods and apparatus for removing a duplicated web page
CN104615714A (en) * 2015-02-05 2015-05-13 北京中搜网络技术股份有限公司 Blog duplicate removal method based on text similarities and microblog channel features
CN106302202A (en) * 2015-05-15 2017-01-04 阿里巴巴集团控股有限公司 Data current-limiting method and device
CN106302202B (en) * 2015-05-15 2020-07-28 阿里巴巴集团控股有限公司 Data current limiting method and device
CN106407195A (en) * 2015-07-28 2017-02-15 北京京东尚科信息技术有限公司 Method and system for eliminating duplication of webpage
CN105653668A (en) * 2015-12-29 2016-06-08 武汉理工大学 Webpage content analysis and extraction optimization method based on DOM Tree in cloud environment
CN106528508A (en) * 2016-10-27 2017-03-22 乐视控股(北京)有限公司 Repeated text judgment method and apparatus
CN106897271A (en) * 2017-03-17 2017-06-27 北京搜狐新媒体信息技术有限公司 Body noise removal method and system
CN106897271B (en) * 2017-03-17 2020-05-22 北京搜狐新媒体信息技术有限公司 News text noise removal method and system
CN108416221A (en) * 2018-01-22 2018-08-17 西安电子科技大学 Safe set of metadata of similar data possesses proof scheme in a kind of cloud environment
CN109002517A (en) * 2018-07-06 2018-12-14 佛山市灏金赢科技有限公司 A kind of webpage content display method and system
CN109189824A (en) * 2018-08-10 2019-01-11 阿里巴巴集团控股有限公司 A kind of method and device for retrieving similar article
CN109189824B (en) * 2018-08-10 2022-04-26 创新先进技术有限公司 Method and device for retrieving similar articles
CN110322692A (en) * 2019-07-09 2019-10-11 广东工业大学 A kind of detection method, device and equipment repeating traffic flow data
CN110322692B (en) * 2019-07-09 2020-10-23 广东工业大学 Method, device and equipment for detecting repeated traffic flow data
CN110442679A (en) * 2019-08-01 2019-11-12 信雅达系统工程股份有限公司 A kind of text De-weight method based on Fusion Model algorithm
CN112307303A (en) * 2020-10-29 2021-02-02 扆亮海 Efficient and accurate network page duplicate removal system based on cloud computing
CN113408660A (en) * 2021-07-15 2021-09-17 北京百度网讯科技有限公司 Book clustering method, device, equipment and storage medium
CN113408660B (en) * 2021-07-15 2024-05-24 北京百度网讯科技有限公司 Book clustering method, device, equipment and storage medium
CN119249020A (en) * 2024-09-19 2025-01-03 广州盈风网络科技有限公司 Artificial intelligence-based website map generation method, system, device and medium
CN119249020B (en) * 2024-09-19 2025-09-16 广州盈风网络科技有限公司 Sitemap generation method, system, equipment and medium based on artificial intelligence
CN119203976A (en) * 2024-11-25 2024-12-27 国家计算机网络与信息安全管理中心江苏分中心 A web page deduplication method based on structure perception
CN119203976B (en) * 2024-11-25 2025-02-11 国家计算机网络与信息安全管理中心江苏分中心 A web page deduplication method based on structure perception

Similar Documents

Publication Publication Date Title
CN103559259A (en) Method for eliminating similar-duplicate webpage on the basis of cloud platform
CN106844640B (en) Webpage data analysis processing method
US10579661B2 (en) System and method for machine learning and classifying data
CN110543595B (en) In-station searching system and method
CN108132929A (en) A kind of similarity calculation method of magnanimity non-structured text
CN110309446A (en) The quick De-weight method of content of text, device, computer equipment and storage medium
US20140068768A1 (en) Apparatus and Method for Identifying Related Code Variants in Binaries
JP2009104591A (en) Web document clustering method and system
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
CN110929145A (en) Public opinion analysis method, public opinion analysis device, computer device and storage medium
CN106815307A (en) Public Culture knowledge mapping platform and its use method
CN105389349A (en) Dictionary update method and device
CN105653668A (en) Webpage content analysis and extraction optimization method based on DOM Tree in cloud environment
CN108363686A (en) A kind of character string segmenting method, device, terminal device and storage medium
CN112307303A (en) Efficient and accurate network page duplicate removal system based on cloud computing
CN110532389B (en) Text clustering method and device and computing equipment
CN105550359B (en) Webpage sorting method and device based on vertical search and server
CN111444411A (en) Network data increment acquisition method, device, equipment and storage medium
CN110008419A (en) Removing duplicate webpages method, device and equipment
CN103593454A (en) Mining method and system for microblog text classification
CN106649557A (en) Semantic association mining method for defect report and mail list
CN106844550A (en) Method and device is recommended in a kind of virtual platform operation
Mallick et al. Incremental mining of sequential patterns: Progress and challenges
CN106202552A (en) Data search method based on cloud computing
JP6749865B2 (en) INFORMATION COLLECTION DEVICE AND INFORMATION COLLECTION METHOD

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140205

WD01 Invention patent application deemed withdrawn after publication