WO2008030568A3 - Feed crawling system and method and spam feed filter - Google Patents
Feed crawling system and method and spam feed filter Download PDFInfo
- Publication number
- WO2008030568A3 WO2008030568A3 PCT/US2007/019558 US2007019558W WO2008030568A3 WO 2008030568 A3 WO2008030568 A3 WO 2008030568A3 US 2007019558 W US2007019558 W US 2007019558W WO 2008030568 A3 WO2008030568 A3 WO 2008030568A3
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- feed
- crawling
- spam
- urls
- database
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
- Computer And Data Communications (AREA)
Abstract
A feed crawling system, method, and computer program product. A spam filter and method for filtering. A system and method for feed crawling with spam filtering. A computer system and associated method and computer program product for crawling content feeds, the computer system comprising: at least one processor for executing at least one process; a database providing a storage for storing location information or universal reference locators (urls); a first process for prioritizing a list of urls to be crawled; a parallelized crawler process for crawling the urls and storing the results in the database; and an indexing process for indexing the database for a user to search.
Applications Claiming Priority (8)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US82490306P | 2006-09-07 | 2006-09-07 | |
US60/824,903 | 2006-09-07 | ||
US82511406P | 2006-09-08 | 2006-09-08 | |
US60/825,114 | 2006-09-08 | ||
US85057707A | 2007-09-05 | 2007-09-05 | |
US85059207A | 2007-09-05 | 2007-09-05 | |
US11/850,577 | 2007-09-05 | ||
US11/850,592 | 2007-09-05 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2008030568A2 WO2008030568A2 (en) | 2008-03-13 |
WO2008030568A3 true WO2008030568A3 (en) | 2008-10-16 |
Family
ID=39157869
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2007/019558 WO2008030568A2 (en) | 2006-09-07 | 2007-09-07 | Feed crawling system and method and spam feed filter |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2008030568A2 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108710672A (en) * | 2018-05-17 | 2018-10-26 | 南京大学 | A kind of Theme Crawler of Content method based on increment bayesian algorithm |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108491438A (en) * | 2018-02-12 | 2018-09-04 | 陆夏根 | A kind of technology policy retrieval analysis method |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6182085B1 (en) * | 1998-05-28 | 2001-01-30 | International Business Machines Corporation | Collaborative team crawling:Large scale information gathering over the internet |
US6266664B1 (en) * | 1997-10-01 | 2001-07-24 | Rulespace, Inc. | Method for scanning, analyzing and rating digital information content |
US6377984B1 (en) * | 1999-11-02 | 2002-04-23 | Alta Vista Company | Web crawler system using parallel queues for queing data sets having common address and concurrently downloading data associated with data set in each queue |
US20020188841A1 (en) * | 1995-07-27 | 2002-12-12 | Jones Kevin C. | Digital asset management and linking media signals with related data using watermarks |
US20020194161A1 (en) * | 2001-04-12 | 2002-12-19 | Mcnamee J. Paul | Directed web crawler with machine learning |
US6631369B1 (en) * | 1999-06-30 | 2003-10-07 | Microsoft Corporation | Method and system for incremental web crawling |
US6738767B1 (en) * | 2000-03-20 | 2004-05-18 | International Business Machines Corporation | System and method for discovering schematic structure in hypertext documents |
US20050086206A1 (en) * | 2003-10-15 | 2005-04-21 | International Business Machines Corporation | System, Method, and service for collaborative focused crawling of documents on a network |
US20050102259A1 (en) * | 2003-11-12 | 2005-05-12 | Yahoo! Inc. | Systems and methods for search query processing using trend analysis |
US20050192936A1 (en) * | 2004-02-12 | 2005-09-01 | Meek Christopher A. | Decision-theoretic web-crawling and predicting web-page change |
US20050262062A1 (en) * | 2004-05-08 | 2005-11-24 | Xiongwu Xia | Methods and apparatus providing local search engine |
US20060136420A1 (en) * | 2004-12-20 | 2006-06-22 | Yahoo!, Inc. | System and method for providing improved access to a search tool in electronic mail-enabled applications |
-
2007
- 2007-09-07 WO PCT/US2007/019558 patent/WO2008030568A2/en active Application Filing
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020188841A1 (en) * | 1995-07-27 | 2002-12-12 | Jones Kevin C. | Digital asset management and linking media signals with related data using watermarks |
US6266664B1 (en) * | 1997-10-01 | 2001-07-24 | Rulespace, Inc. | Method for scanning, analyzing and rating digital information content |
US6182085B1 (en) * | 1998-05-28 | 2001-01-30 | International Business Machines Corporation | Collaborative team crawling:Large scale information gathering over the internet |
US6631369B1 (en) * | 1999-06-30 | 2003-10-07 | Microsoft Corporation | Method and system for incremental web crawling |
US6377984B1 (en) * | 1999-11-02 | 2002-04-23 | Alta Vista Company | Web crawler system using parallel queues for queing data sets having common address and concurrently downloading data associated with data set in each queue |
US6738767B1 (en) * | 2000-03-20 | 2004-05-18 | International Business Machines Corporation | System and method for discovering schematic structure in hypertext documents |
US20020194161A1 (en) * | 2001-04-12 | 2002-12-19 | Mcnamee J. Paul | Directed web crawler with machine learning |
US20050086206A1 (en) * | 2003-10-15 | 2005-04-21 | International Business Machines Corporation | System, Method, and service for collaborative focused crawling of documents on a network |
US20050102259A1 (en) * | 2003-11-12 | 2005-05-12 | Yahoo! Inc. | Systems and methods for search query processing using trend analysis |
US20050192936A1 (en) * | 2004-02-12 | 2005-09-01 | Meek Christopher A. | Decision-theoretic web-crawling and predicting web-page change |
US20050262062A1 (en) * | 2004-05-08 | 2005-11-24 | Xiongwu Xia | Methods and apparatus providing local search engine |
US20060136420A1 (en) * | 2004-12-20 | 2006-06-22 | Yahoo!, Inc. | System and method for providing improved access to a search tool in electronic mail-enabled applications |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108710672A (en) * | 2018-05-17 | 2018-10-26 | 南京大学 | A kind of Theme Crawler of Content method based on increment bayesian algorithm |
CN108710672B (en) * | 2018-05-17 | 2020-04-14 | 南京大学 | A Topic Crawler Method Based on Incremental Bayesian Algorithm |
Also Published As
Publication number | Publication date |
---|---|
WO2008030568A2 (en) | 2008-03-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2008011029A3 (en) | Method and system for creating a concept-object database | |
WO2007047252A3 (en) | System, method & computer program product for concept based searching & analysis | |
WO2007081681A3 (en) | Search system with query refinement and search method | |
WO2008088721A3 (en) | Querying data and an associated ontology in a database management system | |
WO2007065947A3 (en) | System and method of implementing an e-mail interface for a content management system | |
WO2005098591A3 (en) | Methods and systems for structuring event data in a database for location and retrieval | |
WO2008021832A3 (en) | Harvesting data from page | |
WO2008070866A3 (en) | Interleaving search results | |
WO2007108788A3 (en) | Method and system for answer extraction | |
WO2007103191A3 (en) | Comparative web search | |
WO2009123866A3 (en) | Method and system for organizing information | |
WO2006116196A3 (en) | Media object metadata association and ranking | |
WO2007144853A3 (en) | Method and apparatus for performing customized paring on a xml document based on application | |
WO2007059216A3 (en) | Methods and apparatus for rank-based response set clustering | |
Sutherland et al. | Equilibrium modeling of Cu (II) biosorption onto untreated and treated forest macro-fungus Fomes fasciatus. | |
WO2008030568A3 (en) | Feed crawling system and method and spam feed filter | |
WO2007115219A3 (en) | Item management systems and associated methods | |
WO2008009995A3 (en) | System and method for indexing stored electronic data using a b-tree | |
ATE496474T1 (en) | MULTI-LAYER ENVELOPE PROCESS AND CONTENT DELIVERY SYSTEM | |
WO2004107204A3 (en) | Data processing method and system for combining database tables | |
Khosla et al. | Efficacy of insecticidal dusts on natural infestation of Trogoderma granarium (Everts) on wheat seeds | |
WO2009120329A3 (en) | Online analytic processing cube with time stamping | |
Wang JiaHong et al. | Adsorption of Cr (VI) from aqueous solution onto short-chain polyaniline/palygorskite composites. | |
Lu HongTao et al. | In situ oxidation and efficient simultaneous adsorption of arsenite and arsenate by Mg-Fe-LDH with persulfate intercalation. | |
Fazeli et al. | Effect of Environmental Parameters on Economically Important Copepods in Chabahar Bay in 2007 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 07811709 Country of ref document: EP Kind code of ref document: A2 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 07811709 Country of ref document: EP Kind code of ref document: A2 |