[go: up one dir, main page]

WO2008030568A3 - Feed crawling system and method and spam feed filter - Google Patents

Feed crawling system and method and spam feed filter Download PDF

Info

Publication number
WO2008030568A3
WO2008030568A3 PCT/US2007/019558 US2007019558W WO2008030568A3 WO 2008030568 A3 WO2008030568 A3 WO 2008030568A3 US 2007019558 W US2007019558 W US 2007019558W WO 2008030568 A3 WO2008030568 A3 WO 2008030568A3
Authority
WO
WIPO (PCT)
Prior art keywords
feed
crawling
spam
urls
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2007/019558
Other languages
French (fr)
Other versions
WO2008030568A2 (en
Inventor
James Ruga
Rebecca Berrigan
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
FEEDSTER Inc
Original Assignee
FEEDSTER Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by FEEDSTER Inc filed Critical FEEDSTER Inc
Publication of WO2008030568A2 publication Critical patent/WO2008030568A2/en
Publication of WO2008030568A3 publication Critical patent/WO2008030568A3/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Computer And Data Communications (AREA)

Abstract

A feed crawling system, method, and computer program product. A spam filter and method for filtering. A system and method for feed crawling with spam filtering. A computer system and associated method and computer program product for crawling content feeds, the computer system comprising: at least one processor for executing at least one process; a database providing a storage for storing location information or universal reference locators (urls); a first process for prioritizing a list of urls to be crawled; a parallelized crawler process for crawling the urls and storing the results in the database; and an indexing process for indexing the database for a user to search.
PCT/US2007/019558 2006-09-07 2007-09-07 Feed crawling system and method and spam feed filter Ceased WO2008030568A2 (en)

Applications Claiming Priority (8)

Application Number Priority Date Filing Date Title
US82490306P 2006-09-07 2006-09-07
US60/824,903 2006-09-07
US82511406P 2006-09-08 2006-09-08
US60/825,114 2006-09-08
US85057707A 2007-09-05 2007-09-05
US85059207A 2007-09-05 2007-09-05
US11/850,577 2007-09-05
US11/850,592 2007-09-05

Publications (2)

Publication Number Publication Date
WO2008030568A2 WO2008030568A2 (en) 2008-03-13
WO2008030568A3 true WO2008030568A3 (en) 2008-10-16

Family

ID=39157869

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2007/019558 Ceased WO2008030568A2 (en) 2006-09-07 2007-09-07 Feed crawling system and method and spam feed filter

Country Status (1)

Country Link
WO (1) WO2008030568A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108710672A (en) * 2018-05-17 2018-10-26 南京大学 A kind of Theme Crawler of Content method based on increment bayesian algorithm

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108491438A (en) * 2018-02-12 2018-09-04 陆夏根 A kind of technology policy retrieval analysis method

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6182085B1 (en) * 1998-05-28 2001-01-30 International Business Machines Corporation Collaborative team crawling:Large scale information gathering over the internet
US6266664B1 (en) * 1997-10-01 2001-07-24 Rulespace, Inc. Method for scanning, analyzing and rating digital information content
US6377984B1 (en) * 1999-11-02 2002-04-23 Alta Vista Company Web crawler system using parallel queues for queing data sets having common address and concurrently downloading data associated with data set in each queue
US20020188841A1 (en) * 1995-07-27 2002-12-12 Jones Kevin C. Digital asset management and linking media signals with related data using watermarks
US20020194161A1 (en) * 2001-04-12 2002-12-19 Mcnamee J. Paul Directed web crawler with machine learning
US6631369B1 (en) * 1999-06-30 2003-10-07 Microsoft Corporation Method and system for incremental web crawling
US6738767B1 (en) * 2000-03-20 2004-05-18 International Business Machines Corporation System and method for discovering schematic structure in hypertext documents
US20050086206A1 (en) * 2003-10-15 2005-04-21 International Business Machines Corporation System, Method, and service for collaborative focused crawling of documents on a network
US20050102259A1 (en) * 2003-11-12 2005-05-12 Yahoo! Inc. Systems and methods for search query processing using trend analysis
US20050192936A1 (en) * 2004-02-12 2005-09-01 Meek Christopher A. Decision-theoretic web-crawling and predicting web-page change
US20050262062A1 (en) * 2004-05-08 2005-11-24 Xiongwu Xia Methods and apparatus providing local search engine
US20060136420A1 (en) * 2004-12-20 2006-06-22 Yahoo!, Inc. System and method for providing improved access to a search tool in electronic mail-enabled applications

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020188841A1 (en) * 1995-07-27 2002-12-12 Jones Kevin C. Digital asset management and linking media signals with related data using watermarks
US6266664B1 (en) * 1997-10-01 2001-07-24 Rulespace, Inc. Method for scanning, analyzing and rating digital information content
US6182085B1 (en) * 1998-05-28 2001-01-30 International Business Machines Corporation Collaborative team crawling:Large scale information gathering over the internet
US6631369B1 (en) * 1999-06-30 2003-10-07 Microsoft Corporation Method and system for incremental web crawling
US6377984B1 (en) * 1999-11-02 2002-04-23 Alta Vista Company Web crawler system using parallel queues for queing data sets having common address and concurrently downloading data associated with data set in each queue
US6738767B1 (en) * 2000-03-20 2004-05-18 International Business Machines Corporation System and method for discovering schematic structure in hypertext documents
US20020194161A1 (en) * 2001-04-12 2002-12-19 Mcnamee J. Paul Directed web crawler with machine learning
US20050086206A1 (en) * 2003-10-15 2005-04-21 International Business Machines Corporation System, Method, and service for collaborative focused crawling of documents on a network
US20050102259A1 (en) * 2003-11-12 2005-05-12 Yahoo! Inc. Systems and methods for search query processing using trend analysis
US20050192936A1 (en) * 2004-02-12 2005-09-01 Meek Christopher A. Decision-theoretic web-crawling and predicting web-page change
US20050262062A1 (en) * 2004-05-08 2005-11-24 Xiongwu Xia Methods and apparatus providing local search engine
US20060136420A1 (en) * 2004-12-20 2006-06-22 Yahoo!, Inc. System and method for providing improved access to a search tool in electronic mail-enabled applications

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108710672A (en) * 2018-05-17 2018-10-26 南京大学 A kind of Theme Crawler of Content method based on increment bayesian algorithm
CN108710672B (en) * 2018-05-17 2020-04-14 南京大学 A Topic Crawler Method Based on Incremental Bayesian Algorithm

Also Published As

Publication number Publication date
WO2008030568A2 (en) 2008-03-13

Similar Documents

Publication Publication Date Title
WO2008011029A3 (en) Method and system for creating a concept-object database
WO2007081681A3 (en) Search system with query refinement and search method
WO2008088722A3 (en) Querying data and an associated ontology in a database management system
WO2008088721A3 (en) Querying data and an associated ontology in a database management system
WO2008070744A3 (en) Centralized web-based software solution for search engine optimization
WO2008070866A3 (en) Interleaving search results
WO2008034057A3 (en) Methods and systems for dynamically rearranging search results into hierarchically organized concept clusters
WO2007064640A3 (en) Detecting repeating content in broadcast media
WO2007103191A3 (en) Comparative web search
WO2009123866A3 (en) Method and system for organizing information
WO2007144853A3 (en) Method and apparatus for performing customized paring on a xml document based on application
WO2006116196A3 (en) Media object metadata association and ranking
WO2005060684A3 (en) Method and system for obtaining solutions to contradictional problems from a semantically indexed database
WO2005101186A3 (en) System, method and computer program product for extracting metadata faster than real-time
WO2007059216A3 (en) Methods and apparatus for rank-based response set clustering
WO2008084501A3 (en) Method for enhancing the relevance of search results with user judgment
EP1962241A4 (en) Content search device, content search system, server device for content search system, content searching method, and computer program and content output apparatus with search function
WO2008030568A3 (en) Feed crawling system and method and spam feed filter
WO2006122106A3 (en) Processing information from selected sources via a single website
Chojnacka et al. Using wood and bone ash to remove metal ions from solutions.
WO2007056656A3 (en) Methods and apparatus for processing business objects, electronic forms, and workflows
ATE400135T1 (en) MULTI-LAYER WRAPPED METHOD AND SYSTEM FOR PUSH CONTENT DATA METADATA
WO2009120329A3 (en) Online analytic processing cube with time stamping
Sulaymon et al. Competitive adsorption of cadmium lead and mercury ions onto activated carbon in batch adsorber.
Caumul The role of surfactants and their intermediates in environmental chemistry.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07811709

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07811709

Country of ref document: EP

Kind code of ref document: A2