[go: up one dir, main page]

WO2000002141A1 - A system for crawling the web and extracting designated data and the method therefor i.e. webharvester - Google Patents

A system for crawling the web and extracting designated data and the method therefor i.e. webharvester Download PDF

Info

Publication number
WO2000002141A1
WO2000002141A1 PCT/CN1998/000117 CN9800117W WO0002141A1 WO 2000002141 A1 WO2000002141 A1 WO 2000002141A1 CN 9800117 W CN9800117 W CN 9800117W WO 0002141 A1 WO0002141 A1 WO 0002141A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
web
computer system
database
pages
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN1998/000117
Other languages
French (fr)
Inventor
Fujun Bi
Shaun Bliss
Hong Yan
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to AU81008/98A priority Critical patent/AU8100898A/en
Priority to PCT/CN1998/000117 priority patent/WO2000002141A1/en
Publication of WO2000002141A1 publication Critical patent/WO2000002141A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents

Definitions

  • the present invention relates to a system for crawling the Web and extracting designated data and the method therefor, i.e. WebHarvester, and in particular, relates to a system that allows user to fetch, strip and store necessary information from different web sites in an organised and manageable way, and the method therefor.
  • the current technology for the user to visit the web sites is not efficient, and impedes quick communication of the necessary information to the targeted information receiver, which is not favored by neither the web site service provider nor the user or surfer of the web.
  • the present invention provides a system for crawling the Web and extracting designated data , comprising: a computer system; a database configured in the computer system; templates residing in the computer system for mapping information in target page for each web site; fetch means for fetching web pages from said web sites and transferring the fetched pages to said computer system; filter means for scanning the fetched pages to extract necessary information from the fetched pages fiOm said web sites according to corresponding one of said templates, respectively; format & post means for converting the extracted information into a standard format, and storing the formatted information in said database.
  • the present invention further provides a method for crawling the Web and extracting designated data in a computer system configured with a database, comprising the steps of: creating and keeping templates for mapping information in target page for each web site in the computer system; fetching web pages from said web sites and transferring the fetched pages to said computer system; filtering the fetched pages to extract the necessary information from the fetched pages from said web sites according to corresponding one of the templates, respectively; formatting the extracted information into a standard format, and storing the formatted information in said database.
  • This inventive technology is also called “WebHarvester” , which reflects the function of the present invention.
  • This WebHarvester allows user to fetch, strip and store necessary information from different web sites in an organized and manageable way, and is able to recognize and distinguish desired information in various web sites of the same category amongst all other unnecessary data, codes and programming tags. Its final step would be to store them in the database for future use. Stored data will be fully manageable and free of the unwanted information.
  • the uniqueness of WebHarvester would be, doing the mentioned tasks, automatically, and once defined, get the aimed information reliable and with no need for further control.
  • Fig.l shows the conventional way of surfing and getting results by looking at different web sites.
  • Fig.2 shows how the steps needed for getting the same result are shorter with the system according to the present invention
  • Fig. 3 shows the schematic diagram of the basic elements of this invention
  • Fig. 4 shows the configuration of the webharvester system according to the present invention
  • Fig. 5 shows the operations involved when retrieving information with the webharvester system according to the present invention
  • Fig. 6 shows a web page containing information to be imported to the local database
  • Fig. 7 shows the regular HTML version of the page shown in Fig. 6
  • Fig. 8 shows the HTML coded template page in correspondence to the page shown in Fig. 7, added tags are underlined;
  • Fig. 9 shows mapping the information contained in the web page of a web site into a target page
  • Fig. 10 shows the structure of the entities of the database in the system according to an embodiment of the present invention.
  • Fig. 11 shows the relationship between the entities of the database in the system according to the embodiment of the present invention.
  • Fig. 12 shows the relationship in more detail between the entities involved in the system according to the embodiment of the present invention.
  • Fig. 13 shows the basic architecture of the system according to an embodiment of the present invention
  • Fig. 14 shows the detail architecture and configuration of the system according to the embodiment of the present invention
  • Fig. 15 shows how the information is passed between systems and the steps taken to get the data to input into the database and expose it the clients in the system according to the embodiment of the present invention
  • Fig. 16 shows the operation for local server to return the results which contain information the user is interested in to the client when a client inputs request.
  • Fig. 3 shows the schematic diagram of the basic elements of this invention.
  • the system connects to one of the sites, get the web page from the site, then manipulates the page in a predetermined manner, then formats and posts the page to change the page into a standard page specified in the local site.
  • Fig. 4 shows the configuration of the WebHarvester system according to the present invention.
  • the WebHarvester system comprises a computer system; a database configured in the computer system; templates residing in the computer system for mapping information in target page for each web site; fetch means for connecting to said web sites, performing query statements to retrieve respective pages of said web sites, and transferring the retrieved pages to said computer system; filter means for scanning the retrieved pages, extracting the necessary information from the respective retrieved page of each site according to the corresponding template for each site; format and post means for changing the format of the taken information to a standard format, and storing the formatted information in said database.
  • This system provides information from a plurality of web sites to the web users automatically.
  • Fig. 5 there are three basic elements in Webharvester system which also form the steps taken to achieve objectives mentioned above, according to the method of the present invention. These three elements are Fetch, Filter, Format & Post.
  • Fetch element is the first element in Webharvester system, its duty is to find the site and proper page and transfer information to local host. This action is done based on specific information available from the site and according to the format and rules of this site. In most of the sites, information is not stored as static pages and is formed dynamically in time of querry. In order to get necessary information from these sites, Fetch element should first contact to the target site, execute the querry statements, and produce the dynamic page including the result. To do this system should send accurate parameters specific to the search engine, attached to URL of the target site. These parameters are previously examined and stored in database and are able to generate results. Only after this Fetch operation can transfer useful data to the local host.
  • the basic syntax for Fetch command is like bellow:
  • Connection + Web Page + Parameters Command Address is the keyword for connection to the target site and "Web Page Address” is the URL of the target page and "Parameters” is the set of coding necessary to generate page or pages with result. Detail syntax of the command line mostly is based on the characteristics of the target site.
  • filter element should scan it and take unnecessary parts, like HTML tags, unrelated information, and codings, out of the main part.
  • the remaining information supposedly will be the main body of Resume or Job Opportunity posted to the other site based on Template. Stripping desired information from unwanted ones should be done very precisely and accurate in order to avoid inputting garbage to the database in one hand and loosing necessary information on the other hand.
  • WebHarvester deals with web sites is to keep a template for every single site in the system. Format of this template is identical to the related page plus some native tags which are recognizable by the element and show where exactly in target page necessary information are residing, and each of the templates is produced previously according to the features and rules of the respective web sites. By compiling these tags, Filter element will be able to distinguish and extract data from the page, specify the type of that and pass it to the next element.
  • Format & Post element After receiving data from Filter element, Format & Post element will change the format of imported information and form it according to the rules and standards of the local system. In this way, the naming and conventions of different documents should be changed to ones applied to the Job Plant.
  • Fetch element is responsible for contacting the target site, find the proper pages and transfer them to local server. In order to do this, the element should know about behavior of the target site and how can it retrieve pages including right information. This will require that the system keep necessary information and send it at time of retrieving. Basically this information will be the commands and directions necessary to produce dynamic pages of the target site with actual data in them attached to basic URL. It changes from site to site based on its structure and technology residing on, so each site should be studied and resolved separately. The following web page could be an example for a page containing information one would want to import to the local database, as shown in Fig. 6.
  • Fig. 7 shows the regular HTML version of the page shown in Fig. 6.
  • the System is going to store a HTML coded template page in correspondance to this page.
  • the content of this page is the exact HTML version of the page plus special tags which are added in order to recognize the areas in the page including required information.
  • This template page is shown in Fig. 8, added tags are underlined.
  • the template page is HTML version of the real page with some extra tags known to WebHarvester system. These tags are coded like ⁇ IMS_JOB_XXX> which "XXXX" is any title related to the subject and 'TMS JOB" is the code for the system to make it obvious and recognizable from other ordinary tags.
  • the coding between starting and ending WebHarvester tags would vary based on format of original page.
  • ⁇ IMS_JOB_TITLE> ⁇ (.* ⁇ ) ⁇ /P> ⁇ /lMS_JOB_TITLE> tells system , after the tags expect anything of any type until you hit ⁇ /P> tag, then stop getting information.
  • ⁇ IMS JOB DATE> ⁇ (J0-91*/[0- starting tag shows that the following information is supposed to be in date format and will be stored as date. The rest of coding says, incomming data could be any digits betweed 0 and 9 following by slash 7" for month, day and year, followed by space or Tab until it finds ⁇ /P>. Ending tag will pair openning tag with extra 7".
  • ⁇ /P> in the coding part is specific to this page and in other pages or other sites could be anything else. This only could be known by studying target sites and pages and finding out that what is the tag or any symbol following the actual information required to be captured. This method is used to map the information in target pages, so there is no need to understand whole parts of them by ignoring everything else except the known tags.
  • Fig. 9 shows the mapping relation of the information contained in the web page of a web site into a target page in the database.
  • WebHarvester system will only see the areas which are marked and ignores all other parts. This way, system will not hassle with any other piece of information or code on the page.
  • the coding of the template will be as following:
  • Fig. 10 shows the data structure of the database in the system according to an embodiment of the present invention.
  • this invention is for a general purpose. So it is not based on a certain subject or limited to any sector.
  • the data model disclosed below will support job opportunities purposes, but convertible to other systems.
  • Fig. 11 Relationship between entities is shown in Fig. 11. As shown in Fig. 11, each user will have a record in user entity, and there is no more than one entry for each. There could be zero, one, or more than one resume, or job opportunity for each user. Those will be tied to user's entry. Each target site has one entry in Site entity and one entry in Template. Each site can have zero, one, or more than one entry in resume entity. Every Resume can have one or more than one entry in Location and Category entities. Each job can have one entry in Location and Category entities.
  • Fig. 12 shows the relationship in more detail between the entities involved in the system according to the embodiment of the present invention. Relationships may vary on some of the entities based on the type of business they are used. Once again, this implementation is designed to serve job opportunities system which is only one of the applications of the WebHarvester system according to the present invention.
  • Fig. 13 shows the basic architecture of the system according to an embodiment of the present invention.
  • WebHarvester system There is no proprietary configuration for WebHarvester system according to the present invention. Internet connection for each side of this system is the only hardware configuration needed. Configurations may slightly vary in different conditions, but environments supporting this technology are more or less with the same characteristics. The following specification is the most commonly used environment in the market and based on open system architecture.
  • Hardware PC MAC
  • MAC MAC
  • Workstations 32 MB or more RAM 40 MB or more Free space of Hard Disk 28.8 Kbps or faster communication line
  • the system according to the present invention basically is consisting of the WebHarvester technology residing on local server which provides a regular and periodical execution of procedures of WebHarvester. This will gather information from target servers and stores in the local database to be accessable by clients. All connections in this system are based on Internet, so can be universal and transparent throughout the system.
  • Fig. 14 shows the detail architecture and configuration of the system according to the embodiment of the present invention. It shows how these systems are having access to each other and what are elements required for each to make that happen.
  • the clients are configured with the programs of application software(the Job Plant), Web Browser, Operating System, and Network Services programs.
  • the target servers are configured with Database Management System, Web Server, Operating System, and Network Services programs.
  • the local server which implements the present invention, is configured the WebHarvester, Database Management System, Web Server, Operating System, and Network Services programs.
  • Fig. 15 shows how the information is passed between systems and the steps taken to get the data to input into the database and expose it the clients in the system according to the embodiment of the present invention.
  • the local server which implementing the WebHarvester of the present invention, there are template and database tables.
  • the local server sends connection commands and query statements to the target server of a specified web sites to generate a target page from the target server. The generated target page is sent back to the local server.
  • the filter step necessary information is extracted from the generated target page according to the predetermined template residing in the local server for this site, so as to create a target page in the local server according to a predetermined standard format, and store the page containing the necessary information in standard format in the database tables, in the format & post step.
  • the necessary information is available for the web users.
  • the local server can return the results which contain information the user is interested in to the client.
  • Industrial applicability Fig.2 shows how the steps needed for getting the same result as that in Fig. 1 are shorter with the system according to the present invention.
  • the user connects to the Internet and call URL of WebHarvester system according to the present invention, then he execute query of the system and get results A, B, and C automatically, thereby he can print out the results.
  • the operation of the WebHarvester system according to the present invention is much more simple than that involved in the conventional technology, and the web user or surfer need not move across different web sites by himself.

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • Development Economics (AREA)
  • Finance (AREA)
  • Databases & Information Systems (AREA)
  • Strategic Management (AREA)
  • General Physics & Mathematics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Game Theory and Decision Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention discloses a system for crawling the Web and extracting designated data and the method therefor, i.e. WebHarvester, said system comprises: a computer system; a database configured in the computer system; templates residing in the computer system for mapping information in target page for each web site; fetch means for fetching web pages from said web sites and transferring the fetched pages to said computer system; filter means for scanning the fetched pages to extract necessary information from the fetched pages from said web sites according to corresponding one of said templates, respectively; format and post means for converting the extracted information into a standard format, and storing the formatted information in said database. Said computer system is a server connected to Internet.

Description

A system for crawling the Web and extracting designated data and the method therefor i.e. WebHarvester
Field of the invention
The present invention relates to a system for crawling the Web and extracting designated data and the method therefor, i.e. WebHarvester, and in particular, relates to a system that allows user to fetch, strip and store necessary information from different web sites in an organised and manageable way, and the method therefor.
Technical background
Currently in Job Hunting and Career Opportunities industry, and other supply/demand businesses on the World Wide Web, the key is to store as many titles as possible in the database to give a universal, more complete environment to both job seekers and providers. It is to the benefit of both sides that to be exposed to more targeted viewers. None of the related parties will have objection to the chance of being seen by more Web Surfers as long as their rights are reserved and there is no risk of loosing a particular business.
However, when visiting the existing web sites, there are difficulties for a web surfer to get information he is interested in from different web sites. Specifically, as shown in Fig. 1, assume a user wants to get necessary information of a same category web sites A, B and C. First, he must connect to the Internet, and call URL of system A, then he executes query of system A, and gets result A; then he must call URL of system B, and execute query of system B, thereby he gets result B; after that, he have to turn to call URL of system C, execute query of system C, and get result C. Thereafter, the results from sites A, B, and C must be combined together, and be printed out by means of a certain print device. One can imagine, if the user wants to get more information from other more web sites, the operation is more complicated and time consuming.
Take the Real Estate business as another example, assume one is looking for a home to buy and he want to find his target home through WWW. First he should find URLs of real estate companies. Say he found five addresses, and then he should visit all five sites and go through then- interfaces to find the kind of houses he wants. Some sites have been divided to areas and let him search in each area. Some have categorized homes. So he should deal with different interfaces and find his way through five different logics. He will have to make search efforts throughout five sites and print out lists separately. This looks not easy for this user to get what he really wants to get.
Therefore, the current technology for the user to visit the web sites is not efficient, and impedes quick communication of the necessary information to the targeted information receiver, which is not favored by neither the web site service provider nor the user or surfer of the web.
Object of the invention
Therefore, there is need to provide such an environment for both the information provider and the targeted viewers by gathering information from some or all of the related Web sites in a regular and up-to-date manner, and providing the gathered information to the targeted information receivers. It is an object of the present invention to provide a system/method for crawling the Web and extracting designated data, which can meet the always existing desire of people who are in charge of providing and maintaining information for web sites to keep their content up-to-date and accurate, specially for ones which, their content has to change on daily or even hourly basis. These people need a way to automate the process of updating, in the way that content gets updated without or with minimum manual operation.
It is another object of the present invention to provide a system/method for crawling the Web and extracting designated data, which can look at other sites similar to local one and gather data from them in an automatic way with recent and useful information, thereby the local site could be the reflection of information in many other sites, integrated and without hassle of having to move across different sites.
Summary of the invention
To achieve the above objects of the invention, the present invention provides a system for crawling the Web and extracting designated data , comprising: a computer system; a database configured in the computer system; templates residing in the computer system for mapping information in target page for each web site; fetch means for fetching web pages from said web sites and transferring the fetched pages to said computer system; filter means for scanning the fetched pages to extract necessary information from the fetched pages fiOm said web sites according to corresponding one of said templates, respectively; format & post means for converting the extracted information into a standard format, and storing the formatted information in said database. The present invention further provides a method for crawling the Web and extracting designated data in a computer system configured with a database, comprising the steps of: creating and keeping templates for mapping information in target page for each web site in the computer system; fetching web pages from said web sites and transferring the fetched pages to said computer system; filtering the fetched pages to extract the necessary information from the fetched pages from said web sites according to corresponding one of the templates, respectively; formatting the extracted information into a standard format, and storing the formatted information in said database.
This inventive technology is also called "WebHarvester" , which reflects the function of the present invention. This WebHarvester allows user to fetch, strip and store necessary information from different web sites in an organized and manageable way, and is able to recognize and distinguish desired information in various web sites of the same category amongst all other unnecessary data, codes and programming tags. Its final step would be to store them in the database for future use. Stored data will be fully manageable and free of the unwanted information. The uniqueness of WebHarvester would be, doing the mentioned tasks, automatically, and once defined, get the aimed information reliable and with no need for further control.
The foregoing and other objects, features and advantages of the invention will become more readily apparent from the following detailed description of preferred embodiments of the invention which proceeds with reference to the accompanying drawings. Brief description of the drawings
Fig.l shows the conventional way of surfing and getting results by looking at different web sites.
Fig.2 shows how the steps needed for getting the same result are shorter with the system according to the present invention;
Fig. 3 shows the schematic diagram of the basic elements of this invention;
Fig. 4 shows the configuration of the webharvester system according to the present invention; Fig. 5 shows the operations involved when retrieving information with the webharvester system according to the present invention;
Fig. 6 shows a web page containing information to be imported to the local database;
Fig. 7 shows the regular HTML version of the page shown in Fig. 6; Fig. 8 shows the HTML coded template page in correspondence to the page shown in Fig. 7, added tags are underlined;
Fig. 9 shows mapping the information contained in the web page of a web site into a target page;
Fig. 10 shows the structure of the entities of the database in the system according to an embodiment of the present invention;
Fig. 11 shows the relationship between the entities of the database in the system according to the embodiment of the present invention;
Fig. 12 shows the relationship in more detail between the entities involved in the system according to the embodiment of the present invention;
Fig. 13 shows the basic architecture of the system according to an embodiment of the present invention;
Fig. 14 shows the detail architecture and configuration of the system according to the embodiment of the present invention; Fig. 15 shows how the information is passed between systems and the steps taken to get the data to input into the database and expose it the clients in the system according to the embodiment of the present invention; Fig. 16 shows the operation for local server to return the results which contain information the user is interested in to the client when a client inputs request.
Preferred embodiments of the invention The fundamental need of the WebHarvester system is to retrieve information of the same kind from other sites, and gather information from some or all of related Web sites in a regular and up-to-date manner. Fig. 3 shows the schematic diagram of the basic elements of this invention. First, the system connects to one of the sites, get the web page from the site, then manipulates the page in a predetermined manner, then formats and posts the page to change the page into a standard page specified in the local site.
Fig. 4 shows the configuration of the WebHarvester system according to the present invention. The WebHarvester system comprises a computer system; a database configured in the computer system; templates residing in the computer system for mapping information in target page for each web site; fetch means for connecting to said web sites, performing query statements to retrieve respective pages of said web sites, and transferring the retrieved pages to said computer system; filter means for scanning the retrieved pages, extracting the necessary information from the respective retrieved page of each site according to the corresponding template for each site; format and post means for changing the format of the taken information to a standard format, and storing the formatted information in said database. This system provides information from a plurality of web sites to the web users automatically.
To create a clear picture of the invention we will concentrate on Job Opportunities as an example. The objective is to be able to recognize needed information from unwanted, and store them correctly in the database. To do this, WebHarvester system needs to have enough information about the site that is going to get the information. Using this information, the system will be able to understand the target site and this is unique to each site, we call it template. In the template, specific behavior of the site is shown. Based on that and special meaningful tags in template, the system will know where and how to extract requested piece of information from the site. The detail operations involved in the webharvester system for providing information from a plurality of web sites to the web users according to the present invention is shown in Fig. 5.
As shown in Fig. 5, there are three basic elements in Webharvester system which also form the steps taken to achieve objectives mentioned above, according to the method of the present invention. These three elements are Fetch, Filter, Format & Post.
Fetch element is the first element in Webharvester system, its duty is to find the site and proper page and transfer information to local host. This action is done based on specific information available from the site and according to the format and rules of this site. In most of the sites, information is not stored as static pages and is formed dynamically in time of querry. In order to get necessary information from these sites, Fetch element should first contact to the target site, execute the querry statements, and produce the dynamic page including the result. To do this system should send accurate parameters specific to the search engine, attached to URL of the target site. These parameters are previously examined and stored in database and are able to generate results. Only after this Fetch operation can transfer useful data to the local host. The basic syntax for Fetch command is like bellow:
Connection + Web Page + Parameters Command Address "Connection Command" is the keyword for connection to the target site and "Web Page Address" is the URL of the target page and "Parameters" is the set of coding necessary to generate page or pages with result. Detail syntax of the command line mostly is based on the characteristics of the target site.
Once have data in the host system, filter element should scan it and take unnecessary parts, like HTML tags, unrelated information, and codings, out of the main part. The remaining information supposedly will be the main body of Resume or Job Opportunity posted to the other site based on Template. Stripping desired information from unwanted ones should be done very precisely and accurate in order to avoid inputting garbage to the database in one hand and loosing necessary information on the other hand. The way WebHarvester deals with web sites is to keep a template for every single site in the system. Format of this template is identical to the related page plus some native tags which are recognizable by the element and show where exactly in target page necessary information are residing, and each of the templates is produced previously according to the features and rules of the respective web sites. By compiling these tags, Filter element will be able to distinguish and extract data from the page, specify the type of that and pass it to the next element.
After receiving data from Filter element, Format & Post element will change the format of imported information and form it according to the rules and standards of the local system. In this way, the naming and conventions of different documents should be changed to ones applied to the Job Plant.
Fetch element is responsible for contacting the target site, find the proper pages and transfer them to local server. In order to do this, the element should know about behavior of the target site and how can it retrieve pages including right information. This will require that the system keep necessary information and send it at time of retrieving. Basically this information will be the commands and directions necessary to produce dynamic pages of the target site with actual data in them attached to basic URL. It changes from site to site based on its structure and technology residing on, so each site should be studied and resolved separately. The following web page could be an example for a page containing information one would want to import to the local database, as shown in Fig. 6.
Fig. 7 shows the regular HTML version of the page shown in Fig. 6.
System is going to store a HTML coded template page in correspondance to this page. The content of this page is the exact HTML version of the page plus special tags which are added in order to recognize the areas in the page including required information. This template page is shown in Fig. 8, added tags are underlined. The template page is HTML version of the real page with some extra tags known to WebHarvester system. These tags are coded like <IMS_JOB_XXXX> which "XXXX" is any title related to the subject and 'TMS JOB" is the code for the system to make it obvious and recognizable from other ordinary tags. The coding between starting and ending WebHarvester tags would vary based on format of original page. Having this template page WebHarvester system will start scanning the page looking for special tags that are known and ignoring anything else. These special tags include the kind of information is comming after, like <IMS_JOB_TITLE> shows that the information following is about the title of the job. After finding the tag system will compile the coding after that until the closing tag which as a general rule for HTML tags starts with slash "/" after openning bracket.(i.e, </IMS_JOB_TITLE>). Coding between two tags is basically about the type and format of information expected right after the tags. For example :
<IMS_JOB_TITLE>\(.*\)</P></lMS_JOB_TITLE> tells system , after the tags expect anything of any type until you hit </P> tag, then stop getting information. In this example <IMS JOB DATE>\(J0-91*/[0-
Figure imgf000012_0001
starting tag shows that the following information is supposed to be in date format and will be stored as date. The rest of coding says, incomming data could be any digits betweed 0 and 9 following by slash 7" for month, day and year, followed by space or Tab until it finds </P>. Ending tag will pair openning tag with extra 7". </P> in the coding part is specific to this page and in other pages or other sites could be anything else. This only could be known by studying target sites and pages and finding out that what is the tag or any symbol following the actual information required to be captured. This method is used to map the information in target pages, so there is no need to understand whole parts of them by ignoring everything else except the known tags. Fig. 9 shows the mapping relation of the information contained in the web page of a web site into a target page in the database.
WebHarvester system will only see the areas which are marked and ignores all other parts. This way, system will not hassle with any other piece of information or code on the page.
Unlike above example, most of pages have formatting attributes together with text, such as size, color, font and many others. While importing them to local system we want to carry information with its format and show it on our page as they appear in original site as long as storing pure data in database. To serve this purpose we need to encapsulate those formats with text and carry them over local system. In order to do this we have to map formatting attributes exactly the same way as they are. Example bellow shows how we can do it.
Assuming that the original HTML code of the page look like this :
<FONT face="palatino" SIZE=+0> <B>System Engineer</B></FONT>
The coding of the template will be as following:
<LMS_JOB_TrTLE> <FONT face="palatino" SIZE=-+[0-9]>\(.*\)
</FONT></DvIS JOB TITLE> Because in many pages there is no consistency in using uppercase or lowercase keywords, it is safer to code the template as bellow. <IMS_JOB_TITLE> <[Ff][Oo][Nn][Tt] [Ff][Aa][Cc][Ee]="palatino" [Ss][Ii][Zz][Ee]=-+[0-9]>\(.*\) </FONT></IMS_JOB_TITLE>
Fig. 10 shows the data structure of the database in the system according to an embodiment of the present invention. As described above, this invention is for a general purpose. So it is not based on a certain subject or limited to any sector. The data model disclosed below will support job opportunities purposes, but convertible to other systems. For the use of the job opportunities purposes, as shown in Fig. 10, there are seven major entities in charge of supporting the system of the invention for Opportunities model: site entity, template entity, resume entity, job entity, user entity, job category entity, and location entity. All the entities are listed in Table 1 below.
Table 1
Figure imgf000014_0001
Figure imgf000015_0001
The attributes of the entities is shown in Table 2-8.
Table 2 Entity Name : USER
Figure imgf000015_0002
Table 3
Entity Name : SITE
Figure imgf000016_0002
Figure imgf000017_0001
Table 4
Entity Name : TEMPLATE
Figure imgf000017_0002
Table 5 Entity Name RESUME
Figure imgf000017_0003
Figure imgf000018_0001
Table 6 Entity Name JOB
Figure imgf000018_0002
Figure imgf000019_0001
Table 7
Entity Name : CATEGORY
Figure imgf000019_0002
Table 8
Entity Name : LOCATION
Figure imgf000019_0003
Figure imgf000020_0001
Relationship between entities is shown in Fig. 11. As shown in Fig. 11, each user will have a record in user entity, and there is no more than one entry for each. There could be zero, one, or more than one resume, or job opportunity for each user. Those will be tied to user's entry. Each target site has one entry in Site entity and one entry in Template. Each site can have zero, one, or more than one entry in resume entity. Every Resume can have one or more than one entry in Location and Category entities. Each job can have one entry in Location and Category entities.
Fig. 12 shows the relationship in more detail between the entities involved in the system according to the embodiment of the present invention. Relationships may vary on some of the entities based on the type of business they are used. Once again, this implementation is designed to serve job opportunities system which is only one of the applications of the WebHarvester system according to the present invention.
Fig. 13 shows the basic architecture of the system according to an embodiment of the present invention. There is no proprietary configuration for WebHarvester system according to the present invention. Internet connection for each side of this system is the only hardware configuration needed. Configurations may slightly vary in different conditions, but environments supporting this technology are more or less with the same characteristics. The following specification is the most commonly used environment in the market and based on open system architecture.
Environment used for the WebHarvester system according to the embodiment is as follows:
Client
• Interface Web based application program
The Job Plant
• Service Programs Web browser
Microsoft Internet Explorer
Netscape Browser Network & Internet service programs ISP's interface programs
Operating System PC or Workstation based OS Windows 95 Windows NT Client Mac/OS
Unix or Unix based other OSs
• Hardware PC, MAC, or Workstations 32 MB or more RAM 40 MB or more Free space of Hard Disk 28.8 Kbps or faster communication line
Server
• Interface Application program based on WebHarvester
The Job Plant back end • Service Programs Network & Internet service programs
Netscape Web Server Database Management System
Informix Oracle
• Operating System Server based OS
Windows NT Server
Unix or Unix based other OSs
• Hardware Network Connection Leased Line connection with at least 1.5
MB/S baud rate. System
Any hardware with enough power to run
Operating system and Database Management Software with accepted performance. The system according to the present invention basically is consisting of the WebHarvester technology residing on local server which provides a regular and periodical execution of procedures of WebHarvester. This will gather information from target servers and stores in the local database to be accessable by clients. All connections in this system are based on Internet, so can be universal and transparent throughout the system. Fig. 14 shows the detail architecture and configuration of the system according to the embodiment of the present invention. It shows how these systems are having access to each other and what are elements required for each to make that happen.
As shown in Fig. 14, the clients are configured with the programs of application software(the Job Plant), Web Browser, Operating System, and Network Services programs. The target servers are configured with Database Management System, Web Server, Operating System, and Network Services programs. The local server, which implements the present invention, is configured the WebHarvester, Database Management System, Web Server, Operating System, and Network Services programs.
Fig. 15 shows how the information is passed between systems and the steps taken to get the data to input into the database and expose it the clients in the system according to the embodiment of the present invention. As shown in Fig. 15, in the local server, which implementing the WebHarvester of the present invention, there are template and database tables. First, in the fetch step, the local server sends connection commands and query statements to the target server of a specified web sites to generate a target page from the target server. The generated target page is sent back to the local server. Then, in the filter step, necessary information is extracted from the generated target page according to the predetermined template residing in the local server for this site, so as to create a target page in the local server according to a predetermined standard format, and store the page containing the necessary information in standard format in the database tables, in the format & post step. In this stage, the necessary information is available for the web users.
As shown in Fig. 16, when a client inputs request for browsing Job Opportunities or Resumes of different web sites via Internet to the local server, the local server can return the results which contain information the user is interested in to the client.
Industrial applicability Fig.2 shows how the steps needed for getting the same result as that in Fig. 1 are shorter with the system according to the present invention. First, the user connects to the Internet and call URL of WebHarvester system according to the present invention, then he execute query of the system and get results A, B, and C automatically, thereby he can print out the results. Apparently, the operation of the WebHarvester system according to the present invention is much more simple than that involved in the conventional technology, and the web user or surfer need not move across different web sites by himself.
From the above description, one can see the advantages of the WebHarvester system according to the present invention is that it eliminates need of visiting different sites of the same kind throughout the Cyberspace to meet job providers and seekers. This system will be the only address to look at for its users, ideally throughout the world, to meet any needs in job and career opportunities area. The system functions as a gateway to a wealth of information across different servers, since it can retrieve and import information from remote databases. To summarize benefits of the system : - Keeps the information last minute and up to date;
- Needs minimum effort to collect information;
- There will be no need to visit different sites of the same type by importing data to the local site;
- Reduces the cost of content maintenance; - Will get support from owners of the imported data by widening their target viewers.
With this WebHarvester system, using only one URL, which is URL of the WebHarvester system, and dealing with only one interface, the user can get same or even better results, but definitely easier than before. With the present invention, the user can have all information in one place and do everything like importing or printing once, therefore the efficiency thereof is improved.
Having described and illustrated the principles of the invention in preferred embodiments thereof, it should be apparent that the invention can be modified in arrangement and detail without departing from such principles. All modifications and variations coming within the spirit and scope of the invention are covered in the appended claims.

Claims

What is claimed is:
1. A system for crawling the Web and extracting designated data, comprising: a computer system; a database configured in the computer system; templates residing in the computer system for mapping information in target page for each web site; fetch means for fetching web pages from said web sites and transferring the fetched pages to said computer system; filter means for scanning the fetched pages to extract necessary information from the fetched pages from said web sites according to corresponding one of said templates, respectively; format & post means for converting the extracted information into a standard format, and storing the formatted information in said database.
2. The system according to claim 1, wherein said computer system is a server connected to Internet.
3. The system according to claim 1 or 2, wherein said templates form an entity in said database.
4. The system according to claim 3, wherein said fetching means connects to the target web sites, performs query statements to retrieve target pages.
5. The system according to claim 4, wherein said filter means adds starting and ending tags to each interested information contained in the HTML version of the target page according to the predetermined template for each site, and extracts the coding between each pair of starting tag and ending tag.
6. The system according to claim 5, wherein said format & post means compiles the coding between said starting tag and ending tag into a page in a predetermined format.
7. The system according to claim 6, wherein when a user send a request for browsing a specific contents in said web sites, said computer system sends corresponding information stored in said database to said user.
8. A method for crawling the Web and extracting designated in a computer system configured with a database, comprising the steps of: creating and keeping templates for mapping information in target page for each web site in the computer system; fetching web pages from said web sites and transferring the fetched pages to said computer system; filtering the fetched pages to extract the necessary information from the fetched pages from said web sites according to corresponding one of the templates, respectively; formatting the extracted information into a standard format, and storing the formatted information in said database.
9. The method according to claim 8, wherein said templates form an entity in said database.
10. The method according to claim 8 or 9, wherein said fetching step comprising the substeps of: connecting to the target sites, executing query statements, and retrieving the target pages.
11. The method according to claim 10, wherein said filtering step comprising the substeps of adding starting and ending tags to each interested information contained in the HTML version of the target page according to the predetermined template for each site, and extracting the coding between each pair of starting tag and ending tag.
12. The method according to claim 11, wherein said formatting step comprising the substep of compiling the coding between said starting tag and ending tag into a page in a predetermined format.
13. The method according to claim 12, wherein when a user send a request for browsing a specific contents in said web sites, the computer system sends information stored in said database to said user.
14. The method according to claim 13, wherein said computer system is a server connected to Internet.
15. The system according to any one of the claims 1-7, wherein said computer system includes network & Internet service programs, database management system, and server based operating system.
PCT/CN1998/000117 1998-07-03 1998-07-03 A system for crawling the web and extracting designated data and the method therefor i.e. webharvester Ceased WO2000002141A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
AU81008/98A AU8100898A (en) 1998-07-03 1998-07-03 A system for crawling the web and extracting designated data and the method therefor i.e. webharvester
PCT/CN1998/000117 WO2000002141A1 (en) 1998-07-03 1998-07-03 A system for crawling the web and extracting designated data and the method therefor i.e. webharvester

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN1998/000117 WO2000002141A1 (en) 1998-07-03 1998-07-03 A system for crawling the web and extracting designated data and the method therefor i.e. webharvester

Publications (1)

Publication Number Publication Date
WO2000002141A1 true WO2000002141A1 (en) 2000-01-13

Family

ID=4575063

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN1998/000117 Ceased WO2000002141A1 (en) 1998-07-03 1998-07-03 A system for crawling the web and extracting designated data and the method therefor i.e. webharvester

Country Status (2)

Country Link
AU (1) AU8100898A (en)
WO (1) WO2000002141A1 (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20000054312A (en) * 2000-06-01 2000-09-05 최우석 Establishing provide Method for ordered web information
GB2353614A (en) * 1998-08-26 2001-02-28 Symtec Ltd Building a database of WEB connection data
KR20010067844A (en) * 2001-04-02 2001-07-13 박병준 Method and system for objecting and operating of web contents
KR20010069940A (en) * 2001-05-21 2001-07-25 주형순 Apparatus and Method for managing public information using internet
KR20020030057A (en) * 2002-03-20 2002-04-22 조근식 Service Delivery Agent System for Mobile Devices
KR20020084435A (en) * 2001-05-02 2002-11-09 (주)인포캐스트 Method to collect information automatically on Internet and Media recording computer program to carry out the method
FR2825870A1 (en) * 2001-06-06 2002-12-13 Canon Europa Nv Document creation system in gateway uses profile to spread Internet access in time
KR20030094967A (en) * 2002-06-11 2003-12-18 주식회사 코스모정보통신 Internet document crawling method
KR100448177B1 (en) * 2001-03-15 2004-09-10 주식회사 오픈테크 Method for web scraping, and computer readable record medium relating to the same
WO2004079595A2 (en) 2003-03-03 2004-09-16 Raytheon Company System and method for processing electronic data from multiple data sources
KR100463397B1 (en) * 2002-10-30 2004-12-23 한국과학기술정보연구원 Service system and Service method for solving many difficulties in an enterprises, and a storage media for having program source thereof
US20130110818A1 (en) * 2011-10-28 2013-05-02 Eamonn O'Brien-Strain Profile driven extraction
US9818141B2 (en) 2014-01-13 2017-11-14 International Business Machines Corporation Pricing data according to provenance-based use in a query
US9858585B2 (en) 2014-11-11 2018-01-02 International Business Machines Corporation Enhancing data cubes
US10453104B2 (en) 2014-01-13 2019-10-22 International Business Machines Corporation Pricing data according to contribution in a query
CN110851690A (en) * 2019-11-14 2020-02-28 北京计算机技术及应用研究所 Method and device for collecting network information of monitoring website

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0747840A1 (en) * 1995-06-07 1996-12-11 International Business Machines Corporation A method for fulfilling requests of a web browser
EP0749081A1 (en) * 1995-06-12 1996-12-18 Pointcast Inc. Information and advertising distribution system and method
WO1997042584A1 (en) * 1996-05-03 1997-11-13 Webmate Technologies, Inc. Client-server application development and deployment system and methods
CN1175035A (en) * 1996-05-06 1998-03-04 微软件公司 Supermedia guiding using soft ultraconnection

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0747840A1 (en) * 1995-06-07 1996-12-11 International Business Machines Corporation A method for fulfilling requests of a web browser
EP0749081A1 (en) * 1995-06-12 1996-12-18 Pointcast Inc. Information and advertising distribution system and method
WO1997042584A1 (en) * 1996-05-03 1997-11-13 Webmate Technologies, Inc. Client-server application development and deployment system and methods
CN1175035A (en) * 1996-05-06 1998-03-04 微软件公司 Supermedia guiding using soft ultraconnection

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2353614A (en) * 1998-08-26 2001-02-28 Symtec Ltd Building a database of WEB connection data
KR20000054312A (en) * 2000-06-01 2000-09-05 최우석 Establishing provide Method for ordered web information
KR100448177B1 (en) * 2001-03-15 2004-09-10 주식회사 오픈테크 Method for web scraping, and computer readable record medium relating to the same
KR20010067844A (en) * 2001-04-02 2001-07-13 박병준 Method and system for objecting and operating of web contents
KR20020084435A (en) * 2001-05-02 2002-11-09 (주)인포캐스트 Method to collect information automatically on Internet and Media recording computer program to carry out the method
KR20010069940A (en) * 2001-05-21 2001-07-25 주형순 Apparatus and Method for managing public information using internet
FR2825870A1 (en) * 2001-06-06 2002-12-13 Canon Europa Nv Document creation system in gateway uses profile to spread Internet access in time
KR20020030057A (en) * 2002-03-20 2002-04-22 조근식 Service Delivery Agent System for Mobile Devices
KR20030094967A (en) * 2002-06-11 2003-12-18 주식회사 코스모정보통신 Internet document crawling method
KR100463397B1 (en) * 2002-10-30 2004-12-23 한국과학기술정보연구원 Service system and Service method for solving many difficulties in an enterprises, and a storage media for having program source thereof
WO2004079595A2 (en) 2003-03-03 2004-09-16 Raytheon Company System and method for processing electronic data from multiple data sources
WO2004079595A3 (en) * 2003-03-03 2005-01-13 Raytheon Co System and method for processing electronic data from multiple data sources
US7328219B2 (en) 2003-03-03 2008-02-05 Raytheon Company System and method for processing electronic data from multiple data sources
US20130110818A1 (en) * 2011-10-28 2013-05-02 Eamonn O'Brien-Strain Profile driven extraction
US9818141B2 (en) 2014-01-13 2017-11-14 International Business Machines Corporation Pricing data according to provenance-based use in a query
US10453104B2 (en) 2014-01-13 2019-10-22 International Business Machines Corporation Pricing data according to contribution in a query
US9858585B2 (en) 2014-11-11 2018-01-02 International Business Machines Corporation Enhancing data cubes
CN110851690A (en) * 2019-11-14 2020-02-28 北京计算机技术及应用研究所 Method and device for collecting network information of monitoring website

Also Published As

Publication number Publication date
AU8100898A (en) 2000-01-24

Similar Documents

Publication Publication Date Title
US7058626B1 (en) Method and system for providing native language query service
US6105028A (en) Method and apparatus for accessing copies of documents using a web browser request interceptor
US6061686A (en) Updating a copy of a remote document stored in a local computer system
US5991760A (en) Method and apparatus for modifying copies of remotely stored documents using a web browser
US5764906A (en) Universal electronic resource denotation, request and delivery system
US6145003A (en) Method of web crawling utilizing address mapping
US6633867B1 (en) System and method for providing a session query within the context of a dynamic search result set
US6092074A (en) Dynamic insertion and updating of hypertext links for internet servers
WO2000002141A1 (en) A system for crawling the web and extracting designated data and the method therefor i.e. webharvester
US20020091835A1 (en) System and method for internet content collaboration
US20090094137A1 (en) Web Page Optimization Systems
US20030088639A1 (en) Method and an apparatus for transforming content from one markup to another markup language non-intrusively using a server load balancer and a reverse proxy transcoding engine
US20030149684A1 (en) Search and index hosting system
US20040044657A1 (en) Internet searching system to be easy by user and method thereof
US20140052778A1 (en) Method and apparatus for mapping a site on a wide area network
JP2006522381A (en) Method and system for providing regional information search results
JPH11502346A (en) Computer system and computer execution process for creating and maintaining online services
US8892552B1 (en) Dynamic specification of custom search engines at query-time, and applications thereof
KR20010106666A (en) Method and System for extracting and storing data from HTML type web pages and Storing media extracted the data
US8010899B2 (en) System offering a data-skin based on standard schema and the method
JP2003519844A (en) Method and apparatus for indexing structured documents based on style sheets
Leou et al. A Web-based power quality monitoring system
JP4716778B2 (en) Proxy processing system and proxy processing method
Michard et al. The Aquarelle resource discovery system
Shklar et al. MetaMagic: Generating Virtual Web Sites through Data Modeling

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DE DK EE ES FI GB GE GH GM GW HR HU ID IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW SD SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: CA