WO2000002141A1

WO2000002141A1 - A system for crawling the web and extracting designated data and the method therefor i.e. webharvester

Info

Publication number: WO2000002141A1
Application number: PCT/CN1998/000117
Authority: WO
Inventors: Fujun Bi; Shaun Bliss; Hong Yan
Original assignee: Individual
Current assignee: Individual
Priority date: 1998-07-03
Filing date: 1998-07-03
Publication date: 2000-01-13
Anticipated expiration: 2001-01-03
Also published as: AU8100898A

Abstract

The present invention discloses a system for crawling the Web and extracting designated data and the method therefor, i.e. WebHarvester, said system comprises: a computer system; a database configured in the computer system; templates residing in the computer system for mapping information in target page for each web site; fetch means for fetching web pages from said web sites and transferring the fetched pages to said computer system; filter means for scanning the fetched pages to extract necessary information from the fetched pages from said web sites according to corresponding one of said templates, respectively; format and post means for converting the extracted information into a standard format, and storing the formatted information in said database. Said computer system is a server connected to Internet.

Description

A system for crawling the Web and extracting designated data and the method therefor i.e. WebHarvester

Field of the invention

The present invention relates to a system for crawling the Web and extracting designated data and the method therefor, i.e. WebHarvester, and in particular, relates to a system that allows user to fetch, strip and store necessary information from different web sites in an organised and manageable way, and the method therefor.

Technical background

Currently in Job Hunting and Career Opportunities industry, and other supply/demand businesses on the World Wide Web, the key is to store as many titles as possible in the database to give a universal, more complete environment to both job seekers and providers. It is to the benefit of both sides that to be exposed to more targeted viewers. None of the related parties will have objection to the chance of being seen by more Web Surfers as long as their rights are reserved and there is no risk of loosing a particular business.

However, when visiting the existing web sites, there are difficulties for a web surfer to get information he is interested in from different web sites. Specifically, as shown in Fig. 1, assume a user wants to get necessary information of a same category web sites A, B and C. First, he must connect to the Internet, and call URL of system A, then he executes query of system A, and gets result A; then he must call URL of system B, and execute query of system B, thereby he gets result B; after that, he have to turn to call URL of system C, execute query of system C, and get result C. Thereafter, the results from sites A, B, and C must be combined together, and be printed out by means of a certain print device. One can imagine, if the user wants to get more information from other more web sites, the operation is more complicated and time consuming.

Take the Real Estate business as another example, assume one is looking for a home to buy and he want to find his target home through WWW. First he should find URLs of real estate companies. Say he found five addresses, and then he should visit all five sites and go through then- interfaces to find the kind of houses he wants. Some sites have been divided to areas and let him search in each area. Some have categorized homes. So he should deal with different interfaces and find his way through five different logics. He will have to make search efforts throughout five sites and print out lists separately. This looks not easy for this user to get what he really wants to get.

Therefore, the current technology for the user to visit the web sites is not efficient, and impedes quick communication of the necessary information to the targeted information receiver, which is not favored by neither the web site service provider nor the user or surfer of the web.

Object of the invention

Therefore, there is need to provide such an environment for both the information provider and the targeted viewers by gathering information from some or all of the related Web sites in a regular and up-to-date manner, and providing the gathered information to the targeted information receivers. It is an object of the present invention to provide a system/method for crawling the Web and extracting designated data, which can meet the always existing desire of people who are in charge of providing and maintaining information for web sites to keep their content up-to-date and accurate, specially for ones which, their content has to change on daily or even hourly basis. These people need a way to automate the process of updating, in the way that content gets updated without or with minimum manual operation.

It is another object of the present invention to provide a system/method for crawling the Web and extracting designated data, which can look at other sites similar to local one and gather data from them in an automatic way with recent and useful information, thereby the local site could be the reflection of information in many other sites, integrated and without hassle of having to move across different sites.

Summary of the invention

To achieve the above objects of the invention, the present invention provides a system for crawling the Web and extracting designated data , comprising: a computer system; a database configured in the computer system; templates residing in the computer system for mapping information in target page for each web site; fetch means for fetching web pages from said web sites and transferring the fetched pages to said computer system; filter means for scanning the fetched pages to extract necessary information from the fetched pages fiOm said web sites according to corresponding one of said templates, respectively; format & post means for converting the extracted information into a standard format, and storing the formatted information in said database. The present invention further provides a method for crawling the Web and extracting designated data in a computer system configured with a database, comprising the steps of: creating and keeping templates for mapping information in target page for each web site in the computer system; fetching web pages from said web sites and transferring the fetched pages to said computer system; filtering the fetched pages to extract the necessary information from the fetched pages from said web sites according to corresponding one of the templates, respectively; formatting the extracted information into a standard format, and storing the formatted information in said database.

This inventive technology is also called "WebHarvester" , which reflects the function of the present invention. This WebHarvester allows user to fetch, strip and store necessary information from different web sites in an organized and manageable way, and is able to recognize and distinguish desired information in various web sites of the same category amongst all other unnecessary data, codes and programming tags. Its final step would be to store them in the database for future use. Stored data will be fully manageable and free of the unwanted information. The uniqueness of WebHarvester would be, doing the mentioned tasks, automatically, and once defined, get the aimed information reliable and with no need for further control.

The foregoing and other objects, features and advantages of the invention will become more readily apparent from the following detailed description of preferred embodiments of the invention which proceeds with reference to the accompanying drawings. Brief description of the drawings

Fig.l shows the conventional way of surfing and getting results by looking at different web sites.

Fig.2 shows how the steps needed for getting the same result are shorter with the system according to the present invention;

Fig. 3 shows the schematic diagram of the basic elements of this invention;

Fig. 4 shows the configuration of the webharvester system according to the present invention; Fig. 5 shows the operations involved when retrieving information with the webharvester system according to the present invention;

Fig. 6 shows a web page containing information to be imported to the local database;

Fig. 7 shows the regular HTML version of the page shown in Fig. 6; Fig. 8 shows the HTML coded template page in correspondence to the page shown in Fig. 7, added tags are underlined;

Fig. 9 shows mapping the information contained in the web page of a web site into a target page;

Fig. 10 shows the structure of the entities of the database in the system according to an embodiment of the present invention;

Fig. 11 shows the relationship between the entities of the database in the system according to the embodiment of the present invention;

Fig. 12 shows the relationship in more detail between the entities involved in the system according to the embodiment of the present invention;

Fig. 13 shows the basic architecture of the system according to an embodiment of the present invention;

Fig. 14 shows the detail architecture and configuration of the system according to the embodiment of the present invention; Fig. 15 shows how the information is passed between systems and the steps taken to get the data to input into the database and expose it the clients in the system according to the embodiment of the present invention; Fig. 16 shows the operation for local server to return the results which contain information the user is interested in to the client when a client inputs request.

Preferred embodiments of the invention The fundamental need of the WebHarvester system is to retrieve information of the same kind from other sites, and gather information from some or all of related Web sites in a regular and up-to-date manner. Fig. 3 shows the schematic diagram of the basic elements of this invention. First, the system connects to one of the sites, get the web page from the site, then manipulates the page in a predetermined manner, then formats and posts the page to change the page into a standard page specified in the local site.

Fig. 4 shows the configuration of the WebHarvester system according to the present invention. The WebHarvester system comprises a computer system; a database configured in the computer system; templates residing in the computer system for mapping information in target page for each web site; fetch means for connecting to said web sites, performing query statements to retrieve respective pages of said web sites, and transferring the retrieved pages to said computer system; filter means for scanning the retrieved pages, extracting the necessary information from the respective retrieved page of each site according to the corresponding template for each site; format and post means for changing the format of the taken information to a standard format, and storing the formatted information in said database. This system provides information from a plurality of web sites to the web users automatically.

To create a clear picture of the invention we will concentrate on Job Opportunities as an example. The objective is to be able to recognize needed information from unwanted, and store them correctly in the database. To do this, WebHarvester system needs to have enough information about the site that is going to get the information. Using this information, the system will be able to understand the target site and this is unique to each site, we call it template. In the template, specific behavior of the site is shown. Based on that and special meaningful tags in template, the system will know where and how to extract requested piece of information from the site. The detail operations involved in the webharvester system for providing information from a plurality of web sites to the web users according to the present invention is shown in Fig. 5.

As shown in Fig. 5, there are three basic elements in Webharvester system which also form the steps taken to achieve objectives mentioned above, according to the method of the present invention. These three elements are Fetch, Filter, Format & Post.

Fetch element is the first element in Webharvester system, its duty is to find the site and proper page and transfer information to local host. This action is done based on specific information available from the site and according to the format and rules of this site. In most of the sites, information is not stored as static pages and is formed dynamically in time of querry. In order to get necessary information from these sites, Fetch element should first contact to the target site, execute the querry statements, and produce the dynamic page including the result. To do this system should send accurate parameters specific to the search engine, attached to URL of the target site. These parameters are previously examined and stored in database and are able to generate results. Only after this Fetch operation can transfer useful data to the local host. The basic syntax for Fetch command is like bellow:

Connection + Web Page + Parameters Command Address "Connection Command" is the keyword for connection to the target site and "Web Page Address" is the URL of the target page and "Parameters" is the set of coding necessary to generate page or pages with result. Detail syntax of the command line mostly is based on the characteristics of the target site.

Once have data in the host system, filter element should scan it and take unnecessary parts, like HTML tags, unrelated information, and codings, out of the main part. The remaining information supposedly will be the main body of Resume or Job Opportunity posted to the other site based on Template. Stripping desired information from unwanted ones should be done very precisely and accurate in order to avoid inputting garbage to the database in one hand and loosing necessary information on the other hand. The way WebHarvester deals with web sites is to keep a template for every single site in the system. Format of this template is identical to the related page plus some native tags which are recognizable by the element and show where exactly in target page necessary information are residing, and each of the templates is produced previously according to the features and rules of the respective web sites. By compiling these tags, Filter element will be able to distinguish and extract data from the page, specify the type of that and pass it to the next element.

After receiving data from Filter element, Format & Post element will change the format of imported information and form it according to the rules and standards of the local system. In this way, the naming and conventions of different documents should be changed to ones applied to the Job Plant.

Fetch element is responsible for contacting the target site, find the proper pages and transfer them to local server. In order to do this, the element should know about behavior of the target site and how can it retrieve pages including right information. This will require that the system keep necessary information and send it at time of retrieving. Basically this information will be the commands and directions necessary to produce dynamic pages of the target site with actual data in them attached to basic URL. It changes from site to site based on its structure and technology residing on, so each site should be studied and resolved separately. The following web page could be an example for a page containing information one would want to import to the local database, as shown in Fig. 6.

Fig. 7 shows the regular HTML version of the page shown in Fig. 6.

System is going to store a HTML coded template page in correspondance to this page. The content of this page is the exact HTML version of the page plus special tags which are added in order to recognize the areas in the page including required information. This template page is shown in Fig. 8, added tags are underlined. The template page is HTML version of the real page with some extra tags known to WebHarvester system. These tags are coded like <IMS_JOB_XXXX> which "XXXX" is any title related to the subject and 'TMS JOB" is the code for the system to make it obvious and recognizable from other ordinary tags. The coding between starting and ending WebHarvester tags would vary based on format of original page. Having this template page WebHarvester system will start scanning the page looking for special tags that are known and ignoring anything else. These special tags include the kind of information is comming after, like <IMS_JOB_TITLE> shows that the information following is about the title of the job. After finding the tag system will compile the coding after that until the closing tag which as a general rule for HTML tags starts with slash "/" after openning bracket.(i.e, </IMS_JOB_TITLE>). Coding between two tags is basically about the type and format of information expected right after the tags. For example :

<IMS_JOB_TITLE>\(.*\)</lMS_JOB_TITLE> tells system , after the tags expect anything of any type until you hit tag, then stop getting information. In this example <IMS JOB DATE>\(J0-91*/[0-

starting tag shows that the following information is supposed to be in date format and will be stored as date. The rest of coding says, incomming data could be any digits betweed 0 and 9 following by slash 7" for month, day and year, followed by space or Tab until it finds . Ending tag will pair openning tag with extra 7". in the coding part is specific to this page and in other pages or other sites could be anything else. This only could be known by studying target sites and pages and finding out that what is the tag or any symbol following the actual information required to be captured. This method is used to map the information in target pages, so there is no need to understand whole parts of them by ignoring everything else except the known tags. Fig. 9 shows the mapping relation of the information contained in the web page of a web site into a target page in the database.

WebHarvester system will only see the areas which are marked and ignores all other parts. This way, system will not hassle with any other piece of information or code on the page.

Unlike above example, most of pages have formatting attributes together with text, such as size, color, font and many others. While importing them to local system we want to carry information with its format and show it on our page as they appear in original site as long as storing pure data in database. To serve this purpose we need to encapsulate those formats with text and carry them over local system. In order to do this we have to map formatting attributes exactly the same way as they are. Example bellow shows how we can do it.

Assuming that the original HTML code of the page look like this :

System Engineer

The coding of the template will be as following:

<LMS_JOB_TrTLE> \(.*\)

</DvIS JOB TITLE> Because in many pages there is no consistency in using uppercase or lowercase keywords, it is safer to code the template as bellow. <IMS_JOB_TITLE> <[Ff][Oo][Nn][Tt] [Ff][Aa][Cc][Ee]="palatino" [Ss][Ii][Zz][Ee]=-+[0-9]>\(.*\) </IMS_JOB_TITLE>

Fig. 10 shows the data structure of the database in the system according to an embodiment of the present invention. As described above, this invention is for a general purpose. So it is not based on a certain subject or limited to any sector. The data model disclosed below will support job opportunities purposes, but convertible to other systems. For the use of the job opportunities purposes, as shown in Fig. 10, there are seven major entities in charge of supporting the system of the invention for Opportunities model: site entity, template entity, resume entity, job entity, user entity, job category entity, and location entity. All the entities are listed in Table 1 below.

Table 1

The attributes of the entities is shown in Table 2-8.

Table 2 Entity Name : USER

Table 3

Entity Name : SITE

Table 4

Entity Name : TEMPLATE

Table 5 Entity Name RESUME

Table 6 Entity Name JOB

Table 7

Entity Name : CATEGORY

Table 8

Entity Name : LOCATION

Relationship between entities is shown in Fig. 11. As shown in Fig. 11, each user will have a record in user entity, and there is no more than one entry for each. There could be zero, one, or more than one resume, or job opportunity for each user. Those will be tied to user's entry. Each target site has one entry in Site entity and one entry in Template. Each site can have zero, one, or more than one entry in resume entity. Every Resume can have one or more than one entry in Location and Category entities. Each job can have one entry in Location and Category entities.

Fig. 12 shows the relationship in more detail between the entities involved in the system according to the embodiment of the present invention. Relationships may vary on some of the entities based on the type of business they are used. Once again, this implementation is designed to serve job opportunities system which is only one of the applications of the WebHarvester system according to the present invention.

Fig. 13 shows the basic architecture of the system according to an embodiment of the present invention. There is no proprietary configuration for WebHarvester system according to the present invention. Internet connection for each side of this system is the only hardware configuration needed. Configurations may slightly vary in different conditions, but environments supporting this technology are more or less with the same characteristics. The following specification is the most commonly used environment in the market and based on open system architecture.

Environment used for the WebHarvester system according to the embodiment is as follows:

Client

• Interface Web based application program

The Job Plant

• Service Programs Web browser

Microsoft Internet Explorer

Netscape Browser Network & Internet service programs ISP's interface programs

Operating System PC or Workstation based OS Windows 95 Windows NT Client Mac/OS

Unix or Unix based other OSs

• Hardware PC, MAC, or Workstations 32 MB or more RAM 40 MB or more Free space of Hard Disk 28.8 Kbps or faster communication line

Server

• Interface Application program based on WebHarvester

The Job Plant back end • Service Programs Network & Internet service programs

Netscape Web Server Database Management System

Informix Oracle

• Operating System Server based OS

Windows NT Server

Unix or Unix based other OSs

• Hardware Network Connection Leased Line connection with at least 1.5

MB/S baud rate. System

Any hardware with enough power to run

Operating system and Database Management Software with accepted performance. The system according to the present invention basically is consisting of the WebHarvester technology residing on local server which provides a regular and periodical execution of procedures of WebHarvester. This will gather information from target servers and stores in the local database to be accessable by clients. All connections in this system are based on Internet, so can be universal and transparent throughout the system. Fig. 14 shows the detail architecture and configuration of the system according to the embodiment of the present invention. It shows how these systems are having access to each other and what are elements required for each to make that happen.

As shown in Fig. 14, the clients are configured with the programs of application software(the Job Plant), Web Browser, Operating System, and Network Services programs. The target servers are configured with Database Management System, Web Server, Operating System, and Network Services programs. The local server, which implements the present invention, is configured the WebHarvester, Database Management System, Web Server, Operating System, and Network Services programs.

Fig. 15 shows how the information is passed between systems and the steps taken to get the data to input into the database and expose it the clients in the system according to the embodiment of the present invention. As shown in Fig. 15, in the local server, which implementing the WebHarvester of the present invention, there are template and database tables. First, in the fetch step, the local server sends connection commands and query statements to the target server of a specified web sites to generate a target page from the target server. The generated target page is sent back to the local server. Then, in the filter step, necessary information is extracted from the generated target page according to the predetermined template residing in the local server for this site, so as to create a target page in the local server according to a predetermined standard format, and store the page containing the necessary information in standard format in the database tables, in the format & post step. In this stage, the necessary information is available for the web users.

As shown in Fig. 16, when a client inputs request for browsing Job Opportunities or Resumes of different web sites via Internet to the local server, the local server can return the results which contain information the user is interested in to the client.

Industrial applicability Fig.2 shows how the steps needed for getting the same result as that in Fig. 1 are shorter with the system according to the present invention. First, the user connects to the Internet and call URL of WebHarvester system according to the present invention, then he execute query of the system and get results A, B, and C automatically, thereby he can print out the results. Apparently, the operation of the WebHarvester system according to the present invention is much more simple than that involved in the conventional technology, and the web user or surfer need not move across different web sites by himself.

From the above description, one can see the advantages of the WebHarvester system according to the present invention is that it eliminates need of visiting different sites of the same kind throughout the Cyberspace to meet job providers and seekers. This system will be the only address to look at for its users, ideally throughout the world, to meet any needs in job and career opportunities area. The system functions as a gateway to a wealth of information across different servers, since it can retrieve and import information from remote databases. To summarize benefits of the system : - Keeps the information last minute and up to date;

- Needs minimum effort to collect information;

- There will be no need to visit different sites of the same type by importing data to the local site;

- Reduces the cost of content maintenance; - Will get support from owners of the imported data by widening their target viewers.

With this WebHarvester system, using only one URL, which is URL of the WebHarvester system, and dealing with only one interface, the user can get same or even better results, but definitely easier than before. With the present invention, the user can have all information in one place and do everything like importing or printing once, therefore the efficiency thereof is improved.

Having described and illustrated the principles of the invention in preferred embodiments thereof, it should be apparent that the invention can be modified in arrangement and detail without departing from such principles. All modifications and variations coming within the spirit and scope of the invention are covered in the appended claims.

Claims

What is claimed is:

1. A system for crawling the Web and extracting designated data, comprising: a computer system; a database configured in the computer system; templates residing in the computer system for mapping information in target page for each web site; fetch means for fetching web pages from said web sites and transferring the fetched pages to said computer system; filter means for scanning the fetched pages to extract necessary information from the fetched pages from said web sites according to corresponding one of said templates, respectively; format & post means for converting the extracted information into a standard format, and storing the formatted information in said database.

2. The system according to claim 1, wherein said computer system is a server connected to Internet.

3. The system according to claim 1 or 2, wherein said templates form an entity in said database.

4. The system according to claim 3, wherein said fetching means connects to the target web sites, performs query statements to retrieve target pages.

5. The system according to claim 4, wherein said filter means adds starting and ending tags to each interested information contained in the HTML version of the target page according to the predetermined template for each site, and extracts the coding between each pair of starting tag and ending tag.

6. The system according to claim 5, wherein said format & post means compiles the coding between said starting tag and ending tag into a page in a predetermined format.

7. The system according to claim 6, wherein when a user send a request for browsing a specific contents in said web sites, said computer system sends corresponding information stored in said database to said user.

8. A method for crawling the Web and extracting designated in a computer system configured with a database, comprising the steps of: creating and keeping templates for mapping information in target page for each web site in the computer system; fetching web pages from said web sites and transferring the fetched pages to said computer system; filtering the fetched pages to extract the necessary information from the fetched pages from said web sites according to corresponding one of the templates, respectively; formatting the extracted information into a standard format, and storing the formatted information in said database.

9. The method according to claim 8, wherein said templates form an entity in said database.

10. The method according to claim 8 or 9, wherein said fetching step comprising the substeps of: connecting to the target sites, executing query statements, and retrieving the target pages.

11. The method according to claim 10, wherein said filtering step comprising the substeps of adding starting and ending tags to each interested information contained in the HTML version of the target page according to the predetermined template for each site, and extracting the coding between each pair of starting tag and ending tag.

12. The method according to claim 11, wherein said formatting step comprising the substep of compiling the coding between said starting tag and ending tag into a page in a predetermined format.

13. The method according to claim 12, wherein when a user send a request for browsing a specific contents in said web sites, the computer system sends information stored in said database to said user.

14. The method according to claim 13, wherein said computer system is a server connected to Internet.

15. The system according to any one of the claims 1-7, wherein said computer system includes network & Internet service programs, database management system, and server based operating system.