CN106202348A

CN106202348A - A kind of web page form information extraction method

Info

Publication number: CN106202348A
Application number: CN201610524342.2A
Authority: CN
Inventors: 胡生辉; 龙冬阳; 衣杨; 袁野; 杨洋
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2016-07-04
Filing date: 2016-07-04
Publication date: 2016-12-07

Abstract

The invention discloses a method for extracting web page form information, which includes: the user configures the operation file in advance; the user inputs the URL address of the web page to be captured, and the java system captures the web page; the java system captures the captured The webpage is preprocessed; the java system analyzes and locates the webpage according to the information manually input by the user, and at the same time, generates extraction rules and stores them in the rule base for maintenance; the java system extracts the required data from the page according to the extraction rules, and extracts the extracted The data is stored in the MySQL database; the java system uses the JSP page to display the extracted data to the user in the form of a dynamic page according to the configuration of the operation file for user operation. The invention reduces the consumption of system resources in the process of extracting web page form information, speeds up the speed of extracting web page form information, facilitates users to perform secondary processing on the extracted web page form data, and improves the user's secondary processing of web page form data. processing efficiency.

Description

A method for extracting web form information

技术领域technical field

本发明涉及网页信息抽取领域，具体涉及一种网页表格信息抽取方法。The invention relates to the field of web page information extraction, in particular to a method for extracting web page form information.

背景技术Background technique

随着web技术日新月异的进步，网页能够容纳海量数据，但是，有许多用户不关心的信息充斥在网页中，如广告图像和推广链接等，这些信息甚至与网页中的正文内容混淆，使用户难以迅速地从网页中获取重要信息。此外，如果用户想要让目标信息挪作他用，只能先手动摘取信息，同时还要去掉HTML标签或其它噪声信息，然后再重新整理，最后才能按照自己的意愿展现这些目标信息，这样做不仅准确率低，而且费时费工，效率低下。With the rapid development of web technology, web pages can accommodate massive amounts of data, but there are many information that users do not care about, such as advertising images and promotional links, etc., which are even confused with the text content of the web page, making it difficult for users to Quickly get important information from web pages. In addition, if users want to use the target information for other purposes, they can only manually extract the information first, and at the same time remove HTML tags or other noise information, and then reorganize, and finally display the target information according to their wishes. Not only is the accuracy rate low, but it is time-consuming and labor-intensive, and the efficiency is low.

表格(Table)因其能够简洁而有效地表达关系信息的特点，在各领域的网页中被广泛使用，对于大部分Table标签而言，它们用来向用户展示关系数据，例如火车时刻表、网上购物、网上银行和管理信息系统等。而现在web信息抽取技术在表格中的应用有限，多数都是首先通过DOM(Document Object Model，文档对象模型)树对网页进行处理之后进行数据抽取，这种处理方式在进行的时候首先会将网页的全部信息加载进内存，倘若处理的页面较多，则会很耗内存；此外，在对表格数据的处理过程中多数应用也仅仅限于将表格中的数据抽取出来，并未对其可能会进行的一些公有操作进行抽取，做出进一步的加工处理，往往影响了对数据进行二次处理的效率。Tables are widely used in web pages in various fields because of their ability to express relational information concisely and effectively. For most Table tags, they are used to display relational data to users, such as train schedules, online Shopping, online banking and management information systems, etc. However, the application of web information extraction technology in tables is limited now. Most of them first process the web pages through the DOM (Document Object Model) tree and then extract the data. If there are many pages to be processed, it will consume a lot of memory; in addition, in the process of processing table data, most applications are limited to extracting the data in the table, and do not perform any possible processing on it. Extract some public operations and make further processing, which often affects the efficiency of secondary processing of data.

有鉴于此，急需提供一种网页表格信息抽取的新方法，解决现有的网页表格信息抽取技术对系统资源的消耗较大、对网页数据进行二次处理的效率较低的问题。In view of this, there is an urgent need to provide a new method for extracting webpage table information, which can solve the problems of large consumption of system resources and low efficiency of secondary processing of webpage data by the existing webpage table information extraction technology.

发明内容Contents of the invention

本发明所要解决的技术问题是解决现有的网页表格信息抽取技术对系统资源的消耗较大、对网页数据进行二次处理的效率较低的问题。The technical problem to be solved by the present invention is to solve the problems that the existing web page form information extraction technology consumes a lot of system resources and the efficiency of secondary processing of web page data is low.

为了解决上述技术问题，本发明所采用的技术方案是提供一种网页表格信息抽取方法，包括以下步骤：In order to solve the above-mentioned technical problems, the technical solution adopted in the present invention is to provide a method for extracting web page form information, comprising the following steps:

用户预先对操作文件进行配置；The user configures the operation file in advance;

用户输入所要抓取网页的URL地址，并由java系统对该网页进行抓取；The user inputs the URL address of the webpage to be captured, and the java system crawls the webpage;

java系统对抓取到的网页进行预处理；The java system preprocesses the captured web pages;

java系统根据用户手动输入的信息对网页进行解析和定位，同时，生成抽取规则并存入规则库进行维护；The java system analyzes and locates the webpage according to the information manually input by the user, and at the same time, generates extraction rules and stores them in the rule base for maintenance;

java系统根据抽取规则在页面抽取所需数据，并将抽取到的数据存入MySQL数据库；The java system extracts the required data from the page according to the extraction rules, and stores the extracted data into the MySQL database;

java系统根据操作文件的配置，利用JSP页面将抽取到的数据以动态页面的形式展示给用户，供用户操作。According to the configuration of the operation file, the java system uses the JSP page to display the extracted data to the user in the form of a dynamic page for user operation.

在上述技术方案中，用户根据预先定义好的格式在所述操作文件中添加所要进行的操作，包括对网页表格数据的确认、修改和删除操作。In the above technical solution, the user adds operations to be performed in the operation file according to a pre-defined format, including operations of confirming, modifying and deleting web page form data.

在上述技术方案中，所述预先定义好的格式与所述java系统对所述操作文件进行解析时所用格式一致。In the above technical solution, the pre-defined format is consistent with the format used when the java system parses the operation file.

在上述技术方案中，所述java系统对抓取到的网页进行预处理包括去除网页中不相干的图片、视频、音乐、大段文字、导航栏和将HTML文件格式化。In the above technical solution, the preprocessing of the captured webpage by the java system includes removing irrelevant pictures, videos, music, large sections of text, navigation bars and formatting HTML files in the webpage.

在上述技术方案中，当所述网页内容较多时，对所述网页进行去噪处理。In the above technical solution, when the content of the webpage is large, denoising processing is performed on the webpage.

在上述技术方案中，所述操作文件为manipulate.conf。In the above technical solution, the operation file is manipulate.conf.

在上述技术方案中，所述抽取规则包括所需数据位于所述页面第几个表和第几列的信息。In the above technical solution, the extraction rule includes information about which table and which column the required data is located on the page.

在上述技术方案中，所述java系统采用java开源工具包Jsoup对所述网页进行解析和定位。In the above technical solution, the java system uses the java open source toolkit Jsoup to analyze and locate the webpage.

本发明减少了网页表格信息抽取过程中对系统资源的消耗，加快了网页表格信息抽取的速度，方便了用户对抽取出的网页表格数据进行二次处理，提高了用户对网页表格数据进行二次处理的效率。The invention reduces the consumption of system resources in the process of extracting web page form information, speeds up the speed of web page form information extraction, facilitates users to perform secondary processing on the extracted web page form data, and improves the user's secondary processing of web page form data. processing efficiency.

附图说明Description of drawings

图1为本发明实施例提供的一种网页表格信息抽取方法流程图；Fig. 1 is a flow chart of a method for extracting web page form information provided by an embodiment of the present invention;

图2为本发明实施例提供的对抽取到的网页表格数据进行再处理的数据流程图。FIG. 2 is a data flow chart for reprocessing the extracted webpage form data provided by the embodiment of the present invention.

具体实施方式detailed description

本发明涉及操作文件manipulate.conf配置、网页抓取、网页预处理、Jsoup解析、基于表格样本行的表格定位、规则知识库维护、数据抽取、数据规范化和数据持久化、数据及操作展示几个方面的内容，通过用户的示例输入定位网页的目标表格，并且将目标表格的数据存储到数据库，再根据用户的意愿展示出来，让目标数据得以重新利用。从而减少了网页表格信息抽取过程中对系统资源的消耗，加快了网页表格信息抽取的速度，方便了用户对抽取出的网页表格数据进行二次处理，提高了用户对网页表格数据进行二次处理的效率。The present invention relates to operation file manipulate.conf configuration, webpage crawling, webpage preprocessing, Jsoup analysis, table positioning based on table sample rows, rule knowledge base maintenance, data extraction, data normalization and data persistence, data and operation display In terms of content, use the user's example input to locate the target form of the web page, and store the data of the target form in the database, and then display it according to the user's wishes, so that the target data can be reused. This reduces the consumption of system resources in the process of web page form information extraction, speeds up the speed of web page form information extraction, facilitates the user's secondary processing of the extracted web page form data, and improves the user's secondary processing of web page form data. s efficiency.

下面结合说明书附图和具体实施方式对本发明做出详细的说明。The present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

本发明实施例提供了一种网页表格信息抽取方法，如图1所示，包括以下步骤：The embodiment of the present invention provides a method for extracting web page form information, as shown in Figure 1, comprising the following steps:

S1、用户根据需求预先对操作文件manipulate.conf进行配置。S1. The user configures the operation file manipulate.conf in advance according to requirements.

用户根据自己最终要对抽取的网页表格数据进行的操作，预先对操作文件manipulate.conf进行配置，根据预先定义好的格式在操作文件manipulate.conf中添加自己所要进行的操作，包括对网页表格数据的确认、修改和删除等操作，上述预先定义的格式，只需与java系统对该操作文件manipulate.conf进行解析时所用格式一致即可。The user configures the operation file manipulate.conf in advance according to the operation he wants to perform on the extracted web page table data, and adds the operation he wants to perform in the operation file manipulate.conf according to the pre-defined format, including the web page table data Confirmation, modification, deletion and other operations, the above-mentioned pre-defined format only needs to be consistent with the format used when the java system parses the operation file manipulate.conf.

S2、用户输入所要抓取网页的URL(Uniform Resource Locators，统一资源定位器)地址，并由所写的java系统对该网页进行抓取。S2. The user inputs the URL (Uniform Resource Locators, Uniform Resource Locators) address of the webpage to be captured, and the written java system captures the webpage.

为提高java系统稳定性，要求用户输入的URL地址符合URL标准，否则，当输入的URL地址不符合URL标准时，提交时将弹出相应的提示窗口。In order to improve the stability of the java system, the URL address entered by the user is required to conform to the URL standard. Otherwise, when the URL address entered does not conform to the URL standard, a corresponding prompt window will pop up when submitting.

S3、java系统对抓取到的网页进行预处理。S3 and the java system preprocess the captured web pages.

为优化后续对网页的抽取，本发明对抓取到的网页进行预处理，包括去除网页中不相干的图片、视频、音乐、大段文字、导航栏和将HTML文件格式化(即对HTML页面XML序列化)。同时，若网页内容较多可对网页进行去噪处理。In order to optimize subsequent extraction of webpages, the present invention preprocesses the webpages captured, including removing irrelevant pictures, videos, music, large sections of text, navigation bars and formatting HTML files (i.e. HTML pages XML serialization). At the same time, if the content of the webpage is large, denoising processing may be performed on the webpage.

S4、java系统根据用户手动输入的信息，采用java开源工具包Jsoup对网页进行解析和定位，同时，生成抽取规则并存入规则库进行维护。S4. The java system uses the java open source toolkit Jsoup to analyze and locate the webpage according to the information manually input by the user. At the same time, the extraction rules are generated and stored in the rule base for maintenance.

用户手动输入感兴趣的某一行内容，例如可以是火车售票网的一行车票数据，包括航次、售价、始发点、终点和剩余票数等，方便java系统快速定位用户所需数据。The user manually enters a line of interest, such as a line of ticket data from the train ticketing network, including voyage number, price, departure point, destination, and remaining number of tickets, etc., so that the java system can quickly locate the data required by the user.

抽取规则包括所需数据位于页面第几个表、第几列等信息，存入规则库方便以后在此网页类似页面下直接抽取信息以节约时间。The extraction rules include information such as which table and column the required data is located on the page, and are stored in the rule base to facilitate the extraction of information directly from similar pages on this webpage in the future to save time.

S5、java系统根据抽取规则在页面抽取所需数据，并将抽取到的数据存入MySQL数据库，方便以后对数据进行二次处理。S5. The java system extracts the required data on the page according to the extraction rules, and stores the extracted data into the MySQL database, so as to facilitate secondary processing of the data in the future.

如图2所示，为本发明实施例提供的对抽取到的网页表格数据进行再处理的数据流程图。As shown in FIG. 2 , it is a data flow chart for reprocessing the extracted web page form data provided by the embodiment of the present invention.

S6、java系统根据操作文件manipulate.conf的配置，利用JSP页面(Java ServerPages，java服务器页面)将抽取到的数据以动态页面的形式展示给用户，供用户操作。S6. According to the configuration of the operation file manipulate.conf, the java system uses JSP pages (Java Server Pages, java server pages) to display the extracted data to the user in the form of a dynamic page for user operation.

实施例1。Example 1.

首先定义操作文件manipulate.conf的配置格式如下：First define the configuration format of the operation file manipulate.conf as follows:

1：转储浏览，0，无；1: dump browse, 0, none;

2：单人签到，2，签到，操作；2: Single sign-in, 2, sign-in, operation;

3：双人签到，4，员工操作，员工签到，主管操作，主管签到。3: Double sign-in, 4, employee operation, employee sign-in, supervisor operation, supervisor sign-in.

假设某公司某部门经理需要其团队的每个人对团队报账单进行签到确认，但是，由于企业级应用结构复杂，涉及的外部资源众多，员工无法获得这些数据的接口，只能手动将表格信息抄录下来再发布出去，同时，签到确认也无法实现自动化，这样不仅大大地浪费了人力和时间，而且还不能保障数据准确可用。按照本发明设计的方案，部门经理可首先配置操作文件manipulate.conf；输入包含团队账目网页的URL地址，将网页抓取下来；对抓取到的网页进行XML序列化处理，同时，若网页杂音较多，可进行去噪处理；由部门经理输入其中某一个人账单的全部数据，据此生成抽取规则，同时将抽取规则加入规则库进行维护；根据抽取规则对该网页进行处理，抽取所需数据；将抽取到的数据存入数据库中；最后，将最终抽取到的数据及在操作文件manipulate.conf中配置的操作展示给部门经理，在该网页中该部门经理所在的部门可以对这些数据进行操作，而部门经理也可以看到本部门同事的签到与否。Assume that a department manager of a company needs everyone in his team to sign in and confirm the team's billing form. However, due to the complex structure of enterprise-level applications and the large number of external resources involved, employees cannot obtain the interface of these data and can only manually copy the form information At the same time, the check-in confirmation cannot be automated, which not only wastes manpower and time, but also cannot ensure that the data is accurate and available. According to the scheme designed in the present invention, the department manager can first configure the operation file manipulate.conf; input the URL address that includes the team account webpage, and grab the webpage; carry out XML serialization processing on the webpage that is grabbed, and at the same time, if the webpage is noisy There are many, which can be denoised; the department manager will input all the data of one of the personal bills, generate extraction rules based on this, and add the extraction rules to the rule base for maintenance; process the web page according to the extraction rules, and extract the required data; store the extracted data in the database; finally, display the finally extracted data and the operations configured in the operation file manipulate. Operation, and the department manager can also see whether the colleagues in the department have checked in or not.

上述方法具体包括以下步骤：The above method specifically includes the following steps:

S10、部门经理根据定义好的格式在操作文件manipulate.conf中添加如下语句：单人签到，2，签到，操作。S10. The department manager adds the following statement in the operation file manipulate.conf according to the defined format: single check-in, 2, check-in, operation.

S11、部门经理登录服务网页，在该网页中输入所要抓取网页的URL地址。S11. The department manager logs in to the service webpage, and enters the URL address of the webpage to be captured in the webpage.

S12、若本次抓取到的网页大于4MB，由于本次所要抽取的是账单数据，可以将网页中的音频和图片等信息全部去除，以加快表格定位和识别。S12. If the webpage captured this time is larger than 4MB, since what is to be extracted this time is bill data, all information such as audio and pictures in the webpage can be removed to speed up table positioning and identification.

S13、部门经理根据原网页的数据格式输入一行报账数据。S13. The department manager inputs a line of reimbursement data according to the data format of the original web page.

S14、java系统根据部门经理输入的账单数据，采用java开源工具包Jsoup对网页进行解析和定位，最后将此次得到的抽取规则，例如位于第几个表、第几列等信息存入规则库，方便以后在此网站类似页面下直接抽取信息以节约时间。S14. The java system uses the java open source toolkit Jsoup to analyze and locate the webpage according to the bill data input by the department manager, and finally stores the extraction rules obtained this time, such as which table and column are located in the rule base , so that you can directly extract information from similar pages on this website in the future to save time.

S15、java系统将根据部门经理输入的账单数据抽取到的数据存入MySQL数据库，方便以后对数据进行二次处理。S15. The java system stores the data extracted according to the bill data input by the department manager into the MySQL database, so as to facilitate secondary processing of the data in the future.

S16、java系统根据操作文件manipulate.conf的配置和MySQL数据库中存储的内容，利用JSP页面将抽取到的数据通过浏览器展示给部门经理，在该页面中，部门经理所在部门的员工可以进行签到确认，而该部门经理也可以看到某个员工是否签到确认。S16. According to the configuration of the operation file manipulate.conf and the content stored in the MySQL database, the java system uses the JSP page to display the extracted data to the department manager through the browser. On this page, the employees of the department where the department manager belongs can sign in Confirmation, and the manager of the department can also see whether an employee has signed in for confirmation.

本发明不局限于上述最佳实施方式，任何人在本发明的启示下作出的结构变化，凡是与本发明具有相同或相近的技术方案，均落入本发明的保护范围之内。The present invention is not limited to the above-mentioned best implementation mode, and any structural changes made by anyone under the inspiration of the present invention, and any technical solutions that are the same as or similar to the present invention, all fall within the protection scope of the present invention.

Claims

1. A webpage form information extraction method, is characterized in that, comprises the following steps:

The user configures the operation file in advance;

The user inputs the URL address of the webpage to be captured, and the java system crawls the webpage;

The java system preprocesses the captured web pages;

The java system analyzes and locates the webpage according to the information manually input by the user, and at the same time, generates extraction rules and stores them in the rule base for maintenance;

The java system extracts the required data from the page according to the extraction rules, and stores the extracted data into the MySQL database;

According to the configuration of the operation file, the java system uses the JSP page to display the extracted data to the user in the form of a dynamic page for user operation.

2. The method according to claim 1, wherein the user adds operations to be performed in the operation file according to a pre-defined format, including operations of confirming, modifying and deleting web page form data.

3. The method according to claim 2, wherein the predefined format is consistent with the format used when the java system parses the operation file.

4. The method according to claim 1, wherein the java system preprocesses the captured webpage including removing irrelevant pictures, videos, music, large sections of text, navigation bars and HTML File formatting.

5. The method according to claim 4, characterized in that, when the content of the webpage is large, denoising processing is performed on the webpage.

6. The method according to claim 1, wherein the operation file is manipulate.conf.

7. The method according to claim 1, wherein the extraction rule includes information about which table and which column the required data is located on the page.

8. The method according to claim 1, wherein the java system uses the java open source toolkit Jsoup to analyze and locate the webpage.