[go: up one dir, main page]

CN114491360B - Data extraction method, device, electronic device and storage medium - Google Patents

Data extraction method, device, electronic device and storage medium Download PDF

Info

Publication number
CN114491360B
CN114491360B CN202111647593.7A CN202111647593A CN114491360B CN 114491360 B CN114491360 B CN 114491360B CN 202111647593 A CN202111647593 A CN 202111647593A CN 114491360 B CN114491360 B CN 114491360B
Authority
CN
China
Prior art keywords
target
header
tbody
thead
tag
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111647593.7A
Other languages
Chinese (zh)
Other versions
CN114491360A (en
Inventor
蒋晓海
刘玉芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Testin Information Technology Co Ltd
Original Assignee
Beijing Testin Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Testin Information Technology Co Ltd filed Critical Beijing Testin Information Technology Co Ltd
Priority to CN202111647593.7A priority Critical patent/CN114491360B/en
Publication of CN114491360A publication Critical patent/CN114491360A/en
Application granted granted Critical
Publication of CN114491360B publication Critical patent/CN114491360B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开了一种数据提取方法、装置、电子设备及存储介质,包括:根据第一单元格的第一属性信息获取目标元素,其中,所述第一属性信息通过响应于用户在指定页面中对第一单元格的选取操作获取,所述指定页面为所述目标元素所在的页面,基于所述目标元素的元素标识,查找所述指定页面中所述目标元素对应的目标表格,从所述目标表格中提取多个类型的数据。

The present application discloses a data extraction method, device, electronic device and storage medium, including: obtaining a target element according to first attribute information of a first cell, wherein the first attribute information is obtained in response to a user's selection operation on the first cell in a specified page, the specified page is a page where the target element is located, based on the element identifier of the target element, searching for a target table corresponding to the target element in the specified page, and extracting multiple types of data from the target table.

Description

Data extraction method and device, electronic equipment and storage medium
Technical Field
The application belongs to the technical field of Internet, and particularly relates to a data extraction method, a data extraction device, electronic equipment and a storage medium.
Background
A table is defined by a tag, and is composed of a header and a body, each table has a plurality of rows, each row is divided into a plurality of cells, and the cells can contain text, pictures, lists, attributes and the like.
In some scenarios, hypertext markup language (Hyper Text Markup Language, HTML) defines the meaning and structure of web page content, HTML uses elements to note text, pictures, and other content. For data of a given cell in HTML, there are many limitations in determining a corresponding target table in the related art, and there still occurs a problem that the target table cannot be determined and the full table data is extracted.
Disclosure of Invention
The embodiment of the application provides a data extraction method, a data extraction device, electronic equipment and a storage medium, which can solve the problem that a target table cannot be determined and full-table data can be extracted.
In a first aspect, an embodiment of the present application provides a data extraction method, including:
The method comprises the steps of obtaining target elements according to first attribute information of first cells, wherein the first attribute information is obtained by responding to selection operation of a user on the first cells in a designated page, the designated page is the page where the target elements are located, searching a target table corresponding to the target elements in the designated page based on element identification of the target elements, and extracting multiple types of data from the target table.
In a second aspect, an embodiment of the present application provides a data extraction apparatus, including:
The device comprises an acquisition module, a search module and an extraction module, wherein the acquisition module is used for acquiring target elements according to first attribute information of first cells, the first attribute information is acquired by responding to selection operation of a user on the first cells in a designated page, the designated page is the page where the target elements are located, the search module is used for searching a target table corresponding to the target elements in the designated page based on element identifiers of the target elements, and the extraction module is used for extracting multiple types of data from the target table.
In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, and a program or instruction stored on the memory and executable on the processor, the program or instruction implementing the steps of the method according to the first aspect when executed by the processor.
In a fourth aspect, embodiments of the present application provide a readable storage medium having stored thereon a program or instructions which when executed by a processor perform the steps of the method according to the first aspect.
In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and where the processor is configured to execute a program or instructions to implement a method according to the first aspect.
In the embodiment of the application, a target element is acquired according to first attribute information of a first cell, wherein the first attribute information is acquired by responding to a selection operation of the first cell in a designated page of a user, the designated page is the page where the target element is located, a target table corresponding to the target element in the designated page is searched based on element identification of the target element, and a plurality of types of data are extracted from the target table. Therefore, the target table can be determined according to the attribute information of the given cell and the element identification of the target element corresponding to the attribute information, and a plurality of types of data are extracted from the target table, so that the data types are rich, and the data analysis requirement can be met.
Drawings
Fig. 1 is a first flowchart of a data extraction method according to an embodiment of the present application;
fig. 2 is a second flowchart of a data extraction method according to an embodiment of the present application;
fig. 3 is a third flow chart of a data extraction method according to an embodiment of the present application;
fig. 4 is a fourth flowchart of a data extraction method according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a data extraction device according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The data extraction method, the device, the electronic equipment and the storage medium provided by the embodiment of the application are described in detail through specific embodiments and application scenes thereof with reference to the accompanying drawings.
Fig. 1 shows a data extraction method provided by an embodiment of the present invention, which may be performed by an electronic device, for example, a terminal device, that is, the above-mentioned data extraction method may be performed by hardware or software installed in the terminal device, and the method includes the steps of:
Step S101, acquiring a target element according to first attribute information of a first cell, wherein the first attribute information is acquired by responding to a selection operation of a user on the first cell in a designated page, and the designated page is the page where the target element is located.
Specifically, the designated page refers to a page where a target element corresponding to first attribute information of a first cell is located, the first cell is at least one cell selected by a user in the designated page, the selecting operation may be a frame selection of an area including at least one cell, and at this time, the terminal device may determine, in response to the selecting operation, a cell in the area as the first cell. The first attribute information is data in a first cell, and the data in the first cell has a corresponding marking element.
For example, the specified page may be a World wide Web (Web) page, and the elements of the Web page may include menu bars, link windows, input boxes, icons, buttons, text boxes, text, pictures, web forms, and the like. The Web table is divided into a header and a body, and one table is formed by a combination of element identifiers "< table >", "< thead >", and "< tbody >". Wherein "< table >" is used to define the HTML table, supporting the global properties of HTML, "< thead >" is used to define the header, "< tbody >" is used to define the body.
In the Web table, an element "< tr (table row) >" is used to define a row of the table, an element "< td (table data) >" is used to define a cell, and an element "< th (table header) >" is used to define a header of the table. Wherein in a Web table, the smallest content container is a cell, the content is added to the cell by the element "< td >" i.e. the content of each cell should be written to the element "< td >", the row of the table is defined by the element "< tr >" and a row may contain a plurality of cells, e.g. 4 cells created by the element "< td >" are written to the element "< tr >" and then represent a row of 4 cells, and the element "< th >" is used to define the header in the table, usually at the beginning of the row or column.
More specifically, the data types in the cells include, but are not limited to, at least one of an identification (Identity document, ID), a Name (Name), a Tag (Tag), an extensible markup language path (Xpath), text (Text), a style (ClassName), a Link (Link), a location (rect), and an internal element (child), which is a child element, such as at least one of a layer Tag (div), a label Tag (label), an in-line Tag (span).
The Identification (ID) prescribes a unique identifier of the element, the Name (Name) prescribes the Name of the element, the style (ClassName) prescribes the type of the element, an extensible markup language path (Xpath) can be used for detecting whether a certain node in a document is matched with a certain mode, XPath provides rich functions, multiple attribute matching can be flexibly supported, the XPath is a positioning characteristic value acquired based on the attribute of the element, the Tag (Tag) prescribes the Tag content of the element, the Link (Link) prescribes the Link address of the element, and the Text (Text) prescribes the Text inside the element.
Notably, the web page can be a page of a web application that is opened on a browser, wherein the browser can be a Google browser (Google Chrome), a web seeker (Internet Explorer, IE) browser, a Firefox (Firefox) browser, and the like.
Step S103, searching a target table corresponding to the target element in the designated page based on the element identification of the target element.
In particular, for an element, it carries a tag to determine the type of the current element.
Step S105, extracting a plurality of types of data from the target table.
In particular, extracting the plurality of types of data from the target table includes extracting at least one row or column of data from the target table, that is, full table data may be extracted from the target table, or one column to multiple columns or one row to multiple rows of data may be extracted from the target table. For the type of data, it may include, but is not limited to, identification number (ID), name (Name), style (ClassName), extensible markup language path (Xpath), tag (Tag), link (Link), and Text (Text). It should be noted that the type of data may be other types of data, and embodiments of the present application are not limited herein.
According to the data extraction method provided by the embodiment of the invention, the target table can be determined according to the attribute information of the given unit cell and the element identification of the target element corresponding to the attribute information, and a plurality of types of data are extracted from the target table, so that the data types are rich, and the data analysis requirement can be met.
In one implementation, the first attribute information includes at least one of an Identification (ID), a Tag (Tag), an extensible markup language path (Xpath), and a Text (Text), and acquiring a target element according to the first attribute information of the first cell includes one of acquiring the target element according to the Identification (ID), acquiring the target element according to combined information of the Tag and the Text when a browser where the specified page is located is an IE browser, and acquiring the target element according to the extensible markup language path Xpath when the browser where the specified page is located is not the IE browser, or acquiring the target element according to combined information of the Tag and the Text.
Specifically, obtaining the target element according to the combined information of the Tag and the Text includes obtaining elements composed of a plurality of tags (Tag) and the Text in a specified page, and taking the element composed of the Tag (Tag) and the Text in the first order as the target element, wherein the elements composed of the tags (Tag) and the Text can be determined and ordered according to the marking time and the type priority, and of course, other modes of selecting the target element are also possible.
More specifically, when the first attribute information includes one of an Identification (ID), a Tag (Tag), an extensible markup language path (Xpath), and a Text (Text), the above steps may be correspondingly applied to acquire the target element. When the first attribute information comprises at least two of Identification (ID), tag (Tag), extensible markup language path (Xpath) and Text (Text), determining which step is to be adopted to acquire the target element by setting the priority of the attribute information, for example, setting the priority of the Identification (ID), tag (Tag), extensible markup language path (Xpath) and Text (Text) from top to bottom in sequence, wherein in the case that the first attribute information comprises four information of Identification (ID), tag (Tag), extensible markup language path (Xpath) and Text (Text), the target element is preferentially acquired according to the Identification (ID), and in the case that the first attribute information comprises three information of Tag (Tag), extensible markup language path (Xpath) and Text (Text), in the case that the first attribute information does not comprise the Identification (ID), the browser of the designated page is the browser IE, the combination of the Tag element and the target element is preferentially acquired according to the Tag (Tag), the Tag can be acquired according to the first attribute information of the Tag, the Text is not included in the first attribute information, and acquiring the target element. Thus, the target element is acquired in a plurality of modes, and the reliability is high, so that the reliability of extracting the data from the target table is further improved.
In another possible implementation manner, when the first attribute information includes at least two of an Identification (ID), a Tag (Tag), an extensible markup language path (Xpath), and a Text (Text), if the step corresponding to the attribute information with the higher priority fails to acquire the target element, the step corresponding to the attribute information with the next priority may be adopted to acquire the target element. For example, when the first attribute information includes an Identifier (ID), a Tag (Tag), an extensible markup language path (Xpath), and a Text (Text), if the acquisition of the target element according to the Identifier (ID) fails, if the browser where the specified page is located is an IE browser, the steps of "acquiring the target element according to the combined information of the Tag and the Text" are adopted again, and so on. In this way, the target element is acquired in a plurality of modes, and in the case that one mode fails to acquire the target element, the target element can be acquired again in the other modes, so that the reliability is high, and the reliability of extracting the data from the target table is further improved.
In one possible implementation manner, as shown in fig. 2, based on the element identifier of the target element, searching the target table corresponding to the target element in the specified page includes the following steps:
Step S1031, if the element is identified as the first table label, searching whether the first table label contains at least one of a table header (thead) and a table body (tbody).
Step S1032, when at least one of the header (thead) and the body (tbody) is included in the first table label, determining the target table according to at least one of the header (thead) and the body (tbody), and when at least one of the header (thead) and the body (tbody) is not included in the first table label, determining the target table corresponding to the element identifier as an empty table.
Specifically, in the case that the first table tag includes a header (thead) and a body (tbody), the target table is determined by a combination of the header (thead) and the body (tbody), that is, a table consisting of the header (thead) and the body (tbody) is a target table to be searched for by the terminal device.
Under the condition that the first table label comprises a header (thead), looking up a third table label downwards at the same level of the target element according to the page structure of the appointed page, determining a table body (tbody) according to the third table label, and determining a target table through the combination of the header (thead) and the table body (tbody).
Specifically, when the first table tag contains the header (thead) and does not contain the table body (tbody), the next table tag (third table tag) is searched downward by continuing to write the logic in the same nesting layer where the target element is located, if the next table tag is found, the table body (tbody) with the next table body as the target element is determined when the next table tag contains the next table body (first table body), and when the third table tag contains only the header (first table head) and does not contain the table body, the table body (tbody) with the first table head as the target element is determined. After the header (thead) and the body (tbody) are found, a table consisting of the header (thead) and the body (tbody) is used as a target table.
When the first table tag contains a table body (tbody), a fourth table tag is searched upwards at the same level of the target element according to the page structure of the designated page, a table header (thead) is determined according to the fourth table tag, and the target table is determined by the combination of the table header (thead) and the table body (tbody).
Specifically, when the first table tag contains only the table body (tbody) and does not contain the table header (thead), the next table tag (fourth table tag) is searched up in the same nesting layer in which the table body (tbody) is located according to writing logic, when the second table tag contains the fourth table tag, the second table header is determined to be the table header (thead), and when the fourth table tag contains only the table body (second table body) and does not contain the table header, the second table body is determined to be the table header (thead). The table composed of the header (thead) and the body (tbody) thus obtained is used as the target table.
Step S1033, when the element identification is not the first table label, acquiring a parent node element from the last level of the target element according to the page structure of the designated page, and when the element identification of the parent node element is the second table label, searching whether the second table label contains at least one of a table header (thead) and a table body (tbody).
Specifically, if the element is identified as a non-table label, searching a parent node element in a nested layer of a layer above the target element, and if the parent node element is a table label, determining the target table according to the table label. If the parent node element is not the second table label and the parent node is not the root node, continuing to circularly acquire the next parent node to the previous level of the parent node until the next table label is acquired or until the parent node is the root node.
Step S1034, when at least one of the header (thead) and the body (tbody) is included in the second table label, determining the target table according to at least one of the header (thead) and the body (tbody), and when at least one of the header (thead) and the body (tbody) is not included in the second table label, determining that the target table corresponding to the element identifier is an empty table.
Specifically, in the case that the second table label includes at least one of the header (thead) and the body (tbody), determining the target table according to at least one of the header (thead) and the body (tbody) may be referred to each other with "determining the target table according to at least one of the header (thead) and the body (tbody) in the case that the first table label includes at least one of the header (thead) and the body (tbody)" in the first table label.
According to the technical scheme provided by the embodiment of the application, the target table can be obtained more reliably by circularly searching the table label and acquiring the target table by the table header (thead) and the table body (tbody) in the table label, so that the data in the target table can be extracted.
Fig. 3 shows a data extraction method provided by an embodiment of the present invention, which may be performed by an electronic device, for example, a terminal device, that is, the above-mentioned data extraction method may be performed by hardware or software installed in the terminal device, and the method includes the steps of:
Step S301, obtaining a target element according to the first attribute information of the first cell. The first attribute information is obtained by responding to a selection operation of a first cell in a designated page, wherein the designated page is the page where the target element is located.
Step S302, in the case of failure in acquisition of the target element, acquiring the target element through the second attribute information given by the iframe element of the designated page.
Specifically, the iframe element is an HTML tag, and acts as a document in the document, and the iframe element creates an inline frame (i.e., a line inner frame) containing another document, and in the case that the target element cannot be acquired through the first attribute information, the iframe element gives the second attribute information of the second cell associated with the first cell, and acquires the target element. Wherein the second attribute information includes at least one of an identification ID, a Tag, an extensible markup language path Xpath, a Text, a style ClassName, a Link, and a location rect.
Step S303, searching a target table corresponding to the target element in the designated page based on the element identification of the target element.
After the target element is successfully acquired according to the second attribute information, the target table is searched according to the element identification of the target element acquired according to the second attribute information. The element identifier of the target element obtained according to the second attribute information may be referred to in the above embodiment, which is not described herein.
Step 305, extracting a plurality of types of data from the target table.
It should be noted that, in the embodiment of the present application, the step S301, the step S304, and the step S305 have the same or similar implementation manners as those of the above embodiment, which may be referred to each other, and the embodiment of the present application is not described herein again.
According to the data extraction method provided by the embodiment of the invention, under the condition that the target element cannot be acquired according to the first attribute information, the target element can be acquired according to the second attribute information, so that the reliability of acquiring the target form is further improved.
By way of example, an embodiment of the application is described below in connection with fig. 4, in which the following steps are included:
step S401, first attribute information of a first cell in a designated page is acquired.
Specifically, in the case where the user selects the first cell, the terminal device may obtain the first attribute information of the first cell, and the first attribute information may be referred to the description of the foregoing embodiment, which is not described herein.
Step S402, if the first attribute information has the Identification (ID), the step S403 is entered, and if not, the step S404 is entered.
Step S403, obtaining the target element according to the Identification (ID).
Step S404, if the browser is an IE browser, the step S405 is entered, and if the browser is not, the step S406 is entered.
Step S405, obtaining target elements according to the combined information of the Tag and the Text.
Step S406, if there is extensible markup language path (Xpath) in the first attribute information, if yes, step S407 is entered, and if no, step S408 is entered.
Step S407, obtaining the target element according to the extensible markup language path (Xpath).
And step S408, obtaining the target element according to the Xpath formed by the Tag and the Text.
Step S409, if the target element is acquired, the process proceeds to step S410, if not, the process proceeds to step S411.
Step S410, if the target element is the first table label, the step S412 is entered, and if not, the step S413 is entered.
Step S411, acquiring a target element through the second attribute information given by the iframe element of the designated page, and when the target element is acquired, proceeding to step S410, and when the target element is not acquired, ending.
Step S412, if the target element contains the header (thead) and the body (tbody), if not, the process proceeds to step S417, and if so, the process proceeds to step S418.
Step S413, obtaining the parent node element from the previous level of the target element.
The step of obtaining the parent node element from the previous level of the target element refers to starting to circularly search the parent node element upwards from the target element until the parent node element is the root node to finish the cycle or until the parent node element is found.
Step S414, if the parent node element is the second table label, the step S412 is entered, if not, the step S416 is entered.
Step S416, if the father node is the root node, the process is finished, if not, the process goes to step S413.
Step S417, if there is only a header (thead) in the target element, then step S419 is entered, if not, then step S420 is entered.
Step S418, forming a target table by the table head (thead) and the table body (tbody), extracting data from the target table and ending.
Step S419, looking up a third table label downwards at the same level of the target element.
Step S420, if only the table body (tbody) exists in the target element, the step S421 is entered, and if not, the step S422 is entered.
Step S421, searching the fourth table label in the same level of the target element.
Step S422, the table is empty and ends.
Step S423, if both the header (thead) and the body (tbody) are present, the process proceeds to step S418, and if not, the process proceeds to step S424 and step S425.
That is, if the third table tag contains a table body and the fourth table tag contains a table header, the table body in the third table tag is set as a table body (tbody), and the table header in the fourth table tag is set as a table header (thead).
If there is only header (thead), the table is searched up again, and if the new header is found, the original header (thead) is used as the table body, and the process proceeds to step S418.
If no table is found, the header (thead) is returned directly as table data.
If there is only a table body (tbody), the table is searched downward again, and if the data in the table is found, the original table body (tbody) is used as a header, and the process proceeds to step S418.
If no table is found, the table body (tbody) is returned directly as table data.
According to the technical scheme provided by the embodiment of the application, the target table can be determined according to the attribute information of the given cell and the element identification of the target element corresponding to the attribute information, and a plurality of types of data are extracted from the target table, so that the data types are rich, and the data analysis requirement can be met. In addition, the target elements are acquired in a plurality of modes, and in the case that one mode fails to acquire the target elements, the target elements can be acquired again in the rest modes, so that the reliability is high, and the reliability of extracting the data from the target table is further improved. Further, the table label is searched circularly, and the table header (thead) and the table body (tbody) in the table label are used for acquiring the target table, so that the target table can be acquired more reliably, and the data in the target table can be extracted. Finally, under the condition that the target element cannot be acquired according to the first attribute information, the target element can be acquired according to the second attribute information, so that the reliability of acquiring the target form is further improved.
It should be noted that, in the data extraction method provided in the embodiment of the present application, the execution body may be a data extraction device, or a control module in the data extraction device for executing the data extraction method. In the embodiment of the present application, a data extraction device is used as an example to execute a data extraction method, and the data extraction device provided in the embodiment of the present application is described.
Fig. 5 is a schematic structural diagram of a data extraction device according to an embodiment of the present invention. As shown in fig. 5, the data extraction device 500 includes:
The device comprises an acquisition module 501 for acquiring a target element according to first attribute information of a first cell, wherein the first attribute information is acquired by responding to a selection operation of the first cell in a designated page of a user, the designated page is the page where the target element is located, a search module 502 for searching a target table corresponding to the target element in the designated page based on an element identification of the target element, and an extraction module 503 for extracting multiple types of data from the target table.
According to the technical scheme disclosed by the embodiment of the application, the target table can be determined according to the attribute information of the given cell and the element identification of the target element corresponding to the attribute information, and a plurality of types of data are extracted from the target table, so that the data types are rich, and the data analysis requirement can be met.
In one possible implementation manner, the first attribute information includes at least one of an identification ID, a Tag, an extensible markup language path Xpath, and a Text, and the obtaining module 501 is further configured to obtain a target element according to the identification ID, obtain the target element according to combined information of the Tag and the Text when the browser where the specified page is located is an IE browser, and obtain the target element according to the extensible markup language path Xpath when the browser where the specified page is located is not the IE browser, or obtain the target element according to combined information of the Tag and the Text when the browser where the specified page is located is not the IE browser.
In a possible implementation manner, the obtaining module 501 is further configured to obtain, in case that the obtaining of the target element fails, the target element through the second attribute information given by the iframe element of the specified page.
In one possible implementation, the lookup module 502 is further configured to, if the element is identified as the first table label, lookup whether the first table label includes at least one of a header (thead) and a body (tbody);
In the case of the first table tag including at least one of a header (thead) and a body (tbody), determining a target table according to at least one of a header (thead) and a body (tbody), in the case of the first table tag not including at least one of a header (thead) and a body (tbody), determining that the target table corresponding to the element identification is an empty table, in the case of the element identification not being the first table tag, acquiring a parent node element from a page structure of the designated page to a level above the target element, in the case of the element identification not being the first table tag, searching for whether the second table tag includes at least one of a header (thead) and a body (tbody), in the case of the second table tag including at least one of a header (thead) and a body (tbody), determining a target table according to at least one of a header (thead) and a body (tbody), and in the case of the second table tag including at least one of a header (tbody) and a body (tbody), determining that the target table corresponding to the second table not including at least one of a header (tbody) is an empty table.
In one possible implementation, the lookup module 502 is further configured to determine, in a case where the first table tag or the second table tag includes a header (thead) and a body (tbody), a target table by a combination of the header (thead) and the body (tbody), in a case where the first table tag or the second table tag includes a header (thead), look up a third table tag downward at the same level of the target element according to a page structure of the specified page, determine a body (tbody) according to the third table tag, determine a target table by a combination of the header (thead) and the body (tbody), and in a case where the first table tag or the second table tag includes a body (tbody), look up a fourth table tag upward at the same level of the target element according to a page structure of the specified page, determine a header (thead) according to the fourth table tag, and determine a target table by a combination of the header (thead) and the body (tbody).
In a possible implementation manner, the lookup module 502 is further configured to determine that the first table body is the table body (tbody) if the first table body is included in the third table label, and determine that the first table header is the table body (tbody) if the first table header is included in the third table label.
In a possible implementation manner, the lookup module 502 is further configured to determine that the second header is the header (thead) if the second header is included in the fourth table label, and determine that the second table body is the header (thead) if the second table body is included in the fourth table label.
In one possible implementation, the extracting module 503 is further configured to extract at least one row or at least one column of data in the target table, where the data includes at least one of an identification ID, a Tag, an extensible markup language path Xpath, a Text, a style ClassName, a Link, and a location rect.
The data extraction device 500 in the embodiment of the present application may be a device, or may be a component, an integrated circuit, or a chip in a terminal. The device may be a mobile electronic device or a non-mobile electronic device. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), etc., and the non-mobile electronic device may be a server, a network attached storage (Network Attached Storage, NAS), a personal computer (personal computer, PC), a Television (TV), a teller machine, a self-service machine, etc., and the embodiments of the present application are not limited in particular.
The data extraction device 500 in the embodiment of the present application may be a device having an operating system. The operating system may be an Android operating system, an ios operating system, or other possible operating systems, and the embodiment of the present application is not limited specifically.
The data extraction device 500 provided in the embodiment of the present application can implement each process implemented in the method embodiments of fig. 1 to 4, and in order to avoid repetition, a description is omitted here.
Optionally, as shown in fig. 6, the embodiment of the present application further provides an electronic device 600, including a processor 602, a memory 601, and a program or an instruction stored in the memory 601 and capable of being executed on the processor 602, where the program or the instruction implements each process of the embodiment of the data extraction method when executed by the processor 602, and the process can achieve the same technical effect, and for avoiding repetition, a description is omitted herein.
The electronic device in the embodiment of the application includes the mobile electronic device and the non-mobile electronic device.
The present embodiment can implement the processes of the above embodiments of the data extraction method, and achieve the same technical effects, and in order to avoid repetition, a detailed description is omitted here.
The embodiment of the application also provides a readable storage medium, and the readable storage medium stores a program or an instruction, which when executed by a processor, implements each process of the above embodiment of the data extraction method, and can achieve the same technical effects, so that repetition is avoided, and no further description is given here.
The processor is a processor in the electronic device in the above embodiment. Readable storage media include computer readable storage media such as Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic or optical disks, and the like.
The embodiment of the application further provides a chip, the chip comprises a processor and a communication interface, the communication interface is coupled with the processor, the processor is used for running programs or instructions, the processes of the embodiment of the data extraction method can be realized, the same technical effects can be achieved, and the repetition is avoided, and the description is omitted here.
It should be understood that the chips referred to in the embodiments of the present application may also be referred to as system-on-chip chips, chip systems, or system-on-chip chips, etc.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims (9)

1. A data extraction method, comprising:
Acquiring a target element according to first attribute information of a first cell, wherein the first attribute information is acquired by responding to a selection operation of a user on the first cell in a designated page, and the designated page is the page where the target element is located;
searching a target table corresponding to the target element in the designated page based on the element identification of the target element;
extracting a plurality of types of data from the target table;
the searching the target table corresponding to the target element in the specified page based on the element identification of the target element includes:
If the element is identified as a first table label, searching whether the first table label contains at least one of a table header thead and a table body tbody;
in the case that at least one of header thead and table body tbody is included in the first table label, determining the target table from at least one of the header thead and table body tbody;
Determining that a target table corresponding to the element identifier is an empty table when at least one of a header thead and a table body tbody is not included in the first table tag;
Acquiring a parent node element from the upper level of the target element according to the page structure of the designated page under the condition that the element identification is not the first table label;
If the element identifier of the parent node element is a second table label, searching whether the second table label contains at least one of a table header thead and a table body tbody;
in the case that at least one of header thead and table body tbody is included in the second table label, determining the target table from at least one of the header thead and table body tbody;
If at least one of the header thead and the table body tbody is not included in the second table label, it is determined that the target table corresponding to the element identifier is an empty table.
2. The data extraction method according to claim 1, wherein the first attribute information includes at least one of an identification ID, a Tag, an extensible markup language path Xpath, and a Text, and the first attribute information acquisition target element according to the first cell includes one of:
acquiring the target element according to the identification ID;
under the condition that the browser where the appointed page is located is an IE browser, acquiring the target element according to the combined information of the Tag and the Text;
And under the condition that the browser where the designated page is located is not the IE browser, acquiring the target element according to the extensible markup language path Xpath or acquiring the target element according to the combined information of the Tag and the Text.
3. The data extraction method according to claim 1, wherein after the target element is acquired from the first attribute information of the first cell, the method further comprises:
And under the condition that the acquisition of the target element fails, acquiring the target element through the second attribute information given by the iframe element of the designated page.
4. The method of claim 1, wherein said determining the target table from at least one of the header thead and the body tbody comprises:
In the case that the first table label or the second table label includes a header thead and a table body tbody, determining the target table by a combination of the header thead and the table body tbody;
if the first table tag or the second table tag includes a header thead, searching a third table tag downwards at the same level of the target element according to the page structure of the specified page, determining a table body tbody according to the third table tag, and determining the target table through the combination of the header thead and the table body tbody;
If the first table tag or the second table tag includes a table body tbody, searching a fourth table tag at the same level of the target element according to the page structure of the specified page, determining a table header thead according to the fourth table tag, and determining the target table through a combination of the table header thead and the table body tbody.
5. The data extraction method according to claim 4, wherein the determining the table tbody according to the third table label includes:
Determining that the first table body is the table body tbody when the third table label contains the first table body, and determining that the first table head is the table body tbody when the third table label contains the first table head;
Determining the header thead according to the fourth table tag includes:
and determining that the second header is the header thead when the fourth table tag contains a second header, and determining that the second table body is the header thead when the fourth table tag contains a second table body.
6. The method of claim 1, wherein the extracting a plurality of types of data from the target table comprises:
And extracting at least one row or at least one column of data in the target table, wherein the types of the data comprise an identification ID, a Tag, an extensible markup language path Xpath, a Text, a style ClassName, a Link and a position rect.
7. A data extraction apparatus, comprising:
the acquisition module is used for acquiring a target element according to first attribute information of a first cell, wherein the first attribute information is acquired by responding to a selection operation of a user on the first cell in a designated page, and the designated page is the page where the target element is located;
The searching module is used for searching a target table corresponding to the target element in the designated page based on the element identification of the target element;
An extraction module for extracting a plurality of types of data from the target table;
The searching module is further configured to, when the element identifier is a first table label, search whether the first table label includes at least one of a header thead and a table body tbody;
In the case that at least one of the header thead and the body tbody is included in the first table label, a target table is determined according to at least one of the header thead and the body tbody, in the case that at least one of the header thead and the body tbody is not included in the first table label, a target table corresponding to the element identification is determined to be an empty table, in the case that the element identification is not the first table label, a parent node element is acquired to the upper layer of the target element according to the page structure of the designated page, in the case that the element identification of the parent node element is the second table label, whether at least one of the header thead and the body tbody is included in the second table label is searched, in the case that at least one of the header thead and the body tbody is included in the second table label, a target table corresponding to the element identification is determined to be an empty table in the case that at least one of the header thead and the body tbody is not included in the second table label.
8. An electronic device comprising a processor, a memory and a program or instruction stored on the memory and executable on the processor, which program or instruction when executed by the processor implements the steps of the method of any of claims 1-6.
9. A readable storage medium having stored thereon a program or instructions which when executed by a processor perform the steps of the method according to any of claims 1-6.
CN202111647593.7A 2021-12-29 2021-12-29 Data extraction method, device, electronic device and storage medium Active CN114491360B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111647593.7A CN114491360B (en) 2021-12-29 2021-12-29 Data extraction method, device, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111647593.7A CN114491360B (en) 2021-12-29 2021-12-29 Data extraction method, device, electronic device and storage medium

Publications (2)

Publication Number Publication Date
CN114491360A CN114491360A (en) 2022-05-13
CN114491360B true CN114491360B (en) 2024-12-03

Family

ID=81507539

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111647593.7A Active CN114491360B (en) 2021-12-29 2021-12-29 Data extraction method, device, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN114491360B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116127229A (en) * 2023-01-31 2023-05-16 江苏银承网络科技股份有限公司 JSOUP-based data processing method, medium and device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108874977A (en) * 2018-06-08 2018-11-23 东软集团股份有限公司 Page data extracting method, device, storage medium and electronic equipment

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1430420A2 (en) * 2001-05-31 2004-06-23 Lixto Software GmbH Visual and interactive wrapper generation, automated information extraction from web pages, and translation into xml
CN112052368B (en) * 2020-08-11 2024-04-19 北京新橙科技有限公司 Method, system, storage medium and electronic device for automatically extracting list data
CN112395418B (en) * 2020-11-26 2021-09-03 上海携宁计算机科技股份有限公司 Method and device for extracting target object in webpage and electronic equipment

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108874977A (en) * 2018-06-08 2018-11-23 东软集团股份有限公司 Page data extracting method, device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN114491360A (en) 2022-05-13

Similar Documents

Publication Publication Date Title
CN110362370B (en) Webpage language switching method and device and terminal equipment
US20200401431A1 (en) Adaptive web-based robotic process automation
US8667004B2 (en) Providing suggestions during formation of a search query
US20140317482A1 (en) Client side page processing
US20200242181A1 (en) User interface element for surfacing related results
US10572566B2 (en) Image quality independent searching of screenshots of web content
JP6514084B2 (en) OPERATION SUPPORT SYSTEM, OPERATION SUPPORT METHOD, AND OPERATION SUPPORT PROGRAM
CN114021042A (en) Web page content extraction method, device, computer equipment and storage medium
CN111258577A (en) Page rendering method and device, electronic equipment and storage medium
CN108694242B (en) Node searching method, equipment, storage medium and device based on DOM
CN113177391B (en) Method for redirecting operation cursor in streaming interface, computing equipment and storage medium
CN114491360B (en) Data extraction method, device, electronic device and storage medium
CN104899203B (en) Webpage generation method and device and terminal equipment
CN110020279A (en) Page data processing method, device and storage medium
CN109977318B (en) Book searching method, electronic device and computer storage medium
CN107368546A (en) A kind of method and apparatus for generating outline
US20160110346A1 (en) Multilingual content production
US8719693B2 (en) Method for storing localized XML document values
CN110515618B (en) Page information input optimization method, equipment, storage medium and device
CN117634425A (en) Webpage text marking method, device, terminal equipment and storage medium
CN110020318B (en) Processing method of keywords and extended reading behaviors, browser and electronic equipment
JP5373710B2 (en) Index update apparatus and method
CN107085578B (en) Webpage editing method and device
CN112800078B (en) Lightweight text annotation method, system, device and storage medium based on javascript
JPWO2015016133A1 (en) Information management apparatus and information management method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant