Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The data extraction method, the device, the electronic equipment and the storage medium provided by the embodiment of the application are described in detail through specific embodiments and application scenes thereof with reference to the accompanying drawings.
Fig. 1 shows a data extraction method provided by an embodiment of the present invention, which may be performed by an electronic device, for example, a terminal device, that is, the above-mentioned data extraction method may be performed by hardware or software installed in the terminal device, and the method includes the steps of:
Step S101, acquiring a target element according to first attribute information of a first cell, wherein the first attribute information is acquired by responding to a selection operation of a user on the first cell in a designated page, and the designated page is the page where the target element is located.
Specifically, the designated page refers to a page where a target element corresponding to first attribute information of a first cell is located, the first cell is at least one cell selected by a user in the designated page, the selecting operation may be a frame selection of an area including at least one cell, and at this time, the terminal device may determine, in response to the selecting operation, a cell in the area as the first cell. The first attribute information is data in a first cell, and the data in the first cell has a corresponding marking element.
For example, the specified page may be a World wide Web (Web) page, and the elements of the Web page may include menu bars, link windows, input boxes, icons, buttons, text boxes, text, pictures, web forms, and the like. The Web table is divided into a header and a body, and one table is formed by a combination of element identifiers "< table >", "< thead >", and "< tbody >". Wherein "< table >" is used to define the HTML table, supporting the global properties of HTML, "< thead >" is used to define the header, "< tbody >" is used to define the body.
In the Web table, an element "< tr (table row) >" is used to define a row of the table, an element "< td (table data) >" is used to define a cell, and an element "< th (table header) >" is used to define a header of the table. Wherein in a Web table, the smallest content container is a cell, the content is added to the cell by the element "< td >" i.e. the content of each cell should be written to the element "< td >", the row of the table is defined by the element "< tr >" and a row may contain a plurality of cells, e.g. 4 cells created by the element "< td >" are written to the element "< tr >" and then represent a row of 4 cells, and the element "< th >" is used to define the header in the table, usually at the beginning of the row or column.
More specifically, the data types in the cells include, but are not limited to, at least one of an identification (Identity document, ID), a Name (Name), a Tag (Tag), an extensible markup language path (Xpath), text (Text), a style (ClassName), a Link (Link), a location (rect), and an internal element (child), which is a child element, such as at least one of a layer Tag (div), a label Tag (label), an in-line Tag (span).
The Identification (ID) prescribes a unique identifier of the element, the Name (Name) prescribes the Name of the element, the style (ClassName) prescribes the type of the element, an extensible markup language path (Xpath) can be used for detecting whether a certain node in a document is matched with a certain mode, XPath provides rich functions, multiple attribute matching can be flexibly supported, the XPath is a positioning characteristic value acquired based on the attribute of the element, the Tag (Tag) prescribes the Tag content of the element, the Link (Link) prescribes the Link address of the element, and the Text (Text) prescribes the Text inside the element.
Notably, the web page can be a page of a web application that is opened on a browser, wherein the browser can be a Google browser (Google Chrome), a web seeker (Internet Explorer, IE) browser, a Firefox (Firefox) browser, and the like.
Step S103, searching a target table corresponding to the target element in the designated page based on the element identification of the target element.
In particular, for an element, it carries a tag to determine the type of the current element.
Step S105, extracting a plurality of types of data from the target table.
In particular, extracting the plurality of types of data from the target table includes extracting at least one row or column of data from the target table, that is, full table data may be extracted from the target table, or one column to multiple columns or one row to multiple rows of data may be extracted from the target table. For the type of data, it may include, but is not limited to, identification number (ID), name (Name), style (ClassName), extensible markup language path (Xpath), tag (Tag), link (Link), and Text (Text). It should be noted that the type of data may be other types of data, and embodiments of the present application are not limited herein.
According to the data extraction method provided by the embodiment of the invention, the target table can be determined according to the attribute information of the given unit cell and the element identification of the target element corresponding to the attribute information, and a plurality of types of data are extracted from the target table, so that the data types are rich, and the data analysis requirement can be met.
In one implementation, the first attribute information includes at least one of an Identification (ID), a Tag (Tag), an extensible markup language path (Xpath), and a Text (Text), and acquiring a target element according to the first attribute information of the first cell includes one of acquiring the target element according to the Identification (ID), acquiring the target element according to combined information of the Tag and the Text when a browser where the specified page is located is an IE browser, and acquiring the target element according to the extensible markup language path Xpath when the browser where the specified page is located is not the IE browser, or acquiring the target element according to combined information of the Tag and the Text.
Specifically, obtaining the target element according to the combined information of the Tag and the Text includes obtaining elements composed of a plurality of tags (Tag) and the Text in a specified page, and taking the element composed of the Tag (Tag) and the Text in the first order as the target element, wherein the elements composed of the tags (Tag) and the Text can be determined and ordered according to the marking time and the type priority, and of course, other modes of selecting the target element are also possible.
More specifically, when the first attribute information includes one of an Identification (ID), a Tag (Tag), an extensible markup language path (Xpath), and a Text (Text), the above steps may be correspondingly applied to acquire the target element. When the first attribute information comprises at least two of Identification (ID), tag (Tag), extensible markup language path (Xpath) and Text (Text), determining which step is to be adopted to acquire the target element by setting the priority of the attribute information, for example, setting the priority of the Identification (ID), tag (Tag), extensible markup language path (Xpath) and Text (Text) from top to bottom in sequence, wherein in the case that the first attribute information comprises four information of Identification (ID), tag (Tag), extensible markup language path (Xpath) and Text (Text), the target element is preferentially acquired according to the Identification (ID), and in the case that the first attribute information comprises three information of Tag (Tag), extensible markup language path (Xpath) and Text (Text), in the case that the first attribute information does not comprise the Identification (ID), the browser of the designated page is the browser IE, the combination of the Tag element and the target element is preferentially acquired according to the Tag (Tag), the Tag can be acquired according to the first attribute information of the Tag, the Text is not included in the first attribute information, and acquiring the target element. Thus, the target element is acquired in a plurality of modes, and the reliability is high, so that the reliability of extracting the data from the target table is further improved.
In another possible implementation manner, when the first attribute information includes at least two of an Identification (ID), a Tag (Tag), an extensible markup language path (Xpath), and a Text (Text), if the step corresponding to the attribute information with the higher priority fails to acquire the target element, the step corresponding to the attribute information with the next priority may be adopted to acquire the target element. For example, when the first attribute information includes an Identifier (ID), a Tag (Tag), an extensible markup language path (Xpath), and a Text (Text), if the acquisition of the target element according to the Identifier (ID) fails, if the browser where the specified page is located is an IE browser, the steps of "acquiring the target element according to the combined information of the Tag and the Text" are adopted again, and so on. In this way, the target element is acquired in a plurality of modes, and in the case that one mode fails to acquire the target element, the target element can be acquired again in the other modes, so that the reliability is high, and the reliability of extracting the data from the target table is further improved.
In one possible implementation manner, as shown in fig. 2, based on the element identifier of the target element, searching the target table corresponding to the target element in the specified page includes the following steps:
Step S1031, if the element is identified as the first table label, searching whether the first table label contains at least one of a table header (thead) and a table body (tbody).
Step S1032, when at least one of the header (thead) and the body (tbody) is included in the first table label, determining the target table according to at least one of the header (thead) and the body (tbody), and when at least one of the header (thead) and the body (tbody) is not included in the first table label, determining the target table corresponding to the element identifier as an empty table.
Specifically, in the case that the first table tag includes a header (thead) and a body (tbody), the target table is determined by a combination of the header (thead) and the body (tbody), that is, a table consisting of the header (thead) and the body (tbody) is a target table to be searched for by the terminal device.
Under the condition that the first table label comprises a header (thead), looking up a third table label downwards at the same level of the target element according to the page structure of the appointed page, determining a table body (tbody) according to the third table label, and determining a target table through the combination of the header (thead) and the table body (tbody).
Specifically, when the first table tag contains the header (thead) and does not contain the table body (tbody), the next table tag (third table tag) is searched downward by continuing to write the logic in the same nesting layer where the target element is located, if the next table tag is found, the table body (tbody) with the next table body as the target element is determined when the next table tag contains the next table body (first table body), and when the third table tag contains only the header (first table head) and does not contain the table body, the table body (tbody) with the first table head as the target element is determined. After the header (thead) and the body (tbody) are found, a table consisting of the header (thead) and the body (tbody) is used as a target table.
When the first table tag contains a table body (tbody), a fourth table tag is searched upwards at the same level of the target element according to the page structure of the designated page, a table header (thead) is determined according to the fourth table tag, and the target table is determined by the combination of the table header (thead) and the table body (tbody).
Specifically, when the first table tag contains only the table body (tbody) and does not contain the table header (thead), the next table tag (fourth table tag) is searched up in the same nesting layer in which the table body (tbody) is located according to writing logic, when the second table tag contains the fourth table tag, the second table header is determined to be the table header (thead), and when the fourth table tag contains only the table body (second table body) and does not contain the table header, the second table body is determined to be the table header (thead). The table composed of the header (thead) and the body (tbody) thus obtained is used as the target table.
Step S1033, when the element identification is not the first table label, acquiring a parent node element from the last level of the target element according to the page structure of the designated page, and when the element identification of the parent node element is the second table label, searching whether the second table label contains at least one of a table header (thead) and a table body (tbody).
Specifically, if the element is identified as a non-table label, searching a parent node element in a nested layer of a layer above the target element, and if the parent node element is a table label, determining the target table according to the table label. If the parent node element is not the second table label and the parent node is not the root node, continuing to circularly acquire the next parent node to the previous level of the parent node until the next table label is acquired or until the parent node is the root node.
Step S1034, when at least one of the header (thead) and the body (tbody) is included in the second table label, determining the target table according to at least one of the header (thead) and the body (tbody), and when at least one of the header (thead) and the body (tbody) is not included in the second table label, determining that the target table corresponding to the element identifier is an empty table.
Specifically, in the case that the second table label includes at least one of the header (thead) and the body (tbody), determining the target table according to at least one of the header (thead) and the body (tbody) may be referred to each other with "determining the target table according to at least one of the header (thead) and the body (tbody) in the case that the first table label includes at least one of the header (thead) and the body (tbody)" in the first table label.
According to the technical scheme provided by the embodiment of the application, the target table can be obtained more reliably by circularly searching the table label and acquiring the target table by the table header (thead) and the table body (tbody) in the table label, so that the data in the target table can be extracted.
Fig. 3 shows a data extraction method provided by an embodiment of the present invention, which may be performed by an electronic device, for example, a terminal device, that is, the above-mentioned data extraction method may be performed by hardware or software installed in the terminal device, and the method includes the steps of:
Step S301, obtaining a target element according to the first attribute information of the first cell. The first attribute information is obtained by responding to a selection operation of a first cell in a designated page, wherein the designated page is the page where the target element is located.
Step S302, in the case of failure in acquisition of the target element, acquiring the target element through the second attribute information given by the iframe element of the designated page.
Specifically, the iframe element is an HTML tag, and acts as a document in the document, and the iframe element creates an inline frame (i.e., a line inner frame) containing another document, and in the case that the target element cannot be acquired through the first attribute information, the iframe element gives the second attribute information of the second cell associated with the first cell, and acquires the target element. Wherein the second attribute information includes at least one of an identification ID, a Tag, an extensible markup language path Xpath, a Text, a style ClassName, a Link, and a location rect.
Step S303, searching a target table corresponding to the target element in the designated page based on the element identification of the target element.
After the target element is successfully acquired according to the second attribute information, the target table is searched according to the element identification of the target element acquired according to the second attribute information. The element identifier of the target element obtained according to the second attribute information may be referred to in the above embodiment, which is not described herein.
Step 305, extracting a plurality of types of data from the target table.
It should be noted that, in the embodiment of the present application, the step S301, the step S304, and the step S305 have the same or similar implementation manners as those of the above embodiment, which may be referred to each other, and the embodiment of the present application is not described herein again.
According to the data extraction method provided by the embodiment of the invention, under the condition that the target element cannot be acquired according to the first attribute information, the target element can be acquired according to the second attribute information, so that the reliability of acquiring the target form is further improved.
By way of example, an embodiment of the application is described below in connection with fig. 4, in which the following steps are included:
step S401, first attribute information of a first cell in a designated page is acquired.
Specifically, in the case where the user selects the first cell, the terminal device may obtain the first attribute information of the first cell, and the first attribute information may be referred to the description of the foregoing embodiment, which is not described herein.
Step S402, if the first attribute information has the Identification (ID), the step S403 is entered, and if not, the step S404 is entered.
Step S403, obtaining the target element according to the Identification (ID).
Step S404, if the browser is an IE browser, the step S405 is entered, and if the browser is not, the step S406 is entered.
Step S405, obtaining target elements according to the combined information of the Tag and the Text.
Step S406, if there is extensible markup language path (Xpath) in the first attribute information, if yes, step S407 is entered, and if no, step S408 is entered.
Step S407, obtaining the target element according to the extensible markup language path (Xpath).
And step S408, obtaining the target element according to the Xpath formed by the Tag and the Text.
Step S409, if the target element is acquired, the process proceeds to step S410, if not, the process proceeds to step S411.
Step S410, if the target element is the first table label, the step S412 is entered, and if not, the step S413 is entered.
Step S411, acquiring a target element through the second attribute information given by the iframe element of the designated page, and when the target element is acquired, proceeding to step S410, and when the target element is not acquired, ending.
Step S412, if the target element contains the header (thead) and the body (tbody), if not, the process proceeds to step S417, and if so, the process proceeds to step S418.
Step S413, obtaining the parent node element from the previous level of the target element.
The step of obtaining the parent node element from the previous level of the target element refers to starting to circularly search the parent node element upwards from the target element until the parent node element is the root node to finish the cycle or until the parent node element is found.
Step S414, if the parent node element is the second table label, the step S412 is entered, if not, the step S416 is entered.
Step S416, if the father node is the root node, the process is finished, if not, the process goes to step S413.
Step S417, if there is only a header (thead) in the target element, then step S419 is entered, if not, then step S420 is entered.
Step S418, forming a target table by the table head (thead) and the table body (tbody), extracting data from the target table and ending.
Step S419, looking up a third table label downwards at the same level of the target element.
Step S420, if only the table body (tbody) exists in the target element, the step S421 is entered, and if not, the step S422 is entered.
Step S421, searching the fourth table label in the same level of the target element.
Step S422, the table is empty and ends.
Step S423, if both the header (thead) and the body (tbody) are present, the process proceeds to step S418, and if not, the process proceeds to step S424 and step S425.
That is, if the third table tag contains a table body and the fourth table tag contains a table header, the table body in the third table tag is set as a table body (tbody), and the table header in the fourth table tag is set as a table header (thead).
If there is only header (thead), the table is searched up again, and if the new header is found, the original header (thead) is used as the table body, and the process proceeds to step S418.
If no table is found, the header (thead) is returned directly as table data.
If there is only a table body (tbody), the table is searched downward again, and if the data in the table is found, the original table body (tbody) is used as a header, and the process proceeds to step S418.
If no table is found, the table body (tbody) is returned directly as table data.
According to the technical scheme provided by the embodiment of the application, the target table can be determined according to the attribute information of the given cell and the element identification of the target element corresponding to the attribute information, and a plurality of types of data are extracted from the target table, so that the data types are rich, and the data analysis requirement can be met. In addition, the target elements are acquired in a plurality of modes, and in the case that one mode fails to acquire the target elements, the target elements can be acquired again in the rest modes, so that the reliability is high, and the reliability of extracting the data from the target table is further improved. Further, the table label is searched circularly, and the table header (thead) and the table body (tbody) in the table label are used for acquiring the target table, so that the target table can be acquired more reliably, and the data in the target table can be extracted. Finally, under the condition that the target element cannot be acquired according to the first attribute information, the target element can be acquired according to the second attribute information, so that the reliability of acquiring the target form is further improved.
It should be noted that, in the data extraction method provided in the embodiment of the present application, the execution body may be a data extraction device, or a control module in the data extraction device for executing the data extraction method. In the embodiment of the present application, a data extraction device is used as an example to execute a data extraction method, and the data extraction device provided in the embodiment of the present application is described.
Fig. 5 is a schematic structural diagram of a data extraction device according to an embodiment of the present invention. As shown in fig. 5, the data extraction device 500 includes:
The device comprises an acquisition module 501 for acquiring a target element according to first attribute information of a first cell, wherein the first attribute information is acquired by responding to a selection operation of the first cell in a designated page of a user, the designated page is the page where the target element is located, a search module 502 for searching a target table corresponding to the target element in the designated page based on an element identification of the target element, and an extraction module 503 for extracting multiple types of data from the target table.
According to the technical scheme disclosed by the embodiment of the application, the target table can be determined according to the attribute information of the given cell and the element identification of the target element corresponding to the attribute information, and a plurality of types of data are extracted from the target table, so that the data types are rich, and the data analysis requirement can be met.
In one possible implementation manner, the first attribute information includes at least one of an identification ID, a Tag, an extensible markup language path Xpath, and a Text, and the obtaining module 501 is further configured to obtain a target element according to the identification ID, obtain the target element according to combined information of the Tag and the Text when the browser where the specified page is located is an IE browser, and obtain the target element according to the extensible markup language path Xpath when the browser where the specified page is located is not the IE browser, or obtain the target element according to combined information of the Tag and the Text when the browser where the specified page is located is not the IE browser.
In a possible implementation manner, the obtaining module 501 is further configured to obtain, in case that the obtaining of the target element fails, the target element through the second attribute information given by the iframe element of the specified page.
In one possible implementation, the lookup module 502 is further configured to, if the element is identified as the first table label, lookup whether the first table label includes at least one of a header (thead) and a body (tbody);
In the case of the first table tag including at least one of a header (thead) and a body (tbody), determining a target table according to at least one of a header (thead) and a body (tbody), in the case of the first table tag not including at least one of a header (thead) and a body (tbody), determining that the target table corresponding to the element identification is an empty table, in the case of the element identification not being the first table tag, acquiring a parent node element from a page structure of the designated page to a level above the target element, in the case of the element identification not being the first table tag, searching for whether the second table tag includes at least one of a header (thead) and a body (tbody), in the case of the second table tag including at least one of a header (thead) and a body (tbody), determining a target table according to at least one of a header (thead) and a body (tbody), and in the case of the second table tag including at least one of a header (tbody) and a body (tbody), determining that the target table corresponding to the second table not including at least one of a header (tbody) is an empty table.
In one possible implementation, the lookup module 502 is further configured to determine, in a case where the first table tag or the second table tag includes a header (thead) and a body (tbody), a target table by a combination of the header (thead) and the body (tbody), in a case where the first table tag or the second table tag includes a header (thead), look up a third table tag downward at the same level of the target element according to a page structure of the specified page, determine a body (tbody) according to the third table tag, determine a target table by a combination of the header (thead) and the body (tbody), and in a case where the first table tag or the second table tag includes a body (tbody), look up a fourth table tag upward at the same level of the target element according to a page structure of the specified page, determine a header (thead) according to the fourth table tag, and determine a target table by a combination of the header (thead) and the body (tbody).
In a possible implementation manner, the lookup module 502 is further configured to determine that the first table body is the table body (tbody) if the first table body is included in the third table label, and determine that the first table header is the table body (tbody) if the first table header is included in the third table label.
In a possible implementation manner, the lookup module 502 is further configured to determine that the second header is the header (thead) if the second header is included in the fourth table label, and determine that the second table body is the header (thead) if the second table body is included in the fourth table label.
In one possible implementation, the extracting module 503 is further configured to extract at least one row or at least one column of data in the target table, where the data includes at least one of an identification ID, a Tag, an extensible markup language path Xpath, a Text, a style ClassName, a Link, and a location rect.
The data extraction device 500 in the embodiment of the present application may be a device, or may be a component, an integrated circuit, or a chip in a terminal. The device may be a mobile electronic device or a non-mobile electronic device. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), etc., and the non-mobile electronic device may be a server, a network attached storage (Network Attached Storage, NAS), a personal computer (personal computer, PC), a Television (TV), a teller machine, a self-service machine, etc., and the embodiments of the present application are not limited in particular.
The data extraction device 500 in the embodiment of the present application may be a device having an operating system. The operating system may be an Android operating system, an ios operating system, or other possible operating systems, and the embodiment of the present application is not limited specifically.
The data extraction device 500 provided in the embodiment of the present application can implement each process implemented in the method embodiments of fig. 1 to 4, and in order to avoid repetition, a description is omitted here.
Optionally, as shown in fig. 6, the embodiment of the present application further provides an electronic device 600, including a processor 602, a memory 601, and a program or an instruction stored in the memory 601 and capable of being executed on the processor 602, where the program or the instruction implements each process of the embodiment of the data extraction method when executed by the processor 602, and the process can achieve the same technical effect, and for avoiding repetition, a description is omitted herein.
The electronic device in the embodiment of the application includes the mobile electronic device and the non-mobile electronic device.
The present embodiment can implement the processes of the above embodiments of the data extraction method, and achieve the same technical effects, and in order to avoid repetition, a detailed description is omitted here.
The embodiment of the application also provides a readable storage medium, and the readable storage medium stores a program or an instruction, which when executed by a processor, implements each process of the above embodiment of the data extraction method, and can achieve the same technical effects, so that repetition is avoided, and no further description is given here.
The processor is a processor in the electronic device in the above embodiment. Readable storage media include computer readable storage media such as Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic or optical disks, and the like.
The embodiment of the application further provides a chip, the chip comprises a processor and a communication interface, the communication interface is coupled with the processor, the processor is used for running programs or instructions, the processes of the embodiment of the data extraction method can be realized, the same technical effects can be achieved, and the repetition is avoided, and the description is omitted here.
It should be understood that the chips referred to in the embodiments of the present application may also be referred to as system-on-chip chips, chip systems, or system-on-chip chips, etc.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.