Detailed Description
Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In one example, fig. 1 is a flowchart of a method for generating a regular expression according to an embodiment of the present application, where the embodiment may be applicable to a case of automatically generating a regular expression, where the method may be performed by a device for generating a regular expression, where the device may be implemented by software and/or hardware, and may be generally integrated in an electronic device. The electronic device may be a computer device or the like. Accordingly, as shown in fig. 1, the method includes the following operations:
s110, acquiring a sample data list; the sample data list includes a plurality of sample data.
Wherein the sample data list may be a list of sample data components that are needed to generate a regular expression. Alternatively, the data type of the sample data may be a character string type or a chinese character type, which is not limited in the embodiments of the present application.
In an embodiment of the present invention, before generating the regular expression, a sample data list for generating the regular expression may be first acquired. For example, data screening may be performed from batch data, and sample data obtained by screening may be constructed to form a sample data list. For example, a sample data list is constructed by screening a plurality of web site data from the batch data. Or, the sample data can also be independently constructed directly according to the data screening requirement, and a sample data list is formed according to the constructed sample data. For example, a sample data list is built by autonomously building or acquiring corresponding special webpage link character string samples as sample data according to screening requirements of the special webpage link character strings. The embodiment of the application does not limit the specific acquisition mode of the sample data list.
S120, generating a public data tree corresponding to the sample data list according to each sample data.
Wherein the common data tree may record a common data sequence between the sample data. So-called common data sequences, i.e. identical data between sample data.
Accordingly, after the sample data list is obtained, each sample data of the sample data list may be analyzed to determine a common data sequence of each sample data, and a common data tree corresponding to the sample data list may be generated according to the common data sequence of each sample data.
For example, for the sample data "www.cbidu.com.cn" and "www.za.com", the common data sequences thereof are "www." and ". Com", and then the common data tree of the sample data list [ "www.cbidu.com.cn", "www.za.com" ] can be constructed from the common data sequences of "www." and ". Com". Wherein each node in the common data tree may be a common data sequence. For example, the common data tree for sample data list [ "www.cbidu.com.cn", "www.za.com" ] may be: the root node is "www." and the child node is ". Com".
S130, generating a data type list according to the public data tree.
Wherein the data type list can be used for recording the data types of the related data in the sample data list, and the data types can be used for judging the variable characteristics of the related data of each sample data.
Correspondingly, after the public data tree corresponding to the sample data list is generated, a data type list can be further generated according to the generated public data tree, and the variable characteristics of the related data of the sample data are judged through the data type list.
S140, generating a plurality of regular expressions matched with the sample data list according to the data type list.
In the embodiment of the invention, after the data type list is generated for the sample data list, a plurality of regular expressions matched with the sample data list can be generated according to the data type list. Optionally, according to the judgment result of the variable characteristics of the related data of the sample data by the data type list and the specific data content of each related data, a component part of the regular expression corresponding to each related data can be generated, and then the regular expression matched with the sample data list can be automatically generated.
According to the method and the device for generating the regular expression, the common data tree corresponding to the sample data list is generated according to the sample data in the obtained sample data list, so that the data type list is generated according to the common data tree, a plurality of regular expressions matched with the sample data list are generated according to the data type list, automatic generation of the regular expression is achieved, and the generation efficiency of the regular expression is improved.
In an example, fig. 2 is a flowchart of a method for generating a regular expression according to an embodiment of the present application, which is optimized and improved based on the technical solutions of the foregoing embodiments, and various specific implementations of generating, according to each sample data, a common data tree corresponding to the sample data list and generating, according to the common data tree, a data type list are given.
A method for generating a regular expression as shown in fig. 2, comprising:
s210, acquiring a sample data list.
S220, generating a public data tree corresponding to the sample data list according to each sample data.
In an optional embodiment of the present application, the generating a common data tree corresponding to the sample data list according to each sample data may include: taking the sample data list as a current data list; generating a current target public continuous subsequence of each sample data in the current data list through a suffix tree data structure; taking the current target public continuous subsequence as a root node of a current public data tree, and sequentially determining temporary sub-nodes of the sample data according to the target public continuous subsequence and the sample data in the current data list; the temporary sub-nodes comprise a first temporary sub-node and a second temporary sub-node; constructing a target data list according to each temporary child node, and updating the current data list according to the target data list; and returning to execute the operation of generating a current target public continuous sub-sequence of each sample data in the current data list through a suffix tree data structure, and updating the sub-nodes of the current public data tree according to the root node of the target data list until the current target public continuous sub-sequence is empty.
The current target common continuous subsequence may be the longest common continuous subsequence of each sample data in the current data list. The temporary child node may be data obtained by splitting sample data with a root node, which may be used to construct a new data list. The first temporary child node may be a left child node and the second temporary child node may be a right child node. Alternatively, the first temporary child node may be a right child node and the second temporary child node may be a left child node. The node types of the first temporary sub-node and the second temporary sub-node are not limited in the embodiment of the present application. The target data list may be a new data list constructed according to child nodes of each child common data tree, and sample data in the data list is part of original sample data.
In an optional embodiment of the present application, the constructing a target data list according to the child nodes of each current child public data tree may include: constructing a first target data list according to each first temporary child node; and constructing a second target data list according to each second temporary child node; the updating of the child nodes of the current common data tree according to the root node of the target data list may include: taking the root node of each first target data list as a first child node of the current public data tree; and taking the root node of each second target data list as a second child node of the current public data tree.
Wherein the first target data list may be a new data list generated from the first temporary child node and the second target data list may be a new data list generated from the second temporary child node. When the first temporary child node is a left child node and the second temporary child node is a right child node, the first child node may be a left child node and the second child node may be a right child node. When the first temporary child node is a right child node and the second temporary child node is a left child node, the first child node may be the right child node and the second child node may be the left child node.
The embodiment of the application generates each sub-public data tree in a cyclic recursion mode. Specifically, the sample data list may be used as a current data list, a current target public continuous sub-sequence of each sample data in the current data list is generated through a suffix tree data structure, then the generated current target public continuous sub-sequence is used as a root node of the current public data tree, and temporary sub-nodes of each sample data are sequentially determined according to the target public continuous sub-sequence and each sample data in the current data list. Then each temporary child node can construct a target data list and update the current data list according to the target data list. The structure type of the current public data tree may be: a root node, a left child node and a right child node. Accordingly, after the current common data tree is generated for the current data list, the target data list may be constructed according to the root node of the current common data tree and the sample data. That is, a first target data list is constructed according to a first temporary sub-node, a second target data list is constructed according to a second temporary sub-node, then the first target data list and the second target data list are updated to be current data lists, the operation of generating a current target public continuous sub-sequence of each sample data in the current data list through a suffix tree data structure is returned, and the current target public continuous sub-sequence of each sample data in the current data list is regenerated until the current target public continuous sub-sequence is empty. It should be noted that, after the first target data list generates the corresponding current target public continuous subsequence, the current target public continuous subsequence corresponding to the first target data list may be used as the first child node of the current public data tree. Similarly, after the second target data list generates the corresponding current target public continuous subsequence, the current target public continuous subsequence corresponding to the second target data list may be used as the second child node of the current public data tree. Thus, a complete sub-common data tree can be ultimately generated for the sample data list.
In one illustrative example, the sample data list is assumed to be [ "http:// www.cbidu.com", "https:// www.za.com", "http:// www.alucaaa.com" ], where the sample data is "http:// www.cbidu.com", "https:// www.za.com" and "http:// www.alucaaa.com", respectively. The sample data category is used as a current data list, and a current target public continuous subsequence of each sample data in the current data list is generated through a suffix tree data structure: i/www. Then, "// www" is taken as the root node of the current public data tree, and each sample data in the current data list is split into three parts of a first temporary child node, a root node and a second temporary child node according to the root node. In the current data list, a first temporary child node corresponding to the 'http:// www.cbidu.com' sample data is http; the root node is +/www; the second temporary child node is cbidu. Similarly, the first temporary child node corresponding to the 'https:// www.za.com' sample data is https; the root node is +/www; the second temporary child node is za. The first temporary child node corresponding to the 'http:// www.alucaaa.com' sample data is http; the root node is +/www; the second temporary child node is alucaaa. Then, the first temporary child node of each sample data is formed into a first target data list, namely 'http', 'https' and 'http' are formed into the first target data list; the second temporary child node of each sample data is composed into a second target data list, i.e. "cbidu.com", "za.com" and "alucaaa.com" are composed into a second target data list. And after the first target data list and the second target data list are generated, respectively updating the first target data list and the second target data list into a current data list, and calling a current target public continuous subsequence for generating each sample data in the current data list through a suffix tree data structure. The root node of the first target data list, that is, the current target common continuous subsequence, may be used as a left child node of the current common data tree, and the root node of the second target data list, that is, the current target common continuous subsequence, may be used as a right child node of the current common data tree. That is, with [ "http", "https", "http" ] as the first target data list, [ "cbidu.com", "za.com", "alucaaa.com" ] as the second target data list, and each target data list is updated as the current data list to generate the current target common continuous subsequence of each target data list, respectively. In the target data list, the current target common continuous subsequence of the first target data list is empty, and the current target common continuous subsequence of the second target data list is com. Then taking the current target public continuous subsequence of the first target data list as a first sub-node of the current public data tree, and taking the current target public continuous subsequence of the second target data list as a second sub-node of the current public data tree, so as to obtain a final public data tree as follows: the root node is ":// www.", the first child node is null, and the second child node is ". Com".
By adopting the cyclic recursion mode, the technology can sequentially acquire the public data of each sample data and construct a public data tree according to the acquired public data.
And S230, generating a public data full list according to the public data tree and the sample data list.
S240, generating the data type list according to the public data full list.
Wherein the common data full list may be generated for each sample data of the sample data list for embodying a list of the same feature data and different feature data between each sample data.
In the embodiment of the invention, when the data type list is generated according to the public data tree, the public data full list is firstly generated according to the public data tree and the sample data list, and then the data type list is generated according to the public data full list.
In the above scheme, the common data full list may embody the same feature data and different feature data between each sample data, and further may determine corresponding data types for the same feature data and different feature data, respectively, so as to generate a final data type list.
In an optional embodiment of the present application, the generating a common data full list according to the common data tree and the sample data list may include: traversing the public data tree, and constructing a public data intermediate list according to the traversing result; forming a corresponding sub-data list according to non-public data included in the sample data list; and expanding the public data intermediate list according to each sub data list to obtain the public data full list.
Wherein the common data intermediate list may be a list including the same feature data. Wherein the same characteristic data is also common data. Correspondingly, the different characteristic data, i.e. the non-common data. The sub-data list may be a data list formed from a non-common data abstraction.
Specifically, the common data tree may be traversed to construct a common data intermediate list according to the traversal result. Alternatively, the common data tree may be traversed in a medium order. After the public data intermediate list is obtained, the non-public data included in the sample data list can be utilized to form a corresponding sub-data list, and then each sub-data list is utilized to expand the public data intermediate list, so that a final public data full list is obtained.
In one illustrative example, assume that the sample data list is list1, and the specific sample data is [ "http:// www.cbidu.com", "https:// www.za.com", "http:// www.alucaaa.com" ], and the corresponding common data tree is: the root node is ":// www.", the first child node is null, and the second child node is ". Com". Firstly, performing medium sequence traversal on the public data tree to obtain a public data intermediate list2 which is arranged in sequence: [ ":// www", ".com" ]. Then, corresponding sub data lists list4 and list5 are formed using the non-common data included in the sample data list. Wherein, list4 is [ "http", "https", "http" ], and list5 is [ "cbidu", "za", "alucaaa" ]. And finally, expanding the public data intermediate list2 by utilizing each sub data list4 and list5 to obtain a final public data full list3 [ list4, ":/www", list5, ".com" ].
In the scheme, the public data intermediate list is firstly constructed, and the public data intermediate list is expanded by using the sub-data list formed by non-public data, so that a public data full list comprising a nested list can be obtained. Wherein each sub-data list is a nested list. The public data full list can be used for obviously distinguishing public data from non-public data, so that the data type of each data can be judged.
In an optional embodiment of the present application, the generating the data type list according to the common data full list may include: determining the public data of the public data full list as a first data type; calculating the length information entropy of each sub-data list of the public data full list; and determining the data type of each sub data list according to the numerical relation between the length information entropy of each sub data list and the first set threshold value.
In an optional embodiment of the present application, the determining the data type of each sub data list according to the numerical relationship between the length information entropy of each sub data list and the first set threshold may include: determining the data type of the sub data list as a second data type under the condition that the length information entropy of the sub data list is larger than the first set threshold value; and determining the data type of the sub data list as the first data type under the condition that the length information entropy of the sub data list is smaller than or equal to the first set threshold value.
Wherein the first data type may be a constant type. The second data type may be a variable type. Constant type, i.e. fixed constant, variable type, i.e. non-fixed variable. The length information entropy may be an information entropy calculated for each sub data list for embodying uncertainty of the data being a constant type or a variable type. The first set threshold may be set according to an actual requirement, or may be specified in advance, for example, the value is 1.3 or 2.4, which is not limited by a specific value of the first set threshold in the embodiment of the present application.
In the embodiment of the invention, the data type list is generated according to the public data full list, and the corresponding data types are mainly required to be determined for each part of data in the public data full list. Specifically, since the common data in the common data full list is the same characteristic data of each sample data, the data type of the common data can be directly determined as the first data type, that is, the constant data type. For non-common data portions of each sub-data list, the data type of each sub-data list may be determined by means of length information entropy.
The specific definition of the length information entropy may be. Where pi is the number of data lengths divided by the total number of data. Illustratively, each data of the [ (cbidu "," za "," alucaaa "] list corresponds to a data length list of [5,2,7], so that the list corresponds to pi of [1/3,1/3 ]. The length information entropy may determine the uncertainty between the data.
Specifically, when the data type of each sub data list is determined by means of the length information entropy, the length information entropy of each sub data list can be calculated, and the length information entropy is compared with the first set threshold value. If the length information entropy of the sub data list is greater than the first set threshold, the uncertainty of the sub data list is greater than a preset value, and the data type of the sub data list can be determined to be the second data type, namely, the variable type. If the entropy of the length information of the sub data list is smaller than or equal to a first set threshold, the uncertainty of the sub data list is smaller than or equal to a preset value, and the data type of the sub data list can be determined to be a first data type, namely a constant type.
It should be noted that, the data types of the sub data list may not be determined according to the length information entropy, that is, the data types of the sub data list are defined as a first data type and a second data type, and then, for each data type, the corresponding regular expression result may be determined according to the length information entropy.
According to the technical scheme, the specific data type is determined for the non-public data of each sample data by utilizing the length information entropy mode, so that the regular expression corresponding to each sample data can be determined according to the actual data abstraction requirement. When the first set threshold value for determining the non-public data is different, the determination result of the data type corresponding to the non-public data is also different, so that the required regular expression is automatically generated according to the actual data abstraction requirement.
S250, generating a plurality of regular expressions matched with the sample data list according to the data type list.
According to the technical scheme, the public data tree corresponding to the sample data list is generated according to the sample data, so that the public data content in the sample data can be extracted in sequence, the public data can be embodied in the mode of the public data tree, the public data tree and the sample data are further utilized to generate the public data full list, the data type list is generated according to the mode of combining the public data full list with the length information entropy, the generation efficiency of the regular expression can be improved, and the generation mode of the regular expression can be enriched.
In an example, fig. 3 is a flowchart of a method for generating a regular expression according to an embodiment of the present application, and the embodiment of the present application performs optimization and improvement on the basis of the technical solutions of the foregoing embodiments, and provides various specific implementations for generating multiple regular expressions matched with the sample data list according to the data type list.
A method for generating a regular expression as shown in fig. 3, comprising:
s310, acquiring a sample data list.
S320, generating a public data tree corresponding to the sample data list according to each sample data.
S330, generating a data type list according to the public data tree.
S340, generating a plurality of regular expressions matched with the sample data list according to the data type list and the public data full list.
In the embodiment of the application, a plurality of regular expressions matched with a sample data list can be generated according to the data type list and combined with a common data full list.
Accordingly, S340 may specifically include the following operations:
s341, acquiring the current data to be processed of the public data full list according to the data sorting order.
The data sorting order may be an order of sorting each data in the common data full list. The current data to be processed, namely the data which currently needs to generate regular expression content in the public data full list.
Taking the public data full list3 in the example as [ list4, ":// www", "list 5,". Com "] as an example, when the list3 is started to be processed, the data" list4 "of the list3 is obtained according to the data sorting order as the current data to be processed, when the processing of the current data" list4 "to be processed is completed, the data":// www "," of the list3 is obtained according to the data sorting order as the current data to be processed, until all the data are processed.
S342, judging whether the data type of the current data to be processed is a first data type, if so, executing S343; otherwise, S344 is performed.
S343, generating a sub-regular expression matched with the current data to be processed according to the quantity of the data to be processed included in the current data to be processed.
The number of the data to be processed is the number of the data to be processed. The sub-regular expression may be part of the expression content of the regular expression generated for each data correspondence.
In this embodiment of the present application, if the data type of the current data to be processed is a first data type, that is, a constant type, a sub-regular expression matching the current data to be processed needs to be generated according to the number of data to be processed included in the current data to be processed.
In an optional embodiment of the present application, the generating a sub-regular expression matching with the current pending data according to the number of pending data included in the current pending data may include: when the number of the data to be processed is determined to be the first number, the current data to be processed is directly used as a sub-regular expression matched with the current data to be processed; and when the number of the data to be processed is determined to be non-first, combining all the data of the current data to be processed as a sub-regular expression matched with the current data to be processed.
Wherein the first number may be 1. The corresponding non-first number is a positive integer greater than 1.
Optionally, if the number of data to be processed is the first number, the current data to be processed can be directly used as a sub-regular expression matched with the current data to be processed; otherwise, each data combination of the current data to be processed is used as a sub-regular expression matched with the current data to be processed.
The public data full list3 in the above example is shown as [ list4, ":// www", "list 5,". Com "], wherein list4 is [" http "," https "," http "], and list5 is [" cbidu "," za "," alucaaa "]. Assume ":// www" is the current data to be processed and the data type is the first data type. Since the data amount of the current data to be processed is 1. Thus, the ":// www." corresponding sub-regular expression is ":// www." itself. Let list4 be the current data to be processed and the data type be the first data type. Since the current data to be processed includes the data to be processed which are "http" and "https", the data amount is 2, that is, the data amount is greater than 1. Thus, the sub-regular expression to which list4 corresponds may be "http|https". Wherein the symbol "|" represents the meaning of OR.
In the above scheme, the matched sub-regular expression is generated for the current data to be processed of the first data type according to the number of the data to be processed included in the current data, so that the common data can be reserved to the greatest extent, namely, the common characteristics of the data can be extracted.
And S344, generating a sub-regular expression matched with the current data to be processed according to the length information entropy of the current data to be processed.
Correspondingly, if the data type of the current data to be processed is the second data type, generating a sub-regular expression matched with the current data to be processed according to the length information entropy of the current data to be processed.
In an optional embodiment of the present application, the entropy generating a sub-regular expression matched with the current data to be processed according to the length information of the current data to be processed may include: when the length information entropy of the current data to be processed is larger than or equal to a second set threshold value, taking a preset character as a sub-regular expression matched with the current data to be processed; under the condition that the length information entropy of the current data to be processed is smaller than a second set threshold value and larger than a third set threshold value, the first length information and the second length information of each data of the current data to be processed are used as sub-regular expressions matched with the current data to be processed; and under the condition that the length information entropy of the current data to be processed is smaller than or equal to the third set threshold value, taking the third length information of each data of the current data to be processed as a sub-regular expression matched with the current data to be processed.
The second set threshold may be set according to an actual requirement, or may be specified in advance, for example, the value is 1, 2, or 2.5, which is not limited by a specific numerical value of the second set threshold in the embodiment of the present application. The third set threshold may be 0. The preset characters may be set according to actual requirements, such as "+", "-" or ".", etc., and the embodiment of the present application does not limit the specific character content of the preset characters. The first length information may be a minimum data length of each data, the second length information may be a maximum data length of each data, and the third length information may be a data length of data having the same data length.
In the embodiment of the present application, if the data type of the current data to be processed is the second data type, that is, the variable type, a sub-regular expression matched with the current data to be processed needs to be generated according to the length information entropy of the current data to be processed. Optionally, if the length information entropy of the current data to be processed is greater than or equal to the second set threshold, the preset character may be used as a sub-regular expression matched with the current data to be processed. If the length information entropy of the current data to be processed is smaller than the second set threshold and larger than the third set threshold, the first length information and the second length information of each data of the current data to be processed can be used as sub-regular expressions matched with the current data to be processed, so that the characteristics of the current data to be processed on the data length can be abstracted, namely, the range interval of the data length of the current data to be processed is embodied. If the length information entropy of the current data to be processed is smaller than or equal to the third set threshold value, the third length information of each data of the current data to be processed can be used as a sub-regular expression matched with the current data to be processed, so that the characteristics of the current data to be processed on the data length can be abstracted, namely specific numerical values of each data length in the current data to be processed are embodied.
The public data full list3 in the above example is shown as [ list4, ":// www", "list 5,". Com "], wherein list4 is [" http "," https "," http "], and list5 is [" cbidu "," za "," alucaaa "]. Let list5 be the current data to be processed and the data type be the second data type. And calculating the length information entropy of list5 by using a formula. And meanwhile, comparing the length information entropy of list5 with a second set threshold value and a zero value. If the entropy of the length information of the list5 is greater than or equal to the second set threshold, the uncertainty of the list5 is larger, and the list5 can be divided into MAX classes. If the length information entropy of list5 is smaller than the second set threshold and greater than zero, it indicates that the uncertainty of list5 is relatively small, and list5 may be classified as MID. If the length information entropy of the list5 is less than or equal to 0, the data length of each data in the list5 is consistent, and the list5 can be divided into MIN classes.
Since the second set threshold can be set according to actual requirements, the length information entropy of list5 is fixed. Therefore, when the second set threshold value is different, the type of the final division of list5 may also be different. Accordingly, if the final type of list5 is MAX class, list5 may be translated into a preset character "+", i.e., "+" is taken as a sub-regular expression for list5 matching. If the final type of list5 is MID class, list5 can be translated into { minlen, maxlen }. Where minlen is the first length information, i.e. the minimum length, and maxlen is the second length information, i.e. the maximum length. I.e., sub-regular expressions that match {2,7} as list 5. If the final class of list5 is MIN, list5 may be translated into { len }. Where len is the data length of each data in list 5. For example, assuming that list5 is [ "cbidu", "zauca", "aluca" ], that is, the data length of each data in list5 is 5, {4} may be used as a sub-regular expression for list5 matching.
In the scheme, the current data to be processed of the second data type is subjected to deterministic judgment by setting the second set threshold value and the third set threshold value, so that the sub-regular expression corresponding to the current data to be processed is generated according to the judgment result, and the required regular expression can be automatically generated according to the actual data abstraction requirement.
It should be noted that, in the embodiment of the present application, the first setting threshold and the second setting threshold may be set according to actual requirements, that is, different values may be set for different types of sample data, and accordingly different types of regular expressions may be obtained.
S345, judging whether all the data to be processed are processed, if yes, executing S347; otherwise, S346 is performed.
S346, acquiring next data to be processed according to the data sorting sequence, and returning to execute S342 after updating the current data to be processed according to the next data to be processed.
Correspondingly, after the corresponding sub-regular expression is generated by the current data to be processed, the next data to be processed can be obtained according to the data sorting sequence, the next data to be processed is used as the current data to be processed, the operation of generating the sub-regular expression matched with the current data to be processed is carried out in a returning mode until all the data to be processed of the public data full list are processed, namely all the data to be processed of the public data full list correspondingly generate the sub-regular expression.
S347, generating a plurality of regular expressions matched with the sample data list according to each sub-regular expression.
Correspondingly, after all the data to be processed of the public data full list are correspondingly generated into sub-regular expressions, all the sub-regular expressions can be spliced according to the data sequencing order to generate a plurality of regular expressions matched with the sample data list.
The public data full list3 in the above example is shown as [ list4, ":// www", "list 5,". Com "], wherein list4 is [" http "," https "," http "], and list5 is [" cbidu "," za "," alucaaa "]. Through setting of the first set threshold value and the second set threshold value, each data in list3 can generate a corresponding sub-regular expression. Correspondingly, splicing the sub regular expressions according to the data ordering sequence, the regular expressions matched with list1 may be: forms of (https|http) (:// www.) + (. Com), (https|http) (:// www.) {2,7} (. Com) or {4,5 }) (:// www.) {2,7} (. Com), and the like.
According to the technical scheme, the sub-regular expressions corresponding to the data are determined according to the data types of each data of the public data full list, so that common characteristics among the sample data can be effectively reserved, and the required regular expressions can be automatically generated according to actual data abstraction requirements aiming at different characteristic parts.
S348, obtaining the expression composition type of the target sub-regular expression in each regular expression.
S349, marking each target sub regular expression by using a preset mark according to the expression composition type.
The target sub-regular expression may be a sub-regular expression that needs to be additionally identified in each sub-regular expression. The preset identifier may be set according to actual requirements, and exemplary preset identifiers may be "\w" or "\d", etc., and the embodiment of the present application does not limit the specific identifier type of the preset identifier.
In the embodiment of the application, after the sub-regular expressions of each data are generated for the public data full list, some of the sub-regular expressions can be used as target sub-regular expressions to further identify the sub-regular expressions. Alternatively, the expression composition type of the target sub-regular expression may be first determined, and then each of the target sub-regular expressions may be identified by a preset identifier according to the determined expression composition type. The expression composition type of the target sub-regular expression may be, for example: the data is all alphabetical in composition, or the data is not all alphabetical in composition, etc.
Taking the public data full list3: [ list4, ":// www", "list 5,". Com "] as an example, if list3 matches the regular expression corresponding to list1 as follows: (https|http) (:// www.) + (. Com), (https|http) (:// www.) {2,7} (. Com) or {4,5 }) (:// www.) {2,7} (. Com) the sub-regular expression generated for the data of the second data type may be identified with a preset identification, for example, each character in the data is identified with "\w" as all letters, each character in the data is identified with "\d" as not all letters. Accordingly, the regular expression that list1 eventually matches may be: (https|http) (:// www.) \w+ (. Com), (https|http) (:// www.) \w {2,7} (. Com.) \w {4,5 }) (:// www.) \w {2,7} (. Com.).
By identifying each target sub-regular expression by using a preset identification according to the expression composition type, similar characteristics can be extracted from the data of the second data type, so that the finally generated regular expression can reflect the characteristics of the sample data to the greatest extent.
S350, calculating the intimacy of each regular expression by using the intimacy function.
Wherein the affinity function may be used to calculate the affinity of each regular expression.
S360, determining the regular expression corresponding to the target affinity as a target regular expression.
Wherein the target affinity may be a maximum numerical affinity.
In the embodiment of the application, after a plurality of matched regular expressions are generated for the sample data list, in order to further screen the regular expressions meeting the requirements, the affinity of each regular expression can be calculated by using the affinity function, so that the regular expression corresponding to the affinity with the largest value is screened out and is determined as the target regular expression. The screened target regular expression can meet the abstract requirement of sample data to the greatest extent.
For example, the affinity function may be: . Wherein, the generated regular expression is represented, the affinity function is represented, and the sample data in the sample data list is represented. i may represent the order of the sample data, e.g. representing the first sample data. By passing throughf(regex) and each sample data in the sample data list calculate the affinity (i.e., the maximum value is optimal. The similarity between the sample data and the regular expression can be defined according to actual requirements, for example, cosine similarity is used for calculating the affinity.
According to the technical scheme, the plurality of regular expressions matched with the sample data list are generated through the data type list and the public data full list in combination with the length information entropy, so that the generation efficiency of the regular expressions can be improved, and the generation modes of the regular expressions can be enriched.
In one example, fig. 4 is a flowchart of a data extraction method provided in an embodiment of the present application, where the embodiment may be applicable to the case of data extraction and classification according to automatically generated regular expressions, the method may be performed by a data extraction device, which may be implemented by software and/or hardware, and may be generally integrated in an electronic device. The electronic device may be a computer device or the like. Accordingly, as shown in fig. 4, the method includes the following operations:
s410, acquiring data to be processed.
S420, analyzing the data to be processed, and generating a regular expression matched with the data to be processed.
The data to be processed is the original data which is required to be extracted and classified by using the regular expression.
In this embodiment of the present application, after obtaining data to be processed, the data to be processed may be analyzed, for example, several sample data are selected from the data to be processed, and then the regular expression for extracting the data to be processed is generated according to the selected sample data by using the method for generating the regular expression described in any one of the embodiments.
For example, an operation interface may be provided, at which a user may provide data to be processed, and input a sample for generating a regular expression as sample data at a specified position of an operation node. After receiving an instruction for generating a regular expression sent by an operation interface user, the background server directly generates the regular expression corresponding to the sample data sample according to the processing logic corresponding to the regular expression generation method in any embodiment, and displays the regular expression at another designated position of the operation interface.
S430, extracting the data to be processed according to the generated regular expression.
Correspondingly, after the regular expression used for carrying out data extraction on the data to be processed is generated, the data extraction can be carried out on the data to be processed according to the generated regular expression so as to obtain the data meeting the requirements.
The data extraction method can be applied to application scenes with various data extraction and classification, for example, web page link data is screened out from log data, metaphors or rank sentences are screened out from training corpus, and the specific application scene types are not limited in the embodiment of the application.
According to the method and the device, the public data tree corresponding to the sample data list is generated according to the sample data in the obtained sample data list, so that the data type list is generated according to the public data tree, a plurality of regular expressions matched with the sample data list are generated according to the data type list, the generated regular expressions are used for rapidly extracting and classifying data to be processed, automatic generation of the regular expressions is achieved, and the generation efficiency of the regular expressions and the extraction and classification efficiency of the data are improved.
In an example, fig. 5 is a block diagram of a regular expression generating apparatus provided in an embodiment of the present application, where the embodiment of the present application may be applicable to a case of automatically generating a regular expression, where the apparatus is implemented by software and/or hardware, and is specifically configured in an electronic device. The electronic device may be a computer device or the like.
The regular expression generating apparatus 500 shown in fig. 5 includes: a sample data list acquisition module 510, a common data tree generation module 520, a data type list generation module 530, and a first regular expression generation module 540. Wherein,,
a sample data list obtaining module 510, configured to obtain a sample data list; the sample data list includes a plurality of sample data;
A common data tree generating module 520, configured to generate a common data tree corresponding to the sample data list according to each sample data;
a data type list generation module 530, configured to generate a data type list according to the public data tree;
a first regular expression generating module 540 is configured to generate a plurality of regular expressions matching the sample data list according to the data type list.
According to the method and the device for generating the regular expression, the common data tree corresponding to the sample data list is generated according to the sample data in the obtained sample data list, so that the data type list is generated according to the common data tree, a plurality of regular expressions matched with the sample data list are generated according to the data type list, automatic generation of the regular expression is achieved, and the generation efficiency of the regular expression is improved.
Optionally, the public data tree generating module 520 is specifically configured to: taking the sample data list as a current data list; generating a current target public continuous subsequence of each sample data in the current data list through a suffix tree data structure; taking the current target public continuous subsequence as a root node of a current public data tree, and sequentially determining temporary sub-nodes of the sample data according to the target public continuous subsequence and the sample data in the current data list; the temporary sub-nodes comprise a first temporary sub-node and a second temporary sub-node; constructing a target data list according to each temporary child node, and updating the current data list according to the target data list; and returning to execute the operation of generating a current target public continuous sub-sequence of each sample data in the current data list through a suffix tree data structure, and updating the sub-nodes of the current public data tree according to the root node of the target data list until the current target public continuous sub-sequence is empty.
Optionally, the public data tree generating module 520 is specifically configured to: constructing a first target data list according to each first temporary child node; and constructing a second target data list according to each second temporary child node; taking the root node of each first target data list as a first child node of the current public data tree; and taking the root node of each second target data list as a second child node of the current public data tree.
Optionally, the data type list generating module 530 is specifically configured to: generating a public data full list according to the public data tree and the sample data list; and generating the data type list according to the public data full list.
Optionally, the data type list generating module 530 is specifically configured to: traversing the public data tree, and constructing a public data intermediate list according to the traversing result; forming a corresponding sub-data list according to non-public data included in the sample data list; and expanding the public data intermediate list according to each sub data list to obtain the public data full list.
Optionally, the data type list generating module 530 is specifically configured to: determining the public data of the public data full list as a first data type; calculating the length information entropy of each sub-data list of the public data full list; and determining the data type of each sub data list according to the numerical relation between the length information entropy of each sub data list and the first set threshold value.
Optionally, the data type list generating module 530 is specifically configured to: determining the data type of the sub data list as a second data type under the condition that the length information entropy of the sub data list is larger than the first set threshold value; and determining the data type of the sub data list as the first data type under the condition that the length information entropy of the sub data list is smaller than or equal to the first set threshold value.
Optionally, the first regular expression generating module 540 is specifically configured to: and generating a plurality of regular expressions matched with the sample data list according to the data type list and the public data full list.
Optionally, the first regular expression generating module 540 is specifically configured to: acquiring current data to be processed of the public data full list according to a data sorting sequence; generating a sub-regular expression matched with the current data to be processed according to the quantity of the data to be processed included in the current data to be processed under the condition that the data type of the current data to be processed is determined to be a first data type; generating a sub-regular expression matched with the current data to be processed according to the length information entropy of the current data to be processed under the condition that the data type of the current data to be processed is determined to be a second data type; acquiring next data to be processed according to the data sorting sequence, and updating the current data to be processed according to the next data to be processed; returning to execute the operation of generating the sub-regular expression matched with the current data to be processed until all the data to be processed of the public data full list are processed; and generating a plurality of regular expressions matched with the sample data list according to each sub-regular expression.
Optionally, the first regular expression generating module 540 is specifically configured to: when the number of the data to be processed is determined to be the first number, the current data to be processed is directly used as a sub-regular expression matched with the current data to be processed; and when the number of the data to be processed is determined to be non-first, combining all the data of the current data to be processed as a sub-regular expression matched with the current data to be processed.
Optionally, the first regular expression generating module 540 is specifically configured to: when the length information entropy of the current data to be processed is larger than or equal to a second set threshold value, taking a preset character as a sub-regular expression matched with the current data to be processed; under the condition that the length information entropy of the current data to be processed is smaller than a second set threshold value and larger than a third set threshold value, the first length information and the second length information of each data of the current data to be processed are used as sub-regular expressions matched with the current data to be processed; and under the condition that the length information entropy of the current data to be processed is smaller than or equal to the third set threshold value, taking the third length information of each data of the current data to be processed as a sub-regular expression matched with the current data to be processed.
Optionally, the first regular expression generating module 540 is specifically configured to: acquiring the expression composition type of a target sub-regular expression in each sub-regular expression; and marking each target sub-regular expression by using a preset mark according to the expression composition type.
Optionally, the generating device of the regular expression further includes: the affinity calculation module is used for calculating the affinity of each regular expression by using an affinity function; and the target regular expression determining module is used for determining the regular expression corresponding to the target affinity as a target regular expression.
The regular expression generating device can execute the regular expression generating method provided by any embodiment of the application, and has the corresponding functional modules and beneficial effects of the executing method. Technical details which are not described in detail in the present embodiment can be referred to the method for generating a regular expression provided in any embodiment of the present application.
Since the above-described regular expression generating apparatus is an apparatus capable of executing the regular expression generating method in the embodiment of the present application, based on the regular expression generating method described in the embodiment of the present application, those skilled in the art can understand the specific implementation of the regular expression generating apparatus of the present embodiment and various variations thereof, so how the regular expression generating apparatus of the present embodiment implements the regular expression generating method in the embodiment of the present application will not be described in detail herein. As long as the device adopted by the person skilled in the art to implement the regular expression generation method in the embodiment of the present application is included in the scope of protection intended by the present application.
In an example, fig. 6 is a block diagram of a data extraction apparatus provided in an embodiment of the present application, where the embodiment of the present application may be applicable to a case of data extraction and classification according to an automatically generated regular expression, where the apparatus is implemented by software and/or hardware, and specifically configured in an electronic device. The electronic device may be a computer device or the like.
A data extraction apparatus 600 as shown in fig. 6, comprising: a data acquisition module 610 to be processed, a second regular expression generation module 620 and a data extraction module 630. Wherein,,
a data to be processed acquisition module 610, configured to acquire data to be processed;
a second regular expression generating module 620, configured to analyze the data to be processed and generate a regular expression matched with the data to be processed;
the data extraction module 630 is configured to perform data extraction on the data to be processed according to the generated regular expression;
wherein the regular expression is generated by the method of generating a regular expression of any of claims 1-13.
According to the method and the device, the public data tree corresponding to the sample data list is generated according to the sample data in the obtained sample data list, so that the data type list is generated according to the public data tree, a plurality of regular expressions matched with the sample data list are generated according to the data type list, the generated regular expressions are used for rapidly extracting and classifying data to be processed, automatic generation of the regular expressions is achieved, and the generation efficiency of the regular expressions and the extraction and classification efficiency of the data are improved.
The data extraction device can execute the data extraction method provided by any embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment may be referred to the data extraction method provided in any embodiment of the present application.
Since the data extraction apparatus described above is an apparatus capable of performing the data extraction method in the embodiment of the present application, a person skilled in the art will be able to understand the specific implementation of the data extraction apparatus of the embodiment and various modifications thereof based on the data extraction method described in the embodiment of the present application, so how the data extraction apparatus implements the data extraction method in the embodiment of the present application will not be described in detail herein. The apparatus used to implement the data extraction method in the embodiments of the present application falls within the scope of protection intended by the present application.
In one example, the present application also provides an electronic device and a readable storage medium.
Fig. 7 is a schematic structural diagram of an electronic device used to implement a method for generating a regular expression or a method for extracting data according to an embodiment of the present application. As shown in fig. 7, a block diagram of an electronic device of a method for generating a regular expression or a method for extracting data according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.
As shown in fig. 7, the electronic device includes: one or more processors 701, memory 702, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 701 is illustrated in fig. 7.
Memory 702 is a non-transitory computer-readable storage medium provided herein. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method for generating regular expressions or the method for extracting data provided by the present application. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to execute the generation method or the data extraction method of the regular expression provided by the present application.
The memory 702 is used as a non-transitory computer readable storage medium, and is used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the regular expression generation method or the data extraction method in the embodiments of the present application (e.g., the sample data list acquisition module 510, the common data tree generation module 520, the data type list generation module 530, and the first regular expression generation module 540 shown in fig. 5, or the pending data acquisition module 610, the second regular expression generation module 620, and the data extraction module 630 shown in fig. 6). The processor 701 executes various functional applications of the server and data processing, that is, implements the regular expression generation method or the data extraction method in the above-described method embodiment, by running non-transitory software programs, instructions, and modules stored in the memory 702.
Memory 702 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created by use of an electronic device implementing a generation method of a regular expression or a data extraction method, or the like. In addition, the memory 702 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 702 optionally includes memory remotely located relative to processor 701, which may be connected via a network to an electronic device implementing the regular expression generation method or the data extraction method. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device implementing the regular expression generation method or the data extraction method may further include: an input device 703 and an output device 704. The processor 701, the memory 702, the input device 703 and the output device 704 may be connected by a bus or otherwise, in fig. 7 by way of example.
The input device 703 may receive input numeric or character information and generate key signal inputs related to user settings and function control of an electronic device implementing the regular expression generation method or the data extraction method, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer stick, one or more mouse buttons, a track ball, a joystick, etc. The output device 704 may include a display apparatus, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibration motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client may be, but is not limited to, a smart phone, a notebook computer, a desktop computer, a tablet computer, a smart speaker, etc. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud computing, cloud service, cloud database, cloud storage and the like. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
According to the method and the device, the public data tree corresponding to the sample data list is generated according to the sample data in the obtained sample data list, so that the data type list is generated according to the public data tree, a plurality of regular expressions matched with the sample data list are generated according to the data type list, the generated regular expressions are used for rapidly extracting and classifying data to be processed, automatic generation of the regular expressions is achieved, and the generation efficiency of the regular expressions and the extraction and classification efficiency of the data are improved.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions disclosed in the present application can be achieved, and are not limited herein.
The above embodiments do not limit the scope of the application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application are intended to be included within the scope of the present application.