CN112035449B

CN112035449B - Data processing method and device, computer equipment and storage medium

Info

Publication number: CN112035449B
Application number: CN202010712476.3A
Authority: CN
Inventors: 王冰玉
Original assignee: Dazhu Hangzhou Technology Co ltd
Current assignee: Dazhu Hangzhou Technology Co ltd
Priority date: 2020-07-22
Filing date: 2020-07-22
Publication date: 2024-06-14
Anticipated expiration: 2040-07-22
Also published as: CN112035449A

Abstract

The invention provides a data processing method and device, computer equipment and storage medium, wherein the method comprises the following steps: performing primary domain classification on the data to be processed according to first key features contained in the data to be processed to obtain a plurality of groups of first data sets, wherein the data to be processed is unstructured data; converting the target first data set into a feature vector based on the classification model, and performing secondary domain classification on the target first data set according to the feature vector to obtain a plurality of groups of second data sets; extracting entity information in the target second data set based on the information extraction model; performing data cleaning on entity information in the target second data set to obtain target data, wherein the target data is structured data; the target data is imported into a target structured database. The invention solves the technical problems of low efficiency and limitation in extracting information in unstructured data in the related technology.

Description

Data processing method and device, computer equipment and storage medium

Technical Field

The present invention relates to the field of computers, and in particular, to a data processing method and apparatus, a computer device, and a storage medium.

Background

In the related art, large data structuring generally needs to sort out corresponding data templates according to data characteristics to extract data information. However, for large-scale data, the information extraction mode does not have universality, and in the process of large-data processing, a great deal of manual arrangement work is usually required whenever a new data template is arranged, accurate matching and information extraction are performed by manually rearranging accurate rules, and hysteresis and limitation of information extraction are easily caused. For example, the most common way of text data structuring today is to summarize data templates in a fixed format, using a regular way to extract exactly the valid information in the text. The method is simple and effective for data with fixed data templates and clear data types, and aiming at data with non-fixed templates, scattered types and disordered types of many data in reality, if the data are still processed by a simple rule method, huge manpower and time are consumed for arrangement, the efficiency is extremely low, when the data are in a new form, the existing template accurate matching method is almost completely invalid, the new data template is required to be manually arranged, so that the data in the new form are solved, and the hysteresis of information processing is further caused.

In view of the above problems in the related art, no effective solution has been found yet.

Disclosure of Invention

The embodiment of the invention provides a data processing method and device, computer equipment and a storage medium, which at least solve the technical problems of low efficiency and limitation in extracting information in unstructured data in the related technology.

According to an embodiment of the present invention, there is provided a data processing method including: performing primary domain classification on the data to be processed according to first key features contained in the data to be processed to obtain a plurality of groups of first data sets, wherein the data to be processed is unstructured data; converting a target first data set into a feature vector based on a classification model, and performing secondary domain classification on the target first data set according to the feature vector to obtain a plurality of groups of second data sets, wherein the target first data set is any one group of data sets in the plurality of groups of first data sets; extracting entity information in a target second data set based on an information extraction model, wherein the target second data set is any one group of data set in the plurality of groups of second data sets; performing data cleaning on entity information in the target second data set to obtain target data, wherein the target data is structured data; and importing the target data into a target structured database.

Optionally, the first-stage domain classification is performed on the data to be processed according to first key features included in the data to be processed, so as to obtain a plurality of groups of first data sets, including: a1, classifying the data to be processed once according to the first key features to obtain a plurality of groups of first data sets; a2, randomly extracting one or more groups of first data sets from the plurality of groups of first data sets, marking one or more primary class labels for the one or more groups of first data sets, and taking the one or more groups of first data sets as a first verification set; step A3, calculating a first precision rate P1 and a first recall rate R1 of one-stage classification based on the actual class labels of the one or more groups of first data sets and the first verification set, wherein the precision rate represents a ratio of the number of samples which are judged to be the labels M and are also the labels M to the number of samples which are judged to be the labels M, the recall rate represents a ratio of the number of samples which are judged to be the labels M and are also the labels M to the number of samples which are also the labels M in any group of first data sets, and M represents class labels of any group of first data sets; and step A4, calculating the accuracy F1 of the first-order classification by the following formula: f1 =2p1r1/(p1+r1); step A5, comparing the F1 with a first threshold value; and if F1 is smaller than the first threshold value, adding first data of the first-stage classification error into the plurality of groups of first data sets, and cycling the operations of the steps A1 to A5 until F1 is larger than or equal to the first threshold value, and outputting the labeled plurality of groups of first data sets.

Optionally, converting the target first data set into a feature vector based on the classification model, and performing secondary domain classification on the target first data set according to the feature vector to obtain multiple groups of second data sets, including: step a1, performing digital processing on the target first data set by adopting the classification model to obtain a feature vector corresponding to the target first data set; step a2, classifying the target first data set once according to the feature vector to obtain a plurality of groups of second data sets; step a3, randomly extracting one or more groups of second data sets from the plurality of groups of second data sets, labeling one or more secondary category labels for the one or more groups of second data sets, and taking the one or more groups of second data sets as a second verification set; step a4, calculating a second precision rate P2 and a second recall rate R2 of the classification model based on the actual class labels of the one or more groups of second data sets and the second verification set; step a5, calculating the accuracy F2 of the classification model through the following formula: f2 =2p2r2/(p2+r2); step a6, comparing the F2 with a second threshold value; if F2 is less than the second threshold, adding the second data of the classification model classification error to the multiple sets of second data sets, cycling the operations of steps a1-a6 until F2 is greater than or equal to the second threshold, and outputting the multiple sets of second data sets.

Optionally, the extracting entity information in the target second data set based on the information extraction model includes: step B1, extracting a plurality of entity information in the target data set through the information extraction model; step B2, labeling a plurality of entity labels for the plurality of entity information to obtain a plurality of labeled entity information, and taking the plurality of labeled entity information as a third verification set; step B3, calculating a third precision rate P3 and a third recall rate R3 of the information extraction model according to the actual entity labels of the plurality of entity information and the third verification set; and step B4, calculating a third accuracy F3 of the information extraction model through the following formula: f3 =2p3r3/(p3+r3); step B5, comparing the F3 with a third threshold value; and if the F3 is smaller than the third threshold value, adding third data of the information extraction model extraction error to the target data set, cycling the operation of the steps B1-B5 until the F3 is larger than or equal to the third threshold value, and outputting entity information in the target second data set.

Optionally, performing data cleaning on entity information in the target second data set to obtain target data, including: step b1, data cleaning is carried out on a plurality of entity information in the target second data set according to a preset rule, wherein the preset rule at least comprises: filtering unreasonable data in the plurality of entity information, and carrying out format standardization on the plurality of entity information; step b2, randomly extracting one or more entity information from the cleaned entity information, and calculating the qualification rate of the cleaning result; step b3, comparing the qualification rate with a fourth threshold value; and if the qualification rate is smaller than the fourth threshold value, adding fourth data unqualified in cleaning to the plurality of entity information, and cycling the operations of the steps b1-b3 until the qualification rate is larger than or equal to the fourth threshold value, and outputting the plurality of entity information.

Optionally, after performing data cleaning on the entity information in the second target data set to obtain target data, the method further includes: determining a second key feature in the target data; searching third key features associated with the second key features in the target data; searching a fourth key feature having a logical relationship with the third key feature based on a knowledge graph, wherein the fourth key feature is structured data; the fourth key feature is supplemented to the target data.

Optionally, after supplementing the fourth key feature to the target data, the method further comprises: periodically updating the target data; and associating the fourth key with the second key feature in the updated target data.

According to an embodiment of the present invention, there is provided a data processing apparatus including: the first classification module is used for carrying out primary domain classification on the data to be processed according to first key features contained in the data to be processed to obtain a plurality of groups of first data sets, wherein the data to be processed is unstructured data; the second classification module is used for converting the target first data set into a feature vector based on a classification model, and carrying out secondary domain classification on the target first data set according to the feature vector to obtain a plurality of groups of second data sets, wherein the target first data set is any one group of data sets in the plurality of groups of first data sets; the extraction module is used for extracting entity information in a target second data set based on an information extraction model, wherein the target second data set is any one of the plurality of groups of second data sets; the cleaning module is used for carrying out data cleaning on the entity information in the target second data set to obtain target data, wherein the target data is structured data; and the importing module is used for importing the target data into a target structured database.

Optionally, the first classification module is configured to: a1, classifying the data to be processed once according to the first key features to obtain a plurality of groups of first data sets; a2, randomly extracting one or more groups of first data sets from the plurality of groups of first data sets, marking one or more primary class labels for the one or more groups of first data sets, and taking the one or more groups of first data sets as a first verification set; step A3, calculating a first precision rate P1 and a first recall rate R1 of one-stage classification based on the actual class labels of the one or more groups of first data sets and the first verification set, wherein the precision rate represents a ratio of the number of samples which are judged to be the labels M and are also the labels M to the number of samples which are judged to be the labels M, the recall rate represents a ratio of the number of samples which are judged to be the labels M and are also the labels M to the number of samples which are also the labels M in any group of first data sets, and M represents class labels of any group of first data sets; and step A4, calculating the accuracy F1 of the first-order classification by the following formula: f1 =2p1r1/(p1+r1); step A5, comparing the F1 with a first threshold value; and if F1 is smaller than the first threshold value, adding first data of the first-stage classification error into the plurality of groups of first data sets, and cycling the operations of the steps A1 to A5 until F1 is larger than or equal to the first threshold value, and outputting the labeled plurality of groups of first data sets.

Optionally, the second classification module is configured to: step a1, performing digital processing on the target first data set by adopting the classification model to obtain a feature vector corresponding to the target first data set; step a2, classifying the target first data set once according to the feature vector to obtain a plurality of groups of second data sets; step a3, randomly extracting one or more groups of second data sets from the plurality of groups of second data sets, labeling one or more secondary category labels for the one or more groups of second data sets, and taking the one or more groups of second data sets as a second verification set; step a4, calculating a second precision rate P2 and a second recall rate R2 of the classification model based on the actual class labels of the one or more groups of second data sets and the second verification set; step a5, calculating the accuracy F2 of the classification model through the following formula: f2 =2p2r2/(p2+r2); step a6, comparing the F2 with a second threshold value; if F2 is less than the second threshold, adding the second data of the classification model classification error to the multiple sets of second data sets, cycling the operations of steps a1-a6 until F2 is greater than or equal to the second threshold, and outputting the multiple sets of second data sets.

Optionally, the extracting module is configured to: step B1, extracting a plurality of entity information in the target data set through the information extraction model; step B2, labeling a plurality of entity labels for the plurality of entity information to obtain a plurality of labeled entity information, and taking the plurality of labeled entity information as a third verification set; step B3, calculating a third precision rate P3 and a third recall rate R3 of the information extraction model according to the actual entity labels of the plurality of entity information and the third verification set; and step B4, calculating a third accuracy F3 of the information extraction model through the following formula: f3 =2p3r3/(p3+r3); step B5, comparing the F3 with a third threshold value; and if the F3 is smaller than the third threshold value, adding third data of the information extraction model extraction error to the target data set, cycling the operation of the steps B1-B5 until the F3 is larger than or equal to the third threshold value, and outputting entity information in the target second data set.

Optionally, the cleaning module is configured to: step b1, data cleaning is carried out on a plurality of entity information in the target second data set according to a preset rule, wherein the preset rule at least comprises: filtering unreasonable data in the plurality of entity information, and carrying out format standardization on the plurality of entity information; step b2, randomly extracting one or more entity information from the cleaned entity information, and calculating the qualification rate of the cleaning result; step b3, comparing the qualification rate with a fourth threshold value; and if the qualification rate is smaller than the fourth threshold value, adding fourth data of unqualified cleaning to the plurality of entity information, and cycling the operations of the steps 17-19 until the qualification rate is larger than or equal to the fourth threshold value, and outputting the plurality of entity information.

Optionally, the apparatus further includes: the determining module is used for determining a second key feature in the target data after data cleaning is carried out on the entity information in the target second data set to obtain the target data; the searching module is used for searching third key features associated with the second key features in the target data; the searching module is used for searching a fourth key feature with a logical relation with the third key feature based on the knowledge graph, wherein the fourth key feature is structured data; and the adding module is used for supplementing the fourth key characteristic to the target data.

Optionally, the apparatus further includes: an updating module, configured to periodically update the target data after supplementing the fourth key feature to the target data; and the association module is used for associating the fourth key with the second key feature in the updated target data.

According to a further embodiment of the invention there is also provided a computer device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

According to a further embodiment of the invention, there is also provided a storage medium having stored therein a computer program, wherein the computer program is arranged to perform the steps of any of the apparatus embodiments described above when run.

According to the invention, the unstructured data is firstly subjected to rough classification in a wide field according to key features, then the data after the rough classification is converted into the feature vector according to the training model, and the data after the rough classification is further subjected to fine classification according to the feature vector, so that the problem of scattered and disordered data types of the unstructured data is solved; then extracting entity trunk information in the finely classified data through an information extraction model; data cleaning is carried out on the entity trunk information, and structured data with reasonable data and formatted standards is output; and finally, importing the structured database, so that the user can conveniently analyze and manage unstructured data. Compared with the related art, the method and the device enable the unstructured data to be processed more modularly and in a flow mode, and solve the technical problems that information in unstructured data is low in efficiency and limited in the related art.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a block diagram of a hardware configuration of a data processing method applied to a computer terminal according to an embodiment of the present invention;

FIG. 2 is a flow chart of a data processing method according to an embodiment of the invention;

FIG. 3 is a flowchart illustrating a method for data processing according to an embodiment of the present invention;

Fig. 4 is a block diagram of a data processing apparatus according to an embodiment of the present invention.

Detailed Description

The application will be described in detail hereinafter with reference to the drawings in conjunction with embodiments. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.

Example 1

The method according to the first embodiment of the present application may be implemented in a mobile terminal, a server, a computer terminal, or a similar computing device. Taking a computer terminal as an example, fig. 1 is a block diagram of a hardware structure of a data processing method applied to a computer terminal according to an embodiment of the present application. As shown in fig. 1, the computer terminal may include one or more (only one is shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 104 for storing data, and optionally, a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the computer terminal described above. For example, the computer terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a data processing method in an embodiment of the present invention, and the processor 102 executes the computer program stored in the memory 104 to perform various functional applications and data processing, that is, implement the above-mentioned method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the computer terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission means 106 is arranged to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of a computer terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as a NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet wirelessly.

In order to solve the technical problems in the related art, in this embodiment, a data processing method is provided, and fig. 2 is a flowchart of a data processing method according to an embodiment of the present invention, as shown in fig. 2, where the flowchart includes the following steps:

step S202, carrying out primary domain classification on the data to be processed according to first key features contained in the data to be processed to obtain a plurality of groups of first data sets, wherein the data to be processed is unstructured data;

the unstructured data in this embodiment mainly refers to data that cannot be logically expressed and implemented by a fixed structure, and compared with the structured data, the most essential differences of the unstructured data mainly include three levels: unstructured data has a larger capacity than structured data; the speed of generation is faster than structured data; the data sources are diverse. Morphologically, the method mainly comprises the following steps: text, images, pictures, etc., video streams, television streams, etc.

Alternatively, the key feature in the present embodiment may be a keyword in the text data, but is not limited thereto. The main purpose of coarse classification of unstructured data (i.e. the above-mentioned first-level domain classification) according to the above-mentioned embodiments is to distinguish unstructured data according to the domain to which it belongs, facilitating further data mining.

Step S204, converting the target first data set into a feature vector based on the classification model, and carrying out secondary domain classification on the target first data set according to the feature vector to obtain a plurality of groups of second data sets, wherein the target first data set is any one group of data sets in the plurality of groups of first data sets;

In this embodiment, the above coarse classification is only to roughly divide unstructured data according to the broad domain to which the unstructured data belongs, and further fine classification (i.e., the above secondary domain classification) is required to be performed on the unstructured data after the coarse classification, and fine classification is performed on the fine data under the coarse classification, so that finer granularity of the unstructured data is carded, and further the finer domain to which the unstructured data belongs is determined.

Step S206, extracting entity information in a target second data set based on the information extraction model, wherein the target second data set is any one of a plurality of groups of second data sets;

in this embodiment, the useful information extraction is performed on the subdivision result in the step S204, for example, entity information in text data, mode information in image data, and the like. Alternatively, taking text structuring as an example, the entity information may be text backbone information such as name of person, place name, organization name, time, behavior, etc.

Step S208, data cleaning is carried out on entity information in the target second data set to obtain target data, wherein the target data is structured data;

in this embodiment, the information extracted in step S206 is cleaned and corrected to obtain standardized data.

Step S210, importing the target data into a target structured database.

According to the embodiment of the invention, the unstructured data is firstly subjected to rough classification in a wide field according to key features, then the data after the rough classification is converted into the feature vector according to the training model, and the data after the rough classification is further subjected to fine classification according to the feature vector, so that the problem of scattered and disordered data types of the unstructured data is solved; then extracting entity trunk information in the finely classified data through an information extraction model; data cleaning is carried out on the entity trunk information, and structured data with reasonable data and formatted standards is output; and finally, importing the structured database, so that the user can conveniently analyze and manage unstructured data. Compared with the related art, the method and the device enable the unstructured data to be processed more modularly and in a flow mode, and solve the technical problems that information in unstructured data is low in efficiency and limited in the related art.

In an alternative embodiment of the present disclosure, performing a first-level domain classification on data to be processed according to a first key feature included in the data to be processed to obtain a plurality of groups of first data sets, including: a1, classifying data to be processed once according to first key features to obtain a plurality of groups of first data sets; a2, randomly extracting one or more groups of first data sets from the plurality of groups of first data sets, marking one or more primary class labels for the one or more groups of first data sets, and taking the one or more groups of first data sets as a first verification set; step A3, calculating a first precision rate P1 and a first recall rate R1 of one-stage classification based on the actual class labels of one or more groups of first data sets and a first verification set, wherein the precision rate represents a ratio of the number of samples which are judged to be the labels M and are also the labels M to the number of samples which are judged to be the labels M for any group of first data sets, the recall rate represents a ratio of the number of samples which are judged to be the labels M and are also the labels M to the number of samples which are also the labels M in any group of first data sets, and M represents the class labels of any group of first data sets; and step A4, calculating the accuracy F1 of the first-order classification by the following formula: f1 =2p1r1/(p1+r1); step A5, comparing F1 with a first threshold value; if F1 is smaller than the first threshold, adding first data with first-class classification errors into a plurality of groups of first data sets, and cycling the operations of the steps A1-A5 until F1 is larger than or equal to the first threshold, and outputting a plurality of groups of marked first data sets.

Taking text structuring as an example, and describing the present scheme further with reference to an embodiment, fig. 3 is a flowchart illustrating an implementation of a data processing method according to an embodiment of the present invention, as shown in fig. 3, an unprocessed data stream is first obtained and input into a coarse classification model (i.e. performing the above-mentioned first-level domain classification), and the unprocessed data stream is classified once according to the keywords detected in the unprocessed data stream (i.e. the unprocessed data stream with a complicated variety is roughly classified according to the broad domain to which it belongs); in the coarse classification flow, the classification result is checked at fixed time, and when the accuracy (namely F1) reaches a coarse classification preset threshold (namely a first threshold), the model is not iterated temporarily (a next secondary domain classification flow is entered); if the accuracy rate does not reach the rough classification preset threshold value, adding badcase (bad case or problem data) data (namely the first data) and performing model iteration.

In this embodiment, the calculation of the accuracy of the coarse classification model generally needs to extract a part of data (i.e. the one or more sets of first data sets) as a verification set (i.e. the first verification set), the data in the verification set will be labeled with a class label as a verification reference (i.e. the actual class label), meanwhile, the data in the verification set will be labeled with a class label through the coarse classification model, and finally, the consistency of the two parts of labels is compared.

Specifically, the consistency of the two-part label includes two aspects: on the one hand, the accuracy of the model (i.e. P1 above, all called Precision) and on the other hand the Recall (i.e. R1 above, all called Recall). For a certain class of data samples (i.e., any of the above-mentioned first data sets), if the class label is M, then the accuracy = the number of samples that are identified by the model as label M and that are also M by the model/the number of samples that are identified by the model as label M, i.e., how much data in the certain class is determined by the model to be correct; recall = number of samples that are judged by the model as a label M and the actual label is also M/number of samples that are in any group of data set as an actual label M, i.e., how much data that should belong to a certain class is accurately judged by the model; then, for the one or more groups of first data sets, the overall accuracy rate P1 and recall rate R1 of the primary domain coarse classification can be obtained by averaging or weighted averaging each class; finally, taking the harmonic mean (F1-score) of the model accuracy and the model recall as a final accuracy result, namely F1=2P1R1/(P1+R1). When the model accuracy is below a preset threshold, the model is iterated by increasing badcase. The coarse classification mainly uses rules, and the adding of badcase data specifically comprises adding coarse classification rules, namely data of which classification is judged by a model error; the rule model is usually iterated for multiple times to supplement the rule until the accuracy of the model reaches above the threshold.

In addition, the unprocessed data flow in this embodiment is generally referred to as either dynamic incremental data or static data. The dynamic data can be stored in the distributed database, so that the query and the processing are convenient, and the storage mode of the static data can be stored according to actual needs, can be stored in various databases and can also be stored as text format files.

In this embodiment, the rough classification model is mainly a rule model, and the text data is roughly determined according to the keywords in the text, and the data is classified into a certain large class and then is submitted to the fine classification model for finer stage processing. With the arrival of new data, the data form is gradually changed, so that the classification result needs to be checked in time, and iteration is performed on the coarse classification model in time.

In an optional embodiment of the present disclosure, converting the target first data set into a feature vector based on the classification model, and performing a second-level domain classification on the target first data set according to the feature vector to obtain a plurality of groups of second data sets, including: step a1, performing digital processing on a target first data set by adopting a classification model to obtain a feature vector corresponding to the target first data set; step a2, classifying the target first data set once according to the feature vector to obtain a plurality of groups of second data sets; step a3, randomly extracting one or more groups of second data sets from the plurality of groups of second data sets, marking one or more secondary category labels for the one or more groups of second data sets, and taking the one or more groups of second data sets as a second verification set; step a4, calculating a second accuracy rate P2 and a second recall rate R2 of the classification model based on the actual class labels of one or more groups of second data sets and the second verification set; step a5, calculating the accuracy F2 of the classification model by the following formula: f2 =2p2r2/(p2+r2); step a6, comparing F2 with a second threshold value; if F2 is smaller than the second threshold value, adding the second data of the classification model classification errors to a plurality of groups of second data sets, cycling the operation of the steps a1-a6 until F2 is larger than or equal to the second threshold value, and outputting a plurality of groups of second data sets.

As shown in fig. 3, classifying the data after the coarse classification into fine classifications includes: timing and sampling the classification result, and temporarily not iterating the model (namely entering the next entity information extraction flow) when the accuracy (namely F2) reaches a threshold (namely the second threshold); and if the accuracy rate does not reach the threshold value, adding badcase data (namely the second data) and performing an iterative model.

In this embodiment, the text fine classification refines the result obtained by the above coarse classification, and the data in the large domain is further subdivided into finer domains, and the main steps of this step include classification data labeling, classification model training, and classification model iteration.

According to the above embodiment, the model iteration process in the fine classification is the same as the accuracy calculation method of the coarse classification, and the difference is mainly two points, namely, the labels and the levels of the models are different (namely, the primary classification and the secondary classification), and the model types are different. Wherein, the levels are different, that is, the fine classification is under the coarse classification, and each coarse classification may include a plurality of sub-classification, for example, a news text may first use the coarse classification model to determine whether it is political, entertainment or economic, if the coarse classification model determines that the news is entertainment, then the news is sub-classified into the fine classification model under the entertainment, and further determine whether it is music, film or other entertainment. The model types are different, namely specific algorithms for processing data are different, coarse classification usually uses rules, fine classification usually uses classification algorithms of machine learning and deep learning for data processing, firstly text data is quantized Cheng Wenben into feature vectors in a certain mode, the feature vectors can represent more detailed features of texts (namely, the text is classified based on all texts serving as keywords), the converted text features are input into the model, and the fine classification model is produced after a certain number of training iterations.

In one example of the present embodiment, the fine classification model may be a CNN (fully called Convolutional Neural Network, convolutional neural network) correlation model, such as TextCNN (text CNN model); also RNN (collectively referred to as Recurrent Neural Network, recurrent neural network) related models, such as LSTM (collectively referred to as LongShort-Term Memory, long-Term Memory network) are possible.

The model of the fine classification needs to be shorter than the iteration period of the coarse classification, and the main reason is that the key words which are mainly relied on by the coarse classification cannot be updated greatly in a short time, and the data form under the fine classification can be updated and changed faster due to the diversity of information description and the like, so that the requirement on the iteration period of the model is higher. Therefore, the machine learning and deep learning model adopted by the fine classification has finer and more accurate classification effect than the rule model of the coarse classification.

In an alternative embodiment, extracting entity information in the target second data set based on the information extraction model includes: step B1, extracting a plurality of entity information in a target data set through an information extraction model; step B2, labeling a plurality of entity labels for the plurality of entity information to obtain a plurality of labeled entity information, and taking the plurality of labeled entity information as a third verification set; step B3, calculating a third precision rate P3 and a third recall rate R3 of the information extraction model according to the actual entity labels of the plurality of entity information and a third verification set; and step B4, calculating a third accuracy F3 of the information extraction model through the following formula: f3 =2p3r3/(p3+r3); step B5, comparing F3 with a third threshold value; if F3 is smaller than the third threshold, adding the third data of the information extraction model extraction error to the target data set, and cycling the operations of steps B1-B5 until F3 is greater than or equal to the third threshold, and outputting entity information in the target second data set.

As shown in fig. 3, the information extraction of the data after the fine classification includes: the extraction result of the timing sampling inspection information is that when the accuracy rate (namely F3) reaches a threshold value (namely a third threshold value), the model is not iterated temporarily (namely the next cleaning flow is entered); if the accuracy rate does not reach the threshold value, the badcase data (namely the third data) is increased, and an iterative model is entered.

In this embodiment, the main purpose of text entity extraction is to extract the trunk of text information, and common entities include: name of person, place name, time, institution, behavior, etc. Specific entities require detailed text morphology specific determination of a fine class. The method mainly comprises the following steps: entity labeling, entity extraction model training and entity extraction model iteration.

The accuracy of the text entity extraction model is also required to be measured by periodically marking a test set (i.e. the third verification set) and running the entity extraction model on the test set, comparing with the actual entity label to obtain the accuracy of the entity extraction model (i.e. F3), wherein the accuracy of the model is similar to the accuracy calculation of the classification model, and is obtained by reconciling the accuracy and recall rate of the model. If the model accuracy rate on the test set reaches a threshold value, temporarily not updating the model; if the model accuracy rate on the test set does not reach the threshold value, the data (namely the third data) of the model prediction errors in the test set and the actual entity labels are put into a training set of the entity extraction model, and the model of the first version is iterated again.

Optionally, performing data cleaning on entity information in the second data set to obtain target data, including: step b1, cleaning data of a plurality of entity information in a target second data set according to a preset rule, wherein the preset rule at least comprises: filtering unreasonable data in the plurality of entity information, and carrying out format standardization on the plurality of entity information; step b2, randomly extracting one or more entity information from the cleaned entity information, and calculating the qualification rate of the cleaning result; step b3, comparing the qualification rate with a fourth threshold value; if the qualification rate is smaller than the fourth threshold value, the fourth data which is unqualified in cleaning is added to the plurality of entity information, the operations of the steps b1-b3 are circulated until the qualification rate is larger than or equal to the fourth threshold value, and the plurality of entity information is output.

As shown in fig. 3, the data cleansing includes: timing sampling and checking the cleaning result, and temporarily not iterating the cleaning rule model (i.e. entering the next operation) when the accuracy (i.e. the qualification rate) reaches a threshold (i.e. the fourth threshold); and if the accuracy rate does not reach the threshold value, carrying out the problem data carding and carrying out the iterative cleaning rule.

In this embodiment, the data cleaning is mainly to further clean the extracted entity, and the main cleaning process can be summarized as follows: filtering impurities, rationalizing, judging and standardizing. For example, when cleaning a Chinese "name" entity, all characters except Chinese are impurities, and when cleaning a time entity; furthermore, if the month appears "13 months", the time data is unreasonable data. For another example, the data format of time is expressed in various ways, for example, "5/24/2020" is often described as "2020-05-24", and the format needs to be standardized and unified for convenience in the subsequent process.

Alternatively, the accuracy measurement of the cleaning result may be performed from two aspects, format uniformity on the one hand and content rationality on the other hand. The format accuracy is that the formats of the same entity are the same, and the content accords with the normative, for example, the formats of XXXX-XX-XX are all cleaned on the date; content rationality means that the content of each field after being cleaned needs to meet the specific requirements of the field, for example, the age range must be within [0, 150], and the result after cleaning is the obtained structured data.

In this embodiment, the measurement of the accuracy of the cleaning result also requires regular sampling, and when the sampling qualification rate is lower than the threshold value, the cleaning policy needs to be updated again, and the cleaning rule is added for the badcase data (i.e. the fourth data) that is not cleaned correctly.

In another optional embodiment of the present disclosure, after performing data cleansing on the entity information in the target second data set to obtain target data, the method further includes: determining a second key feature in the target data; searching third key features associated with the second key features in the target data; searching a fourth key feature with a logical relation with the third key feature based on the knowledge graph, wherein the fourth key feature is structured data; the fourth key feature is supplemented to the target data.

Optionally, after supplementing the fourth key feature to the target data, further comprising: periodically updating target data; and associating the fourth key with the second key feature in the updated target data.

As shown in fig. 3, after the accuracy of the data cleansing reaches a threshold, the data is supplemented, and the supplement logic is updated periodically, and the resulting final data is imported into the structured database.

In this embodiment, data supplementation is often performed by means of knowledge maps or other related structured data, and the richness of the data is increased by reasoning the structured data; and carrying out proper logic supplementation on the cleaning result according to other dependent items of the data, mining explicit and implicit information in the data as much as possible, and mining the information in the data as much as possible.

For example: the work unit of a person (i.e. the second key feature) is a (i.e. the third key feature), and the address of the unit a searched in the related knowledge graph or table is at B (i.e. the fourth key feature), the work place of the person can be supplemented based on the address.

Optionally, the logic supplements a table linking operation in the database.

In an alternative example of the present case, taking text structuring as an example, for example, text before structuring (i.e. the above data to be processed) is "5 th month 1 day 2020, zhang Sancheng trains arrive in south Beijing", the text is classified into a first-level domain based on keywords "riding" and "arriving" (i.e. the above first key feature), and a rough classification label is labeled as "travel"; converting the text into feature vectors based on a classification model, wherein the feature vectors comprise all text features of the text, classifying the text after the rough classification further into fine classifications (namely, classifying the text in the second-level field) according to all text features, and labeling the fine classification labels as 'transportation travel_railway travel'; the entity information of 'Zhang san', 'departure date', 'arrival station', 'train riding' in the text is extracted through the information extraction model; and then cleaning the extracted entity information, wherein the cleaned result comprises the following steps: name "Zhang San", departure date "2020-05-01", arrival station "Nanjing nan"; finally, through logic supplement of arrived site Nanjing south, the arrived city of Zhang three is obtained as Nanjing.

Through the embodiment, various unstructured data are subjected to the processes of data classification, information extraction and data cleaning, a mature data structuring production line is formed from construction to iteration, most of the processes in text structuring are modeled and modularized through more perfect processes, and therefore labor and time cost is saved, and data are utilized to the greatest extent.

Example 2

In this embodiment, a data processing device is further provided, and the device is used to implement the foregoing embodiments and preferred embodiments, and will not be described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

Fig. 4 is a block diagram of a data processing apparatus according to an embodiment of the present invention, as shown in fig. 4, including: the first classification module 402 is configured to perform primary domain classification on data to be processed according to first key features included in the data to be processed, so as to obtain multiple groups of first data sets, where the data to be processed is unstructured data; the second classification module 404 is connected to the first classification module 402, and is configured to convert the target first data set into a feature vector based on the classification model, and perform a second-level domain classification on the target first data set according to the feature vector to obtain multiple sets of second data sets, where the target first data set is any one of the multiple sets of first data sets; an extracting module 406, coupled to the second classifying module 404, configured to extract entity information in a target second data set based on an information extraction model, where the target second data set is any one of a plurality of sets of second data sets; the cleansing module 408 is connected to the extracting module 406, and is configured to perform data cleansing on entity information in the second data set to obtain target data, where the target data is structured data; an import module 410, coupled to the cleansing module 408, is configured to import the target data into the target structured database.

Optionally, the first classification module 402 is configured to: a1, classifying data to be processed once according to first key features to obtain a plurality of groups of first data sets; a2, randomly extracting one or more groups of first data sets from the plurality of groups of first data sets, marking one or more primary class labels for the one or more groups of first data sets, and taking the one or more groups of first data sets as a first verification set; step A3, calculating a first precision rate P1 and a first recall rate R1 of one-stage classification based on the actual class labels of one or more groups of first data sets and a first verification set, wherein the precision rate represents a ratio of the number of samples which are judged to be the labels M and are also the labels M to the number of samples which are judged to be the labels M for any group of first data sets, the recall rate represents a ratio of the number of samples which are judged to be the labels M and are also the labels M to the number of samples which are also the labels M in any group of first data sets, and M represents the class labels of any group of first data sets; and step A4, calculating the accuracy F1 of the first-order classification by the following formula: f1 =2p1r1/(p1+r1); step A5, comparing F1 with a first threshold value; if F1 is smaller than the first threshold, adding first data with first-class classification errors into a plurality of groups of first data sets, and cycling the operations of the steps A1-A5 until F1 is larger than or equal to the first threshold, and outputting a plurality of groups of marked first data sets.

Optionally, the second classification module 404 is configured to: step a1, performing digital processing on a target first data set by adopting a classification model to obtain a feature vector corresponding to the target first data set; step a2, classifying the target first data set once according to the feature vector to obtain a plurality of groups of second data sets; step a3, randomly extracting one or more groups of second data sets from the plurality of groups of second data sets, marking one or more secondary category labels for the one or more groups of second data sets, and taking the one or more groups of second data sets as a second verification set; step a4, calculating a second accuracy rate P2 and a second recall rate R2 of the classification model based on the actual class labels of one or more groups of second data sets and the second verification set; step a5, calculating the accuracy F2 of the classification model by the following formula: f2 =2p2r2/(p2+r2); step a6, comparing F2 with a second threshold value; if F2 is smaller than the second threshold value, adding the second data of the classification model classification errors to a plurality of groups of second data sets, cycling the operation of the steps a1-a6 until F2 is larger than or equal to the second threshold value, and outputting a plurality of groups of second data sets.

Optionally, the extracting module 406 is configured to: step B1, extracting a plurality of entity information in a target data set through an information extraction model; step B2, labeling a plurality of entity labels for the plurality of entity information to obtain a plurality of labeled entity information, and taking the plurality of labeled entity information as a third verification set; step B3, calculating a third precision rate P3 and a third recall rate R3 of the information extraction model according to the actual entity labels of the plurality of entity information and a third verification set; and step B4, calculating a third accuracy F3 of the information extraction model through the following formula: f3 =2p3r3/(p3+r3); step B5, comparing F3 with a third threshold value; if F3 is smaller than the third threshold, adding the third data of the information extraction model extraction error to the target data set, and cycling the operations of steps B1-B5 until F3 is greater than or equal to the third threshold, and outputting entity information in the target second data set.

Optionally, the cleaning module 408 is configured to: step b1, cleaning data of a plurality of entity information in a target second data set according to a preset rule, wherein the preset rule at least comprises: filtering unreasonable data in the plurality of entity information, and carrying out format standardization on the plurality of entity information; step b2, randomly extracting one or more entity information from the cleaned entity information, and calculating the qualification rate of the cleaning result; step b3, comparing the qualification rate with a fourth threshold value; if the qualification rate is smaller than the fourth threshold value, the fourth data which is unqualified in cleaning is added to the plurality of entity information, the operations of the steps b1-b3 are circulated until the qualification rate is larger than or equal to the fourth threshold value, and the plurality of entity information is output.

Optionally, the apparatus further includes: the determining module is used for determining second key features in the target data after data cleaning is carried out on the entity information in the target second data set to obtain the target data; the searching module is used for searching third key features related to the second key features in the target data; the searching module is used for searching a fourth key feature with a logic relation with the third key feature based on the knowledge graph, wherein the fourth key feature is structured data; and the adding module is used for supplementing the fourth key characteristic to the target data.

Optionally, the apparatus further includes: the updating module is used for periodically updating the target data after supplementing the fourth key features to the target data; and the association module is used for associating the fourth key and the second key characteristic in the updated target data.

It should be noted that each of the above modules may be implemented by software or hardware, and for the latter, it may be implemented by, but not limited to: the modules are all located in the same processor; or the above modules may be located in different processors in any combination.

Example 3

An embodiment of the invention also provides a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

Alternatively, in the present embodiment, the above-described storage medium may be configured to store a computer program for performing the steps of:

S1, carrying out primary domain classification on data to be processed according to first key features contained in the data to be processed to obtain a plurality of groups of first data sets, wherein the data to be processed is unstructured data;

S2, converting a target first data set into a feature vector based on a classification model, and performing secondary domain classification on the target first data set according to the feature vector to obtain a plurality of groups of second data sets, wherein the target first data set is any one group of data sets in the plurality of groups of first data sets;

s3, extracting entity information in a target second data set based on an information extraction model, wherein the target second data set is any one of the plurality of groups of second data sets;

S4, data cleaning is carried out on the entity information in the target second data set to obtain target data, wherein the target data is structured data;

s5, importing the target data into a target structured database.

Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a usb disk, a Read-Only Memory (ROM), a random access Memory (Random AccessMemory RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.

An embodiment of the invention also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:

s5, importing the target data into a target structured database.

Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments and optional implementations, and this embodiment is not described herein.

It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module for implementation. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of data processing, comprising:

Performing primary domain classification on the data to be processed according to first key features contained in the data to be processed to obtain a plurality of groups of first data sets, wherein the data to be processed is unstructured data;

Converting a target first data set into a feature vector, and performing secondary domain classification on the target first data set based on a classification model according to the feature vector to obtain a plurality of groups of second data sets, wherein the target first data set is any one group of data sets in the plurality of groups of first data sets;

Extracting entity information in a target second data set based on an information extraction model, wherein the target second data set is any one group of data set in the plurality of groups of second data sets;

Performing data cleaning on entity information in the target second data set to obtain target data, wherein the target data is structured data;

importing the target data into a target structured database;

The method for converting the target first data set into the feature vector, and carrying out secondary domain classification on the target first data set based on the classification model according to the feature vector to obtain a plurality of groups of second data sets comprises the following steps:

Step a1, performing digital processing on the target first data set by adopting the classification model to obtain a feature vector corresponding to the target first data set;

step a2, classifying the target first data set once according to the feature vector to obtain a plurality of groups of second data sets;

Step a3, randomly extracting one or more groups of second data sets from the plurality of groups of second data sets, labeling one or more secondary category labels for the one or more groups of second data sets, and taking the one or more groups of second data sets as a second verification set;

Step a4, calculating a second precision rate P2 and a second recall rate R2 of the classification model based on the actual class labels of the one or more groups of second data sets and the second verification set;

Step a5, calculating the accuracy F2 of the classification model through the following formula: f2 =2p2r2/(p2+r2);

Step a6, comparing the F2 with a second threshold value;

If F2 is less than the second threshold, adding the second data of the classification model classification error to the multiple sets of second data sets, cycling the operations of steps a1-a6 until F2 is greater than or equal to the second threshold, and outputting the multiple sets of second data sets.

2. The method of claim 1, wherein performing a first-level domain classification on the data to be processed according to a first key feature included in the data to be processed to obtain a plurality of sets of first data sets, comprises:

A1, classifying the data to be processed once according to the first key features to obtain a plurality of groups of first data sets;

a2, randomly extracting one or more groups of first data sets from the plurality of groups of first data sets, marking one or more primary class labels for the one or more groups of first data sets, and taking the one or more groups of first data sets as a first verification set;

Step A3, calculating a first precision rate P1 and a first recall rate R1 of one-stage classification based on the actual class labels of the one or more groups of first data sets and the first verification set, wherein the precision rate represents a ratio of the number of samples which are judged to be the labels M and are also the labels M to the number of samples which are judged to be the labels M, the recall rate represents a ratio of the number of samples which are judged to be the labels M and are also the labels M to the number of samples which are also the labels M in any group of first data sets, and M represents class labels of any group of first data sets;

and step A4, calculating the accuracy F1 of the first-order classification by the following formula: f1 =2p1r1/(p1+r1);

Step A5, comparing the F1 with a first threshold value;

and if F1 is smaller than the first threshold value, adding first data of the first-stage classification error into the plurality of groups of first data sets, and cycling the operations of the steps A1 to A5 until F1 is larger than or equal to the first threshold value, and outputting the labeled plurality of groups of first data sets.

3. The method of claim 1, wherein extracting entity information in the target second data set based on the information extraction model comprises:

step B1, extracting a plurality of entity information in the target data set through the information extraction model;

step B2, labeling a plurality of entity labels for the plurality of entity information to obtain a plurality of labeled entity information, and taking the plurality of labeled entity information as a third verification set;

Step B3, calculating a third precision rate P3 and a third recall rate R3 of the information extraction model according to the actual entity labels of the plurality of entity information and the third verification set;

and step B4, calculating a third accuracy F3 of the information extraction model through the following formula: f3=2p3r3/(p3+r3);

step B5, comparing the F3 with a third threshold value;

And if the F3 is smaller than the third threshold value, adding third data of the information extraction model extraction error to the target data set, cycling the operation of the steps B1-B5 until the F3 is larger than or equal to the third threshold value, and outputting entity information in the target second data set.

4. The method of claim 1, wherein performing data cleansing on the entity information in the target second data set to obtain target data comprises:

Step b1, data cleaning is carried out on a plurality of entity information in the target second data set according to a preset rule, wherein the preset rule at least comprises: filtering unreasonable data in the plurality of entity information, and carrying out format standardization on the plurality of entity information;

Step b2, randomly extracting one or more entity information from the cleaned entity information, and calculating the qualification rate of the cleaning result;

Step b3, comparing the qualification rate with a fourth threshold value;

And if the qualification rate is smaller than the fourth threshold value, adding fourth data unqualified in cleaning to the plurality of entity information, and cycling the operations of the steps b1-b3 until the qualification rate is larger than or equal to the fourth threshold value, and outputting the plurality of entity information.

5. The method of claim 1, wherein after performing data cleansing on the entity information in the target second data set to obtain target data, the method further comprises:

determining a second key feature in the target data;

searching third key features associated with the second key features in the target data;

searching a fourth key feature having a logical relationship with the third key feature based on a knowledge graph, wherein the fourth key feature is structured data;

the fourth key feature is supplemented to the target data.

6. The method of claim 5, wherein after supplementing the fourth key feature to the target data, the method further comprises:

Periodically updating the target data;

and associating the fourth key with the second key feature in the updated target data.

7. A data processing apparatus, comprising:

the first classification module is used for carrying out primary domain classification on the data to be processed according to first key features contained in the data to be processed to obtain a plurality of groups of first data sets, wherein the data to be processed is unstructured data;

The second classification module is used for converting the target first data set into a feature vector, and carrying out secondary domain classification on the target first data set based on a classification model according to the feature vector to obtain a plurality of groups of second data sets, wherein the target first data set is any one group of data sets in the plurality of groups of first data sets;

The extraction module is used for extracting entity information in a target second data set based on an information extraction model, wherein the target second data set is any one of the plurality of groups of second data sets;

The cleaning module is used for carrying out data cleaning on the entity information in the target second data set to obtain target data, wherein the target data is structured data;

The importing module is used for importing the target data into a target structured database;

The second classification module is used for: step a1, performing digital processing on a target first data set by adopting a classification model to obtain a feature vector corresponding to the target first data set; step a2, classifying the target first data set once according to the feature vector to obtain a plurality of groups of second data sets; step a3, randomly extracting one or more groups of second data sets from the plurality of groups of second data sets, marking one or more secondary category labels for the one or more groups of second data sets, and taking the one or more groups of second data sets as a second verification set; step a4, calculating a second accuracy rate P2 and a second recall rate R2 of the classification model based on the actual class labels of one or more groups of second data sets and the second verification set; step a5, calculating the accuracy F2 of the classification model by the following formula: f2 =2p2r2/(p2+r2); step a6, comparing F2 with a second threshold value; if F2 is smaller than the second threshold value, adding the second data of the classification model classification errors to a plurality of groups of second data sets, cycling the operation of the steps a1-a6 until F2 is larger than or equal to the second threshold value, and outputting a plurality of groups of second data sets.

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.

9. A computer storage medium having stored thereon a computer program, which when executed by a processor realizes the steps of the method according to any of claims 1 to 6.