[go: up one dir, main page]

CN119597834B - Unstructured data automatic processing method and system based on deep learning - Google Patents

Unstructured data automatic processing method and system based on deep learning Download PDF

Info

Publication number
CN119597834B
CN119597834B CN202510131702.1A CN202510131702A CN119597834B CN 119597834 B CN119597834 B CN 119597834B CN 202510131702 A CN202510131702 A CN 202510131702A CN 119597834 B CN119597834 B CN 119597834B
Authority
CN
China
Prior art keywords
data
feature
features
constructing
rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202510131702.1A
Other languages
Chinese (zh)
Other versions
CN119597834A (en
Inventor
高经郡
高海玲
连文龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kejie Technology Co ltd
Original Assignee
Beijing Kejie Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kejie Technology Co ltd filed Critical Beijing Kejie Technology Co ltd
Priority to CN202510131702.1A priority Critical patent/CN119597834B/en
Publication of CN119597834A publication Critical patent/CN119597834A/en
Application granted granted Critical
Publication of CN119597834B publication Critical patent/CN119597834B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • G06F18/15Statistical pre-processing, e.g. techniques for normalisation or restoring missing data
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks
    • G06F18/295Markov models or related models, e.g. semi-Markov models; Markov random fields; Networks embedding Markov models
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/042Backward inferencing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/046Forward inferencing; Production systems
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Animal Behavior & Ethology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an unstructured data automatic processing method and system based on deep learning, which relate to the technical field of data processing and comprise the steps of acquiring unstructured data through an intelligent data acquisition device, establishing a data characteristic fingerprint library, and constructing a data quality assessment model by adopting an incremental learning algorithm to perform data standardization processing; extracting dominant features and implicit features by using a multi-layer cascade network structure and a self-attention mechanism, and constructing a feature association graph and a bidirectional reasoning channel; and carrying out rule matching and causal reasoning based on the domain knowledge base, and finally realizing automatic structuring of unstructured data. The invention improves the automation degree and accuracy of data processing, reduces the manual intervention cost, and has stronger universality and expandability.

Description

Unstructured data automatic processing method and system based on deep learning
Technical Field
The invention relates to a data processing technology, in particular to an unstructured data automatic processing method and system based on deep learning.
Background
With the rapid development of information technology, the generation and storage of unstructured data has been increased explosively, and unstructured data in the form of text, images, audio, video and the like has been widely used in various industries. However, unstructured data lacks a fixed storage structure, the data format is various, the content is complex, the traditional data processing method is difficult to meet the requirements for efficient classification, labeling and utilization of the unstructured data, and particularly under the scene that the volume of data is huge and the updating is frequent, the rapid realization of mining and utilization of the data value faces a great challenge.
The prior art relies on manual or semi-automatic modes to process unstructured data, but the methods are generally low in efficiency and high in cost, and automation and precision processing are difficult to realize particularly in the links of data quality evaluation and deep feature extraction. Meanwhile, the comprehensive mining of implicit features (such as association rules and variation trends) and explicit features (such as time sequence and numerical distribution) of data still lacks an effective model architecture. In addition, in the process of constructing the inference rule based on the characteristic association relationship, the current technology cannot fully utilize the domain knowledge and the causal inference method, so that the effects of data optimization and structural conversion are limited.
In order to solve the above problems, there is an urgent need for an unstructured data automatic processing method based on deep learning, from data acquisition, classification labeling to deep feature mining, rule reasoning and optimization processing full-flow automation, and improving the efficiency and quality of data processing.
Disclosure of Invention
The embodiment of the invention provides an unstructured data automatic processing method and system based on deep learning, which can solve the problems in the prior art.
In a first aspect of an embodiment of the present invention,
The method for automatically processing unstructured data based on deep learning comprises the following steps:
The method comprises the steps of obtaining unstructured data by using an intelligent data collector, constructing a data characteristic fingerprint library, classifying and labeling the unstructured data based on the data characteristic fingerprint library to obtain classified and labeled data, constructing a data quality assessment model by adopting an incremental learning algorithm, scoring the classified and labeled data from three dimensions of data integrity, timeliness and value density to obtain scoring results, constructing a priority processing queue according to the scoring results, and performing data cleaning and normalization processing on the classified and labeled data in the priority processing queue to obtain a standardized training data set;
Deep learning is carried out on the standardized training data set by adopting a multi-layer cascade network structure, the multi-layer cascade network structure identifies target features in the standardized training data set according to category information of the classification marking data and through a self-attention mechanism, a feature association diagram is built on the basis of the target features, a bidirectional reasoning channel is built on the feature association diagram, the bidirectional reasoning channel comprises a forward channel and a reverse channel, wherein the forward channel extracts numerical distribution features, time sequence features and attribute features in the standardized training data set through a convolutional neural network to obtain dominant feature data, the reverse channel extracts data item association rules, data change trends and data interaction modes in the standardized training data set through the convolutional neural network to obtain implicit feature data, and a dynamic weight distribution strategy is adopted to carry out feature combination on the dominant feature data and the implicit feature data to obtain multidimensional feature tensor;
And carrying out rule matching on the multidimensional feature tensor based on a preset domain knowledge base to obtain an inference rule set, carrying out layering processing on the multidimensional feature tensor by adopting the inference rule set to obtain initial structured data, analyzing the initial structured data by a causal inference method to obtain a decision basis chain, generating optimization parameters based on the decision basis chain, and carrying out optimization processing on the initial structured data according to the optimization parameters and the category information of the classification marking data to obtain final structured data.
In an alternative embodiment of the present invention,
The method for obtaining the non-structured data by using the intelligent data collector, constructing a data characteristic fingerprint database, and carrying out classification labeling on the non-structured data based on the data characteristic fingerprint database to obtain classification labeling data comprises the following steps:
The method comprises the steps that unstructured data are obtained through an intelligent data collector, the intelligent data collector comprises a local feature channel and a global feature channel, word-level feature extraction and sentence-level feature extraction are carried out on the local feature channel through a layered attention network to obtain local association features, a data relationship graph is built through a graph neural network through the global feature channel, global relationship features are extracted through message transmission, and attention fusion is carried out on the local association features and the global relationship features to obtain fusion feature data;
The feature importance weight is obtained by carrying out multi-head attention calculation on the fusion feature data, the fusion feature data is screened according to the feature importance weight, the screened features are mapped by adopting a dynamic local sensitive hash algorithm to generate feature fingerprints, and the feature fingerprints are constructed into a data feature fingerprint library supporting neighbor query;
Performing data enhancement on the fusion characteristic data to obtain an enhancement characteristic set, calculating the contrast loss of a sample pair based on the enhancement characteristic set to obtain a semantic similarity matrix, constructing a data association graph by utilizing the neighbor query result of the data characteristic fingerprint library and the semantic similarity matrix, obtaining an initial classification labeling result on the data association graph through graph attention propagation known labels, and simultaneously extracting entity relations according to the data association graph to construct a domain knowledge graph;
constructing a meta learner, taking the feature vector and the confidence coefficient of the initial classification labeling result as input, carrying out evaluation modeling on the classification result of each sample to obtain classification reliability scores, marking the samples which are lower than a preset reliability threshold and verified by the data feature fingerprint library as first type data to be rechecked, carrying out logic verification on the initial classification labeling result by utilizing the domain knowledge graph, and marking the samples which are not verified as second type data to be rechecked;
Modeling the selection process of the first type data to be rechecked and the second type data to be rechecked as a Markov decision process, constructing a hybrid rewarding function according to the classification reliability score and the sample information gain, and selecting an optimal sample to be rechecked to perform manual labeling to obtain a manual labeling result through a deep reinforcement learning optimization sample selection strategy;
And carrying out weighted fusion on the manual labeling result and the initial classification labeling result, and carrying out consistency verification on the weighted and fused labeling result by utilizing the similarity index of the data characteristic fingerprint library to generate classification labeling data comprising the fusion characteristic data, the semantic similarity matrix, the classification reliability score and the domain knowledge graph verification result.
In an alternative embodiment of the present invention,
The method for constructing the data quality assessment model by adopting an incremental learning algorithm, wherein the data quality assessment model scores the classified marking data from three dimensions of data integrity, timeliness and value density to obtain a scoring result, a priority processing queue is constructed according to the scoring result, and the data cleaning and normalization processing of the classified marking data in the priority processing queue to obtain a standardized training data set comprises the following steps:
Acquiring time stamp information of the classified marking data, calculating to obtain a data time attenuation coefficient by adopting an exponential decay function, and carrying out weighting treatment on the classified marking data based on the data time attenuation coefficient to obtain sample weight;
Constructing a data quality evaluation model, constructing a loss function by taking the sample weight as a weight item, optimizing the loss function by adopting an online random gradient descent algorithm until the model converges to obtain a trained data quality evaluation model, and performing quality evaluation on the classification label data by using the trained data quality evaluation model;
Calculating an effective feature ratio based on a quality evaluation result to obtain a data integrity score, obtaining a data timeliness score according to the data time attenuation coefficient, calculating a data value density score by combining the sample weight, and carrying out weighted fusion on the data integrity score, the data timeliness score and the data value density score to obtain a data quality score;
Constructing a priority processing queue according to the data quality scores, marking the classification marking data with the data quality scores lower than a first preset threshold value as data to be rejected, marking the classification marking data with the data quality scores higher than a second preset threshold value as priority processing data, calculating the data processing priority based on the data quality scores and the sample weights, and generating a priority data queue with processing marks;
According to the priority order of the priority data queue, field names and data formats of the classified marked data are subjected to unified conversion to obtain standardized characteristics, a time sequence completion model is constructed based on the timeliness score and the data integrity score of the data, and the missing fields are completed by the time sequence completion model to obtain structural characteristics;
constructing a characteristic fingerprint according to the structural characteristics and the data value density scores, identifying repeated data based on the characteristic fingerprint, and carrying out anomaly detection and denoising processing in combination with the data quality scores to generate quality processing records and cleaned characteristic data;
And respectively carrying out standardized conversion on numerical value features, category features and text features in the cleaned feature data, guiding word vector mapping of the text features by utilizing the feature fingerprints, and combining the converted standardized features with the data quality scores, the data processing priorities and the quality processing records to generate a standardized training data set.
In an alternative embodiment of the present invention,
Deep learning processing is carried out on the standardized training data set by adopting a multi-layer cascade network structure, and the multi-layer cascade network structure identifies target features in the standardized training data set through a self-attention mechanism according to the category information of the classification annotation data, wherein the method comprises the following steps:
Constructing a multi-layer cascade network structure, wherein the multi-layer cascade network structure comprises a feature extraction layer, an attention layer and a feature fusion layer, the feature extraction layer is composed of a plurality of convolution blocks, the attention layer adopts a self-attention mechanism, and the feature fusion layer is used for feature dimension reduction and combination;
Extracting category information from the category labeling data in the standardized training data set, constructing a category embedding matrix, and calculating based on the category embedding matrix and the standardized training data set to obtain a category mapping matrix;
Processing the standardized training data set by utilizing a feature extraction layer of a multi-layer cascade network structure, performing product operation on the category mapping matrix and the extracted features to obtain category perception features, and splicing the category perception features extracted in different layers in a channel dimension to obtain multi-scale cascade features;
In the attention layer, respectively generating a query matrix, a key matrix and a value matrix based on the multi-scale cascade characteristic and the category embedding matrix, calculating the similarity of the query matrix and the key matrix through a self-attention mechanism to obtain attention weight, and multiplying the attention weight and the value matrix to obtain attention characteristic;
in a feature fusion layer, carrying out residual connection and normalization processing on the attention features to obtain normalized features, constructing gating weights based on the normalized features and a category embedding matrix, and weighting the gating weights and the normalized features to obtain enhanced features;
And transmitting the enhancement features of each layer by layer in a multi-layer cascade network structure, and carrying out feature cascade, so as to obtain target features in the standardized training data set through cross-layer feature aggregation.
In an alternative embodiment of the present invention,
The step of carrying out rule matching on the multidimensional feature tensor based on a preset domain knowledge base to obtain an inference rule set comprises the following steps:
Performing dimension reduction processing on the multidimensional feature tensor by utilizing a feature dimension reduction network to obtain a feature sequence, constructing a feature index tree on the feature sequence, setting a sliding window with a dynamic size based on the feature index tree, adaptively adjusting the size of the sliding window according to the feature sequence, and performing sliding scanning on the feature sequence to extract a feature mode;
inputting the characteristic pattern into a pre-training text encoder to obtain a characteristic coding vector, and inputting a rule pattern in a preset domain knowledge base into the pre-training text encoder to obtain a rule coding vector;
Calculating the mean value and standard deviation of the matching degree, determining an adaptive threshold coefficient according to a preset confidence interval, weighting the mean value and the standard deviation based on the adaptive threshold coefficient to obtain an adaptive threshold, screening rule patterns with the matching degree higher than the adaptive threshold, and forming a candidate rule set by the screened rule patterns and the corresponding structured templates;
And carrying out dependency analysis on the rules in the candidate rule set, establishing a dependency relationship between the rules based on the input characteristics and the output characteristics of the rules, expressing the dependency relationship as a directed graph, expressing the rules by nodes, expressing the dependency relationship by directed edges, calculating the ingress and egress of each rule node, determining the dependency level of the rules, and adopting an improved topological sorting algorithm to distribute the rules with direct dependency relationship to adjacent levels to obtain an inference rule set.
In an alternative embodiment of the present invention,
Layering the multidimensional feature tensor by adopting the reasoning rule set, and obtaining initial structured data comprises the following steps:
Extracting feature vectors of rule related areas from a feature sequence for rules of each rule layer in an inference rule set, inputting the feature vectors into a rule matching network, wherein the rule matching network comprises a feature conversion layer and a matching degree calculation layer, outputting rule applicability scores, and normalizing the rule applicability scores to serve as rule confidence degrees;
the rules in each rule layer are ordered in a descending order based on the rule confidence level to obtain an ordering rule set, the rules are sequentially applied to corresponding structured templates according to the ordering order, and structured fragments are generated through feature mapping, wherein the structured templates define the mapping relation between features and structured data;
detecting an action area of the rules in the ordering rule set, extracting the range of the overlapped area when the action areas of the rules are overlapped, calculating the coverage of each rule in the overlapped area, taking the weighted sum of the confidence and the coverage of the rules as the priority of the rules, and selecting the rule with the highest priority to be applied to the overlapped area so as to ensure that a unique structuring fragment is generated;
Constructing a multi-layer verification network comprising an attribute verification layer, a relation verification layer and a value domain verification layer, verifying the integrity of the structured fragments based on an attribute graph at the attribute verification layer, verifying the relationship rationality among the structured fragments based on a knowledge graph at the relation verification layer, verifying the validity of attribute values in the structured fragments based on a statistical model at the value domain verification layer, and integrating verification results of all layers to obtain verified structured fragments;
extracting semantic features in the verified structured fragments, calculating semantic similarity among the fragments, constructing a hierarchical organization structure of the fragments based on the semantic similarity, and integrating the semantically related structured fragments according to the hierarchical organization structure to generate initial structured data conforming to a preset specification.
In an alternative embodiment of the present invention,
Analyzing the initial structured data through a causal reasoning method to obtain a decision basis chain, generating optimization parameters based on the decision basis chain, and optimizing the initial structured data according to the optimization parameters and the category information of the classification annotation data to obtain final structured data, wherein the step of optimizing the initial structured data comprises the following steps:
calculating bias correlation coefficients of variable pairs in initial structured data, performing a condition independence test based on the bias correlation coefficients to obtain a condition independent relation among the variables, and constructing a causal graph according to the condition independent relation;
Constructing a linear structural equation model based on the causal graph, and optimizing parameters of the linear structural equation model through maximum likelihood estimation to obtain a structural causal model, wherein the structural causal model is used for quantifying causal effect intensity among variables in the causal graph;
calculating the number of associated edges of each node in the causal graph, determining the node with the number of the associated edges exceeding a preset threshold value of the number of the associated edges as a decision node, tracing the associated nodes step by step along the causal edges in the causal graph from the decision node to obtain a factor chain, calculating the accumulated causal effect of the factor chain based on the structural causal model, and determining the factor chain with the accumulated causal effect exceeding the preset effect threshold value as a candidate decision basis chain;
In the structural causal model, changing the value of each node in the candidate decision basis chain and substituting the value into the structural causal model to obtain a corresponding downstream node predicted value after the node value is changed, determining the stability of the candidate decision basis chain based on the change amplitude of the predicted value, and determining the candidate decision basis chain with the stability meeting a preset condition as a decision basis chain;
Calculating the influence degree of each node in the decision basis chain on a target variable based on the structural causal model, screening out nodes with influence degree exceeding a preset influence threshold according to the influence degree as a feature set, taking the influence degree of each node in the feature set as a feature importance score, and combining the feature importance score with class distinction of classification marking data to obtain a feature optimization parameter;
Weighting and adjusting the value of the corresponding feature in the initial structured data based on the feature optimization parameters to obtain optimized data, wherein the feature optimization parameters are used for determining weight coefficients of feature adjustment;
calculating feature distribution differences of the optimized data and the initial structured data to obtain an optimized effect index, inputting the optimized data into the causal relation constraint of the structural causal model to obtain a verification result, performing iterative optimization on the optimized data according to the optimized effect index and the verification result, and determining the current optimized data as final structured data when the optimized index reaches a preset target and the verification result meets causal consistency.
In a second aspect of an embodiment of the present invention,
Provided is an unstructured data automatic processing system based on deep learning, comprising:
The first unit is used for acquiring unstructured data by utilizing an intelligent data acquisition device, constructing a data characteristic fingerprint library, classifying and labeling the unstructured data based on the data characteristic fingerprint library to obtain classified and labeled data, constructing a data quality assessment model by adopting an incremental learning algorithm, scoring the classified and labeled data from three dimensions of data integrity, timeliness and value density to obtain a scoring result, constructing a priority processing queue according to the scoring result, and performing data cleaning and normalization processing on the classified and labeled data in the priority processing queue to obtain a standardized training data set;
The second unit is used for performing deep learning processing on the standardized training data set by adopting a multi-layer cascade network structure, identifying target features in the standardized training data set through a self-attention mechanism according to the category information of the classification marking data, constructing a feature association diagram based on the target features, constructing a bidirectional reasoning channel on the feature association diagram, wherein the bidirectional reasoning channel comprises a forward channel and a reverse channel, the forward channel extracts numerical distribution features, time sequence features and attribute features in the standardized training data set through a convolutional neural network to obtain dominant feature data, the reverse channel extracts data item association rules, data change trends and data interaction modes in the standardized training data set through the convolutional neural network to obtain implicit feature data, and performing feature combination on the dominant feature data and the implicit feature data by adopting a dynamic weight allocation strategy to obtain multidimensional feature tensor;
and the third unit is used for carrying out rule matching on the multidimensional feature tensor based on a preset domain knowledge base to obtain an inference rule set, carrying out layering processing on the multidimensional feature tensor by adopting the inference rule set to obtain initial structured data, analyzing the initial structured data by a causal inference method to obtain a decision basis chain, generating optimization parameters based on the decision basis chain, and carrying out optimization processing on the initial structured data according to the optimization parameters and the category information of the classification marking data to obtain final structured data.
In a third aspect of an embodiment of the present invention,
There is provided an electronic device including:
A processor;
a memory for storing processor-executable instructions;
Wherein the processor is configured to invoke the instructions stored in the memory to perform the method described previously.
In a fourth aspect of an embodiment of the present invention,
There is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method as described above.
In the embodiment, the data quality assessment model is established by constructing the data characteristic fingerprint library and combining an incremental learning algorithm, the data is scored from three dimensions of data integrity, timeliness and value density, and the priority processing queue is established, so that the efficiency and accuracy of data processing can be effectively improved, important data is ensured to be processed preferentially, and meanwhile, the quality of training data is ensured through data cleaning and standardization processing. The multi-layer cascade network structure is adopted to combine with a self-attention mechanism to carry out deep learning processing, dominant features and implicit features are respectively extracted through a bidirectional reasoning channel, and a dynamic weight distribution strategy is used to carry out feature combination, so that various feature information in data can be comprehensively captured, the completeness and accuracy of feature extraction are improved, and the understanding capability of a model on complex unstructured data is enhanced. Based on a preset domain knowledge base, rule matching and causal reasoning are carried out, the optimization process is guided by a decision basis chain, intelligent conversion from unstructured data to structured data is realized, the interpretability and reliability of data processing are improved, the processing result is more in line with the actual application requirement, and high-quality data support is provided for subsequent data analysis and decision.
Drawings
FIG. 1 is a schematic flow chart of an unstructured data automatic processing method based on deep learning according to an embodiment of the invention;
Fig. 2 is a schematic structural diagram of an unstructured data automatic processing system based on deep learning according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The technical scheme of the invention is described in detail below by specific examples. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes.
Fig. 1 is a flow chart of an unstructured data automatic processing method based on deep learning according to an embodiment of the present invention, as shown in fig. 1, the method includes:
S101, obtaining unstructured data by using an intelligent data collector, constructing a data characteristic fingerprint library, classifying and labeling the unstructured data based on the data characteristic fingerprint library to obtain classified and labeled data, constructing a data quality assessment model by adopting an incremental learning algorithm, scoring the classified and labeled data from three dimensions of data integrity, timeliness and value density to obtain scoring results, constructing a priority processing queue according to the scoring results, and performing data cleaning and normalization processing on the classified and labeled data in the priority processing queue to obtain a standardized training data set;
S102, deep learning is conducted on the standardized training data set by adopting a multi-layer cascade network structure, the multi-layer cascade network structure identifies target features in the standardized training data set according to category information of the classification marking data through a self-attention mechanism, a feature association diagram is built on the basis of the target features, a bidirectional reasoning channel is built on the feature association diagram, the bidirectional reasoning channel comprises a forward channel and a reverse channel, wherein the forward channel extracts numerical distribution features, time sequence features and attribute features in the standardized training data set through a convolutional neural network to obtain dominant feature data, the reverse channel extracts data item association rules, data change trends and data interaction modes in the standardized training data set through the convolutional neural network to obtain implicit feature data, and a dynamic weight distribution strategy is adopted to conduct feature combination on the dominant feature data and the implicit feature data to obtain multidimensional feature tensor;
S103, carrying out rule matching on the multidimensional feature tensor based on a preset domain knowledge base to obtain an inference rule set, carrying out layering processing on the multidimensional feature tensor by adopting the inference rule set to obtain initial structured data, analyzing the initial structured data by a causal inference method to obtain a decision basis chain, generating optimization parameters based on the decision basis chain, and carrying out optimization processing on the initial structured data according to the optimization parameters and the category information of the classification marking data to obtain final structured data.
The intelligent data collector refers to a technical tool or device capable of automatically acquiring, processing and storing unstructured data. By identifying and analyzing data from different sources, unstructured data such as text, images, audio, video and the like are collected, and meanwhile, the integrity and the accuracy of the data are ensured. A data feature fingerprint library is a database dedicated to storing and managing data features. The method is characterized in that a unique and identifiable data characteristic set is formed by extracting and fingerprinting multi-dimensional characteristics (such as shape characteristics, structural characteristics, statistical characteristics and the like) of data content, and the unique and identifiable data characteristic set is used for data classification, retrieval and labeling. The feature association graph is a data expression form based on graph structure and is used for describing association relations among features. The nodes represent the features, the edges represent the dependencies or interactions between the features, and the graph can better capture the internal structure of the data.
The bidirectional reasoning channel is a calculation framework combining forward reasoning and reverse reasoning, the forward channel utilizes a convolutional neural network to extract numerical distribution characteristics, time sequence characteristics and attribute characteristics, and the reverse channel captures complex association rules and variation trends among data items through the convolutional neural network. Dynamic weight allocation strategy is a technology for dynamically adjusting different characteristic weights according to task demands. And the comprehensive analysis capability of dominant and recessive features is improved by optimizing the feature combination process.
The multidimensional feature tensor is a tensor form used in deep learning to represent a high-dimensional data structure. The method integrates the explicit feature data and the implicit feature data, and can provide rich context information for subsequent reasoning and decision.
In an optional implementation manner, the steps of obtaining unstructured data by using an intelligent data collector, constructing a data feature fingerprint database, and performing classification labeling on the unstructured data based on the data feature fingerprint database to obtain classification labeling data include:
The method comprises the steps that unstructured data are obtained through an intelligent data collector, the intelligent data collector comprises a local feature channel and a global feature channel, word-level feature extraction and sentence-level feature extraction are carried out on the local feature channel through a layered attention network to obtain local association features, a data relationship graph is built through a graph neural network through the global feature channel, global relationship features are extracted through message transmission, and attention fusion is carried out on the local association features and the global relationship features to obtain fusion feature data;
The feature importance weight is obtained by carrying out multi-head attention calculation on the fusion feature data, the fusion feature data is screened according to the feature importance weight, the screened features are mapped by adopting a dynamic local sensitive hash algorithm to generate feature fingerprints, and the feature fingerprints are constructed into a data feature fingerprint library supporting neighbor query;
Performing data enhancement on the fusion characteristic data to obtain an enhancement characteristic set, calculating the contrast loss of a sample pair based on the enhancement characteristic set to obtain a semantic similarity matrix, constructing a data association graph by utilizing the neighbor query result of the data characteristic fingerprint library and the semantic similarity matrix, obtaining an initial classification labeling result on the data association graph through graph attention propagation known labels, and simultaneously extracting entity relations according to the data association graph to construct a domain knowledge graph;
constructing a meta learner, taking the feature vector and the confidence coefficient of the initial classification labeling result as input, carrying out evaluation modeling on the classification result of each sample to obtain classification reliability scores, marking the samples which are lower than a preset reliability threshold and verified by the data feature fingerprint library as first type data to be rechecked, carrying out logic verification on the initial classification labeling result by utilizing the domain knowledge graph, and marking the samples which are not verified as second type data to be rechecked;
Modeling the selection process of the first type data to be rechecked and the second type data to be rechecked as a Markov decision process, constructing a hybrid rewarding function according to the classification reliability score and the sample information gain, and selecting an optimal sample to be rechecked to perform manual labeling to obtain a manual labeling result through a deep reinforcement learning optimization sample selection strategy;
And carrying out weighted fusion on the manual labeling result and the initial classification labeling result, and carrying out consistency verification on the weighted and fused labeling result by utilizing the similarity index of the data characteristic fingerprint library to generate classification labeling data comprising the fusion characteristic data, the semantic similarity matrix, the classification reliability score and the domain knowledge graph verification result.
For example, when unstructured data is acquired by an intelligent data collector, feature extraction is performed first through a local feature channel and a global feature channel. In the local feature channel, the hierarchical attention network includes word-level attention layers and sentence-level attention layers. The word level attention layer encodes input text by using a two-way long-short term memory network to obtain a word vector sequence, calculates the importance weight of each word by an attention mechanism, and weights to obtain word level feature representation. On the basis of word-level features, the sentence-level attention layer calculates the association degree between sentences, generates sentence-level feature representation, and finally obtains local association features.
In the global feature channel, the graph neural network firstly builds a graph structure based on the association relation between data, nodes represent data examples, and edges represent the relation between the examples. Through a message transmission mechanism, each node acquires information from the neighbor node and updates the representation of the node, and node characteristics containing global relation information are obtained after iteration for a plurality of rounds. And fusing the local associated features and the global relationship features through a multi-head attention mechanism to obtain fused feature data.
When feature screening is carried out on the fusion feature data, importance weights of feature dimensions are obtained by calculating multi-head attention, and feature dimensions with higher weights are reserved. And performing dimension reduction mapping on the screened features by using a dynamic local sensitive hash algorithm to generate compact feature fingerprints. The feature fingerprints are stored in an index structure supporting KNN neighbor queries, building a database of data feature fingerprints.
And in the data enhancement stage, the fusion characteristic data is subjected to random clipping, rotation and other transformations to generate an enhanced sample. And calculating the contrast loss of the original sample and the enhanced sample to obtain a semantic similarity matrix among the samples. And constructing a data association graph by combining neighbor query results of the feature fingerprint library and the semantic similarity matrix. And (3) propagating the known labels on the associated graph through the graph attention network to obtain an initial classification result of the unlabeled sample. And simultaneously, extracting entity relation triples from the association graph, and constructing a domain knowledge graph.
The meta learner adopts a gradient lifting tree model, input features comprise feature vectors and classification confidence of samples, and classification reliability scores are output. And marking the samples with scores lower than the threshold value and verified by the characteristic fingerprint library as first-class data to be rechecked. And carrying out logic verification on the initial classification result by using the relation rule in the domain knowledge graph, and marking the sample which does not accord with the rule as the second type of data to be rechecked.
The selection process of the first and second types of data to be reviewed is modeled based on a markov decision process. In this modeling process, each sample is considered a state, and the decision of the system to transition from the current state to the next state is based on a series of constraints and objective functions. Dynamic optimization can be realized by modeling the selection process of the data to be rechecked so as to ensure that the selected sample has the most value for manual labeling.
The classification reliability score reflects the credibility of the classification result of the sample and is obtained by comprehensively calculating multidimensional factors such as the consistency and the confidence of the initial classification labeling result and the sample characteristics, the similarity with the related sample in the data characteristic fingerprint library and the like. The sample information gain refers to the potential contribution of the labeling result of each sample in improving the performance of the overall data classification model. By combining the classification reliability score with the sample information gain, a hybrid bonus function is constructed that aims at weighing the selection of information samples that have high value while avoiding the selection of duplicate or inefficient samples.
Deep reinforcement learning is introduced to optimize the sample selection strategy. In this process, the reinforcement learning agent continuously updates its policies through interactions with the environment to achieve maximization of the hybrid rewards function. Specifically, the agent selects the optimal sample to be rechecked in each round of decision making according to the current sample characteristics, classification reliability scores and information gain calculations. As training proceeds, agents gradually learn to select more efficiently those samples that are most valuable to improve data quality and model performance.
And after the manual labeling result is obtained, carrying out weighted fusion on the manual labeling result and the initial classification labeling result. The weighted fusion realizes the preferential trust of the manual labeling result by distributing dynamic weights, and simultaneously reserves the reliable part in the initial classification labeling result. The process ensures that the fused labeling result can reflect the high accuracy of manual labeling and can maintain the robustness of the model.
And carrying out consistency verification on the weighted and fused labeling result through the similarity index of the data characteristic fingerprint library. The data characteristic fingerprint library provides high-dimensional characteristic fingerprints of known samples, and whether the fusion labeling result is consistent with the known samples can be verified through quick retrieval and similarity matching. This step aims at checking potential labeling errors or inconsistencies, and further improves the reliability of labeling results.
The finally generated classification labeling data not only comprises fusion characteristic data, but also comprises a semantic similarity matrix, classification reliability scores and a domain knowledge graph verification result. The information is integrated to form a complete data labeling system, and a powerful foundation is provided for subsequent data analysis, model training and decision support.
In the embodiment, through a two-channel feature extraction mode of the hierarchical attention network and the graph neural network, local semantic features of data are captured, global relations among the data are modeled, and comprehensiveness and accuracy of feature representation are improved. And a characteristic fingerprint library is constructed by adopting dynamic local sensitive hash and neighbor index, and efficient label propagation is realized by combining contrast learning and graph annotation meaning force propagation, so that the manual labeling cost is greatly reduced. And meanwhile, logic verification is carried out through the knowledge graph, so that the accuracy of the labeling result is ensured. Based on the active learning strategy of deep reinforcement learning, the most valuable sample is selected in a self-adaptive mode to carry out manual labeling, and the labeling efficiency is remarkably improved. The consistency and reliability of the labeling result are ensured through a multiple verification mechanism, and the quality of classification labeling is improved.
In an optional implementation manner, a data quality assessment model is constructed by adopting an incremental learning algorithm, the data quality assessment model scores the classified labeling data from three dimensions of data integrity, timeliness and value density to obtain a scoring result, a priority processing queue is constructed according to the scoring result, and data cleaning and normalization processing are performed on the classified labeling data in the priority processing queue to obtain a standardized training data set, wherein the standardized training data set comprises the following steps:
Acquiring time stamp information of the classified marking data, calculating to obtain a data time attenuation coefficient by adopting an exponential decay function, and carrying out weighting treatment on the classified marking data based on the data time attenuation coefficient to obtain sample weight;
Constructing a data quality evaluation model, constructing a loss function by taking the sample weight as a weight item, optimizing the loss function by adopting an online random gradient descent algorithm until the model converges to obtain a trained data quality evaluation model, and performing quality evaluation on the classification label data by using the trained data quality evaluation model;
Calculating an effective feature ratio based on a quality evaluation result to obtain a data integrity score, obtaining a data timeliness score according to the data time attenuation coefficient, calculating a data value density score by combining the sample weight, and carrying out weighted fusion on the data integrity score, the data timeliness score and the data value density score to obtain a data quality score;
Constructing a priority processing queue according to the data quality scores, marking the classification marking data with the data quality scores lower than a first preset threshold value as data to be rejected, marking the classification marking data with the data quality scores higher than a second preset threshold value as priority processing data, calculating the data processing priority based on the data quality scores and the sample weights, and generating a priority data queue with processing marks;
According to the priority order of the priority data queue, field names and data formats of the classified marked data are subjected to unified conversion to obtain standardized characteristics, a time sequence completion model is constructed based on the timeliness score and the data integrity score of the data, and the missing fields are completed by the time sequence completion model to obtain structural characteristics;
constructing a characteristic fingerprint according to the structural characteristics and the data value density scores, identifying repeated data based on the characteristic fingerprint, and carrying out anomaly detection and denoising processing in combination with the data quality scores to generate quality processing records and cleaned characteristic data;
And respectively carrying out standardized conversion on numerical value features, category features and text features in the cleaned feature data, guiding word vector mapping of the text features by utilizing the feature fingerprints, and combining the converted standardized features with the data quality scores, the data processing priorities and the quality processing records to generate a standardized training data set.
Illustratively, the class marking data is first acquired, including the data content and the time stamp information. For each piece of data, a time decay coefficient is calculated based on its generation time. In particular, a reference time point, such as the current time, may be set, a time difference between each data piece and the reference time is calculated, and the attenuation coefficient is calculated by the time difference. For example, for a piece of data generated 30 days ago, the time decay factor may be 0.85, while the data decay factor 60 days ago may decrease to 0.72.
And weighting the data based on the calculated time attenuation coefficient to obtain a sample weight. The weight value reflects the timeliness importance of the data, and newer data has higher weight. For example, if the time attenuation coefficient of a certain piece of data is 0.85, the initial sample weight may be set to 0.85.
In constructing the data quality assessment model, the sample weights are used as weight terms of the loss function. Model training is carried out by adopting an online random gradient descent mode, and a batch of data is randomly selected for training each time, so that model parameters are continuously adjusted until convergence. The trained model may be used to evaluate data quality.
For each piece of data, scores were made from three dimensions, integrity, timeliness, and value density, respectively:
the integrity score is calculated based on the scale of the valid features. For example, a piece of data contains 10 characteristic fields, 8 of which are valid, then its integrity score may be 0.8.
The timeliness score directly uses the time decay factor calculated previously. For example, a data with an attenuation coefficient of 0.85 has a time-dependent score of 0.85.
The value density score is calculated in combination with the sample weight. For example, if a sample weight of a data is 0.85 and its feature information is rich, its value density score may be 0.88.
And carrying out weighted fusion on the scores of the three dimensions to obtain a final data quality score. For example, the integrity score, the timeliness score, and the value density score may be weighted by 0.3, and 0.4, respectively, and a weighted average may be calculated as the final score.
A priority processing queue is constructed based on the data quality scores. Two thresholds are set, such as 0.6 and 0.8. Data with a score below 0.6 are marked for culling and data with a score above 0.8 are marked for preferential treatment. And calculating the processing priority by combining the quality scores and the sample weights to generate a priority queue.
The data is processed in a priority order. Firstly, field unification is carried out, dates in different formats are unified into a standard format, and numerical values of different units are converted into unified units. For the missing fields, padding is performed based on a timing complement model. For example, for missing sales data, padding may be predicted based on historical data.
Feature fingerprints are constructed for deduplication and anomaly detection. The feature fingerprint may contain a hash value combination of key fields. For example, for merchandise data, fields such as category, specification, date of manufacture, etc. may be combined to generate a fingerprint. And (5) repeating the data based on fingerprint identification and carrying out anomaly detection by combining the quality scores.
And finally, carrying out standardization treatment on the cleaned data. And normalizing the numerical features, converting the category features into a coding form, and converting the text features into word vectors. And combining the normalized features with information such as quality scores and the like to generate a final training data set.
In the embodiment, through a time attenuation mechanism and multi-dimensional quality evaluation, the accurate quantification of the data value is realized, the accuracy and the efficiency of data processing are improved, and the data quality evaluation is more objective and reasonable. By adopting the priority queue and the characteristic fingerprint mechanism, the intelligent scheduling and the anomaly identification of the data processing are realized, the resource consumption of the data processing is reduced, and the efficiency and the accuracy of the data cleaning are improved. Through standardized processing and feature conversion, the uniformity of the data format and the normalization of the feature expression are ensured, the effect of subsequent model training is improved, and the reliability and stability of data application are enhanced.
In an alternative embodiment, the deep learning process is performed on the standardized training data set by adopting a multi-layer cascade network structure, and the multi-layer cascade network structure identifies target features in the standardized training data set through a self-attention mechanism according to the category information of the classification annotation data, wherein the method comprises the following steps:
Constructing a multi-layer cascade network structure, wherein the multi-layer cascade network structure comprises a feature extraction layer, an attention layer and a feature fusion layer, the feature extraction layer is composed of a plurality of convolution blocks, the attention layer adopts a self-attention mechanism, and the feature fusion layer is used for feature dimension reduction and combination;
Extracting category information from the category labeling data in the standardized training data set, constructing a category embedding matrix, and calculating based on the category embedding matrix and the standardized training data set to obtain a category mapping matrix;
Processing the standardized training data set by utilizing a feature extraction layer of a multi-layer cascade network structure, performing product operation on the category mapping matrix and the extracted features to obtain category perception features, and splicing the category perception features extracted in different layers in a channel dimension to obtain multi-scale cascade features;
In the attention layer, respectively generating a query matrix, a key matrix and a value matrix based on the multi-scale cascade characteristic and the category embedding matrix, calculating the similarity of the query matrix and the key matrix through a self-attention mechanism to obtain attention weight, and multiplying the attention weight and the value matrix to obtain attention characteristic;
in a feature fusion layer, carrying out residual connection and normalization processing on the attention features to obtain normalized features, constructing gating weights based on the normalized features and a category embedding matrix, and weighting the gating weights and the normalized features to obtain enhanced features;
And transmitting the enhancement features of each layer by layer in a multi-layer cascade network structure, and carrying out feature cascade, so as to obtain target features in the standardized training data set through cross-layer feature aggregation.
Illustratively, the build process of a multi-layered cascading network structure first requires designing a feature extraction layer. The feature extraction layer is composed of a plurality of convolution blocks, each convolution block comprises three convolution layers, the convolution kernel sizes are 3×3, 5×5 and 7×7 respectively, the step size is 1, and the filling mode adopts the same filling. Each convolution layer is followed by a batch normalization layer and a ReLU activation function, and the batch normalization layer and the ReLU activation function are used for extracting feature information of different scales.
And converting the category information into a vector form by a single-hot coding mode for the classification marking data in the standardized training data set. Assuming that the dataset contains 10 categories, each category label is converted into a 10-dimensional vector. These vectors are combined into a class embedding matrix, the matrix dimension being the number of classes x the embedding dimension. Taking the image classification task as an example, the input image size is 224×224×3, and the feature map size obtained by the feature extraction layer is 28×28×256.
In the feature extraction process, dot product operation is carried out on the category mapping matrix and the extracted features, so that the features with category perception capability are obtained. And for the features extracted by different convolution blocks, splicing in the channel dimension to form a multi-scale cascade feature. Taking the above image classification task as an example, three feature dimensions of different scales are 28×28×256, 14×14×512 and 7×7×1024, respectively, and the multi-scale features of 28×28×1792 are obtained after stitching.
In the processing process of the attention layer, a query matrix is generated based on the multi-scale cascade characteristics, and a key matrix and a value matrix are generated based on the category embedding matrix. The attention weight is obtained by calculating the similarity between the query matrix and the key matrix, and the weight is normalized by adopting a softmax function. Multiplying the normalized attention weight by a value matrix to obtain the attention feature fused with the category information.
In the feature fusion layer, residual connection is firstly carried out on the attention features, the input features and the attention features are added, and then normalization processing is carried out through layer normalization. Based on the normalized features and the class embedding matrix, a gating weight mechanism is built using the fully connected layers, the weights being used to control the importance of the different features. And multiplying the gating weight with the normalized feature element by element to obtain the enhanced feature representation.
And finally, transmitting the enhanced features of each layer by layer in the network, and carrying out feature aggregation by adopting a cross-layer connection mode. Specifically, the low-level features are aligned with the high-level feature size through upsampling, and then are spliced in the channel dimension, so that the target features fused with multi-level information are finally obtained.
In the embodiment, through the combination of the multi-layer cascade structure and the self-attention mechanism, multi-scale characteristic information in data can be effectively captured, the comprehensiveness and the accuracy of characteristic extraction are improved, and the adaptability of the model to targets with different scales is enhanced. The category information is introduced to guide the feature extraction process, and the recognition capability of the model to the category related features is improved and the noise interference in the feature extraction process is reduced through the construction of the category perception features and the application of the attention mechanism. The method adopts a mode of combining a gating mechanism and residual connection, so that the self-adaptive fusion and optimization of the features are realized, the discriminant of the feature representation is improved, the original feature information is maintained, and the generalization capability of the model is enhanced.
In an alternative embodiment, performing rule matching on the multidimensional feature tensor based on a preset domain knowledge base to obtain an inference rule set includes:
Performing dimension reduction processing on the multidimensional feature tensor by utilizing a feature dimension reduction network to obtain a feature sequence, constructing a feature index tree on the feature sequence, setting a sliding window with a dynamic size based on the feature index tree, adaptively adjusting the size of the sliding window according to the feature sequence, and performing sliding scanning on the feature sequence to extract a feature mode;
inputting the characteristic pattern into a pre-training text encoder to obtain a characteristic coding vector, and inputting a rule pattern in a preset domain knowledge base into the pre-training text encoder to obtain a rule coding vector;
Calculating the mean value and standard deviation of the matching degree, determining an adaptive threshold coefficient according to a preset confidence interval, weighting the mean value and the standard deviation based on the adaptive threshold coefficient to obtain an adaptive threshold, screening rule patterns with the matching degree higher than the adaptive threshold, and forming a candidate rule set by the screened rule patterns and the corresponding structured templates;
And carrying out dependency analysis on the rules in the candidate rule set, establishing a dependency relationship between the rules based on the input characteristics and the output characteristics of the rules, expressing the dependency relationship as a directed graph, expressing the rules by nodes, expressing the dependency relationship by directed edges, calculating the ingress and egress of each rule node, determining the dependency level of the rules, and adopting an improved topological sorting algorithm to distribute the rules with direct dependency relationship to adjacent levels to obtain an inference rule set.
Illustratively, first, the input multidimensional feature tensor is subjected to feature dimension reduction processing. A deep neural network is used as a characteristic dimension-reducing network comprising a plurality of convolution layers and a pooling layer. Taking the medical diagnosis scenario as an example, the input multidimensional feature tensor contains various examination indexes of the patient, and the dimension is 128x64x32. It is converted into a one-dimensional feature sequence by a feature dimension reduction network, and the length is 512.
After the feature sequence is obtained, a feature index tree is constructed. And adopting a B+ tree structure to store the characteristic sequence, storing the characteristic value by the leaf node, and storing the index information by the non-leaf node. A sliding window with dynamic size is arranged on the characteristic sequence, the initial window size is set to be 32, and the sliding window is dynamically adjusted according to the local change condition of the characteristic sequence. When the feature sequence fluctuates greatly, the window size is reduced to capture fine-grained features, and when the feature sequence is stationary, the window size is increased to obtain macroscopic features.
The sliding window scans over the feature sequence to extract the feature pattern. Each feature pattern contains a sequence of feature values within a window. The extracted feature pattern is input into a pre-trained BERT text encoder to obtain 768-dimensional feature encoding vectors. Meanwhile, the rule patterns in the knowledge base of the preset field are input into the same encoder, so that the rule coding vector is obtained.
And calculating cosine similarity between the feature coding vector and the rule coding vector to obtain the matching degree. Taking medical diagnosis rules as an example, a certain rule mode is that the blood pressure is continuously increased and the headache is accompanied, and the cosine similarity between a rule code vector and a code vector of a certain characteristic mode is 0.85.
And counting the mean value and standard deviation of all the matching degrees. Assume a mean value of 0.7 and a standard deviation of 0.15. A 95% confidence interval is set, with a corresponding adaptive threshold coefficient of 1.96. The mean and standard deviation are weighted to obtain an adaptive threshold of 0.7+1.96×0.15=0.994. And screening rule patterns with the matching degree higher than the threshold value to form a candidate rule set.
And carrying out dependency analysis on the rules in the candidate rule set. And constructing a dependency relationship network among the rules by analyzing the input features and the output features of the rules. The dependency relationships are represented by directed graphs, the nodes represent rules, and the directed edges represent dependencies between rules. The ingress and egress of each node is calculated to identify the relative location and importance of the rule.
And adopting an improved topological sorting algorithm to carry out hierarchical division on the rules according to the dependency relationship among the rules. The result of the topological ordering is an ordered list that ensures that the order of execution of the rules conforms to logical dependencies while avoiding the occurrence of circular dependencies.
And finally, sorting the ordered rule sets into inference rule sets. The rules in the reasoning rule set are gradually executed according to the dependency level, so that the order and the high efficiency of the reasoning process are ensured.
In the embodiment, the feature mode is extracted through the dynamically adjusted sliding window, so that feature information of different scales can be adaptively captured, and the flexibility and accuracy of feature extraction are improved. And the similarity between the feature mode and the rule mode is calculated by adopting a pre-training text encoder, so that the semantic understanding capability of the deep learning model is fully utilized, and the accuracy rate and generalization capability of rule matching are improved. A hierarchical reasoning rule set is constructed based on rule dependency relationship, so that the logic relationship among rules is clearer, and the interpretability and the reasoning efficiency of the reasoning process are improved.
In an alternative embodiment, the layering the multidimensional feature tensor with the inference rule set to obtain the initial structured data includes:
Extracting feature vectors of rule related areas from a feature sequence for rules of each rule layer in an inference rule set, inputting the feature vectors into a rule matching network, wherein the rule matching network comprises a feature conversion layer and a matching degree calculation layer, outputting rule applicability scores, and normalizing the rule applicability scores to serve as rule confidence degrees;
the rules in each rule layer are ordered in a descending order based on the rule confidence level to obtain an ordering rule set, the rules are sequentially applied to corresponding structured templates according to the ordering order, and structured fragments are generated through feature mapping, wherein the structured templates define the mapping relation between features and structured data;
detecting an action area of the rules in the ordering rule set, extracting the range of the overlapped area when the action areas of the rules are overlapped, calculating the coverage of each rule in the overlapped area, taking the weighted sum of the confidence and the coverage of the rules as the priority of the rules, and selecting the rule with the highest priority to be applied to the overlapped area so as to ensure that a unique structuring fragment is generated;
Constructing a multi-layer verification network comprising an attribute verification layer, a relation verification layer and a value domain verification layer, verifying the integrity of the structured fragments based on an attribute graph at the attribute verification layer, verifying the relationship rationality among the structured fragments based on a knowledge graph at the relation verification layer, verifying the validity of attribute values in the structured fragments based on a statistical model at the value domain verification layer, and integrating verification results of all layers to obtain verified structured fragments;
extracting semantic features in the verified structured fragments, calculating semantic similarity among the fragments, constructing a hierarchical organization structure of the fragments based on the semantic similarity, and integrating the semantically related structured fragments according to the hierarchical organization structure to generate initial structured data conforming to a preset specification.
Illustratively, the multidimensional feature tensor is first hierarchically processed using the set of inference rules to generate initial structured data, and the feature sequence is first regional extracted according to the rule hierarchy in the set of inference rules. In the extraction process, for the rule in each rule layer, a feature region related to the rule condition is identified, and its feature vector is extracted as an input.
The extracted feature vector is input into a rule matching network. The rule matching network is composed of a feature conversion layer and a matching degree calculation layer, the feature conversion layer normalizes and enhances the input feature vector, and the matching degree calculation layer outputs the applicability score of each rule by calculating the matching degree of the feature and the rule condition. The suitability score reflects how well the rule fits the feature. The suitability score is then normalized to generate a rule confidence level, which is a key indicator for measuring the reliability of the rule.
And ordering the rules in each rule layer in a descending order based on the rule confidence, and generating an ordering rule set. The ordering rule sets are sequentially arranged according to the importance and applicability of the rules, so that the high confidence rule is ensured to be preferentially applied. The rules in the ordered rule set are then applied sequentially to the corresponding structured templates. The structured templates define the mapping rules of features to structured data, and the structured fragments are generated by combining the mapping relation of the feature vectors and the templates. The structured fragments are the basic units of the original structured data.
When the action areas of the rules overlap, detecting the range of the overlapping area, extracting all the rules in the overlapping area, and calculating the coverage of each rule in the overlapping area. Coverage represents the proportion of the range in which the rule acts within the overlap region. And weighting the rule confidence and the coverage to obtain the rule priority. The highest priority rule is applied to the overlapping regions, ensuring that each region generates only a unique structured fragment, thereby avoiding data collision and redundancy.
And constructing a multi-layer verification network to perform multi-dimensional verification on the generated structured fragments. The multi-layer authentication network comprises an attribute authentication layer, a relationship authentication layer and a value domain authentication layer. The attribute verification layer checks the integrity of the structured fragments based on the attribute map, for example, to confirm whether all necessary attribute values are contained in the fragments. The relationship verification layer verifies whether the relationship between the structured fragments is reasonable based on the knowledge-graph, for example, checking whether the different fragments conform to predefined logical or semantic associations. The value field verification layer checks the validity of the fragment attribute values based on the statistical model, for example, to confirm whether the values are within a reasonable range. And integrating the verification results, and correcting or removing fragments which do not meet the specification to obtain the final verification structured fragments.
Semantic features are extracted from the verified structured fragments, and the relevance among the fragments is calculated through semantic similarity. The semantic features capture deep semantic information of the fragment content, and the semantic similarity reflects the semantic correlation degree among fragments. And constructing a hierarchical organization structure of the fragments based on the semantic similarity, and aggregating the fragments related to the semantic into larger units. And generating initial structured data which accords with a preset standard by integrating the relevant fragments.
In the embodiment, the accuracy of feature extraction is improved through layering processing and rule matching, the rule matching network can adapt to feature representation of different scenes, and flexibility and robustness of rule application are improved. The multi-layer verification network realizes multi-dimensional data quality control, ensures the accuracy of a structured result in terms of attribute integrity, relationship rationality, value validity and the like, and remarkably improves the data reliability. The hierarchical organization based on semantic similarity realizes intelligent integration of the structured fragments, solves the problems of data redundancy and inconsistency, and improves the normalization and usability of the structured data.
In an optional implementation manner, analyzing the initial structured data by a causal reasoning method to obtain a decision basis chain, generating an optimization parameter based on the decision basis chain, and performing optimization processing on the initial structured data according to the optimization parameter and the category information of the classification annotation data to obtain final structured data includes:
calculating bias correlation coefficients of variable pairs in initial structured data, performing a condition independence test based on the bias correlation coefficients to obtain a condition independent relation among the variables, and constructing a causal graph according to the condition independent relation;
Constructing a linear structural equation model based on the causal graph, and optimizing parameters of the linear structural equation model through maximum likelihood estimation to obtain a structural causal model, wherein the structural causal model is used for quantifying causal effect intensity among variables in the causal graph;
calculating the number of associated edges of each node in the causal graph, determining the node with the number of the associated edges exceeding a preset threshold value of the number of the associated edges as a decision node, tracing the associated nodes step by step along the causal edges in the causal graph from the decision node to obtain a factor chain, calculating the accumulated causal effect of the factor chain based on the structural causal model, and determining the factor chain with the accumulated causal effect exceeding the preset effect threshold value as a candidate decision basis chain;
In the structural causal model, changing the value of each node in the candidate decision basis chain and substituting the value into the structural causal model to obtain a corresponding downstream node predicted value after the node value is changed, determining the stability of the candidate decision basis chain based on the change amplitude of the predicted value, and determining the candidate decision basis chain with the stability meeting a preset condition as a decision basis chain;
Calculating the influence degree of each node in the decision basis chain on a target variable based on the structural causal model, screening out nodes with influence degree exceeding a preset influence threshold according to the influence degree as a feature set, taking the influence degree of each node in the feature set as a feature importance score, and combining the feature importance score with class distinction of classification marking data to obtain a feature optimization parameter;
Weighting and adjusting the value of the corresponding feature in the initial structured data based on the feature optimization parameters to obtain optimized data, wherein the feature optimization parameters are used for determining weight coefficients of feature adjustment;
calculating feature distribution differences of the optimized data and the initial structured data to obtain an optimized effect index, inputting the optimized data into the causal relation constraint of the structural causal model to obtain a verification result, performing iterative optimization on the optimized data according to the optimized effect index and the verification result, and determining the current optimized data as final structured data when the optimized index reaches a preset target and the verification result meets causal consistency.
Illustratively, all variables in the initial structured data are first paired, and the partial correlation coefficient between each pair of variables is calculated. The partial correlation coefficient reflects the degree of linear relationship between the two variables while excluding the interfering effects of the other variables. By analyzing the partial correlation coefficients between the variables, it can be preliminarily determined which variables may have direct or indirect causal relationship.
And further executing a condition independence test according to the calculated bias correlation coefficient to judge whether the variables are independent after controlling other variables. If the two variables are independent of each other, there is no direct causal link between them. Based on these conditional independent relationships, a causal graph is constructed that represents causal relationships between variables. The nodes of the causal graph represent variables and the directed edges represent direct causal effects between the variables. In the construction process, the structure of the graph needs to be ensured to be an acyclic graph, and the prior knowledge or the domain constraint is met.
Based on the causal graph, a structural equation model is constructed for quantifying causal relationship strength between variables. In the structural equation model, the value of each variable can be expressed as a linear combination of its causal precursor variables. And optimizing model parameters by using a maximum likelihood estimation method, and finally generating a structural causal model describing causal graph quantitative relations. The model is used for evaluating the causal effect among variables and providing a quantitative basis for subsequent analysis.
And analyzing nodes in the causal graph, and marking the nodes with the number of the associated edges exceeding a preset threshold as decision nodes. Decision nodes are typically in critical locations in the causal graph, with a greater impact on the overall causal structure. Each decision node is traced back gradually along the causal path from its associated other nodes, forming a series of causal chains. The cumulative causal effects of these causal chains are quantified using a structural causal model, which represents the overall causal impact strength from the start to the end of the chain. And (3) screening out chains with stronger causal effects by setting a threshold value of accumulated causal effects, and taking the chains as candidate decision basis chains.
And verifying the stability of the candidate decision basis chain. And changing the values of the nodes in the chain one by one in the structural causal model, and observing the variation amplitude of the predicted value of the downstream node of the chain. If a certain candidate decision basis chain can keep higher prediction stability when the node value changes, the reliability is higher. And confirming the candidate decision basis chain meeting the stability condition as a final decision basis chain.
And calculating the direct or indirect influence of each node in the decision-making chain on the target variable according to the structural causal model, and quantifying the direct or indirect influence into the influence degree. And screening the nodes with the influence degree exceeding the threshold value as a feature set, and recording the influence degree of each node as a feature importance score. And comprehensively calculating the feature importance scores and the category distinction degree by combining the statistical characteristics of the category distinction degree in the classification annotation data, and generating feature optimization parameters used in the optimization process.
And carrying out weighted adjustment on the corresponding features in the initial structured data based on the feature optimization parameters. The weighting adjustment process considers the magnitude of the feature optimization parameters, ensures that the key features are given higher weight, and therefore the values of the key features are optimized more remarkably. The adjusted data is used as optimized data, and further effect verification is needed.
Comparing the optimized data with the initial data, analyzing the characteristic distribution change condition of the optimized data, and generating an optimized effect index. And simultaneously, inputting the optimized data into the structural causal model again, and verifying whether the optimized data meets causal relation constraint conditions. If the causal relationship constraint is broken in the optimization process, the characteristic optimization parameters and the data weight allocation strategy need to be readjusted.
Through multiple rounds of iteration, consistency between the feature distribution and the target distribution of the optimized data is gradually improved, and meanwhile, the causal consistency of the optimized result is ensured to be maintained. And stopping iteration when the optimization effect index reaches a preset target range and the causal verification result meets constraint requirements, and confirming the current optimization data as final structured data.
The finally generated structured data meets the preset requirements in terms of causal consistency, feature distribution rationality and optimization effect, and can be directly used for subsequent data analysis or decision support tasks.
In the embodiment, the causal relationship in the data is deeply mined by constructing the causal graph and the structural causal model, and the accuracy and the interpretability of the data analysis are improved. Compared with the traditional correlation analysis, the causal reasoning can better reflect the real relation between variables, and avoids misleading caused by false correlation. The concept of a decision basis chain is introduced, and a reliable theoretical basis is provided for data optimization by tracking a causal chain and evaluating the stability of the causal chain. The feature selection and optimization method based on the causality can effectively identify key features and improve generalization capability and robustness of the model. The causal consistency is considered in the optimization process, and the optimized data still meets the original causal relation constraint through iterative verification. The method not only improves the data quality, but also ensures the interpretability and the credibility of the optimization result, and provides a more reliable data basis for subsequent decision analysis and model application.
Fig. 2 is a schematic structural diagram of an unstructured data automatic processing system based on deep learning according to an embodiment of the present invention, as shown in fig. 2, the system includes:
The first unit is used for acquiring unstructured data by utilizing an intelligent data acquisition device, constructing a data characteristic fingerprint library, classifying and labeling the unstructured data based on the data characteristic fingerprint library to obtain classified and labeled data, constructing a data quality assessment model by adopting an incremental learning algorithm, scoring the classified and labeled data from three dimensions of data integrity, timeliness and value density to obtain a scoring result, constructing a priority processing queue according to the scoring result, and performing data cleaning and normalization processing on the classified and labeled data in the priority processing queue to obtain a standardized training data set;
The second unit is used for performing deep learning processing on the standardized training data set by adopting a multi-layer cascade network structure, identifying target features in the standardized training data set through a self-attention mechanism according to the category information of the classification marking data, constructing a feature association diagram based on the target features, constructing a bidirectional reasoning channel on the feature association diagram, wherein the bidirectional reasoning channel comprises a forward channel and a reverse channel, the forward channel extracts numerical distribution features, time sequence features and attribute features in the standardized training data set through a convolutional neural network to obtain dominant feature data, the reverse channel extracts data item association rules, data change trends and data interaction modes in the standardized training data set through the convolutional neural network to obtain implicit feature data, and performing feature combination on the dominant feature data and the implicit feature data by adopting a dynamic weight allocation strategy to obtain multidimensional feature tensor;
and the third unit is used for carrying out rule matching on the multidimensional feature tensor based on a preset domain knowledge base to obtain an inference rule set, carrying out layering processing on the multidimensional feature tensor by adopting the inference rule set to obtain initial structured data, analyzing the initial structured data by a causal inference method to obtain a decision basis chain, generating optimization parameters based on the decision basis chain, and carrying out optimization processing on the initial structured data according to the optimization parameters and the category information of the classification marking data to obtain final structured data.
In a third aspect of an embodiment of the present invention,
There is provided an electronic device including:
A processor;
a memory for storing processor-executable instructions;
Wherein the processor is configured to invoke the instructions stored in the memory to perform the method described previously.
In a fourth aspect of an embodiment of the present invention,
There is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method as described above.
The present invention may be a method, apparatus, system, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for performing various aspects of the present invention.
It should be noted that the above embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that the technical solution described in the above embodiments may be modified or some or all of the technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the scope of the technical solution of the embodiments of the present invention.

Claims (10)

1. An unstructured data automatic processing method based on deep learning is characterized by comprising the following steps:
The method comprises the steps of obtaining unstructured data by using an intelligent data collector, constructing a data characteristic fingerprint library, classifying and labeling the unstructured data based on the data characteristic fingerprint library to obtain classified and labeled data, constructing a data quality assessment model by adopting an incremental learning algorithm, scoring the classified and labeled data from three dimensions of data integrity, timeliness and value density to obtain scoring results, constructing a priority processing queue according to the scoring results, and performing data cleaning and normalization processing on the classified and labeled data in the priority processing queue to obtain a standardized training data set;
Deep learning is carried out on the standardized training data set by adopting a multi-layer cascade network structure, the multi-layer cascade network structure identifies target features in the standardized training data set according to category information of the classification marking data and through a self-attention mechanism, a feature association diagram is built on the basis of the target features, a bidirectional reasoning channel is built on the feature association diagram, the bidirectional reasoning channel comprises a forward channel and a reverse channel, wherein the forward channel extracts numerical distribution features, time sequence features and attribute features in the standardized training data set through a convolutional neural network to obtain dominant feature data, the reverse channel extracts data item association rules, data change trends and data interaction modes in the standardized training data set through the convolutional neural network to obtain implicit feature data, and a dynamic weight distribution strategy is adopted to carry out feature combination on the dominant feature data and the implicit feature data to obtain multidimensional feature tensor;
And carrying out rule matching on the multidimensional feature tensor based on a preset domain knowledge base to obtain an inference rule set, carrying out layering processing on the multidimensional feature tensor by adopting the inference rule set to obtain initial structured data, analyzing the initial structured data by a causal inference method to obtain a decision basis chain, generating optimization parameters based on the decision basis chain, and carrying out optimization processing on the initial structured data according to the optimization parameters and the category information of the classification marking data to obtain final structured data.
2. The method of claim 1, wherein obtaining unstructured data with an intelligent data collector, constructing a data feature fingerprint library, and classifying and labeling the unstructured data based on the data feature fingerprint library to obtain classification and labeling data comprises:
The method comprises the steps that unstructured data are obtained through an intelligent data collector, the intelligent data collector comprises a local feature channel and a global feature channel, word-level feature extraction and sentence-level feature extraction are carried out on the local feature channel through a layered attention network to obtain local association features, a data relationship graph is built through a graph neural network through the global feature channel, global relationship features are extracted through message transmission, and attention fusion is carried out on the local association features and the global relationship features to obtain fusion feature data;
The feature importance weight is obtained by carrying out multi-head attention calculation on the fusion feature data, the fusion feature data is screened according to the feature importance weight, the screened features are mapped by adopting a dynamic local sensitive hash algorithm to generate feature fingerprints, and the feature fingerprints are constructed into a data feature fingerprint library supporting neighbor query;
Performing data enhancement on the fusion characteristic data to obtain an enhancement characteristic set, calculating the contrast loss of a sample pair based on the enhancement characteristic set to obtain a semantic similarity matrix, constructing a data association graph by utilizing the neighbor query result of the data characteristic fingerprint library and the semantic similarity matrix, obtaining an initial classification labeling result on the data association graph through graph attention propagation known labels, and simultaneously extracting entity relations according to the data association graph to construct a domain knowledge graph;
constructing a meta learner, taking the feature vector and the confidence coefficient of the initial classification labeling result as input, carrying out evaluation modeling on the classification result of each sample to obtain classification reliability scores, marking the samples which are lower than a preset reliability threshold and verified by the data feature fingerprint library as first type data to be rechecked, carrying out logic verification on the initial classification labeling result by utilizing the domain knowledge graph, and marking the samples which are not verified as second type data to be rechecked;
Modeling the selection process of the first type data to be rechecked and the second type data to be rechecked as a Markov decision process, constructing a hybrid rewarding function according to the classification reliability score and the sample information gain, and selecting an optimal sample to be rechecked to perform manual labeling to obtain a manual labeling result through a deep reinforcement learning optimization sample selection strategy;
And carrying out weighted fusion on the manual labeling result and the initial classification labeling result, and carrying out consistency verification on the weighted and fused labeling result by utilizing the similarity index of the data characteristic fingerprint library to generate classification labeling data comprising the fusion characteristic data, the semantic similarity matrix, the classification reliability score and the domain knowledge graph verification result.
3. The method of claim 1, wherein constructing a data quality assessment model by using an incremental learning algorithm, wherein the data quality assessment model scores the classified annotation data from three dimensions of data integrity, timeliness and value density to obtain a scoring result, constructing a priority processing queue according to the scoring result, and performing data cleaning and normalization processing on the classified annotation data in the priority processing queue to obtain a standardized training data set comprises:
Acquiring time stamp information of the classified marking data, calculating to obtain a data time attenuation coefficient by adopting an exponential decay function, and carrying out weighting treatment on the classified marking data based on the data time attenuation coefficient to obtain sample weight;
Constructing a data quality evaluation model, constructing a loss function by taking the sample weight as a weight item, optimizing the loss function by adopting an online random gradient descent algorithm until the model converges to obtain a trained data quality evaluation model, and performing quality evaluation on the classification label data by using the trained data quality evaluation model;
Calculating an effective feature ratio based on a quality evaluation result to obtain a data integrity score, obtaining a data timeliness score according to the data time attenuation coefficient, calculating a data value density score by combining the sample weight, and carrying out weighted fusion on the data integrity score, the data timeliness score and the data value density score to obtain a data quality score;
Constructing a priority processing queue according to the data quality scores, marking the classification marking data with the data quality scores lower than a first preset threshold value as data to be rejected, marking the classification marking data with the data quality scores higher than a second preset threshold value as priority processing data, calculating the data processing priority based on the data quality scores and the sample weights, and generating a priority data queue with processing marks;
According to the priority order of the priority data queue, field names and data formats of the classified marked data are subjected to unified conversion to obtain standardized characteristics, a time sequence completion model is constructed based on the timeliness score and the data integrity score of the data, and the missing fields are completed by the time sequence completion model to obtain structural characteristics;
constructing a characteristic fingerprint according to the structural characteristics and the data value density scores, identifying repeated data based on the characteristic fingerprint, and carrying out anomaly detection and denoising processing in combination with the data quality scores to generate quality processing records and cleaned characteristic data;
And respectively carrying out standardized conversion on numerical value features, category features and text features in the cleaned feature data, guiding word vector mapping of the text features by utilizing the feature fingerprints, and combining the converted standardized features with the data quality scores, the data processing priorities and the quality processing records to generate a standardized training data set.
4. The method of claim 1, wherein deep learning the standardized training dataset using a multi-layered cascading network structure that labels category information of data according to the categorization, and identifies target features in the standardized training dataset by a self-attention mechanism comprises:
Constructing a multi-layer cascade network structure, wherein the multi-layer cascade network structure comprises a feature extraction layer, an attention layer and a feature fusion layer, the feature extraction layer is composed of a plurality of convolution blocks, the attention layer adopts a self-attention mechanism, and the feature fusion layer is used for feature dimension reduction and combination;
Extracting category information from the category labeling data in the standardized training data set, constructing a category embedding matrix, and calculating based on the category embedding matrix and the standardized training data set to obtain a category mapping matrix;
Processing the standardized training data set by utilizing a feature extraction layer of a multi-layer cascade network structure, performing product operation on the category mapping matrix and the extracted features to obtain category perception features, and splicing the category perception features extracted in different layers in a channel dimension to obtain multi-scale cascade features;
In the attention layer, respectively generating a query matrix, a key matrix and a value matrix based on the multi-scale cascade characteristic and the category embedding matrix, calculating the similarity of the query matrix and the key matrix through a self-attention mechanism to obtain attention weight, and multiplying the attention weight and the value matrix to obtain attention characteristic;
in a feature fusion layer, carrying out residual connection and normalization processing on the attention features to obtain normalized features, constructing gating weights based on the normalized features and a category embedding matrix, and weighting the gating weights and the normalized features to obtain enhanced features;
And transmitting the enhancement features of each layer by layer in a multi-layer cascade network structure, and carrying out feature cascade, so as to obtain target features in the standardized training data set through cross-layer feature aggregation.
5. The method of claim 1, wherein rule matching the multi-dimensional feature tensor based on a preset domain knowledge base to obtain an inference rule set comprises:
Performing dimension reduction processing on the multidimensional feature tensor by utilizing a feature dimension reduction network to obtain a feature sequence, constructing a feature index tree on the feature sequence, setting a sliding window with a dynamic size based on the feature index tree, adaptively adjusting the size of the sliding window according to the feature sequence, and performing sliding scanning on the feature sequence to extract a feature mode;
inputting the characteristic pattern into a pre-training text encoder to obtain a characteristic coding vector, and inputting a rule pattern in a preset domain knowledge base into the pre-training text encoder to obtain a rule coding vector;
Calculating the mean value and standard deviation of the matching degree, determining an adaptive threshold coefficient according to a preset confidence interval, weighting the mean value and the standard deviation based on the adaptive threshold coefficient to obtain an adaptive threshold, screening rule patterns with the matching degree higher than the adaptive threshold, and forming a candidate rule set by the screened rule patterns and the corresponding structured templates;
And carrying out dependency analysis on the rules in the candidate rule set, establishing a dependency relationship between the rules based on the input characteristics and the output characteristics of the rules, expressing the dependency relationship as a directed graph, expressing the rules by nodes, expressing the dependency relationship by directed edges, calculating the ingress and egress of each rule node, determining the dependency level of the rules, and adopting an improved topological sorting algorithm to distribute the rules with direct dependency relationship to adjacent levels to obtain an inference rule set.
6. The method of claim 1, wherein layering the multi-dimensional feature tensor with the set of inference rules to obtain initial structured data comprises:
Extracting feature vectors of rule related areas from a feature sequence for rules of each rule layer in an inference rule set, inputting the feature vectors into a rule matching network, wherein the rule matching network comprises a feature conversion layer and a matching degree calculation layer, outputting rule applicability scores, and normalizing the rule applicability scores to serve as rule confidence degrees;
the rules in each rule layer are ordered in a descending order based on the rule confidence level to obtain an ordering rule set, the rules are sequentially applied to corresponding structured templates according to the ordering order, and structured fragments are generated through feature mapping, wherein the structured templates define the mapping relation between features and structured data;
detecting an action area of the rules in the ordering rule set, extracting the range of the overlapped area when the action areas of the rules are overlapped, calculating the coverage of each rule in the overlapped area, taking the weighted sum of the confidence and the coverage of the rules as the priority of the rules, and selecting the rule with the highest priority to be applied to the overlapped area so as to ensure that a unique structuring fragment is generated;
Constructing a multi-layer verification network comprising an attribute verification layer, a relation verification layer and a value domain verification layer, verifying the integrity of the structured fragments based on an attribute graph at the attribute verification layer, verifying the relationship rationality among the structured fragments based on a knowledge graph at the relation verification layer, verifying the validity of attribute values in the structured fragments based on a statistical model at the value domain verification layer, and integrating verification results of all layers to obtain verified structured fragments;
extracting semantic features in the verified structured fragments, calculating semantic similarity among the fragments, constructing a hierarchical organization structure of the fragments based on the semantic similarity, and integrating the semantically related structured fragments according to the hierarchical organization structure to generate initial structured data conforming to a preset specification.
7. The method of claim 1, wherein analyzing the initial structured data by a causal reasoning method to obtain a decision basis chain, generating an optimization parameter based on the decision basis chain, and optimizing the initial structured data according to the optimization parameter and the category information of the classification annotation data to obtain final structured data comprises:
calculating bias correlation coefficients of variable pairs in initial structured data, performing a condition independence test based on the bias correlation coefficients to obtain a condition independent relation among the variables, and constructing a causal graph according to the condition independent relation;
Constructing a linear structural equation model based on the causal graph, and optimizing parameters of the linear structural equation model through maximum likelihood estimation to obtain a structural causal model, wherein the structural causal model is used for quantifying causal effect intensity among variables in the causal graph;
calculating the number of associated edges of each node in the causal graph, determining the node with the number of the associated edges exceeding a preset threshold value of the number of the associated edges as a decision node, tracing the associated nodes step by step along the causal edges in the causal graph from the decision node to obtain a factor chain, calculating the accumulated causal effect of the factor chain based on the structural causal model, and determining the factor chain with the accumulated causal effect exceeding the preset effect threshold value as a candidate decision basis chain;
In the structural causal model, changing the value of each node in the candidate decision basis chain and substituting the value into the structural causal model to obtain a corresponding downstream node predicted value after the node value is changed, determining the stability of the candidate decision basis chain based on the change amplitude of the predicted value, and determining the candidate decision basis chain with the stability meeting a preset condition as a decision basis chain;
Calculating the influence degree of each node in the decision basis chain on a target variable based on the structural causal model, screening out nodes with influence degree exceeding a preset influence threshold according to the influence degree as a feature set, taking the influence degree of each node in the feature set as a feature importance score, and combining the feature importance score with class distinction of classification marking data to obtain a feature optimization parameter;
Weighting and adjusting the value of the corresponding feature in the initial structured data based on the feature optimization parameters to obtain optimized data, wherein the feature optimization parameters are used for determining weight coefficients of feature adjustment;
calculating feature distribution differences of the optimized data and the initial structured data to obtain an optimized effect index, inputting the optimized data into the causal relation constraint of the structural causal model to obtain a verification result, performing iterative optimization on the optimized data according to the optimized effect index and the verification result, and determining the current optimized data as final structured data when the optimized index reaches a preset target and the verification result meets causal consistency.
8. An unstructured data automatic processing system based on deep learning for implementing the method of any of the preceding claims 1-7, comprising:
The first unit is used for acquiring unstructured data by utilizing an intelligent data acquisition device, constructing a data characteristic fingerprint library, classifying and labeling the unstructured data based on the data characteristic fingerprint library to obtain classified and labeled data, constructing a data quality assessment model by adopting an incremental learning algorithm, scoring the classified and labeled data from three dimensions of data integrity, timeliness and value density to obtain a scoring result, constructing a priority processing queue according to the scoring result, and performing data cleaning and normalization processing on the classified and labeled data in the priority processing queue to obtain a standardized training data set;
The second unit is used for performing deep learning processing on the standardized training data set by adopting a multi-layer cascade network structure, identifying target features in the standardized training data set through a self-attention mechanism according to the category information of the classification marking data, constructing a feature association diagram based on the target features, constructing a bidirectional reasoning channel on the feature association diagram, wherein the bidirectional reasoning channel comprises a forward channel and a reverse channel, the forward channel extracts numerical distribution features, time sequence features and attribute features in the standardized training data set through a convolutional neural network to obtain dominant feature data, the reverse channel extracts data item association rules, data change trends and data interaction modes in the standardized training data set through the convolutional neural network to obtain implicit feature data, and performing feature combination on the dominant feature data and the implicit feature data by adopting a dynamic weight allocation strategy to obtain multidimensional feature tensor;
and the third unit is used for carrying out rule matching on the multidimensional feature tensor based on a preset domain knowledge base to obtain an inference rule set, carrying out layering processing on the multidimensional feature tensor by adopting the inference rule set to obtain initial structured data, analyzing the initial structured data by a causal inference method to obtain a decision basis chain, generating optimization parameters based on the decision basis chain, and carrying out optimization processing on the initial structured data according to the optimization parameters and the category information of the classification marking data to obtain final structured data.
9. An electronic device, comprising:
A processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to invoke the instructions stored in the memory to perform the method of any of claims 1 to 7.
10. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1 to 7.
CN202510131702.1A 2025-02-06 2025-02-06 Unstructured data automatic processing method and system based on deep learning Active CN119597834B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510131702.1A CN119597834B (en) 2025-02-06 2025-02-06 Unstructured data automatic processing method and system based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510131702.1A CN119597834B (en) 2025-02-06 2025-02-06 Unstructured data automatic processing method and system based on deep learning

Publications (2)

Publication Number Publication Date
CN119597834A CN119597834A (en) 2025-03-11
CN119597834B true CN119597834B (en) 2025-04-25

Family

ID=94839214

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510131702.1A Active CN119597834B (en) 2025-02-06 2025-02-06 Unstructured data automatic processing method and system based on deep learning

Country Status (1)

Country Link
CN (1) CN119597834B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120260789B (en) * 2025-03-21 2025-10-14 广东医通软件有限公司 Method and system for optimizing medical decision based on artificial intelligence
CN119941192B (en) * 2025-04-03 2025-06-24 中国水利水电第五工程局有限公司 Science and technology project data management method, system and medium for construction method revising process
CN119961577B (en) * 2025-04-09 2025-07-15 北京奥维云网大数据科技股份有限公司 Automatic AI corpus data screening and obtaining method based on big data analysis
CN120542486A (en) * 2025-05-13 2025-08-26 台州市黄岩中盛证书有限公司 An adaptive corona printing typesetting optimization system based on neural network model
CN120376027B (en) * 2025-06-27 2025-09-19 南昌市检验检测中心 Automatic acceptance and feature recognition method for user reports of adverse reactions to cosmetics

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118300809A (en) * 2024-01-19 2024-07-05 南京大学 A network security intelligence mapping system and method based on hierarchical perception
CN118381627A (en) * 2024-04-01 2024-07-23 宁波和利时信息安全研究院有限公司 LLM driven industrial network intrusion detection method and response system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022036616A1 (en) * 2020-08-20 2022-02-24 中山大学 Method and apparatus for generating inferential question on basis of low labeled resource
CN119311854B (en) * 2024-11-14 2025-06-17 江西数易科技有限公司 Data query method and system based on cross-modal similarity text mining

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118300809A (en) * 2024-01-19 2024-07-05 南京大学 A network security intelligence mapping system and method based on hierarchical perception
CN118381627A (en) * 2024-04-01 2024-07-23 宁波和利时信息安全研究院有限公司 LLM driven industrial network intrusion detection method and response system

Also Published As

Publication number Publication date
CN119597834A (en) 2025-03-11

Similar Documents

Publication Publication Date Title
CN119597834B (en) Unstructured data automatic processing method and system based on deep learning
US7107254B1 (en) Probablistic models and methods for combining multiple content classifiers
US7606784B2 (en) Uncertainty management in a decision-making system
CN120068882B (en) Method and system for analyzing and predicting the trend of scientific and technological literature
CN118195032B (en) Large model automatic evolution system and method with active learning capability
CN119047843A (en) Service risk prediction system and method combining multi-source data
US20240232589A9 (en) Automated, Constraints-Dependent Machine Learning Model Thresholding Mechanisms
CN119049682B (en) Medical asset management big data analysis method and system
CN119538118A (en) Data classification method and device
CN118733714A (en) A semantic large model optimization method and system for power scenarios
CN120526446A (en) Document upload method for project management software based on OCR and large language model
Bashar et al. Algan: Time series anomaly detection with adjusted-lstm gan
CN119760655A (en) Data analysis method and device based on intelligent data analysis large model
KR102851530B1 (en) Server and method for learning relationship prediction based on supply chain knowledge graph and weight adjustment
CN119293266B (en) Enterprise knowledge graph construction method, system, equipment and storage medium
CN119067238B (en) Method and system for generating scientific creation big data model based on big model
CN120541203A (en) Explainable patent recommendation method and device based on knowledge graph and large language model
CN118520176B (en) Accurate recommendation method and system based on artificial intelligence
CN120067986A (en) Cloud computing-based intelligent multi-source heterogeneous data fusion and analysis system
CN119558687A (en) Multi-scale cross-domain indicator change attribution method and system based on knowledge enhancement
CN120338070A (en) A method for constructing a large language model in the field of intelligent manufacturing by integrating domain knowledge distillation
CN119513818A (en) A multimodal data fusion method, device, equipment and medium
CN119943312A (en) Hospital financial data mining system and method based on cloud platform
CN119005087A (en) Automatic optimization method and system for PCB (printed circuit board) dividing paths based on machine learning
CN118070853A (en) Method and device for evaluating interoperability between systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant