Disclosure of Invention
The present application has been made in order to solve the above technical problems. The embodiment of the application provides a data safety protection integrated system and a method thereof, which adopt an artificial intelligent detection algorithm based on deep learning to extract characteristic information of data to be detected and sensitive words, and further calculate a transfer matrix between the characteristics of the data to be detected and the characteristics of the sensitive words to represent the characteristic similarity of the data to be detected and the characteristics of the sensitive words so as to judge whether the data is sensitive data or not. Thus, a large amount of data can be automatically processed, and the accuracy and efficiency of sensitive data identification are improved.
Accordingly, according to one aspect of the present application, there is provided a data security integrated system comprising:
The data acquisition module is used for acquiring data to be detected and a sensitive vocabulary set;
the data to be detected semantic understanding module is used for enabling the data to be detected to pass through a context encoder comprising an embedded layer to obtain a plurality of context semantic feature vectors;
the first scale perception module is used for cascading the context semantic feature vectors to obtain a first scale semantic association feature vector;
the second scale perception module is used for two-dimensionally arranging the context semantic feature vectors into context semantic feature matrixes and then obtaining second scale semantic association feature vectors through a convolutional neural network model comprising a plurality of mixed convolutional layers;
the multi-scale fusion module is used for carrying out interpolation order fusion on the first-scale semantic association feature vector and the second-scale semantic association feature vector so as to obtain a data feature vector to be detected;
the sensitive vocabulary semantic understanding module is used for enabling the sensitive vocabulary set to pass through a context encoder comprising an embedded layer to obtain sensitive data feature vectors;
the transfer calculation module is used for calculating a transfer matrix between the data characteristic vector to be detected and the sensitive data characteristic vector to be detected as a classification characteristic matrix;
And the detection result generation module is used for passing the classification characteristic matrix through a classifier to obtain a classification result, wherein the classification result is used for indicating whether the detection data are sensitive data or not.
In the above data security protection integrated system, the data semantic understanding module to be detected includes: an embedding unit, configured to map text data of each data item in the data to be detected into word embedding vectors by using an embedding layer of the context encoder; a data adding unit, configured to add numerical data in each data item to the tail of the word embedding vector of each data item to obtain a plurality of data item embedding vectors; and the context coding unit is used for carrying out context semantic coding on the plurality of data item embedded vectors by using a Bert model based on a converter of the context coder so as to obtain a plurality of context semantic feature vectors.
In the above data security protection integrated system, the context encoding unit includes: a one-dimensional arrangement subunit, configured to perform one-dimensional arrangement on the plurality of data item embedding vectors to obtain a data item global embedding vector; a self-attention generation subunit, configured to calculate a product between the data item global embedding vector and a transpose vector of each of the plurality of data item embedding vectors to obtain a plurality of self-attention correlation matrices; the standardized self-attention subunit is used for respectively carrying out standardized processing on each self-attention incidence matrix in the plurality of self-attention incidence matrices to obtain a plurality of standardized self-attention incidence matrices; the weight generation subunit is used for obtaining a plurality of probability values from each normalized self-attention correlation matrix in the normalized self-attention correlation matrices through a classification function; and the weighting subunit is used for weighting each data item embedded vector in the plurality of data item embedded vectors by taking each probability value in the plurality of probability values as a weight so as to obtain the plurality of context semantic feature vectors.
In the above data security protection integrated system, the second scale sensing module is configured to: each mixed convolution layer using the convolutional neural network model performs respective processing on input data in forward transfer of the layer: performing convolutional encoding on the context semantic feature matrix by using a first convolutional check with a first size to obtain a first scale feature map; performing convolutional encoding on the context semantic feature matrix by using a second convolutional check with the first void ratio to obtain a second scale feature map; performing convolutional encoding on the context semantic feature matrix by using a third convolutional check with a second void ratio to obtain a third scale feature map; performing convolutional encoding on the context semantic feature matrix by using a fourth convolution kernel with a third void fraction to obtain a fourth scale feature map, wherein the first convolution kernel, the second convolution kernel, the third convolution kernel and the fourth convolution kernel have the same size, and the second convolution kernel, the third convolution kernel and the fourth convolution kernel have different void fractions; performing aggregation on the first scale feature map, the second scale feature map, the third scale feature map and the fourth scale feature map along a channel dimension to obtain an aggregated feature map; global pooling processing is carried out on each feature matrix along the channel dimension on the aggregate feature map so as to generate a pooled feature map; performing activation processing on the pooled feature map to generate an activated feature map; wherein the output of the last layer of the convolutional neural network model comprising a plurality of mixed convolutional layers is the second-scale semantically-related feature vector.
In the above data security protection integrated system, the multi-scale fusion module includes: the difference calculation unit is used for calculating the position-based difference between the first-scale semantic association feature vector and the second-scale semantic association feature vector to obtain a difference feature vector; the per-position weighting unit is used for calculating per-position weighting between the first-scale semantic association feature vector and the second-scale semantic association feature vector to obtain a point-added feature vector; the cosine similarity calculation unit is used for calculating cosine similarity between the differential feature vector and the point-added feature vector; the weighted fusion unit is used for taking cosine similarity between the differential feature vector and the point-added feature vector as a weight parameter, and fusing the first-scale semantic association feature vector and the second-scale semantic association feature vector by the following fusion formula to obtain the data feature vector to be detected; wherein, the fusion formula is: v (V) i =αV 1 +(1-α)V 2 ,V 1 Representing the first scale semantically associated feature vector, V 2 Representing the second scale semantically-related feature vector, alpha representing a weight parameter, V i Representing the feature vector of the data to be detected.
In the above data security protection integrated system, the sensitive vocabulary semantic understanding module includes: the embedded vectorization unit is used for mapping each sensitive vocabulary in the sensitive vocabulary set into a word embedded vector by using an embedded layer of the context encoder to obtain a sequence of the word embedded vector; a semantic coding unit, configured to perform global-based context semantic coding on the sequence of word embedding vectors using a Bert model based on a converter of the context encoder to obtain a plurality of word feature vectors; and the cascading unit is used for cascading the word characteristic vectors to obtain the sensitive data characteristic vector.
In the above data security protection integrated system, the transfer calculation module is configured to: calculating a transfer matrix between the data feature vector to be detected and the sensitive data feature vector according to the following transfer formula;
wherein, the transfer formula is:
wherein V is a Representing the feature vector of the data to be detected, V b Representing the sensitive data feature vector, M representing the transfer matrix,the representation matrix is multiplied by the vector.
According to another aspect of the present application, there is provided a data security protection integration method, including:
Acquiring data to be detected and a sensitive vocabulary set;
passing the data to be detected through a context encoder comprising an embedded layer to obtain a plurality of context semantic feature vectors;
cascading the context semantic feature vectors to obtain a first-scale semantic association feature vector;
two-dimensionally arranging the context semantic feature vectors into a context semantic feature matrix, and then obtaining a second-scale semantic association feature vector through a convolutional neural network model comprising a plurality of mixed convolutional layers;
carrying out interpolation order fusion on the first-scale semantic association feature vector and the second-scale semantic association feature vector to obtain a data feature vector to be detected;
passing the sensitive vocabulary set through a context encoder comprising an embedded layer to obtain a sensitive data feature vector;
calculating a transfer matrix between the data feature vector to be detected and the sensitive data feature vector as a classification feature matrix;
and the classification feature matrix passes through a classifier to obtain a classification result, wherein the classification result is used for indicating whether the detection data are sensitive data or not.
In the above data security protection integrated method, the step of passing the data to be detected through a context encoder including an embedded layer to obtain a plurality of context semantic feature vectors includes: using an embedding layer of the context encoder to map text data of each data item in the data to be detected into word embedding vectors respectively; respectively adding numerical data in each data item to the tail part of the word embedding vector of each data item to obtain a plurality of data item embedding vectors; and performing context semantic coding on the plurality of data item embedded vectors by using a converter-based Bert model of the context encoder to obtain a plurality of context semantic feature vectors.
In the above data security protection integration method, performing context semantic encoding on the plurality of data item embedded vectors using a converter-based Bert model of the context encoder to obtain the plurality of context semantic feature vectors, including: one-dimensional arrangement is carried out on the plurality of data item embedded vectors to obtain a data item global embedded vector; calculating the product between the data item global embedded vector and the transpose vector of each data item embedded vector in the plurality of data item embedded vectors to obtain a plurality of self-attention association matrices; respectively carrying out standardization processing on each self-attention correlation matrix in the plurality of self-attention correlation matrices to obtain a plurality of standardized self-attention correlation matrices; each normalized self-attention correlation matrix in the normalized self-attention correlation matrices is subjected to a classification function to obtain a plurality of probability values; and weighting each data item embedded vector in the plurality of data item embedded vectors by taking each probability value in the plurality of probability values as a weight so as to obtain the plurality of context semantic feature vectors.
Compared with the prior art, the data safety protection integrated system and the method thereof provided by the application adopt an artificial intelligent detection algorithm based on deep learning to extract the characteristic information of the data to be detected and the sensitive vocabulary, and further calculate the transfer matrix between the characteristics of the data to be detected and the characteristics of the sensitive vocabulary to represent the characteristic similarity of the two, so as to judge whether the data is the sensitive data. Thus, a large amount of data can be automatically processed, and the accuracy and efficiency of sensitive data identification are improved.
Detailed Description
Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application and not all of the embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.
Fig. 1 is a block diagram of a data security integrated system according to an embodiment of the present application. As shown in fig. 1, a data security integrated system 100 according to an embodiment of the present application includes: the data acquisition module 110 is configured to acquire data to be detected and a sensitive vocabulary set; the to-be-detected data semantic understanding module 120 is configured to pass the to-be-detected data through a context encoder including an embedded layer to obtain a plurality of context semantic feature vectors; a first scale perception module 130, configured to concatenate the plurality of context semantic feature vectors to obtain a first scale semantic association feature vector; the second scale perception module 140 is configured to two-dimensionally arrange the plurality of context semantic feature vectors into a context semantic feature matrix, and then obtain a second scale semantic association feature vector through a convolutional neural network model including a plurality of hybrid convolutional layers; the multi-scale fusion module 150 is configured to perform interpolation order fusion on the first-scale semantic association feature vector and the second-scale semantic association feature vector to obtain a data feature vector to be detected; a sensitive vocabulary semantic understanding module 160, configured to pass the sensitive vocabulary set through a context encoder including an embedded layer to obtain a sensitive data feature vector; a transfer calculation module 170, configured to calculate a transfer matrix between the feature vector of the data to be detected and the feature vector of the sensitive data as a classification feature matrix; the detection result generating module 180 is configured to pass the classification feature matrix through a classifier to obtain a classification result, where the classification result is used to indicate whether the detection data is sensitive data.
Fig. 2 is a schematic architecture diagram of a data security protection integrated system according to an embodiment of the present application. As shown in fig. 2, first, data to be detected and a sensitive vocabulary set are acquired. The data to be detected is then passed through a context encoder comprising an embedded layer to obtain a plurality of context semantic feature vectors. Then, the context semantic feature vectors are concatenated to obtain a first scale semantic association feature vector. And simultaneously, two-dimensionally arranging the context semantic feature vectors into a context semantic feature matrix, and then obtaining a second-scale semantic association feature vector through a convolutional neural network model comprising a plurality of mixed convolutional layers. And then, carrying out interpolation ordered fusion on the first-scale semantic association feature vector and the second-scale semantic association feature vector to obtain a data feature vector to be detected. And secondly, passing the sensitive vocabulary set through a context encoder comprising an embedded layer to obtain sensitive data feature vectors. Then, a transfer matrix between the data feature vector to be detected and the sensitive data feature vector is calculated as a classification feature matrix. And finally, the classification feature matrix passes through a classifier to obtain a classification result, wherein the classification result is used for indicating whether the detection data are sensitive data or not.
In the above data security protection integrated system 100, the data collection module 110 is configured to obtain data to be detected and a sensitive vocabulary set. As mentioned above in the background, sensitive data identification is of great importance in data security protection, the main purpose of which is to identify and flag sensitive data stored in a system in order to take corresponding security measures against access, leakage or abuse by unauthorized persons. However, most of the current data have the characteristics of large capacity and complexity, the traditional method for manually combing is low in speed, and different people can judge the same data differently, so that the results generated by the sensitive data in recognition are different. Therefore, an efficient and accurate sensitive data identification scheme is desired.
Accordingly, in the process of identifying the sensitive data, in order to achieve both accuracy and effectiveness, the identification can be performed by feature similarity between the implicit features of the data to be detected and the implicit features of the sensitive vocabulary in a high-dimensional space. In other words, in the technical scheme of the application, an artificial intelligent detection algorithm based on deep learning is adopted to extract feature information of data to be detected and sensitive words, and then a transfer matrix between the features of the data to be detected and the features of the sensitive words is further calculated to represent feature similarity of the two, so that whether the data are sensitive data is judged. Thus, a large amount of data can be automatically processed, and the accuracy and efficiency of sensitive data identification are improved. Specifically, in the technical scheme of the application, first, data to be detected and a sensitive vocabulary set are acquired.
In the above-mentioned integrated data security system 100, the to-be-detected data semantic understanding module 120 is configured to pass the to-be-detected data through a context encoder including an embedded layer to obtain a plurality of context semantic feature vectors. In sensitive data identification, the context information of the data is critical to accurately determining whether the data is sensitive. Thus, to capture context information and semantically related features in the data to be detected, the data to be detected is processed through a context encoder comprising an embedded layer. It should be appreciated that a context encoder is a model for converting text data into a continuous vector representation. Firstly, mapping each word or character in text data into an embedded vector representation through an embedded layer; and then, a self-attention mechanism is introduced into a plurality of embedded vectors by using a converter of the context encoder, the plurality of embedded vectors are converted into a plurality of context semantic feature vectors, and the semantic association features of the data are comprehensively captured, so that the accuracy of sensitive data identification is improved.
Fig. 3 is a block diagram of a data semantic understanding module to be detected in the data security protection integrated system according to an embodiment of the present application. As shown in fig. 3, the to-be-detected data semantic understanding module 120 includes: an embedding unit 121, configured to map text data of each data item in the data to be detected into word embedding vectors by using an embedding layer of the context encoder; a data adding unit 122, configured to add numerical data in each data item to the tail of the word embedding vector of each data item to obtain a plurality of data item embedding vectors; a context encoding unit 123, configured to perform context semantic encoding on the plurality of data item embedded vectors using a Bert model based on a converter of the context encoder to obtain the plurality of context semantic feature vectors.
Fig. 4 is a block diagram of a context encoding unit in a data security integrated system according to an embodiment of the present application. The context encoding unit 123 includes: a one-dimensional arrangement subunit 1231, configured to perform one-dimensional arrangement on the plurality of data item embedding vectors to obtain a data item global embedding vector; a self-attention generation subunit 1232 configured to calculate a product between the global data item embedding vector and a transpose vector of each of the plurality of data item embedding vectors to obtain a plurality of self-attention correlation matrices; a normalized self-attention subunit 1233, configured to perform normalization processing on each of the plurality of self-attention correlation matrices to obtain a plurality of normalized self-attention correlation matrices; a weight generating subunit 1234, configured to obtain a plurality of probability values from each normalized self-attention correlation matrix in the plurality of normalized self-attention correlation matrices by using a classification function; and the weighting subunit 1235 is configured to weight each of the plurality of data item embedding vectors with each of the plurality of probability values as a weight to obtain the plurality of context semantic feature vectors.
In the above data security protection integrated system 100, the first scale perception module 130 is configured to concatenate the plurality of context semantic feature vectors to obtain a first scale semantic association feature vector. In order to comprehensively consider semantic information in the context semantic feature vectors and capture the semantic association features of the data to be detected in a global mode, the context semantic feature vectors need to be fused. The plurality of context semantic feature vectors are integrated by adopting cascading operation to form a global feature representation, so that the whole semantic association features of the data to be detected are reflected better.
In the above data security protection integrated system 100, the second scale perception module 140 is configured to two-dimensionally arrange the plurality of context semantic feature vectors into a context semantic feature matrix, and then obtain a second scale semantic association feature vector through a convolutional neural network model including a plurality of hybrid convolutional layers. To further extract and capture local and global semantic association features of the data, the plurality of contextual semantic feature vectors are further processed using a convolutional data network model. It should be understood that the plurality of context semantic feature vectors are arranged into the form of the context semantic feature matrix, so that different context semantic information can be organized and represented in two-dimensional space, thereby facilitating the better understanding of the relationship between contexts by the model and extracting the local semantic association features. Meanwhile, the convolutional neural network has good feature extraction capability in terms of image and text processing, and the hybrid convolutional layer can capture the association between local detail and global context at the same time. Therefore, by applying convolution operation on the context semantic feature matrix, semantic association features of different scales can be effectively extracted from the context semantic feature matrix.
Accordingly, in one specific example, the second scale awareness module 140 is configured to: each mixed convolution layer using the convolutional neural network model performs respective processing on input data in forward transfer of the layer: performing convolutional encoding on the context semantic feature matrix by using a first convolutional check with a first size to obtain a first scale feature map; performing convolutional encoding on the context semantic feature matrix by using a second convolutional check with the first void ratio to obtain a second scale feature map; performing convolutional encoding on the context semantic feature matrix by using a third convolutional check with a second void ratio to obtain a third scale feature map; performing convolutional encoding on the context semantic feature matrix by using a fourth convolution kernel with a third void fraction to obtain a fourth scale feature map, wherein the first convolution kernel, the second convolution kernel, the third convolution kernel and the fourth convolution kernel have the same size, and the second convolution kernel, the third convolution kernel and the fourth convolution kernel have different void fractions; performing aggregation on the first scale feature map, the second scale feature map, the third scale feature map and the fourth scale feature map along a channel dimension to obtain an aggregated feature map; global pooling processing is carried out on each feature matrix along the channel dimension on the aggregate feature map so as to generate a pooled feature map; performing activation processing on the pooled feature map to generate an activated feature map; wherein the output of the last layer of the convolutional neural network model comprising a plurality of mixed convolutional layers is the second-scale semantically-related feature vector.
In the above data security protection integrated system 100, the multi-scale fusion module 150 is configured to perform interpolation and order fusion on the first-scale semantic association feature vector and the second-scale semantic association feature vector to obtain a data feature vector to be detected. In order to comprehensively utilize semantic association features of different scales, the first-scale semantic association feature vector and the second-scale semantic association feature vector are further fused. It should be understood that the first-scale semantic association feature vector focuses more on the overall semantic association feature of the data, and the second-scale semantic association feature vector focuses more on the local and global semantic association, and by fusing the features of the two scales, different features and information of the two scales can be comprehensively utilized, so that the data to be detected can be more comprehensively described and represented, and accuracy and robustness of sensitive data identification are improved.
In particular, in the technical solution of the present application, the first-scale semantic association feature vector and the second-scale semantic association feature vector respectively represent different high-dimensional feature manifolds in a high-dimensional feature space, but in a class probability tag domain, the first-scale semantic association feature vector and the second-scale semantic association feature vector respectively point to the same class probability tag, so that the high-dimensional feature manifolds of the first-scale semantic association feature vector and the second-scale semantic association feature vector have implicit association in a manifold expression level, that is, in the technical solution of the present application, the high-dimensional feature manifolds of the first-scale semantic association feature vector and the second-scale semantic association feature vector have smoothness and robustness in a manifold characterization level. Based on the above, in the technical solution of the present application, the manifold difference and manifold superposition expansion of the first scale semantic association feature vector and the second scale semantic association feature vector in the high-dimensional feature space are represented by position-by-position difference and position-by-position point addition, so as to use the method Cosine similarity between the differential feature vector and the point-plus-feature vector represents smoothness and robustness of the high-dimensional manifolds of the first-scale semantically-related feature vector and the second-scale semantically-related feature vector between manifold characterization levels. And further, taking cosine similarity between the differential feature vector and the point-added feature vector as a weight parameter, and fusing the first-scale semantic association feature vector and the second-scale semantic association feature vector to obtain the data feature vector to be detected according to the following formula: v (V) i =αV 1 +(1-α)V 2 ,V 1 Representing the first scale semantically associated feature vector, V 2 Representing the second scale semantically-related feature vector, alpha representing a weight parameter, V i Representing the feature vector of the data to be detected. In this way, the high-dimensional feature manifold of the data feature vector to be detected has collinearity with the first-scale semantic association feature vector and the second-scale semantic association feature vector at the geometric level, but the manifold range is different from the manifold measure, and manifold transformation consistency exists at the algebraic angle, so that the data feature vector to be detected can perform feature fusion by utilizing the high-dimensional implicit association between the first-scale semantic association feature vector and the second-scale semantic association feature vector so as to improve the smoothness and robustness of the fused data feature vector to be detected, and further provide the accuracy of classification judgment of a final classification feature matrix through a classifier.
Accordingly, in one specific example, the multi-scale fusion module 150 includes: the difference calculation unit is used for calculating the position-based difference between the first-scale semantic association feature vector and the second-scale semantic association feature vector to obtain a difference feature vector; the per-position weighting unit is used for calculating per-position weighting between the first-scale semantic association feature vector and the second-scale semantic association feature vector to obtain a point-added feature vector; the cosine similarity calculation unit is used for calculating cosine similarity between the differential feature vector and the point-added feature vector; a weighted fusion unit for using the difference characteristicCosine similarity between the vector and the point-added feature vector is used as a weight parameter, and the first-scale semantic association feature vector and the second-scale semantic association feature vector are fused by the following fusion formula to obtain the data feature vector to be detected; wherein, the fusion formula is: v (V) i =αV 1 +(1-α)V 2 ,V 1 Representing the first scale semantically associated feature vector, V 2 Representing the second scale semantically-related feature vector, alpha representing a weight parameter, V i Representing the feature vector of the data to be detected.
In the above-mentioned data security protection integrated system 100, the sensitive vocabulary semantic understanding module 160 is configured to pass the sensitive vocabulary set through a context encoder including an embedded layer to obtain a sensitive data feature vector. It is contemplated that the sensitive vocabulary sets are typically presented in text form. However, sensitive words in text form are inconvenient to directly compare and calculate with the data to be detected. Therefore, it is necessary to use a context encoder to mine the contextual semantic features of the sensitive vocabulary and to convert the set of sensitive vocabulary into a computable and comparable vector representation for feature comparison and classification with the data to be detected.
Fig. 5 is a block diagram of a sensitive vocabulary semantic understanding module in a data security integrated system according to an embodiment of the present application. As shown in fig. 5, the sensitive vocabulary semantic understanding module 160 includes: an embedding vectorization unit 161, configured to map each sensitive vocabulary in the sensitive vocabulary set into a word embedding vector by using an embedding layer of the context encoder to obtain a sequence of word embedding vectors; a semantic coding unit 162, configured to perform global-based context semantic coding on the sequence of word embedding vectors using a Bert model based on a converter of the context encoder to obtain a plurality of word feature vectors; a concatenation unit 163, configured to concatenate the plurality of word feature vectors to obtain the sensitive data feature vector.
In the above data security protection integrated system 100, the transfer calculation module 170 is configured to calculate a transfer matrix between the feature vector of the data to be detected and the feature vector of the sensitive data as a classification feature matrix. In order to capture semantic associations and transfer features between data to be detected and sensitive data, a transfer matrix between the data feature vectors to be detected and the sensitive data feature vectors is further calculated. It should be appreciated that the transfer matrix may be regarded as a transformation matrix mapping the data feature vectors to be detected to sensitive data feature vectors. Each element in the matrix represents a transfer relationship between a certain dimension of the feature vector of the data to be detected and a corresponding dimension of the feature vector of the sensitive data, reflecting the similarity, the difference and the semantic transfer degree between the data to be detected and the sensitive data. Based on the characteristic information, a classifier is further used for judging whether the data to be detected belongs to the sensitive data category.
Accordingly, in one specific example, the transfer calculation module 170 is configured to: calculating a transfer matrix between the data feature vector to be detected and the sensitive data feature vector according to the following transfer formula;
Wherein, the transfer formula is:
wherein V is a Representing the feature vector of the data to be detected, V b Representing the sensitive data feature vector, M representing the transfer matrix,the representation matrix is multiplied by the vector.
In the above data safety protection integrated system 100, the detection result generating module 180 is configured to pass the classification feature matrix through a classifier to obtain a classification result, where the classification result is used to indicate whether the detection data is sensitive data. The classifier is a trained machine learning model, and the training process of the classifier is usually performed based on labeled training data, where the training data includes features of the data to be detected and corresponding class labels (sensitive data or non-sensitive data). Through training, the classifier can learn the association between the features and the categories and finish classifying the unknown data. Here, the classification feature matrix is used as an input, and a classification result for indicating whether the data to be detected is sensitive data can be obtained. Therefore, the automatic classification and the sensitive data identification of the data to be detected are realized, and further processing or decision making is carried out according to the classification result, so that the safety of the data is ensured.
In summary, the data security protection integrated system according to the embodiment of the application is illustrated, an artificial intelligent detection algorithm based on deep learning is adopted to extract feature information of data to be detected and sensitive words, and then a transfer matrix between the features of the data to be detected and the features of the sensitive words is further calculated to represent feature similarity of the two, so that whether the data is sensitive data is judged. Thus, a large amount of data can be automatically processed, and the accuracy and efficiency of sensitive data identification are improved.
Fig. 6 is a flowchart of a data security protection integration method according to an embodiment of the present application. As shown in fig. 6, a data security protection integration method according to an embodiment of the present application includes the steps of: s110, acquiring data to be detected and a sensitive vocabulary set; s120, the data to be detected passes through a context encoder comprising an embedded layer to obtain a plurality of context semantic feature vectors; s130, cascading the context semantic feature vectors to obtain a first-scale semantic association feature vector; s140, two-dimensionally arranging the context semantic feature vectors into a context semantic feature matrix, and then obtaining a second-scale semantic association feature vector through a convolutional neural network model comprising a plurality of mixed convolutional layers; s150, carrying out interpolation ordered fusion on the first-scale semantic association feature vector and the second-scale semantic association feature vector to obtain a data feature vector to be detected; s160, passing the sensitive vocabulary set through a context encoder comprising an embedded layer to obtain a sensitive data feature vector; s170, calculating a transfer matrix between the data feature vector to be detected and the sensitive data feature vector as a classification feature matrix; and S180, the classification feature matrix is passed through a classifier to obtain a classification result, and the classification result is used for indicating whether the detection data is sensitive data or not.
Here, it will be understood by those skilled in the art that the specific operations of the respective steps in the above-described data security integration method have been described in detail in the above description of the data security integration system with reference to fig. 1 to 5, and thus, repetitive descriptions thereof will be omitted.