CN117474006A

CN117474006A - Data security protection integrated system and method

Info

Publication number: CN117474006A
Application number: CN202311376336.3A
Authority: CN
Inventors: 翁武焰; 何颖; 吴慧明
Original assignee: Fujian Zhongxin Wang 'an Information Technology Co ltd
Current assignee: Fujian Zhongxin Wang 'an Information Technology Co ltd
Priority date: 2023-10-23
Filing date: 2023-10-23
Publication date: 2024-01-30

Abstract

This application relates to the field of intelligent protection technology. It specifically discloses an integrated data security protection system and its method. It uses an artificial intelligence detection algorithm based on deep learning to extract the characteristic information of the data to be detected and sensitive words, and then further Calculate the transfer matrix between the characteristics of the data to be detected and the characteristics of the sensitive vocabulary to represent the similarity of the two characteristics, thereby determining whether the data is sensitive data. In this way, large amounts of data can be processed automatically, improving the accuracy and efficiency of sensitive data identification.

Description

Data security protection integrated system and method thereof

Technical Field

The application relates to the technical field of intelligent protection, and more particularly, to a data security protection integrated system and a method thereof.

Background

With the rapid development of the internet, data security is getting more and more important, and sensitive data identification has important significance in data security, and the main purpose of the data security identification is to identify and mark sensitive data stored in a system so as to take corresponding security measures to protect the data. Sensitive data refers to data that may cause serious harm to society or individuals after leakage. Meanwhile, the sensitive data is also called privacy data, and comprises all information which is not disclosed or classified, including personal privacy data such as names, identification card numbers, addresses, telephones, bank account numbers, mailboxes, passwords, medical information, educational backgrounds and the like; and enterprise private information such as business conditions of the enterprise, customer information, business secrets, etc.

By means of sensitive data identification, these data can be found and marked in time, so that appropriate security measures are taken to prevent access, leakage or abuse by unauthorized persons. Because most of the current data have the characteristics of large capacity and complexity, the traditional manual carding speed is low, and different people can judge the same data differently, so that the result generated by the sensitive data in the identification process is different.

Accordingly, a data security integrated system and method thereof are desired.

Disclosure of Invention

The present application has been made in order to solve the above technical problems. The embodiment of the application provides a data safety protection integrated system and a method thereof, which adopt an artificial intelligent detection algorithm based on deep learning to extract characteristic information of data to be detected and sensitive words, and further calculate a transfer matrix between the characteristics of the data to be detected and the characteristics of the sensitive words to represent the characteristic similarity of the data to be detected and the characteristics of the sensitive words so as to judge whether the data is sensitive data or not. Thus, a large amount of data can be automatically processed, and the accuracy and efficiency of sensitive data identification are improved.

Accordingly, according to one aspect of the present application, there is provided a data security integrated system comprising:

The data acquisition module is used for acquiring data to be detected and a sensitive vocabulary set;

the data to be detected semantic understanding module is used for enabling the data to be detected to pass through a context encoder comprising an embedded layer to obtain a plurality of context semantic feature vectors;

the first scale perception module is used for cascading the context semantic feature vectors to obtain a first scale semantic association feature vector;

the second scale perception module is used for two-dimensionally arranging the context semantic feature vectors into context semantic feature matrixes and then obtaining second scale semantic association feature vectors through a convolutional neural network model comprising a plurality of mixed convolutional layers;

the multi-scale fusion module is used for carrying out interpolation order fusion on the first-scale semantic association feature vector and the second-scale semantic association feature vector so as to obtain a data feature vector to be detected;

the sensitive vocabulary semantic understanding module is used for enabling the sensitive vocabulary set to pass through a context encoder comprising an embedded layer to obtain sensitive data feature vectors;

the transfer calculation module is used for calculating a transfer matrix between the data characteristic vector to be detected and the sensitive data characteristic vector to be detected as a classification characteristic matrix;

And the detection result generation module is used for passing the classification characteristic matrix through a classifier to obtain a classification result, wherein the classification result is used for indicating whether the detection data are sensitive data or not.

In the above data security protection integrated system, the data semantic understanding module to be detected includes: an embedding unit, configured to map text data of each data item in the data to be detected into word embedding vectors by using an embedding layer of the context encoder; a data adding unit, configured to add numerical data in each data item to the tail of the word embedding vector of each data item to obtain a plurality of data item embedding vectors; and the context coding unit is used for carrying out context semantic coding on the plurality of data item embedded vectors by using a Bert model based on a converter of the context coder so as to obtain a plurality of context semantic feature vectors.

In the above data security protection integrated system, the context encoding unit includes: a one-dimensional arrangement subunit, configured to perform one-dimensional arrangement on the plurality of data item embedding vectors to obtain a data item global embedding vector; a self-attention generation subunit, configured to calculate a product between the data item global embedding vector and a transpose vector of each of the plurality of data item embedding vectors to obtain a plurality of self-attention correlation matrices; the standardized self-attention subunit is used for respectively carrying out standardized processing on each self-attention incidence matrix in the plurality of self-attention incidence matrices to obtain a plurality of standardized self-attention incidence matrices; the weight generation subunit is used for obtaining a plurality of probability values from each normalized self-attention correlation matrix in the normalized self-attention correlation matrices through a classification function; and the weighting subunit is used for weighting each data item embedded vector in the plurality of data item embedded vectors by taking each probability value in the plurality of probability values as a weight so as to obtain the plurality of context semantic feature vectors.

In the above data security protection integrated system, the second scale sensing module is configured to: each mixed convolution layer using the convolutional neural network model performs respective processing on input data in forward transfer of the layer: performing convolutional encoding on the context semantic feature matrix by using a first convolutional check with a first size to obtain a first scale feature map; performing convolutional encoding on the context semantic feature matrix by using a second convolutional check with the first void ratio to obtain a second scale feature map; performing convolutional encoding on the context semantic feature matrix by using a third convolutional check with a second void ratio to obtain a third scale feature map; performing convolutional encoding on the context semantic feature matrix by using a fourth convolution kernel with a third void fraction to obtain a fourth scale feature map, wherein the first convolution kernel, the second convolution kernel, the third convolution kernel and the fourth convolution kernel have the same size, and the second convolution kernel, the third convolution kernel and the fourth convolution kernel have different void fractions; performing aggregation on the first scale feature map, the second scale feature map, the third scale feature map and the fourth scale feature map along a channel dimension to obtain an aggregated feature map; global pooling processing is carried out on each feature matrix along the channel dimension on the aggregate feature map so as to generate a pooled feature map; performing activation processing on the pooled feature map to generate an activated feature map; wherein the output of the last layer of the convolutional neural network model comprising a plurality of mixed convolutional layers is the second-scale semantically-related feature vector.

In the above data security protection integrated system, the multi-scale fusion module includes: the difference calculation unit is used for calculating the position-based difference between the first-scale semantic association feature vector and the second-scale semantic association feature vector to obtain a difference feature vector; the per-position weighting unit is used for calculating per-position weighting between the first-scale semantic association feature vector and the second-scale semantic association feature vector to obtain a point-added feature vector; the cosine similarity calculation unit is used for calculating cosine similarity between the differential feature vector and the point-added feature vector; the weighted fusion unit is used for taking cosine similarity between the differential feature vector and the point-added feature vector as a weight parameter, and fusing the first-scale semantic association feature vector and the second-scale semantic association feature vector by the following fusion formula to obtain the data feature vector to be detected; wherein, the fusion formula is: v (V) _i ＝αV ₁ +(1-α)V ₂ ，V ₁ Representing the first scale semantically associated feature vector, V ₂ Representing the second scale semantically-related feature vector, alpha representing a weight parameter, V _i Representing the feature vector of the data to be detected.

In the above data security protection integrated system, the sensitive vocabulary semantic understanding module includes: the embedded vectorization unit is used for mapping each sensitive vocabulary in the sensitive vocabulary set into a word embedded vector by using an embedded layer of the context encoder to obtain a sequence of the word embedded vector; a semantic coding unit, configured to perform global-based context semantic coding on the sequence of word embedding vectors using a Bert model based on a converter of the context encoder to obtain a plurality of word feature vectors; and the cascading unit is used for cascading the word characteristic vectors to obtain the sensitive data characteristic vector.

In the above data security protection integrated system, the transfer calculation module is configured to: calculating a transfer matrix between the data feature vector to be detected and the sensitive data feature vector according to the following transfer formula;

wherein, the transfer formula is:

wherein V is _a Representing the feature vector of the data to be detected, V _b Representing the sensitive data feature vector, M representing the transfer matrix,the representation matrix is multiplied by the vector.

According to another aspect of the present application, there is provided a data security protection integration method, including:

Acquiring data to be detected and a sensitive vocabulary set;

passing the data to be detected through a context encoder comprising an embedded layer to obtain a plurality of context semantic feature vectors;

cascading the context semantic feature vectors to obtain a first-scale semantic association feature vector;

two-dimensionally arranging the context semantic feature vectors into a context semantic feature matrix, and then obtaining a second-scale semantic association feature vector through a convolutional neural network model comprising a plurality of mixed convolutional layers;

carrying out interpolation order fusion on the first-scale semantic association feature vector and the second-scale semantic association feature vector to obtain a data feature vector to be detected;

passing the sensitive vocabulary set through a context encoder comprising an embedded layer to obtain a sensitive data feature vector;

calculating a transfer matrix between the data feature vector to be detected and the sensitive data feature vector as a classification feature matrix;

and the classification feature matrix passes through a classifier to obtain a classification result, wherein the classification result is used for indicating whether the detection data are sensitive data or not.

In the above data security protection integrated method, the step of passing the data to be detected through a context encoder including an embedded layer to obtain a plurality of context semantic feature vectors includes: using an embedding layer of the context encoder to map text data of each data item in the data to be detected into word embedding vectors respectively; respectively adding numerical data in each data item to the tail part of the word embedding vector of each data item to obtain a plurality of data item embedding vectors; and performing context semantic coding on the plurality of data item embedded vectors by using a converter-based Bert model of the context encoder to obtain a plurality of context semantic feature vectors.

In the above data security protection integration method, performing context semantic encoding on the plurality of data item embedded vectors using a converter-based Bert model of the context encoder to obtain the plurality of context semantic feature vectors, including: one-dimensional arrangement is carried out on the plurality of data item embedded vectors to obtain a data item global embedded vector; calculating the product between the data item global embedded vector and the transpose vector of each data item embedded vector in the plurality of data item embedded vectors to obtain a plurality of self-attention association matrices; respectively carrying out standardization processing on each self-attention correlation matrix in the plurality of self-attention correlation matrices to obtain a plurality of standardized self-attention correlation matrices; each normalized self-attention correlation matrix in the normalized self-attention correlation matrices is subjected to a classification function to obtain a plurality of probability values; and weighting each data item embedded vector in the plurality of data item embedded vectors by taking each probability value in the plurality of probability values as a weight so as to obtain the plurality of context semantic feature vectors.

Compared with the prior art, the data safety protection integrated system and the method thereof provided by the application adopt an artificial intelligent detection algorithm based on deep learning to extract the characteristic information of the data to be detected and the sensitive vocabulary, and further calculate the transfer matrix between the characteristics of the data to be detected and the characteristics of the sensitive vocabulary to represent the characteristic similarity of the two, so as to judge whether the data is the sensitive data. Thus, a large amount of data can be automatically processed, and the accuracy and efficiency of sensitive data identification are improved.

Drawings

The foregoing and other objects, features and advantages of the present application will become more apparent from the following more particular description of embodiments of the present application, as illustrated in the accompanying drawings. The accompanying drawings are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate the application and not constitute a limitation to the application. In the drawings, like reference numerals generally refer to like parts or steps.

Fig. 1 is a block diagram of a data security integrated system according to an embodiment of the present application.

Fig. 2 is a schematic architecture diagram of a data security protection integrated system according to an embodiment of the present application.

Fig. 3 is a block diagram of a data semantic understanding module to be detected in the data security protection integrated system according to an embodiment of the present application.

Fig. 4 is a block diagram of a context encoding unit in a data security integrated system according to an embodiment of the present application.

Fig. 5 is a block diagram of a sensitive vocabulary semantic understanding module in a data security integrated system according to an embodiment of the present application.

Fig. 6 is a flowchart of a data security protection integration method according to an embodiment of the present application.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application and not all of the embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.

Fig. 1 is a block diagram of a data security integrated system according to an embodiment of the present application. As shown in fig. 1, a data security integrated system 100 according to an embodiment of the present application includes: the data acquisition module 110 is configured to acquire data to be detected and a sensitive vocabulary set; the to-be-detected data semantic understanding module 120 is configured to pass the to-be-detected data through a context encoder including an embedded layer to obtain a plurality of context semantic feature vectors; a first scale perception module 130, configured to concatenate the plurality of context semantic feature vectors to obtain a first scale semantic association feature vector; the second scale perception module 140 is configured to two-dimensionally arrange the plurality of context semantic feature vectors into a context semantic feature matrix, and then obtain a second scale semantic association feature vector through a convolutional neural network model including a plurality of hybrid convolutional layers; the multi-scale fusion module 150 is configured to perform interpolation order fusion on the first-scale semantic association feature vector and the second-scale semantic association feature vector to obtain a data feature vector to be detected; a sensitive vocabulary semantic understanding module 160, configured to pass the sensitive vocabulary set through a context encoder including an embedded layer to obtain a sensitive data feature vector; a transfer calculation module 170, configured to calculate a transfer matrix between the feature vector of the data to be detected and the feature vector of the sensitive data as a classification feature matrix; the detection result generating module 180 is configured to pass the classification feature matrix through a classifier to obtain a classification result, where the classification result is used to indicate whether the detection data is sensitive data.

Fig. 2 is a schematic architecture diagram of a data security protection integrated system according to an embodiment of the present application. As shown in fig. 2, first, data to be detected and a sensitive vocabulary set are acquired. The data to be detected is then passed through a context encoder comprising an embedded layer to obtain a plurality of context semantic feature vectors. Then, the context semantic feature vectors are concatenated to obtain a first scale semantic association feature vector. And simultaneously, two-dimensionally arranging the context semantic feature vectors into a context semantic feature matrix, and then obtaining a second-scale semantic association feature vector through a convolutional neural network model comprising a plurality of mixed convolutional layers. And then, carrying out interpolation ordered fusion on the first-scale semantic association feature vector and the second-scale semantic association feature vector to obtain a data feature vector to be detected. And secondly, passing the sensitive vocabulary set through a context encoder comprising an embedded layer to obtain sensitive data feature vectors. Then, a transfer matrix between the data feature vector to be detected and the sensitive data feature vector is calculated as a classification feature matrix. And finally, the classification feature matrix passes through a classifier to obtain a classification result, wherein the classification result is used for indicating whether the detection data are sensitive data or not.

In the above data security protection integrated system 100, the data collection module 110 is configured to obtain data to be detected and a sensitive vocabulary set. As mentioned above in the background, sensitive data identification is of great importance in data security protection, the main purpose of which is to identify and flag sensitive data stored in a system in order to take corresponding security measures against access, leakage or abuse by unauthorized persons. However, most of the current data have the characteristics of large capacity and complexity, the traditional method for manually combing is low in speed, and different people can judge the same data differently, so that the results generated by the sensitive data in recognition are different. Therefore, an efficient and accurate sensitive data identification scheme is desired.

Accordingly, in the process of identifying the sensitive data, in order to achieve both accuracy and effectiveness, the identification can be performed by feature similarity between the implicit features of the data to be detected and the implicit features of the sensitive vocabulary in a high-dimensional space. In other words, in the technical scheme of the application, an artificial intelligent detection algorithm based on deep learning is adopted to extract feature information of data to be detected and sensitive words, and then a transfer matrix between the features of the data to be detected and the features of the sensitive words is further calculated to represent feature similarity of the two, so that whether the data are sensitive data is judged. Thus, a large amount of data can be automatically processed, and the accuracy and efficiency of sensitive data identification are improved. Specifically, in the technical scheme of the application, first, data to be detected and a sensitive vocabulary set are acquired.

In the above-mentioned integrated data security system 100, the to-be-detected data semantic understanding module 120 is configured to pass the to-be-detected data through a context encoder including an embedded layer to obtain a plurality of context semantic feature vectors. In sensitive data identification, the context information of the data is critical to accurately determining whether the data is sensitive. Thus, to capture context information and semantically related features in the data to be detected, the data to be detected is processed through a context encoder comprising an embedded layer. It should be appreciated that a context encoder is a model for converting text data into a continuous vector representation. Firstly, mapping each word or character in text data into an embedded vector representation through an embedded layer; and then, a self-attention mechanism is introduced into a plurality of embedded vectors by using a converter of the context encoder, the plurality of embedded vectors are converted into a plurality of context semantic feature vectors, and the semantic association features of the data are comprehensively captured, so that the accuracy of sensitive data identification is improved.

Fig. 3 is a block diagram of a data semantic understanding module to be detected in the data security protection integrated system according to an embodiment of the present application. As shown in fig. 3, the to-be-detected data semantic understanding module 120 includes: an embedding unit 121, configured to map text data of each data item in the data to be detected into word embedding vectors by using an embedding layer of the context encoder; a data adding unit 122, configured to add numerical data in each data item to the tail of the word embedding vector of each data item to obtain a plurality of data item embedding vectors; a context encoding unit 123, configured to perform context semantic encoding on the plurality of data item embedded vectors using a Bert model based on a converter of the context encoder to obtain the plurality of context semantic feature vectors.

Fig. 4 is a block diagram of a context encoding unit in a data security integrated system according to an embodiment of the present application. The context encoding unit 123 includes: a one-dimensional arrangement subunit 1231, configured to perform one-dimensional arrangement on the plurality of data item embedding vectors to obtain a data item global embedding vector; a self-attention generation subunit 1232 configured to calculate a product between the global data item embedding vector and a transpose vector of each of the plurality of data item embedding vectors to obtain a plurality of self-attention correlation matrices; a normalized self-attention subunit 1233, configured to perform normalization processing on each of the plurality of self-attention correlation matrices to obtain a plurality of normalized self-attention correlation matrices; a weight generating subunit 1234, configured to obtain a plurality of probability values from each normalized self-attention correlation matrix in the plurality of normalized self-attention correlation matrices by using a classification function; and the weighting subunit 1235 is configured to weight each of the plurality of data item embedding vectors with each of the plurality of probability values as a weight to obtain the plurality of context semantic feature vectors.

In the above data security protection integrated system 100, the first scale perception module 130 is configured to concatenate the plurality of context semantic feature vectors to obtain a first scale semantic association feature vector. In order to comprehensively consider semantic information in the context semantic feature vectors and capture the semantic association features of the data to be detected in a global mode, the context semantic feature vectors need to be fused. The plurality of context semantic feature vectors are integrated by adopting cascading operation to form a global feature representation, so that the whole semantic association features of the data to be detected are reflected better.

In the above data security protection integrated system 100, the second scale perception module 140 is configured to two-dimensionally arrange the plurality of context semantic feature vectors into a context semantic feature matrix, and then obtain a second scale semantic association feature vector through a convolutional neural network model including a plurality of hybrid convolutional layers. To further extract and capture local and global semantic association features of the data, the plurality of contextual semantic feature vectors are further processed using a convolutional data network model. It should be understood that the plurality of context semantic feature vectors are arranged into the form of the context semantic feature matrix, so that different context semantic information can be organized and represented in two-dimensional space, thereby facilitating the better understanding of the relationship between contexts by the model and extracting the local semantic association features. Meanwhile, the convolutional neural network has good feature extraction capability in terms of image and text processing, and the hybrid convolutional layer can capture the association between local detail and global context at the same time. Therefore, by applying convolution operation on the context semantic feature matrix, semantic association features of different scales can be effectively extracted from the context semantic feature matrix.

Accordingly, in one specific example, the second scale awareness module 140 is configured to: each mixed convolution layer using the convolutional neural network model performs respective processing on input data in forward transfer of the layer: performing convolutional encoding on the context semantic feature matrix by using a first convolutional check with a first size to obtain a first scale feature map; performing convolutional encoding on the context semantic feature matrix by using a second convolutional check with the first void ratio to obtain a second scale feature map; performing convolutional encoding on the context semantic feature matrix by using a third convolutional check with a second void ratio to obtain a third scale feature map; performing convolutional encoding on the context semantic feature matrix by using a fourth convolution kernel with a third void fraction to obtain a fourth scale feature map, wherein the first convolution kernel, the second convolution kernel, the third convolution kernel and the fourth convolution kernel have the same size, and the second convolution kernel, the third convolution kernel and the fourth convolution kernel have different void fractions; performing aggregation on the first scale feature map, the second scale feature map, the third scale feature map and the fourth scale feature map along a channel dimension to obtain an aggregated feature map; global pooling processing is carried out on each feature matrix along the channel dimension on the aggregate feature map so as to generate a pooled feature map; performing activation processing on the pooled feature map to generate an activated feature map; wherein the output of the last layer of the convolutional neural network model comprising a plurality of mixed convolutional layers is the second-scale semantically-related feature vector.

In the above data security protection integrated system 100, the multi-scale fusion module 150 is configured to perform interpolation and order fusion on the first-scale semantic association feature vector and the second-scale semantic association feature vector to obtain a data feature vector to be detected. In order to comprehensively utilize semantic association features of different scales, the first-scale semantic association feature vector and the second-scale semantic association feature vector are further fused. It should be understood that the first-scale semantic association feature vector focuses more on the overall semantic association feature of the data, and the second-scale semantic association feature vector focuses more on the local and global semantic association, and by fusing the features of the two scales, different features and information of the two scales can be comprehensively utilized, so that the data to be detected can be more comprehensively described and represented, and accuracy and robustness of sensitive data identification are improved.

In particular, in the technical solution of the present application, the first-scale semantic association feature vector and the second-scale semantic association feature vector respectively represent different high-dimensional feature manifolds in a high-dimensional feature space, but in a class probability tag domain, the first-scale semantic association feature vector and the second-scale semantic association feature vector respectively point to the same class probability tag, so that the high-dimensional feature manifolds of the first-scale semantic association feature vector and the second-scale semantic association feature vector have implicit association in a manifold expression level, that is, in the technical solution of the present application, the high-dimensional feature manifolds of the first-scale semantic association feature vector and the second-scale semantic association feature vector have smoothness and robustness in a manifold characterization level. Based on the above, in the technical solution of the present application, the manifold difference and manifold superposition expansion of the first scale semantic association feature vector and the second scale semantic association feature vector in the high-dimensional feature space are represented by position-by-position difference and position-by-position point addition, so as to use the method Cosine similarity between the differential feature vector and the point-plus-feature vector represents smoothness and robustness of the high-dimensional manifolds of the first-scale semantically-related feature vector and the second-scale semantically-related feature vector between manifold characterization levels. And further, taking cosine similarity between the differential feature vector and the point-added feature vector as a weight parameter, and fusing the first-scale semantic association feature vector and the second-scale semantic association feature vector to obtain the data feature vector to be detected according to the following formula: v (V) _i ＝αV ₁ +(1-α)V ₂ ，V ₁ Representing the first scale semantically associated feature vector, V ₂ Representing the second scale semantically-related feature vector, alpha representing a weight parameter, V _i Representing the feature vector of the data to be detected. In this way, the high-dimensional feature manifold of the data feature vector to be detected has collinearity with the first-scale semantic association feature vector and the second-scale semantic association feature vector at the geometric level, but the manifold range is different from the manifold measure, and manifold transformation consistency exists at the algebraic angle, so that the data feature vector to be detected can perform feature fusion by utilizing the high-dimensional implicit association between the first-scale semantic association feature vector and the second-scale semantic association feature vector so as to improve the smoothness and robustness of the fused data feature vector to be detected, and further provide the accuracy of classification judgment of a final classification feature matrix through a classifier.

Accordingly, in one specific example, the multi-scale fusion module 150 includes: the difference calculation unit is used for calculating the position-based difference between the first-scale semantic association feature vector and the second-scale semantic association feature vector to obtain a difference feature vector; the per-position weighting unit is used for calculating per-position weighting between the first-scale semantic association feature vector and the second-scale semantic association feature vector to obtain a point-added feature vector; the cosine similarity calculation unit is used for calculating cosine similarity between the differential feature vector and the point-added feature vector; a weighted fusion unit for using the difference characteristicCosine similarity between the vector and the point-added feature vector is used as a weight parameter, and the first-scale semantic association feature vector and the second-scale semantic association feature vector are fused by the following fusion formula to obtain the data feature vector to be detected; wherein, the fusion formula is: v (V) _i ＝αV ₁ +(1-α)V ₂ ，V ₁ Representing the first scale semantically associated feature vector, V ₂ Representing the second scale semantically-related feature vector, alpha representing a weight parameter, V _i Representing the feature vector of the data to be detected.

In the above-mentioned data security protection integrated system 100, the sensitive vocabulary semantic understanding module 160 is configured to pass the sensitive vocabulary set through a context encoder including an embedded layer to obtain a sensitive data feature vector. It is contemplated that the sensitive vocabulary sets are typically presented in text form. However, sensitive words in text form are inconvenient to directly compare and calculate with the data to be detected. Therefore, it is necessary to use a context encoder to mine the contextual semantic features of the sensitive vocabulary and to convert the set of sensitive vocabulary into a computable and comparable vector representation for feature comparison and classification with the data to be detected.

Fig. 5 is a block diagram of a sensitive vocabulary semantic understanding module in a data security integrated system according to an embodiment of the present application. As shown in fig. 5, the sensitive vocabulary semantic understanding module 160 includes: an embedding vectorization unit 161, configured to map each sensitive vocabulary in the sensitive vocabulary set into a word embedding vector by using an embedding layer of the context encoder to obtain a sequence of word embedding vectors; a semantic coding unit 162, configured to perform global-based context semantic coding on the sequence of word embedding vectors using a Bert model based on a converter of the context encoder to obtain a plurality of word feature vectors; a concatenation unit 163, configured to concatenate the plurality of word feature vectors to obtain the sensitive data feature vector.

In the above data security protection integrated system 100, the transfer calculation module 170 is configured to calculate a transfer matrix between the feature vector of the data to be detected and the feature vector of the sensitive data as a classification feature matrix. In order to capture semantic associations and transfer features between data to be detected and sensitive data, a transfer matrix between the data feature vectors to be detected and the sensitive data feature vectors is further calculated. It should be appreciated that the transfer matrix may be regarded as a transformation matrix mapping the data feature vectors to be detected to sensitive data feature vectors. Each element in the matrix represents a transfer relationship between a certain dimension of the feature vector of the data to be detected and a corresponding dimension of the feature vector of the sensitive data, reflecting the similarity, the difference and the semantic transfer degree between the data to be detected and the sensitive data. Based on the characteristic information, a classifier is further used for judging whether the data to be detected belongs to the sensitive data category.

Accordingly, in one specific example, the transfer calculation module 170 is configured to: calculating a transfer matrix between the data feature vector to be detected and the sensitive data feature vector according to the following transfer formula;

Wherein, the transfer formula is:

In the above data safety protection integrated system 100, the detection result generating module 180 is configured to pass the classification feature matrix through a classifier to obtain a classification result, where the classification result is used to indicate whether the detection data is sensitive data. The classifier is a trained machine learning model, and the training process of the classifier is usually performed based on labeled training data, where the training data includes features of the data to be detected and corresponding class labels (sensitive data or non-sensitive data). Through training, the classifier can learn the association between the features and the categories and finish classifying the unknown data. Here, the classification feature matrix is used as an input, and a classification result for indicating whether the data to be detected is sensitive data can be obtained. Therefore, the automatic classification and the sensitive data identification of the data to be detected are realized, and further processing or decision making is carried out according to the classification result, so that the safety of the data is ensured.

In summary, the data security protection integrated system according to the embodiment of the application is illustrated, an artificial intelligent detection algorithm based on deep learning is adopted to extract feature information of data to be detected and sensitive words, and then a transfer matrix between the features of the data to be detected and the features of the sensitive words is further calculated to represent feature similarity of the two, so that whether the data is sensitive data is judged. Thus, a large amount of data can be automatically processed, and the accuracy and efficiency of sensitive data identification are improved.

Fig. 6 is a flowchart of a data security protection integration method according to an embodiment of the present application. As shown in fig. 6, a data security protection integration method according to an embodiment of the present application includes the steps of: s110, acquiring data to be detected and a sensitive vocabulary set; s120, the data to be detected passes through a context encoder comprising an embedded layer to obtain a plurality of context semantic feature vectors; s130, cascading the context semantic feature vectors to obtain a first-scale semantic association feature vector; s140, two-dimensionally arranging the context semantic feature vectors into a context semantic feature matrix, and then obtaining a second-scale semantic association feature vector through a convolutional neural network model comprising a plurality of mixed convolutional layers; s150, carrying out interpolation ordered fusion on the first-scale semantic association feature vector and the second-scale semantic association feature vector to obtain a data feature vector to be detected; s160, passing the sensitive vocabulary set through a context encoder comprising an embedded layer to obtain a sensitive data feature vector; s170, calculating a transfer matrix between the data feature vector to be detected and the sensitive data feature vector as a classification feature matrix; and S180, the classification feature matrix is passed through a classifier to obtain a classification result, and the classification result is used for indicating whether the detection data is sensitive data or not.

Here, it will be understood by those skilled in the art that the specific operations of the respective steps in the above-described data security integration method have been described in detail in the above description of the data security integration system with reference to fig. 1 to 5, and thus, repetitive descriptions thereof will be omitted.

Claims

1. An integrated data security protection system, characterized by including:

Data collection module, used to obtain data to be detected and sensitive vocabulary collections;

A semantic understanding module for the data to be detected, used to pass the data to be detected through a context encoder including an embedding layer to obtain multiple contextual semantic feature vectors;

A first scale perception module, configured to concatenate the plurality of contextual semantic feature vectors to obtain a first scale semantic association feature vector;

The second scale perception module is used to two-dimensionally arrange the plurality of contextual semantic feature vectors into a contextual semantic feature matrix and then use a convolutional neural network model including multiple hybrid convolutional layers to obtain the second scale semantic association feature vector. ;

A multi-scale fusion module, configured to interpolate and orderly fuse the first-scale semantic correlation feature vector and the second-scale semantic correlation feature vector to obtain a feature vector of the data to be detected;

A sensitive vocabulary semantic understanding module, used to pass the sensitive vocabulary set through a context encoder including an embedding layer to obtain a sensitive data feature vector;

A transfer calculation module, used to calculate the transfer matrix between the feature vector of the data to be detected and the feature vector of the sensitive data as a classification feature matrix;

A detection result generation module is used to pass the classification feature matrix through a classifier to obtain a classification result, where the classification result is used to indicate whether the detection data is sensitive data.

2. The data security protection integrated system according to claim 1, characterized in that the semantic understanding module of the data to be detected includes:

An embedding unit, configured to use the embedding layer of the context encoder to map the text data of each data item in the data to be detected into word embedding vectors;

A data adding unit, configured to add the numerical data in each data item to the end of the word embedding vector of each data item to obtain multiple data item embedding vectors;

A context encoding unit, configured to use the transformer-based Bert model of the context encoder to perform contextual semantic encoding on the plurality of data item embedding vectors to obtain the plurality of contextual semantic feature vectors.

3. The integrated data security protection system according to claim 2, characterized in that the context encoding unit includes:

A one-dimensional arrangement subunit, used to arrange the plurality of data item embedding vectors in one dimension to obtain a data item global embedding vector;

The self-attention generation subunit is used to calculate the product between the global embedding vector of the data item and the transposed vector of each data item embedding vector in the plurality of data item embedding vectors to obtain multiple self-attention correlation matrices;

A standardized self-attention subunit, used to perform standardization processing on each of the plurality of self-attention correlation matrices to obtain a plurality of standardized self-attention correlation matrices;

A weight generation subunit, used to pass each of the standardized self-attention correlation matrices among the multiple standardized self-attention correlation matrices through a classification function to obtain multiple probability values;

A weighting subunit, configured to weight each of the plurality of data item embedding vectors using each of the plurality of probability values as a weight to obtain the plurality of contextual semantic feature vectors.

4. The data security protection integrated system according to claim 3, characterized in that the second scale sensing module is used to: use each hybrid convolution layer of the convolutional neural network model in the forward direction of the layer. The input data are processed separately during transfer:

Convolutionally encoding the contextual semantic feature matrix using a first convolution kernel having a first size to obtain a first scale feature map;

Convolutionally encode the contextual semantic feature matrix using a second convolution kernel with a first hole rate to obtain a second scale feature map;

Convolutionally encode the contextual semantic feature matrix using a third convolution kernel with a second hole rate to obtain a third scale feature map;

The context semantic feature matrix is convolutionally encoded using a fourth convolution kernel with a third hole rate to obtain a fourth scale feature map, wherein the first convolution kernel, the second convolution kernel, the The third convolution kernel and the fourth convolution kernel have the same size, and the second convolution kernel, the third convolution kernel and the fourth convolution kernel have different hole rates;

Aggregate the first scale feature map, the second scale feature map, the third scale feature map and the fourth scale feature map along the channel dimension to obtain an aggregate feature map;

Perform global pooling processing on each feature matrix along the channel dimension on the aggregated feature map to generate a pooled feature map;

Perform activation processing on the pooled feature map to generate an activation feature map;

Wherein, the output of the last layer of the convolutional neural network model including multiple hybrid convolutional layers is the second scale semantic association feature vector.

5. The integrated data security protection system according to claim 4, characterized in that the multi-scale fusion module includes:

A difference calculation unit configured to calculate the position-wise difference between the first-scale semantic correlation feature vector and the second-scale semantic correlation feature vector to obtain a difference feature vector;

A weighting unit by position, used to calculate weighting by position between the first scale semantic correlation feature vector and the second scale semantic correlation feature vector to obtain a point plus feature vector;

A cosine similarity calculation unit, used to calculate the cosine similarity between the differential feature vector and the point plus feature vector;

A weighted fusion unit, configured to use the cosine similarity between the differential feature vector and the point plus feature vector as a weight parameter, and use the following fusion formula to fuse the first scale semantic association feature vector and the second Scale semantic correlation feature vector to obtain the feature vector of the data to be detected; wherein, the fusion formula is: V _i =αV ₁ +(1-α)V ₂ , V ₁ represents the first scale semantic correlation feature vector, V ₂ represents the second scale semantic association feature vector, α represents the weight parameter, and _Vi represents the feature vector of the data to be detected.

6. The data security protection integrated system according to claim 5, characterized in that the sensitive vocabulary semantic understanding module includes:

An embedding vectorization unit, configured to use the embedding layer of the context encoder to respectively map each sensitive word in the sensitive word set to a word embedding vector to obtain a sequence of word embedding vectors;

A semantic encoding unit, configured to use the transformer-based Bert model of the context encoder to perform global contextual semantic encoding on the sequence of word embedding vectors to obtain multiple word feature vectors;

A cascading unit is used to cascade the plurality of word feature vectors to obtain the sensitive data feature vector.

7. The data security protection integrated system according to claim 6, characterized in that the transfer calculation module is used to calculate the relationship between the feature vector of the data to be detected and the feature vector of the sensitive data using the following transfer formula transfer matrix;

Among them, the transfer formula is:

Where V _a represents the feature vector of the data to be detected, V _b represents the feature vector of the sensitive data, and M represents the transfer matrix, Represents the multiplication of matrices and vectors.

8. An integrated method of data security protection, characterized by including:

Obtain the data to be detected and the sensitive vocabulary collection;

Pass the data to be detected through a context encoder including an embedding layer to obtain multiple contextual semantic feature vectors;

Concatenate the plurality of contextual semantic feature vectors to obtain a first-scale semantic association feature vector;

The plurality of contextual semantic feature vectors are two-dimensionally arranged into a contextual semantic feature matrix and then passed through a convolutional neural network model including multiple hybrid convolutional layers to obtain a second scale semantic association feature vector;

Perform interpolation and orderly fusion on the first scale semantic correlation feature vector and the second scale semantic correlation feature vector to obtain a feature vector of the data to be detected;

Pass the sensitive vocabulary set through a context encoder including an embedding layer to obtain a sensitive data feature vector;

Calculate the transfer matrix between the feature vector of the data to be detected and the feature vector of the sensitive data as a classification feature matrix;

The classification feature matrix is passed through a classifier to obtain a classification result, and the classification result is used to indicate whether the detection data is sensitive data.

9. The data security protection integrated method according to claim 8, characterized in that the data to be detected is passed through a context encoder including an embedding layer to obtain a plurality of contextual semantic feature vectors, including:

Use the embedding layer of the context encoder to map the text data of each data item in the data to be detected into word embedding vectors;

Add the numerical data in each data item to the tail of the word embedding vector of each data item to obtain multiple data item embedding vectors;

The plurality of data item embedding vectors are contextually semantically encoded using the transformer-based Bert model of the context encoder to obtain the plurality of contextual semantic feature vectors.

10. The data security protection integrated method according to claim 9, characterized in that the Bert model based on the transformer of the context encoder is used to perform context semantic encoding on the plurality of data item embedding vectors to obtain the Multiple contextual semantic feature vectors, including:

One-dimensionally arrange the plurality of data item embedding vectors to obtain a data item global embedding vector;

Calculating a product between the data item global embedding vector and the transposed vector of each data item embedding vector in the plurality of data item embedding vectors to obtain a plurality of self-attention correlation matrices;

Perform standardization processing on each of the self-attention correlation matrices respectively to obtain a plurality of standardized self-attention correlation matrices;

Pass each of the multiple standardized self-attention correlation matrices through a classification function to obtain multiple probability values;

Each of the plurality of data item embedding vectors is weighted using each of the plurality of probability values as a weight to obtain the plurality of contextual semantic feature vectors.