Detailed Description
In order to describe the technical contents, the achieved objects and effects of the present invention in detail, the following description will be made with reference to the embodiments in conjunction with the accompanying drawings.
Referring to fig. 1, a personal information identification method includes acquiring metadata of a database, and acquiring data types and data characteristic values of various fields in the database;
Respectively checking each field according to the data type, the data value field, the data length and the data rule of the preset personal information data item and the data type and the data characteristic value of each field to determine candidate personal information data items corresponding to each field;
Calculating the name matching degree, the entity identification matching degree and the numerical fingerprint matching degree of each field and each candidate personal information data item corresponding to each field respectively, and calculating the final matching degree of each field and each candidate personal information data item according to the name matching degree, the entity identification matching degree and the numerical fingerprint matching degree;
And if the final matching degree of one field and one candidate personal information data item is maximum and is larger than a preset threshold value, taking the candidate personal information data item as a target personal information data item matched with the field.
From the above description, the invention has the beneficial effects of realizing the automatic intelligent detection of the personal information in the database and ensuring the detection efficiency and accuracy.
Further, the data types include text type, number type, date and time type, short text type, enumerated word typical type, and binary type.
Further, the data types and the data characteristic values of the fields in the acquired database are specifically:
if the data type of one field is the text type, acquiring the minimum value and the maximum value of the text length;
if the data type of a field is a digital type or a date and time type, acquiring the average value and variance of the maximum value and the minimum value of the data;
If the data type of one field is a short text type or an enumerated dictionary type, acquiring a value range list of the numerical value;
if the data type of a field is binary type, the minimum length and the maximum length of binary data are obtained.
It will be appreciated from the above description that for fields of different data types, different data characteristic values are obtained, facilitating subsequent data verification to quickly determine potentially matching personal information data items.
Further, the calculating the name matching degree, the entity identification matching degree and the numerical fingerprint matching degree of each field and each candidate personal information data item corresponding to each field respectively, and calculating the final matching degree of each field and each candidate personal information data item according to the name matching degree, the entity identification matching degree and the numerical fingerprint matching degree specifically comprises:
Respectively calculating the field names of the fields and the matching degree of the field descriptions and the data item names of the candidate personal information data items corresponding to the field names of the fields to obtain the name matching degree of the fields and the candidate personal information data items;
calculating the matching degree of the data value of the field with the data type of the short text type and each candidate personal information data item corresponding to the data value through the conditional random field to obtain the entity identification matching degree of the field with the data type of the short text type and each candidate personal information data item;
Respectively calculating the matching degree of the numerical fingerprints of the fields of the data type and the data type or the date and time type and the numerical fingerprints of the corresponding name candidate personal information data items to obtain the matching degree of the fields of the data type and the numerical fingerprints of the candidate personal information data items;
Respectively calculating the coverage of the data set of which the data type is the field of the enumeration word typical type and the dictionary value set of each candidate personal information data item corresponding to the data set, and obtaining the numerical fingerprint matching degree of the field of which the data type is the enumeration dictionary type and each candidate personal information data item;
And calculating the final matching degree of each field and each candidate personal information data item according to the name matching degree, the entity identification matching degree and the numerical fingerprint matching degree of each field and each candidate personal information.
As can be seen from the above description, the accuracy of matching degree calculation can be improved by calculating the matching degree through three dimensions of field name, data entity identification and data fingerprint, so that the identification accuracy of the subsequent target personal information data item can be improved.
Further, the calculating the matching degree of the field names and the field descriptions of the fields and the names of the data items of the candidate personal information data items corresponding to the field names and the field descriptions of the fields respectively, and the obtaining the matching degree of the names of the fields and the candidate personal information data items specifically comprises the following steps:
calculating a word vector of a field name of a field to obtain a first word vector;
Calculating word vectors of field descriptions of the field to obtain second word vectors;
Calculating a word vector of the data item name of a candidate personal information data item corresponding to the field to obtain a third word vector;
According to the first word vector and the third word vector, calculating cosine similarity between the field name of the field and the data item name of the candidate personal information data item to obtain first cosine similarity;
According to the second word vector and the third word vector, calculating cosine similarity between the field description of the field and the data item name of the candidate personal information data item to obtain second cosine similarity;
and calculating the name matching degree of the field and the candidate personal information data item according to the first cosine similarity, the second cosine similarity and the preset first weight coefficient and second weight coefficient.
As can be seen from the above description, the accuracy of the name matching degree calculation can be improved by converting text into word vectors and calculating the name matching degree according to cosine similarity.
Further, the calculating, by the conditional random field, the matching degree of the data value of the field with the data type of the short text type and each candidate personal information data item corresponding to the data value of the field with the data type of the short text type, and the entity identification matching degree of the field with the data type of the short text type and each candidate personal information data item of the field with the data type of the short text type is specifically:
acquiring a field with a data type of a short text type;
Calculating a prediction score of a candidate personal information data item corresponding to the data value of the field through a preset conditional random field;
And calculating the entity identification matching degree of the field and the candidate personal information data item according to the prediction score and a preset third weight coefficient.
From the above description, it is known that by identifying named entities, identifying entities in text, it is possible to further determine matching personal information data items for short text-type data.
Further, the matching degree between the numerical fingerprint of the data type field and the numerical fingerprint of each candidate personal information data item corresponding to the data type field or the date type field is calculated, and the obtained matching degree between the data type field and the numerical fingerprint of each candidate personal information data item is specifically:
Acquiring a field of a data type which is a digital type or a date and time type;
Acquiring a six-dimensional characteristic value according to the numerical value of the field to obtain a numerical fingerprint of the field, wherein the six-dimensional characteristic value comprises a minimum value, a first quartile, a median, a third quartile, a maximum value and a variance;
Acquiring a six-dimensional characteristic value of a candidate personal information data item corresponding to the field, and obtaining a numerical fingerprint of the candidate personal information data item;
And calculating the Euclidean distance according to the numerical fingerprint of the field and the numerical fingerprint of the candidate personal information data item, and calculating the matching degree of the field and the numerical fingerprint of the candidate personal information data item according to the Euclidean distance and a preset fourth weight coefficient.
From the above description, it can be seen that a numerical fingerprint is generated according to the characteristics of the numerical distribution, and the matching degree between the numerical fingerprints is measured from the distance between the numerical fingerprints. For digital information, a numerical fingerprint consists of six dimensions, minimum, first quartile, median, third quartile, maximum, variance.
Further, the calculating the coverage of the data set of the fields with the data type of the enumerated word typical type and the dictionary value sets of the candidate personal information data items corresponding to the data set respectively, and the obtaining the numerical fingerprint matching degree of the fields with the data type of the enumerated dictionary type and the candidate personal information data items specifically comprises the following steps:
acquiring a field of which the data type is an enumeration dictionary type;
Acquiring the numerical value of the field, and performing ascending sort according to a natural sequence to obtain a numerical value set of the field;
Acquiring a dictionary value set of a candidate personal information data item corresponding to the field;
And calculating the coverage of the numerical value set in the dictionary value set, and calculating the numerical fingerprint matching degree of the field and the candidate personal information data item according to the coverage and a preset fifth weight coefficient.
As can be seen from the above description, for enumerating dictionary-type data, the degree of matching is measured by the degree of coverage between sets.
Further, the calculating the name matching degree, the entity identification matching degree and the numerical fingerprint matching degree of each field and each candidate personal information data item corresponding to each field respectively, and calculating the final matching degree of each field and each candidate personal information data item according to the name matching degree, the entity identification matching degree and the numerical fingerprint matching degree, further includes:
setting the entity identification matching degree of the field with the data type not being the short text type and each candidate personal information data item to 0;
The matching degree of the digital fingerprint of the data type field which is not the digital type, the date and time type or the enumerated typical type and each candidate personal information data item is set to 0.
The invention also proposes a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, implements a method as described above.
Example 1
Referring to fig. 2, a personal information identification method, which can be applied to automatically and intelligently detecting personal information related to a database, as shown in fig. 2, includes the following steps:
s1, data exploration.
First, database metadata is acquired. Personal information is typically stored in a database, whether a traditional relational or non-relational database, which uses metadata to describe the data format stored, including information of all tables and field information, specifically table names, field types (i.e., data storage types), field descriptions (comments). Thus, database metadata is first acquired, providing support for matching of subsequent data.
Then, the data type and the data characteristic value of each field in the database are obtained. I.e. traversing the data of each field in the database to obtain the characteristic of the data stored in the field.
In this embodiment, the data types include text type, number type, date and time type, short text and enumerated word typical type, and binary type, wherein for fields of different data types, different data characteristic values are obtained, specifically, as follows:
if the data type of one field is the text type, acquiring the minimum value and the maximum value of the text length;
If the data type of a field is a digital type or a date and time type, acquiring the maximum value and the minimum value of the data, and further, acquiring an average value, a variance and the like;
If the data type of one field is a short text type and an enumeration dictionary type, acquiring a value range list of the numerical value;
if the data type of a field is binary type, the minimum length and the maximum length of binary data are obtained.
And S2, data verification, namely respectively verifying each field according to the data type, the data value field, the data length and the data rule of the preset personal information data item and the data type and the data characteristic value of each field to determine the candidate personal information data item corresponding to each field.
Each data field stored in the database may be personal information, and the personal information may also involve a relatively large number of data items, such as name, identification number, cell phone number, home address, religious belief, and so on. By checking the data type, value range, length, rules, etc., it can be determined that the matching personal information data item is or is locked within a certain range. Data verification is a sufficient condition that must be satisfied to calculate a match.
In this embodiment, the data verification includes a data type verification, a data value field verification, a data length verification, and a data rule verification.
For data type verification, i.e. verifying the relationship between the type of data storage and the personal information data item, a data item may have one or more possible data storage types. For example, the data storage type of the identification number data item is usually a text type, and the data types such as numbers, dates, binary and the like can be excluded. The data storage type of the birth date data item can be a text type, a date and time type, or a digital type (millisecond after 1970).
For most digital data items, there is a corresponding range of values, such as age, height, date of birth, etc. Therefore, the data items with value range requirements are subjected to type conversion and then value range verification. For example, for a field of a data type that is a digital type and a date-time type, it is determined whether the maximum value and the minimum value of the data are both within the range of the value of a certain personal information data item, and if so, the personal information data item is considered to be a personal information data item that the field matches.
In addition, for fields whose data types are short text types and enumeration word typical types, the fields can also be checked for value fields because they also have corresponding value fields.
There is typically a length requirement for both text-type data items and binary data items. The length range may be discrete or continuous. For example, the identification card number is 15 bits or 18 bits, the mobile phone number is 11 bits, 14 bits or 15 bits (with international codes), and the license plate number is 7 to 8 bits. Thus, the data of fields whose data types are text type and binary type can be length-checked.
For a particular personal information data item, there is usually a check rule. For example, the last digit of the identification card number is a check field, which is obtained by the calculation of the first 17 digits, the first 3 digits of the mobile phone number have certain requirements, the range is 130-139,145-147. According to the specific condition of the verification rule, the verification bits can be verified according to the rule, and the regular expression can be adopted to carry out structural verification on the data constituent elements.
For example, assuming that the data type of a field is a text type, the minimum value of the text length is 11, the maximum value of the text length is 15, and the first 3 bits of each data are within a preset range, the mobile phone number data item is considered as a candidate personal information data item of the field.
And S3, calculating the matching degree, namely calculating the name matching degree, the entity identification matching degree and the numerical fingerprint matching degree of each field and each candidate personal information data item corresponding to each field respectively, and calculating the final matching degree of each field and each candidate personal information according to the name matching degree, the entity identification matching degree and the numerical fingerprint matching degree.
During the data verification phase, it is generally only possible to lock within a certain range the personal information data items for which the stored data fields may match. Unless there is an explicit rule like an identification number, it can basically be determined by data verification. Thus, for data items that may match, it is also necessary to select by other means. This stage supports computing the degree of matching from dimensions of field names, entity identification, numeric fingerprints, etc., and intelligently matching data items according to weight thresholds and the magnitude of the degree of matching values.
Specifically, the method comprises the following steps:
s301, respectively calculating the field names of the fields and the matching degree of the field descriptions and the data item names of the candidate personal information data items corresponding to the field names, and obtaining the name matching degree between the fields and the candidate personal information data items.
The field name plays a key role in matching fields, but the same data item has a plurality of expressions such as the data item "name" which can be expressed as "name", "chinese name", "great name", "opposite name", and the like. The degree of matching can be measured by computing the distance between the field names and the data item names through natural language processing techniques. The field names are typically quite compact, so field descriptions (remarks) in database metadata can also be used to assist in measuring the degree of matching.
The natural language processing technique word2vec (wordto vector) supports the simplification of the processing of text content into vector operations in vector space. Using word2vec requires preparation of a large amount of corpus for training of deep learning, and the training model results can be used to calculate word vectors. In this embodiment, the similarity between two vector inner product spaces is measured by measuring their angle cosine values. The calculation formula of the cosine value of the included angle between the two vectors A and B is as follows:
,
Therefore, the method for calculating the name matching degree in this step is specifically as follows:
and calculating the word vector of the field name of the field to obtain a first word vector, and calculating the word vector of the field description of the field to obtain a second word vector, namely simultaneously, calculating the word vector of the data item name of the candidate personal information data item corresponding to the field to obtain a third word vector.
For example, assuming that a field name of a field is FIELDNAME, a word2vec converted word vector is vecF, a field description of the field is fieldDescription, a word2vec converted word vector is vecT, a data item name of a candidate personal information data item corresponding to the field is TARGETFIELDNAME, a word2vec converted word vector is vecD, a calculation formula of a name matching degree degreel between the field and the candidate personal information data item is as follows:
,
wherein weight1 is the first weight coefficient corresponding to the field name, and weight2 is the second weight coefficient corresponding to the field description.
S302, calculating the matching degree of the data value of the field with the data type of the short text type and each candidate personal information data item corresponding to the data value through the conditional random field, and obtaining the entity identification matching degree between each field with the data type of the short text type and each candidate personal information data item.
Named entity recognition, namely, recognizing the entities such as person names, organization names, place names and the like in the text. Matching data items may be further determined for text-type data. Conditional random field CRF (conditional random field) is a conditional probability distribution model for another set of output random variables given a set of input random variables, and is characterized by assuming that the output random variables constitute a Markov (Markov) random field. Such machine learning techniques are commonly used for entity recognition, automatic learning from training data sets, and the trained models can be used to recognize text, match possible person names, addresses, work units, and so on.
If the data stored in the database is of a short text type and the candidate personal information data items to be matched belong to identifiable named entities, the degree of matching between the fields and the candidate personal information data items is measured from a predictive score identified by the entities. Otherwise, the matching degree value of entity identification is 0. That is, for fields whose data type is not a short text type, and fields whose data type is a short text type but whose corresponding candidate personal information data items do not belong to identifiable named entities, the entity recognition matching degree between these fields and their corresponding candidate personal information data items is set to 0 directly.
In this embodiment, the method for calculating the entity identification matching degree in this step is specifically as follows:
A field FIELDNAME of the data type short text type is obtained, then a predictive score of a candidate personal information data item TARGETFIELDNAME of the data value fieldValue of the field FIELDNAME corresponding to the field is calculated by a preset conditional random field, and assuming crfScore, the calculation formula of the entity identification matching degree2 between the field FIELDNAME and the candidate personal information data item TARGETFIELDNAME is as follows:
,
Wherein crfScore is the score of the data value fieldValue predicted by the conditional random field CRF, and weight3 is the third weight coefficient identified by the corresponding entity.
S303, respectively calculating the matching degree of the numerical fingerprints of the fields of the data type and the data type or the date and time type and the numerical fingerprints of the corresponding candidate personal information data items, and obtaining the matching degree of the numerical fingerprints of the fields of the data type and the data type or the date and time type and the candidate personal information data items.
The numerical distribution of part of the personal information has certain laws, such as age, date of birth. A numerical fingerprint may be generated from the characteristics of the numerical distribution, the degree of matching between the two being measured from the distance between the numerical fingerprints.
In this embodiment, the numerical fingerprints of the fields of different data types are different, for example, for digital information, the numerical fingerprints consist of six dimensions of a minimum, a first quartile, a median, a third quartile, a maximum, and a variance. Therefore, the method for calculating the numerical fingerprint matching degree in this step is specifically as follows:
and then traversing and counting all values of the field to obtain six-dimensional characteristic values which are respectively a minimum value, a first quartile, a median, a third quartile, a maximum value and a variance, and then combining to obtain the numerical fingerprint of the field. And simultaneously, acquiring a six-dimensional characteristic value of a candidate personal information data item corresponding to the field, and obtaining the numerical fingerprint of the candidate personal information data item. The normalized euclidean distance is then sampled to measure the degree of matching between the numeric fingerprints. The numerical fingerprint matching degree of the field and the candidate personal information data item is calculated as follows:
,
wherein C is the numerical fingerprint of field fieldValue, D is the numerical fingerprint of candidate personal information data item targetFieldValue, n represents the dimension of the numerical fingerprint, in this embodiment n=6, i.e. a total of six dimensions; the kth dimension representing a numeric fingerprint, e.g., C1 represents the first dimension, i.e., the minimum; Representing the standard deviation of the data fingerprint D in the kth dimension, weight4 is the fourth weight coefficient for the corresponding digital numeric fingerprint.
S304, respectively calculating the coverage of the data set of the fields with the data type of the enumerated word typical type and the dictionary value sets of the candidate personal information data items corresponding to the data set, and obtaining the numerical fingerprint matching degree between the fields with the data type of the enumerated dictionary type and the candidate personal information data items.
For enumerating dictionary data items, such as religion beliefs, ethnicities, academia, blood types, etc. The numerical dictionary forms a set in natural order, which can represent the characteristics of numerical distribution, and the matching degree is measured from the coverage between the sets.
Therefore, the method for calculating the numerical fingerprint matching degree in this step is specifically as follows:
And obtaining a field of which the data type is the typical type of the enumeration word, and then carrying out ascending sort on all values in the field according to a natural sequence to obtain a value set of the field. Meanwhile, a dictionary value set of a candidate personal information data item corresponding to the field is obtained, wherein the dictionary value set of the candidate personal information data item can refer to national standard, local standard and industry standard, for example, national standard GB3304-91. The numerical fingerprint matching degree of the field and the candidate personal information data item is calculated as follows:
,
Where E is the set of values for field fieldValue in the database, F is the set of dictionary values for candidate personal information data item targetFieldValue corresponding to that field, and weight5 is the fifth weight coefficient for a typical numeric fingerprint of the corresponding enumerated word.
Further, for fields whose data type is not a numeric type, a date-time type, or an enumerated word typical type, the numeric fingerprint matching degree of these fields with their respective candidate personal information data items is set to 0.
And S305, calculating the final matching degree of each field and each candidate personal information data item according to the name matching degree, the entity identification matching degree and the numerical fingerprint matching degree of each field and each candidate personal information.
The matching degrees of the multiple dimensions are accumulated to obtain a final matching degree, and a calculation formula of the final matching degree is that degree= degreel +degree2+degree3.
S4, determining target personal information data items matched with the fields.
And the final matching degree of each field and each candidate personal information data item can be obtained through the matching degree calculation, then, for each field, the final matching degree of each candidate personal information data item is compared, effective matching is calculated only if the matching degree is larger than a preset threshold G, and if the final matching degree of a plurality of candidate personal information data items is larger than the threshold G for one field, the candidate personal information data item with the largest final matching degree is taken as the best matching item.
And if the final matching degree of one field and one candidate personal information data item is maximum and is larger than a preset threshold value, taking the candidate personal information data item as a target personal information data item matched with the field.
According to the method, data exploration is conducted from two aspects of metadata of a database and distribution characteristics of traversal data, data is verified from data types, data value fields, data lengths and data rules, personal information data items which are possibly matched are quickly reduced, matching degree of a field and candidate personal information data items of the field is further calculated from three dimensions of field names, data entity identification and data fingerprints, and finally target personal information data items are determined through threshold values and final matching degree values, so that personal information involved in the database is automatically and intelligently detected.
By intelligently matching the personal information data items related to the data items in the data resources, scientific and efficient support is provided for data examination, reference can be provided for quality management of data management, and abnormal data and abnormal values can be intelligently detected. The data checksum data fingerprint scheme can also be used for intelligently matching data types in data access, so that the workload of manually matching data items is reduced.
Under the condition that the regulatory units continuously keep good laws and regulations on data security and the society places more importance on data security, the embodiment can be used for the data security check departments to quickly touch the data resources of network operators, intelligently detect related personal information and provide powerful support for data asset reports, thereby constructing peace, safe, open and cooperative network spaces and providing an efficient reference scheme for self-supervision management of data managers.
Example two
The present embodiment is a computer readable storage medium corresponding to the above embodiment, and has a computer program stored thereon, where the computer program when executed by a processor implements the processes in the above personal information identification method embodiment, and the same technical effects can be achieved, and for avoiding repetition, a detailed description is omitted here.
In summary, the personal information identification method and the computer readable storage medium provided by the invention perform data exploration through two aspects of database metadata and traversal data distribution characteristics, and verify data through data types, data value fields, data lengths and data rules, so that the personal information data items which are possibly matched are rapidly determined, namely, the matching degree of a field name, data entity identification and a data fingerprint three-dimensional calculation field and candidate personal information data items thereof is determined, and the target personal information data item is determined according to the final matching degree, thereby realizing automatic intelligent detection of personal information related in a database and ensuring the detection accuracy. According to the invention, by utilizing the characteristics of the metadata structure and the data value field, the intelligent matching of the data items is realized by combining data verification and matching degree calculation continuously and gradually, so that the stored personal information items are automatically detected, and the support is provided for data safety supervision and management work.
The foregoing description is only illustrative of the present invention and is not intended to limit the scope of the invention, and all equivalent changes made by the specification and drawings of the present invention, or direct or indirect application in the relevant art, are included in the scope of the present invention.