[go: up one dir, main page]

CN114398528B - Personal information identification method and computer-readable storage medium - Google Patents

Personal information identification method and computer-readable storage medium

Info

Publication number
CN114398528B
CN114398528B CN202111479395.4A CN202111479395A CN114398528B CN 114398528 B CN114398528 B CN 114398528B CN 202111479395 A CN202111479395 A CN 202111479395A CN 114398528 B CN114398528 B CN 114398528B
Authority
CN
China
Prior art keywords
field
personal information
data
type
matching degree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111479395.4A
Other languages
Chinese (zh)
Other versions
CN114398528A (en
Inventor
杜新胜
曾超
朱健伟
王超
蔡丽莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guotou Intelligent Xiamen Information Co ltd
Xiamen Meiya Pico Information Co Ltd
Original Assignee
Guotou Intelligent Xiamen Information Co ltd
Xiamen Meiya Pico Information Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guotou Intelligent Xiamen Information Co ltd, Xiamen Meiya Pico Information Co Ltd filed Critical Guotou Intelligent Xiamen Information Co ltd
Priority to CN202111479395.4A priority Critical patent/CN114398528B/en
Publication of CN114398528A publication Critical patent/CN114398528A/en
Application granted granted Critical
Publication of CN114398528B publication Critical patent/CN114398528B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/907Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/28Determining representative reference patterns, e.g. by averaging or distorting; Generating dictionaries
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Algebra (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Collating Specific Patterns (AREA)

Abstract

本发明公开了一种个人信息识别方法及计算机可读存储介质,方法包括:获取数据库元数据,并获取数据库中各字段的数据类型和数据特征值;根据预设的个人信息数据项的数据类型、数据值域、数据长度和数据规则以及各字段的数据类型和数据特征值,分别对各字段进行校验,确定各字段对应的候选个人信息数据项;分别计算各字段与其对应的各候选个人信息数据项的名称匹配度、实体识别匹配度和数值指纹匹配度,并计算各字段与其各候选个人信息数据项的最终匹配度;若一字段与其一候选个人信息数据项的最终匹配度最大且大于预设的阈值,则将该候选个人信息数据项作为目标个人信息数据项。本发明可自动识别出数据库中涉及的个人信息数据项。

The present invention discloses a personal information identification method and computer-readable storage medium. The method comprises: obtaining database metadata and the data type and data characteristic value of each field in the database; verifying each field based on the preset data type, data value range, data length, and data rules of the personal information data item, as well as the data type and data characteristic value of each field, and determining the candidate personal information data item corresponding to each field; calculating the name matching degree, entity recognition matching degree, and numerical fingerprint matching degree of each field and its corresponding candidate personal information data item, and calculating the final matching degree of each field and its candidate personal information data item; if the final matching degree between a field and a candidate personal information data item is the largest and greater than a preset threshold, then the candidate personal information data item is used as the target personal information data item. The present invention can automatically identify personal information data items involved in the database.

Description

Personal information identification method and computer readable storage medium
Technical Field
The present invention relates to the field of data identification technologies, and in particular, to a personal information identification method and a computer readable storage medium.
Background
With the development of the internet, the transition of big data, the personal information has evolved into sensitive network data, which is one of the important contents of network security protection. In the process of carrying out data security examination, the supervision units need an informatization technology to support, so that extensive, deep, efficient and accurate examination is realized.
Currently, the related art has started. On the client side, there are dynamic detection techniques that collect personal information for applications. In terms of a server, technologies such as data component identification, data component security hole detection, data content identification, data asset display and the like exist. The technology can identify various cloud components, big data components, relational database components, encryption containers and the like, can detect database permission holes, application holes, data leakage holes and the like, can identify personal data in a regular expression, keywords and other modes, and can display data assets from data types, data quantity and other dimensions.
However, for some purpose, the storage of personal information may be hidden or hidden by the regulatory entity, so that personal information identification is a key element. The existing personal information identification technology for data audit is simple, and is not mature and comprehensive enough.
Disclosure of Invention
The invention aims to provide a personal information identification method and a computer readable storage medium, which can automatically identify personal information data items related in a database.
In order to solve the technical problems, the technical scheme adopted by the invention is that the personal information identification method comprises the following steps:
Acquiring metadata of a database, and acquiring data types and data characteristic values of various fields in the database;
Respectively checking each field according to the data type, the data value field, the data length and the data rule of the preset personal information data item and the data type and the data characteristic value of each field to determine candidate personal information data items corresponding to each field;
calculating the name matching degree, the entity identification matching degree and the numerical fingerprint matching degree of each field and each candidate personal information data item corresponding to each field respectively, and calculating the final matching degree of the name field and each candidate personal information data item according to the name matching degree, the entity identification matching degree and the numerical fingerprint matching degree;
And if the final matching degree of one field and one candidate personal information data item is maximum and is larger than a preset threshold value, taking the candidate personal information data item as a target personal information data item matched with the field.
The invention also proposes a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, implements a method as described above.
The method has the advantages that data exploration is conducted through two aspects of metadata and traversing data distribution characteristics of the database, data is verified through data types, data value fields, data lengths and data rules, so that personal information data items which are possibly matched are rapidly determined, namely matching degrees of three dimensions of field names, data entity identification and data fingerprints and candidate personal information data items are calculated, target personal information data items are determined according to the final matching degrees, automatic intelligent detection of personal information in the database is achieved, and detection accuracy is guaranteed. According to the invention, by utilizing the characteristics of the metadata structure and the data value field, the intelligent matching of the data items is realized by combining data verification and matching degree calculation continuously and gradually, so that the stored personal information items are automatically detected, and the support is provided for data safety supervision and management work.
Drawings
FIG. 1 is a flow chart of a personal information identification method of the present invention;
fig. 2 is a flow chart of a method according to a first embodiment of the invention.
Detailed Description
In order to describe the technical contents, the achieved objects and effects of the present invention in detail, the following description will be made with reference to the embodiments in conjunction with the accompanying drawings.
Referring to fig. 1, a personal information identification method includes acquiring metadata of a database, and acquiring data types and data characteristic values of various fields in the database;
Respectively checking each field according to the data type, the data value field, the data length and the data rule of the preset personal information data item and the data type and the data characteristic value of each field to determine candidate personal information data items corresponding to each field;
Calculating the name matching degree, the entity identification matching degree and the numerical fingerprint matching degree of each field and each candidate personal information data item corresponding to each field respectively, and calculating the final matching degree of each field and each candidate personal information data item according to the name matching degree, the entity identification matching degree and the numerical fingerprint matching degree;
And if the final matching degree of one field and one candidate personal information data item is maximum and is larger than a preset threshold value, taking the candidate personal information data item as a target personal information data item matched with the field.
From the above description, the invention has the beneficial effects of realizing the automatic intelligent detection of the personal information in the database and ensuring the detection efficiency and accuracy.
Further, the data types include text type, number type, date and time type, short text type, enumerated word typical type, and binary type.
Further, the data types and the data characteristic values of the fields in the acquired database are specifically:
if the data type of one field is the text type, acquiring the minimum value and the maximum value of the text length;
if the data type of a field is a digital type or a date and time type, acquiring the average value and variance of the maximum value and the minimum value of the data;
If the data type of one field is a short text type or an enumerated dictionary type, acquiring a value range list of the numerical value;
if the data type of a field is binary type, the minimum length and the maximum length of binary data are obtained.
It will be appreciated from the above description that for fields of different data types, different data characteristic values are obtained, facilitating subsequent data verification to quickly determine potentially matching personal information data items.
Further, the calculating the name matching degree, the entity identification matching degree and the numerical fingerprint matching degree of each field and each candidate personal information data item corresponding to each field respectively, and calculating the final matching degree of each field and each candidate personal information data item according to the name matching degree, the entity identification matching degree and the numerical fingerprint matching degree specifically comprises:
Respectively calculating the field names of the fields and the matching degree of the field descriptions and the data item names of the candidate personal information data items corresponding to the field names of the fields to obtain the name matching degree of the fields and the candidate personal information data items;
calculating the matching degree of the data value of the field with the data type of the short text type and each candidate personal information data item corresponding to the data value through the conditional random field to obtain the entity identification matching degree of the field with the data type of the short text type and each candidate personal information data item;
Respectively calculating the matching degree of the numerical fingerprints of the fields of the data type and the data type or the date and time type and the numerical fingerprints of the corresponding name candidate personal information data items to obtain the matching degree of the fields of the data type and the numerical fingerprints of the candidate personal information data items;
Respectively calculating the coverage of the data set of which the data type is the field of the enumeration word typical type and the dictionary value set of each candidate personal information data item corresponding to the data set, and obtaining the numerical fingerprint matching degree of the field of which the data type is the enumeration dictionary type and each candidate personal information data item;
And calculating the final matching degree of each field and each candidate personal information data item according to the name matching degree, the entity identification matching degree and the numerical fingerprint matching degree of each field and each candidate personal information.
As can be seen from the above description, the accuracy of matching degree calculation can be improved by calculating the matching degree through three dimensions of field name, data entity identification and data fingerprint, so that the identification accuracy of the subsequent target personal information data item can be improved.
Further, the calculating the matching degree of the field names and the field descriptions of the fields and the names of the data items of the candidate personal information data items corresponding to the field names and the field descriptions of the fields respectively, and the obtaining the matching degree of the names of the fields and the candidate personal information data items specifically comprises the following steps:
calculating a word vector of a field name of a field to obtain a first word vector;
Calculating word vectors of field descriptions of the field to obtain second word vectors;
Calculating a word vector of the data item name of a candidate personal information data item corresponding to the field to obtain a third word vector;
According to the first word vector and the third word vector, calculating cosine similarity between the field name of the field and the data item name of the candidate personal information data item to obtain first cosine similarity;
According to the second word vector and the third word vector, calculating cosine similarity between the field description of the field and the data item name of the candidate personal information data item to obtain second cosine similarity;
and calculating the name matching degree of the field and the candidate personal information data item according to the first cosine similarity, the second cosine similarity and the preset first weight coefficient and second weight coefficient.
As can be seen from the above description, the accuracy of the name matching degree calculation can be improved by converting text into word vectors and calculating the name matching degree according to cosine similarity.
Further, the calculating, by the conditional random field, the matching degree of the data value of the field with the data type of the short text type and each candidate personal information data item corresponding to the data value of the field with the data type of the short text type, and the entity identification matching degree of the field with the data type of the short text type and each candidate personal information data item of the field with the data type of the short text type is specifically:
acquiring a field with a data type of a short text type;
Calculating a prediction score of a candidate personal information data item corresponding to the data value of the field through a preset conditional random field;
And calculating the entity identification matching degree of the field and the candidate personal information data item according to the prediction score and a preset third weight coefficient.
From the above description, it is known that by identifying named entities, identifying entities in text, it is possible to further determine matching personal information data items for short text-type data.
Further, the matching degree between the numerical fingerprint of the data type field and the numerical fingerprint of each candidate personal information data item corresponding to the data type field or the date type field is calculated, and the obtained matching degree between the data type field and the numerical fingerprint of each candidate personal information data item is specifically:
Acquiring a field of a data type which is a digital type or a date and time type;
Acquiring a six-dimensional characteristic value according to the numerical value of the field to obtain a numerical fingerprint of the field, wherein the six-dimensional characteristic value comprises a minimum value, a first quartile, a median, a third quartile, a maximum value and a variance;
Acquiring a six-dimensional characteristic value of a candidate personal information data item corresponding to the field, and obtaining a numerical fingerprint of the candidate personal information data item;
And calculating the Euclidean distance according to the numerical fingerprint of the field and the numerical fingerprint of the candidate personal information data item, and calculating the matching degree of the field and the numerical fingerprint of the candidate personal information data item according to the Euclidean distance and a preset fourth weight coefficient.
From the above description, it can be seen that a numerical fingerprint is generated according to the characteristics of the numerical distribution, and the matching degree between the numerical fingerprints is measured from the distance between the numerical fingerprints. For digital information, a numerical fingerprint consists of six dimensions, minimum, first quartile, median, third quartile, maximum, variance.
Further, the calculating the coverage of the data set of the fields with the data type of the enumerated word typical type and the dictionary value sets of the candidate personal information data items corresponding to the data set respectively, and the obtaining the numerical fingerprint matching degree of the fields with the data type of the enumerated dictionary type and the candidate personal information data items specifically comprises the following steps:
acquiring a field of which the data type is an enumeration dictionary type;
Acquiring the numerical value of the field, and performing ascending sort according to a natural sequence to obtain a numerical value set of the field;
Acquiring a dictionary value set of a candidate personal information data item corresponding to the field;
And calculating the coverage of the numerical value set in the dictionary value set, and calculating the numerical fingerprint matching degree of the field and the candidate personal information data item according to the coverage and a preset fifth weight coefficient.
As can be seen from the above description, for enumerating dictionary-type data, the degree of matching is measured by the degree of coverage between sets.
Further, the calculating the name matching degree, the entity identification matching degree and the numerical fingerprint matching degree of each field and each candidate personal information data item corresponding to each field respectively, and calculating the final matching degree of each field and each candidate personal information data item according to the name matching degree, the entity identification matching degree and the numerical fingerprint matching degree, further includes:
setting the entity identification matching degree of the field with the data type not being the short text type and each candidate personal information data item to 0;
The matching degree of the digital fingerprint of the data type field which is not the digital type, the date and time type or the enumerated typical type and each candidate personal information data item is set to 0.
The invention also proposes a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, implements a method as described above.
Example 1
Referring to fig. 2, a personal information identification method, which can be applied to automatically and intelligently detecting personal information related to a database, as shown in fig. 2, includes the following steps:
s1, data exploration.
First, database metadata is acquired. Personal information is typically stored in a database, whether a traditional relational or non-relational database, which uses metadata to describe the data format stored, including information of all tables and field information, specifically table names, field types (i.e., data storage types), field descriptions (comments). Thus, database metadata is first acquired, providing support for matching of subsequent data.
Then, the data type and the data characteristic value of each field in the database are obtained. I.e. traversing the data of each field in the database to obtain the characteristic of the data stored in the field.
In this embodiment, the data types include text type, number type, date and time type, short text and enumerated word typical type, and binary type, wherein for fields of different data types, different data characteristic values are obtained, specifically, as follows:
if the data type of one field is the text type, acquiring the minimum value and the maximum value of the text length;
If the data type of a field is a digital type or a date and time type, acquiring the maximum value and the minimum value of the data, and further, acquiring an average value, a variance and the like;
If the data type of one field is a short text type and an enumeration dictionary type, acquiring a value range list of the numerical value;
if the data type of a field is binary type, the minimum length and the maximum length of binary data are obtained.
And S2, data verification, namely respectively verifying each field according to the data type, the data value field, the data length and the data rule of the preset personal information data item and the data type and the data characteristic value of each field to determine the candidate personal information data item corresponding to each field.
Each data field stored in the database may be personal information, and the personal information may also involve a relatively large number of data items, such as name, identification number, cell phone number, home address, religious belief, and so on. By checking the data type, value range, length, rules, etc., it can be determined that the matching personal information data item is or is locked within a certain range. Data verification is a sufficient condition that must be satisfied to calculate a match.
In this embodiment, the data verification includes a data type verification, a data value field verification, a data length verification, and a data rule verification.
For data type verification, i.e. verifying the relationship between the type of data storage and the personal information data item, a data item may have one or more possible data storage types. For example, the data storage type of the identification number data item is usually a text type, and the data types such as numbers, dates, binary and the like can be excluded. The data storage type of the birth date data item can be a text type, a date and time type, or a digital type (millisecond after 1970).
For most digital data items, there is a corresponding range of values, such as age, height, date of birth, etc. Therefore, the data items with value range requirements are subjected to type conversion and then value range verification. For example, for a field of a data type that is a digital type and a date-time type, it is determined whether the maximum value and the minimum value of the data are both within the range of the value of a certain personal information data item, and if so, the personal information data item is considered to be a personal information data item that the field matches.
In addition, for fields whose data types are short text types and enumeration word typical types, the fields can also be checked for value fields because they also have corresponding value fields.
There is typically a length requirement for both text-type data items and binary data items. The length range may be discrete or continuous. For example, the identification card number is 15 bits or 18 bits, the mobile phone number is 11 bits, 14 bits or 15 bits (with international codes), and the license plate number is 7 to 8 bits. Thus, the data of fields whose data types are text type and binary type can be length-checked.
For a particular personal information data item, there is usually a check rule. For example, the last digit of the identification card number is a check field, which is obtained by the calculation of the first 17 digits, the first 3 digits of the mobile phone number have certain requirements, the range is 130-139,145-147. According to the specific condition of the verification rule, the verification bits can be verified according to the rule, and the regular expression can be adopted to carry out structural verification on the data constituent elements.
For example, assuming that the data type of a field is a text type, the minimum value of the text length is 11, the maximum value of the text length is 15, and the first 3 bits of each data are within a preset range, the mobile phone number data item is considered as a candidate personal information data item of the field.
And S3, calculating the matching degree, namely calculating the name matching degree, the entity identification matching degree and the numerical fingerprint matching degree of each field and each candidate personal information data item corresponding to each field respectively, and calculating the final matching degree of each field and each candidate personal information according to the name matching degree, the entity identification matching degree and the numerical fingerprint matching degree.
During the data verification phase, it is generally only possible to lock within a certain range the personal information data items for which the stored data fields may match. Unless there is an explicit rule like an identification number, it can basically be determined by data verification. Thus, for data items that may match, it is also necessary to select by other means. This stage supports computing the degree of matching from dimensions of field names, entity identification, numeric fingerprints, etc., and intelligently matching data items according to weight thresholds and the magnitude of the degree of matching values.
Specifically, the method comprises the following steps:
s301, respectively calculating the field names of the fields and the matching degree of the field descriptions and the data item names of the candidate personal information data items corresponding to the field names, and obtaining the name matching degree between the fields and the candidate personal information data items.
The field name plays a key role in matching fields, but the same data item has a plurality of expressions such as the data item "name" which can be expressed as "name", "chinese name", "great name", "opposite name", and the like. The degree of matching can be measured by computing the distance between the field names and the data item names through natural language processing techniques. The field names are typically quite compact, so field descriptions (remarks) in database metadata can also be used to assist in measuring the degree of matching.
The natural language processing technique word2vec (wordto vector) supports the simplification of the processing of text content into vector operations in vector space. Using word2vec requires preparation of a large amount of corpus for training of deep learning, and the training model results can be used to calculate word vectors. In this embodiment, the similarity between two vector inner product spaces is measured by measuring their angle cosine values. The calculation formula of the cosine value of the included angle between the two vectors A and B is as follows:
,
Therefore, the method for calculating the name matching degree in this step is specifically as follows:
and calculating the word vector of the field name of the field to obtain a first word vector, and calculating the word vector of the field description of the field to obtain a second word vector, namely simultaneously, calculating the word vector of the data item name of the candidate personal information data item corresponding to the field to obtain a third word vector.
For example, assuming that a field name of a field is FIELDNAME, a word2vec converted word vector is vecF, a field description of the field is fieldDescription, a word2vec converted word vector is vecT, a data item name of a candidate personal information data item corresponding to the field is TARGETFIELDNAME, a word2vec converted word vector is vecD, a calculation formula of a name matching degree degreel between the field and the candidate personal information data item is as follows:
,
wherein weight1 is the first weight coefficient corresponding to the field name, and weight2 is the second weight coefficient corresponding to the field description.
S302, calculating the matching degree of the data value of the field with the data type of the short text type and each candidate personal information data item corresponding to the data value through the conditional random field, and obtaining the entity identification matching degree between each field with the data type of the short text type and each candidate personal information data item.
Named entity recognition, namely, recognizing the entities such as person names, organization names, place names and the like in the text. Matching data items may be further determined for text-type data. Conditional random field CRF (conditional random field) is a conditional probability distribution model for another set of output random variables given a set of input random variables, and is characterized by assuming that the output random variables constitute a Markov (Markov) random field. Such machine learning techniques are commonly used for entity recognition, automatic learning from training data sets, and the trained models can be used to recognize text, match possible person names, addresses, work units, and so on.
If the data stored in the database is of a short text type and the candidate personal information data items to be matched belong to identifiable named entities, the degree of matching between the fields and the candidate personal information data items is measured from a predictive score identified by the entities. Otherwise, the matching degree value of entity identification is 0. That is, for fields whose data type is not a short text type, and fields whose data type is a short text type but whose corresponding candidate personal information data items do not belong to identifiable named entities, the entity recognition matching degree between these fields and their corresponding candidate personal information data items is set to 0 directly.
In this embodiment, the method for calculating the entity identification matching degree in this step is specifically as follows:
A field FIELDNAME of the data type short text type is obtained, then a predictive score of a candidate personal information data item TARGETFIELDNAME of the data value fieldValue of the field FIELDNAME corresponding to the field is calculated by a preset conditional random field, and assuming crfScore, the calculation formula of the entity identification matching degree2 between the field FIELDNAME and the candidate personal information data item TARGETFIELDNAME is as follows:
,
Wherein crfScore is the score of the data value fieldValue predicted by the conditional random field CRF, and weight3 is the third weight coefficient identified by the corresponding entity.
S303, respectively calculating the matching degree of the numerical fingerprints of the fields of the data type and the data type or the date and time type and the numerical fingerprints of the corresponding candidate personal information data items, and obtaining the matching degree of the numerical fingerprints of the fields of the data type and the data type or the date and time type and the candidate personal information data items.
The numerical distribution of part of the personal information has certain laws, such as age, date of birth. A numerical fingerprint may be generated from the characteristics of the numerical distribution, the degree of matching between the two being measured from the distance between the numerical fingerprints.
In this embodiment, the numerical fingerprints of the fields of different data types are different, for example, for digital information, the numerical fingerprints consist of six dimensions of a minimum, a first quartile, a median, a third quartile, a maximum, and a variance. Therefore, the method for calculating the numerical fingerprint matching degree in this step is specifically as follows:
and then traversing and counting all values of the field to obtain six-dimensional characteristic values which are respectively a minimum value, a first quartile, a median, a third quartile, a maximum value and a variance, and then combining to obtain the numerical fingerprint of the field. And simultaneously, acquiring a six-dimensional characteristic value of a candidate personal information data item corresponding to the field, and obtaining the numerical fingerprint of the candidate personal information data item. The normalized euclidean distance is then sampled to measure the degree of matching between the numeric fingerprints. The numerical fingerprint matching degree of the field and the candidate personal information data item is calculated as follows:
,
wherein C is the numerical fingerprint of field fieldValue, D is the numerical fingerprint of candidate personal information data item targetFieldValue, n represents the dimension of the numerical fingerprint, in this embodiment n=6, i.e. a total of six dimensions; the kth dimension representing a numeric fingerprint, e.g., C1 represents the first dimension, i.e., the minimum; Representing the standard deviation of the data fingerprint D in the kth dimension, weight4 is the fourth weight coefficient for the corresponding digital numeric fingerprint.
S304, respectively calculating the coverage of the data set of the fields with the data type of the enumerated word typical type and the dictionary value sets of the candidate personal information data items corresponding to the data set, and obtaining the numerical fingerprint matching degree between the fields with the data type of the enumerated dictionary type and the candidate personal information data items.
For enumerating dictionary data items, such as religion beliefs, ethnicities, academia, blood types, etc. The numerical dictionary forms a set in natural order, which can represent the characteristics of numerical distribution, and the matching degree is measured from the coverage between the sets.
Therefore, the method for calculating the numerical fingerprint matching degree in this step is specifically as follows:
And obtaining a field of which the data type is the typical type of the enumeration word, and then carrying out ascending sort on all values in the field according to a natural sequence to obtain a value set of the field. Meanwhile, a dictionary value set of a candidate personal information data item corresponding to the field is obtained, wherein the dictionary value set of the candidate personal information data item can refer to national standard, local standard and industry standard, for example, national standard GB3304-91. The numerical fingerprint matching degree of the field and the candidate personal information data item is calculated as follows:
,
Where E is the set of values for field fieldValue in the database, F is the set of dictionary values for candidate personal information data item targetFieldValue corresponding to that field, and weight5 is the fifth weight coefficient for a typical numeric fingerprint of the corresponding enumerated word.
Further, for fields whose data type is not a numeric type, a date-time type, or an enumerated word typical type, the numeric fingerprint matching degree of these fields with their respective candidate personal information data items is set to 0.
And S305, calculating the final matching degree of each field and each candidate personal information data item according to the name matching degree, the entity identification matching degree and the numerical fingerprint matching degree of each field and each candidate personal information.
The matching degrees of the multiple dimensions are accumulated to obtain a final matching degree, and a calculation formula of the final matching degree is that degree= degreel +degree2+degree3.
S4, determining target personal information data items matched with the fields.
And the final matching degree of each field and each candidate personal information data item can be obtained through the matching degree calculation, then, for each field, the final matching degree of each candidate personal information data item is compared, effective matching is calculated only if the matching degree is larger than a preset threshold G, and if the final matching degree of a plurality of candidate personal information data items is larger than the threshold G for one field, the candidate personal information data item with the largest final matching degree is taken as the best matching item.
And if the final matching degree of one field and one candidate personal information data item is maximum and is larger than a preset threshold value, taking the candidate personal information data item as a target personal information data item matched with the field.
According to the method, data exploration is conducted from two aspects of metadata of a database and distribution characteristics of traversal data, data is verified from data types, data value fields, data lengths and data rules, personal information data items which are possibly matched are quickly reduced, matching degree of a field and candidate personal information data items of the field is further calculated from three dimensions of field names, data entity identification and data fingerprints, and finally target personal information data items are determined through threshold values and final matching degree values, so that personal information involved in the database is automatically and intelligently detected.
By intelligently matching the personal information data items related to the data items in the data resources, scientific and efficient support is provided for data examination, reference can be provided for quality management of data management, and abnormal data and abnormal values can be intelligently detected. The data checksum data fingerprint scheme can also be used for intelligently matching data types in data access, so that the workload of manually matching data items is reduced.
Under the condition that the regulatory units continuously keep good laws and regulations on data security and the society places more importance on data security, the embodiment can be used for the data security check departments to quickly touch the data resources of network operators, intelligently detect related personal information and provide powerful support for data asset reports, thereby constructing peace, safe, open and cooperative network spaces and providing an efficient reference scheme for self-supervision management of data managers.
Example two
The present embodiment is a computer readable storage medium corresponding to the above embodiment, and has a computer program stored thereon, where the computer program when executed by a processor implements the processes in the above personal information identification method embodiment, and the same technical effects can be achieved, and for avoiding repetition, a detailed description is omitted here.
In summary, the personal information identification method and the computer readable storage medium provided by the invention perform data exploration through two aspects of database metadata and traversal data distribution characteristics, and verify data through data types, data value fields, data lengths and data rules, so that the personal information data items which are possibly matched are rapidly determined, namely, the matching degree of a field name, data entity identification and a data fingerprint three-dimensional calculation field and candidate personal information data items thereof is determined, and the target personal information data item is determined according to the final matching degree, thereby realizing automatic intelligent detection of personal information related in a database and ensuring the detection accuracy. According to the invention, by utilizing the characteristics of the metadata structure and the data value field, the intelligent matching of the data items is realized by combining data verification and matching degree calculation continuously and gradually, so that the stored personal information items are automatically detected, and the support is provided for data safety supervision and management work.
The foregoing description is only illustrative of the present invention and is not intended to limit the scope of the invention, and all equivalent changes made by the specification and drawings of the present invention, or direct or indirect application in the relevant art, are included in the scope of the present invention.

Claims (7)

1. A personal information identification method, comprising:
Acquiring metadata of a database, and acquiring data types and data characteristic values of various fields in the database;
Respectively acquiring data characteristic values based on the data types of the fields, wherein the data characteristic values comprise a data length range of a text type field, a statistical characteristic value of a digital type or date and time type field, a value range list of a short text or enumeration word typical field and a data length range of a binary field;
Respectively checking each field according to the data type, the data value field, the data length and the data rule of the preset personal information data item and the data type and the data characteristic value of each field to determine candidate personal information data items corresponding to each field;
the name matching degree, the entity identification matching degree and the numerical fingerprint matching degree of each field and each candidate personal information data item corresponding to each field are calculated respectively, and the final matching degree of each field and each candidate personal information data item is calculated according to the name matching degree, the entity identification matching degree and the numerical fingerprint matching degree specifically as follows:
Respectively calculating the field names of the fields and the matching degree of the field descriptions and the data item names of the candidate personal information data items corresponding to the field names of the fields to obtain the name matching degree of the fields and the candidate personal information data items;
calculating the matching degree of the data value of the field with the data type of the short text type and each candidate personal information data item corresponding to the data value through the conditional random field to obtain the entity identification matching degree of the field with the data type of the short text type and each candidate personal information data item;
respectively calculating the matching degree of the numerical fingerprints of the fields of the data type and the data type or the date and time type and the numerical fingerprints of the corresponding candidate personal information data items to obtain the matching degree of the fields of the data type and the numerical fingerprints of the candidate personal information data items;
Respectively calculating the coverage of the data set of which the data type is the field of the enumeration word typical type and the dictionary value set of each candidate personal information data item corresponding to the data set, and obtaining the numerical fingerprint matching degree of the field of which the data type is the enumeration dictionary type and each candidate personal information data item;
calculating the final matching degree of each field and each candidate personal information data item according to the name matching degree, the entity identification matching degree and the numerical fingerprint matching degree of each field and each candidate personal information;
The matching degree of the data value of the field with the data type of the short text type and each candidate personal information data item corresponding to the field is calculated through the conditional random field, and the entity identification matching degree of the field with the data type of the short text type and each candidate personal information data item is obtained specifically as follows:
acquiring a field with a data type of a short text type;
Calculating a prediction score of a candidate personal information data item corresponding to the data value of the field through a preset conditional random field;
calculating the entity identification matching degree of the field and the candidate personal information data item according to the prediction score and a preset third weight coefficient;
The matching degree of the numerical fingerprint of the fields of the data type or the date and time type and the numerical fingerprint of each candidate personal information data item corresponding to the numerical fingerprint is calculated respectively, and the obtained matching degree of the fields of the data type or the date and time type and the numerical fingerprint of each candidate personal information data item is specifically:
Acquiring a field of a data type which is a digital type or a date and time type;
Acquiring a six-dimensional characteristic value according to the numerical value of the field to obtain a numerical fingerprint of the field, wherein the six-dimensional characteristic value comprises a minimum value, a first quartile, a median, a third quartile, a maximum value and a variance;
Acquiring a six-dimensional characteristic value of a candidate personal information data item corresponding to the field, and obtaining a numerical fingerprint of the candidate personal information data item;
Calculating Euclidean distance according to the numerical fingerprint of the field and the numerical fingerprint of the candidate personal information data item, and calculating the matching degree of the field and the numerical fingerprint of the candidate personal information data item according to the Euclidean distance and a preset fourth weight coefficient;
and if the final matching degree of one field and one candidate personal information data item is maximum and is larger than a preset threshold value, taking the candidate personal information data item as a target personal information data item matched with the field.
2. The personal information identification method of claim 1, wherein the data types include a text type, a number type, a date-time type, a short text type, an enumerated word typical type, and a binary type.
3. The personal information identification method according to claim 2, wherein the data type and the data characteristic value of each field in the acquisition database are specifically:
if the data type of one field is the text type, acquiring the minimum value and the maximum value of the text length;
if the data type of a field is a digital type or a date and time type, acquiring the maximum value, the minimum value, the average value and the variance of the data;
If the data type of one field is a short text type or an enumerated dictionary type, acquiring a value range list of the numerical value;
if the data type of a field is binary type, the minimum length and the maximum length of binary data are obtained.
4. The personal information identification method according to claim 1, wherein the calculating of the matching degree of the field names of the fields and the data item names of the candidate personal information data items corresponding to the field descriptions respectively, and the obtaining of the matching degree of the names of the fields and the candidate personal information data items thereof are specifically:
Calculating a word vector of a field name of a field to obtain a first word vector; calculating word vectors of field descriptions of the field to obtain second word vectors;
calculating a word vector of the data item name of a candidate personal information data item corresponding to the field to obtain a third word house;
According to the first word vector and the third word vector, calculating cosine similarity between the field name of the field and the data item name of the candidate personal information data item to obtain first cosine similarity;
According to the second word vector and the third word vector, calculating cosine similarity between the field description of the field and the data item name of the candidate personal information data item to obtain second cosine similarity;
and calculating the name matching degree of the field and the candidate personal information data item according to the first cosine similarity, the second cosine similarity and the preset first weight coefficient and second weight coefficient.
5. The personal information identification method according to claim 1, wherein the step of calculating the coverage of the data set of the fields of which the data types are the enumerated word typical types and the dictionary value sets of the candidate personal information data items corresponding to the data sets respectively to obtain the numerical fingerprint matching degree of the fields of which the data types are the enumerated dictionary types and the candidate personal information data items specifically comprises:
acquiring a field of which the data type is an enumeration dictionary type;
Acquiring the numerical value of the field, and performing ascending sort according to a natural sequence to obtain a numerical value set of the field;
Acquiring a dictionary value set of a candidate personal information data item corresponding to the field;
And calculating the coverage of the numerical value set in the dictionary value set, and calculating the numerical fingerprint matching degree of the field and the candidate personal information data item according to the coverage and a preset fifth weight coefficient.
6. The personal information identification method as claimed in claim 1, wherein the calculating the name matching degree, the entity identification matching degree and the numerical fingerprint matching degree of each field and each candidate personal information data item corresponding thereto, respectively, and calculating the final matching degree of each field and each candidate personal information data item thereof according to the name matching degree, the entity identification matching degree and the numerical fingerprint matching degree, further comprises:
setting the entity identification matching degree of the field with the data type not being the short text type and each candidate personal information data item to 0;
The matching degree of the digital fingerprint of the data type field which is not the digital type, the date and time type or the enumerated typical type and each candidate personal information data item is set to 0.
7. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-6.
CN202111479395.4A 2021-12-06 2021-12-06 Personal information identification method and computer-readable storage medium Active CN114398528B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111479395.4A CN114398528B (en) 2021-12-06 2021-12-06 Personal information identification method and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111479395.4A CN114398528B (en) 2021-12-06 2021-12-06 Personal information identification method and computer-readable storage medium

Publications (2)

Publication Number Publication Date
CN114398528A CN114398528A (en) 2022-04-26
CN114398528B true CN114398528B (en) 2025-09-23

Family

ID=81225179

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111479395.4A Active CN114398528B (en) 2021-12-06 2021-12-06 Personal information identification method and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN114398528B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115081531A (en) * 2022-06-30 2022-09-20 杭州数梦工场科技有限公司 Data processing method, device and electronic device
CN115496360A (en) * 2022-09-21 2022-12-20 平安银行股份有限公司 High-net-worth customer identification method, system, and computer equipment

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109783490A (en) * 2018-12-25 2019-05-21 杭州数梦工场科技有限公司 Data fusion method, device, computer equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101793185B1 (en) * 2015-01-30 2017-11-02 가천대학교 산학협력단 Method for identifying patient personal information
CN111061840A (en) * 2019-12-18 2020-04-24 腾讯音乐娱乐科技(深圳)有限公司 Data identification method and device and computer readable storage medium
CN113657100B (en) * 2021-07-20 2023-12-15 北京百度网讯科技有限公司 Entity identification method, entity identification device, electronic equipment and storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109783490A (en) * 2018-12-25 2019-05-21 杭州数梦工场科技有限公司 Data fusion method, device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种标准数据元与数据项匹配算法;李敏;;电脑知识与技术;20160131;第12卷(第01期);全文 *

Also Published As

Publication number Publication date
CN114398528A (en) 2022-04-26

Similar Documents

Publication Publication Date Title
US8666998B2 (en) Handling data sets
US8122045B2 (en) Method for mapping a data source to a data target
CN111291070B (en) Abnormal SQL detection method, equipment and medium
Heidarian et al. A hybrid geometric approach for measuring similarity level among documents and document clustering
CN114398528B (en) Personal information identification method and computer-readable storage medium
Gwo et al. Plant identification through images: Using feature extraction of key points on leaf contours1
CN110019474B (en) Automatic synonymy data association method and device in heterogeneous database and electronic equipment
CN118350968B (en) Intelligent processing method and system for realizing case law examination based on deep learning
US9400826B2 (en) Method and system for aggregate content modeling
CN113129057A (en) Software cost information processing method and device, computer equipment and storage medium
CN117648581B (en) Enterprise similarity evaluation method, device, terminal and medium
Mitsuzuka et al. Analysis of CSR activities affecting corporate value using machine learning
CN118096452B (en) A case-assisted trial method, device, terminal equipment and medium
CN114708100B (en) A data transaction compliance detection system and method
CN117290460A (en) Method, system, device and storage medium for calculating similarity of massive texts
CN115659967A (en) Operational research optimization method, operational research optimization device, electronic equipment and storage medium
CN113569005A (en) An intelligent extraction method of large-scale data features based on data content
Teng et al. The calculation of similarity and its application in data mining
Hua Integrating Clustering and Semantic Similarity for MAUDE Database Dimensionality Reduction
CN112966901A (en) Lineage data quality analysis and verification method for inspection business collaborative flow
CN111459970A (en) Method for checking uniqueness of object information
CN116010257B (en) Web application type identification method, device and system based on multidimensional digital characteristics
Rusinol et al. Perceptual image retrieval by adding color information to the shape context descriptor
CN113887231B (en) Medical cosmetic entity alignment method, device, equipment and readable storage medium
CN116596709B (en) Auxiliary judging method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Country or region after: China

Address after: Unit 102-402, No. 12 Guanri Road, Phase II, Software Park, Xiamen Torch High tech Zone, Xiamen, Fujian Province, 361000

Applicant after: Guotou Intelligent (Xiamen) Information Co.,Ltd.

Address before: AIU Cupressaceae No. 12 building, 361000 Fujian province Xiamen software park two sunrise Road

Applicant before: XIAMEN MEIYA PICO INFORMATION Co.,Ltd.

Country or region before: China

GR01 Patent grant