CN116910650A - Data identification method, device, storage medium and computer equipment - Google Patents
Data identification method, device, storage medium and computer equipment Download PDFInfo
- Publication number
- CN116910650A CN116910650A CN202310855260.6A CN202310855260A CN116910650A CN 116910650 A CN116910650 A CN 116910650A CN 202310855260 A CN202310855260 A CN 202310855260A CN 116910650 A CN116910650 A CN 116910650A
- Authority
- CN
- China
- Prior art keywords
- data
- identified
- identification
- indicator
- item
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24564—Applying rules; Deductive queries
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Bioethics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Medical Informatics (AREA)
- Evolutionary Computation (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明涉及数据处理技术领域,并公开了一种数据识别方法、装置、存储介质及计算机设备,其中方法包括:响应于数据识别请求,获取数据识别请求中的数据筛选规则,并基于数据筛选规则在源数据表中提取出待识别数据,然后确定待识别数据的多个待识别指标项以及对应的识别算法,再基于识别算法对待识别指标项的字段内容进行识别得到识别结果,最终将多个识别结果代入数据匹配计算公式中进行计算,得到计算结果,并根据计算结果确定待识别数据的数据类别。上述方法提升数据识别效率,并结合多个维度来识别数据,提升数据的识别精度,准确获取数据的敏感度识别结果,实现对敏感数据内容精准的分类与管控。
The invention relates to the field of data processing technology, and discloses a data identification method, device, storage medium and computer equipment. The method includes: responding to a data identification request, obtaining data filtering rules in the data identification request, and based on the data filtering rules Extract the data to be identified from the source data table, then determine the multiple index items to be identified and the corresponding identification algorithms, and then identify the field contents of the index items to be identified based on the identification algorithm to obtain the identification results. Finally, multiple The recognition results are substituted into the data matching calculation formula for calculation, the calculation results are obtained, and the data category of the data to be identified is determined based on the calculation results. The above method improves the efficiency of data identification and combines multiple dimensions to identify data, improves the accuracy of data identification, accurately obtains the sensitivity identification results of data, and achieves accurate classification and control of sensitive data content.
Description
技术领域Technical field
本发明涉及数据处理技术领域,尤其是涉及一种数据识别方法、装置、存储介质及计算机设备。The present invention relates to the field of data processing technology, and in particular, to a data identification method, device, storage medium and computer equipment.
背景技术Background technique
随着各行业对于数据分类分级逐渐规范化、标准化管理,以及伴随着数据安全保护法的实施,各企业对于数据库内所存储的数据愈发关注,尤其是对一些涉及到敏感内容的数据不断加大关注力度。而为了更好地对存储的数据进行分类分级,需要预先对数据内容的敏感度进行精准识别,才能做到对数据的合理管控。As various industries gradually standardize and standardize data classification and management, and with the implementation of the data security protection law, enterprises are paying more and more attention to the data stored in the database, especially some data involving sensitive content. Pay attention to intensity. In order to better classify and classify the stored data, the sensitivity of the data content needs to be accurately identified in advance to achieve reasonable management and control of the data.
现有技术中,对于存储数据敏感度的识别还只停留在基于数据内容进行初步识别与判断,或是对数据内容进行二维复合规则的简单判定,但以上两种方式在对大批量的数据进行敏感度识别时存在数据识别效率低下的问题、并且获取到的数据敏感度识别结果容易产生误差,无法准确识别出敏感数据,进而无法做到对存储的数据进行精准识别与分类。In the existing technology, the identification of the sensitivity of stored data is still limited to preliminary identification and judgment based on the data content, or a simple judgment of the data content based on two-dimensional composite rules. However, the above two methods are not suitable for large batches of data. When performing sensitivity identification, there is a problem of low data identification efficiency, and the obtained data sensitivity identification results are prone to errors. Sensitive data cannot be accurately identified, and the stored data cannot be accurately identified and classified.
发明内容Contents of the invention
有鉴于此,本申请提供的数据识别方法、装置、存储介质及计算机设备,主要目的在于解决现有技术中对敏感数据的识别方法识别效率低、识别结果精准度低的技术问题。In view of this, the main purpose of the data identification method, device, storage medium and computer equipment provided by this application is to solve the technical problems in the prior art of low identification efficiency and low accuracy of identification results of sensitive data identification methods.
根据本发明的第一个方面,提供了一种数据识别方法,该方法包括:According to a first aspect of the present invention, a data identification method is provided, which method includes:
响应于数据识别请求,获取所述数据识别请求中携带的数据筛选规则,并基于所述数据筛选规则在源数据表中提取出待识别数据;In response to the data identification request, obtain the data filtering rules carried in the data identification request, and extract the data to be identified in the source data table based on the data filtering rules;
确定所述待识别数据的多个待识别指标项,并获取每一所述待识别指标项对应的识别算法;Determine multiple index items to be identified of the data to be identified, and obtain the identification algorithm corresponding to each index item to be identified;
基于所述识别算法对所述待识别数据的每一所述待识别指标项的字段内容进行识别,得到所述待识别数据的每一所述待识别指标项对应的识别结果;Identify the field content of each indicator item to be identified in the data to be identified based on the identification algorithm, and obtain the identification result corresponding to each indicator item to be identified in the data to be identified;
将所述待识别数据的多个所述识别结果输入到预设的规则匹配计算表达式中进行计算,得到计算结果,并根据所述计算结果确定所述待识别数据的敏感度识别结果。The plurality of identification results of the data to be identified are input into a preset rule matching calculation expression for calculation, the calculation result is obtained, and the sensitivity identification result of the data to be identified is determined based on the calculation result.
根据本发明的第二个方面,提供了一种数据识别装置,该装置包括:According to a second aspect of the present invention, a data identification device is provided, which device includes:
数据提取模块,用于响应于数据识别请求,获取所述数据识别请求中携带的数据筛选规则,并基于所述数据筛选规则在源数据表中提取出待识别数据;A data extraction module, configured to respond to a data identification request, obtain the data filtering rules carried in the data identification request, and extract the data to be identified in the source data table based on the data filtering rules;
算法确认模块,用于确定所述待识别数据的多个待识别指标项,并获取每一所述待识别指标项对应的识别算法;An algorithm confirmation module, used to determine multiple index items to be identified of the data to be identified, and obtain the identification algorithm corresponding to each index item to be identified;
数据识别模块,用于基于所述识别算法对所述待识别数据的每一所述待识别指标项的字段内容进行识别,得到所述待识别数据的每一所述待识别指标项对应的识别结果;A data identification module, configured to identify the field content of each indicator item to be identified in the data to be identified based on the identification algorithm, and obtain the identification corresponding to each indicator item to be identified in the data to be identified. result;
结果输出模块,用于将所述待识别数据的多个所述识别结果输入到预设的规则匹配计算表达式中进行计算,得到计算结果,并根据所述计算结果确定所述待识别数据的敏感度识别结果。A result output module is configured to input multiple recognition results of the data to be recognized into a preset rule matching calculation expression for calculation, obtain the calculation results, and determine the number of the data to be recognized based on the calculation results. Sensitivity identification results.
根据本发明的第三个方面,提供了一种存储介质,其上存储有计算机程序,程序被处理器执行时实现上述数据识别方法。According to a third aspect of the present invention, a storage medium is provided, on which a computer program is stored. When the program is executed by a processor, the above-mentioned data identification method is implemented.
根据本发明的第四个方面,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行程序时实现上述数据识别方法。According to a fourth aspect of the present invention, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, the above data identification method is implemented.
本发明提供的一种数据识别方法、装置、存储介质及计算机设备,本申请首先响应于数据识别请求,获取数据识别请求中携带的数据筛选规则,并基于数据筛选规则在源数据表中提取出待识别数据,然后确定待识别数据的多个待识别指标项,并获取每一待识别指标项对应的识别算法,之后基于识别算法对待识别数据的每一待识别指标项的字段内容进行识别,得到待识别数据的每一待识别指标项对应的识别结果,最终将待识别数据的多个识别结果输入到预设的规则匹配计算表达式中进行计算,得到计算结果,并根据计算结果确定待识别数据的敏感度识别结果。The invention provides a data identification method, device, storage medium and computer equipment. This application first responds to a data identification request, obtains the data filtering rules carried in the data identification request, and extracts the data from the source data table based on the data filtering rules. The data to be identified is then determined to identify multiple indicator items of the data to be identified, and the identification algorithm corresponding to each indicator item to be identified is obtained, and then the field content of each indicator item to be identified in the data to be identified is identified based on the identification algorithm. Obtain the identification results corresponding to each index item to be identified in the data to be identified, and finally input the multiple identification results of the data to be identified into the preset rule matching calculation expression for calculation, obtain the calculation results, and determine the identification results based on the calculation results. Sensitivity identification results of identification data.
上述方法在对数据进行识别处理之前,预先基于数据识别请求中的数据筛选规则对源数据表中的数据进行筛选,获取到真正需要进行识别的数据,当待识别数据的数据量较大时,此方法能够对大批量数据快速进行初步筛选,有效提升数据处理的效率;之后确定待识别数据的多个待识别指标项,并基于每个待识别指标项的识别算法逐一对每个待识别指标项的字段内容进行针对性的识别,从多个指标维度对数据进行识别,能够更加全面地了解数据,并且准确获取每一指标维度下数据的识别结果;最后将每个待识别指标项的识别结果通过规则匹配计算表达式进行计算,基于多个指标维度之间的计算结果确定待识别数据的敏感度识别结果,提升了数据识别率,最终得到待识别数据的数据分类结果,准确识别出敏感数据。上述方法能够提升数据识别的效率,尤其能够对大批量的复杂数据准确识别,并且结合多个维度来识别数据,有效提升数据的识别精度,准确得到数据的敏感度识别结果,实现对敏感数据内容精准的分类与管控。Before identifying and processing the data, the above method pre-screens the data in the source data table based on the data filtering rules in the data identification request to obtain the data that really needs to be identified. When the amount of data to be identified is large, This method can quickly conduct preliminary screening of large batches of data, effectively improving the efficiency of data processing; then determine multiple indicators to be identified in the data to be identified, and identify each indicator one by one based on the identification algorithm of each indicator to be identified. Targeted identification of the field content of the item, identification of data from multiple indicator dimensions, a more comprehensive understanding of the data, and accurate identification of the data under each indicator dimension; finally, the identification of each indicator item to be identified The results are calculated through rule matching calculation expressions, and the sensitivity identification results of the data to be identified are determined based on the calculation results between multiple indicator dimensions, which improves the data recognition rate. Finally, the data classification results of the data to be identified are obtained, and the sensitive data is accurately identified. data. The above method can improve the efficiency of data identification, especially it can accurately identify large batches of complex data, and combine multiple dimensions to identify data, effectively improve the identification accuracy of data, accurately obtain the sensitivity identification results of data, and realize the identification of sensitive data content. Accurate classification and control.
上述说明仅是本申请技术方案的概述,为了能够更清楚了解本申请的技术手段,而可依照说明书的内容予以实施,并且为了让本申请的上述和其它目的、特征和优点能够更明显易懂,以下特举本申请的具体实施方式。The above description is only an overview of the technical solutions of the present application. In order to have a clearer understanding of the technical means of the present application, they can be implemented according to the content of the description, and in order to make the above and other purposes, features and advantages of the present application more obvious and understandable. , the specific implementation methods of the present application are specifically listed below.
附图说明Description of the drawings
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:The drawings described here are used to provide a further understanding of the present invention and constitute a part of this application. The illustrative embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute an improper limitation of the present invention. In the attached picture:
图1示出了本发明提供的一种实施例中数据识别方法的流程示意图;Figure 1 shows a schematic flow chart of a data identification method in an embodiment provided by the present invention;
图2示出了本发明提供的一种实施例中数据识别方法的流程示意图;Figure 2 shows a schematic flow chart of a data identification method in an embodiment provided by the present invention;
图3示出了本发明提供的一种实施例中数据识别方法的原理流程图;Figure 3 shows a principle flow chart of a data identification method in an embodiment provided by the present invention;
图4示出了本发明提供的一种实施例中数据识别装置的结构示意图;Figure 4 shows a schematic structural diagram of a data identification device in an embodiment provided by the present invention;
图5示出了本发明提供的一种实施例中数据识别装置的结构示意图;Figure 5 shows a schematic structural diagram of a data identification device in an embodiment provided by the present invention;
图6示出了本发明提供的一种实施例中计算机设备的装置结构示意图。Figure 6 shows a schematic structural diagram of a computer device in an embodiment provided by the present invention.
具体实施方式Detailed ways
下面将参照附图更详细地描述本申请的示例性实施例。虽然附图中显示了本申请的示例性实施例,然而应当理解,可以以各种形式实现本申请而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本申请,并且能够将本申请的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a thorough understanding of the present application, and to fully convey the scope of the present application to those skilled in the art.
本申请实施例提供了一种数据识别方法,如图1所示,该方法包括以下步骤:The embodiment of the present application provides a data identification method, as shown in Figure 1. The method includes the following steps:
101、响应于数据识别请求,获取数据识别请求中携带的数据筛选规则,并基于数据筛选规则在源数据表中提取出待识别数据。101. In response to the data identification request, obtain the data filtering rules carried in the data identification request, and extract the data to be identified from the source data table based on the data filtering rules.
具体地,数据筛选规则具体是指根据特定的条件来筛选出需要进行识别或分析的数据,而特定条件可以是数值范围、日期范围、或某一列数据中的特定值等等,利用数据筛选规则能够在对数据识别前预先确定目标数据,有效提升数据识别的效率。Specifically, data filtering rules specifically refer to filtering out data that needs to be identified or analyzed based on specific conditions, and the specific conditions can be numerical ranges, date ranges, or specific values in a certain column of data, etc., using data filtering rules The target data can be determined in advance before data identification, effectively improving the efficiency of data identification.
进而本申请提出一种数据识别方法,首先响应于用户发送的数据识别请求,其中,数据识别请求中携带有数据筛选规则,而数据筛选规则具体可以由用户自行编辑设定,也可以选用预先存储好的规则,但无论是何种实现形式,数据筛选规则能够帮助用户在处理大批量的数据中快速筛选出需要进行识别的数据,从而提高数据识别效率。而数据筛选规则的范围大小同样可以由用户自行设定,数据筛选规则的越严苛,所筛选的数据范围越小,便于进行精准分析。具体到本申请中,源数据表中包含有大量的数据,但并非所有的数据都是需要进行识别的,例如对于一张配置有大量人员信息的源数据表,需要对30岁以上的人员名单进行识别,进而可以设置数据筛选规则,年龄数据列中的数据内容大于30,有效排除年龄指标项小于等30岁的人员名单,避免后续进行无意义的数据识别,因此利用数据筛选规则能够快速筛选出源数据表中的多条待识别的数据,以供后续进行处理。Furthermore, this application proposes a data identification method that first responds to a data identification request sent by a user. The data identification request carries data filtering rules, and the data filtering rules can be edited and set by the user themselves, or pre-stored. Good rules, but no matter what form of implementation, data filtering rules can help users quickly filter out the data that needs to be identified when processing large batches of data, thereby improving the efficiency of data identification. The scope of the data filtering rules can also be set by the user. The more stringent the data filtering rules are, the smaller the scope of the filtered data will be, which facilitates accurate analysis. Specific to this application, the source data table contains a large amount of data, but not all data needs to be identified. For example, for a source data table configured with a large amount of personnel information, a list of people over 30 years old needs to be identified. To identify, you can then set up data filtering rules. The data content in the age data column is greater than 30, effectively excluding the list of people whose age indicator is less than 30 years old, and avoiding subsequent meaningless data identification. Therefore, you can use data filtering rules to quickly filter Extract multiple pieces of data to be identified in the source data table for subsequent processing.
102、确定待识别数据的多个待识别指标项,并获取每一待识别指标项对应的识别算法。102. Determine multiple index items to be identified in the data to be identified, and obtain the identification algorithm corresponding to each index item to be identified.
具体地,待识别数据中包含有多个数据指标项,每一数据指标项均可作为待识别指标项,待识别指标项可以作为数据识别分析的数据特征指标,是用来识别或评判数据的一个维度。在实际的数据识别过程中,我们需要对数据内容进行深入的挖掘和分析,通过筛选出具有代表性的指标项以更好地理解以及分析数据。而对于每一个待识别指标项,我们需要特定的识别方法来保证数据的准确性和可信度,从而更好地支持数据识别分析,其中,识别算法即为用来对数据指标项进行分析和识别的数学方法和算法,而每一数据指标项都对应有与自身字段内容相对应的识别算法,选择合适的识别算法对于正确进行数据识别和分析非常重要。Specifically, the data to be identified contains multiple data indicator items. Each data indicator item can be used as an indicator item to be identified. The indicator item to be identified can be used as a data feature indicator for data recognition analysis, which is used to identify or evaluate the data. One dimension. In the actual data identification process, we need to conduct in-depth mining and analysis of the data content, and filter out representative indicator items to better understand and analyze the data. For each indicator item to be identified, we need a specific identification method to ensure the accuracy and credibility of the data, so as to better support data identification analysis. Among them, the identification algorithm is used to analyze and analyze the data indicator items. Mathematical methods and algorithms for identification, and each data indicator item corresponds to an identification algorithm corresponding to its own field content. Choosing an appropriate identification algorithm is very important for correct data identification and analysis.
在本申请实施例中,在提取待识别数据后,基于待识别数据的数据指标项确定待识别数据的待识别指标项,待识别指标项具体可以为全部的数据指标项,也可为用户选定的部分数据指标项,即用户需要从某些特定的维度来了解待识别数据,在确定好待识别指标项后,进一步获取每一待识别指标项对应的识别算法,为后续的识别工作做好准备。具体到本申请中,每个待识别指标项即代表着对于待识别数据进行识别的一个维度,获取待识别数据的等多个待识别指标项即是确定对待识别数据进行识别的多个维度,例如待识别数据中的数据库名,资源名称,表名,表注释,字段名称,字段注释,数据类型,数据长度,数据占比,去重占比,任意字段,任意注释均可作为一个识别维度,而获取每一待识别指标项对应的识别算法是为了进一步从各个维度对待识别数据进行识别处理,以最终得到当前维度下待识别数据准确的识别结果。In the embodiment of the present application, after extracting the data to be identified, the indicator items to be identified of the data to be identified are determined based on the data indicator items of the data to be identified. The indicator items to be identified can specifically be all data indicator items, or can be selected by the user. Certain data indicator items, that is, users need to understand the data to be identified from certain specific dimensions. After determining the indicator items to be identified, the identification algorithm corresponding to each indicator item to be identified is further obtained to prepare for subsequent identification work. Be prepared. Specifically in this application, each indicator item to be identified represents a dimension for identifying the data to be identified. Obtaining multiple indicator items to be identified for the data to be identified is to determine multiple dimensions for identifying the data to be identified. For example, the database name, resource name, table name, table comment, field name, field comment, data type, data length, data proportion, and deduplication proportion in the data to be identified. Any field and any comment can be used as an identification dimension. , and obtaining the identification algorithm corresponding to each index item to be identified is to further identify the data to be identified from various dimensions, so as to finally obtain accurate identification results of the data to be identified in the current dimension.
103、基于识别算法对待识别数据的每一待识别指标项的字段内容进行识别,得到待识别数据的每一待识别指标项对应的识别结果。103. Identify the field content of each index item to be identified in the data to be identified based on the identification algorithm, and obtain the identification result corresponding to each index item to be identified in the data to be identified.
在本申请实施例中,在确定待识别指标项以及对应的识别算法后,提取待识别指标项中的字段内容,利用对应的识别算法对字段内容进行识别,得到识别结果,例如当待识别数据为一系列销售数据时,我们可以采用某种特定的识别算法,如决策树算法来识别某个待识别指标项终端产品是否属于特定的产品类别,从而确定该产品的销售情况。利用待识别指标项对应的识别算法对待识别指标项中的字段内容进行针对性的识别,能够得到准确的识别结果,便于更直观准确的理解以及分析待识别数据。In the embodiment of this application, after determining the indicator item to be identified and the corresponding identification algorithm, the field content in the indicator item to be identified is extracted, and the corresponding identification algorithm is used to identify the field content to obtain the identification result. For example, when the data to be identified is When it is a series of sales data, we can use a specific identification algorithm, such as a decision tree algorithm, to identify whether the terminal product of a certain indicator item to be identified belongs to a specific product category, thereby determining the sales of the product. By using the identification algorithm corresponding to the indicator item to be identified to perform targeted identification of the field content in the indicator item to be identified, accurate identification results can be obtained, which facilitates a more intuitive and accurate understanding and analysis of the data to be identified.
104、将待识别数据的多个识别结果输入到预设的规则匹配计算表达式中进行计算,得到计算结果,并根据计算结果确定待识别数据的敏感度识别结果。104. Input multiple identification results of the data to be identified into the preset rule matching calculation expression for calculation, obtain the calculation results, and determine the sensitivity identification result of the data to be identified based on the calculation results.
具体地,规则匹配计算表达式用于根据不同的指标维度的识别结果和运算符计算规则匹配结果,具体将每个指标维度的识别结果输入由运算符组成的表达式中,根据事先定义的运算规则进行计算,得到最终的规则匹配结果。在此过程中,每个指标维度的识别结果都会被映射成一个数值或者逻辑值,对应填入规则匹配计算表达式中的相应位置,具体用于计算两个指标维度之间的关系,也可用于判断某个条件是否成立。Specifically, the rule matching calculation expression is used to calculate the rule matching results based on the recognition results and operators of different indicator dimensions. Specifically, the recognition results of each indicator dimension are input into an expression composed of operators, and the results are calculated according to the predefined operations. The rules are calculated and the final rule matching result is obtained. During this process, the identification result of each indicator dimension will be mapped into a numerical or logical value, and the corresponding position in the rule matching calculation expression will be filled in accordingly. It is specifically used to calculate the relationship between the two indicator dimensions. It can also be used To determine whether a certain condition is true.
在本申请实施例中,将多个数据识别指标项对应的识别结果代入规则匹配计算表达式中进行计算的过程,是指根据预定义的计算公式,将多个待识别指标项的识别结果进行组合和计算,从而得到计算结果。而计算结果具体可以为用于评价待识别数据的具体数值,例如数据在某方面的得分情况,也可以是一个分类标签,例如优秀、良好、较差等,还可以是指标的划分,例如正常、风险等,而根据不同类型的计算结果能够准确确定待识别数据的数据类别,以准确获得待识别数据的分类结果,便于清晰地理解数据的价值和特征。In the embodiment of the present application, the process of substituting the identification results corresponding to multiple data identification index items into the rule matching calculation expression for calculation refers to performing the identification results of the multiple index items to be identified according to the predefined calculation formula. Combine and calculate to get the calculation result. The calculation result can be a specific value used to evaluate the data to be identified, such as the score of the data in a certain aspect, or it can be a classification label, such as excellent, good, poor, etc., or it can be the division of indicators, such as normal. , risks, etc., and the data categories of the data to be identified can be accurately determined based on different types of calculation results, so as to accurately obtain the classification results of the data to be identified, so as to clearly understand the value and characteristics of the data.
本发明提供的一种数据识别方法、装置、存储介质及计算机设备,具体原理流程图如图3所示,本申请首先响应于数据识别请求,获取数据识别请求中携带的数据筛选规则,并基于数据筛选规则在源数据表中提取出待识别数据,然后确定待识别数据的多个待识别指标项,并获取每一待识别指标项对应的识别算法,之后基于识别算法对待识别数据的每一待识别指标项的字段内容进行识别,得到待识别数据的每一待识别指标项对应的识别结果,最终将待识别数据的多个识别结果输入到预设的规则匹配计算表达式中进行计算,得到计算结果,并根据计算结果确定待识别数据的敏感度识别结果。The invention provides a data identification method, device, storage medium and computer equipment. The specific principle flow chart is shown in Figure 3. This application first responds to the data identification request, obtains the data filtering rules carried in the data identification request, and based on The data filtering rule extracts the data to be identified from the source data table, then determines multiple index items to be identified in the data to be identified, and obtains the identification algorithm corresponding to each index item to be identified, and then based on the identification algorithm for each index item to be identified. The field contents of the indicator items to be identified are identified to obtain the identification results corresponding to each indicator item to be identified in the data to be identified. Finally, multiple identification results of the data to be identified are input into the preset rule matching calculation expression for calculation. The calculation results are obtained, and the sensitivity identification results of the data to be identified are determined based on the calculation results.
上述方法在对数据进行识别处理之前,预先基于数据识别请求中的数据筛选规则对源数据表中的数据进行筛选,获取到真正需要进行识别的数据,当待识别数据的数据量较大时,此方法能够对大批量数据快速进行初步筛选,有效提升数据处理的效率;之后确定待识别数据的多个待识别指标项,并基于每个待识别指标项的识别算法逐一对每个待识别指标项的字段内容进行针对性的识别,从多个指标维度对数据进行识别,能够更加全面地了解数据,并且准确获取每一指标维度下数据的识别结果;最后将每个待识别指标项的识别结果通过规则匹配计算表达式进行计算,基于多个指标维度之间的计算结果确定待识别数据的敏感度识别结果,提升了数据识别率,最终得到待识别数据的数据分类结果,准确识别出敏感数据。上述方法能够提升数据识别的效率,尤其能够对大批量的复杂数据准确识别,并且结合多个维度来识别数据,有效提升数据的识别精度,准确得到数据的敏感度识别结果,实现对敏感数据内容精准的分类与管控。Before identifying and processing the data, the above method pre-screens the data in the source data table based on the data filtering rules in the data identification request to obtain the data that really needs to be identified. When the amount of data to be identified is large, This method can quickly conduct preliminary screening of large batches of data, effectively improving the efficiency of data processing; then determine multiple indicators to be identified in the data to be identified, and identify each indicator one by one based on the identification algorithm of each indicator to be identified. Targeted identification of the field content of the item, identification of data from multiple indicator dimensions, a more comprehensive understanding of the data, and accurate identification of the data under each indicator dimension; finally, the identification of each indicator item to be identified The results are calculated through rule matching calculation expressions, and the sensitivity identification results of the data to be identified are determined based on the calculation results between multiple indicator dimensions, which improves the data recognition rate. Finally, the data classification results of the data to be identified are obtained, and the sensitive data is accurately identified. data. The above method can improve the efficiency of data identification, especially it can accurately identify large batches of complex data, and combine multiple dimensions to identify data, effectively improve the identification accuracy of data, accurately obtain the sensitivity identification results of data, and realize the identification of sensitive data content. Accurate classification and control.
本申请实施例还提供了一种数据识别方法,如图2所示,该方法包括以下步骤:The embodiment of the present application also provides a data identification method, as shown in Figure 2. The method includes the following steps:
201、响应于数据识别请求,获取数据识别请求中携带的数据筛选规则。201. In response to the data identification request, obtain the data filtering rules carried in the data identification request.
在本申请实施例中,数据识别请求是指用户通过某种方式,例如网络接口、应用程序或人机交互界面向系统发送的请求,要求对某些特定的数据进行识别操作,并返回相关识别结果,而在实际的应用场景中,数据识别请求往往是由某个业务需求、问题或场景触发的。同时,在不同的应用场景下,数据识别请求可能会有不同的形式,携带有不同的内容,在本申请中,数据识别请求携带有数据筛选规则,数据筛选规则用于对源数据表的数据进行初步筛选,使得本申请提供的方法能够应用于处理大批量数据的场景下。In the embodiment of this application, a data identification request refers to a request sent by the user to the system through a certain method, such as a network interface, an application program or a human-computer interaction interface, requiring identification operations on certain specific data and returning relevant identifications. As a result, in actual application scenarios, data identification requests are often triggered by a certain business requirement, problem or scenario. At the same time, in different application scenarios, data identification requests may have different forms and carry different contents. In this application, the data identification requests carry data filtering rules, and the data filtering rules are used to filter the data in the source data table. Preliminary screening is performed so that the method provided in this application can be applied to scenarios of processing large batches of data.
202、基于数据筛选规则在源数据表中提取出待识别数据。202. Extract the data to be identified from the source data table based on the data filtering rules.
具体地,源数据表包括多条数据记录,每条数据记录包括多个数据指标项;首先获取数据筛选规则,并确定数据筛选规则中的数据筛选条件,其中,数据筛选条件包括判定指标项和判定条件,然后将判定指标项与多个数据指标项逐一进行匹配,提取与判定指标项相同的数据指标项,最后根据判定指标项对应的判定条件对数据指标项的字段内容进行判定,若数据指标项的字段内容满足判定条件,则确定数据指标项对应的数据记录满足数据筛选条件,并将数据记录标记为待识别数据。Specifically, the source data table includes multiple data records, and each data record includes multiple data indicator items; first, the data filtering rules are obtained, and the data filtering conditions in the data filtering rules are determined, where the data filtering conditions include determination indicator items and Determine the conditions, then match the determination indicator item with multiple data indicator items one by one, extract the data indicator items that are the same as the determination indicator item, and finally determine the field content of the data indicator item according to the determination conditions corresponding to the determination indicator item. If the data If the field content of the indicator item meets the judgment conditions, it is determined that the data record corresponding to the data indicator item meets the data filtering conditions, and the data record is marked as data to be identified.
在本申请实施例中,获取到数据筛选规则,确定数据筛选规则中的数据筛选条件,数据筛选条件具体包括判定指标项以及判定条件,例如,在一个源数据表中,判定指标项可以为数据库名、资源名称、表名、表注释、字段名称、字段注释、数据类型、数据长度、数据占比、去重占比、任意字段、任意注释等各类指标,而判定指标项以及判定条件之间又可以有多种判定方式,判定方式可以为包含、不包含、等于、不等于、正则、以指定字符开始、以指定字符结尾、介于、字典等字符内容,也可以是等于、小于、大于、小于等于、大于等于、不等于、区间、特殊行业数据算法、识别算法等数量内容,最终形成类似于“库名等于bus”“列名包含add”“数据内容地址标识算法”“表注释包含用户”等数据筛选条件,然后根据判定指标项在现有数据记录中的数据指标项中进行匹配,提取与判定指标项完全一致的数据指标项,再根据判定条件对数据指标项的字段内容进行判定,如果字段内容满足判定条件,这说明此数据记录为待识别数据,便完成数据的初步筛选。In the embodiment of this application, the data filtering rules are obtained, and the data filtering conditions in the data filtering rules are determined. The data filtering conditions specifically include determination indicator items and determination conditions. For example, in a source data table, the determination indicator item can be a database Name, resource name, table name, table comment, field name, field comment, data type, data length, data proportion, deduplication proportion, any field, any comment and other indicators, among which the determination indicator items and the determination conditions There can be a variety of determination methods for time. The determination methods can be including, not containing, equal to, not equal to, regular, starting with a specified character, ending with a specified character, between, dictionary and other character contents, or it can also be equal to, less than, Greater than, less than or equal to, greater than or equal to, not equal to, interval, special industry data algorithm, identification algorithm and other quantitative contents, the final form is similar to "library name equals bus" "column name contains add" "data content address identification algorithm" "table comments" "Contains users" and other data filtering conditions, and then match the data indicator items in the existing data records according to the determination indicator items, extract the data indicator items that are completely consistent with the determination indicator items, and then match the field contents of the data indicator items according to the determination conditions Make a judgment. If the field content meets the judgment conditions, it means that this data record is the data to be identified, and the preliminary screening of the data is completed.
进一步的,数据筛选条件的数量为多个,首先基于多个数据筛选条件逐一对源数据表内的多条数据记录进行筛选,然后当存在数据记录同时满足全部数据筛选条件时,提取数据记录,并对数据记录进行冗余处理,最后对冗余处理后的数据记录进行整合,并将整合后的数据记录标记为待识别数据。Further, the number of data filtering conditions is multiple. First, multiple data records in the source data table are filtered one by one based on multiple data filtering conditions. Then, when there are data records that meet all data filtering conditions at the same time, the data records are extracted. And perform redundant processing on the data records, and finally integrate the redundant processed data records, and mark the integrated data records as data to be identified.
在本申请实施例中,通常情况下数据筛选规则中的数据筛选条件的数量为多个,而设定一个较小的数据筛选范围便于精准快速地找到待识别数据,因此基于每个数据筛选条件对数据记录逐条进行筛选,最后确定满足所有数据筛选条件的数据记录,这些数据记录即为接下来需要进行识别的待识别数据,将获取到的所有数据记录进行整合组装,并进行冗余处理,剔除掉一些重复无效的数据后,所形成的数据记录整体变为后续需要进行识别的待识别数据。In the embodiment of this application, usually the number of data filtering conditions in the data filtering rules is multiple, and setting a smaller data filtering range is convenient for accurately and quickly finding the data to be identified, so based on each data filtering condition Filter the data records one by one, and finally determine the data records that meet all data filtering conditions. These data records are the data to be identified that need to be identified next. All the obtained data records will be integrated and assembled, and redundant processing will be performed. After eliminating some duplicate and invalid data, the entire data record formed becomes the data to be identified that needs to be identified later.
203、确定待识别数据的多个待识别指标项,待识别指标项包括定性指标项和定量指标项。203. Determine multiple index items to be identified of the data to be identified. The index items to be identified include qualitative index items and quantitative index items.
具体地,首先获取待识别数据的源数据表信息,根据源数据表信息生成元数据指标项,然后获取待识别数据的数据指标项,将元数据指标项与数据指标项进行整合,得到待识别数据的定性指标项,之后获取待识别数据中每一数据指标项的字段内容,并基于预设的统计算法对字段内容进行统计计算,得到统计结果,最后基于统计结果,生成待识别数据的定量指标项。Specifically, first obtain the source data table information of the data to be identified, generate metadata indicator items based on the source data table information, then obtain the data indicator items of the data to be identified, integrate the metadata indicator items and data indicator items, and obtain the data indicator items to be identified. Qualitative indicator items of the data, and then obtain the field content of each data indicator item in the data to be identified, and perform statistical calculations on the field content based on the preset statistical algorithm to obtain statistical results. Finally, based on the statistical results, generate a quantitative analysis of the data to be identified. indicator items.
在本申请实施例中,对待识别指标性的类别进行划分,具体分为定性指标项和定量指标项,而对定性指标项进一步划分又可分为元数据指标项以及数据指标项。具体地,当获取到待识别数据后,先根据待识别数据的来源,即待识别数据所属的源数据表,获取源数据表信息,源数据表信息具体包括源数据表所在数据库的库名,源数据表的表名,以及源数据表的表注释等信息,以上信息均为待识别数据的元数据指标项,而待识别数据的数据指标项则对应待识别数据中的列注释以及列名等相关信息,由源数据指标项以及数据指标项构成了定性指标项。而对于待识别数据的定性指标项中的字段内容进行数据处理,便能得到待识别数据具体的定量指标项,例如通过统计算法对字段内容进行统计计算,得到数据长度、数据占比、去重占比等量化指标作为待识别数据表的定量指标项。最终,将定性指标项以及定量指标项进行整合,即得到待识别数据表的待识别指标项,后续对于待识别数据的识别过程都是基于待识别指标项而进行的。In the embodiment of this application, the categories of indicators to be identified are divided into qualitative indicator items and quantitative indicator items, and the qualitative indicator items are further divided into metadata indicator items and data indicator items. Specifically, when the data to be identified is obtained, the source data table information is obtained based on the source of the data to be identified, that is, the source data table to which the data to be identified belongs. The source data table information specifically includes the database name of the database where the source data table is located. The table name of the source data table, and the table comments of the source data table and other information. The above information is the metadata indicator item of the data to be identified, and the data indicator items of the data to be identified correspond to the column comments and column names in the data to be identified. and other related information, the qualitative indicator items are composed of source data indicator items and data indicator items. By performing data processing on the field content in the qualitative index items of the data to be identified, the specific quantitative index items of the data to be identified can be obtained. For example, statistical calculations are performed on the field content through statistical algorithms to obtain data length, data proportion, and deduplication. Quantitative indicators such as proportions are used as quantitative indicator items in the data table to be identified. Finally, the qualitative index items and the quantitative index items are integrated to obtain the index items to be identified in the data table to be identified. The subsequent identification process of the data to be identified is based on the index items to be identified.
204、获取待识别数据的每一待识别指标项对应的识别算法,基于识别算法对待识别数据的每一待识别指标项的字段内容进行识别,得到待识别数据的每一待识别指标项对应的识别结果。204. Obtain the identification algorithm corresponding to each index item to be identified in the data to be identified, identify the field content of each index item to be identified in the data to be identified based on the identification algorithm, and obtain the identification algorithm corresponding to each index item to be identified in the data to be identified. Recognition results.
具体地,首先获取待识别指标项对应的识别算法,其中,识别算法包括字符串匹配算法、特征提取算法、统计学习算法、机器学习算法中的至少一种,然后提取待识别指标项的字段内容,根据待识别指标项对应的识别算法确定字段内容中的待识别字段,最后基于识别算法对待识别字段进行识别,得到待识别指标项的识别结果。Specifically, first obtain the identification algorithm corresponding to the indicator item to be identified, where the identification algorithm includes at least one of a string matching algorithm, a feature extraction algorithm, a statistical learning algorithm, and a machine learning algorithm, and then extract the field content of the indicator item to be identified. , determine the fields to be identified in the field content according to the identification algorithm corresponding to the indicator items to be identified, and finally identify the fields to be identified based on the identification algorithm to obtain the identification results of the indicator items to be identified.
在本申请实施例中,待识别数据的每一待识别指标项对应一种识别算法,而常用的识别算法具体包括通过匹配指定的字符串、关键词、正则表达式等,对数据进行敏感内容识别和过滤的字符串匹配算法;针对数据中的特征信息,采取不同的方法进行提取、分析、建模和挖掘,从而识别敏感内容的特征提取算法;通过对数据进行统计分析和建模,建立统计学模型,从而对数据中的敏感内容实现识别和分类的统计学习算法,以及利用机器学习算和模型,对数据进行训练和学习,从而对数据中的敏感内容进行预测和分类的机器学习算法。在获取到待识别指标项各自的识别算法后,基于不同识别算法所要进行识别的数据类别,提取待识别指标项中字段内容的待识别字段并进行识别,最终得到待识别指标项各个对应的识别结果。In the embodiment of this application, each index item to be identified in the data to be identified corresponds to an identification algorithm, and commonly used identification algorithms specifically include performing sensitive content on the data by matching specified strings, keywords, regular expressions, etc. String matching algorithm for identification and filtering; feature extraction algorithm for identifying sensitive content by adopting different methods to extract, analyze, model and mine feature information in the data; through statistical analysis and modeling of data, establish Statistical models, statistical learning algorithms that can identify and classify sensitive content in data, and machine learning algorithms that use machine learning sum models to train and learn data to predict and classify sensitive content in data. . After obtaining the respective identification algorithms of the indicator items to be identified, based on the data categories to be identified by different identification algorithms, the fields to be identified in the field content of the indicator items to be identified are extracted and identified, and finally the corresponding identification of each indicator item to be identified is obtained. result.
具体地,如图3所示,每一待识别指标项即代表着对数据识别的一个维度,基于每个待识别指标项对待识别数据进行处理识别,能够从一个维度准确了解识别待处理数据,而本申请结合对多个待识别指标项中的字段内容进行识别,得到多个维度的数据识别结果,以全面准确的识别数据。Specifically, as shown in Figure 3, each indicator item to be identified represents a dimension of data identification. Based on each indicator item to be identified, the data to be identified is processed and identified, and the data to be processed can be accurately understood and identified from one dimension. This application combines the identification of field contents in multiple index items to be identified to obtain multi-dimensional data identification results to comprehensively and accurately identify data.
205、编辑规则匹配计算表达式,将多个识别结果代入规则匹配计算表达式中进行计算。205. Edit the rule matching calculation expression, and substitute multiple recognition results into the rule matching calculation expression for calculation.
具体地,响应于表达式编辑指令,获取规则匹配计算表达式的编辑规则,首先基于编辑规则,获取与多个待识别指标项相关联的运算符,其中,运算符的数量至少为一个,且运算符包括算数运算符、关系运算符和逻辑运算符中的至少一种,然后对多个待识别指标项和至少一个运算符进行合成,得到规则匹配计算表达式。Specifically, in response to the expression editing instruction, the editing rules of the rule matching calculation expression are obtained. First, based on the editing rules, operators associated with multiple indicator items to be identified are obtained, wherein the number of operators is at least one, and The operators include at least one of arithmetic operators, relational operators and logical operators, and then multiple index items to be identified and at least one operator are synthesized to obtain a rule matching calculation expression.
在本申请实施例中,本申请从多个维度来对数据进行识别,而具体识别结果需要综合各个维度的识别结果来确定,因此在获取到各个待识别指标项的识别结果之后,需要合成规则匹配计算表达式,来对各个结果进行综合性的计算,以最终获取结合各个维度的数据识别结果。规则匹配计算表达式的具体编辑过程需要根据之前确定好的待识别指标项,获取与各个待识别指标项相关联的运算符,这里的运算符支持常用的算数运算符、关系运算符和逻辑运算符,能够支持各类的运算方式,再确定与所有待识别指标项相关联的运算符后,对所有待识别指标项以及运算符进行合成,即可得到数据匹配计算公式,例如,当运算符为逻辑运算符时,基于待识别数据指标项编辑生成的规则匹配计算表达式可以为(1||2)&&3,即为条件1或条件2满足,并且条件3满足,或规则匹配计算表达式可以为1||2||3||4,即为同时满足条件1、条件2、条件3和条件4,其中,公式中的1234均为各个待识别数据指标项的识别结果。In the embodiment of this application, this application identifies data from multiple dimensions, and the specific identification results need to be determined by integrating the identification results of each dimension. Therefore, after obtaining the identification results of each indicator item to be identified, it is necessary to synthesize rules Match calculation expressions to perform comprehensive calculations on each result to finally obtain data recognition results that combine various dimensions. The specific editing process of the rule matching calculation expression requires obtaining the operators associated with each indicator item to be identified based on the previously determined indicator items to be identified. The operators here support commonly used arithmetic operators, relational operators and logical operations. Operators can support various calculation methods. After determining the operators associated with all indicator items to be identified, all indicator items to be identified and the operators are synthesized to obtain the data matching calculation formula. For example, when the operator When it is a logical operator, the rule matching calculation expression generated based on the editing of the data indicator item to be identified can be (1||2)&&3, that is, condition 1 or condition 2 is met, and condition 3 is met, or the rule matching calculation expression It can be 1||2||3||4, that is, it satisfies condition 1, condition 2, condition 3 and condition 4 at the same time. Among them, 1234 in the formula are the identification results of each data index item to be identified.
206、得到计算结果,并根据预设的敏感度阈值与计算结果进行比对,确定待识别数据的敏感度识别结果。206. Obtain the calculation result, and compare the calculation result with the preset sensitivity threshold to determine the sensitivity identification result of the data to be identified.
具体地,获取预设的敏感度阈值,并基于敏感度阈值对计算结果进行比对;当计算结果大于敏感度阈值时,将计算结果标记为第一数据类别,并为第一数据类别添加敏感内容的数据标签;当计算结果小于或等于敏感度阈值时,将计算结果标记为第二数据类别,并为第二数据类别添加非敏感内容的数据标签。Specifically, a preset sensitivity threshold is obtained, and the calculation results are compared based on the sensitivity threshold; when the calculation result is greater than the sensitivity threshold, the calculation result is marked as the first data category, and sensitivity is added to the first data category. Data label of the content; when the calculation result is less than or equal to the sensitivity threshold, mark the calculation result as the second data category, and add a data label of non-sensitive content to the second data category.
在本申请实施例中,根据不用的数据匹配计算公式得到的计算结果,利用用于区分不同数据类别的敏感度阈值来对计算结果进行判定,最终完成对于待识别数据具体的识别与分类,确定待识别数据是否为敏感内容,例如,当得到的计算结果为具体数值时,预设的敏感度阈值同样为具体数值,对两个具体数值进行比对,当计算结果大于预设阈值时,即可将待识别数据划分为第一数据类别,确定待识别数据为敏感内容,而当计算结果的数值小于或等于预设的敏感度阈值时,即可将待识别数据划分为第二数据类别,确定待识别数据为敏感内容。预设的敏感度阈值的设定便于用户能够快速对计算结果进行分析,准确得到待识别数据准确的数据类别,完成对整个数据分析的全过程,快速确定待识别数据是否属于敏感内容,便于后续进行处理。In the embodiment of this application, based on the calculation results obtained by different data matching calculation formulas, the sensitivity thresholds used to distinguish different data categories are used to judge the calculation results, and finally complete the specific identification and classification of the data to be identified, and determine Whether the data to be identified is sensitive content. For example, when the calculation result is a specific value, the preset sensitivity threshold is also a specific value. The two specific values are compared. When the calculation result is greater than the preset threshold, that is The data to be identified can be divided into the first data category, and the data to be identified is determined to be sensitive content. When the value of the calculation result is less than or equal to the preset sensitivity threshold, the data to be identified can be divided into the second data category. Determine that the data to be identified is sensitive content. The setting of the preset sensitivity threshold allows users to quickly analyze the calculation results, accurately obtain the accurate data category of the data to be identified, complete the entire data analysis process, and quickly determine whether the data to be identified is sensitive content, which facilitates follow-up. for processing.
本发明提供的一种数据识别方法、装置、存储介质及计算机设备,本申请首先响应于数据识别请求,获取数据识别请求中携带的数据筛选规则,并基于数据筛选规则在源数据表中提取出待识别数据,然后确定待识别数据的多个待识别指标项,待识别指标项包括定性指标项和定量指标项,之后获取待识别数据的每一待识别指标项对应的识别算法,基于识别算法对待识别数据的每一待识别指标项的字段内容进行识别,得到待识别数据的每一待识别指标项对应的识别结果,再合成规则匹配计算表达式,将多个识别结果代入规则匹配计算表达式中进行计算,最后得到计算结果,并根据预设的敏感度阈值与计算结果进行比对,确定待识别数据的敏感度识别结果。The invention provides a data identification method, device, storage medium and computer equipment. This application first responds to a data identification request, obtains the data filtering rules carried in the data identification request, and extracts the data from the source data table based on the data filtering rules. The data to be identified is then determined to identify multiple indicator items of the data to be identified. The indicator items to be identified include qualitative indicator items and quantitative indicator items. Then the identification algorithm corresponding to each indicator item to be identified of the data to be identified is obtained, based on the identification algorithm. Identify the field content of each index item to be identified in the data to be identified, obtain the identification results corresponding to each index item to be identified in the data to be identified, then synthesize the rule matching calculation expression, and substitute the multiple identification results into the rule matching calculation expression The calculation is performed in the formula, and finally the calculation result is obtained, and the sensitivity recognition result of the data to be identified is determined by comparing the preset sensitivity threshold with the calculation result.
上述方法基于数据筛选规则中的多个数据筛选条件来确定待识别数据,完成对大批量数据的精准筛选;确定待识别数据的多个待识别指标项,具体包括定性指标项和定量指标项,然后利用待识别指标项对应的识别算法对字段内容进行针对性识别,能够获取准确的识别结果;最后基于待识别指标项合成规则匹配计算表达式,将每个识别结果通过规则匹配计算表达式进行计算,将得到的计算结果与预设的敏感度阈值比对确定待识别数据是否为敏感内容。上述方法有效提升处理大批量数据的效率,并且能够结合多个维度来识别数据,提升数据的识别精度,最后利用预设的敏感度阈值来精准判定待识别数据是否为敏感内容。The above method determines the data to be identified based on multiple data filtering conditions in the data filtering rules, and completes the accurate screening of large batches of data; determines multiple index items to be identified for the data to be identified, specifically including qualitative index items and quantitative index items. Then use the recognition algorithm corresponding to the indicator item to be identified to carry out targeted identification of the field content, and obtain accurate identification results; finally, a rule matching calculation expression is synthesized based on the indicator item to be identified, and each identification result is processed through the rule matching calculation expression Calculate, and compare the calculated results with the preset sensitivity threshold to determine whether the data to be identified is sensitive content. The above method effectively improves the efficiency of processing large batches of data, and can combine multiple dimensions to identify data, improve the identification accuracy of data, and finally use the preset sensitivity threshold to accurately determine whether the data to be identified is sensitive content.
进一步地,作为图1方法的具体实现,本申请实施例提供了一种数据识别装置,如图4所示,装置包括:数据提取模块301、算法确认模块302、数据识别模块303、结果输出模块304。Further, as a specific implementation of the method in Figure 1, the embodiment of the present application provides a data identification device, as shown in Figure 4. The device includes: a data extraction module 301, an algorithm confirmation module 302, a data identification module 303, and a result output module. 304.
数据提取模块301,可用于响应于数据识别请求,获取数据识别请求中携带的数据筛选规则,并基于数据筛选规则在源数据表中提取出待识别数据;The data extraction module 301 can be used to respond to the data identification request, obtain the data filtering rules carried in the data identification request, and extract the data to be identified in the source data table based on the data filtering rules;
算法确认模块302,可用于确定待识别数据的多个待识别指标项,并获取每一待识别指标项对应的识别算法;The algorithm confirmation module 302 can be used to determine multiple index items to be identified in the data to be identified, and obtain the identification algorithm corresponding to each index item to be identified;
数据识别模块303,可用于基于识别算法对待识别数据的每一待识别指标项的字段内容进行识别,得到待识别数据的每一待识别指标项对应的识别结果;The data identification module 303 can be used to identify the field content of each index item to be identified in the data to be identified based on the identification algorithm, and obtain the identification result corresponding to each index item to be identified in the data to be identified;
结果输出模块304,可用于将待识别数据的多个识别结果输入到预设的规则匹配计算表达式中进行计算,得到计算结果,并根据计算结果确定待识别数据的敏感度识别结果。The result output module 304 can be used to input multiple recognition results of the data to be recognized into a preset rule matching calculation expression for calculation, obtain the calculation results, and determine the sensitivity recognition result of the data to be recognized based on the calculation results.
在具体的应用场景中,数据提取模块301可用于获取数据筛选规则,并确定数据筛选规则中的数据筛选条件,其中,每条数据筛选条件包括判定指标项和判定条件;将判定指标项与多个数据指标项逐一进行匹配,提取与判定指标项相同的数据指标项;根据判定指标项对应的判定条件对数据指标项的字段内容进行判定,若数据指标项的字段内容满足判定条件,则确定数据指标项对应的数据记录满足数据筛选条件,并将数据记录标记为待识别数据。In specific application scenarios, the data extraction module 301 can be used to obtain data filtering rules and determine data filtering conditions in the data filtering rules, where each data filtering condition includes a determination indicator item and a determination condition; combine the determination indicator item with multiple The data index items are matched one by one, and the data index items that are the same as the judgment index items are extracted; the field content of the data index items is judged according to the judgment conditions corresponding to the judgment index items. If the field content of the data index items meets the judgment conditions, then it is determined The data record corresponding to the data indicator item meets the data filtering conditions, and the data record is marked as data to be identified.
在具体的应用场景中,数据提取模块301还可用于基于多个数据筛选条件逐一对源数据表内的多条数据记录进行筛选;当存在数据记录同时满足全部数据筛选条件时,提取数据记录,并对数据记录进行冗余处理;对冗余处理后的数据记录进行整合,并将整合后的数据记录标记为待识别数据。In specific application scenarios, the data extraction module 301 can also be used to filter multiple data records in the source data table one by one based on multiple data filtering conditions; when a data record exists and meets all data filtering conditions, the data record is extracted. And perform redundant processing on the data records; integrate the redundant processed data records, and mark the integrated data records as data to be identified.
在具体的应用场景中,算法确认模块302,具体可用于获取待识别数据的源数据表信息,根据源数据表信息生成元数据指标项;获取待识别数据的数据指标项,将元数据指标项与数据指标项进行整合,得到待识别数据的定性指标项;获取待识别数据中每一数据指标项的字段内容,并基于预设的统计算法对字段内容进行统计计算,得到计算结果;基于计算结果,生成待识别数据的定量指标项。In a specific application scenario, the algorithm confirmation module 302 can be used to obtain the source data table information of the data to be identified, and generate metadata indicator items based on the source data table information; obtain the data indicator items of the data to be identified, and convert the metadata indicator items into Integrate with the data indicator items to obtain the qualitative indicator items of the data to be identified; obtain the field content of each data indicator item in the data to be identified, and perform statistical calculations on the field content based on the preset statistical algorithm to obtain the calculation results; based on calculation As a result, quantitative indicator items of the data to be identified are generated.
在具体的应用场景中,数据识别模块303还可用于获取待识别指标项对应的识别算法,其中,识别算法包括字符串匹配算法、特征提取算法、统计学习算法、机器学习算法中的至少一种;提取待识别指标项的字段内容,根据待识别指标项对应的识别算法确定字段内容中的待识别字段;基于识别算法对待识别字段进行识别,得到待识别指标项的识别结果。In specific application scenarios, the data identification module 303 can also be used to obtain the identification algorithm corresponding to the indicator item to be identified, where the identification algorithm includes at least one of a string matching algorithm, a feature extraction algorithm, a statistical learning algorithm, and a machine learning algorithm. ; Extract the field content of the indicator item to be identified, and determine the field to be identified in the field content according to the identification algorithm corresponding to the indicator item to be identified; identify the field to be identified based on the identification algorithm, and obtain the identification result of the indicator item to be identified.
在具体的应用场景中,如图5所示,本申请还包括公式编辑模块305,公式编辑模块305具体还可用于响应于表达式编辑指令,获取规则匹配计算表达式的编辑规则;基于编辑规则,获取与多个待识别指标项相关联的运算符,其中,运算符的数量至少为一个,且运算符包括算数运算符、关系运算符和逻辑运算符中的至少一种;对多个待识别指标项和至少一个运算符进行合成,得到规则匹配计算表达式。In a specific application scenario, as shown in Figure 5, this application also includes a formula editing module 305. The formula editing module 305 can also be used to respond to expression editing instructions and obtain editing rules for rule matching calculation expressions; based on the editing rules , obtain the operators associated with multiple indicator items to be identified, where the number of operators is at least one, and the operators include at least one of arithmetic operators, relational operators and logical operators; for multiple to-be-identified indicator items, The identified indicator item is combined with at least one operator to obtain a rule matching calculation expression.
在具体的应用场景中,结果输出模块304,具体可用于获取预设的敏感度阈值,并基于敏感度阈值对计算结果进行比对;当计算结果大于敏感度阈值时,将计算结果标记为第一数据类别,并为第一数据类别添加敏感内容的数据标签;当计算结果小于或等于敏感度阈值时,将计算结果标记为第二数据类别,并为第二数据类别添加非敏感内容的数据标签。In a specific application scenario, the result output module 304 can be used to obtain a preset sensitivity threshold, and compare the calculation results based on the sensitivity threshold; when the calculation result is greater than the sensitivity threshold, mark the calculation result as the first One data category, and add a data label of sensitive content to the first data category; when the calculation result is less than or equal to the sensitivity threshold, mark the calculation result as a second data category, and add data of non-sensitive content to the second data category Label.
需要说明的是,本实施例提供的一种数据识别装置所涉及各功能单元的其它相应描述,可以参考图1和图2中的对应描述,在此不再赘述。It should be noted that, for other corresponding descriptions of the functional units involved in the data identification device provided in this embodiment, reference can be made to the corresponding descriptions in Figures 1 and 2, which will not be described again here.
基于上述如图1所示方法,相应的,本实施例还提供了一种存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述数据识别方法。Based on the above method shown in Figure 1, correspondingly, this embodiment also provides a storage medium on which a computer program is stored, and when the program is executed by a processor, the above data identification method is implemented.
基于这样的理解,本申请的技术方案可以以软件产品的形式体现出来,该待识别软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施场景数据识别方法。Based on this understanding, the technical solution of this application can be embodied in the form of a software product. The software product to be identified can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.). It includes several instructions to cause a computer device (which can be a personal computer, a server, or a network device, etc.) to execute each implementation scenario data identification method of this application.
基于上述如图1和图2所示的方法,以及图4和如图5所示的数据识别装置实施例,为了实现上述目的,如图6所示,本实施例还提供了数据识别的实体设备,该设备包括通信总线、处理器、存储器和通信接口,还可以包括输入输出接口和显示设备,其中,各个功能单元之间可以通过总线完成相互间的通信。该存储器存储有计算机程序,处理器,用于执行存储器上所存放的程序,执行上述实施例中数据识别方法。Based on the above methods shown in Figures 1 and 2, and the data identification device embodiments shown in Figures 4 and 5, in order to achieve the above purpose, as shown in Figure 6, this embodiment also provides an entity for data identification The device includes a communication bus, a processor, a memory and a communication interface, and may also include an input-output interface and a display device, wherein each functional unit can communicate with each other through the bus. The memory stores a computer program and a processor, which is used to execute the program stored in the memory and execute the data identification method in the above embodiment.
可选的,该实体设备还可以包括用户接口、网络接口、摄像头、射频(RadioFrequency,RF)电路,传感器、音频电路、WI-FI模块等等。用户接口可以包括显示屏(Display)、输入单元比如键盘(Keyboard)等,可选用户接口还可以包括USB接口、读卡器接口等。网络接口可选的可以包括标准的有线接口、无线接口(如WI-FI接口)等。Optionally, the physical device may also include a user interface, a network interface, a camera, a radio frequency (Radio Frequency, RF) circuit, a sensor, an audio circuit, a WI-FI module, etc. The user interface may include a display screen (Display), an input unit such as a keyboard (Keyboard), etc. The optional user interface may also include a USB interface, a card reader interface, etc. Optional network interfaces may include standard wired interfaces, wireless interfaces (such as WI-FI interfaces), etc.
本领域技术人员可以理解,本实施例提供的一种数据识别实体设备结构并不构成对该实体设备的限定,可以包括更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art can understand that the structure of a data recognition physical device provided in this embodiment does not constitute a limitation on the physical device, and may include more or fewer components, or combine certain components, or arrange different components. .
存储介质中还可以包括操作系统、网络通信模块。操作系统是管理上述实体设备硬件和待识别软件资源的程序,支持信息处理程序以及其它待识别软件和/或程序的运行。网络通信模块用于实现存储介质内部各组件之间的通信,以及与信息处理实体设备中其它硬件和软件之间通信。The storage medium may also include an operating system and a network communication module. The operating system is a program that manages the above-mentioned physical device hardware and software resources to be identified, and supports the operation of information processing programs and other software and/or programs to be identified. The network communication module is used to realize communication between components within the storage medium, as well as communication with other hardware and software in the information processing physical device.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到本申请可以借助软件加必要的通用硬件平台的方式来实现,也可以通过硬件实现。通过应用本申请的技术方案,首先响应于数据识别请求,获取数据识别请求中携带的数据筛选规则,并基于数据筛选规则在源数据表中提取出待识别数据,然后确定待识别数据的多个待识别指标项,并获取每一待识别指标项对应的识别算法,之后基于识别算法对待识别数据的每一待识别指标项的字段内容进行识别,得到待识别数据的每一待识别指标项对应的识别结果,最终将待识别数据的多个识别结果输入到预设的规则匹配计算表达式中进行计算,得到计算结果,并根据计算结果确定待识别数据的敏感度识别结果。Through the above description of the embodiments, those skilled in the art can clearly understand that the present application can be implemented by means of software plus a necessary general hardware platform, or can also be implemented by hardware. By applying the technical solution of this application, first in response to the data identification request, the data filtering rules carried in the data identification request are obtained, and based on the data filtering rules, the data to be identified is extracted from the source data table, and then multiple elements of the data to be identified are determined. The indicator items to be identified, and the identification algorithm corresponding to each indicator item to be identified is obtained, and then the field content of each indicator item to be identified in the data to be identified is identified based on the identification algorithm, and the corresponding indicator item to be identified in the data to be identified is obtained The identification results are finally input into the preset rule matching calculation expressions for calculation, and the calculation results are obtained, and the sensitivity identification results of the data to be identified are determined based on the calculation results.
上述方法在对数据进行识别处理之前,预先基于数据识别请求中的数据筛选规则对源数据表中的数据进行筛选,获取到真正需要进行识别的数据,当待识别数据的数据量较大时,此方法能够对大批量数据快速进行初步筛选,有效提升数据处理的效率;之后确定待识别数据的多个待识别指标项,并基于每个待识别指标项的识别算法逐一对每个待识别指标项的字段内容进行针对性的识别,从多个指标维度对数据进行识别,能够更加全面地了解数据,并且准确获取每一指标维度下数据的识别结果;最后将每个待识别指标项的识别结果通过规则匹配计算表达式进行计算,基于多个指标维度之间的计算结果确定待识别数据的敏感度识别结果,提升了数据识别率,最终得到待识别数据的数据分类结果,准确识别出敏感数据。上述方法能够提升数据识别的效率,尤其能够对大批量的复杂数据准确识别,并且结合多个维度来识别数据,有效提升数据的识别精度,准确得到数据的敏感度识别结果,实现对敏感数据内容精准的分类与管控。Before identifying and processing the data, the above method pre-screens the data in the source data table based on the data filtering rules in the data identification request to obtain the data that really needs to be identified. When the amount of data to be identified is large, This method can quickly conduct preliminary screening of large batches of data, effectively improving the efficiency of data processing; then determine multiple indicators to be identified in the data to be identified, and identify each indicator one by one based on the identification algorithm of each indicator to be identified. Targeted identification of the field content of the item, identification of data from multiple indicator dimensions, a more comprehensive understanding of the data, and accurate identification of the data under each indicator dimension; finally, the identification of each indicator item to be identified The results are calculated through rule matching calculation expressions, and the sensitivity identification results of the data to be identified are determined based on the calculation results between multiple indicator dimensions, which improves the data recognition rate. Finally, the data classification results of the data to be identified are obtained, and the sensitive data is accurately identified. data. The above method can improve the efficiency of data identification, especially it can accurately identify large batches of complex data, and combine multiple dimensions to identify data, effectively improve the identification accuracy of data, accurately obtain the sensitivity identification results of data, and realize the identification of sensitive data content. Accurate classification and control.
本领域技术人员可以理解附图只是一个优选实施场景的示意图,附图中的模块或流程并不一定是实施本申请所必须的。本领域技术人员可以理解实施场景中的装置中的模块可以按照实施场景描述进行分布于实施场景的装置中,也可以进行相应变化位于不同于本实施场景的一个或多个装置中。上述实施场景的模块可以合并为一个模块,也可以进一步拆分成多个子模块。Those skilled in the art can understand that the accompanying drawing is only a schematic diagram of a preferred implementation scenario, and the modules or processes in the accompanying drawing are not necessarily necessary for implementing the present application. Those skilled in the art can understand that the modules in the devices in the implementation scenario can be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or can be correspondingly changed and located in one or more devices different from the implementation scenario. The modules of the above implementation scenarios can be combined into one module or further split into multiple sub-modules.
上述本申请序号仅仅为了描述,不代表实施场景的优劣。以上公开的仅为本申请的几个具体实施场景,但是,本申请并非局限于此,任何本领域的技术人员能思之的变化都应落入本申请的保护范围。The above serial numbers of this application are only for description and do not represent the advantages and disadvantages of the implementation scenarios. What is disclosed above are only a few specific implementation scenarios of the present application. However, the present application is not limited thereto. Any changes that can be thought of by those skilled in the art should fall within the protection scope of the present application.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310855260.6A CN116910650A (en) | 2023-07-12 | 2023-07-12 | Data identification method, device, storage medium and computer equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310855260.6A CN116910650A (en) | 2023-07-12 | 2023-07-12 | Data identification method, device, storage medium and computer equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116910650A true CN116910650A (en) | 2023-10-20 |
Family
ID=88366093
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310855260.6A Pending CN116910650A (en) | 2023-07-12 | 2023-07-12 | Data identification method, device, storage medium and computer equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116910650A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117648635A (en) * | 2024-01-30 | 2024-03-05 | 深圳昂楷科技有限公司 | Sensitive information classification and classification method and system and electronic equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150324606A1 (en) * | 2014-05-10 | 2015-11-12 | Informatica Corporation | Identifying and Securing Sensitive Data at its Source |
CN113360522A (en) * | 2020-03-05 | 2021-09-07 | 奇安信科技集团股份有限公司 | Method and device for quickly identifying sensitive data |
CN113515771A (en) * | 2021-03-19 | 2021-10-19 | 卓望数码技术(深圳)有限公司 | Data sensitivity determination method, electronic device and computer-readable storage medium |
CN116150200A (en) * | 2022-11-16 | 2023-05-23 | 马上消费金融股份有限公司 | Data processing method, device, electronic equipment and storage medium |
CN116402596A (en) * | 2023-02-15 | 2023-07-07 | 恒生电子股份有限公司 | Data analysis method, device, computer equipment and readable storage medium |
-
2023
- 2023-07-12 CN CN202310855260.6A patent/CN116910650A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150324606A1 (en) * | 2014-05-10 | 2015-11-12 | Informatica Corporation | Identifying and Securing Sensitive Data at its Source |
CN113360522A (en) * | 2020-03-05 | 2021-09-07 | 奇安信科技集团股份有限公司 | Method and device for quickly identifying sensitive data |
CN113515771A (en) * | 2021-03-19 | 2021-10-19 | 卓望数码技术(深圳)有限公司 | Data sensitivity determination method, electronic device and computer-readable storage medium |
CN116150200A (en) * | 2022-11-16 | 2023-05-23 | 马上消费金融股份有限公司 | Data processing method, device, electronic equipment and storage medium |
CN116402596A (en) * | 2023-02-15 | 2023-07-07 | 恒生电子股份有限公司 | Data analysis method, device, computer equipment and readable storage medium |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117648635A (en) * | 2024-01-30 | 2024-03-05 | 深圳昂楷科技有限公司 | Sensitive information classification and classification method and system and electronic equipment |
CN117648635B (en) * | 2024-01-30 | 2024-05-03 | 深圳昂楷科技有限公司 | Sensitive information classification and classification method and system and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109522746B (en) | A data processing method, electronic device and computer storage medium | |
WO2020062660A1 (en) | Enterprise credit risk evaluation method, apparatus and device, and storage medium | |
CN113760891B (en) | Data table generation method, device, equipment and storage medium | |
CN112597182B (en) | Optimization method, device, terminal and storage medium of data query statement | |
TWI682287B (en) | Knowledge graph generating apparatus, method, and computer program product thereof | |
CN106709622A (en) | Database analysis device and database analysis method | |
CN114493255A (en) | Enterprise anomaly monitoring method and related equipment based on knowledge graph | |
CN111967437A (en) | Text recognition method, device, equipment and storage medium | |
CN106991175A (en) | A kind of customer information method for digging, device, equipment and storage medium | |
CN111768242A (en) | Order rate prediction method, device and readable storage medium | |
CN111552690A (en) | Data generation method, device, terminal and storage medium | |
WO2016188334A1 (en) | Method and device for processing application access data | |
CN114840519A (en) | Data labeling method, equipment and storage medium | |
CN111666101A (en) | Software homologous analysis method and device | |
JP6419667B2 (en) | Test DB data generation method and apparatus | |
CN116910650A (en) | Data identification method, device, storage medium and computer equipment | |
CN114398562B (en) | Shop data management method, device, equipment and storage medium | |
CN111859985A (en) | AI customer service model testing method, device, electronic equipment and storage medium | |
CN111582754A (en) | Risk checking method, device and equipment and computer readable storage medium | |
CN112434104B (en) | Redundant rule screening method and device for association rule mining | |
CN117093556A (en) | Log classification method, device, computer equipment and computer readable storage medium | |
CN110059480A (en) | Attack monitoring method, device, computer equipment and storage medium | |
CN115563943A (en) | Report processing method, device, equipment and storage medium | |
CN114780602A (en) | Data tracing analysis method and device, computer equipment and storage medium | |
CN114462405A (en) | Text type identification method and device, storage medium and electronic device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |