CN113591485B

CN113591485B - Intelligent data quality auditing system and method based on data science

Info

Publication number: CN113591485B
Application number: CN202110671379.9A
Authority: CN
Inventors: 黄建平; 张旭东; 王红凯; 张建松; 沈思琪; 陈可; 谢裕清; 冯珺; 阳东; 刘晓枫; 丁雪花; 蒋斌
Original assignee: State Grid Zhejiang Electric Power Co Ltd; Zhejiang Huayun Information Technology Co Ltd; Information and Telecommunication Branch of State Grid Zhejiang Electric Power Co Ltd
Current assignee: State Grid Zhejiang Electric Power Co Ltd; Zhejiang Huayun Information Technology Co Ltd; Information and Telecommunication Branch of State Grid Zhejiang Electric Power Co Ltd
Priority date: 2021-06-17
Filing date: 2021-06-17
Publication date: 2024-07-12
Anticipated expiration: 2041-06-17
Also published as: CN113591485A

Abstract

本发明公开了一种基于数据科学的智能化数据质量稽核系统及方法，方法包括：数据采集：进行检测对象元数据采集以及日志数据采集解析；数据特征提取：识别并剔除无效表和无效字段，同时通过修订算法根据数据内容对字段类型进行自动修订，根据字段类型提取特征；异常检测：预设数据异常检测方法库，与数据特征进行匹配以选取对应的异常检测方法并检测；任务调度编排：设置编排服务器和节点服务器，编排服务器根据任务请求将上述任务拆分为若干子步后分发给不同节点服务器处理。本发明提降低了数据资产管理和数据质量治理的门槛，实现数据质量稽核的通用性、规模化、自动化以及智能化，整体提升数据质量稽核的效率与工作质量。The present invention discloses an intelligent data quality audit system and method based on data science, the method includes: data collection: metadata collection of detection objects and log data collection and analysis; data feature extraction: identifying and eliminating invalid tables and invalid fields, and automatically revising field types according to data content through revision algorithms, and extracting features according to field types; anomaly detection: presetting a data anomaly detection method library, matching with data features to select corresponding anomaly detection methods and detect; task scheduling and orchestration: setting an orchestration server and a node server, and the orchestration server splits the above tasks into several sub-steps according to task requests and distributes them to different node servers for processing. The present invention reduces the threshold of data asset management and data quality governance, realizes the versatility, scale, automation and intelligence of data quality audits, and improves the efficiency and work quality of data quality audits as a whole.

Description

An intelligent data quality audit system and method based on data science

技术领域Technical Field

本发明涉及数据处理领域，特别涉及一种基于数据科学的智能化数据质量稽核系统及方法。The present invention relates to the field of data processing, and in particular to an intelligent data quality audit system and method based on data science.

背景技术Background technique

随着数字经济的发展，各行各业已经不再一味地追求数据量的规模，在数据应用的过程中对数据质量的要求也越来越高，面对海量的数据资源，如何更快、更准、更智能得发现定位数据质量问题，开展相应治理工作，是当前企业级数据资产管理的重点与核心。With the development of the digital economy, all walks of life are no longer blindly pursuing the scale of data volume. In the process of data application, the requirements for data quality are getting higher and higher. Faced with massive data resources, how to discover and locate data quality problems faster, more accurately and more intelligently, and carry out corresponding governance work, is the focus and core of current enterprise-level data asset management.

现有技术中，如公开号CN105554152A的发明公开了一种数据特征提取的方法及装置。在更细节化的技术内容中，又如公开号CN108256074A的发明公开了一种校验处理的方法，包括获取待校验的数据仓库的模型，每一模型包括多个字段信息，所述字段信息包括字段定义和字段类型；根据预先存储的数据字典，对所述字段信息进行校验，所述数据字典包括多个标准用语，每一标准用语包括标准定义和标准类型；若所述字段定义与标准定义匹配且所述字段类型与标准类型不匹配，则将所述字段类型修改为与标准类型一致。所述方法根据标准用语对数据仓库的模型进行校验，在字段定义与标准定义匹配且字段类型与标准类型不匹配时，有针对性的将字段类型修改为与标准类型一致，从而得到标准的一致化的模型。In the prior art, for example, the invention with publication number CN105554152A discloses a method and device for data feature extraction. In more detailed technical content, for example, the invention with publication number CN108256074A discloses a method for verification processing, including obtaining a model of a data warehouse to be verified, each model including multiple field information, the field information including field definition and field type; verifying the field information according to a pre-stored data dictionary, the data dictionary including multiple standard terms, each standard term including a standard definition and a standard type; if the field definition matches the standard definition and the field type does not match the standard type, then modifying the field type to be consistent with the standard type. The method verifies the model of the data warehouse according to the standard terminology, and when the field definition matches the standard definition and the field type does not match the standard type, the field type is modified to be consistent with the standard type in a targeted manner, thereby obtaining a standard and consistent model.

现有技术中解决相关问题的方式各有千秋，而传统的数据质量治理模式下，问题检测对象的选择是需要由业务专家依据业务规范和经验知识来指定特定、具体的数据表和字段，需指明每个字段具有什么样的特点，适用什么样的规则，这样的方式与结果对业务专家的经验以及专业技能要求极高，数据质量问题的检测对象范围比较局限，且高度依赖业务专家，对于大规模的海量数据就需要业务专家分别、逐一指定对应的检测对象与范围，且数据特征的通用性弱、维护起来费时费力，无法实现大规模、自动化的数据质量检测对象的明确及相应数据特征的提取，数据质量稽核的效率低下并受人工经验影响严重。The existing technologies have their own advantages and disadvantages in solving related problems. Under the traditional data quality governance model, the selection of problem detection objects requires business experts to specify specific data tables and fields based on business specifications and experience knowledge. It is necessary to indicate what characteristics each field has and what rules apply. Such methods and results require extremely high experience and professional skills of business experts. The scope of detection objects for data quality problems is relatively limited and highly dependent on business experts. For large-scale massive data, business experts are required to specify the corresponding detection objects and scopes one by one. In addition, the universality of data features is weak and maintenance is time-consuming and labor-intensive. It is impossible to achieve large-scale, automated data quality detection objects and extraction of corresponding data features. The efficiency of data quality audits is low and seriously affected by manual experience.

发明内容Summary of the invention

针对现有技术面对大规模数据时稽核效率低，检测准确度低的问题，本发明提供了一种基于数据科学的智能化数据质量稽核系统及方法，通过数据特征提取、异常检测和任务调度编排管理，降低数据资产管理和数据质量治理的门槛，实现数据质量稽核的通用性、规模化、自动化以及智能化，整体提升数据质量稽核的效率与工作质量。In view of the problems of low audit efficiency and low detection accuracy in the existing technology when facing large-scale data, the present invention provides an intelligent data quality audit system and method based on data science. Through data feature extraction, anomaly detection and task scheduling and orchestration management, the threshold of data asset management and data quality governance is lowered, and the universality, scale, automation and intelligence of data quality audit are achieved, thereby improving the efficiency and work quality of data quality audit as a whole.

以下是本发明的技术方案。The following is the technical solution of the present invention.

一种基于数据科学的智能化数据质量稽核系统，包括：An intelligent data quality audit system based on data science, including:

数据采集模块：进行检测对象元数据采集以及日志数据采集解析；数据特征提取模块：根据字段类型提取特征；异常检测模块：与数据特征进行匹配以选取对应的异常检测方法并检测；任务调度编排模块：包括编排服务器和节点服务器，编排服务器根据任务请求将上述任务拆分为若干子步后分发给不同节点服务器处理。Data collection module: collects metadata of detection objects and collects and analyzes log data; data feature extraction module: extracts features according to field types; anomaly detection module: matches data features to select corresponding anomaly detection methods and detects; task scheduling and orchestration module: includes orchestration servers and node servers. The orchestration server splits the above tasks into several sub-steps according to task requests and distributes them to different node servers for processing.

本发明以特征提取和异常检测作为数据稽核的实现基础，并借助任务调度编排模块进行服务器资源的合理分配，最终实现大规模数据的稽核。The present invention uses feature extraction and anomaly detection as the basis for implementing data auditing, and uses a task scheduling and orchestration module to reasonably allocate server resources, ultimately achieving large-scale data auditing.

另外本发明还提供了一种基于数据科学的智能化数据质量稽核方法，用于上述的系统，包括以下步骤：In addition, the present invention also provides an intelligent data quality audit method based on data science, which is used in the above system and includes the following steps:

S1：数据采集：进行检测对象元数据采集以及日志数据采集解析；S2：数据特征提取：识别并剔除无效表和无效字段，同时通过修订算法根据数据内容对字段类型进行自动修订，根据字段类型提取特征；S3：异常检测：预设数据异常检测方法库，与数据特征进行匹配以选取对应的异常检测方法并检测；S4：任务调度编排：设置编排服务器和节点服务器，编排服务器根据任务请求将上述任务拆分为若干子步后分发给不同节点服务器处理。S1: Data collection: collect metadata of detection objects and collect and analyze log data; S2: Data feature extraction: identify and eliminate invalid tables and invalid fields, and automatically revise field types according to data content through revision algorithms, and extract features according to field types; S3: Anomaly detection: preset data anomaly detection method library, match with data features to select corresponding anomaly detection methods and detect; S4: Task scheduling and orchestration: set up orchestration servers and node servers. The orchestration server splits the above tasks into several sub-steps according to task requests and distributes them to different node servers for processing.

作为优选，所述数据特征提取的过程包括：对数据进行字段类型的初步识别，并剔除无效表和无效字段；判断数据的中文描述和字段类型，对不匹配的数据进行抽样，计算样本中各字段类型占比，根据占比结果修订字段类型；根据字段类型提取特征；所述字段类型包括数值型、文本型以及日期型中的至少一种。Preferably, the process of data feature extraction includes: preliminary identification of field types of data, and elimination of invalid tables and invalid fields; determining the Chinese description and field type of the data, sampling unmatched data, calculating the proportion of each field type in the sample, and revising the field type according to the proportion result; extracting features according to the field type; the field type includes at least one of numeric, text and date types.

作为优选，所述初步识别的过程包括：根据现有的字段类型数据库对需识别的数据进行初步识别，或引入经神经网络训练的识别模型进行初步识别，得到字段类型的初步识别结果；所述剔除无效表和无效字段的过程包括：定义无效表和无效字段，通过表的元数据信息和数据内容判断，将空表、僵尸表、日志表、备份表、临时表、单字段表以及低热度表统一判定为无效表；将空字段和单一值字段统一判定为无效字段；对无效表和字段进行识别和剔除。不同字段类型具有各自的特点，现有技术中通常采用数据库和训练模型等进行对比和识别，有助于减少实施成本且有一定的基础准确率保障；另外，无效表以及无效字段涵盖了常见的各种无效数据，进行剔除后可以减少后续数据提取和分析的处理压力。Preferably, the process of preliminary identification includes: preliminary identification of the data to be identified based on the existing field type database, or introduction of a recognition model trained by a neural network for preliminary identification, to obtain preliminary identification results of the field type; the process of eliminating invalid tables and invalid fields includes: defining invalid tables and invalid fields, judging empty tables, zombie tables, log tables, backup tables, temporary tables, single-field tables, and low-heat tables as invalid tables through metadata information and data content of the tables; uniformly judging empty fields and single-value fields as invalid fields; identifying and eliminating invalid tables and fields. Different field types have their own characteristics. In the prior art, databases and training models are usually used for comparison and identification, which helps to reduce implementation costs and has a certain basic accuracy guarantee; in addition, invalid tables and invalid fields cover various common invalid data, and elimination can reduce the processing pressure of subsequent data extraction and analysis.

作为优选，所述修订字段类型的过程包括：利用NLP自然语言处理模块对数据的中文描述进行分词与语义识别，解析后通过类型决策树进行近似词或近似字的路径识别，中文描述的语义与字段类型不匹配的，标记为疑似修订字段类型；然后对中文描述语义相同或相似的数据内容进行多次抽样，统计出抽样数据中不同字段类型的占比情况，并以占比超过阈值的类型作为推荐修订字段类型，最终修订为真实存放数据所属的字段类型。自然语言处理技术可以对中文描述进行分词和语义识别，而决策树可以进行相似含义的路径识别，以帮助判断是否属于疑似修订字段类型，最终通过设置阈值的方式，以占比为判断标准确定结果，修订过程是对初步识别的补充，进一步提高识别准确率。Preferably, the process of revising the field type includes: using the NLP natural language processing module to segment and semantically identify the Chinese description of the data, and after parsing, using the type decision tree to identify the path of similar words or similar characters. If the semantics of the Chinese description does not match the field type, it is marked as a suspected revised field type; then the data content with the same or similar semantics in the Chinese description is sampled multiple times, and the proportion of different field types in the sampled data is counted, and the type with a proportion exceeding the threshold is used as the recommended revised field type, and finally revised to the field type to which the real data is stored. Natural language processing technology can segment and semantically identify the Chinese description, and the decision tree can identify paths with similar meanings to help determine whether it belongs to a suspected revised field type. Finally, the result is determined by setting a threshold and using the proportion as the judgment standard. The revision process is a supplement to the preliminary identification and further improves the identification accuracy.

作为优选，所述根据字段类型提取特征的过程包括：对数值型字段，利用均值、最大值、最小值、中位数、方差、四分位数、四分位距、数值聚类以及长度聚类进行特征和特征值提取；对于文本型字段，从长度聚类和结构分布统计属性特征，并通过数据内容的分词和语义识别进行内容特征上的提取；对日期型字段，进行结构解析，对日期格式和长度进行特征提取。Preferably, the process of extracting features according to field type includes: for numeric fields, features and feature values are extracted using mean, maximum, minimum, median, variance, quartile, interquartile range, numerical clustering and length clustering; for text fields, statistical attribute features are analyzed from length clustering and structural distribution, and content features are extracted through word segmentation and semantic recognition of data content; for date fields, structural analysis is performed, and features are extracted from date format and length.

作为优选，所述修订字段类型结束后，还包括验证步骤：将日期类数据转换为文本类数据，并复制为验证组和干扰组，所述验证组根据原日期格式插入年月日描述，所述干扰组根据原日期类数据位数增加计数单位描述，将验证组和干扰组插入自身相邻的文本类数据中，并通过NLP自然语言处理模块对拼接后的文本类数据进行语义识别，记录每一对干扰组和验证组的识别速度，如验证组的识别速度快于干扰组且超过幅度阈值，则通过验证，否则将对应的原日期类数据列为疑似错误类型。由于不论是日期类数据还是数值类数据，往往与其相邻的文本类数据有联系，当原本识别正确时，验证组拼接后的文本较容易识别，因此识别速度较快，而如果原本识别错误，则验证组拼接后的文本是错误的，因此相比于干扰组没有识别速度的优势，甚至更慢，因此将被列为疑似错误类型。As a preferred method, after the revision field type is completed, a verification step is also included: converting the date data into text data and copying it into a verification group and an interference group, wherein the verification group inserts the year, month, and day description according to the original date format, and the interference group adds the counting unit description according to the number of digits of the original date data, inserting the verification group and the interference group into the text data adjacent to itself, and semantically identifying the spliced text data through the NLP natural language processing module, recording the recognition speed of each pair of interference group and verification group, if the recognition speed of the verification group is faster than that of the interference group and exceeds the amplitude threshold, the verification is passed, otherwise the corresponding original date data is listed as a suspected error type. Since both date data and numerical data are often related to the adjacent text data, when the original recognition is correct, the text spliced by the verification group is easier to recognize, so the recognition speed is faster, and if the original recognition is wrong, the text spliced by the verification group is wrong, so compared with the interference group, there is no recognition speed advantage, or even slower, so it will be listed as a suspected error type.

作为优选，所述异常检测的过程包括：Preferably, the anomaly detection process includes:

构建数据异常检测方法库，根据每种数据特征设置对应的检测方法，汇总形成数据异常检测方法库，所述数据异常检测方法库以字典类型存储，数据特征名称及其特征参数组成的元组作为字典的键，数据特征对应的异常检测方法作为字典的值；对数据特征进行异常检测方法匹配，根据匹配结果中的异常检测方法进行检测；大规模数据特征遍历，对每个数据特征进行匹配和检测。其中方法库的设置是从统计学、常识、自然规律、专业通用知识等角度对不同的数据特征分别设计对应的异常检测方法，比如数据值类特征设计当字段值出现极值时报异常、日期特征对不符合日期格式的字段内容报异常等，方法库的设置根据实际使用需求进行具体确定，匹配后针对性地进行检测。而Python的字典类型是个键值对，使用Python的字典类型来存储数据特征及其异常检测方法，字典的键存储的是数据特征名称及其特征参数组成的元组，字典的值存储的是该数据特征对应的异常检测方法，其中每个异常检测方法的阈值由特征参数给出，通过字典的方式存储，可以清楚划分键和值，利于后续的匹配。Construct a data anomaly detection method library, set the corresponding detection method according to each data feature, and summarize it to form a data anomaly detection method library. The data anomaly detection method library is stored in a dictionary type, and the tuple consisting of the data feature name and its feature parameters is used as the dictionary key, and the anomaly detection method corresponding to the data feature is used as the dictionary value; match the data features with anomaly detection methods, and detect according to the anomaly detection method in the matching results; traverse large-scale data features to match and detect each data feature. The setting of the method library is to design corresponding anomaly detection methods for different data features from the perspectives of statistics, common sense, natural laws, and professional general knowledge. For example, the data value feature design reports anomalies when the field value has an extreme value, and the date feature reports anomalies for field content that does not conform to the date format. The setting of the method library is specifically determined according to actual usage needs, and targeted detection is performed after matching. The Python dictionary type is a key-value pair. We use the Python dictionary type to store data features and their anomaly detection methods. The dictionary key stores a tuple consisting of the data feature name and its feature parameters. The dictionary value stores the anomaly detection method corresponding to the data feature. The threshold of each anomaly detection method is given by the feature parameter. By storing it in a dictionary, we can clearly divide the keys and values, which is convenient for subsequent matching.

作为优选，所述匹配包括以下过程：对待处理的数据特征名称和异常检测方法库中的键分别嵌入经NLP得到的词向量，计算词向量之间的余弦相似度，相似度于阈值的键即为该数据特征对应的潜在键，这些键所对应的异常检测方法即是匹配结果；所述余弦相似度的计算公式如下：Preferably, the matching includes the following process: embedding the name of the data feature to be processed and the key in the anomaly detection method library into the word vector obtained by NLP, calculating the cosine similarity between the word vectors, the keys with similarity greater than the threshold are the potential keys corresponding to the data feature, and the anomaly detection methods corresponding to these keys are the matching results; the calculation formula of the cosine similarity is as follows:

其中u和v分别表示两个词向量。词向量包含多维数值，借助余弦相似度，可以较为准确地判断和比较。Where u and v represent two word vectors. Word vectors contain multi-dimensional values, and with the help of cosine similarity, they can be judged and compared more accurately.

作为优选，所述大规模数据特征遍历过程包括：将待匹配的词向量中每一维度数值按比例缩放至0到255范围内，以依次展开排列的n个像素点阵列表示每个词向量，其中n为词向量的维度，该词向量每个维度的值为每个像素点的灰度值，以将像素点阵列所表示的图像复制至m个像素点的白底图片中得到复刻图，其中m为n的x^2倍，x为大于等于2的自然数，降低复刻图的像素至n，读取每个像素的灰度值，组成新的特殊词向量，使用特殊词向量进行余弦相似度的计算以减少大规模数据量下的计算强度。面对大规模的数据时，如果仍然同处理单个数据的方式完全一致，则运算量将非常大，整体效率偏低，因此采用上述方式将向量模糊化，模糊化的词向量与原词向量之间虽然会产生偏差，但原本相似的词向量之间仍然保留有合适的相似度，因此相似度的计算结果相差较小，通过这种方式可以应对海量数据下的计算压力。As a preferred embodiment, the large-scale data feature traversal process includes: scaling the value of each dimension in the word vector to be matched to a range of 0 to 255, and representing each word vector with an array of n pixels arranged in sequence, wherein n is the dimension of the word vector, and the value of each dimension of the word vector is the gray value of each pixel, so as to copy the image represented by the pixel array to a white background image of m pixels to obtain a replica, wherein m is x^2 times of n, and x is a natural number greater than or equal to 2, reduce the pixels of the replica to n, read the gray value of each pixel, form a new special word vector, and use the special word vector to calculate the cosine similarity to reduce the computational intensity under large-scale data. When facing large-scale data, if it is still completely consistent with the way of processing single data, the amount of calculation will be very large and the overall efficiency is low. Therefore, the above method is used to fuzzify the vector. Although there will be deviations between the fuzzy word vector and the original word vector, the originally similar word vectors still retain a suitable similarity, so the calculation results of the similarity are relatively small. In this way, the computational pressure under massive data can be coped with.

另外还有一种替代方案，即大规模数据特征遍历过程包括：将待匹配的词向量中每一维度数值按比例缩放至0到255范围内，并将0至225分为若干阶，将每个维度的数值修改为该数值对应阶内的中间数，生成新的特殊词向量，使用特殊词向量进行余弦相似度的计算以减少大规模数据量下的计算强度。该方案仍然是以模糊化词向量为主，降低大规模数据下的计算量。There is also an alternative solution, that is, the large-scale data feature traversal process includes: scaling the value of each dimension in the word vector to be matched to the range of 0 to 255, and dividing 0 to 225 into several orders, modifying the value of each dimension to the middle number in the order corresponding to the value, generating a new special word vector, and using the special word vector to calculate the cosine similarity to reduce the computational intensity under large-scale data. This solution is still based on fuzzy word vectors to reduce the computational intensity under large-scale data.

作为优选，所述任务调度编排的过程包括：Preferably, the task scheduling process includes:

编排服务器将任务拆分成不同节点，分别部署在多个服务器，分配服务器运算资源以减低各服务器计算压力；编排服务器对集群请求统一收集及分发，采用生产者消费者模式，有序分发任务至节点服务器集群，根据集群配置情况分配任务执行策略，并实时反馈任务执行情况；根据节点服务器集群情况进行任务调度，当集群中有某个节点失效的情况下，其上的任务转移到其他正常的节点上，以保证任务运行不受部分节点服务器宕机影响。合理规划服务器算力可以为整个稽核方法提供效率上的加成，进一步扩大本方案中特征提取和异常检测的效率优势。The orchestration server splits the tasks into different nodes and deploys them on multiple servers, allocating server computing resources to reduce the computing pressure of each server; the orchestration server uniformly collects and distributes cluster requests, adopts the producer-consumer model, distributes tasks to the node server cluster in an orderly manner, allocates task execution strategies according to the cluster configuration, and provides real-time feedback on task execution; it schedules tasks according to the node server cluster. When a node in the cluster fails, the tasks on it are transferred to other normal nodes to ensure that the task operation is not affected by the downtime of some node servers. Reasonable planning of server computing power can provide efficiency bonus for the entire audit method, further expanding the efficiency advantages of feature extraction and anomaly detection in this solution.

本发明的实质性效果包括：提供数据科学方法和人工智能技术在数据质量检核方面的解决方案与系统功能服务，降低数据资产管理和数据质量治理的门槛，实现数据质量稽核的通用性、规模化、自动化以及智能化，整体提升数据质量稽核的效率与工作质量。The substantial effects of the present invention include: providing solutions and system function services for data science methods and artificial intelligence technologies in data quality verification, lowering the threshold for data asset management and data quality governance, achieving the universality, scale, automation and intelligence of data quality audits, and improving the efficiency and work quality of data quality audits as a whole.

具体实施方式Detailed ways

下面将结合实施例，对本申请的技术方案进行描述。另外，为了更好的说明本发明，在下文中给出了众多的具体细节。本领域技术人员应当理解，没有某些具体细节，本发明同样可以实施。在一些实例中，对于本领域技术人员熟知的方法、手段、元件和电路未做详细描述，以便于凸显本发明的主旨。The technical scheme of the present application will be described below in conjunction with the embodiments. In addition, in order to better illustrate the present invention, numerous specific details are provided below. It should be understood by those skilled in the art that the present invention can be implemented without certain specific details. In some instances, methods, means, components and circuits well known to those skilled in the art are not described in detail in order to highlight the gist of the present invention.

实施例：Example:

一种基于数据科学的智能化数据质量稽核系统及方法，系统包括：An intelligent data quality audit system and method based on data science, the system comprising:

本实施例以特征提取和异常检测作为数据稽核的实现基础，并借助任务调度编排模块进行服务器资源的合理分配，最终实现大规模数据的稽核。This embodiment uses feature extraction and anomaly detection as the basis for implementing data auditing, and uses the task scheduling and orchestration module to reasonably allocate server resources, ultimately achieving large-scale data auditing.

本实施例中对应的稽核方法包括以下步骤：The corresponding audit method in this embodiment includes the following steps:

S1：数据采集：S1: Data collection:

进行检测对象元数据采集以及日志数据采集解析。Collect metadata of detection objects and analyze log data.

S2：数据特征提取：S2: Data feature extraction:

对数据进行字段类型的初步识别，并剔除无效表和无效字段；判断数据的中文描述和字段类型，对不匹配的数据进行抽样，计算样本中各字段类型占比，根据占比结果修订字段类型；根据字段类型提取特征；字段类型包括数值型、文本型以及日期型中的至少一种。Perform preliminary identification of the field type of the data and eliminate invalid tables and invalid fields; determine the Chinese description and field type of the data, sample the unmatched data, calculate the proportion of each field type in the sample, and revise the field type based on the proportion results; extract features based on the field type; the field type includes at least one of numeric, text, and date types.

其中初步识别的过程包括：根据现有的字段类型数据库对需识别的数据进行初步识别，或引入经神经网络训练的识别模型进行初步识别，得到字段类型的初步识别结果；剔除无效表和无效字段的过程包括：定义无效表和无效字段，通过表的元数据信息和数据内容判断，将空表、僵尸表、日志表、备份表、临时表、单字段表以及低热度表统一判定为无效表；将空字段和单一值字段统一判定为无效字段；对无效表和字段进行识别和剔除。不同字段类型具有各自的特点，现有技术中通常采用数据库和训练模型等进行对比和识别，有助于减少实施成本且有一定的基础准确率保障；另外，无效表以及无效字段涵盖了常见的各种无效数据，进行剔除后可以减少后续数据提取和分析的处理压力。The preliminary identification process includes: preliminary identification of the data to be identified based on the existing field type database, or introduction of a recognition model trained by a neural network for preliminary identification, and obtaining preliminary identification results of the field type; the process of eliminating invalid tables and invalid fields includes: defining invalid tables and invalid fields, judging empty tables, zombie tables, log tables, backup tables, temporary tables, single-field tables, and low-heat tables as invalid tables through metadata information and data content of the tables; uniformly judging empty fields and single-value fields as invalid fields; identifying and eliminating invalid tables and fields. Different field types have their own characteristics. In the prior art, databases and training models are usually used for comparison and identification, which helps to reduce implementation costs and has a certain basic accuracy guarantee; in addition, invalid tables and invalid fields cover various common invalid data, and elimination can reduce the processing pressure of subsequent data extraction and analysis.

其中修订字段类型的过程包括：利用NLP自然语言处理模块对数据的中文描述进行分词与语义识别，解析后通过类型决策树进行近似词或近似字的路径识别，中文描述的语义与字段类型不匹配的，标记为疑似修订字段类型；然后对中文描述语义相同或相似的数据内容进行多次抽样，统计出抽样数据中不同字段类型的占比情况，并以占比超过阈值的类型作为推荐修订字段类型，最终修订为真实存放数据所属的字段类型。自然语言处理技术可以对中文描述进行分词和语义识别，而决策树可以进行相似含义的路径识别，以帮助判断是否属于疑似修订字段类型，最终通过设置阈值的方式，以占比为判断标准确定结果，修订过程是对初步识别的补充，进一步提高识别准确率。The process of revising the field type includes: using the NLP natural language processing module to segment and semantically identify the Chinese description of the data, and after parsing, using the type decision tree to identify the path of similar words or similar characters. If the semantics of the Chinese description does not match the field type, it is marked as a suspected revised field type; then the data content with the same or similar semantics in the Chinese description is sampled multiple times, and the proportion of different field types in the sampled data is counted, and the type with a proportion exceeding the threshold is used as the recommended revised field type, and finally revised to the field type to which the real data belongs. Natural language processing technology can segment and semantically identify the Chinese description, while the decision tree can identify paths with similar meanings to help determine whether it belongs to the suspected revised field type. Finally, the result is determined by setting a threshold and using the proportion as the judgment standard. The revision process is a supplement to the preliminary identification and further improves the identification accuracy.

其中根据字段类型提取特征的过程包括：对数值型字段，利用均值、最大值、最小值、中位数、方差、四分位数、四分位距、数值聚类以及长度聚类进行特征和特征值提取；对于文本型字段，从长度聚类和结构分布统计属性特征，并通过数据内容的分词和语义识别进行内容特征上的提取；对日期型字段，进行结构解析，对日期格式和长度进行特征提取。The process of extracting features according to field types includes: for numeric fields, features and feature values are extracted using mean, maximum, minimum, median, variance, quartile, interquartile range, numeric clustering, and length clustering; for text fields, length clustering and structural distribution statistical attribute features are used, and content features are extracted through word segmentation and semantic recognition of data content; for date fields, structural analysis is performed, and features are extracted from date format and length.

更具体地，可以从数据特征库查找该字段类型适用的数据特征及特征提取方法，并根据对应数据特征的依赖以及互斥关系网络，对该字段类型所有适用的数据特征提取方法进行遍历，例如确定某数据字段为数值型后，特征提取算法将会载入长度、整数、正数、负数、小数等属性特征提取的方法，以及手机号、邮编等业务特征提取的方法，通过对数据内容进行持续的识别和提取，可以获得是长度集中、是整数、是手机号等特征，同时会对“正-负”这两种对立互斥的特征进行区分，从而获得该字段多角度的特征和特征值。More specifically, the data features and feature extraction methods applicable to the field type can be found in the data feature library, and all applicable data feature extraction methods for the field type can be traversed based on the dependency and mutually exclusive relationship network of the corresponding data features. For example, after determining that a data field is numeric, the feature extraction algorithm will load methods for extracting attribute features such as length, integer, positive number, negative number, decimal, and other attribute features, as well as methods for extracting business features such as mobile phone number and zip code. By continuously identifying and extracting the data content, features such as length concentration, integer, and mobile phone number can be obtained. At the same time, the two opposing and mutually exclusive features of "positive-negative" can be distinguished, thereby obtaining multi-angle features and feature values for the field.

另外，修订字段类型结束后，还包括验证步骤：将日期类数据转换为文本类数据，并复制为验证组和干扰组，验证组根据原日期格式插入年月日描述，干扰组根据原日期类数据位数增加计数单位描述，将验证组和干扰组插入自身相邻的文本类数据中，并通过NLP自然语言处理模块对拼接后的文本类数据进行语义识别，记录每一对干扰组和验证组的识别速度，如验证组的识别速度快于干扰组且超过幅度阈值，则通过验证，否则将对应的原日期类数据列为疑似错误类型。由于不论是日期类数据还是数值类数据，往往与其相邻的文本类数据有联系，当原本识别正确时，验证组拼接后的文本较容易识别，因此识别速度较快，而如果原本识别错误，则验证组拼接后的文本是错误的，因此相比于干扰组没有识别速度的优势，甚至更慢，因此将被列为疑似错误类型。In addition, after the revision of the field type is completed, a verification step is also included: converting the date data into text data and copying it into a verification group and an interference group. The verification group inserts the year, month, and day description according to the original date format, and the interference group adds the counting unit description according to the number of digits of the original date data. The verification group and the interference group are inserted into the adjacent text data, and the spliced text data is semantically recognized through the NLP natural language processing module, and the recognition speed of each pair of interference groups and verification groups is recorded. If the recognition speed of the verification group is faster than that of the interference group and exceeds the amplitude threshold, the verification is passed, otherwise the corresponding original date data is listed as a suspected error type. Since both date data and numerical data are often related to the adjacent text data, when the original recognition is correct, the text spliced by the verification group is easier to recognize, so the recognition speed is faster, and if the original recognition is wrong, the text spliced by the verification group is wrong, so compared with the interference group, there is no recognition speed advantage, or even slower, so it will be listed as a suspected error type.

S3：异常检测：S3: Anomaly Detection:

构建数据异常检测方法库，根据每种数据特征设置对应的检测方法，汇总形成数据异常检测方法库，数据异常检测方法库以字典类型存储，数据特征名称及其特征参数组成的元组作为字典的键，数据特征对应的异常检测方法作为字典的值；对数据特征进行异常检测方法匹配，根据匹配结果中的异常检测方法进行检测；大规模数据特征遍历，对每个数据特征进行匹配和检测。其中方法库的设置是从统计学、常识、自然规律、专业通用知识等角度对不同的数据特征分别设计对应的异常检测方法，比如数据值类特征设计当字段值出现极值时报异常、日期特征对不符合日期格式的字段内容报异常等，方法库的设置根据实际使用需求进行具体确定，匹配后针对性地进行检测。而Python的字典类型是个键值对，使用Python的字典类型来存储数据特征及其异常检测方法，字典的键存储的是数据特征名称及其特征参数组成的元组，字典的值存储的是该数据特征对应的异常检测方法，其中每个异常检测方法的阈值由特征参数给出，通过字典的方式存储，可以清楚划分键和值，利于后续的匹配。Construct a data anomaly detection method library, set the corresponding detection method according to each data feature, and summarize it to form a data anomaly detection method library. The data anomaly detection method library is stored in a dictionary type, and the tuple consisting of the data feature name and its feature parameters is used as the dictionary key, and the anomaly detection method corresponding to the data feature is used as the dictionary value; match the data features with anomaly detection methods, and detect according to the anomaly detection method in the matching results; traverse large-scale data features to match and detect each data feature. The setting of the method library is to design corresponding anomaly detection methods for different data features from the perspectives of statistics, common sense, natural laws, and professional general knowledge. For example, the data value feature design reports anomalies when the field value has an extreme value, and the date feature reports anomalies for field content that does not conform to the date format. The setting of the method library is determined according to the actual use needs, and targeted detection is performed after matching. The Python dictionary type is a key-value pair. We use the Python dictionary type to store data features and their anomaly detection methods. The dictionary key stores a tuple consisting of the data feature name and its feature parameters. The dictionary value stores the anomaly detection method corresponding to the data feature. The threshold of each anomaly detection method is given by the feature parameter. By storing it in a dictionary, we can clearly divide the keys and values, which is convenient for subsequent matching.

其中匹配包括以下过程：对待处理的数据特征名称和异常检测方法库中的键分别嵌入经NLP得到的词向量，计算词向量之间的余弦相似度，相似度于阈值的键即为该数据特征对应的潜在键，这些键所对应的异常检测方法即是匹配结果；余弦相似度的计算公式如下：The matching process includes the following steps: embed the name of the data feature to be processed and the key in the anomaly detection method library into the word vector obtained by NLP, calculate the cosine similarity between the word vectors, and the keys with similarity greater than the threshold are the potential keys corresponding to the data feature. The anomaly detection methods corresponding to these keys are the matching results. The calculation formula of cosine similarity is as follows:

大规模数据特征遍历过程包括：将待匹配的词向量中每一维度数值按比例缩放至0到255范围内，以依次展开排列的n个像素点阵列表示每个词向量，其中n为词向量的维度，该词向量每个维度的值为每个像素点的灰度值，以将像素点阵列所表示的图像复制至m个像素点的白底图片中得到复刻图，其中m为n的x^2倍，x为大于等于2的自然数，降低复刻图的像素至n，读取每个像素的灰度值，组成新的特殊词向量，使用特殊词向量进行余弦相似度的计算以减少大规模数据量下的计算强度。面对大规模的数据时，如果仍然同处理单个数据的方式完全一致，则运算量将非常大，整体效率偏低，因此采用上述方式将向量模糊化，模糊化的词向量与原词向量之间虽然会产生偏差，但原本相似的词向量之间仍然保留有合适的相似度，因此相似度的计算结果相差较小，通过这种方式可以应对海量数据下的计算压力。The large-scale data feature traversal process includes: scaling the value of each dimension in the word vector to be matched to the range of 0 to 255, representing each word vector with an array of n pixels arranged in sequence, where n is the dimension of the word vector, and the value of each dimension of the word vector is the gray value of each pixel, so as to copy the image represented by the pixel array to a white background image of m pixels to obtain a replica, where m is x^2 times of n, and x is a natural number greater than or equal to 2, reduce the pixels of the replica to n, read the gray value of each pixel, form a new special word vector, and use the special word vector to calculate the cosine similarity to reduce the computational intensity under large-scale data. When facing large-scale data, if it is still exactly the same as the way of processing single data, the amount of calculation will be very large and the overall efficiency is low. Therefore, the above method is used to fuzzy the vector. Although there will be deviations between the fuzzy word vector and the original word vector, the originally similar word vectors still retain a suitable similarity, so the calculation results of the similarity are relatively small. This method can cope with the computational pressure under massive data.

S4：任务调度编排：S4: Task scheduling and orchestration:

设置编排服务器和节点服务器，编排服务器根据任务请求将上述任务拆分为若干子步后分发给不同节点服务器处理。Set up an orchestration server and a node server. The orchestration server splits the above task into several sub-steps according to the task request and distributes them to different node servers for processing.

本实施例以特征提取和异常检测作为数据稽核的实现基础，并借助任务调度编排进行服务器资源的合理分配，最终实现大规模数据的稽核。本实施例的实质性效果包括：提供数据科学方法和人工智能技术在数据质量检核方面的解决方案与系统功能服务，降低数据资产管理和数据质量治理的门槛，实现数据质量稽核的通用性、规模化、自动化以及智能化，整体提升数据质量稽核的效率与工作质量。This embodiment uses feature extraction and anomaly detection as the basis for data auditing, and uses task scheduling and orchestration to reasonably allocate server resources, ultimately achieving large-scale data auditing. The substantial effects of this embodiment include: providing solutions and system function services for data science methods and artificial intelligence technologies in data quality verification, lowering the threshold for data asset management and data quality governance, achieving the universality, scale, automation and intelligence of data quality auditing, and overall improving the efficiency and work quality of data quality auditing.

通过以上实施方式的描述，所属领域的技术人员可以了解到，为描述的方便和简洁，仅以上述各功能模块的划分进行举例说明，实际应用中可以根据需要而将上述功能分配由不同的功能模块完成，即将具体装置的内部结构划分成不同的功能模块，以完成以上描述的全部或者部分功能。Through the description of the above implementation methods, technical personnel in the relevant field can understand that for the convenience and simplicity of description, only the division of the above-mentioned functional modules is used as an example. In actual applications, the above-mentioned functions can be assigned to different functional modules as needed, that is, the internal structure of the specific device can be divided into different functional modules to complete all or part of the functions described above.

在本申请所提供的实施例中，应该理解到，所揭露的方法可以通过其它的方式实现。例如既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个可读取存储介质中。基于这样的理解，本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该软件产品存储在一个存储介质中，包括若干指令用以使得一个设备(可以是单片机，芯片等)或处理器(processor)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(read only memory，ROM)、随机存取存储器(random access memory，RAM)、磁碟或者光盘等各种可以存储程序代码的介质。In the embodiments provided in the present application, it should be understood that the disclosed method can be implemented in other ways. For example, it can be implemented in the form of hardware or in the form of a software functional unit. If it is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a readable storage medium. Based on such an understanding, the technical solution of the embodiment of the present application is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium, including several instructions to enable a device (which can be a single-chip microcomputer, chip, etc.) or a processor (processor) to perform all or part of the steps of the various embodiments of the present application. The aforementioned storage medium includes: various media that can store program codes, such as a USB flash drive, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a disk or an optical disk.

以上内容，仅为本申请的具体实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应以权利要求的保护范围为准。The above contents are only specific implementation methods of the present application, but the protection scope of the present application is not limited thereto. Any technician familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the present application, which should be included in the protection scope of the present application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.

Claims

1. An intelligent data quality auditing system based on data science, which is characterized by comprising:

and a data acquisition module: collecting and analyzing metadata of the detection object and collecting and analyzing log data;

and the data characteristic extraction module is used for: extracting features according to the field types;

an abnormality detection module: matching with the data characteristics to select a corresponding abnormality detection method and detect;

a task scheduling module: the system comprises an arrangement server and a node server, wherein the arrangement server divides the task into a plurality of sub-steps according to a task request and distributes the sub-steps to different node servers for processing;

The data feature extraction module performs:

performing preliminary identification of field types on the data, and eliminating an invalid table and an invalid field;

Judging Chinese description and field types of the data, sampling the unmatched data, calculating the duty ratio of each field type in the sample, and revising the field types according to the duty ratio result;

Extracting features according to the field types;

the field type comprises at least one of a numerical type, a text type and a date type;

the process of revising the field type includes: the Chinese description of the data is subjected to word segmentation and semantic recognition by utilizing an NLP natural language processing module, the path recognition of the approximate word or the approximate word is carried out through a type decision tree after analysis, the semantics of the Chinese description are not matched with the field type, and the Chinese description is marked as a suspected revised field type; and then sampling the data content with the same or similar semantic meaning for multiple times, counting the duty ratio conditions of different field types in the sampled data, taking the type with the duty ratio exceeding a threshold value as a recommended revised field type, and finally revising the recommended revised field type into the field type to which the real storage data belongs.

2. An intelligent data quality auditing method based on data science, which is used for an auditing system as claimed in claim 1, and is characterized by comprising the following steps:

S1: and (3) data acquisition: collecting and analyzing metadata of the detection object and collecting and analyzing log data;

S2: and (3) data characteristic extraction: identifying and removing an invalid table and an invalid field, automatically revising the field type according to the data content through a revising algorithm, and extracting features according to the field type;

S3: abnormality detection: presetting a data anomaly detection method library, and matching with data characteristics to select a corresponding anomaly detection method and detect;

s4: task scheduling: setting an arranging server and a node server, dividing the task into a plurality of sub-steps according to the task request by the arranging server, and then distributing the sub-steps to different node servers for processing;

The process of data feature extraction comprises the following steps:

Extracting features according to the field types;

3. The intelligent data quality auditing method based on data science according to claim 2, wherein the preliminary identification process includes: performing preliminary identification on data to be identified according to the existing field type database, or introducing an identification model trained by a neural network to perform preliminary identification, so as to obtain a preliminary identification result of the field type; the process of eliminating the invalid table and the invalid field comprises the following steps: defining an invalid table and an invalid field, and uniformly judging an empty table, a zombie table, a log table, a backup table, a temporary table, a single-field table and a low-heat table as invalid tables through metadata information and data content judgment of the tables; uniformly judging the null field and the single value field as invalid fields; and identifying and rejecting invalid tables and fields.

4. The intelligent data quality auditing method based on data science according to claim 2, wherein the process of extracting features according to field types includes: extracting features and feature values of the numerical value type fields by means of average value, maximum value, minimum value, median, variance, quartile range, numerical value cluster and length cluster; for text type fields, statistical attribute features are distributed from length clusters and structures, and extraction on content features is performed through word segmentation and semantic recognition of data content; and carrying out structural analysis on the date type field, and carrying out feature extraction on the date format and the length.

5. The intelligent data quality auditing method based on data science according to claim 2, wherein the anomaly detection process includes:

Constructing a data anomaly detection method library, setting corresponding detection methods according to each data feature, and summarizing to form the data anomaly detection method library, wherein the data anomaly detection method library is stored in a dictionary type, a tuple consisting of a data feature name and a feature parameter thereof is used as a key of the dictionary, and an anomaly detection method corresponding to the data feature is used as a value of the dictionary;

Performing anomaly detection method matching on the data characteristics, and detecting according to the anomaly detection method in the matching result;

And traversing the large-scale data features, and matching and detecting each data feature.

6. The intelligent data quality auditing method based on data science according to claim 5, in which the matching includes the following processes: the method comprises the steps that a word vector obtained through NLP is respectively embedded into a data feature name to be processed and keys in an anomaly detection method library, cosine similarity between the word vectors is calculated, keys with similarity being in a threshold value are potential keys corresponding to the data feature, and anomaly detection methods corresponding to the keys are matching results;

The cosine similarity is calculated as follows:

where u and v represent two word vectors, respectively.

7. The intelligent data quality auditing method according to claim 6, wherein the large-scale data feature traversal process includes: scaling each dimension value in the word vectors to be matched to be in a range of 0 to 255, sequentially expanding and arranging n pixel point arrays to represent each word vector, wherein n is the dimension of the word vector, the value of each dimension of the word vector is the gray value of each pixel point, so that an image represented by the pixel point arrays is copied to white background pictures of m pixel points to obtain a reproduction graph, wherein m is x-2 times of n, x is a natural number greater than or equal to 2, the pixels of the reproduction graph are reduced to n, the gray value of each pixel is read, a new special word vector is formed, and cosine similarity calculation is performed by using the special word vector to reduce the calculation intensity under large-scale data quantity.

8. The intelligent data quality auditing method based on data science according to claim 2, wherein the task scheduling process includes:

the arrangement server splits the tasks into different nodes, deploys the different nodes on a plurality of servers respectively, and distributes the operation resources of the servers to reduce the calculation pressure of each server;

The arrangement server uniformly collects and distributes cluster requests, orderly distributes tasks to the node server clusters by adopting a producer consumer mode, distributes task execution strategies according to cluster configuration conditions, and feeds back task execution conditions in real time;

and performing task scheduling according to the condition of the node server cluster, and transferring the tasks on the node server cluster to other normal nodes under the condition that a certain node in the cluster fails, so as to ensure that the task operation is not influenced by downtime of part of the node servers.