CN117762914A

CN117762914A - Data quality detection method and system

Info

Publication number: CN117762914A
Application number: CN202311791516.8A
Authority: CN
Inventors: 王斌; 蒋博一; 潘黎
Original assignee: Chengdu Data Group Co ltd
Current assignee: Chengdu Data Group Co ltd
Priority date: 2023-12-25
Filing date: 2023-12-25
Publication date: 2024-03-26

Abstract

The invention discloses a data quality detection method and a data quality detection system, and relates to the technical field of data quality detection. Comprising the following steps: acquiring at least one piece of data to be detected in a data source, and disassembling the data to be detected into a state of data characters; constructing a screening rule and summarizing data characters of the similar character information; and constructing a data analysis model internally loaded with different systems. Compared with the traditional data quality detection method, the method can detect the data quality problem more finely by disassembling the data to be detected into the state of the data character, can extract information in the data character more accurately by constructing the screening rule and the data analysis model, and obtain a plurality of data character items, can obtain more accurate item difference scores by comparing historical data character indexes with the data character items, and can judge whether the data quality meets the standard more accurately by constructing the judging rule and setting the judging threshold.

Description

A data quality detection method and system

技术领域Technical field

本发明涉及数据质量检测技术领域，具体为一种数据质量检测方法及系统。The invention relates to the technical field of data quality detection, specifically a data quality detection method and system.

背景技术Background technique

在对数据进行分析和管理的过程中，异常数据的出现会直接导致分析结果发生显著的变化，造成得到的结果超出预期，从而使管理者对整个项目过程的推断、控制和预测不准确，出现错误的判断将给整个项目带来巨大的风险，因此需要对数据进行质量检测。In the process of analyzing and managing data, the emergence of abnormal data will directly lead to significant changes in the analysis results, causing the results to exceed expectations, thus making the managers' inference, control and prediction of the entire project process inaccurate, resulting in Wrong judgment will bring huge risks to the entire project, so data quality testing is required.

经检索，授权公告号“CN112395280B”的中国发明专利，公开了“一种数据质量检测方法及其系统”，该申请根据原有整合历史数据建立数据模型，同时对新样本数据进行预测识别获得数据质量波动情况，从而能够事先采取调整措施以及针对性地进行数据质量管理。After searching, the Chinese invention patent with the authorization announcement number "CN112395280B" disclosed "a data quality detection method and its system". This application established a data model based on the original integrated historical data, and simultaneously performed prediction and identification on new sample data to obtain the data. Quality fluctuations, so that adjustment measures can be taken in advance and data quality management can be carried out in a targeted manner.

此外申请公布号“CN109933581A”的中国发明专利，公开了“一种数据质量检测方法及系统”，该申请通过将数据等分存储到多台服务器磁盘和内存，采用分布式文件存储，让多台计算终端对分布式内存上的数据同时进行计算，实现分布式内存并行计算，然后将对多台计算终端上的运算结果进行汇总，即可完成质量检测。In addition, he applied for a Chinese invention patent with publication number "CN109933581A", which disclosed "a data quality detection method and system". This application uses distributed file storage to store data in equal parts on the disks and memories of multiple servers. The computing terminal calculates the data on the distributed memory at the same time to realize distributed memory parallel computing, and then summarizes the operation results on multiple computing terminals to complete the quality inspection.

然而上述两个专利在实际使用时，存在以下问题：However, when the above two patents are actually used, there are the following problems:

第一、第一个专利虽然能够根据历史数据建立数据模型，对新样本数据进行预测识别，但是无法对异常数据进行实时检测和处理，只能在数据质量波动出现后采取调整措施，不能做到及时处理。First, although the first patent can establish a data model based on historical data and predict and identify new sample data, it cannot detect and process abnormal data in real time. Adjustment measures can only be taken after data quality fluctuations occur, which cannot be done Deal with it promptly.

第二、第二个专利虽然采用了分布式内存并行计算的方式，提高了数据质量检测的效率，但并未对数据质量检测的准确性和可靠性做出改进，且对异常数据的检测和处理能力有限。Second, although the second patent uses distributed memory parallel computing to improve the efficiency of data quality detection, it does not improve the accuracy and reliability of data quality detection, and the detection and detection of abnormal data are not improved. Processing capacity is limited.

因此，为此申请人基于上述两点不足，提出了一种新型数据质量检测方法及系统。Therefore, based on the above two shortcomings, the applicant proposed a new data quality detection method and system.

发明内容Contents of the invention

本发明的目的在于提供一种数据质量检测方法及系统，以解决上述背景技术中提出的问题。The purpose of the present invention is to provide a data quality detection method and system to solve the problems raised in the above background technology.

为实现上述目的，本发明提供如下技术方案：In order to achieve the above objects, the present invention provides the following technical solutions:

第一方面，设计了一种数据质量检测方法，所述方法包括：In the first aspect, a data quality detection method is designed, which includes:

所述方法包括：The methods include:

获取数据源内至少一条待检测数据，将待检测数据拆解成数据字符的状态；Obtain at least one piece of data to be detected from the data source, and disassemble the data to be detected into the state of data characters;

构建筛选规则，并让同类字符信息的数据字符归纳一起；Construct filtering rules and group data characters of similar character information together;

构建内部搭载有不同系统的数据分析模型，提取数据字符内字符信息，得到多个数据字符项目，将历史数据字符指标和数据字符项目比对，得到数据字符的项目差值评分；Build an internal data analysis model equipped with different systems, extract character information within data characters, obtain multiple data character items, compare historical data character indicators and data character items, and obtain the item difference score of data characters;

构建判别规则，设定判别阈值，根据每个数据字符的项目差值评分和预设的阈值进行比较，来确定的单个数据字符是否“达标”，统计“达标”的数据字符项目数量，并根据数量判断数据源的整体质量；Construct a discrimination rule, set a discrimination threshold, and compare the item difference score of each data character with the preset threshold to determine whether a single data character "qualifies", counts the number of data character items that "qualify", and calculates the Quantity determines the overall quality of the data source;

建立影响指标，基于数据的应用场景，对多个数据字符的项目差值评分进行的权重划分，依据权重划分后的“达标”的数据字符项目数量，最终判别待检测数据的整体质量。Establish impact indicators, based on the application scenarios of the data, divide the weights of the project difference scores of multiple data characters, and finally judge the overall quality of the data to be detected based on the number of "standard" data character projects after the weight division.

作为本技术方案的进一步优选的，根据项目差值评分和预设的阈值进行比较；As a further preference of this technical solution, comparison is made based on the item difference score and a preset threshold;

当差值评分高于阈值，则判定该项目为高质量数据，并将其记为“达标”；When the difference score is higher than the threshold, the project is judged to be high-quality data and recorded as "standard";

当差值评分低于阈值，则判定该项目为低质量数据，并将其记为“不达标”。When the difference score is lower than the threshold, the item is judged to be low-quality data and recorded as "substandard".

作为本技术方案的进一步优选的，根据“达标”的数据字符项目数量，进一步判断数据的质量等级；As a further preference of this technical solution, the quality level of the data is further judged based on the number of data character items that "qualify";

当至少80%的数据字符项目都“达标”，则数据质量为“80分—90分”；When at least 80% of the data character items are "up to standard", the data quality is "80 points - 90 points";

当50%—79%的数据字符项目“达标”，则数据质量为“60分—79分”；When 50%-79% of the data character items "reach the standard", the data quality is "60 points-79 points";

当低于50%的数据字符项目“达标”，则数据质量为“40分—59分”。When less than 50% of the data character items are "up to standard", the data quality is "40 points - 59 points".

作为本技术方案的进一步优选的，所述影响指标的构建标准包括：As a further preference of this technical solution, the construction standards of the impact indicators include:

a.准确性：对于“达标”的项目，当其字符信息与历史数据存在差异时，将其标记为“可能存在问题”并分配20％-35％权重；a. Accuracy: For projects that "qualify", when there are differences between their character information and historical data, they will be marked as "possible problems" and assigned a weight of 20%-35%;

b.一致性：对于来自同一数据源或同一批次的数据字符，当其中多个项目都出现字符信息与历史数据存在差异时，将其标记为“需要重点关注”并分配40％-55％权重；b. Consistency: For data characters from the same data source or the same batch, when multiple items have discrepancies between character information and historical data, mark them as "requiring focus" and allocate 40%-55% Weights;

C.稳定性：对于“达标”的项目，当其字符信息与历史数据不存在差异时，将其标记为“稳定性好”并分配10％-40％权重。C. Stability: For a project that "qualifies", when there is no difference between its character information and historical data, it will be marked as "good stability" and assigned a weight of 10%-40%.

作为本技术方案的进一步优选的，数据字符的项目差值评分和预设的阈值进行比较，得到差值基于影响指标的构建标准，重新构建，构建的数值为等于原先数值乘以权重数值再加上原先的数值。As a further optimization of this technical solution, the item difference score of the data character is compared with the preset threshold, and the difference is obtained based on the construction standard of the impact indicator, and is reconstructed. The constructed value is equal to the original value multiplied by the weight value plus the original value.

作为本技术方案的进一步优选的，构建筛选规则的方法包括：As a further preference of this technical solution, the method of constructing screening rules includes:

检查数据字符内无效、模糊、错误以及重复的字符，并对这些字符进行标记；Check the data characters for invalid, ambiguous, wrong and repeated characters and mark these characters;

替换或删除被标记的字符；Replace or delete marked characters;

构建分类标准，基于字符的信息让同类字符的数据字符归纳到一起。Construct a classification standard based on character information to group data characters of similar characters together.

作为本技术方案的进一步优选的，数据分析模型的分析方法包括：As a further preference of this technical solution, the analysis method of the data analysis model includes:

从筛选后的数据字符中提取出有用的字符特征；Extract useful character features from filtered data characters;

基于提取的字符特征，联想并生成多个与其相同领域的数据字符项目；Based on the extracted character features, associate and generate multiple data character items in the same field;

将生成的数据字符项目存储管理到数据库内。Store and manage the generated data character items into the database.

第二方面，为完善本技术方案，申请人又提出了基于上述数据质量检测方法的数据质量检测系统，该系统包括：Secondly, in order to improve this technical solution, the applicant has proposed a data quality detection system based on the above-mentioned data quality detection method. The system includes:

数据获取模块：用于获取数据源内至少一条待检测数据；Data acquisition module: used to obtain at least one piece of data to be detected in the data source;

数据拆解模块：用于将待检测数据拆解成数据字符的状态；Data disassembly module: used to disassemble the data to be detected into the state of data characters;

数据归纳模块：用于构建筛选规则，并让同类字符信息的数据字符归纳一起；Data induction module: used to build filtering rules and summarize data characters of similar character information together;

数据提取模块：用于构建内部搭载有不同系统的数据分析模型，提取数据字符内字符信息，得到多个数据字符项目；Data extraction module: used to build data analysis models equipped with different systems internally, extract character information within data characters, and obtain multiple data character items;

指标比对模块：用于将历史数据字符指标和数据字符项目比对，得到数据字符的项目差值评分；Indicator comparison module: used to compare historical data character indicators and data character items to obtain the item difference score of data characters;

判别模块：用于构建判别规则，设定判别阈值，根据每个数据字符的项目差值评分和预设的阈值进行比较，来确定的单个数据字符是否“达标”，统计“达标”的数据字符项目数量；Discrimination module: used to build discrimination rules, set discrimination thresholds, compare the project difference score of each data character with the preset threshold to determine whether a single data character "qualifies", and counts the data characters that "qualify" Number of projects;

质量判断模块：用于根据“达标”的数据字符项目数量，进一步判断数据的质量等级；Quality judgment module: used to further judge the quality level of data based on the number of data character items that "qualify";

影响指标模块：用于建立影响指标，基于数据的应用场景，对多个数据字符的项目差值评分进行的权重划分；Impact indicator module: used to establish impact indicators, data-based application scenarios, and weight division of project difference scores of multiple data characters;

最终判别模块：用于依据权重划分后的“达标”的数据字符项目数量，最终判别待检测数据的整体质量。Final judgment module: used to judge the overall quality of the data to be detected based on the number of "standard" data character items divided by weights.

与现有技术相比，本发明的有益效果是；Compared with the prior art, the beneficial effects of the present invention are:

该数据质量检测方法及系统，相较于传统数据质量检测方法通过将待检测数据拆解成数据字符的状态，能够更精细地检测数据质量问题，其次，通过构建筛选规则和数据分析模型，能够更准确地提取数据字符内的信息，并得到多个数据字符项目，再次，通过将历史数据字符指标和数据字符项目进行比对，能够得出更准确的项目差值评分，最后，通过构建判别规则和设定判别阈值，能够更准确地判断数据质量是否达标。Compared with traditional data quality detection methods, this data quality detection method and system can more accurately detect data quality problems by disassembling the data to be detected into the state of data characters. Secondly, by building filtering rules and data analysis models, it can Extract the information within the data characters more accurately and obtain multiple data character items. Thirdly, by comparing the historical data character indicators and data character items, a more accurate item difference score can be obtained. Finally, by constructing a discriminant Rules and setting of discrimination thresholds can more accurately determine whether the data quality meets the standards.

此外，本发明的数据质量检测方法及系统通过将待检测数据拆解成数据字符的状态，并利用筛选规则和数据分析模型进行数据提取和分析，能够迅速有效地检测数据质量，提高了检测效率；In addition, the data quality detection method and system of the present invention can quickly and effectively detect data quality and improve detection efficiency by disassembling the data to be detected into data character states, and using filtering rules and data analysis models to perform data extraction and analysis. ;

还有就是，本发明通过构建筛选规则、数据分析模型、判别规则和影响指标等模块，能够实现数据质量的自动检测和判别，从而降低了人工干预的成本和错误率。In addition, by constructing modules such as screening rules, data analysis models, discrimination rules, and impact indicators, the present invention can realize automatic detection and discrimination of data quality, thereby reducing the cost and error rate of manual intervention.

此外需说明的是，本发明提供的数据质量检测方法，通过构建筛选规则、数据分析模型、判别规则以及影响指标，实现了对数据质量的全面检测和评估，其中该方法具有较强的通用性，可以应用于各种数据源的检测，如数据库、文件、网络数据等，同时，该方法还考虑了数据的应用场景，通过对数据字符的项目差值评分进行权重划分，使得检测结果更加符合实际业务需求。In addition, it should be noted that the data quality detection method provided by the present invention realizes comprehensive detection and evaluation of data quality by constructing screening rules, data analysis models, discrimination rules and impact indicators, and the method has strong versatility. , can be applied to the detection of various data sources, such as databases, files, network data, etc. At the same time, this method also considers the application scenarios of the data, and makes the detection results more consistent by weighting the project difference scores of data characters. Actual business needs.

需补充的是，本发明还提供了构建筛选规则、数据分析模型、判别规则以及影响指标的具体方法，使得数据质量检测过程更加清晰、有序，通过实施本发明，可以有效地提高数据质量检测的准确性和效率，为各类企业、机构提供可靠的数据质量保障。It should be added that the present invention also provides specific methods for constructing screening rules, data analysis models, discrimination rules and impact indicators, making the data quality detection process clearer and more orderly. By implementing the present invention, the data quality detection can be effectively improved. The accuracy and efficiency provide reliable data quality assurance for various enterprises and institutions.

附图说明Description of the drawings

图1为本发明数据质量检测方法的内容分析图；Figure 1 is a content analysis diagram of the data quality detection method of the present invention;

图2为本发明数据质量检测系统的模块组成图；Figure 2 is a module composition diagram of the data quality detection system of the present invention;

图3为本发明获取数据源检测数据的步骤流程图；Figure 3 is a flow chart of steps for obtaining data source detection data according to the present invention;

图4为本发明数据分析模型的运行步骤流程图。Figure 4 is a flow chart of the operation steps of the data analysis model of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present invention.

本发明提供一种技术方案：一种数据质量检测方法包括：获取数据源内至少一条待检测数据，将待检测数据拆解成数据字符的状态；构建筛选规则，并让同类字符信息的数据字符归纳一起；构建内部搭载有不同系统的数据分析模型，提取数据字符内字符信息，得到多个数据字符项目，将历史数据字符指标和数据字符项目比对，得到数据字符的项目差值评分；构建判别规则，设定判别阈值，根据每个数据字符的项目差值评分和预设的阈值进行比较，来确定的单个数据字符是否“达标”，统计“达标”的数据字符项目数量，并根据数量判断数据源的整体质量；建立影响指标，基于数据的应用场景，对多个数据字符的项目差值评分进行的权重划分，依据权重划分后的“达标”的数据字符项目数量，最终判别待检测数据的整体质量。The present invention provides a technical solution: a data quality detection method includes: obtaining at least one piece of data to be detected in the data source, disassembling the data to be detected into the state of data characters; constructing filtering rules, and summarizing the data characters of similar character information Together; build an internal data analysis model equipped with different systems, extract character information within data characters, obtain multiple data character items, compare historical data character indicators and data character items, and obtain the item difference score of data characters; build a discriminant Rules, set the judgment threshold, compare the item difference score of each data character with the preset threshold to determine whether a single data character "qualifies", counts the number of data character items that "qualify", and makes a judgment based on the number The overall quality of the data source; establishing impact indicators, based on data application scenarios, weighting the project difference scores of multiple data characters, and finally judging the data to be detected based on the number of "standard" data character projects after weight division overall quality.

作为一种具体的实施方式，获取数据源的检测数据包括以下步骤：As a specific implementation, obtaining the detection data of the data source includes the following steps:

步骤一、确定数据源：首先需要确定待检测的数据源，可以是数据库、文件、网络数据等；Step 1. Determine the data source: First, you need to determine the data source to be detected, which can be a database, file, network data, etc.;

步骤二、获取数据：从确定的数据源中获取待检测的数据；Step 2. Obtain data: Obtain the data to be detected from the determined data source;

步骤三、数据清洗：对待检测的数据进行清洗，去除无效、模糊、错误以及重复的字符，确保数据的准确性；Step 3. Data cleaning: Clean the data to be detected to remove invalid, fuzzy, wrong and repeated characters to ensure the accuracy of the data;

步骤四、数据拆解：将清洗后的数据拆解成数据字符的状态，即将数据按照特定的规则或格式转换成可识别的字符序列；Step 4. Data disassembly: disassemble the cleaned data into the state of data characters, that is, convert the data into a recognizable character sequence according to specific rules or formats;

步骤五、数据存储：将拆解后的数据字符存储管理到数据库内，方便后续的数据质量检测和分析。Step 5. Data storage: Store and manage the disassembled data characters into the database to facilitate subsequent data quality detection and analysis.

需说明的是，通过以上步骤，能够获取到待检测的数据源中的检测数据，并进行后续的数据质量检测和分析。It should be noted that through the above steps, the detection data in the data source to be detected can be obtained, and subsequent data quality detection and analysis can be performed.

作为一种具体的实施方式，本发明根据项目差值评分和预设的阈值进行比较，当差值评分高于阈值，则判定该项目为高质量数据，并将其记为“达标”；当差值评分低于阈值，则判定该项目为低质量数据，并将其记为“不达标”，其中根据“达标”的数据字符项目数量，进一步判断数据的质量等级。As a specific implementation, the present invention compares the project difference score with a preset threshold. When the difference score is higher than the threshold, the project is determined to be high-quality data and is recorded as "standard"; If the difference score is lower than the threshold, the item is judged to be low-quality data and recorded as "not up to standard". The quality level of the data is further judged based on the number of "up to standard" data character items.

需说明的是，本发明影响指标的构建标准包括：It should be noted that the construction standards of the impact indicators in this invention include:

b.一致性：对于来自同一数据源或同一批次的数据字符，当其中多个项目都出现了字符信息与历史数据存在差异时，将该数据源或同一批次数据字符标记为“需要重点关注”并分配40％-55％权重；b. Consistency: For data characters from the same data source or the same batch, when multiple items have discrepancies between character information and historical data, mark the data source or the same batch of data characters as "needs focus" Follow” and assign 40%-55% weight;

需强调的是，本发明数据字符的项目差值评分和预设的阈值进行比较，得到差值基于影响指标的构建标准，重新构建，构建的数值为等于原先数值乘以权重数值再加上原先的数值。It should be emphasized that the project difference score of the data character in the present invention is compared with the preset threshold, and the difference obtained is reconstructed based on the construction standard of the impact index. The constructed value is equal to the original value multiplied by the weight value plus the original value. value.

此外还需补充的是，本发明数据质量检测方法的检测逻辑为：首先，获取数据源中的待检测数据，并将其拆解成数据字符的状态；然后，根据构建的筛选规则将同类字符信息的数据字符归纳一起；接着，构建内部搭载有不同系统的数据分析模型，提取数据字符内字符信息，得到多个数据字符项目；再然后，将历史数据字符指标和数据字符项目进行比对，得到数据字符的项目差值评分；之后，根据构建的判别规则和设定判别阈值，根据每个数据字符的项目差值评分和预设的阈值进行比较，来确定的单个数据字符是否“达标”，统计“达标”的数据字符项目数量；最后，根据“达标”的数据字符项目数量判断数据源的整体质量。此外，还可以建立影响指标，基于数据的应用场景，对多个数据字符的项目差值评分进行的权重划分，依据权重划分后的“达标”的数据字符项目数量，最终判别待检测数据的整体质量。In addition, it should be added that the detection logic of the data quality detection method of the present invention is: first, obtain the data to be detected in the data source and disassemble it into the state of data characters; then, according to the constructed filtering rules, similar characters are The data characters of the information are summarized together; then, a data analysis model equipped with different systems is built internally, and the character information within the data characters is extracted to obtain multiple data character items; then, historical data character indicators and data character items are compared, Obtain the item difference score of the data characters; then, according to the constructed discrimination rules and set discrimination thresholds, compare the item difference score of each data character with the preset threshold to determine whether a single data character "qualifies" , count the number of "standard" data character items; finally, judge the overall quality of the data source based on the number of "standard" data character items. In addition, impact indicators can also be established. Based on data application scenarios, the weights of the project difference scores of multiple data characters can be divided. Based on the number of "standard" data character projects after weight division, the overall data to be detected can finally be judged. quality.

作为一种具体的实施方式，构建筛选规则的方法包括：As a specific implementation, the method of constructing filtering rules includes:

检查数据字符内无效、模糊、错误以及重复的字符，并对这些字符进行标记；替换或删除被标记的字符；构建分类标准，基于字符的信息让同类字符的数据字符归纳到一起，需补充的是，其中同类字符的划分方式分为以下几种：Check the invalid, ambiguous, wrong and repeated characters in the data characters and mark these characters; replace or delete the marked characters; build a classification standard, based on the character information, the data characters of similar characters can be grouped together, and what needs to be added Yes, the classification methods of similar characters are divided into the following types:

第一，形状相似性分类：这种分类方法主要是根据字符的形状特征进行分类，例如，在汉字中，“日”、“月”等字符的形状非常相似，可以将它们归为同一类别。First, shape similarity classification: This classification method is mainly based on the shape characteristics of characters. For example, in Chinese characters, the shapes of characters such as "日" and "月" are very similar, and they can be classified into the same category.

第二，功能相似性分类：这种分类方法主要是根据字符的功能或用途进行分类。例如，在中文中，“你”、“我”、“他”等代词的功能相似，可以将它们归为同一类别；在英文中，“and”、“or”、“but”等连词的功能也相似，可以将它们归为同一类别。Second, functional similarity classification: This classification method is mainly based on the function or use of characters. For example, in Chinese, pronouns such as "you", "I" and "he" have similar functions and can be classified into the same category; in English, the functions of conjunctions such as "and", "or" and "but" Also similar, they can be classified into the same category.

第三，语义相似性分类：这种分类方法主要是根据字符的语义进行分类。例如，“apple”、“banana”、“orange”等单词都表示水果，可以将它们归为同一类别；在英文中，“the”、“a”、“an”等冠词的语义相似，可以将它们归为同一类别。Third, semantic similarity classification: This classification method is mainly based on the semantics of characters. For example, words such as "apple", "banana" and "orange" all refer to fruits and can be classified into the same category; in English, articles such as "the", "a" and "an" have similar semantics and can be Put them in the same category.

第四，频率和出现概率分类：这种分类方法主要是根据字符在文本中的频率和出现概率进行分类。例如，在中文文本中，“的”、“了”、“啊”等助词出现的频率非常高，可以将它们归为高频字符类别；在英文文本中，“the”、“and”、“or”等常用词出现的概率也非常高，可以将它们归为高频字符类别。Fourth, frequency and occurrence probability classification: This classification method is mainly based on the frequency and occurrence probability of characters in the text. For example, in Chinese texts, auxiliary words such as "的", "乐", and "Ah" appear very frequently, and they can be classified into high-frequency character categories; in English texts, "the", "and", " The probability of occurrence of common words such as "or" is also very high, and they can be classified into the high-frequency character category.

其中构建判别规则的方法包括：Methods for constructing discrimination rules include:

a.设定判别阈值：根据历史数据字符的项目差值评分，确定一个合适的判别阈值。判别阈值的设定可以基于数据的整体分布，例如平均值、中位数等。也可以根据数据的业务特性，如数据的波动范围、异常值等来设定。a. Set the discrimination threshold: Determine an appropriate discrimination threshold based on the item difference score of historical data characters. The setting of the discrimination threshold can be based on the overall distribution of the data, such as the mean, median, etc. It can also be set according to the business characteristics of the data, such as the fluctuation range of the data, abnormal values, etc.

b.比较数据字符的项目差值评分与判别阈值：对于每个数据字符，将其项目差值评分与判别阈值进行比较。若项目差值评分大于判别阈值，则判定该数据字符为高质量数据；反之，若项目差值评分小于判别阈值，则判定该数据字符为低质量数据。b. Compare the item difference score of the data character with the discrimination threshold: For each data character, compare its item difference score with the discrimination threshold. If the item difference score is greater than the discrimination threshold, the data character is judged to be high-quality data; conversely, if the item difference score is less than the discrimination threshold, the data character is judged to be low-quality data.

c.统计“达标”的数据字符项目数量：在所有数据字符中，统计项目差值评分大于判别阈值的数据字符数量。c. Count the number of data character items that "reach the standard": among all data characters, count the number of data characters whose item difference score is greater than the discrimination threshold.

d.判断数据源的整体质量：根据“达标”的数据字符项目数量，判断数据源的整体质量。若“达标”的数据字符项目数量占比越高，则数据源的整体质量越高。d. Judge the overall quality of the data source: Judge the overall quality of the data source based on the number of "standard" data character items. The higher the proportion of data character items that "qualify", the higher the overall quality of the data source.

需说明的是，本发明提供的数据质量检测方法，通过构建筛选规则、数据分析模型、判别规则以及影响指标，实现了对数据质量的全面检测和评估，其中该方法具有较强的通用性，可以应用于各种数据源的检测，如数据库、文件、网络数据等，同时，该方法还考虑了数据的应用场景，通过对数据字符的项目差值评分进行权重划分，使得检测结果更加符合实际业务需求。It should be noted that the data quality detection method provided by the present invention achieves comprehensive detection and evaluation of data quality by constructing screening rules, data analysis models, discrimination rules and impact indicators. The method has strong versatility. It can be applied to the detection of various data sources, such as databases, files, network data, etc. At the same time, this method also considers the application scenarios of the data, and makes the detection results more consistent with reality by weighting the project difference scores of data characters. Business needs.

此外，本发明还提供了构建筛选规则、数据分析模型、判别规则以及影响指标的具体方法，使得数据质量检测过程更加清晰、有序，通过实施本发明，可以有效地提高数据质量检测的准确性和效率，为各类企业、机构提供可靠的数据质量保障。In addition, the present invention also provides specific methods for constructing screening rules, data analysis models, discrimination rules and impact indicators, making the data quality detection process clearer and more orderly. By implementing the present invention, the accuracy of data quality detection can be effectively improved. and efficiency, providing reliable data quality assurance for various enterprises and institutions.

作为一种具体的实施方式，数据分析模型的运行步骤包括：As a specific implementation, the steps of running the data analysis model include:

步骤一、数据预处理：在对筛选后的数据字符进行深入分析之前，需要对其进行预处理。预处理的过程主要包括去除噪声和标准化。去除噪声是为了确保分析的准确性，消除数据中的错误和异常值。标准化是将数据转化为同一尺度，以便于后续的分析。这两步操作都是为了提高数据的质量，为后续的分析打下坚实的基础。Step 1. Data preprocessing: Before in-depth analysis of the filtered data characters, it needs to be preprocessed. The preprocessing process mainly includes noise removal and standardization. Noise is removed to ensure the accuracy of the analysis and to eliminate errors and outliers in the data. Standardization is the process of converting data to the same scale to facilitate subsequent analysis. Both steps are to improve the quality of data and lay a solid foundation for subsequent analysis.

步骤二、机器学习算法应用：预处理后的数据将用于训练机器学习算法。这个过程主要包括两个部分：特征提取和模型构建。特征提取是将数据字符转化为计算机可以理解的特征向量，包括文本的语法、语义等信息。模型构建是利用训练数据，通过机器学习算法（如决策树、支持向量机、神经网络等）学习特征向量与目标变量之间的关系，构建出预测和分类模型。Step 2. Application of machine learning algorithm: The preprocessed data will be used to train the machine learning algorithm. This process mainly includes two parts: feature extraction and model construction. Feature extraction is to convert data characters into feature vectors that can be understood by computers, including the syntax, semantics and other information of the text. Model construction uses training data to learn the relationship between feature vectors and target variables through machine learning algorithms (such as decision trees, support vector machines, neural networks, etc.) to build prediction and classification models.

步骤三、数据存储和管理：构建好的模型可以对新的数据字符进行预测和分类。为了方便后续的查询和使用，我们需要将生成的数据字符项目进行存储和管理。存储的方式可以包括关系型数据库、非关系型数据库或者分布式存储系统等。管理主要包括数据的增、删、改等操作，以及模型的更新和优化。Step 3. Data storage and management: The built model can predict and classify new data characters. In order to facilitate subsequent query and use, we need to store and manage the generated data character items. Storage methods can include relational databases, non-relational databases, or distributed storage systems. Management mainly includes data addition, deletion, modification and other operations, as well as model update and optimization.

步骤四、模型更新和优化：数据分析模型具有可扩展性和可维护性。随着业务的发展和数据的变化，模型需要不断地进行更新和优化。这包括对新数据的引入、算法参数的调整、模型结构的改进等。此外，我们还需要对模型进行评估，以确保其在实际应用中的准确性和有效性。Step 4. Model update and optimization: The data analysis model is scalable and maintainable. As business develops and data changes, models need to be constantly updated and optimized. This includes the introduction of new data, adjustment of algorithm parameters, improvement of model structure, etc. Additionally, we need to evaluate the model to ensure its accuracy and effectiveness in real-world applications.

步骤五、结果输出和应用：经过模型预测和分类的数据字符，可以用于各种业务场景。例如，在金融领域，可以用于信贷风险评估、投资决策等；在医疗领域，可以用于疾病预测、用药建议等；在教育领域，可以用于学生成绩预测、教学效果评估等。这些应用都有助于提高业务效率和决策质量。Step 5. Result output and application: Data characters predicted and classified by the model can be used in various business scenarios. For example, in the financial field, it can be used for credit risk assessment, investment decisions, etc.; in the medical field, it can be used for disease prediction, medication recommendations, etc.; in the education field, it can be used for student performance prediction, teaching effect evaluation, etc. These applications all help improve business efficiency and decision-making quality.

作为一种具体的实施方式，申请还提出了一种基于上述数据质量检测方法的检测系统，该系统包括以下模块：As a specific implementation, the application also proposes a detection system based on the above-mentioned data quality detection method. The system includes the following modules:

数据获取模块：用于从各种数据源获取待检测数据。这些数据源能够是数据库、文件、网络连接或其他数据存储和传输方式获取的。Data acquisition module: used to obtain data to be detected from various data sources. These data sources can be obtained from databases, files, network connections, or other data storage and transmission methods.

数据拆解模块：对待检测数据进行预处理，将其拆解成数据字符的状态。这个过程一般包括数据清洗、格式转换和规范化操作。Data disassembly module: Preprocess the data to be detected and disassemble it into the state of data characters. This process generally includes data cleaning, format conversion and normalization operations.

数据归纳模块：对拆解后的数据字符进行归纳和分类，构建筛选规则，以便将同类字符信息的数据字符归纳在一起。Data induction module: summarize and classify the disassembled data characters, and build filtering rules to summarize data characters of similar character information together.

数据提取模块：根据归纳结果，构建内部搭载有不同系统的数据分析模型，提取数据字符内字符信息，得到多个数据字符项目。Data extraction module: Based on the induction results, build a data analysis model equipped with different systems internally, extract character information within data characters, and obtain multiple data character items.

指标比对模块：将历史数据字符指标与当前数据字符项目进行比对，计算项目差值评分。这个评分可以反映数据的变化程度和异常程度。Indicator comparison module: Compare historical data character indicators with current data character items, and calculate the item difference score. This score can reflect the degree of variation and abnormality in the data.

判别模块：根据判别规则和判别阈值，对每个数据字符的项目差值评分进行判别，确定单个数据字符是否“达标”。同时，统计“达标”的数据字符项目数量。Discrimination module: Based on the discrimination rules and discrimination thresholds, the project difference score of each data character is discriminated to determine whether a single data character "qualifies". At the same time, count the number of data character items that "qualify".

质量判断模块：根据“达标”的数据字符项目数量，判断数据的质量等级。这个等级能够反映数据的准确性和可靠性。Quality judgment module: Judge the quality level of the data based on the number of data character items that "qualify". This grade reflects the accuracy and reliability of the data.

影响指标模块：基于数据的应用场景，建立影响指标，对多个数据字符的项目差值评分进行权重划分。这个权重划分可以反映不同项目差值评分对整体数据质量的影响程度。Impact indicator module: Based on data application scenarios, impact indicators are established to weight the project difference scores of multiple data characters. This weight division can reflect the impact of different project difference scores on the overall data quality.

最终判别模块：依据权重划分后的“达标”的数据字符项目数量，最终判别待检测数据的整体质量。这个判别结果可以为数据使用者提供参考，帮助其了解数据的质量状况。Final judgment module: Based on the number of "standard" data character items after weight division, the overall quality of the data to be detected is finally judged. This discrimination result can provide a reference for data users to help them understand the quality of the data.

作为一种具体的实施方式，还能够在该数据质量检测系统内添加以下模块，例如：异常检测与预警模块：在数据质量检测过程中，若发现异常数据字符，该模块可及时将其标注出来，并向数据管理人员发送预警信息，这有助于确保数据管理人员能够及时采取措施，降低低质量数据对业务系统的影响。数据可视化模块：为了更直观地展示数据质量检测结果，该模块提供了数据可视化功能，通过图表、报告等形式，数据使用者可以清晰地了解数据质量的状况，从而有针对性地优化数据质量。用户交互模块：该模块旨在提高系统的人性化程度，让数据使用者能够更便捷地操作和使用系统，需说明的是，该模块让用户可以通过界面设置筛选规则、查看检测结果、调整模型参数等，以满足不同业务场景的需求。系统优化与升级模块：为了不断提高系统性能和适应不断变化的数据环境，该模块负责对系统进行优化和升级，需说明的是，这包括对数据分析模型、算法策略等方面的改进，以提高数据质量检测的准确性和效率。As a specific implementation method, the following modules can also be added to the data quality detection system, such as: anomaly detection and early warning module: during the data quality detection process, if abnormal data characters are found, this module can mark them out in time , and sends early warning information to data managers, which helps ensure that data managers can take timely measures to reduce the impact of low-quality data on business systems. Data visualization module: In order to display data quality detection results more intuitively, this module provides data visualization functions. Through charts, reports, etc., data users can clearly understand the status of data quality, thereby optimizing data quality in a targeted manner. User interaction module: This module is designed to improve the humanity of the system and allow data users to operate and use the system more conveniently. It should be noted that this module allows users to set filtering rules, view detection results, and adjust models through the interface. Parameters, etc., to meet the needs of different business scenarios. System optimization and upgrade module: In order to continuously improve system performance and adapt to the changing data environment, this module is responsible for optimizing and upgrading the system. It should be noted that this includes improvements to data analysis models, algorithm strategies, etc., to improve Accuracy and efficiency of data quality testing.

尽管已经示出和描述了本发明的实施例，对于本领域的普通技术人员而言，可以理解在不脱离本发明的原理和精神的情况下可以对这些实施例进行多种变化、修改、替换和变型，本发明的范围由所附权利要求及其等同物限定。Although the embodiments of the present invention have been shown and described, those of ordinary skill in the art will understand that various changes, modifications, and substitutions can be made to these embodiments without departing from the principles and spirit of the invention. and modifications, the scope of the invention is defined by the appended claims and their equivalents.

Claims

1. A data quality detection method, characterized in that the method includes:

Obtain at least one piece of data to be detected from the data source, and disassemble the data to be detected into the state of data characters;

Construct filtering rules and group data characters of similar character information together;

Build an internal data analysis model equipped with different systems, extract character information within data characters, obtain multiple data character items, compare historical data character indicators and data character items, and obtain the item difference score of data characters;

Construct a discrimination rule, set a discrimination threshold, and compare the item difference score of each data character with the preset threshold to determine whether a single data character "qualifies", counts the number of data character items that "qualify", and calculates the Quantity determines the overall quality of the data source;

Establish impact indicators, based on the application scenarios of the data, divide the weights of the project difference scores of multiple data characters, and finally judge the overall quality of the data to be detected based on the number of "standard" data character projects after the weight division.

2. A data quality detection method according to claim 1, characterized in that: comparison is made according to the item difference score and a preset threshold;

When the difference score is higher than the threshold, the project is judged to be high-quality data and recorded as "standard";

When the difference score is lower than the threshold, the item is judged to be low-quality data and recorded as "substandard".

3. A data quality detection method according to claim 2, characterized in that: further judging the quality level of the data based on the number of "standard" data character items;

When at least 80% of the data character items are "up to standard", the data quality is "80 points - 90 points";

When 50%-79% of the data character items "reach the standard", the data quality is "60 points-79 points";

When less than 50% of the data character items are "up to standard", the data quality is "40 points - 59 points".

4. A data quality detection method according to claim 3, characterized in that: the construction standards of the impact indicators include:

a. Accuracy: For projects that "qualify", when there are differences between their character information and historical data, they will be marked as "possible problems" and assigned a weight of 20%-35%;

b. Consistency: For data characters from the same data source or the same batch, when multiple items have discrepancies between character information and historical data, mark them as "requiring focus" and allocate 40%-55% Weights;

C. Stability: For a project that "qualifies", when there is no difference between its character information and historical data, it will be marked as "good stability" and assigned a weight of 10%-40%.

5. A data quality detection method according to claim 4, characterized in that: the item difference score of the data character is compared with the preset threshold, and the difference obtained is reconstructed based on the construction standard of the impact index, and the constructed The value is equal to the original value multiplied by the weight value plus the original value.

6. A data quality detection method according to claim 1, characterized in that: the method of constructing screening rules includes:

Check the data characters for invalid, ambiguous, wrong and repeated characters and mark these characters;

Replace or delete marked characters;

Construct a classification standard based on character information to group data characters of similar characters together.

7. A data quality detection method according to claim 1, characterized in that: the analysis method of the data analysis model includes:

Extract useful character features from filtered data characters;

Based on the extracted character features, associate and generate multiple data character items in the same field;

Store and manage the generated data character items into the database.

8. A data quality detection system, characterized by including:

Data acquisition module: used to obtain at least one piece of data to be detected in the data source;

Data disassembly module: used to disassemble the data to be detected into the state of data characters;

Data induction module: used to build filtering rules and summarize data characters of similar character information together;

Data extraction module: used to build data analysis models equipped with different systems internally, extract character information within data characters, and obtain multiple data character items;

Indicator comparison module: used to compare historical data character indicators and data character items to obtain the item difference score of data characters;

Discrimination module: used to build discrimination rules, set discrimination thresholds, compare the project difference score of each data character with the preset threshold to determine whether a single data character "qualifies", and counts the data characters that "qualify" Number of projects;

Quality judgment module: used to further judge the quality level of data based on the number of data character items that "qualify";

Impact indicator module: used to establish impact indicators, data-based application scenarios, and weight division of project difference scores of multiple data characters;

Final judgment module: used to judge the overall quality of the data to be detected based on the number of "standard" data character items divided by weights.