CN115328902B

CN115328902B - A data quality inspection rule matching method, storage medium and system

Info

Publication number: CN115328902B
Application number: CN202211049853.5A
Authority: CN
Inventors: 徐欢; 施勇; 段琳; 李标奇; 徐敏
Original assignee: Information Center of Yunnan Power Grid Co Ltd
Current assignee: Information Center of Yunnan Power Grid Co Ltd
Priority date: 2022-08-30
Filing date: 2022-08-30
Publication date: 2025-05-16
Anticipated expiration: 2042-08-30
Also published as: CN115328902A

Abstract

Collecting a plurality of field metadata and a plurality of data quality inspection rules, calculating the association degree between each field metadata and each data quality inspection rule, enabling the field metadata with the association degree reaching the standard to be matched with the data quality inspection rules, identifying candidate field metadata matched with the data quality inspection rules and candidate field metadata not matched with the data quality inspection rules, and if the candidate field metadata which are more than a preset threshold in text similarity and consistent in data type with the field metadata to be matched exist, replacing parameter information contained in the data quality inspection rules selected by a user with data information of the field metadata to be matched, replacing condition parameters with new condition parameters input by the user, obtaining new data quality inspection rules and enabling the field metadata to be matched with the new data quality inspection rules.

Description

Data quality inspection rule matching method, storage medium and system

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method, a storage medium, and a system for matching data quality inspection rules.

Background

When the power grid system operates, a large amount of service data can be generated, the service data can reflect the operation condition of the power grid system, and the service data are stored in the service system after being acquired. At present, a data quality inspection rule is generally adopted to inspect the quality of service data in a service system, and if the quality inspection result of the service data is abnormal, a worker needs to monitor the power grid operation service corresponding to the abnormal service data.

In the process of selecting the data quality inspection rules for quality inspection of service data, the association degree between the field metadata for describing the service data and the data quality inspection rules is calculated first, and if the association degree meets the standard, the field metadata is matched with the data quality inspection rules for quality inspection. If a certain field metadata does not reach the standard of the association degree between the field metadata and each data quality inspection rule, the field metadata cannot be matched with the data quality inspection rule, and therefore quality inspection of service data described by the field metadata cannot be performed.

Disclosure of Invention

The technical problem to be solved by the invention is how to improve the situation that field metadata is not matched with a data quality check rule.

In order to solve the technical problems, the invention provides a data quality inspection rule matching method, which comprises the following steps:

A. collecting name information, source information and data type information of a plurality of field metadata used for describing service data from a service system;

B. Acquiring preset multiple data quality inspection rules, and field name information, field source information and condition parameters contained in each data quality inspection rule;

C. Judging whether the association degree between each field metadata and each data quality inspection rule meets the standard or not according to the name information and the source information of each field metadata and the field name information and the field source information contained in each data quality inspection rule;

D. Matching the field metadata with the association degree reaching the standard with a data quality check rule;

E. Identifying candidate field metadata matched with the data quality check rule and field metadata to be matched without matching the data quality check rule among the plurality of field metadata;

F. for each field metadata to be matched, the following steps F1, F2, F3, F4 are performed:

F1, judging whether candidate field metadata with the text similarity larger than a preset threshold value and consistent in data type exists or not, and if so, displaying the candidate field metadata and the matched data quality check rule thereof for the user to select;

F2. obtaining candidate field metadata selected by a user and a data quality check rule matched with the candidate field metadata, and replacing field name information and field source information contained in the data quality check rule with name information and source information of the field metadata to be matched;

f3. obtaining new condition parameters input by the user, and replacing the condition parameters contained in the data quality inspection rule selected by the user with the new condition parameters input by the user to obtain a new data quality inspection rule;

-F4. matching the field metadata to be matched with the new data quality check rule.

Preferably, in the step D, if there is a data quality inspection rule, the text similarity between the field name information and the name information of the metadata of the field reaches a first preset value, and the text similarity between the field source information and the source information of the metadata of the field reaches a second preset value, then the association degree between the data quality inspection rule and the metadata of the field reaches the standard.

Preferably, in the step F1, the text similarity between the metadata of the field to be matched and the metadata of each candidate field is calculated according to the name information of the metadata of the field to be matched and the name information of the metadata of each candidate field, whether the metadata of the candidate field with the text similarity greater than a preset threshold exists is judged, and if so, whether the metadata of the field to be matched is consistent with the metadata of the candidate field in data type is judged by comparing the data type information of the metadata of the field to be matched with the data type information of the metadata of the candidate field.

Preferably, in the step F1, if the text similarity with the field metadata to be matched is greater than a preset threshold and there are multiple candidate field metadata with the same data type, the multiple candidate field metadata are sorted and displayed for the user to select according to the text similarity from large to small.

Preferably, in the step F1, candidate field metadata with text similarity ranked before a predetermined ranking is selected for display.

Preferably, in the step F2, the data quality inspection rule is first decomposed into a select clause including field name information, a from clause including field source information, and a where clause including condition parameters by using an SQL engine, then the field name information in the select clause is replaced with the name information of the field metadata to be matched, and the field source information in the from clause is replaced with the source information of the field metadata to be matched, in the step F3, the condition parameters in the where clause are replaced with new condition parameters input by a user, and then the replaced select clause, from clause, and where clause are combined to obtain the new data quality inspection rule.

The present invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps in a data quality check rule matching method as described above.

The invention also provides a data quality inspection rule matching system, which comprises a computer readable storage medium and a processor which are connected with each other, wherein the computer readable storage medium is as described above.

Judging whether candidate field metadata which is more than a preset threshold value and has consistent data types exists for the field metadata to be matched of the unmatched data quality inspection rule, if so, showing the candidate field metadata and the matched data quality inspection rule thereof for user selection, then acquiring the candidate field metadata selected by the user and the matched data quality inspection rule thereof, replacing field name information and field source information contained in the data quality inspection rule with name information and source information of the field metadata to be matched, acquiring new condition parameters input by the user, replacing the condition parameters contained in the user selected data quality inspection rule with new condition parameters input by the user, namely changing the field name information, the field source information and the condition parameters of the data quality inspection rule according to the name information, the source information and the new condition parameters input by the user of the field metadata to be matched on the basis of the template of the original data quality inspection rule, acquiring the new field name information, the field source information and the condition parameters of the new condition parameters input by the user, associating the new field metadata to be matched with the new field metadata, and the new condition parameters input by the user, and the new field metadata to be matched by the quality inspection rule, the quality check of the service data described by the metadata of the fields to be matched can be performed using the new data quality check rules.

Drawings

Fig. 1 is a flow chart of a data quality check rule matching method.

Detailed Description

The invention is further described in detail below in connection with the detailed description.

The embodiment provides a data quality check rule matching system, which comprises a computer readable storage medium and a processor which are connected with each other, wherein the computer readable storage medium stores a computer program, and the computer program is executed by the processor to implement a data quality check rule matching method as shown in fig. 1, and the method specifically comprises the following step A, B, C, D, E, F.

A. A plurality of field metadata describing service data, name information, source information and data type information of each field metadata are collected from a service system.

In this embodiment, the service system stores a large amount of service data generated by the power grid system during operation, and the service data can reflect the operation condition of the power grid system. The data quality inspection rule matching system collects a plurality of field metadata for describing service data from the service system, and collects data information such as name information, source information, data type information and the like of each field metadata. For example, the name information of the field metadata I is name, the source information is t_ userinfo, the data type information is text type, the name information of the field metadata II is name, the source information is t_ admininfo, the data type information is text type, the name information of the field metadata III is age, the source information is t_ userinfo, and the data type information is numerical type.

B. Acquiring preset multiple data quality inspection rules, and field name information, field source information and condition parameters contained in each data quality inspection rule.

In order to perform quality inspection on service data in a service system, a plurality of data quality inspection rules are usually preset, and each data quality inspection rule includes parameter information such as field name information, field source information, condition parameters and the like. For example, a data quality check rule of "SELECT NAME from t_ userinfo where len (name) >8" is preset, the field name information is "name", the field source information is "t_ userinfo", the condition parameter is "8", a data quality check rule of "SELECT DATE from t_ userinfo WHERE DATE IS null" is preset, the field name information is "date", the field source information is "t_ userinfo", and the condition parameter is "null". The data quality inspection rule matching system acquires a plurality of preset data quality inspection rules, and acquires field name information, field source information and condition parameters contained in each data quality inspection rule.

The system calculates the text similarity between the name information of each field metadata and the field name information of each data quality inspection rule by adopting LEVENSHTEIN DISTANCE algorithm, judges whether the text similarity reaches a first preset value (80%), calculates the text similarity between the source information of each field metadata and the field source information of each data quality inspection rule, judges whether the text similarity reaches a second preset value (100%), and judges that the association degree between the field metadata and the data quality inspection rule meets the standard under the condition that the text similarity between the name information of the field metadata and the field name information of the data quality inspection rule meets the first preset value and the text similarity between the source information of the field metadata and the field source information of the data quality inspection rule meets the second preset value, or else judges that the association degree between the field metadata and the data quality inspection rule does not meet the standard. For example, for field metadata one, field metadata two, field metadata three, data quality inspection rule one and data quality inspection rule two, the system needs to calculate the association degree between field metadata one and data quality inspection rule one, the association degree between field metadata one and data quality inspection rule two, the association degree between field metadata two and data quality inspection rule one, the association degree between field metadata three and data quality inspection rule two, and then determine whether these association degrees reach the standard, which is specifically as follows:

The system calculates the text similarity between the name information "name" of the first field metadata and the field name information "name" of the first data quality inspection rule, the calculated result is that the text similarity is 100%, the first preset value (80%), then calculates the text similarity between the source information "t_ userinfo" of the first field metadata and the field source information "t_ userinfo" of the first data quality inspection rule, the calculated result is that the text similarity is 100%, and the second preset value (100%) is reached, in which case the system judges that the association degree between the first field metadata and the first data quality inspection rule meets the standard.

The system calculates the text similarity between the name information "name" of the first field metadata and the field name information "date" of the second data quality inspection rule, the calculated result is that the text similarity is 50%, the first preset value (80%) is not reached, then calculates the text similarity between the source information "t_ userinfo" of the first field metadata and the field source information "t_ userinfo" of the second data quality inspection rule, the calculated result is that the text similarity is 100%, and the second preset value (100%) is reached, in which case the system judges that the association degree between the first field metadata and the second data quality inspection rule is not up to standard.

The system calculates the text similarity between the name information "name" of the field metadata two and the field name information "name" of the data quality inspection rule one, the calculated result is that the text similarity is 100%, the first preset value (80%) is not reached, then calculates the text similarity between the source information "t_ admininfo" of the field metadata two and the field source information "t_ userinfo" of the data quality inspection rule one, the calculated result is that the text similarity is 50%, the second preset value (100%) is not reached, and in this case, the system judges that the association degree between the field metadata two and the data quality inspection rule one does not reach the standard.

The system calculates the text similarity between the name information "name" of the field metadata two and the field name information "date" of the data quality inspection rule two, the calculated result is that the text similarity is 50%, the first preset value (80%) is not reached, then calculates the text similarity between the source information "t_ admininfo" of the field metadata two and the field source information "t_ userinfo" of the data quality inspection rule two, the calculated result is that the text similarity is 50%, the second preset value (100%) is not reached, and in this case, the system judges that the association degree between the field metadata two and the data quality inspection rule two does not reach the standard.

The system calculates the text similarity between the name information "age" of the field metadata three and the field name information "name" of the data quality inspection rule one, the calculated result is that the text similarity is 30%, the first preset value (80%) is not reached, then calculates the text similarity between the source information "t_ userinfo" of the field metadata three and the field source information "t_ userinfo" of the data quality inspection rule one, the calculated result is that the text similarity is 100%, and the second preset value (100%) is reached, in which case the system judges that the association degree between the field metadata three and the data quality inspection rule two does not reach the standard.

The system calculates the text similarity between the name information "age" of the field metadata three and the field name information "date" of the data quality inspection rule two, the calculated result is that the text similarity is 30%, the first preset value (80%) is not reached, then calculates the text similarity between the source information "t_ userinfo" of the field metadata three and the field source information "t_ userinfo" of the data quality inspection rule two, the calculated result is that the text similarity is 100%, and the second preset value (100%) is reached, in which case, the system judges that the association degree between the field metadata three and the data quality inspection rule two is not up to standard.

It should be noted that, the LEVENSHTEIN DISTANCE algorithm is also called EDIT DISTANCE algorithm, that is, an edit distance algorithm, which obtains an edit distance between two character strings by calculating a minimum number of edit operations required for converting one character string into another character string, and the smaller the edit distance is, the greater the text similarity of the two character strings is, wherein the edit operations include replacing one character with another character, inserting one character, and deleting one character.

D. and matching the field metadata with the association degree reaching the standard with the data quality inspection rule.

After judging whether the association degree between each field metadata and each data quality inspection rule meets the standard, the system establishes a mapping relation between the field metadata with the standard and the data quality inspection rule, so that the field metadata with the standard and the data quality inspection rule are matched, and does not establish a mapping relation between the field metadata with the standard and the data quality inspection rule, so that the field metadata with the standard and the data quality inspection rule are not matched, specifically, the system judges that the association degree between the field metadata with the first data quality inspection rule meets the standard, the association degree between the field metadata with the second data quality inspection rule is not met, the association degree between the field metadata with the third data quality inspection rule is not met, the field metadata with the first data quality inspection rule is not matched, the field metadata with the third data quality inspection rule is not met, and the field metadata with the quality inspection rule is not matched, and the field metadata with the third data quality inspection rule is not matched, and the field metadata is not matched with the quality inspection rule is not met. That is, field metadata one matches data quality check rule one "SELECT NAME from t_ userinfo where len (name) >8", while field metadata two and three do not match data quality check rule.

E. Candidate field metadata of the matched data quality check rule and field metadata to be matched of the unmatched data quality check rule among the plurality of field metadata are identified.

In this embodiment, the system marks the first field metadata of the first matched data quality inspection rule as candidate field metadata, marks the second field metadata and the third field metadata of the non-matched data quality inspection rule as field metadata to be matched, and identifies and distinguishes the first field metadata and the second field metadata and the third field metadata to be matched.

F. The following steps F1, F2, F3, F4 are performed for each field metadata to be matched.

(1) The execution of steps F1, F2, F3, F4 for field metadata two to be matched is detailed as follows:

F1. Judging whether candidate field metadata with the text similarity larger than a preset threshold value and consistent in data type exists or not, if so, displaying the candidate field metadata and the matched data quality check rule thereof for the user to select;

The system calculates the text similarity of the metadata of the field to be matched and the metadata of each candidate field according to the name information of the metadata of the field to be matched and the name information of the metadata of each candidate field, judges whether the metadata of the candidate field with the text similarity of the metadata of the field to be matched being greater than a preset threshold (for example, 80 percent) exists, and then compares and judges whether the metadata of the field to be matched is consistent with the metadata of the candidate field according to the data type information of the metadata of the field to be matched and the data type information of the metadata of the candidate field if the metadata of the field to be matched exists, and displays the metadata of the candidate field and the quality check rule of the matched data of the metadata of the candidate field for the user to select if the metadata of the field to be matched is greater than the preset threshold and the data type is consistent.

In this embodiment, the name information of the field metadata to be matched is "name", the data type information is "text type", one candidate field metadata is specifically "first candidate field metadata", the field name information is "name", and the data type information is "text type", so that according to the name information "name" of the field metadata to be matched and the name information "name" of the first candidate field metadata, the text similarity result between the field metadata to be matched and the first candidate field metadata is calculated to be 100%, which is greater than a preset threshold (80%), namely, there is the first candidate field metadata with the text similarity greater than the preset threshold with the first candidate field metadata, so that according to the data type information "text type" of the field metadata to be matched and the first candidate field metadata, the data type of the field metadata to be matched is judged to be consistent, namely, the text similarity between the field metadata to be matched and the first candidate field metadata is greater than the preset threshold, and the data type is consistent, and in this case, the system displays the quality check rule of the first candidate field metadata userinfo where len and the first candidate field metadata for matching (39t_26) of the user is selected (398).

In other embodiments, if there are multiple candidate field metadata, and there are multiple candidate field metadata with text similarity greater than a preset threshold and consistent data types with the second field metadata to be matched, the multiple candidate field metadata are sorted according to the text similarity from large to small, and the candidate field metadata with the text similarity ranked in the first three names are selected for display for selection by a user.

F2. and acquiring candidate field metadata selected by a user and a matched data quality inspection rule thereof, and replacing field name information and field source information contained in the data quality inspection rule with the name information and source information of the field metadata to be matched.

In this embodiment, the system displays the candidate field metadata one and the matched data quality check rule one ' SELECT NAME from t_ userinfo where len (name) >8 ', and if the user finds that the data quality check rule one is suitable after viewing, the candidate field metadata one and the matched data quality check rule one can be selected, the system acquires the candidate field metadata one selected by the user and the matched data quality check rule one, then the SQL partitioning module in the SQL engine decomposes the data quality rule one ' SELECT NAME from t_ userinfo where len (name) >8 ' into a select clause ' SELECT NAME ', a from clause ' from t_ userinfo ' and a where clause ' where ' 8 ' is found, and then replaces the field name information ' name ' in the select clause with the name information ' anme ' of the field metadata two to be matched by the parameter filling module in the SQL engine, and replaces the field name information ' userinfo ' from t_3754 ' in the from clause with the source information ' admininfo ' from t_4924 '.

F3. And acquiring new condition parameters input by the user, and replacing the condition parameters contained in the data quality inspection rule selected by the user with the new condition parameters input by the user to acquire the new data quality inspection rule.

After selecting the first candidate field metadata and the first matched data quality check rule, the user also needs to input new condition parameters of the first candidate field metadata, such as '10', into the system according to experience, after acquiring the new condition parameters '10' input by the user, the system replaces the condition parameters '8' in the where clause with the new condition parameters '10' input by the user by using a parameter filling module in the SQL engine, and then combines the replaced selected clause, from clause and where the where clause is combined by using an Sql combination module in the SQL engine to obtain a new data quality check rule 'SELECT NAME from t_ admininfo where len (name) > 10'.

As can be seen from steps F2 and F3, the system changes the field name information "name", the field source information "t_ userinfo" and the condition parameter "8" of the first data quality inspection rule according to the name information "name", the source information "t_ admininfo" of the second field metadata to be matched and the new condition parameter "10" input by the user on the basis of the template of the first data quality inspection rule "SELECT NAME from t_ userinfo where len (name) >8", so as to obtain the new data quality inspection rule "SELECT NAME from t_ admininfo where len (name) >10".

It should be noted that SQL is an abbreviation of Structured Query Language, translated into a "structured query language," which is a computer language used to access, query, update, and manage data in relational databases. The SQL engine is one of important subsystems of the database, and is responsible for accepting SQL sentences sent by the application program on the upper part and directing the executor to run an execution plan on the lower part.

F4. and matching the field metadata to be matched with the new data quality check rule.

After the new data quality check rule "SELECT NAME from t_ admininfo where len (name) >10" is obtained, the association degree between the field metadata to be matched and the new data quality check rule "SELECT NAME from t_ admininfo where len (name) >10" will reach the standard, so the system establishes a mapping relationship between the field metadata to be matched and the new data quality check rule "SELECT NAME from t_ admininfo where len (name) >10", so that the field metadata to be matched is matched with the new data quality check rule "SELECT NAME from t_ admininfo where len (name) >10", and the service data described by the field metadata to be matched can be checked in quality by using the new data quality check rule "SELECT NAME from t_ admininfo where len (name) > 10".

(2) The execution of steps F1, F2, F3, F4 for field metadata three to be matched is detailed as follows:

In this embodiment, the step F1 is performed first, in which the name information of the field metadata to be matched is "age", the data type information is "numeric value type", and one candidate field metadata, specifically, the candidate field metadata one, is "name", and the data type information is "text type", so that according to the name information "age" of the field metadata to be matched and the name information "name" of the candidate field metadata one, the text similarity result between the field metadata to be matched and the candidate field metadata one is calculated to be 30%, and is not greater than the preset threshold (80%), that is, there is no candidate field metadata with the text similarity greater than the preset threshold with the field metadata to be matched, so that it is unnecessary to compare and judge whether the data types of the field metadata to be matched and the candidate field metadata one are consistent, so that the text similarity between the field metadata to be matched and the candidate field metadata one is not greater than the preset threshold and the data type is consistent, and in this case, the system does not display the candidate field metadata one and the matched data quality inspection rule "39from_5326_ userinfo where len (name) for the user to select.

The system then does not need to perform steps F2, F3, F4.

The above-described embodiments are provided for the present invention only and are not intended to limit the scope of patent protection. Insubstantial changes and substitutions can be made by one skilled in the art in light of the teachings of the invention, as yet fall within the scope of the claims.

Claims

1. A data quality inspection rule matching method is characterized by comprising the following steps:

2. The method according to claim 1, wherein in the step C, if the data quality inspection rule is provided, the text similarity between the field name information and the name information of the metadata of the field reaches a first preset value, and the text similarity between the field source information and the source information of the metadata of the field reaches a second preset value, the association degree between the data quality inspection rule and the metadata of the field is up to standard.

3. The method according to claim 1, wherein in the step F1, the text similarity between the metadata of the field to be matched and the metadata of each candidate field is calculated according to the name information of the metadata of the field to be matched and the name information of the metadata of each candidate field, and whether the metadata of the candidate field with the text similarity greater than a preset threshold is determined, if so, the metadata of the field to be matched and the metadata of the candidate field are compared and determined according to the data type information of the metadata of the field to be matched and the data type information of the metadata of the candidate field.

4. The method according to claim 1, wherein in the step F1, if the text similarity with the field metadata to be matched is greater than a preset threshold and there are a plurality of candidate field metadata with the same data type, the plurality of candidate field metadata are sorted and displayed for the user to select according to the text similarity from large to small.

5. The method according to claim 4, wherein in the step F1, candidate field metadata with text similarity ranked before a predetermined ranking is selected for presentation.

6. The method of claim 1, wherein in step F2, the SQL engine is used to decompose the data quality inspection rule into a select clause including field name information, a from clause including field source information, and a where clause including condition parameters, then the field name information in the select clause is replaced with the name information of the field metadata to be matched, the field source information in the from clause is replaced with the source information of the field metadata to be matched, and in step F3, the condition parameters in the where clause are replaced with new condition parameters input by a user, and then the replaced select clause, from clause, and where clause are combined to obtain the new data quality inspection rule.

7. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor realizes the steps in the data quality check rule matching method according to any one of claims 1 to 6.

8. A data quality check rule matching system comprising a computer readable storage medium and a processor coupled to each other, wherein the computer readable storage medium is as claimed in claim 7.