[go: up one dir, main page]

CN113886398A - Data processing method, apparatus and electronic equipment - Google Patents

Data processing method, apparatus and electronic equipment Download PDF

Info

Publication number
CN113886398A
CN113886398A CN202111224470.2A CN202111224470A CN113886398A CN 113886398 A CN113886398 A CN 113886398A CN 202111224470 A CN202111224470 A CN 202111224470A CN 113886398 A CN113886398 A CN 113886398A
Authority
CN
China
Prior art keywords
data
attribute
target
target data
rules
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111224470.2A
Other languages
Chinese (zh)
Other versions
CN113886398B (en
Inventor
余蓥良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CN202111224470.2A priority Critical patent/CN113886398B/en
Publication of CN113886398A publication Critical patent/CN113886398A/en
Application granted granted Critical
Publication of CN113886398B publication Critical patent/CN113886398B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application provides a data processing method, a data processing device and electronic equipment, wherein the method comprises the following steps: determining data of a plurality of attribute features included in target data based on the target data, wherein the target data belongs to specified type data which can be represented in a two-dimensional table form; mining incidence relations among data of different attribute characteristics in target data to obtain at least one group of incidence relation rules, wherein each group of incidence relation rules comprise: association rules that the data within the at least two attribute features need to satisfy; wherein the incidence relation rule is used for determining abnormal data existing in the specified type data. The scheme of the basic application can realize the determination of abnormal data in the data.

Description

Data processing method and device and electronic equipment
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a data processing method and apparatus, and an electronic device.
Background
With the advent of the big data age, a great deal of data is often involved in daily life and work that needs to be processed.
In many cases, some abnormal data may exist in the acquired data, and the abnormal data may affect the accuracy of data analysis, so how to determine the abnormal data existing in the data is a technical problem that needs to be solved by those skilled in the art.
Disclosure of Invention
The application provides a data processing method and device and electronic equipment.
The data processing method comprises the following steps:
determining data of a plurality of attribute features included in target data based on the target data, wherein the target data belongs to specified type data which can be represented in a two-dimensional table form;
mining incidence relations among data of different attribute characteristics in the target data to obtain at least one group of incidence relation rules, wherein each group of incidence relation rules comprise: association rules that the data within the at least two attribute features need to satisfy;
wherein the incidence relation rule is used for determining abnormal data existing in the specified type of data.
In a possible implementation manner, the mining association relationships between data of different attribute features in the target data to obtain at least one set of association relationship rules includes:
and mining association relations among different attribute features in the target data by using a data mining algorithm according to the set confidence coefficient to obtain at least one group of association relation rules.
In another possible implementation manner, the mining association between different attribute features in the target data according to the set confidence level and by using a data mining algorithm to obtain at least one set of association rules includes:
performing frequent item set mining on each group of data in the target data by using a data mining algorithm to obtain a plurality of mined frequent item sets which accord with a set confidence coefficient, wherein each group of data corresponds to one line of data in a two-dimensional table converted from the target data;
and analyzing the incidence relation among different attribute characteristics in the multiple frequent item sets by using the data mining algorithm to obtain at least one group of incidence relation rules.
In another possible implementation manner, the mining association relationships between data of different attribute features in the target data to obtain at least one set of association relationship rules includes:
for each attribute feature combination in target data, counting the equal probability of data in the same group under the attribute feature combination, wherein the target data comprises at least one attribute feature combination, the attribute feature combination comprises two attribute features in the target data, and each group of data is converted into a line of data in a two-dimensional table corresponding to the target data;
and determining at least one target attribute feature combination with the equal probability higher than a probability threshold value to obtain an equal rule corresponding to the target attribute feature combination, wherein the equal rule of the target attribute feature combination represents that data belonging to the same group in two attributes in the attribute feature combination are equal.
In another possible implementation manner, the counting the equal probability that the data in the same group in each attribute feature combination in the target data is equal includes:
determining the correlation degree between every two attribute features in the target data based on the data under each attribute feature in the target data;
determining at least one candidate attribute feature combination with the correlation degree higher than a correlation degree threshold value, wherein the candidate attribute feature combination comprises two attribute features with the correlation degree higher than the correlation degree threshold value;
and respectively counting the equal probability of the data in the same group in each candidate attribute feature combination.
In another possible implementation manner, before determining a correlation degree between every two attribute features in the target data based on data under each attribute feature in the target data, the method further includes:
determining the attribute characteristics of the data in the target data as non-numerical data, and converting the non-numerical data in the attribute characteristics of the non-numerical data into numerical data.
In another possible implementation manner, the method further includes:
and determining abnormal data in the data to be detected belonging to the specified type of data according to the at least one group of incidence relation rules, wherein the data to be detected and the target data contain the same attribute characteristics.
Wherein, a data processing device comprises:
a data determination unit configured to determine data of a plurality of attribute features included in target data based on the target data, the target data belonging to a specified type of data that can be represented in a two-dimensional table form;
the data mining unit is used for mining the incidence relation among the data of different attribute characteristics in the target data to obtain at least one group of incidence relation rules, and each group of incidence relation rules comprise: association rules that the data within the at least two attribute features need to satisfy; wherein the incidence relation rule is used for determining abnormal data existing in the specified type of data.
In one possible implementation, the data mining unit includes:
and the algorithm mining unit is used for mining the incidence relation among different attribute characteristics in the target data by using a data mining algorithm according to the set confidence coefficient to obtain at least one group of incidence relation rules.
Wherein, an electronic equipment includes: a processor and a memory;
wherein the processor is configured to perform the data processing method as described in any one of the above;
the memory is used for storing programs needed by the processor to execute operations.
As can be seen from the above, the present application may mine the association relationship between the data of different attribute features in the target data, and determine at least one set of association relationship rules existing in the target data. Because each group of association relation rules comprises association rules which are required to be met by data in at least two attribute characteristics, and the association relation rules are obtained by mining target data belonging to specified type data, the association relation rules have universality aiming at the specified type data, and abnormal data existing in the specified type data can be analyzed and detected based on the association relation rules.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on the provided drawings without creative efforts.
Fig. 1 is a schematic flow chart illustrating a data processing method according to an embodiment of the present application;
fig. 2 is a schematic flowchart illustrating a data processing method according to an embodiment of the present application;
fig. 3 is a schematic flowchart illustrating a data processing method according to an embodiment of the present application;
fig. 4 is a schematic flowchart illustrating a data processing method according to an embodiment of the present application;
fig. 5 is a schematic flowchart illustrating a data processing method according to an embodiment of the present application;
fig. 6 is a schematic diagram illustrating a component structure of a data processing apparatus according to an embodiment of the present application;
fig. 7 is a schematic diagram illustrating a composition structure of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without inventive step, are within the scope of the present disclosure.
As shown in fig. 1, which shows a schematic flow chart of an embodiment of a data processing method according to the present application, the method of the present embodiment may include:
s101, determining data of a plurality of attribute features included in the target data based on the target data.
Wherein the target data belongs to specified type data which can be represented in a two-dimensional table form. That is, the specific form of the target data may be table data of a two-dimensional table, or may be data that is presented in another form but can be converted into a two-dimensional table.
If the target data is a piece of table data; alternatively, structured data or unstructured data may be stored in tabular form.
The specified type data is used to characterize the data class to which the target data belongs, and the specified type may specify the number and kind of attribute features contained in the data. For example, the specified type data may be data that is capable of being converted into a two-dimensional table and includes specified attribute characteristics.
For example, the specified data type may be access record data generated by the same website in different actual time periods, and since fields included in the access record data generated at different times are the same, different access data records all belong to the same type of table data.
For another example, the data belonging to the specified type of data may be student management data or the like containing the same field.
Wherein the attribute feature is a data representation of the target data in an attribute dimension. Specifically, the attribute feature in the target data corresponds to a field in the target data when the target data is represented by a two-dimensional table.
It will be appreciated that the target data may comprise a plurality of pieces of data, each piece of data corresponding to a row of records in a two-dimensional table into which the target data is translated.
S102, mining incidence relations among data of different attribute characteristics in the target data to obtain at least one group of incidence relation rules.
Wherein each group of association relation rules comprises: association rules that the data within the at least two attribute features need to satisfy. It can be understood that the association rule corresponding to at least two attribute features is actually an association rule determined based on the association relationship between the at least two attribute features.
It is to be understood that the association rule between the at least two attribute features may be a data relationship rule or a logical relationship rule that is satisfied between the at least two attribute features. And determining that each group of association relation rules are all suitable for any group of data in the target data, namely that the association rules are all satisfied between the at least two attribute characteristics in each group of data.
The target data may include multiple sets of data (which may also be referred to as multiple pieces of data), and each set of data (each piece of data) corresponds to one line of data converted from the target data into the two-dimensional table.
For example, the association rule is that the feature values of at least two attribute features are the same. Then the characteristic values on the at least two attribute features within the set of data should be the same for any set of data in the target data.
As another example, the association rule may be that the sum of attribute feature a and attribute feature B is equal to attribute feature C. Correspondingly, for any group of data, the sum of the value of the attribute feature a and the value of the attribute feature B in the group of data should be equal to the value of the attribute feature C.
As another example, the association rule may be: in the case where attribute feature 1 is equal to S1 and attribute feature 2 is equal to S2, attribute feature 3 is equal to S3.
It can be understood that, since the target data belongs to the data of the specified type, the association relationship rules that need to be satisfied between different attribute features in the data of the specified type are actually mined by mining the association relationship of the target data. Based on this, the incidence relation rule determined in the application can be used for determining abnormal data existing in the data of the specified type.
Further, the abnormal data existing in the data to be detected belonging to the specified type of data can be determined according to at least one group of incidence relation rules, wherein the data to be detected and the target data contain the same attribute characteristics. The data to be detected can be target data or data other than the target data.
For example, it may be detected whether data that does not satisfy the association rule exists in the target data based on the at least one set of association rule, and data that does not satisfy the association rule exists in the target data and belongs to the abnormal data. For example, if the mined association rule is that the value of the attribute feature a should be equal to the value of the attribute feature B, but the values of the two attribute features in a certain piece of data in the target data are not equal, it is assumed that the two pieces of data of the two attribute features in the piece of data are abnormal data.
For another example, abnormal data detection may be performed on a piece of data other than the target data and belonging to the specified type of data based on the at least one set of association rules.
As can be seen from the above, the present application may mine the association relationship between the data of different attribute features in the target data, and determine at least one set of association relationship rules existing in the target data. Because each group of association relation rules comprises association rules which are required to be met by data in at least two attribute characteristics, and the association relation rules are obtained by mining target data belonging to specified type data, the association relation rules have universality aiming at the specified type data, and abnormal data existing in the specified type data can be analyzed and detected based on the association relation rules.
It will be appreciated that there are many possibilities for specific ways of mining the association rules for the existence of target data. Several cases will be described below as examples.
In a possible case, the application can utilize a data mining algorithm to mine the association relationship between different attribute features in the target data according to the set confidence level to obtain at least one group of association relationship rules.
The confidence is a threshold, and can be specifically set according to needs.
The data mining algorithm can be any association rule mining algorithm capable of mining implicit relationships in data. For example, the data mining algorithm may be Apriori algorithm or FPGrowth algorithm, etc.
For this possible case, a specific implementation is described below.
As shown in fig. 2, which shows a schematic flow chart of another embodiment of the data processing method of the present application, the method of the present embodiment may include:
s201, determining data of a plurality of attribute features included in the target data based on the target data.
This step can be referred to the related description of the previous embodiment, and is not described herein again.
S202, performing frequent itemset mining on each group of data in the target data by using a data mining algorithm to obtain a plurality of mined frequent itemsets which accord with set confidence degrees.
As mentioned above, each set of data corresponds to a row of data in the two-dimensional table converted from the target data.
And the frequent item set is defined as: there are a series of sets, these sets have some identical elements, the elements with high frequency of occurrence in the set at the same time form a subset, satisfy a certain threshold condition (i.e. confidence), it is a frequent item set. In the present application, a frequent item set refers to a set formed by some data items that appear in multiple sets of data of target data with a number satisfying a confidence level.
The confidence coefficient represents the probability of one data appearing and the other data appearing.
According to different data mining algorithms, the specific implementation process of mining frequent item sets is different, and the method is not limited by the application.
It can be understood that, before frequent item set mining, in order to embody the attribute characteristics to which each data item in the target data belongs and also avoid interference of frequent item set mining due to the same numerical value under different attribute characteristics, each data item in the target data is converted into a data item having an attribute characteristic by the present application, and specifically, the data item may be converted into attribute characteristics + data items to which the data item belongs.
Wherein each data item is a data value under a property characteristic in the target data.
For example, if the target data is a two-dimensional table, and the data in the field T in a certain row in the two-dimensional table is aaa, the data needs to be converted into the field Taaa.
S203, analyzing the incidence relation among different attribute characteristics in the multiple frequent item sets by using a data mining algorithm to obtain at least one group of incidence relation rules.
Wherein each group of association relation rules comprises: association rules that the data within the at least two attribute features need to satisfy. And the incidence relation rule is used for determining abnormal data existing in the specified type of data.
Because the data mining algorithm can also be called as an association relation mining algorithm, the mined frequent item set is analyzed through the data mining algorithm, the association relation existing among the data items with different attribute characteristics in the frequent item set can be determined, and the association relation existing among the different attribute characteristics is finally obtained.
It can be understood that, because the data mining algorithm is mature and the association relationship between different attribute features in one piece of data can be analyzed more accurately and comprehensively, the association relationship rules existing in the target data can be mined by means of the data mining algorithm in the embodiment, and the association relationship rules existing in the target data can be mined more comprehensively, accurately and efficiently, so that abnormal data existing in the specified type of data can be analyzed more comprehensively and accurately based on the mined association relationship rules.
For the convenience of understanding, the following describes a process of determining association rules between different fields in a two-dimensional table by using a data mining algorithm, taking target data as an example of the two-dimensional table.
As shown in fig. 3, which shows a schematic flow chart of an embodiment of a data processing method according to the present application, the method of the present embodiment may include:
s301, a two-dimensional table is obtained, and data of each row and each field of the two-dimensional table are determined.
The two-dimensional table belongs to a two-dimensional table of specified type data, for example, the two-dimensional table can be a student family information record table.
S302, aiming at each data item in the two-dimensional table, converting the data item into data spliced by fields and the data item to obtain a converted two-dimensional table.
Each data item in the two-dimensional table is data in one two-dimensional table, and refers to data uniquely determined by rows and columns. For example, the data in the first row and the first column of the two-dimensional table is a data item, and the data in the first row and the second column of the two-dimensional table is a data item.
The data obtained by splicing the fields and the data items can reflect the specific values of the data and the fields to which the data belong, and the conversion can avoid that the data items under different fields are the same and are considered as frequently-occurring data.
For example, the second row in the two-dimensional table and the data entry in field S is 100, it may be converted to field S100.
And S303, according to the set confidence coefficient, performing frequent item set mining on each line of data in the converted two-dimensional table by using a data mining algorithm to obtain a plurality of mined frequent item sets which accord with the confidence coefficient.
It can be understood that, since the data in the two-dimensional table is already converted, only the data whose fields and data items are identical and frequently appear may be identified as belonging to frequent items in the frequent item set during the frequent item set mining process for each row of data in the converted two-dimensional table.
S304, analyzing the incidence relation among different fields in the multiple frequent item sets by using the data mining algorithm to obtain the incidence relation rule of at least one group of fields.
Each set of fields may include at least two fields, and thus, the association rule of each set of fields is the association rule satisfied between the data in at least two fields in the set of fields.
For example, the association rule for a set of fields may be: when the field S1 is B1 and the field S3 is V1, the field S6 is F1.
It is to be understood that fig. 3 exemplifies the target data as a two-dimensional table, and it is to be understood that if the target data is other data that can be converted into a two-dimensional table, the association relationship rule may be mined by using the method of the embodiment of fig. 3 after the target data is converted into the two-dimensional table.
In this embodiment, the mined association rule is also used to determine abnormal data existing in the two-dimensional table and other two-dimensional tables belonging to the specified type of data. Other two-dimensional tables belonging to the same type as the two-dimensional table include the same fields therebetween.
For example, a two-dimensional table is a two-grade one-shift student achievement table, which has fields: student name, year and month of occurrence, age, math score, Chinese score, and total score. Then assume that the association rules mined based on the student achievement list include association rules: the year of birth is # year # month, and the age is equal to 10 years.
Then, in combination with the above two association rules, if the birth year and month of a certain student is # year # month and the age of the certain student is not equal to 10 years old, it indicates that there is an abnormality in the data of the field indicating the age of the student.
Similarly, if the second grade, second shift student achievement table is the same data table as the above achievement table, and it also has the fields: the name, the date of the occurrence, the age, the mathematic score, the Chinese score, the total score and other fields of the students can be used for carrying out abnormity detection on the score table of the students in the second class and the second class by utilizing the association relation rule so as to detect abnormal data of which the age is not matched with the date of the birth.
In the following, referring to fig. 4, a description is given of yet another possible scenario of mining the association relationship between data of different attribute features in the target data.
As shown in fig. 4, which shows another schematic flow chart of the data processing method of the present application, the method of this embodiment may include:
s401, data of a plurality of attribute features included in the target data are determined based on the target data.
Wherein the target data belongs to specified type data which can be represented in a two-dimensional table form.
This step can be referred to the related description of the previous embodiment, and is not described herein again.
S402, for each attribute feature combination in the target data, counting the equal probability of the data in the same group under the attribute feature combination.
The target data comprises at least one attribute feature combination, and the attribute feature combination comprises two attribute features in the target data.
As described above, each set of data (also referred to as each piece of data) is converted into one row of data in the two-dimensional table corresponding to the target data.
It will be appreciated that for any set of data, and including data belonging to a plurality of different attribute features in the target data. For the same attribute feature combination, the data in the same group under the attribute feature combination is actually the data belonging to each attribute feature corresponding to the attribute feature combination in the group data.
For example, assuming that the attribute feature combination includes the attribute feature S1 and the attribute feature S2, it may be determined that, for each group of data, the data of the attribute feature S1 and the data of the attribute feature S2 in the group are the data belonging to the same group under the attribute feature combination.
For the sake of easy distinction, the probability that data in the same group under the attribute feature combination is equal is referred to as equal probability.
The equal probability corresponding to an attribute feature combination reflects the probability of the same group of data under the data feature combination in the target data, and the probability is the ratio of the equal group of data of each attribute feature in the attribute feature combination to the total number of all groups in the target data.
Still taking the above attribute feature combination including the attribute feature S1 and the attribute feature S2 as an example, for the data of each group, it may be first detected whether the data of the attribute feature S1 and the data of the attribute feature S2 in the group are the same, and finally, the number of groups with the same data of the two attribute features in all the groups is counted. In combination with the number of equal groups and the total number of all groups in the target data table, the probability that the data for which the two attribute features are in the same group is the same, i.e., equal probability, can be determined.
It can be understood that if the data amount in the target data is large or the types of attribute features in the target data are large, analyzing the corresponding equal probability of each attribute feature combination one by one will inevitably result in an excessive data processing amount.
In order to reduce the data processing amount, the correlation degree between every two attribute features in the target data can be determined based on the data under each attribute feature in the target data.
Wherein, the correlation degree between the two attribute characteristics is the correlation degree calculated based on the data under the two attribute characteristics.
On the basis, at least one candidate attribute feature combination with the correlation degree higher than the correlation degree threshold value can be determined, and the candidate attribute feature combination comprises two attribute features with the correlation degree higher than the correlation degree threshold value. Accordingly, the equal probability that the data in the same row in each candidate attribute feature combination is equal can be respectively counted.
It can be understood that, the association rule mainly determined in the present application is an equality rule between two attribute features, and since two attribute features with a correlation degree higher than a correlation degree threshold may be equal, the present application only needs to analyze an equality probability corresponding to an attribute feature combination composed of two attribute features with a correlation degree higher than a correlation degree threshold.
The correlation threshold may be set as needed, which is not limited to this.
Optionally, considering that some data in the target data may be data in a non-numerical form such as a character string form, so as to facilitate calculation of the correlation degree between different attribute features, the present application may also determine that the data in the target data is an attribute feature of non-numerical data, and convert the non-numerical data in the attribute feature of the non-numerical data into numerical data. The following detailed description is made in combination with the case that the target data is a two-dimensional table, and is not repeated herein.
And S403, determining at least one target attribute feature combination with equal probability higher than the probability threshold value to obtain an equal rule corresponding to the target attribute feature combination.
The probability threshold may be set as needed, for example, the probability threshold may be 90%, and the like, which is not limited.
For the sake of distinction, attribute features corresponding to equal probabilities higher than the probability threshold are combined into a target attribute feature combination.
The equality rule of the target attribute feature combination characterizes that data belonging to the same group of two attribute features in the attribute feature combination are equal.
It can be understood that, for an attribute feature combination, if the corresponding equality probability of the attribute feature combination is the proportion of two groups with the same attribute feature in the attribute feature combination, the higher the proportion is, the higher the probability that the data in the same group under the two attribute features are equal is. Based on this, when the equal probability corresponding to the attribute feature combination exceeds the set probability threshold, it can be determined that the data belonging to the same group of the two attribute features in the attribute feature combination are equal.
For example, assuming that the probability threshold is 90%, 100 pieces of data are included in the target data. If the field 1 of 92 pieces of data in 100 pieces of data is equal to the data in the field 2, the equal probability corresponding to the field combination consisting of the field 1 and the field 2 is 92%, and the equal probability exceeds the probability threshold, then the data belonging to the field 1 and the data belonging to the field 2 in any one piece of data in the target data can be considered as equal data.
It can be understood that after the equality rule of the target attribute feature combination is determined, abnormal data existing in the data to be detected can be detected based on the equality rule of the target attribute feature combination. For example, if two data belonging to the same group in the data to be detected and corresponding to two attribute features of the target attribute combination are not equal, it is indicated that the data in the group in the data to be detected has an abnormality.
To facilitate understanding of the embodiment of fig. 4, the following describes a process of mining an equality rule in a two-dimensional table by taking target data as an example.
As shown in fig. 5, which shows a schematic flow chart of an embodiment of a data processing method according to the present application, the method of the present embodiment may include:
s501, obtaining a two-dimensional table.
The two-dimensional table comprises a plurality of rows and a plurality of columns of data, wherein each column in the two-dimensional table corresponds to one field.
S502, determining a field of which the data in the two-dimensional table is non-numerical data, and converting the non-numerical data in the field into numerical data.
The non-numerical data may be character string data or the like, which is not numerical values represented by numbers.
In the present application, the purpose of converting non-numeric data to numeric data is merely to facilitate subsequent calculation of the correlation between the two fields.
The manner of converting non-numeric data in a field into numeric data may be various, as long as it is ensured that each data in the field can still be distinguished.
For example, for a field of which data is non-numeric data, a unique identification value representing the value in the field may be assigned to different types of values according to the types of values contained in the field.
For ease of understanding, the process of converting each data under the field of non-numeric data in a two-dimensional table is described below with reference to an example of a two-dimensional table.
Table 1 is a two-dimensional table.
TABLE 1
Figure BDA0003312380540000131
Figure BDA0003312380540000141
The first column is included in table 1 as the identification number of each record. For example, table 1 contains 17 lines of data, i.e., 17 records, and the 17 records are sequentially marked as record 1 to record 17.
Meanwhile, table 1 includes 8 fields, such as the fields may include: column1, systema.column2, and the like. The data in the column in which each field is located is the data under that field.
As can be seen from table 1, the data in each field in table 1 is in character form, not numerical data, and therefore, data conversion needs to be performed on the data in each field.
On the basis, for each field, different kinds of data in the field are numbered in sequence, so that each kind of data in the field corresponds to one kind of code.
Correspondingly, the data of each line under the field is respectively converted into the codes corresponding to the data according to the relationship between each kind of data and the codes under the field.
The two-dimensional table as converted from table 1 may be as shown in table 2.
TABLE 2
Figure BDA0003312380540000151
As can be seen from comparing table 1 and table 2, for each field, a value under the field in table 2 represents the original data under the field, for example, taking the field "systema. column 1" as an example, the field contains only two kinds of data, B1 and B2, so that B1 at each position under the field can be converted into a number 0, and each B2 can be converted into a number 1. The other fields are similar and are not described in detail.
Of course, table 1 and table 2 are merely an example illustration of a two-dimensional table conversion, and the manner of converting non-numerical data into numerical data by other means is also applicable to the present embodiment, and is not limited thereto.
S503, determining the correlation degree between every two fields in the two-dimensional table based on the data under each attribute characteristic in the two-dimensional table.
For example, each data under each field may form a matrix, and for any two fields, the correlation between the matrices corresponding to the two fields may be calculated, so as to obtain the correlation between the two fields.
And S504, determining at least one candidate field combination with the correlation degree higher than the correlation degree threshold value.
Wherein each candidate field combination comprises two fields with a correlation degree higher than a correlation degree threshold value.
And S505, respectively counting the equal probability of the data in the same row in each candidate field combination.
For example, taking table 2 as an example, it can be counted which rows in systema. column1 and systema. column2 have equal data, to obtain the equal number of the same rows in the two fields, and then to count the ratio of the equal data to the total number of rows as the equal probability.
S506, determining at least one target field combination with equal probability higher than the probability threshold from at least one candidate field combination to obtain an equal rule corresponding to the target field combination.
The equality rule for the target field combination characterizes that the data belonging to the same row in both fields within the target field combination is equal.
For example, the same data in the same row in any two of the three fields, systema.column1, systemb.column2, and systemc.column1, can be obtained from table 2 through the above analysis.
After the equality rule is obtained, anomaly detection can be performed on the two-dimensional table to be detected, and if the data of a certain row under two fields in the two-dimensional table to be detected, which belong to the target field combination, are not equal, the data of the row under the two fields are abnormal.
The two-dimensional table to be detected may be the two-dimensional table with the found equality rule, or the two-dimensional table belonging to the same type as the two-dimensional table.
For example, assuming that system a.column1 is obtained by combining the corresponding equality rules of a plurality of target fields, system b.column2 is obtained by combining the corresponding equality rules, then, by detecting table 1 in combination with this rule, it can be seen that in table 1, if the data "Value 9" in the field of line 10 of table 1 and belonging to system b.column2 is not equal to the data in line 10 of the two fields of system a.column1 and system c.column1, then the data "Value 9" in the field of line 10 and belonging to system b.column2 is abnormal data.
The application also provides a data processing device corresponding to the data processing method.
As shown in fig. 6, it shows a schematic diagram of a component structure of a data processing apparatus, which may include:
a data determining unit 601 configured to determine data of a plurality of attribute features included in target data based on the target data, the target data belonging to a specified type of data that can be represented in a two-dimensional table form;
a data mining unit 602, configured to mine an association relationship between data of different attribute features in the target data to obtain at least one group of association relationship rules, where each group of association relationship rules includes: association rules that the data within the at least two attribute features need to satisfy; wherein the incidence relation rule is used for determining abnormal data existing in the specified type of data.
In one possible implementation, the data mining unit includes:
and the algorithm mining unit is used for mining the incidence relation among different attribute characteristics in the target data by using a data mining algorithm according to the set confidence coefficient to obtain at least one group of incidence relation rules.
In another possible implementation manner, the algorithm mining unit includes:
the item set mining unit is used for mining various groups of data in the target data through a data mining algorithm to obtain a plurality of mined frequent item sets which accord with set confidence degrees, and each group of data corresponds to one line of data in a two-dimensional table converted from the target data;
and the rule analysis unit is used for analyzing the incidence relation among different attribute characteristics in the multiple frequent item sets by using the data mining algorithm to obtain at least one group of incidence relation rules.
In another possible implementation manner of the present application, the data mining unit includes:
a probability determining unit, configured to count, for each attribute feature combination in target data, an equal probability that data in the same group under the attribute feature combination are equal, where the target data includes at least one attribute feature combination, the attribute feature combination includes two attribute features in the target data, and each group of data is converted into a row of data in a two-dimensional table corresponding to the target data;
and the rule determining unit is used for determining at least one target attribute feature combination with the equal probability higher than a probability threshold value to obtain an equal rule corresponding to the target attribute feature combination, and the equal rule of the target attribute feature combination represents that the data belonging to the same group in the two attributes in the attribute feature combination are equal.
In one possible implementation manner, the probability determination unit includes:
the correlation degree calculation unit is used for determining the correlation degree between every two attribute features in the target data based on the data under each attribute feature in the target data aiming at each attribute feature combination in the target data;
the combination determining unit is used for determining at least one candidate attribute feature combination with the correlation degree higher than a correlation degree threshold value, wherein the candidate attribute feature combination comprises two attribute features with the correlation degree higher than the correlation degree threshold value;
and the probability statistical unit is used for respectively counting the equal probability of the data in the same group in each candidate attribute feature combination.
In an alternative, the apparatus further comprises:
the data conversion unit is used for determining the correlation degree between every two attribute features in the target data before the correlation degree calculation unit determines the correlation degree, and further comprises the following steps:
determining the attribute characteristics of the data in the target data as non-numerical data, and converting the non-numerical data in the attribute characteristics of the non-numerical data into numerical data.
In an embodiment of any of the apparatus above in this application, the apparatus further comprises:
and the anomaly detection unit is used for determining the anomalous data in the data to be detected belonging to the specified type of data according to the at least one group of incidence relation rules, wherein the data to be detected and the target data have the same attribute characteristics.
In yet another aspect, the present application further provides an electronic device, as shown in fig. 7, which shows a schematic structural diagram of a component of the electronic device, where the electronic device may be any type of electronic device, and the electronic device includes at least a memory 701 and a processor 702;
wherein the processor 701 is configured to perform the data processing method as in any of the above embodiments.
The memory 702 is used to store programs needed for the processor to perform operations.
It is to be understood that the electronic device may further include a display unit 703 and an input unit 704.
Of course, the electronic device may have more or less components than those shown in fig. 7, which is not limited thereto.
In another aspect, the present application further provides a computer-readable storage medium, in which at least one instruction, at least one program, a set of codes, or a set of instructions is stored, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by a processor to implement the data processing method according to any one of the above embodiments.
The present application also proposes a computer program comprising computer instructions stored in a computer readable storage medium. The computer program is for performing the data processing method as in any of the above embodiments when run on an electronic device.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. Meanwhile, the features described in the embodiments of the present specification may be replaced or combined with each other, so that those skilled in the art can implement or use the present application. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims (10)

1.一种数据处理方法,包括:1. A data processing method, comprising: 基于目标数据,确定所述目标数据包括的多个属性特征的数据,所述目标数据属于能够采用二维表形式表示的指定类型数据;Based on the target data, determine the data of a plurality of attribute features included in the target data, and the target data belongs to the specified type of data that can be represented in the form of a two-dimensional table; 挖掘所述目标数据中不同属性特征的数据之间的关联关系,得到至少一组关联关系规则,每组关联关系规则包括:至少两个属性特征内的数据所需满足的关联规则;Mining the association relationship between data of different attribute features in the target data to obtain at least one group of association relationship rules, each group of association relationship rules includes: the association rules that the data in at least two attribute features need to satisfy; 其中,所述关联关系规则用于确定所述指定类型数据中存在的异常数据。Wherein, the association relationship rule is used to determine abnormal data existing in the specified type of data. 2.根据权利要求1所述的方法,所述挖掘所述目标数据中不同属性特征的数据之间的关联关系,得到至少一组关联关系规则,包括:2. The method according to claim 1, wherein the mining association between data of different attribute features in the target data obtains at least one set of association rules, comprising: 按照设定的置信度,并利用数据挖掘算法挖掘所述目标数据中不同属性特征之间的关联关系,得到至少一组关联关系规则。According to the set confidence level, the data mining algorithm is used to mine the association relationship between different attribute features in the target data, so as to obtain at least one set of association relationship rules. 3.根据权利要求2所述的方法,所述按照设定的置信度,并利用数据挖掘算法挖掘所述目标数据中不同属性特征之间的关联关系,得到至少一组关联关系规则,包括:3. The method according to claim 2, wherein according to the set confidence, and using a data mining algorithm to mine the association between different attribute features in the target data, at least one set of association rules is obtained, comprising: 利用数据挖掘算法对所述目标数据中各组数据进行频繁项集挖掘,得到挖掘出的符合设定置信度的多个频繁项集,每组数据对应所述目标数据转换出的二维表中的一行数据;Use the data mining algorithm to mine the frequent itemsets of each group of data in the target data, and obtain a plurality of frequent itemsets that meet the set confidence level. Each group of data corresponds to the two-dimensional table converted from the target data. a row of data; 利用所述数据挖掘算法分析所述多个频繁项集中存在的不同属性特征之间的关联关系,得到至少一组关联关系规则。The data mining algorithm is used to analyze the association relationship between different attribute features existing in the multiple frequent item sets to obtain at least one set of association relationship rules. 4.根据权利要求1所述的方法,所述挖掘所述目标数据中不同属性特征的数据之间的关联关系,得到至少一组关联关系规则,包括:4. The method according to claim 1, wherein said mining the association relationship between the data of different attribute features in the target data, obtains at least one set of association relationship rules, comprising: 针对目标数据中的每种属性特征组合,统计所述属性特征组合下处于相同组内的数据相等的相等概率,所述目标数据包括至少一个属性特征组合,所述属性特征组合包括所述目标数据中的两个属性特征,每组数据对应所述目标数据转换出二维表中的一行数据;For each attribute feature combination in the target data, count the equal probability that the data in the same group under the attribute feature combination are equal, the target data includes at least one attribute feature combination, and the attribute feature combination includes the target data In the two attribute features, each group of data is converted into a row of data in the two-dimensional table corresponding to the target data; 确定所述相等概率高于概率阈值的至少一个目标属性特征组合,得到所述目标属性特征组合对应的相等规则,所述目标属性特征组合的相等规则表征所述属性特征组合内的两个属性中属于相同组的数据相等。Determine at least one target attribute feature combination with the equal probability higher than a probability threshold, and obtain an equality rule corresponding to the target attribute feature combination, and the equality rule of the target attribute feature combination represents the two attributes in the attribute feature combination. Data belonging to the same group are equal. 5.根据权利要求4所述的方法,所述统计所述目标数据中每个属性特征组合内处于相同组的数据相等的相等概率,包括:5. The method according to claim 4, wherein the statistics of the equal probability that the data in the same group in each attribute feature combination in the target data are equal, comprising: 基于所述目标数据中的各属性特征下的数据,确定所述目标数据中两两属性特征之间相关度;Based on the data under each attribute feature in the target data, determine the correlation between the two attribute features in the target data; 确定相关度高于相关度阈值的至少一个候选属性特征组合,所述候选属性特征组合包括相关度高于相关度阈值的两个属性特征;determining at least one candidate attribute feature combination with a correlation higher than a correlation threshold, the candidate attribute feature combination including two attribute features with a correlation higher than the correlation threshold; 分别统计每个所述候选属性特征组合内处于相同组的数据相等的相等概率。The equal probability that the data in the same group in each of the candidate attribute feature combinations are equal is counted separately. 6.根据权利要求5所述的方法,在基于所述目标数据中的各属性特征下的数据,确定所述目标数据中两两属性特征之间相关度之前,还包括:6. The method according to claim 5, before determining the correlation between the two attribute features in the target data based on the data under each attribute feature in the target data, further comprising: 确定所述目标数据中数据为非数值型数据的属性特征,将所述非数值型数据的属性特征内的非数值型数据转换为数值型数据。It is determined that the data in the target data is an attribute feature of non-numerical data, and the non-numerical data in the attribute feature of the non-numerical data is converted into numerical data. 7.根据权利要求1所述的方法,还包括:7. The method of claim 1, further comprising: 依据所述至少一组关联关系规则,确定属于所述指定类型数据的待检测数据中存在的异常数据,所述待检测数据与所述目标数据包含的属性特征相同。Abnormal data existing in the data to be detected belonging to the data of the specified type is determined according to the at least one set of association relationship rules, and the data to be detected has the same attribute characteristics as the target data. 8.一种数据处理装置,包括:8. A data processing device, comprising: 数据确定单元,用于基于目标数据,确定所述目标数据包括的多个属性特征的数据,所述目标数据属于能够采用二维表形式表示的指定类型数据;a data determination unit, configured to determine, based on target data, data of a plurality of attribute features included in the target data, where the target data belongs to a specified type of data that can be represented in the form of a two-dimensional table; 数据挖掘单元,用于挖掘所述目标数据中不同属性特征的数据之间的关联关系,得到至少一组关联关系规则,每组关联关系规则包括:至少两个属性特征内的数据所需满足的关联规则;其中,所述关联关系规则用于确定所述指定类型数据中存在的异常数据。A data mining unit, configured to mine the association relationship between data of different attribute features in the target data, and obtain at least one set of association relationship rules, each set of association relationship rules includes: the data in at least two attribute features needs to satisfy Association rules; wherein, the association rules are used to determine abnormal data existing in the specified type of data. 9.根据权利要求8所述的装置,所述数据挖掘单元,包括:9. The apparatus according to claim 8, the data mining unit, comprising: 算法挖掘单元,用于按照设定的置信度,并利用数据挖掘算法挖掘所述目标数据中不同属性特征之间的关联关系,得到至少一组关联关系规则。The algorithm mining unit is used for mining the association relationship between different attribute features in the target data according to the set confidence level and using the data mining algorithm to obtain at least one set of association relationship rules. 10.一种电子设备,包括:处理器和存储器;10. An electronic device, comprising: a processor and a memory; 其中,处理器用于执行如上权利要求1至7任意一项所述的数据处理方法;Wherein, the processor is configured to execute the data processing method according to any one of claims 1 to 7; 所述存储器用于存储所述处理器执行操作所需的程序。The memory is used to store programs required by the processor to perform operations.
CN202111224470.2A 2021-10-20 2021-10-20 Data processing method, device and electronic equipment Active CN113886398B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111224470.2A CN113886398B (en) 2021-10-20 2021-10-20 Data processing method, device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111224470.2A CN113886398B (en) 2021-10-20 2021-10-20 Data processing method, device and electronic equipment

Publications (2)

Publication Number Publication Date
CN113886398A true CN113886398A (en) 2022-01-04
CN113886398B CN113886398B (en) 2025-09-23

Family

ID=79003915

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111224470.2A Active CN113886398B (en) 2021-10-20 2021-10-20 Data processing method, device and electronic equipment

Country Status (1)

Country Link
CN (1) CN113886398B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090033664A1 (en) * 2007-07-31 2009-02-05 Hao Ming C Generating a visualization to show mining results produced from selected data items and attribute(s) in a selected focus area and other portions of a data set
CN106598030A (en) * 2016-12-22 2017-04-26 西安理工大学 Axle temperature correlation analysis method based on data
CN108334548A (en) * 2017-12-26 2018-07-27 爱品克科技(武汉)股份有限公司 A kind of data mining technology based on correlation rule
CN111277465A (en) * 2020-01-20 2020-06-12 支付宝(杭州)信息技术有限公司 Abnormal data message detection method and device and electronic equipment
CN111353051A (en) * 2019-12-04 2020-06-30 江苏蓝河智能科技有限公司 K-means and Apriori-based algorithm maritime big data association analysis method
CN112380274A (en) * 2020-11-16 2021-02-19 北京航空航天大学 Control process-oriented anomaly detection system
CN113313409A (en) * 2021-06-16 2021-08-27 中国南方电网有限责任公司 Power system secondary equipment defect analysis method and system based on data association

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090033664A1 (en) * 2007-07-31 2009-02-05 Hao Ming C Generating a visualization to show mining results produced from selected data items and attribute(s) in a selected focus area and other portions of a data set
CN106598030A (en) * 2016-12-22 2017-04-26 西安理工大学 Axle temperature correlation analysis method based on data
CN108334548A (en) * 2017-12-26 2018-07-27 爱品克科技(武汉)股份有限公司 A kind of data mining technology based on correlation rule
CN111353051A (en) * 2019-12-04 2020-06-30 江苏蓝河智能科技有限公司 K-means and Apriori-based algorithm maritime big data association analysis method
CN111277465A (en) * 2020-01-20 2020-06-12 支付宝(杭州)信息技术有限公司 Abnormal data message detection method and device and electronic equipment
CN112380274A (en) * 2020-11-16 2021-02-19 北京航空航天大学 Control process-oriented anomaly detection system
CN113313409A (en) * 2021-06-16 2021-08-27 中国南方电网有限责任公司 Power system secondary equipment defect analysis method and system based on data association

Also Published As

Publication number Publication date
CN113886398B (en) 2025-09-23

Similar Documents

Publication Publication Date Title
CN110019349B (en) Sentence warning method, device, equipment and computer-readable storage medium
JP5785617B2 (en) Method and arrangement for handling data sets, data processing program and computer program product
US8122045B2 (en) Method for mapping a data source to a data target
Zhang et al. On multi-column foreign key discovery
US9552349B2 (en) Methods and apparatus for performing spelling corrections using one or more variant hash tables
CN111258966A (en) Data deduplication method, device, equipment and storage medium
JP4878624B2 (en) Document processing apparatus and document processing method
US9442991B2 (en) Ascribing actionable attributes to data that describes a personal identity
WO2018028099A1 (en) Method and device for search quality assessment
US20200097483A1 (en) Novel olap pre-calculation model and method for generating pre-calculation result
Baas et al. When peer reviewers go rogue-Estimated prevalence of citation manipulation by reviewers based on the citation patterns of 69,000 reviewers
Dasu Data glitches: Monsters in your data
CN114661568B (en) Abnormal operation behavior detection method, device, equipment and storage medium
US20170004188A1 (en) Apparatus and Method for Graphically Displaying Transaction Logs
CN112559817A (en) Report content checking method, system, computer equipment and storage medium
Wang et al. Jobfair: A framework for benchmarking gender hiring bias in large language models
JP6242540B1 (en) Data conversion system and data conversion method
CN110889118B (en) Abnormal SQL statement detection method, device, computer equipment and storage medium
US10127192B1 (en) Analytic system for fast quantile computation
US10540600B2 (en) Method and apparatus for detecting changed data
CN113886398B (en) Data processing method, device and electronic equipment
CN112101468B (en) A method for determining abnormal sequences in sequence combinations
US20170220678A1 (en) Automated scientific error checking
Tilton Porting an iterative parallel region growing algorithm from the MPP to the MasPar MP-1
CN112612810A (en) Slow SQL statement identification method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant