CN119691470B

CN119691470B - Audit method based on big data

Info

Publication number: CN119691470B
Application number: CN202510205822.1A
Authority: CN
Inventors: 蒋川; 张铭; 廖雪伶
Original assignee: Chengdu Ict Information Technology Co ltd
Current assignee: Chengdu Ict Information Technology Co ltd
Priority date: 2025-02-25
Filing date: 2025-02-25
Publication date: 2025-06-13
Anticipated expiration: 2045-02-25
Also published as: CN119691470A

Abstract

The invention discloses an audit method based on big data, which relates to the technical field of audit methods, and comprises the steps of initializing initial position characteristics and initial importance characteristics of each item node in a data set; the method comprises the steps of calculating the support degree given by adjacent nodes of each item node, calculating the aggregate characteristic of each item node according to the support degree after normalization processing, determining the initial importance characteristic when the similarity function is maximized as the target importance characteristic of the item node, calculating the matching degree between each item node in the data set according to the target importance characteristic of each item node in the two data sets, determining two item nodes with the matching degree not smaller than a threshold value as related items, auditing each data set based on each related item, and providing reference for the auditing process by finding out the relation between each item in the two data sets, thereby improving the auditing efficiency and auditing quality.

Description

Audit method based on big data

Technical Field

The invention relates to the technical field of auditing methods, in particular to an auditing method based on big data.

Background

The audit is for the execution conditions of economic activities, financial balance, financial regulations and the like of a certain industry, is organized, led and planned from top to bottom in an audit organization and audit staff, is expanded from unit economic activity audit to whole industry economic activity audit, is changed from microscopic economic activity audit to medium-view economic activity or macroscopic economic activity audit, takes data as the audit basis, and the quality of the data directly influences the audit quality, and the quality of the data currently used for audit needs to be improved.

Disclosure of Invention

The invention aims to provide an audit method based on big data, which can improve the quality of data materials used for audit work, help organizations and institutions to more efficiently conduct audit work, improve audit quality and reduce risks under the condition of ensuring compliance.

The technical aim of the invention is realized by the following technical scheme:

in a first aspect, the present application provides an audit method based on big data, comprising the following specific steps:

acquiring at least two data sets to be audited of different types, and preprocessing the data sets, wherein the preprocessing comprises data cleaning and abnormal data detection;

Based on each preprocessed data set, utilizing Gaussian distribution random initialization to obtain initial position characteristics and initial importance characteristics of each item node in the data set;

Calculating the support degree given by the adjacent node of each item node by using the initial position feature and the initial importance feature, and calculating the aggregate feature of each item node fused with the adjacent node feature according to the support degree after normalization treatment;

Constructing a similarity function of each item node by using the initial position features and the aggregation features, and determining the initial importance features when the similarity function is maximized as target importance features of each item node;

according to the target importance characteristics of each item node in at least two data sets, calculating to obtain the matching degree between each item node in different types of data sets, determining two item nodes with the matching degree not smaller than a threshold value as associated items, and auditing each data set based on each associated item.

The method has the advantages that in the scheme, firstly, preprocessing such as data cleaning and abnormal data detection is conducted on a data set to be audited, repeated data and abnormal data in the data set are removed, the data in the data set are enabled to be more simplified and accurate, then initial position features and initial importance features of all item nodes in the data set are obtained through Gaussian distribution random initialization, secondly, the support degree given by adjacent nodes of all item nodes is calculated through the initial position features and the initial importance features, aggregate features of the adjacent node features are obtained through calculation of the support degree after normalization processing of all item nodes, similarity functions of all item nodes are built through the initial position features and the aggregate features, initial importance features when the similarity functions are maximized are determined to be target importance features of all item nodes, finally, matching items among all item nodes in different types of data sets are calculated according to the target importance features of all item nodes in all data sets, two item nodes with the matching degrees not smaller than a threshold value are determined to be related items, the association items are indicated to be related items, the association items are high, the association items can be related to each other, and the audit items can be provided with high correlation effects on all data sets, and the audit sets can be promoted, and the audit items are relevant to all the data sets are strongly correlated.

In the scheme, if the data sets of economic activities and the data sets of financial balances are jointly examined, the expenditure of a certain event in the economic activities is a certain value, and the expenditure of a certain item in the financial balances is also the value, and under the condition that the detailed activity content is not clear, the two events in the two data sets show higher matching degree, so that when the data sets of different types are examined jointly, reference is provided for the auditing process by finding out the relation between the items in the two data sets of different types, higher improvement is provided for the auditing work, the quality of the data materials for the auditing work is improved, the organization and the organization are helped to more efficiently conduct the auditing work, the auditing quality is improved, and the risk is reduced under the condition that the compliance is ensured.

On the basis of the technical scheme, the invention can be improved as follows.

Further, the data cleaning specifically includes:

discretizing attribute items of each original data in the normalized data set, and transforming each attribute value obtained after discretization into a preset integer interval according to the size;

calculating information gain rates of attribute items of each original data based on the converted attribute values, and constructing an attribute set through each information gain rate which is not smaller than a preset value;

And inserting each data corresponding to the attribute set into a preset prefix tree, traversing each leaf node in the prefix tree to delete the repeated data, and obtaining a data set after data cleaning.

The data cleaning process by calculating the information gain rate and inserting the prefix tree can reduce the time complexity of the detection process and ensure the accuracy of the data set.

Further, the abnormal data detection specifically includes:

clustering the normalized data set by using a K-Means clustering algorithm, and obtaining a plurality of data clusters formed by each original data in the data set;

based on a plurality of data clusters, calculating to obtain a first Euclidean distance between each data in each data cluster and other data in the same cluster and a second Euclidean distance between each data in each data cluster and each data in other data clusters;

calculating an outlier factor of each data in each data cluster based on the first Euclidean distance, and determining the data with the outlier factor not smaller than a first threshold value as local isolated data;

And determining each data with the second Euclidean distance not smaller than a second threshold value as global isolated data, determining original data corresponding to the local isolated data and the global isolated data as abnormal data, deleting the abnormal data, and obtaining a data set detected by the abnormal data.

The adoption of the further scheme has the beneficial effects that as the data to be audited is generally complicated, the data volume is larger, and the isolated points based on the density are not ideal in algorithm execution efficiency and global isolated points identification, the global isolated points can be identified while the algorithm execution time is reduced by combining with the clustering algorithm thought.

Further, the information gain ratio of the attribute items of each original data is specifically:

wherein: ;

In the formula, The gain ratio of the information representing attribute item a in data set D,Information gain representing attribute item a in dataset D,Representing the number of samples for which the attribute item a has a value i,Representing the total number of samples in the data set D, n representing the number of values of the attribute item a.

Further, the outlier factor of each data in each data cluster is specifically:

;

In the formula, Representing dataIs used to determine the outlier factor of (1),Representing dataIs used for the distance to be reached,DataIs used for the production of the high-density polyethylene,Representing distance dataThe most recent k data constitute a set.

Further, the support degree given by the neighboring node of each item node is specifically:

In the formula (I), in the formula (II), Indicating the degree of support given by the neighboring node n to the item node m,Represents an initial importance feature of the item node m,Representing an initial importance feature of the neighboring node n;

each item node fuses the aggregation characteristics of adjacent node characteristics, specifically:

In the formula (I), in the formula (II), Represents the aggregate characteristics of the transaction node m,Representing a set of neighboring nodes to item node m,Representing the support after normalization by the softmax function,Representing the initial location characteristics of the neighboring node n.

Further, the similarity function specifically includes:

In the formula (I), in the formula (II), The value of the objective function is indicated,Represents the aggregate characteristics of the transaction node m,The initial position characteristic of the item node m is represented, and the corner mark T represents the vector transposition operation;

The matching degree between each item node is specifically:

wherein:

,;

In the formula, Represents the degree of matching between item node m and item node n, the corner label T represents the vector transpose operation,A target importance feature representing a transaction node m,The target importance characteristics of item node n,Respectively representing a weight matrix and a bias term respectively corresponding to the item nodes m,The weight matrix and the bias term respectively corresponding to the item node n are respectively represented.

In a second aspect, the present application provides a big data based auditing system, applied to any one of the first aspects, comprising:

the first module is used for acquiring at least two data sets to be audited of different types, and preprocessing the data sets, wherein the preprocessing comprises data cleaning and abnormal data detection;

The second module is used for randomly initializing by utilizing Gaussian distribution based on each preprocessed data set to obtain initial position characteristics and initial importance characteristics of each item node in the data set;

The third module is used for calculating the support degree given by the adjacent node of each item node by utilizing the initial position feature and the initial importance feature, and obtaining the aggregate feature of the adjacent node feature fused by each item node according to the support degree calculation after normalization processing;

A fourth module, configured to construct a similarity function of each item node using the initial position feature and the aggregate feature, and determine an initial importance feature when the similarity function is maximized as a target importance feature of each item node;

and a fifth module, configured to calculate, according to the target importance characteristics of each item node in at least two data sets, a matching degree between each item node in different types of data sets, determine two item nodes with matching degrees not smaller than a threshold value as related items, and audit each data set based on each related item.

In a third aspect, the application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of the first aspects when executing the computer program.

In a fourth aspect, the present application provides a non-transitory computer readable storage medium storing computer instructions that cause a computer to perform the method of any one of the first aspects.

Compared with the prior art, the invention has at least the following beneficial effects:

According to the method, firstly, preprocessing such as data cleaning and abnormal data detection is carried out on a data set to be audited, repeated data and abnormal data in the data set are removed, so that the data in the data set is more simplified and accurate, then initial position features and initial importance features of all item nodes in the data set are obtained through Gaussian distribution random initialization, secondly, the support degree given by adjacent nodes of all item nodes is calculated by the initial position features and the initial importance features, aggregate features of adjacent node features are obtained through calculation according to the support degree after normalization processing, similarity functions of all item nodes are built by the initial position features and the aggregate features, the initial importance features are determined to be target importance features of all item nodes when the similarity functions are maximized, finally, matching degrees among all item nodes in the data set are calculated according to the target importance features of all item nodes in the two data sets, two item nodes with the matching degrees not smaller than a threshold value are determined to be related items, the association items are represented by association, the association shows that the association between two item nodes can be high, the association items can be provided with relative to each other, and the audit set can be promoted based on the two item nodes, and the audit sets have final effect on all the data sets.

In the application, when the data sets of different types are subjected to joint audit, the relation between matters in the two data sets of different types is found, a reference is provided for the audit process, the audit work is improved to a higher degree, the quality of data materials used for the audit work is improved, the organization and the organization are helped to more efficiently carry out the audit work, the audit quality is improved, the risk is reduced under the condition of ensuring the compliance, the time complexity of the detection process is reduced and the accuracy of the data sets is ensured by the data cleaning process carried out in a mode of calculating the information gain rate and inserting a prefix tree, meanwhile, the data quantity to be audited is larger, the isolated point based on the density is not ideal in algorithm execution efficiency and global isolated point identification, and the global isolated point can be identified while the algorithm execution time is reduced by combining the clustering algorithm idea.

Drawings

The accompanying drawings, which are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings:

FIG. 1 is a method flow diagram of an audit method in an embodiment of the present invention;

FIG. 2 is a schematic diagram of the connection of an audit system in an embodiment of the present invention;

fig. 3 is a schematic connection diagram of an electronic device according to an embodiment of the invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that like reference numerals and letters refer to like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

In the description of the embodiments of the present invention, "plurality" means at least 2.

In order to improve the quality of data materials used for auditing works, help organizations and institutions to more efficiently conduct auditing works, and reduce risks under the conditions of improving auditing quality and ensuring compliance, the embodiment provides an auditing method based on big data, as shown in fig. 1, comprising the following specific steps:

S1, acquiring at least two data sets to be audited of different types, and preprocessing the data sets, wherein the preprocessing comprises data cleaning and abnormal data detection.

Optionally, the data cleaning specifically includes:

S11, discretizing attribute items of each original data in the normalized data set, and transforming each attribute value obtained after discretization into a preset integer interval according to the size.

And S12, calculating information gain rates of attribute items of the original data based on the converted attribute values, and constructing an attribute set through the information gain rates not smaller than a preset value.

The information gain ratio of the attribute items of each original data is specifically:

wherein: ;

S13, inserting each data corresponding to the attribute set into a preset prefix tree, traversing each leaf node in the prefix tree to delete the repeated data, and obtaining a data set after data cleaning.

Specifically, the preset prefix tree has the following characteristics and structural improvement points:

1) Non-leaf nodes act as index splitting entries and do not store data information.

2) Only leaf nodes store data and may store multiple pieces of data.

3) There are n attributes as data of index item, there are n+2 layers, wherein the first layer is root node, and the last layer stores sample data information.

When detecting repeated data in the prefix tree, firstly, traversing the data in each leaf node in turn, secondly, comparing each sample point with the data in the leaf node where the sample point is located, outputting the data to a similar data set if the similarity between the two pieces of data is larger than a given threshold value, marking the data as compared data after the comparison with other data, and comparing A with B when traversing the data for the next time, if A and B are in the same leaf node, outputting A to the similar data set if the similarity is larger than the given threshold value, deleting A from the leaf node where the sample point is located, and only one piece of data at the leaf node where C, D, G is located, wherein the data which exist singly are not repeated data.

Specifically, after the prefix tree is improved, similar data can be quickly gathered in the same leaf node, the operation process is reduced, then the leaf node is traversed, the similarity among the data is calculated in the leaf node, the detection of the repeated data is completed, and the efficiency of the repeated data detection is improved.

Optionally, the detecting of the abnormal data specifically includes:

s14, clustering the normalized data set by using a K-Means clustering algorithm, and obtaining a plurality of data clusters formed by the original data in the data set.

And S15, calculating to obtain a first Euclidean distance between each data in each data cluster and other data in the same cluster and a second Euclidean distance between each data in each data cluster and each data in other data clusters based on the plurality of data clusters.

S16, calculating an outlier factor of each data in each data cluster based on the first Euclidean distance, and determining the data with the outlier factor not smaller than a first threshold value as local isolated data.

The outlier factor of each data in each data cluster is specifically:

;

S17, determining each data with the second Euclidean distance not smaller than a second threshold value as global isolated data, determining original data corresponding to the local isolated data and the global isolated data as abnormal data, deleting the abnormal data, and obtaining a data set detected by the abnormal data.

Specifically, because the data to be audited is generally complicated, the data volume is larger, and the isolated points based on the density are not ideal in algorithm execution efficiency and global isolated points identification, the global isolated points can be identified while the algorithm execution time is reduced by combining the thought of clustering algorithm.

S2, based on each preprocessed data set, utilizing Gaussian distribution random initialization to obtain initial position characteristics and initial importance characteristics of each item node in the data set.

Therefore, the position features represent the environment information of the adjacent nodes, the importance features represent unique supporting relations, and compared with the position features, the importance features have stronger distinguishing property.

And S3, calculating the support degree given by the adjacent nodes of each item node by using the initial position features and the initial importance features, and calculating the aggregate features of the adjacent node features fused with each item node according to the support degree after normalization processing.

The support degree given by the adjacent node of each item node is specifically:

further, each item node fuses the aggregation characteristics of the adjacent node characteristics, specifically:

S4, constructing a similarity function of each item node by using the initial position features and the aggregation features, and determining the initial importance features when the similarity functions are maximized as target importance features of each item node, wherein the maximized value of the similarity functions is 100%, namely 1.

Specifically, the similarity function specifically includes:

In the formula (I), in the formula (II), The value of the objective function is indicated,Represents the aggregate characteristics of the transaction node m,The initial position feature of the item node m is represented, and the corner mark T represents the vector transpose operation.

S5, calculating to obtain the matching degree between the item nodes in the data sets of different types according to the target importance characteristics of the item nodes in the data sets, determining the two item nodes with the matching degree not smaller than a threshold value as related items, and auditing the data sets based on the related items.

Specifically, when the data sets of different types are subjected to joint audit, the relation among all matters in the two data sets of different types is found, so that references are provided for the audit process, the audit work is improved, the quality of data materials for the audit work is improved, organizations and institutions are helped to carry out the audit work more efficiently, the audit quality is improved, and risks are reduced under the condition of ensuring compliance.

The matching degree between each item node is specifically:

wherein:

,;

Embodiment 2. The embodiment of the application provides an audit system based on big data, which is applied to any one of the embodiment 1 and is shown in fig. 2, and comprises the following steps:

The first module is used for acquiring at least two data sets to be audited of different types, and preprocessing the data sets, wherein the preprocessing comprises data cleaning and abnormal data detection.

And the second module is used for randomly initializing by utilizing Gaussian distribution based on each preprocessed data set to obtain the initial position characteristic and the initial importance characteristic of each item node in the data set.

And the third module is used for calculating the support degree given by the adjacent node of each item node by utilizing the initial position feature and the initial importance feature, and obtaining the aggregate feature of the adjacent node feature fused by each item node according to the support degree calculation after normalization processing.

And a fourth module, configured to construct a similarity function of each item node using the initial position feature and the aggregate feature, and determine an initial importance feature when the similarity function is maximized as a target importance feature of each item node.

Embodiment 3 an embodiment of the present application provides an electronic device, as shown in fig. 3, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of embodiment 1 when executing the computer program.

Embodiment 4. The present application provides a non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of any one of embodiment 1.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. An audit method based on big data, characterized by comprising the following specific steps:

Acquire at least two data sets to be audited of different types, and preprocess the data sets, wherein the preprocessing includes data cleaning and abnormal data detection;

Based on each preprocessed data set, the initial position features and initial importance features of each event node in the data set are obtained by random initialization using Gaussian distribution;

Using the initial position feature and the initial importance feature, calculate the support given by the adjacent nodes of each event node, and calculate the aggregated feature of each event node by fusing the features of the adjacent nodes according to the normalized support;

Constructing a similarity function for each event node using the initial position feature and the aggregation feature, and determining the initial importance feature when the similarity function is maximized as the target importance feature of each event node;

According to the target importance characteristics of each item node in at least two of the data sets, the matching degree between each item node in different types of data sets is calculated, two item nodes whose matching degree is not less than a threshold are determined as related items, and each data set is audited based on each of the related items;

The support given by the adjacent nodes of each event node is as follows:

In the formula, α _mn represents the support given by the neighboring node n to the event node m. represents the initial importance feature of the event node m, and the subscript T represents the vector transposition operation. Represents the initial importance feature of the adjacent node n;

Each event node integrates the aggregate features of adjacent nodes, specifically:

In the formula, represents the aggregated features of item node m, N(v _m ) represents the set of adjacent nodes of item node m, β _mn represents the support after normalization by softmax function, Represents the initial position features of the adjacent node n;

The similarity function is specifically:

In the formula, ^Ps represents the objective function value, represents the aggregation features of event node m, Indicates the initial position feature of event node m, and the subscript T indicates the vector transposition operation;

The matching degree between each event node is as follows:

q = tanh(( _hm × _hn ) ^T ·( _hm × _hn )), where:

In the formula, q represents the matching degree between event node m and event node n, and the subscript T represents the vector transposition operation. represents the target importance feature of event node m, The target importance features of item node n, w ⁱ and b ⁱ represent the weight matrix and bias item corresponding to item node m, respectively, and w ^j and b ^j represent the weight matrix and bias item corresponding to item node n, respectively.

2. According to the big data-based audit method of claim 1, the data cleaning is specifically:

Discretize the attribute items of each original data in the normalized data set, and transform each attribute value obtained after discretization into a preset integer range according to the size;

The information gain rate of each attribute item of the original data is calculated based on the transformed attribute value, and the attribute set is constructed through each information gain rate that is not less than a preset value;

Insert the corresponding data in the attribute set into the preset prefix tree, traverse each leaf node in the prefix tree to delete duplicate data, and obtain a data set after data cleaning.

3. According to the big data-based audit method of claim 1, the abnormal data detection is specifically:

The K-Means clustering algorithm is used to cluster the normalized data set, and multiple data clusters consisting of the original data in the data set are obtained;

Based on the plurality of data clusters, a first Euclidean distance between each data in each data cluster and other data in the same cluster is calculated, as well as a second Euclidean distance between each data in each data cluster and each data in other data clusters;

Calculate the outlier factor of each data in each data cluster based on the first Euclidean distance, and determine the data whose outlier factor is not less than a first threshold as local isolated data;

Each data whose second Euclidean distance is not less than a second threshold is determined as global isolated data, and the original data corresponding to the local isolated data and the global isolated data is determined as abnormal data, and the abnormal data is deleted to obtain a data set after abnormal data detection.

4. According to the audit method based on big data in claim 2, the information gain rate of the attribute items of each original data is specifically:

in:

Where g _r (D, A) represents the information gain rate of attribute item A in data set D, g(D, A) represents the information gain of attribute item A in data set D, |D _i | represents the number of samples with attribute item A taking value i, |D| represents the total number of samples in data set D, and n represents the number of values of attribute item A.

5. According to the audit method based on big data in claim 3, the outlier factor of each data in each data cluster is specifically:

Where L(X _i ) represents the outlier factor of data _Xi , B(Z _i ) represents the reachable distance of data _Zi , ρ(X _i ) represents the reachable density of data _Xi , and _Aik represents the set of k data closest to data _Xi .

6. An audit system based on big data, applied to an audit method based on big data as claimed in any one of claims 1 to 5, characterized in that it comprises:

The first module is used to obtain at least two data sets to be audited of different types and preprocess the data sets, wherein the preprocessing includes data cleaning and abnormal data detection;

The second module is used to obtain the initial position features and initial importance features of each event node in the data set based on each preprocessed data set by using Gaussian distribution random initialization;

The third module is used to calculate the support given by the adjacent nodes of each event node by using the initial position feature and the initial importance feature, and obtain the aggregated feature of each event node fused with the features of the adjacent nodes according to the normalized support calculation;

The fourth module is used to construct a similarity function of each event node using the initial position feature and the aggregation feature, and determine the initial importance feature when the similarity function is maximized as the target importance feature of each event node;

A fifth module is used to calculate the matching degree between each item node in different types of data sets according to the target importance characteristics of each item node in at least two of the data sets, determine two item nodes whose matching degree is not less than a threshold as related items, and audit each data set based on each of the related items;

In this system, the support given by the adjacent nodes of each event node is as follows:

The similarity function is specifically:

The matching degree between each event node is as follows:

q = tanh(( _hm × _hn ) ^T ·( _hm × _hn )), where:

In the formula, q represents the matching degree between event node m and event node n, and the subscript T represents the vector transposition operation. represents the target importance feature of item node m, Z _n ^H1 represents the target importance feature of item node n, ^wi and ^bi represent the weight matrix and bias item corresponding to item node m, respectively, and ^wj and ^bj represent the weight matrix and bias item corresponding to item node n, respectively.

7. An electronic device, characterized in that it comprises a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method according to any one of claims 1 to 5 when executing the computer program.

8. A non-transitory computer-readable storage medium, characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions enable a computer to execute the method according to any one of claims 1 to 5.