[go: up one dir, main page]

CN119691470B - Audit method based on big data - Google Patents

Audit method based on big data Download PDF

Info

Publication number
CN119691470B
CN119691470B CN202510205822.1A CN202510205822A CN119691470B CN 119691470 B CN119691470 B CN 119691470B CN 202510205822 A CN202510205822 A CN 202510205822A CN 119691470 B CN119691470 B CN 119691470B
Authority
CN
China
Prior art keywords
data
node
item
feature
event node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202510205822.1A
Other languages
Chinese (zh)
Other versions
CN119691470A (en
Inventor
蒋川
张铭
廖雪伶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Ict Information Technology Co ltd
Original Assignee
Chengdu Ict Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Ict Information Technology Co ltd filed Critical Chengdu Ict Information Technology Co ltd
Priority to CN202510205822.1A priority Critical patent/CN119691470B/en
Publication of CN119691470A publication Critical patent/CN119691470A/en
Application granted granted Critical
Publication of CN119691470B publication Critical patent/CN119691470B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an audit method based on big data, which relates to the technical field of audit methods, and comprises the steps of initializing initial position characteristics and initial importance characteristics of each item node in a data set; the method comprises the steps of calculating the support degree given by adjacent nodes of each item node, calculating the aggregate characteristic of each item node according to the support degree after normalization processing, determining the initial importance characteristic when the similarity function is maximized as the target importance characteristic of the item node, calculating the matching degree between each item node in the data set according to the target importance characteristic of each item node in the two data sets, determining two item nodes with the matching degree not smaller than a threshold value as related items, auditing each data set based on each related item, and providing reference for the auditing process by finding out the relation between each item in the two data sets, thereby improving the auditing efficiency and auditing quality.

Description

Audit method based on big data
Technical Field
The invention relates to the technical field of auditing methods, in particular to an auditing method based on big data.
Background
The audit is for the execution conditions of economic activities, financial balance, financial regulations and the like of a certain industry, is organized, led and planned from top to bottom in an audit organization and audit staff, is expanded from unit economic activity audit to whole industry economic activity audit, is changed from microscopic economic activity audit to medium-view economic activity or macroscopic economic activity audit, takes data as the audit basis, and the quality of the data directly influences the audit quality, and the quality of the data currently used for audit needs to be improved.
Disclosure of Invention
The invention aims to provide an audit method based on big data, which can improve the quality of data materials used for audit work, help organizations and institutions to more efficiently conduct audit work, improve audit quality and reduce risks under the condition of ensuring compliance.
The technical aim of the invention is realized by the following technical scheme:
in a first aspect, the present application provides an audit method based on big data, comprising the following specific steps:
acquiring at least two data sets to be audited of different types, and preprocessing the data sets, wherein the preprocessing comprises data cleaning and abnormal data detection;
Based on each preprocessed data set, utilizing Gaussian distribution random initialization to obtain initial position characteristics and initial importance characteristics of each item node in the data set;
Calculating the support degree given by the adjacent node of each item node by using the initial position feature and the initial importance feature, and calculating the aggregate feature of each item node fused with the adjacent node feature according to the support degree after normalization treatment;
Constructing a similarity function of each item node by using the initial position features and the aggregation features, and determining the initial importance features when the similarity function is maximized as target importance features of each item node;
according to the target importance characteristics of each item node in at least two data sets, calculating to obtain the matching degree between each item node in different types of data sets, determining two item nodes with the matching degree not smaller than a threshold value as associated items, and auditing each data set based on each associated item.
The method has the advantages that in the scheme, firstly, preprocessing such as data cleaning and abnormal data detection is conducted on a data set to be audited, repeated data and abnormal data in the data set are removed, the data in the data set are enabled to be more simplified and accurate, then initial position features and initial importance features of all item nodes in the data set are obtained through Gaussian distribution random initialization, secondly, the support degree given by adjacent nodes of all item nodes is calculated through the initial position features and the initial importance features, aggregate features of the adjacent node features are obtained through calculation of the support degree after normalization processing of all item nodes, similarity functions of all item nodes are built through the initial position features and the aggregate features, initial importance features when the similarity functions are maximized are determined to be target importance features of all item nodes, finally, matching items among all item nodes in different types of data sets are calculated according to the target importance features of all item nodes in all data sets, two item nodes with the matching degrees not smaller than a threshold value are determined to be related items, the association items are indicated to be related items, the association items are high, the association items can be related to each other, and the audit items can be provided with high correlation effects on all data sets, and the audit sets can be promoted, and the audit items are relevant to all the data sets are strongly correlated.
In the scheme, if the data sets of economic activities and the data sets of financial balances are jointly examined, the expenditure of a certain event in the economic activities is a certain value, and the expenditure of a certain item in the financial balances is also the value, and under the condition that the detailed activity content is not clear, the two events in the two data sets show higher matching degree, so that when the data sets of different types are examined jointly, reference is provided for the auditing process by finding out the relation between the items in the two data sets of different types, higher improvement is provided for the auditing work, the quality of the data materials for the auditing work is improved, the organization and the organization are helped to more efficiently conduct the auditing work, the auditing quality is improved, and the risk is reduced under the condition that the compliance is ensured.
On the basis of the technical scheme, the invention can be improved as follows.
Further, the data cleaning specifically includes:
discretizing attribute items of each original data in the normalized data set, and transforming each attribute value obtained after discretization into a preset integer interval according to the size;
calculating information gain rates of attribute items of each original data based on the converted attribute values, and constructing an attribute set through each information gain rate which is not smaller than a preset value;
And inserting each data corresponding to the attribute set into a preset prefix tree, traversing each leaf node in the prefix tree to delete the repeated data, and obtaining a data set after data cleaning.
The data cleaning process by calculating the information gain rate and inserting the prefix tree can reduce the time complexity of the detection process and ensure the accuracy of the data set.
Further, the abnormal data detection specifically includes:
clustering the normalized data set by using a K-Means clustering algorithm, and obtaining a plurality of data clusters formed by each original data in the data set;
based on a plurality of data clusters, calculating to obtain a first Euclidean distance between each data in each data cluster and other data in the same cluster and a second Euclidean distance between each data in each data cluster and each data in other data clusters;
calculating an outlier factor of each data in each data cluster based on the first Euclidean distance, and determining the data with the outlier factor not smaller than a first threshold value as local isolated data;
And determining each data with the second Euclidean distance not smaller than a second threshold value as global isolated data, determining original data corresponding to the local isolated data and the global isolated data as abnormal data, deleting the abnormal data, and obtaining a data set detected by the abnormal data.
The adoption of the further scheme has the beneficial effects that as the data to be audited is generally complicated, the data volume is larger, and the isolated points based on the density are not ideal in algorithm execution efficiency and global isolated points identification, the global isolated points can be identified while the algorithm execution time is reduced by combining with the clustering algorithm thought.
Further, the information gain ratio of the attribute items of each original data is specifically:
wherein: ;
In the formula, The gain ratio of the information representing attribute item a in data set D,Information gain representing attribute item a in dataset D,Representing the number of samples for which the attribute item a has a value i,Representing the total number of samples in the data set D, n representing the number of values of the attribute item a.
Further, the outlier factor of each data in each data cluster is specifically:
;
In the formula, Representing dataIs used to determine the outlier factor of (1),Representing dataIs used for the distance to be reached,DataIs used for the production of the high-density polyethylene,Representing distance dataThe most recent k data constitute a set.
Further, the support degree given by the neighboring node of each item node is specifically:
In the formula (I), in the formula (II), Indicating the degree of support given by the neighboring node n to the item node m,Represents an initial importance feature of the item node m,Representing an initial importance feature of the neighboring node n;
each item node fuses the aggregation characteristics of adjacent node characteristics, specifically:
In the formula (I), in the formula (II), Represents the aggregate characteristics of the transaction node m,Representing a set of neighboring nodes to item node m,Representing the support after normalization by the softmax function,Representing the initial location characteristics of the neighboring node n.
Further, the similarity function specifically includes:
In the formula (I), in the formula (II), The value of the objective function is indicated,Represents the aggregate characteristics of the transaction node m,The initial position characteristic of the item node m is represented, and the corner mark T represents the vector transposition operation;
The matching degree between each item node is specifically:
wherein:
,;
In the formula, Represents the degree of matching between item node m and item node n, the corner label T represents the vector transpose operation,A target importance feature representing a transaction node m,The target importance characteristics of item node n,Respectively representing a weight matrix and a bias term respectively corresponding to the item nodes m,The weight matrix and the bias term respectively corresponding to the item node n are respectively represented.
In a second aspect, the present application provides a big data based auditing system, applied to any one of the first aspects, comprising:
the first module is used for acquiring at least two data sets to be audited of different types, and preprocessing the data sets, wherein the preprocessing comprises data cleaning and abnormal data detection;
The second module is used for randomly initializing by utilizing Gaussian distribution based on each preprocessed data set to obtain initial position characteristics and initial importance characteristics of each item node in the data set;
The third module is used for calculating the support degree given by the adjacent node of each item node by utilizing the initial position feature and the initial importance feature, and obtaining the aggregate feature of the adjacent node feature fused by each item node according to the support degree calculation after normalization processing;
A fourth module, configured to construct a similarity function of each item node using the initial position feature and the aggregate feature, and determine an initial importance feature when the similarity function is maximized as a target importance feature of each item node;
and a fifth module, configured to calculate, according to the target importance characteristics of each item node in at least two data sets, a matching degree between each item node in different types of data sets, determine two item nodes with matching degrees not smaller than a threshold value as related items, and audit each data set based on each related item.
In a third aspect, the application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of the first aspects when executing the computer program.
In a fourth aspect, the present application provides a non-transitory computer readable storage medium storing computer instructions that cause a computer to perform the method of any one of the first aspects.
Compared with the prior art, the invention has at least the following beneficial effects:
According to the method, firstly, preprocessing such as data cleaning and abnormal data detection is carried out on a data set to be audited, repeated data and abnormal data in the data set are removed, so that the data in the data set is more simplified and accurate, then initial position features and initial importance features of all item nodes in the data set are obtained through Gaussian distribution random initialization, secondly, the support degree given by adjacent nodes of all item nodes is calculated by the initial position features and the initial importance features, aggregate features of adjacent node features are obtained through calculation according to the support degree after normalization processing, similarity functions of all item nodes are built by the initial position features and the aggregate features, the initial importance features are determined to be target importance features of all item nodes when the similarity functions are maximized, finally, matching degrees among all item nodes in the data set are calculated according to the target importance features of all item nodes in the two data sets, two item nodes with the matching degrees not smaller than a threshold value are determined to be related items, the association items are represented by association, the association shows that the association between two item nodes can be high, the association items can be provided with relative to each other, and the audit set can be promoted based on the two item nodes, and the audit sets have final effect on all the data sets.
In the application, when the data sets of different types are subjected to joint audit, the relation between matters in the two data sets of different types is found, a reference is provided for the audit process, the audit work is improved to a higher degree, the quality of data materials used for the audit work is improved, the organization and the organization are helped to more efficiently carry out the audit work, the audit quality is improved, the risk is reduced under the condition of ensuring the compliance, the time complexity of the detection process is reduced and the accuracy of the data sets is ensured by the data cleaning process carried out in a mode of calculating the information gain rate and inserting a prefix tree, meanwhile, the data quantity to be audited is larger, the isolated point based on the density is not ideal in algorithm execution efficiency and global isolated point identification, and the global isolated point can be identified while the algorithm execution time is reduced by combining the clustering algorithm idea.
Drawings
The accompanying drawings, which are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings:
FIG. 1 is a method flow diagram of an audit method in an embodiment of the present invention;
FIG. 2 is a schematic diagram of the connection of an audit system in an embodiment of the present invention;
fig. 3 is a schematic connection diagram of an electronic device according to an embodiment of the invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that like reference numerals and letters refer to like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
In the description of the embodiments of the present invention, "plurality" means at least 2.
In order to improve the quality of data materials used for auditing works, help organizations and institutions to more efficiently conduct auditing works, and reduce risks under the conditions of improving auditing quality and ensuring compliance, the embodiment provides an auditing method based on big data, as shown in fig. 1, comprising the following specific steps:
S1, acquiring at least two data sets to be audited of different types, and preprocessing the data sets, wherein the preprocessing comprises data cleaning and abnormal data detection.
Optionally, the data cleaning specifically includes:
S11, discretizing attribute items of each original data in the normalized data set, and transforming each attribute value obtained after discretization into a preset integer interval according to the size.
And S12, calculating information gain rates of attribute items of the original data based on the converted attribute values, and constructing an attribute set through the information gain rates not smaller than a preset value.
The information gain ratio of the attribute items of each original data is specifically:
wherein: ;
In the formula, The gain ratio of the information representing attribute item a in data set D,Information gain representing attribute item a in dataset D,Representing the number of samples for which the attribute item a has a value i,Representing the total number of samples in the data set D, n representing the number of values of the attribute item a.
S13, inserting each data corresponding to the attribute set into a preset prefix tree, traversing each leaf node in the prefix tree to delete the repeated data, and obtaining a data set after data cleaning.
Specifically, the preset prefix tree has the following characteristics and structural improvement points:
1) Non-leaf nodes act as index splitting entries and do not store data information.
2) Only leaf nodes store data and may store multiple pieces of data.
3) There are n attributes as data of index item, there are n+2 layers, wherein the first layer is root node, and the last layer stores sample data information.
When detecting repeated data in the prefix tree, firstly, traversing the data in each leaf node in turn, secondly, comparing each sample point with the data in the leaf node where the sample point is located, outputting the data to a similar data set if the similarity between the two pieces of data is larger than a given threshold value, marking the data as compared data after the comparison with other data, and comparing A with B when traversing the data for the next time, if A and B are in the same leaf node, outputting A to the similar data set if the similarity is larger than the given threshold value, deleting A from the leaf node where the sample point is located, and only one piece of data at the leaf node where C, D, G is located, wherein the data which exist singly are not repeated data.
Specifically, after the prefix tree is improved, similar data can be quickly gathered in the same leaf node, the operation process is reduced, then the leaf node is traversed, the similarity among the data is calculated in the leaf node, the detection of the repeated data is completed, and the efficiency of the repeated data detection is improved.
Optionally, the detecting of the abnormal data specifically includes:
s14, clustering the normalized data set by using a K-Means clustering algorithm, and obtaining a plurality of data clusters formed by the original data in the data set.
And S15, calculating to obtain a first Euclidean distance between each data in each data cluster and other data in the same cluster and a second Euclidean distance between each data in each data cluster and each data in other data clusters based on the plurality of data clusters.
S16, calculating an outlier factor of each data in each data cluster based on the first Euclidean distance, and determining the data with the outlier factor not smaller than a first threshold value as local isolated data.
The outlier factor of each data in each data cluster is specifically:
;
In the formula, Representing dataIs used to determine the outlier factor of (1),Representing dataIs used for the distance to be reached,DataIs used for the production of the high-density polyethylene,Representing distance dataThe most recent k data constitute a set.
S17, determining each data with the second Euclidean distance not smaller than a second threshold value as global isolated data, determining original data corresponding to the local isolated data and the global isolated data as abnormal data, deleting the abnormal data, and obtaining a data set detected by the abnormal data.
Specifically, because the data to be audited is generally complicated, the data volume is larger, and the isolated points based on the density are not ideal in algorithm execution efficiency and global isolated points identification, the global isolated points can be identified while the algorithm execution time is reduced by combining the thought of clustering algorithm.
S2, based on each preprocessed data set, utilizing Gaussian distribution random initialization to obtain initial position characteristics and initial importance characteristics of each item node in the data set.
Therefore, the position features represent the environment information of the adjacent nodes, the importance features represent unique supporting relations, and compared with the position features, the importance features have stronger distinguishing property.
And S3, calculating the support degree given by the adjacent nodes of each item node by using the initial position features and the initial importance features, and calculating the aggregate features of the adjacent node features fused with each item node according to the support degree after normalization processing.
The support degree given by the adjacent node of each item node is specifically:
In the formula (I), in the formula (II), Indicating the degree of support given by the neighboring node n to the item node m,Represents an initial importance feature of the item node m,Representing an initial importance feature of the neighboring node n;
further, each item node fuses the aggregation characteristics of the adjacent node characteristics, specifically:
In the formula (I), in the formula (II), Represents the aggregate characteristics of the transaction node m,Representing a set of neighboring nodes to item node m,Representing the support after normalization by the softmax function,Representing the initial location characteristics of the neighboring node n.
S4, constructing a similarity function of each item node by using the initial position features and the aggregation features, and determining the initial importance features when the similarity functions are maximized as target importance features of each item node, wherein the maximized value of the similarity functions is 100%, namely 1.
Specifically, the similarity function specifically includes:
In the formula (I), in the formula (II), The value of the objective function is indicated,Represents the aggregate characteristics of the transaction node m,The initial position feature of the item node m is represented, and the corner mark T represents the vector transpose operation.
S5, calculating to obtain the matching degree between the item nodes in the data sets of different types according to the target importance characteristics of the item nodes in the data sets, determining the two item nodes with the matching degree not smaller than a threshold value as related items, and auditing the data sets based on the related items.
Specifically, when the data sets of different types are subjected to joint audit, the relation among all matters in the two data sets of different types is found, so that references are provided for the audit process, the audit work is improved, the quality of data materials for the audit work is improved, organizations and institutions are helped to carry out the audit work more efficiently, the audit quality is improved, and risks are reduced under the condition of ensuring compliance.
The matching degree between each item node is specifically:
wherein:
,;
In the formula, Represents the degree of matching between item node m and item node n, the corner label T represents the vector transpose operation,A target importance feature representing a transaction node m,The target importance characteristics of item node n,Respectively representing a weight matrix and a bias term respectively corresponding to the item nodes m,The weight matrix and the bias term respectively corresponding to the item node n are respectively represented.
Embodiment 2. The embodiment of the application provides an audit system based on big data, which is applied to any one of the embodiment 1 and is shown in fig. 2, and comprises the following steps:
The first module is used for acquiring at least two data sets to be audited of different types, and preprocessing the data sets, wherein the preprocessing comprises data cleaning and abnormal data detection.
And the second module is used for randomly initializing by utilizing Gaussian distribution based on each preprocessed data set to obtain the initial position characteristic and the initial importance characteristic of each item node in the data set.
And the third module is used for calculating the support degree given by the adjacent node of each item node by utilizing the initial position feature and the initial importance feature, and obtaining the aggregate feature of the adjacent node feature fused by each item node according to the support degree calculation after normalization processing.
And a fourth module, configured to construct a similarity function of each item node using the initial position feature and the aggregate feature, and determine an initial importance feature when the similarity function is maximized as a target importance feature of each item node.
And a fifth module, configured to calculate, according to the target importance characteristics of each item node in at least two data sets, a matching degree between each item node in different types of data sets, determine two item nodes with matching degrees not smaller than a threshold value as related items, and audit each data set based on each related item.
Embodiment 3 an embodiment of the present application provides an electronic device, as shown in fig. 3, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of embodiment 1 when executing the computer program.
Embodiment 4. The present application provides a non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of any one of embodiment 1.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (8)

1.一种基于大数据的审计方法,其特征在于,包括以下具体步骤:1. An audit method based on big data, characterized by comprising the following specific steps: 获取不同类型的至少两个待审计的数据集,并对所述数据集进行预处理,所述预处理包括数据清洗和异常数据检测;Acquire at least two data sets to be audited of different types, and preprocess the data sets, wherein the preprocessing includes data cleaning and abnormal data detection; 基于每个经过预处理后的数据集,利用高斯分布随机初始化得到所述数据集中各个事项节点的初始位置特征和初始重要度特征;Based on each preprocessed data set, the initial position features and initial importance features of each event node in the data set are obtained by random initialization using Gaussian distribution; 利用所述初始位置特征和所述初始重要度特征,计算每个事项节点的相邻节点给予的支持度,并根据经过归一化处理后的支持度计算得到每个事项节点融合相邻节点特征的聚合特征;Using the initial position feature and the initial importance feature, calculate the support given by the adjacent nodes of each event node, and calculate the aggregated feature of each event node by fusing the features of the adjacent nodes according to the normalized support; 利用所述初始位置特征和所述聚合特征构建每个事项节点的相似度函数,并将所述相似度函数最大化时的初始重要度特征确定为各个事项节点的目标重要度特征;Constructing a similarity function for each event node using the initial position feature and the aggregation feature, and determining the initial importance feature when the similarity function is maximized as the target importance feature of each event node; 根据至少两个所述数据集中各个事项节点的所述目标重要度特征,计算得到不同类型的数据集中各个事项节点之间的匹配度,将匹配度不小于阈值的两个事项节点确定为关联事项,并基于各个所述关联事项对各个数据集进行审计;According to the target importance characteristics of each item node in at least two of the data sets, the matching degree between each item node in different types of data sets is calculated, two item nodes whose matching degree is not less than a threshold are determined as related items, and each data set is audited based on each of the related items; 每个事项节点的相邻节点给予的支持度,具体为:The support given by the adjacent nodes of each event node is as follows: 式中,αmn表示相邻节点n给予事项节点m的支持度,表示事项节点m的初始重要度特征,角标T表示向量转置操作,表示相邻节点n的初始重要度特征; In the formula, α mn represents the support given by the neighboring node n to the event node m. represents the initial importance feature of the event node m, and the subscript T represents the vector transposition operation. Represents the initial importance feature of the adjacent node n; 每个事项节点融合相邻节点特征的聚合特征,具体为:Each event node integrates the aggregate features of adjacent nodes, specifically: 式中,表示事项节点m的聚合特征,N(vm)表示事项节点m的相邻节点的集合,βmn表示通过softmax函数进行归一化处理后的支持度,表示相邻节点n的初始位置特征; In the formula, represents the aggregated features of item node m, N(v m ) represents the set of adjacent nodes of item node m, β mn represents the support after normalization by softmax function, Represents the initial position features of the adjacent node n; 所述相似度函数具体为:The similarity function is specifically: 式中,Ps表示目标函数值,表示事项节点m的聚合特征,表示事项节点m的初始位置特征,角标T表示向量转置操作; In the formula, Ps represents the objective function value, represents the aggregation features of event node m, Indicates the initial position feature of event node m, and the subscript T indicates the vector transposition operation; 各个事项节点之间的匹配度,具体为:The matching degree between each event node is as follows: q=tanh((hm×hn)T·(hm×hn)),其中:q = tanh(( hm × hn ) T ·( hm × hn )), where: 式中,q表示事项节点m与事项节点n之间的匹配度,角标T表示向量转置操作,表示事项节点m的目标重要度特征,事项节点n的目标重要度特征,wi、bi分别表示事项节点m分别对应的权重矩阵和偏置项,wj、bj分别表示事项节点n分别对应的权重矩阵和偏置项。In the formula, q represents the matching degree between event node m and event node n, and the subscript T represents the vector transposition operation. represents the target importance feature of event node m, The target importance features of item node n, w i and b i represent the weight matrix and bias item corresponding to item node m, respectively, and w j and b j represent the weight matrix and bias item corresponding to item node n, respectively. 2.根据权利要求1所述的一种基于大数据的审计方法,其特征在于,所述数据清洗具体为:2. According to the big data-based audit method of claim 1, the data cleaning is specifically: 对归一化处理后的数据集中各个原始数据的属性项进行离散化,并将离散化后得到的各个属性值按照大小变换到预设的整数区间内;Discretize the attribute items of each original data in the normalized data set, and transform each attribute value obtained after discretization into a preset integer range according to the size; 基于变换后的属性值计算得到各个原始数据的属性项的信息增益率,并通过不小于预设值的各个信息增益率构建属性集;The information gain rate of each attribute item of the original data is calculated based on the transformed attribute value, and the attribute set is constructed through each information gain rate that is not less than a preset value; 将属性集中对应的各个数据插入到预置的前缀树中,遍历前缀树中各个叶子节点进行重复数据的删除,得到经过数据清洗后的数据集。Insert the corresponding data in the attribute set into the preset prefix tree, traverse each leaf node in the prefix tree to delete duplicate data, and obtain a data set after data cleaning. 3.根据权利要求1所述的一种基于大数据的审计方法,其特征在于,所述异常数据检测具体为:3. According to the big data-based audit method of claim 1, the abnormal data detection is specifically: 利用K-Means聚类算法对经过归一化处理后的数据集进行聚类处理,并得到由数据集中各个原始数据构成的多个数据簇;The K-Means clustering algorithm is used to cluster the normalized data set, and multiple data clusters consisting of the original data in the data set are obtained; 基于多个所述数据簇,计算得到各个数据簇中每个数据与同簇中其他数据间的第一欧氏距离,以及各个数据簇中每个数据与其他数据簇中各个数据之间的第二欧氏距离;Based on the plurality of data clusters, a first Euclidean distance between each data in each data cluster and other data in the same cluster is calculated, as well as a second Euclidean distance between each data in each data cluster and each data in other data clusters; 基于所述第一欧氏距离计算各数据簇中每个数据的离群点因子,并将离群点因子不小于第一阈值的数据确定为局部孤立数据;Calculate the outlier factor of each data in each data cluster based on the first Euclidean distance, and determine the data whose outlier factor is not less than a first threshold as local isolated data; 将所述第二欧式距离不小于第二阈值的各个数据确定为全局孤立数据,并将局部孤立数据和全局孤立数据对应的原始数据确定为异常数据,将所述异常数据删除后得到经过异常数据检测的数据集。Each data whose second Euclidean distance is not less than a second threshold is determined as global isolated data, and the original data corresponding to the local isolated data and the global isolated data is determined as abnormal data, and the abnormal data is deleted to obtain a data set after abnormal data detection. 4.根据权利要求2所述的一种基于大数据的审计方法,其特征在于,各个原始数据的属性项的信息增益率,具体为:4. According to the audit method based on big data in claim 2, the information gain rate of the attribute items of each original data is specifically: 其中: in: 式中,gr(D,A)表示数据集D中属性项A的信息增益率,g(D,A)表示数据集D中属性项A的信息增益,|Di|表示属性项A取值为i的样本数,|D|表示数据集D中的总样本数,n表示属性项A的取值数目。Where g r (D, A) represents the information gain rate of attribute item A in data set D, g(D, A) represents the information gain of attribute item A in data set D, |D i | represents the number of samples with attribute item A taking value i, |D| represents the total number of samples in data set D, and n represents the number of values of attribute item A. 5.根据权利要求3所述的一种基于大数据的审计方法,其特征在于,各数据簇中每个数据的离群点因子,具体为:5. According to the audit method based on big data in claim 3, the outlier factor of each data in each data cluster is specifically: 式中,L(Xi)表示数据Xi的离群点因子,B(Zi)表示数据Zi的可达距离,ρ(Xi)数据Xi的可达密度,Aik表示距离数据Xi最近的k个数据构成的集合。Where L(X i ) represents the outlier factor of data Xi , B(Z i ) represents the reachable distance of data Zi , ρ(X i ) represents the reachable density of data Xi , and Aik represents the set of k data closest to data Xi . 6.一种基于大数据的审计系统,应用于权利要求1-5中任一项所述的一种基于大数据的审计方法,其特征在于,包括:6. An audit system based on big data, applied to an audit method based on big data as claimed in any one of claims 1 to 5, characterized in that it comprises: 第一模块,用于获取不同类型的至少两个待审计的数据集,并对所述数据集进行预处理,所述预处理包括数据清洗和异常数据检测;The first module is used to obtain at least two data sets to be audited of different types and preprocess the data sets, wherein the preprocessing includes data cleaning and abnormal data detection; 第二模块,用于基于每个经过预处理后的数据集,利用高斯分布随机初始化得到所述数据集中各个事项节点的初始位置特征和初始重要度特征;The second module is used to obtain the initial position features and initial importance features of each event node in the data set based on each preprocessed data set by using Gaussian distribution random initialization; 第三模块,用于利用所述初始位置特征和所述初始重要度特征,计算每个事项节点的相邻节点给予的支持度,并根据经过归一化处理后的支持度计算得到每个事项节点融合相邻节点特征的聚合特征;The third module is used to calculate the support given by the adjacent nodes of each event node by using the initial position feature and the initial importance feature, and obtain the aggregated feature of each event node fused with the features of the adjacent nodes according to the normalized support calculation; 第四模块,用于利用所述初始位置特征和所述聚合特征构建每个事项节点的相似度函数,并将所述相似度函数最大化时的初始重要度特征确定为各个事项节点的目标重要度特征;The fourth module is used to construct a similarity function of each event node using the initial position feature and the aggregation feature, and determine the initial importance feature when the similarity function is maximized as the target importance feature of each event node; 第五模块,用于根据至少两个所述数据集中各个事项节点的所述目标重要度特征,计算得到不同类型的数据集中各个事项节点之间的匹配度,将匹配度不小于阈值的两个事项节点确定为关联事项,并基于各个所述关联事项对各个数据集进行审计;A fifth module is used to calculate the matching degree between each item node in different types of data sets according to the target importance characteristics of each item node in at least two of the data sets, determine two item nodes whose matching degree is not less than a threshold as related items, and audit each data set based on each of the related items; 该系统中,每个事项节点的相邻节点给予的支持度,具体为:In this system, the support given by the adjacent nodes of each event node is as follows: 式中,αmn表示相邻节点n给予事项节点m的支持度,表示事项节点m的初始重要度特征,角标T表示向量转置操作,表示相邻节点n的初始重要度特征; In the formula, α mn represents the support given by the neighboring node n to the event node m. represents the initial importance feature of the event node m, and the subscript T represents the vector transposition operation. Represents the initial importance feature of the adjacent node n; 每个事项节点融合相邻节点特征的聚合特征,具体为:Each event node integrates the aggregate features of adjacent nodes, specifically: 式中,表示事项节点m的聚合特征,N(vm)表示事项节点m的相邻节点的集合,βmn表示通过softmax函数进行归一化处理后的支持度,表示相邻节点n的初始位置特征; In the formula, represents the aggregated features of item node m, N(v m ) represents the set of adjacent nodes of item node m, β mn represents the support after normalization by softmax function, Represents the initial position features of the adjacent node n; 所述相似度函数具体为:The similarity function is specifically: 式中,Ps表示目标函数值,表示事项节点m的聚合特征,表示事项节点m的初始位置特征,角标T表示向量转置操作; In the formula, Ps represents the objective function value, represents the aggregation features of event node m, Indicates the initial position feature of event node m, and the subscript T indicates the vector transposition operation; 各个事项节点之间的匹配度,具体为:The matching degree between each event node is as follows: q=tanh((hm×hn)T·(hm×hn)),其中:q = tanh(( hm × hn ) T ·( hm × hn )), where: 式中,q表示事项节点m与事项节点n之间的匹配度,角标T表示向量转置操作,表示事项节点m的目标重要度特征,Zn H1事项节点n的目标重要度特征,wi、bi分别表示事项节点m分别对应的权重矩阵和偏置项,wj、bj分别表示事项节点n分别对应的权重矩阵和偏置项。In the formula, q represents the matching degree between event node m and event node n, and the subscript T represents the vector transposition operation. represents the target importance feature of item node m, Z n H1 represents the target importance feature of item node n, wi and bi represent the weight matrix and bias item corresponding to item node m, respectively, and wj and bj represent the weight matrix and bias item corresponding to item node n, respectively. 7.一种电子设备,其特征在于,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行计算机程序时实现权利要求1-5中任一项所述的方法。7. An electronic device, characterized in that it comprises a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method according to any one of claims 1 to 5 when executing the computer program. 8.一种非暂态计算机可读存储介质,其特征在于,所述非暂态计算机可读存储介质存储计算机指令,计算机指令使计算机执行权利要求1-5中任一项所述的方法。8. A non-transitory computer-readable storage medium, characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions enable a computer to execute the method according to any one of claims 1 to 5.
CN202510205822.1A 2025-02-25 2025-02-25 Audit method based on big data Active CN119691470B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510205822.1A CN119691470B (en) 2025-02-25 2025-02-25 Audit method based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510205822.1A CN119691470B (en) 2025-02-25 2025-02-25 Audit method based on big data

Publications (2)

Publication Number Publication Date
CN119691470A CN119691470A (en) 2025-03-25
CN119691470B true CN119691470B (en) 2025-06-13

Family

ID=95027844

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510205822.1A Active CN119691470B (en) 2025-02-25 2025-02-25 Audit method based on big data

Country Status (1)

Country Link
CN (1) CN119691470B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076352A (en) * 2021-03-17 2021-07-06 远光软件股份有限公司 Auditing method, electronic device and storage medium
CN113657549A (en) * 2021-08-31 2021-11-16 平安医疗健康管理股份有限公司 Medical data auditing method, device, equipment and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8917902B2 (en) * 2011-08-24 2014-12-23 The Nielsen Company (Us), Llc Image overlaying and comparison for inventory display auditing
JP7012895B1 (en) * 2021-07-19 2022-01-28 株式会社Tkc Accounting systems, methods, and programs
CN118312909B (en) * 2024-06-06 2024-10-18 湖南三湘银行股份有限公司 Bank auditing method and system based on deep neural network
CN118396684B (en) * 2024-06-26 2024-09-20 广东省广告集团股份有限公司 User advertisement recommendation method and device based on fused neural network and model construction method thereof

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076352A (en) * 2021-03-17 2021-07-06 远光软件股份有限公司 Auditing method, electronic device and storage medium
CN113657549A (en) * 2021-08-31 2021-11-16 平安医疗健康管理股份有限公司 Medical data auditing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN119691470A (en) 2025-03-25

Similar Documents

Publication Publication Date Title
US12216683B1 (en) Systems and methods for generating and implementing knowledge graphs for knowledge representation and analysis
US10515090B2 (en) Data extraction and transformation method and system
US20220075762A1 (en) Method for classifying an unmanaged dataset
RU2268488C2 (en) Method and system for data organization
US6542896B1 (en) System and method for organizing data
CN102197406B (en) fuzzy data manipulation
CN105469096B (en) A kind of characteristic bag image search method based on Hash binary-coding
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
CN107291895B (en) A Fast Hierarchical Document Query Method
CN107463665A (en) A kind of data correlation rule mining algorithms
US11188981B1 (en) Identifying matching transfer transactions
CN109582783B (en) Hot topic detection method and device
CN108564009A (en) A kind of improvement characteristic evaluation method based on mutual information
US20220229854A1 (en) Constructing ground truth when classifying data
CN109992676A (en) A kind of cross-media resource retrieval method and retrieval system
CN119807912A (en) Abnormal data detection method based on improved differential privacy and clustering algorithm optimization
CN119691470B (en) Audit method based on big data
CN114328600A (en) Method, device, equipment and storage medium for determining standard data element
CN115964658B (en) A clustering-based classification label updating method and system
Gabor-Toth et al. Linking Deutsche Bundesbank Company Data
CN111625530A (en) Large-scale vector retrieval method and device
CN114328844B (en) A text data set management method, device, equipment and storage medium
CN113988878B (en) Graph database technology-based anti-fraud method and system
CN115186138A (en) A method and terminal for comparison of distribution network data
CN111753084B (en) Short text feature extraction and classification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant