[go: up one dir, main page]

CN110502552B - Classification data conversion method based on fine-tuning conditional probability - Google Patents

Classification data conversion method based on fine-tuning conditional probability Download PDF

Info

Publication number
CN110502552B
CN110502552B CN201910770010.6A CN201910770010A CN110502552B CN 110502552 B CN110502552 B CN 110502552B CN 201910770010 A CN201910770010 A CN 201910770010A CN 110502552 B CN110502552 B CN 110502552B
Authority
CN
China
Prior art keywords
data
fine
tuning
conditional probability
numerical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910770010.6A
Other languages
Chinese (zh)
Other versions
CN110502552A (en
Inventor
熊庆宇
李秋德
吉胜芬
高旻
余洋
王凯歌
吉皇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN201910770010.6A priority Critical patent/CN110502552B/en
Publication of CN110502552A publication Critical patent/CN110502552A/en
Application granted granted Critical
Publication of CN110502552B publication Critical patent/CN110502552B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Fuzzy Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the field of data mining or machine learning of data preprocessing, and provides a classification data conversion method based on fine-tuning conditional probability, which comprises the following steps: s1, collecting classified data; s2, preprocessing data, and cleaning missing data, noise data and invalid data in the classified data; s3, calculating conditional probability, and converting the cleaned classification data into numerical vectors; s4, fine-tuning the conditional probability, and performing numerical fine-tuning on the numerical vector converted in the step S3; and S5, embedding numerical values of the classification data, namely embedding or mapping the original classification data into numerical value data for the numerical value vector subjected to numerical value fine adjustment in the step S4. The invention can convert the classification value in the classification data set into a high-quality numerical vector, and the converted numerical data can keep the real distribution of the original data, thereby ensuring the reliability of the data mining task.

Description

一种基于微调条件概率的分类数据转换方法A Categorical Data Transformation Method Based on Fine-tuning Conditional Probability

技术领域technical field

本发明涉及数据预处理的数据挖掘或机器学习领域,具体涉及一种基于微调条件概率的分类数据转换方法。The invention relates to the field of data mining or machine learning for data preprocessing, in particular to a classification data conversion method based on fine-tuning conditional probability.

背景技术Background technique

在一个数据挖掘或机器学习任务中,采集的数据通常会包含数值型和分类型两类数据。然而大部分机器学习算法(如神经网络、支持向量机、逻辑回归等)只能直接处理数值数据,仅有少数地如决策树、贝叶斯等算法可直接处理分类数据;此外,直接处理数值数据的算法通常比直接处理分类数据的算法具有更高效的性能。为了能广泛地使用数值输入的机器学习算法,分类数据需要转换为数值数据。目前,国内外已经提出了多种分类数据转换方法,然而,这些方法多数存在的一个缺陷是将分类数据转换为低质量的数值数据,从而偏离了原始数据的真实分布,以至于降低了下一阶段机器学习算法的性能和可靠性。因此,研究一种高效合理的分类数据转换方法极为重要。In a data mining or machine learning task, the collected data usually contains two types of data, numerical and categorical. However, most machine learning algorithms (such as neural networks, support vector machines, logistic regression, etc.) can only directly process numerical data, and only a few algorithms such as decision trees and Bayesian can directly process classified data; Algorithms for categorical data often have more efficient performance than algorithms that process categorical data directly. In order to widely use machine learning algorithms with numerical input, categorical data needs to be converted to numerical data. At present, a variety of categorical data conversion methods have been proposed at home and abroad. However, one of the defects of most of these methods is that the categorical data is converted into low-quality numerical data, which deviates from the true distribution of the original data, so that the next step is reduced. Performance and reliability of stage machine learning algorithms. Therefore, it is extremely important to study an efficient and reasonable classification data conversion method.

在分类数据转换为数值数据的众多方法之中,最常用的方法是独热编码(One-hotEncoding),它将分类属性内的每个分类值转换为一个高维的0-1向量;当分类属性的分类值基数很大时,这个方法极易出现维度灾难问题,从而增加数据存储的开销和后序机器学习算法的时间开销。为此,专利CN109740680A公开了一种混合值属性审批数据的分类方法及系统,通过独热编码转换为高维的数值数据后,再用神经网络进行深度编码以降低属性维度,但是需要花费大量的时间去寻找一个好的神经网络结构;专利US20190164083A1公开了一种自然语言处理领域下用于机器学习的分类数据转换和聚类方法,该方法首先也是使用独热编码转换,随后使用聚类算法去降低属性维度。除了独热编码及其改进方法外,专利CN109255373A公开了一种分类数据数字化的数据处理方法,但该方法仅应用于土地利用和土壤类型等环境领域问题,不具普适性。授权专利US9619757B2公开了一种使用结果可能性的标称属性转换方法,它将每个分类值转换为该分类值在数据集中出现的可能性(或概率),这种方法没有考虑类标签信息,因此可能会损失部分信息。Among the many methods for converting categorical data into numerical data, the most commonly used method is One-hot Encoding, which converts each categorical value in a categorical attribute into a high-dimensional 0-1 vector; when the classification When the categorical value cardinality of attributes is large, this method is prone to the curse of dimensionality, which increases the overhead of data storage and the time overhead of subsequent machine learning algorithms. For this reason, patent CN109740680A discloses a classification method and system for mixed-value attribute approval data. After one-hot encoding is converted into high-dimensional numerical data, neural network is used for deep encoding to reduce the attribute dimension, but it takes a lot of money Time to find a good neural network structure; patent US20190164083A1 discloses a classification data conversion and clustering method for machine learning in the field of natural language processing. This method first uses one-hot encoding conversion, and then uses a clustering algorithm to Reduce attribute dimensionality. In addition to one-hot encoding and its improved method, patent CN109255373A discloses a data processing method for digitizing classified data, but this method is only applicable to environmental issues such as land use and soil type, and is not universal. The authorized patent US9619757B2 discloses a nominal attribute conversion method using the possibility of results, which converts each categorical value into the possibility (or probability) of the categorical value appearing in the data set. This method does not consider the class label information. Therefore some information may be lost.

Kasif等人考虑了类标签信息后提出了一种基于记忆推理的转换方法,将分类属性内的每个分类值转换为一个条件概率向量。然而他们并没有将转换的条件概率向量应用于数值输入的机器学习算法,而只是用于计算分类值之间的距离。Hernández-Pereira等人将上述转换方法的条件概率应用于数值输入的神经网络算法,并在入侵检测问题中取得了很好的实验效果。基于记忆推理的转换方法因考虑了类标签信息而获得了较高质量的数值数据,然而,我们通过深入分析这种转换方法后发现:它依赖属性独立假设,假设数据集内的属性之间是相互独立的。当属性之间存在某种依赖关系时便违反了这个假设(注:属性之间通常是相互依赖的),从而转换后的条件概率也不太可靠,稍许的偏离了原始数据的真实分布。Kasif et al. proposed a conversion method based on memory inference after considering the class label information, converting each categorical value within a categorical attribute into a conditional probability vector. However, they did not apply the transformed conditional probability vectors to machine learning algorithms for numerical inputs, but only for computing distances between categorical values. Hernández-Pereira et al. applied the conditional probability of the above conversion method to the neural network algorithm of numerical input, and achieved good experimental results in the intrusion detection problem. The conversion method based on memory reasoning has obtained higher-quality numerical data due to the consideration of class label information. However, we have found through in-depth analysis of this conversion method that it relies on the attribute independence assumption, assuming that the attributes in the data set are mutually independant. This assumption is violated when there is a certain dependency between attributes (note: attributes are usually interdependent), so the converted conditional probability is not very reliable, and slightly deviates from the true distribution of the original data.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提供一种基于微调条件概率的分类数据转换方法,可将分类数据集中的分类值转换为高质量的数值向量,使得转换后的数值数据依然能保持原始数据的真实分布,从而提高了下一阶段机器学习算法的分类性能,并保证了数据挖掘任务的可靠性。The purpose of the present invention is to provide a classification data conversion method based on fine-tuning conditional probability, which can convert the classification values in the classification data set into high-quality numerical vectors, so that the converted numerical data can still maintain the true distribution of the original data, thereby It improves the classification performance of the next-stage machine learning algorithms and ensures the reliability of data mining tasks.

本发明提出的一种基于微调条件概率的分类数据转换方法,包括:A classification data conversion method based on fine-tuning conditional probability proposed by the present invention includes:

S1、分类数据的数据采集;S1. Data collection of classified data;

S2、数据预处理,清洗分类数据中的缺失数据,噪音数据,以及无效数据;S2. Data preprocessing, cleaning missing data, noisy data, and invalid data in the classified data;

S3、条件概率计算,将清洗以后的分类数据转换为数值向量;S3. Conditional probability calculation, converting the cleaned classified data into numerical vectors;

S4、微调条件概率,对步骤S3中转换后的数值向量进行数值微调;S4, fine-tuning the conditional probability, performing numerical fine-tuning on the converted numerical vector in step S3;

S5、分类数据的数值嵌入,对步骤S4中进行数值微调以后的数值向量,采用原始的分类数据嵌入或映射为数值数据。S5. Numerical embedding of categorical data, embedding or mapping the original categorical data into numerical data for the numerical vector after numerical fine-tuning in step S4.

本发明一种基于微调条件概率的分类数据转换方法的有益效果:可靠性:可将分类数据集中的分类值转换为高质量的数值向量,转换后的数值数据能保持原始数据的真实分布,保证了数据挖掘任务的可靠性;Beneficial effects of the classification data conversion method based on fine-tuning conditional probability of the present invention: reliability: the classification values in the classification data set can be converted into high-quality numerical vectors, and the converted numerical data can maintain the true distribution of the original data, ensuring Improve the reliability of data mining tasks;

高性能:转换的数据应用于下一阶段机器学习算法后,能取得高性能指标(高的准确率,召回率,F得分等);High performance: After the converted data is applied to the next stage of machine learning algorithms, high performance indicators (high accuracy, recall, F score, etc.) can be obtained;

高效性:转换的数据维度远低于独热编码方法,且比独热编码及其改进方法具有更少的运行时间;Efficiency: The converted data dimension is much lower than the one-hot encoding method, and has less running time than the one-hot encoding and its improved methods;

便捷性:预设的参数个数少,减少用户设置参数带来的麻烦,更有利于实际的应用场景;Convenience: The number of preset parameters is small, which reduces the trouble caused by user setting parameters, and is more conducive to actual application scenarios;

普适性:它是一种基于数据驱动的转换方法,能自适应的应用于各种分类数据集。Universality: It is a data-driven transformation method that can be adaptively applied to various classification data sets.

附图说明Description of drawings

图1为本发明实施例的一种基于微调条件概率的分类数据转换方法的算法流程图;FIG. 1 is an algorithm flow chart of a classification data conversion method based on fine-tuning conditional probability according to an embodiment of the present invention;

图2为本发明实施例的一种基于微调条件概率的分类数据转换方法实际运用环境图;FIG. 2 is an actual application environment diagram of a classification data conversion method based on fine-tuning conditional probability according to an embodiment of the present invention;

图3为本发明实施例的一种基于微调条件概率的分类数据转换方法分类数据矩阵的样例图;3 is a sample diagram of a classification data matrix based on a fine-tuning conditional probability classification data conversion method according to an embodiment of the present invention;

图4为本发明实施例的一种基于微调条件概率的分类数据转换方法应用系统架构图;FIG. 4 is an application system architecture diagram of a classified data conversion method based on fine-tuning conditional probability according to an embodiment of the present invention;

图5为本发明实施例的一种基于微调条件概率的分类数据转换方法实现分类数据转换的一个示例;FIG. 5 is an example of a classification data conversion method based on fine-tuning conditional probability to realize classification data conversion according to an embodiment of the present invention;

其中:101、数据采集,102、网络,103、数据库,104、服务系统,105、用户设备,200、分类数据样例,301、数据转换模块,302、分类器模块,303、分析报告,401、条件概率计算,402、估计有效范围,403、微调条件概率,404、微调后验证,405、条件判断,501、分类数据集,502、分类属性,505、数值数据集。Among them: 101. Data collection, 102. Network, 103. Database, 104. Service system, 105. User equipment, 200. Classification data sample, 301. Data conversion module, 302. Classifier module, 303. Analysis report, 401 . Calculation of conditional probability, 402. Estimated effective range, 403. Fine-tuning conditional probability, 404. Verification after fine-tuning, 405. Conditional judgment, 501. Classification data set, 502. Classification attribute, 505. Numerical data set.

具体实施方式Detailed ways

下面结合附图和具体实施例对本发明的一种基于微调条件概率的分类数据转换方法作进一步的说明。以下实施例仅用于更加清楚地说明本发明的技术方案,而不能以此来限制本发明的保护范围。A classification data conversion method based on fine-tuning conditional probability of the present invention will be further described below in conjunction with the accompanying drawings and specific embodiments. The following examples are only used to illustrate the technical solution of the present invention more clearly, but not to limit the protection scope of the present invention.

如图1所示,本发明是一种基于微调条件概率的分类数据转换方法,包括:As shown in Figure 1, the present invention is a classification data conversion method based on fine-tuning conditional probability, comprising:

S1、分类数据的数据采集101;S1. Data collection of classified data 101;

S2、数据预处理,清洗分类数据中的缺失数据,噪音数据,以及无效数据;S2. Data preprocessing, cleaning missing data, noisy data, and invalid data in the classified data;

S3、条件概率计算401,将清洗以后的分类数据转换为数值向量;S3. Conditional probability calculation 401, converting the cleaned classified data into a numerical vector;

S4、微调条件概率,对步骤S3中转换后的数值向量进行数值微调;S4, fine-tuning the conditional probability, performing numerical fine-tuning on the converted numerical vector in step S3;

S5、分类数据的数值嵌入,对步骤S4中进行数值微调以后的数值向量,采用原始的分类数据嵌入或映射为数值数据。S5. Numerical embedding of categorical data, embedding or mapping the original categorical data into numerical data for the numerical vector after numerical fine-tuning in step S4.

本发明一种基于微调条件概率的分类数据转换方法的有益效果:可靠性:可将分类数据集501中的分类值转换为高质量的数值向量,转换后的数值数据集505能保持原始数据的真实分布,保证了数据挖掘任务的可靠性;Beneficial effects of a classification data conversion method based on fine-tuning conditional probability of the present invention: reliability: the classification value in the classification data set 501 can be converted into a high-quality numerical vector, and the converted numerical data set 505 can maintain the original data True distribution ensures the reliability of data mining tasks;

高性能:转换的数据应用于下一阶段机器学习算法后,能取得高性能指标(高的准确率,召回率,F得分等);High performance: After the converted data is applied to the next stage of machine learning algorithms, high performance indicators (high accuracy, recall, F score, etc.) can be obtained;

高效性:转换的数据维度远低于独热编码方法,且比独热编码及其改进方法具有更少的运行时间;Efficiency: The converted data dimension is much lower than the one-hot encoding method, and has less running time than the one-hot encoding and its improved methods;

便捷性:预设的参数个数少,减少用户设置参数带来的麻烦,更有利于实际的应用场景;Convenience: The number of preset parameters is small, which reduces the trouble caused by user setting parameters, and is more conducive to actual application scenarios;

普适性:它是一种基于数据驱动的转换方法,能自适应的应用于各种分类数据集501。Universality: It is a data-driven transformation method that can be adaptively applied to various classification data sets501.

设X是一个包含N个样本的分类数据集501,每个样本由一个m维的向量[a1(x),…,am(x)]表示,其中ai(x)是样本x的第i属性的分类值,此外,X的类标签为

Figure GDA0003848957880000041
在算法的流程图中,条件概率计算401首先提取每个分类属性502Ai和类标签C中的数据,然后计算属性Ai内每个分类值ai(x)的条件概率,并生成如下的l维数值向量:Suppose X is a classification data set 501 containing N samples, each sample is represented by an m-dimensional vector [a 1 (x),...,am (x)], where a i ( x) is the sample x The categorical value of the i-th attribute, in addition, the class label of X is
Figure GDA0003848957880000041
In the flow chart of the algorithm, the conditional probability calculation 401 first extracts the data in each classification attribute 502A i and class label C, and then calculates the conditional probability of each classification value a i (x) in the attribute A i , and generates the following l-dimensional numeric vector:

ai(x)→[P(c1|ai(x)),…,P(cj|ai(x)),…,P(cl|ai(x))] (1)a i (x)→[P(c 1 |a i (x)),…,P(c j |a i (x)),…,P(c l |a i (x))] (1)

其中式(1)中的条件概率项P(cj|ai(x))是由拉普拉斯平滑(Laplace Smoothing)的贝叶斯估计(Bayesian Estimation)进行计算,即为:The conditional probability item P(c j |a i (x)) in formula (1) is calculated by Bayesian Estimation of Laplace Smoothing, which is:

Figure GDA0003848957880000042
Figure GDA0003848957880000042

其中式(2)中的I(x,y)是一个指标函数,即当x=y时I(x,y)=1,否则I(x,y)=0;λ(≥0)是一个拉普拉斯平滑因子。I(x,y) in formula (2) is an indicator function, that is, I(x,y)=1 when x=y, otherwise I(x,y)=0; λ(≥0) is a Laplace smoothing factor.

估计有效范围402,利用有效范围算法(ValidRanges Algorithm)的计算属性Ai内每个分类值ai(x)的有效范围[Pmin(cj|ai),Pmax(cj|ai)],其中0≤Pmin(cj|ai)≤Pmax(cj|ai)≤1。Estimate the valid range 402, using the valid range algorithm (ValidRanges Algorithm) to calculate the valid range of each classification value a i (x) in the attribute A i [P min (c j |a i ), P max (c j |a i )], where 0≤P min (c j |a i )≤P max (c j |a i )≤1.

S01、如果条件概率项P(cj|ai(x))用于正确分类的样本数大于错误分类的样本数时,即Neg_ratio(ai,cj)>pos_ratio(ai,cj)时,微调这个概率项P(cj|ai(x)),否则退出微调过程;S01. If the conditional probability item P(c j |a i (x)) is used to correctly classify the number of samples greater than the number of wrongly classified samples, that is, Neg_ratio(a i ,c j )>pos_ratio(a i ,c j ) , fine-tune the probability item P(c j |a i (x)), otherwise exit the fine-tuning process;

S02、计算分类值ai(x)的平均有效范围

Figure GDA0003848957880000051
与条件概率
Figure GDA0003848957880000052
的绝对值,
Figure GDA0003848957880000053
其中
Figure GDA0003848957880000054
S02, calculate the average effective range of the classification value a i (x)
Figure GDA0003848957880000051
with conditional probability
Figure GDA0003848957880000052
the absolute value of
Figure GDA0003848957880000053
in
Figure GDA0003848957880000054

S03、把条件概率

Figure GDA0003848957880000055
Figure GDA0003848957880000056
进行更新,即
Figure GDA0003848957880000057
S03, the conditional probability
Figure GDA0003848957880000055
use
Figure GDA0003848957880000056
to update, i.e.
Figure GDA0003848957880000057

S04、归一化更新的条件概率

Figure GDA0003848957880000058
Figure GDA0003848957880000059
S04. Conditional probability of normalized update
Figure GDA0003848957880000058
which is
Figure GDA0003848957880000059

微调后验证404,使用机器学习分类器验证微调后的条件概率的性能是否得到提高,即验证微调算法是否能更加真实的拟合原始数据的分布。Verification after fine-tuning 404 , using a machine learning classifier to verify whether the performance of the conditional probability after fine-tuning is improved, that is, verifying whether the fine-tuning algorithm can more realistically fit the distribution of the original data.

条件判断405,判断微调后验证404中条件概率的性能是否提高,如果得到提高,说明本次微调是有效的,转到微调条件概率403,继续微调;否则终止微调过程,退出程序;此外,为防止微调过程进入死循环,微调次数限制在预设的1000次以内。Condition judgment 405, judge whether the performance of the conditional probability in verification 404 after fine-tuning is improved, if it is improved, it shows that this fine-tuning is effective, go to fine-tuning conditional probability 403, and continue fine-tuning; otherwise, terminate the fine-tuning process and exit the program; in addition, for To prevent the fine-tuning process from entering an infinite loop, the number of fine-tuning is limited to the preset 1000 times.

计算环境图包括由通信网络102耦合的数据采集101、存储数据库103、数据挖掘服务系统104和用户设备105四个功能块。数据采集101终端可能由台式机电脑、笔记本电脑或移动设备自动的在线收集有用的分类数据(如电商网页数据,医疗监测数据等),也可能是人工收集后再录入系统的分类数据集501(如市场访问数据,人口普查数据等)。数据采集101终端将收集的分类数据集501通过网络102发送到数据库103中进行存储,存储分类数据集501的数据库可能是本地工作站或远程服务器,或是云端数据服务器。用户通过用户设备105向服务系统104发送请求,要求分析某个数据挖掘任务(如信用卡欺诈检测的任务)。服务系统104收到请求后,从数据库103中调用相应的分类数据集501,通过数据挖掘分析后将分析报告303传回给用户设备105,以供用户查看和决策。The computing environment diagram includes four functional blocks of data acquisition 101 , storage database 103 , data mining service system 104 and user equipment 105 coupled by communication network 102 . The data collection 101 terminal may automatically collect useful classified data (such as e-commerce web page data, medical monitoring data, etc.) online by desktop computers, notebook computers or mobile devices, or it may be manually collected and then entered into the classified data set 501 of the system (such as market access data, census data, etc.). The data collection 101 terminal sends the collected classified data set 501 to the database 103 through the network 102 for storage. The database storing the classified data set 501 may be a local workstation or a remote server, or a cloud data server. The user sends a request to the service system 104 through the user equipment 105 to analyze a certain data mining task (such as a credit card fraud detection task). After receiving the request, the service system 104 calls the corresponding classified data set 501 from the database 103, and sends the analysis report 303 back to the user device 105 after data mining and analysis for the user to view and make decisions.

数据采集101将收集的分类数据集501存储在数据库103中,这些分类数据集501的一个示例如图3所示。分类数据样例200是一个信用卡欺诈检测的数据矩阵集,该矩阵的每行代表一个信用贷款客户,每列描述客户的基本信息(或属性,如性别,婚烟状况,收入,信用记录)。这些属性是分类数据(如性别的值为“男”,“女”),而非数值数据(如0.12,1.85等)。The data acquisition 101 stores the collected classified data sets 501 in the database 103 , an example of these classified data sets 501 is shown in FIG. 3 . The classification data sample 200 is a data matrix set for credit card fraud detection, each row of the matrix represents a credit loan customer, and each column describes the basic information (or attributes, such as gender, marriage status, income, credit history) of the customer. These attributes are categorical data (such as the value of gender is "male", "female") rather than numerical data (such as 0.12, 1.85, etc.).

用户设备105请求服务系统104分析某个数据挖掘任务时,应用于用户的一个服务系统104如图4所示。服务系统104首先从数据库103中调用相应的分类数据集501,然后再在本系统中运行数据转换模块301和分类器模型302,并汇总分析报告303。本发明的数据转换模块301能将数据库103中的分类数据转换为高质量的数值数据,它包括具有数据清洗功能的数据预处理子模块(如清洗缺失数据,噪音数据等)、条件概率计算子模块、微调条件概率子模块和嵌入数值子模块(数据嵌入或数据映射)。转换后的数值数据送入到分类器模块302中,分类器模块302选择适合的机器学习模型(如神经网络、支持向量机、逻辑回归等学习模型)和损失函数(平方损失、0-1损失、交叉熵损失、对数损失等)训练一个分类器。然后,分类器模块302中的分类器对数据转换模块301中的转换数据进行评估,并形成分析报告303。分析报告303中主要包括预测样本的标签,以及分类器性能和效率的评价等内容。When the user device 105 requests the service system 104 to analyze a certain data mining task, a service system 104 applied to the user is shown in FIG. 4 . The service system 104 first calls the corresponding classification data set 501 from the database 103 , and then runs the data conversion module 301 and the classifier model 302 in the system, and compiles the analysis report 303 . The data conversion module 301 of the present invention can convert the classified data in the database 103 into high-quality numerical data, and it includes a data preprocessing submodule with a data cleaning function (such as cleaning missing data, noise data, etc.), a conditional probability calculator module, a fine-tuning conditional probability submodule, and an embedded numerical submodule (data embedding or data mapping). The converted numerical data is sent into the classifier module 302, and the classifier module 302 selects a suitable machine learning model (such as learning models such as neural network, support vector machine, and logistic regression) and a loss function (square loss, 0-1 loss , cross-entropy loss, log loss, etc.) to train a classifier. Then, the classifier in the classifier module 302 evaluates the converted data in the data conversion module 301 and generates an analysis report 303 . The analysis report 303 mainly includes the labels of the predicted samples, and the evaluation of the performance and efficiency of the classifier.

数据转换模块301的一个实施例:An embodiment of the data conversion module 301:

采用数据转换模块301可以将数据集中的分类数据转换为数值数据,下面以信用卡欺诈检测的数据集为例说明。该信用卡欺诈检测数据集来源于某市某银行的信用卡部门,在2013年共收集了284,807条数据记录,每条记录含28个分类属性502。该数据集的示例如分类数据矩阵样例200所示。The classification data in the data set can be converted into numerical data by using the data conversion module 301 , and the following uses the data set of credit card fraud detection as an example for illustration. The credit card fraud detection data set comes from the credit card department of a certain bank in a certain city. A total of 284,807 data records were collected in 2013, and each record contains 28 classification attributes 502. An example of this data set is shown in the sample categorical data matrix 200 .

操作步骤如下:The operation steps are as follows:

Step1:数据转换模块301中的数据预处理子模块对原始数据通过清洗缺失数据,噪音数据等操作后得到处理后的分类数据集501;Step1: the data preprocessing sub-module in the data conversion module 301 obtains the processed classification data set 501 after operations such as cleaning missing data and noise data on the original data;

Step2:从分类数据集501中提取每N个分类属性502和类标签;Step2: extract every N classification attribute 502 and class label from classification data set 501;

Step3:通过公式(1)、(2)计算条件概率401,例如:分类值“结婚”对应的条件概率是[0.15,0.51,...],分类值“单身”对应的条件概率是[0.33,0.12,...]等等;Step3: Calculate the conditional probability 401 through the formulas (1) and (2), for example: the conditional probability corresponding to the classification value "married" is [0.15, 0.51, ...], and the conditional probability corresponding to the classification value "single" is [0.33 , 0.12, ...] and so on;

Step4:微调条件概率403是按照本发明的说明书附图图5进行,例如分类值“结婚”对应的条件概率为[0.15,0.51,...]通过微调后,它对应的微调条件概率403为[0.13,0.47,...];Step4: The fine-tuning conditional probability 403 is carried out according to the accompanying drawing Fig. 5 of the present invention, for example, the conditional probability corresponding to the classification value "marriage" is [0.15, 0.51, ...] After fine-tuning, its corresponding fine-tuning conditional probability 403 is [0.13, 0.47, ...];

Step5:分类数据集501的分类数据用微调条件概率403进行转换,并将转换后的数值数据保存到数值数据集505中。Step5: The categorical data of the categorical data set 501 is converted with the fine-tuning conditional probability 403 , and the converted numerical data is saved in the numerical data set 505 .

Claims (2)

1.一种基于微调条件概率的分类数据转换方法,其特征在于,包括:1. A classification data conversion method based on fine-tuning conditional probability, characterized in that, comprising: S1、分类数据的数据采集;S1. Data collection of classified data; S2、数据预处理,清洗分类数据中的缺失数据,噪音数据,以及无效数据;S2. Data preprocessing, cleaning missing data, noisy data, and invalid data in the classified data; S3、条件概率计算,将清洗以后的分类数据转换为数值向量;S3. Conditional probability calculation, converting the cleaned classified data into numerical vectors; S4、微调条件概率,对步骤S3中转换后的数值向量进行数值微调;S4, fine-tuning the conditional probability, performing numerical fine-tuning on the converted numerical vector in step S3; S5、分类数据的数值嵌入,对步骤S4中进行数值微调以后的数值向量,采用原始的分类数据嵌入或映射为数值数据;S5. Numerical embedding of classified data, embedding or mapping the original classified data into numerical data for the numerical vector after numerical fine-tuning in step S4; 其中步骤S2和S3具体包括:Wherein steps S2 and S3 specifically include: X是一个包含N个样本的分类数据集,每个样本由一个m维的向量[a1(x),…,am(x)]表示,其中ai(x)是样本x的第i属性的分类值,此外,X的类标签为
Figure FDA0003848957870000011
条件概率计算首先提取每个分类属性Ai和类标签C中的数据,然后计算属性Ai内每个分类值ai(x)的条件概率,并生成如下的l维数值向量:
X is a classification data set containing N samples, each sample is represented by an m-dimensional vector [a 1 (x),...,am (x)], where a i (x) is the i- th sample x The categorical value of the attribute, in addition, the class label of X is
Figure FDA0003848957870000011
The conditional probability calculation first extracts the data in each classification attribute A i and class label C, and then calculates the conditional probability of each classification value a i (x) in the attribute A i , and generates the following l-dimensional numerical vector:
ai(x)→[P(c1|ai(x)),…,P(cj|ai(x)),…,P(cl|ai(x))] (1)a i (x)→[P(c 1 |a i (x)),…,P(c j |a i (x)),…,P(c l |a i (x))] (1) 其中式(1)中的条件概率项P(cj|ai(x))是由拉普拉斯平滑(Laplace Smoothing)的贝叶斯估计(Bayesian Estimation)进行计算,即为:The conditional probability item P(c j |a i (x)) in formula (1) is calculated by Bayesian Estimation of Laplace Smoothing, which is:
Figure FDA0003848957870000012
Figure FDA0003848957870000012
其中式(2)中的I(x,y)是一个指标函数,即当x=y时I(x,y)=1,否则I(x,y)=0;λ≥0是一个拉普拉斯平滑因子;I(x,y) in formula (2) is an indicator function, that is, I(x,y)=1 when x=y, otherwise I(x,y)=0; λ≥0 is a Lapp Lass smoothing factor; 其中步骤S3还包括:Wherein step S3 also includes: 估计有效范围,利用有效范围算法(Valid Ranges Algorithm)的计算属性Ai内每个分类值ai(x)的有效范围[Pmin(cj|ai),Pmax(cj|ai)],其中0≤Pmin(cj|ai)≤Pmax(cj|ai)≤1;To estimate the valid range, use the valid range algorithm (Valid Ranges Algorithm) to calculate the valid range of each classification value a i (x) in the attribute A i [P min (c j |a i ), P max (c j |a i )], where 0≤P min (c j |a i )≤P max (c j |a i )≤1; 其中步骤S4包括:Wherein step S4 comprises: S01、如果条件概率项P(cj|ai(x))用于正确分类的样本数大于错误分类的样本数时,即Neg_ratio(ai,cj)>pos_ratio(ai,cj)时,微调这个概率项P(cj|ai(x)),否则退出微调过程;S01. If the conditional probability item P(c j |a i (x)) is used to correctly classify the number of samples greater than the number of wrongly classified samples, that is, Neg_ratio(a i ,c j )>pos_ratio(a i ,c j ) , fine-tune the probability item P(c j |a i (x)), otherwise exit the fine-tuning process; S02、计算分类值ai(x)的平均有效范围
Figure FDA0003848957870000013
与条件概率
Figure FDA0003848957870000014
的差的绝对值,
Figure FDA0003848957870000021
其中
Figure FDA0003848957870000022
S02, calculate the average effective range of the classification value a i (x)
Figure FDA0003848957870000013
with conditional probability
Figure FDA0003848957870000014
The absolute value of the difference,
Figure FDA0003848957870000021
in
Figure FDA0003848957870000022
S03、把条件概率
Figure FDA0003848957870000023
Figure FDA0003848957870000024
进行更新,即
Figure FDA0003848957870000025
S03, the conditional probability
Figure FDA0003848957870000023
use
Figure FDA0003848957870000024
to update, i.e.
Figure FDA0003848957870000025
S04、归一化更新的条件概率
Figure FDA0003848957870000026
Figure FDA0003848957870000027
S04. Conditional probability of normalized update
Figure FDA0003848957870000026
which is
Figure FDA0003848957870000027
其中步骤S4还包括:Wherein step S4 also includes: 微调后验证,使用机器学习分类器验证微调后的条件概率的性能是否得到提高,即验证微调算法是否能更加真实的拟合原始数据的分布;Verification after fine-tuning, use the machine learning classifier to verify whether the performance of the conditional probability after fine-tuning is improved, that is, verify whether the fine-tuning algorithm can more realistically fit the distribution of the original data; 其中步骤S5包括,Wherein step S5 comprises, 条件判断,判断微调后验证中条件概率的性能是否提高,如果得到提高,说明本次微调是有效的,转到微调条件概率,继续微调;否则终止微调过程,退出程序。Conditional judgment, judge whether the performance of conditional probability in verification after fine-tuning is improved, if it is improved, it means that this fine-tuning is effective, go to fine-tuning conditional probability, and continue fine-tuning; otherwise, terminate the fine-tuning process and exit the program.
2.如权利要求1所述的一种基于微调条件概率的分类数据转换方法,其特征在于,微调次数限制在预设的1000次以内。2. A classification data conversion method based on fine-tuning conditional probability as claimed in claim 1, characterized in that, the number of fine-tuning is limited within the preset 1000 times.
CN201910770010.6A 2019-08-20 2019-08-20 Classification data conversion method based on fine-tuning conditional probability Active CN110502552B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910770010.6A CN110502552B (en) 2019-08-20 2019-08-20 Classification data conversion method based on fine-tuning conditional probability

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910770010.6A CN110502552B (en) 2019-08-20 2019-08-20 Classification data conversion method based on fine-tuning conditional probability

Publications (2)

Publication Number Publication Date
CN110502552A CN110502552A (en) 2019-11-26
CN110502552B true CN110502552B (en) 2022-10-28

Family

ID=68588872

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910770010.6A Active CN110502552B (en) 2019-08-20 2019-08-20 Classification data conversion method based on fine-tuning conditional probability

Country Status (1)

Country Link
CN (1) CN110502552B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444400A (en) * 2020-04-07 2020-07-24 中国汽车工程研究院股份有限公司 Force and Flow Field Data Management Methods
CN114549178A (en) * 2022-02-23 2022-05-27 中国工商银行股份有限公司 Credit evaluation method, credit evaluation device, electronic device and medium
CN115264048B (en) * 2022-07-26 2023-05-23 重庆大学 Intelligent gear decision design method for automatic transmission based on data mining
CN117009339A (en) * 2023-08-15 2023-11-07 中国银行股份有限公司 Data cleaning method, device, equipment and readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103294828A (en) * 2013-06-25 2013-09-11 厦门市美亚柏科信息股份有限公司 Verification method and verification device of data mining model dimension
CN104391860A (en) * 2014-10-22 2015-03-04 安一恒通(北京)科技有限公司 Content type detection method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7020593B2 (en) * 2002-12-04 2006-03-28 International Business Machines Corporation Method for ensemble predictive modeling by multiplicative adjustment of class probability: APM (adjusted probability model)
US10558766B2 (en) * 2015-12-31 2020-02-11 Palo Alto Research Center Incorporated Method for Modelica-based system fault analysis at the design stage

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103294828A (en) * 2013-06-25 2013-09-11 厦门市美亚柏科信息股份有限公司 Verification method and verification device of data mining model dimension
CN104391860A (en) * 2014-10-22 2015-03-04 安一恒通(北京)科技有限公司 Content type detection method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A probabilistic framework for memory-based reasoning;Simon Kasif et al.;《Artificial Intelligence》;19980930;第1-2卷(第104期);第287-311页 *
Handling nominal features in anomaly intrusion detection problems;Mei-Ling Shyu et al.;《15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications》;20050906;第55-62页 *

Also Published As

Publication number Publication date
CN110502552A (en) 2019-11-26

Similar Documents

Publication Publication Date Title
CN110502552B (en) Classification data conversion method based on fine-tuning conditional probability
WO2021088499A1 (en) False invoice issuing identification method and system based on dynamic network representation
CN109949152A (en) A Personal Credit Default Prediction Method
CN104751363B (en) Stock Forecasting of Middle And Long Period Trends method and system based on Bayes classifier
CN115410275A (en) Office place personnel state detection method and system based on image recognition
Sudipa et al. Trend Forecasting of the Top 3 Indonesian Bank Stocks Using the ARIMA Method
CN110163378A (en) Characteristic processing method, apparatus, computer readable storage medium and computer equipment
CN119006144A (en) Business project management method, device, computer equipment and storage medium
CN114219630A (en) Service risk prediction method, device, equipment and medium
CN113569048A (en) Method and system for automatically dividing affiliated industries based on enterprise operation range
CN117973675A (en) A method and system for judging false closed loop of defects and countermeasures based on planned work orders
CN120163653A (en) Dynamic risk control method, device, equipment and storage medium
CN118297640A (en) Product marketing management system and method based on big data
CN112329862A (en) Decision tree-based anti-money laundering method and system
CN116843345A (en) Intelligent wind control system and method for trading clients based on artificial intelligence technology
CN117726426A (en) Credit evaluation method, credit evaluation device, electronic equipment and storage medium
CN117853151A (en) Electronic commerce data analysis system and method based on big data
CN116384751A (en) Method and computing device for carrying out standardized risk index and risk rating prediction
CN115293867A (en) Financial reimbursement user portrait optimization method, device, equipment and storage medium
CN115237970A (en) Data prediction method, device, equipment, storage medium and program product
CN113724060A (en) Credit risk assessment method and system
CN118468207B (en) Enterprise abnormal behavior monitoring system and method based on big data
CN119359307B (en) Supply chain financial transaction safety early warning method and system
CN119693125B (en) Credit risk level assessment method, apparatus, device and storage medium
CN117952717B (en) A method and system for processing air ticket orders based on big data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant