[go: up one dir, main page]

CN119862442B - An unsupervised deep learning method based on synthetic data generation - Google Patents

An unsupervised deep learning method based on synthetic data generation

Info

Publication number
CN119862442B
CN119862442B CN202510354534.2A CN202510354534A CN119862442B CN 119862442 B CN119862442 B CN 119862442B CN 202510354534 A CN202510354534 A CN 202510354534A CN 119862442 B CN119862442 B CN 119862442B
Authority
CN
China
Prior art keywords
image data
sample
gaussian
data
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202510354534.2A
Other languages
Chinese (zh)
Other versions
CN119862442A (en
Inventor
陈新
季良杰
王宇峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Huakun Zhenyu Intelligent Technology Co ltd
Original Assignee
Sichuan Huakun Zhenyu Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Huakun Zhenyu Intelligent Technology Co ltd filed Critical Sichuan Huakun Zhenyu Intelligent Technology Co ltd
Priority to CN202510354534.2A priority Critical patent/CN119862442B/en
Publication of CN119862442A publication Critical patent/CN119862442A/en
Application granted granted Critical
Publication of CN119862442B publication Critical patent/CN119862442B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • G06F18/15Statistical pre-processing, e.g. techniques for normalisation or restoring missing data
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biomedical Technology (AREA)
  • Algebra (AREA)
  • Pure & Applied Mathematics (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of an unsupervised deep learning method, and particularly relates to an unsupervised deep learning method based on synthetic data generation, wherein a Gaussian mixture model is used for searching an original clustering center and standard deviation of unlabeled data, an optimal superposition weight is selected for the Gaussian mixture model by measuring the difference between isotropic Gaussian spot generated data based on the Gaussian mixture model clustering center and standard deviation and original data distribution, the Gaussian mixture model is used for determining the clustering number and standard deviation under the optimal superposition weight, a large amount of synthetic data with labels and balance are generated by the isotropic Gaussian spots based on the clustering number and standard deviation, the synthetic data is used for training a deep neural network model, and the trained deep neural network model is used for executing downstream tasks based on the initial unlabeled data. The method realizes sample equalization of generated synthetic data, assists training of the deep neural network, and improves model accuracy in an unsupervised learning scene.

Description

Unsupervised deep learning method based on synthetic data generation
Technical Field
The invention belongs to the technical field of an unsupervised deep learning method, and particularly relates to an unsupervised deep learning method based on synthetic data generation.
Background
With the development of wireless networks, mining data information and extracting features thereof for application has become a common task in the data age. Such as unlabeled image classification tasks in computer vision, abnormal traffic detection tasks in the network. However, in the real world, there are various types of unlabeled data. Such data presents a significant challenge for data mining due to the lack of tags as a reference and guide for training. Currently, for data mining tasks of unlabeled data, the core idea of common methods is to analyze statistical properties and similarities of the data to find potential structures and patterns in the data, such methods being referred to as unsupervised training. The deep neural network is outstanding in various methods due to flexible change capability, and can realize complex feature transformation, so that the deep neural network is suitable for different unsupervised learning tasks.
The rules in deep neural network refinement and summary data require a large amount of data as input samples for training, and the current mainstream solution is to generate synthetic data based on existing samples, thereby enriching the sample scale involved in training. The generation of the synthetic data follows a paradigm in which the original data structure is first analyzed, and then the synthetic data is generated based on the analysis result. Such as a data generation method based on principal component analysis, and a synthetic data generation method based on gaussian mixture function. Although the method can solve the problem of insufficient sample size of the unsupervised deep learning to a certain extent, the existing method does not pay attention to the sample equalization problem of generating synthetic data, such as unbalanced sample quantity of various types, and low distinction degree of the characteristics of the various types of samples, so that the training process of the deep neural network is confused, the effect is poor in practical application, and the requirement of practical application is difficult to meet.
Therefore, how to improve the training process of the deep neural network in the existing method so as to realize sample equalization of generated synthetic data, assist training of the deep neural network, and improve model accuracy in an unsupervised learning scene is a technical problem to be solved at present.
Disclosure of Invention
The invention aims to provide an unsupervised deep learning method based on synthetic data generation, which is used for realizing sample equalization of the synthetic data generation, assisting training of a deep neural network and improving model precision in an unsupervised learning scene.
In order to solve the technical problems, the invention adopts the following technical scheme:
An unsupervised deep learning method based on synthetic data generation, comprising the steps of:
s1, a Gaussian mixture model obtains superposition weights from a Bayesian optimizer, a clustering center and a standard deviation are searched based on unlabeled original data, and an isotropic Gaussian spot uses the clustering center and the standard deviation of the current Gaussian mixture model to generate synthetic data;
S2, measuring the distribution difference between the synthesized data and the original data, and obtaining an optimal superposition weight for a Gaussian mixture function based on Bayes optimization iteration;
S3, taking a clustering center and standard deviation of the Gaussian mixture model under the optimal superposition weight as a benchmark, and taking the standard deviation as a synthetic data generator, wherein the synthetic data generator generates labeled synthetic data similar to the original data distribution;
S4, inputting the labeled synthetic data into a deep neural network model for model training;
S5, inputting the initial unlabeled data into the trained deep neural network model, and executing a downstream task.
Preferably, the specific process of searching the clustering center and standard deviation based on the unlabeled raw data in the step S1 is as follows:
S11, representing the unlabeled raw data as X, including N samples, wherein each sample i includes M features, and any X i,j represents the value of the j features of the i sample, wherein i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to M, and dividing the unlabeled raw data X into K classes, wherein K= [1,2, ··, K ];
S11, representing the unlabeled raw data as X, including N samples, wherein each sample i includes M features, and any X i,j represents the value of the j features of the i sample, wherein i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to M, and dividing the unlabeled raw data X into K classes, wherein K= [1,2, ··, K ];
S12, clustering the unlabeled original data X, wherein the specific formula is as follows:
;
Wherein, the In order to cluster the results of the processing,For the class to which sample i belongs,,For the class to which sample 1 belongs,For the class to which sample 2 belongs,Class to which sample N belongs;
S13, calculating a clustering center C and a standard deviation sigma according to the probability of the given clustering calculation data sample.
Preferably, the specific process of calculating the cluster center C and the standard deviation σ according to the probability of calculating the data sample for a given cluster in step S13 is as follows:
S131, creating a Gaussian mixture model based on linear superposition of K Gaussian models, wherein the specific formula is as follows:
;
Wherein w k is the superposition parameter, ,Is the average vector of the kth gaussian component,For the covariance matrix of the kth gaussian component,A multi-element normal density function for the kth component;
s132, a Bayes optimizer randomly initializes superposition parameters w for the Gaussian mixture model and calculates an original sample mean value based on the original data Covariance matrixGenerating a sample meanAt this time, 3×k parameters to be updated exist in the gaussian mixture model;
S133, calculating probability that each sample i belongs to each cluster And (3) carrying out iterative calculation on n multiplied by k times, wherein the calculation formula is as follows:
;
Wherein, the Is the mixing weight of the j-th component,X represents the sample set, x i is the ith sample,Is the average vector of the j-th gaussian component,Covariance matrix of j-th Gaussian component;
The isotropic gaussian spot generates labeled and balanced synthetic data of the same size as the original data S134.
Preferably, in each iteration in step S133, the parameters are updated by the following formula:
;
;
;
Wherein, the As the mean value of the sample,N is the total number of data samples, I is the identity matrix,Is a regular term.
Preferably, the specific procedure for generating labeled and balanced synthetic data of the same scale as the original data from isotropic gaussian spots in step S134 is as follows:
S1341, generating labels by isotropic Gaussian spots according to parameters under the optimal weight of a Gaussian mixture model, wherein a cluster center K is randomly selected from K clusters, at the moment, a data object D i belongs to K, generating labels by a generator according to a marked data object D i, and D i epsilon D to obtain a label set with the size N and containing a data set D ;
S1342 the data generator isotropically gaussian spot is based on the standard deviation centered around C kk,For the sample size, a data object d i is generated, the formula is as follows:
;
Wherein, the Obeys normal distributionThe generation process is performed K times until the isotropic gaussian spot generates samples for all clusters, downstream yielding a balanced dataset D with N synthetic samples.
Preferably, the specific process of obtaining the optimal superposition weight for the gaussian mixture function based on bayesian optimization iteration in step S2 is as follows:
s21, judging whether the data generated by the generator is similar to the original data distribution or not through a Kelmogorov-Semipor test, wherein the statistic S of the Kelmogorov-Semipor test is:
S=supx|Fm(x)-Gn(x)|;
wherein sup x represents the maximum value of |F m(x)-Gn (x) |, F m (x) is the empirical distribution function of the raw data, G n (x) is the empirical distribution function of the synthesized data,
;
;
I is an indication function, when X i≤X,Yi is less than or equal to X, i=1, n is a sample size, and m represents the number of original data samples;
and S22, circularly executing the steps S12, S13 and S21 to obtain the optimal superposition weight of the Gaussian mixture model.
Preferably, the specific process of generating the tagged synthetic data in step S3 is as follows:
S31, based on the obtained Gaussian mixture model optimal superposition weight, running a Gaussian mixture function to obtain a clustering center C and a standard deviation sigma which are obtained by iteration of the Gaussian mixture model under the weight;
s32, designating the required number of samples, running the isotropic Gaussian spots in turn according to the number of clustering centers, and generating labeled sample balance synthetic data D'.
Preferably, in step S4, the specific process of inputting the labeled synthesis data into the deep neural network model for model training is as follows:
S41, deploying a deep neural network model, wherein the deep neural network model comprises a mask layer, a dense layer, a batch normalization layer and a random inactivation layer, the mask layer is the first layer of the deep neural network model, and the dense layer is provided with a plurality of neurons and uses linear transformation and an activation function for increasing nonlinearity;
s42, inputting a median value of the sequence and a preset appointed value through the mask layer The equal samples are masked as follows:
;
Wherein, the Representing the i-th element of the mask vector,Representing an ith sample value in the input sequence;
S43, calculating through the dense layer, wherein the specific formula is as follows:
h=ReLU(WD’+b);
Wherein h is output after dense layer calculation, W is a leachable weight matrix with linear change, b is a bias vector, reLU is a modified linear unit function, reLU=max (0, z), z is a linear transformation result in the neural network, when the linear unit function is modified by the ReLU, if z is greater than 0, z is output, otherwise, 0, D' is synthesized data with sample balance of labels;
S44, inputting the output result of the dense layer into a batch normalization layer for processing, wherein the specific formula is as follows:
;
Wherein, the AndThe mean and standard deviation of h over the current lot,AndIs a parameter that can be learned;
S45, inputting the output result of the batch normalization layer into a random inactivation layer for processing, wherein the specific formula is as follows:
;
Wherein d is a binary masking vector, the probability of each element being 0 is p, and the probability of each element being 1 is 1/p;
s46 output of random inactivation layer Entering an output dense layer with k neurons and softmax activation functions, and calculating the probability that input data belongs to various labels:
;
wherein the dimension of y is k x N, As a matrix of weights, the weight matrix,As a bias vector, the softmax activation function is:
;
wherein z i is the i-th element in the input vector z;
s47, cross entropy measurement is adopted for the predicted probability results of various labels:
;
Wherein, the Representing the class to which sample i belongs, synlog representing the synchronization logarithm,To predict probability, training loss updates the learnable parameters of the deep neural network by Adam optimizer until the deep neural network converges.
Preferably, the specific procedure of step S5 is as follows:
And (3) inputting the initial unlabeled data sample X into the deep neural network model obtained by the pre-training of the synthetic data in the step S4, outputting a corresponding result, and executing a downstream task.
The beneficial effects of the invention include:
The invention provides an unsupervised deep learning method based on synthetic data generation, which is characterized in that a Gaussian mixture model searches for a clustering center and standard deviation of original unlabeled data, bayesian optimization is used for selecting optimal superposition weights for Gaussian mixture models by measuring differences between isotropic Gaussian spot generation data and original data distribution based on the clustering center and standard deviation of the Gaussian mixture model, the Gaussian mixture model determines the clustering number and standard deviation under the optimal superposition weights, the isotropic Gaussian spots generate a large amount of synthetic data with labels and balance based on the clustering number and standard deviation, the synthetic data are used for training a deep neural network model, and the trained deep neural network model performs downstream tasks based on the initial unlabeled data. A large amount of labeled synthetic data similar to the distribution of the labeled synthetic data can be generated based on the clustering method according to the statistical characteristics of the non-labeled data, so that sufficient training samples are provided for deep neural network model training in an unsupervised scene, the performance of the model is improved, sample equalization of the generated synthetic data is realized, training of the deep neural network is assisted, and the model precision in the unsupervised learning scene is improved.
According to the invention, the statistical characteristics of the unlabeled data are mined through the overlapping weight of the Bayesian optimization Gaussian mixture model, and the labeled synthetic data are generated so as to expand the training sample size, and label references are provided for deep neural network model training, so that the model performance is improved. And considering various sample sizes of the synthesized data, generating data with consistent sample numbers for each type of data, realizing sample equalization of the synthesized data, and avoiding confusion caused by unbalanced generated samples in deep neural network model learning. The structure of the deep neural network can be flexibly adjusted according to specific tasks, and the output support of the pre-trained deep neural network is further processed, so that the requirements of various downstream tasks can be met.
Drawings
Fig. 1 is a schematic flow chart of an unsupervised deep learning method based on synthetic data generation in the present invention.
FIG. 2 is a schematic flow chart of a Bayesian optimizer of the present invention for selecting optimal superposition weights for a Gaussian mixture model.
Detailed Description
The invention is further described in detail below with reference to fig. 1-2:
Example 1
Referring to fig. 1 and 2, an unsupervised deep learning method based on synthetic data generation includes the steps of:
s1, a Gaussian mixture model randomly obtains superposition weights from a Bayesian optimizer, a clustering center and a standard deviation are searched based on unlabeled original data, and an isotropic Gaussian spot uses the clustering center and the standard deviation of the current Gaussian mixture model to generate synthetic data;
S2, measuring the distribution difference between the synthesized data and the original data by using a Kelmogorov-Schmidt test, and obtaining an optimal superposition weight for a Gaussian mixture function based on Bayes optimization iteration;
and S3, taking the clustering center and standard deviation of the Gaussian mixture model under the optimal superposition weight as a reference, and taking the isotropic Gaussian mixture model as a synthetic data generator, wherein the synthetic data generator generates labeled synthetic data similar to the original data distribution.
The gaussian mixture function is run with the optimal parameters of the gaussian mixture model, and then the isotropic gaussian spot is run for the desired number of samples N' to generate the label and the composite data. In order to ensure the balance of the generated data, the generator calls the clustering center and the variance one by one to generate various types of synthesized data with balanced sample size. Generating a display havingN' samples of the cluster center, isotropic Gaussian blobs were called K times in total. Selecting one C k to generate when each call is madeSamples of whichDownstream a balanced data set of one size is obtainedClusters are represented as labels for each composite data.
And S4, inputting the labeled synthesized data into a deep neural network model for model training, and designing an adjustable deep neural network model in the step to adapt to the execution requirement of a downstream task. The deep neural network comprises a mask layer, a dense layer, a batch normalization layer and a random inactivation layer.
The combination of the above-described network layers may be flexibly adjusted for different specific tasks. The output result finally enters an output dense layer with k neurons and softmax activation functions, and the probability that the input data belongs to various labels is calculated. The predicted label probability result adopts cross entropy measurement as training loss, and the learnable parameters of the deep neural network model are updated by a gradient descent method until the deep neural network converges.
S5, inputting the initial unlabeled data into the trained deep neural network model, executing a downstream task, and outputting an execution result.
Example 2
Based on the embodiment 1, the specific process of searching the clustering center and standard deviation based on the unlabeled raw data in the step S1 is as follows:
S11, constructing a data structured expression, namely, representing unlabeled original data as X, comprising N samples, wherein each sample i comprises M features, and any X i,j represents the value of the j features of the i sample, wherein i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to M, and dividing the unlabeled original data X into K classes, wherein K= [1,2, ··, K ];
S12, clustering the unlabeled original data X, wherein the specific formula is as follows:
;
Wherein, the In order to cluster the results of the processing,For the class to which sample i belongs,,For the class to which sample 1 belongs,For the class to which sample 2 belongs,Class to which sample N belongs;
s13, calculating a clustering center C and a standard deviation sigma according to the probability of a given clustering calculation data sample;
S13, calculating a clustering center C and a standard deviation sigma according to the probability of the given clustering calculation data sample.
In this embodiment, the specific procedure of calculating the clustering center C and the standard deviation σ according to the probability of calculating the data samples for a given cluster in step S13 is as follows:
S131, creating a Gaussian mixture model based on linear superposition of K Gaussian models, wherein the specific formula is as follows:
;
Wherein w k is the superposition parameter, ,Is the average vector of the kth gaussian component,For the covariance matrix of the kth gaussian component,A multi-element normal density function for the kth component;
s132, a Bayes optimizer randomly initializes superposition parameters w for the Gaussian mixture model and calculates an original sample mean value based on the original data Covariance matrixGenerating a sample meanAt this time, 3×k parameters to be updated exist in the gaussian mixture model;
S133, calculating probability that each sample i belongs to each cluster And (3) carrying out iterative calculation on n multiplied by k times, wherein the calculation formula is as follows:
;
Wherein, the Is the mixing weight of the j-th component,X represents the sample set, x i is the ith sample,Is the average vector of the j-th gaussian component,Covariance matrix of j-th Gaussian component;
The isotropic gaussian spot generates labeled and balanced synthetic data of the same size as the original data S134.
In each iteration in step S133, the parameters are updated by the following formula:
;
;
;
Wherein, the As the mean value of the sample,N is the total number of data samples, I is the identity matrix,Is a regular term.
The specific procedure for generating labeled and balanced synthetic data for the same scale as the original data in the isotropic gaussian spot in step S134 is as follows:
S1341, generating labels by isotropic Gaussian spots according to parameters under the optimal weight of a Gaussian mixture model, wherein a cluster center K is randomly selected from K clusters, at the moment, a data object D i belongs to K, generating labels by a generator according to a marked data object D i, and D i epsilon D to obtain a label set with the size N and containing a data set D ;
S1342 the data generator isotropically gaussian spot is based on the standard deviation centered around C kk,For the sample size, a data object d i is generated, the formula is as follows:
;
Wherein, the Obeys normal distributionThe generation process is performed K times until the isotropic gaussian spot generates samples for all clusters, downstream yielding a balanced dataset D with N synthetic samples.
The isotropic gaussian spot generates synthetic data on the same scale as the original data, while the present invention focuses on the balance of generating synthetic data. Therefore, in the data generation process, isotropic Gaussian spots are called in order based on the number of samples and the number of clusters, and data with balanced sample size is generated. The isotropic Gaussian spots are generated by adopting a clustering center C and a standard deviation sigma generated in a Gaussian mixture function as the clustering center and the standard deviation, wherein C kk respectively represents the clustering center and the standard deviation of a specific class, and a data set which is similar to the characteristics of the original data and is consistent with the original data in size is generated. To ensure sample equalization of the synthesized data, isotropic gaussian blobs are invoked in a loop.
Example 3
Based on embodiment 1 or embodiment 2, the specific procedure for obtaining the optimal superposition weight for the gaussian mixture function based on bayesian optimization iteration in step S2 is as follows:
s21, judging whether the data generated by the generator is similar to the original data distribution or not through a Kelmogorov-Semipor test, wherein the statistic S of the Kelmogorov-Semipor test is:
S=supx|Fm(x)-Gn(x)|;
Wherein sup x represents a maximum value of |F m(x)-Gn (x) |.
F m (x) is the empirical distribution function of the raw data, G n (x) is the empirical distribution function of the synthesized data,
;
;
I is an indication function, when X i≤X,Yi is less than or equal to X, i=1, otherwise 0, n is the sample size, and m represents the number of original data samples;
and S22, circularly executing the steps S12, S13 and S21 to obtain the optimal superposition weight of the Gaussian mixture model.
The specific process of generating the labeled synthetic data in step S3 is as follows:
S31, based on the obtained Gaussian mixture model optimal superposition weight, running a Gaussian mixture function to obtain a clustering center C and a standard deviation sigma which are obtained by iteration of the Gaussian mixture model under the weight;
S32, designating the required sample number, running isotropic Gaussian spots in turn according to the number of clustering centers to generate labeled sample balanced synthetic data
Example 4
On the basis of embodiment 1, embodiment 2 or embodiment 3, the specific process of inputting the labeled synthesis data into the deep neural network model for model training in step S4 is as follows:
And S41, deploying a deep neural network model, wherein the deep neural network model comprises a mask layer, a dense layer, a batch normalization layer and a random inactivation layer, the mask layer is the first layer of the deep neural network model, the dense layer is provided with a plurality of neurons and uses linear transformation and a non-linear ReLU activation function, and the combination of the network layers can be flexibly adjusted for different specific tasks.
S42, inputting a median value of the sequence and a preset appointed value through the mask layerThe equal samples are masked as follows:
;
Representing the i-th element of the mask vector, Representing the i-th sample value in the input sequence.
S43, calculating through the dense layer, wherein the specific formula is as follows:
h=ReLU(WD’+b);
Where h is the output of dense layer computation, W is the learnable weight matrix of linear variation, b is the bias vector, reLU is the modified linear unit function, relu=max (0, z), z is the linear transformation result in the neural network, when the linear unit function is modified by ReLU, if z >0, then z is output, otherwise 0, d' is the sample balanced synthesized data with labels.
S44, inputting the output result of the dense layer into a batch normalization layer for processing, wherein the normalization layer can disorder the data sequence, quicken the convergence of the network and improve the generalization capability of the network, and the specific formula is as follows:
;
Wherein, the AndThe mean and standard deviation of h over the current lot,AndIs a parameter that can be learned;
s45, inputting the output result of the batch normalization layer into a random inactivation layer for processing, wherein the random inactivation layer can avoid the problem of overfitting of a deep neural network and reduce training time consumption, and the specific formula is as follows:
;
Wherein d is a binary masking vector, the probability of each element being 0 is p, and the probability of each element being 1 is 1/p;
S46, predicting the output of the label, namely the random inactivation layer, based on the synthesized data Entering an output dense layer with k neurons and softmax activation functions, and calculating the probability that input data belongs to various labels:
;
wherein the dimension of y is k x N, As a matrix of weights, the weight matrix,As a bias vector, the softmax activation function is:
;
wherein, z j is the j-th element in the input vector z, which is the i-th element in the input vector z;
s47, back propagation training a deep neural network, namely, cross entropy measurement is adopted for the probability result of each predicted label:
;
Wherein, the Representing the class to which sample i belongs, synlog representing the synchronization logarithm,To predict probability, training loss updates the learnable parameters of the deep neural network by Adam optimizer until the deep neural network converges.
The specific process of step S5 is as follows:
And (3) inputting the initial unlabeled data sample X into the deep neural network model obtained by the pre-training of the synthetic data in the step S4, outputting a corresponding result, and executing a downstream task.
According to the invention, the statistical characteristics of the unlabeled data are mined through the overlapping weight of the Bayesian optimization Gaussian mixture model, and the labeled synthetic data are generated so as to expand the training sample size, and label references are provided for deep neural network model training, so that the model performance is improved. And considering various sample sizes of the synthesized data, generating data with consistent sample numbers for each type of data, realizing sample equalization of the synthesized data, and avoiding confusion caused by unbalanced generated samples in deep neural network model learning. The structure of the deep neural network can be flexibly adjusted according to specific tasks, and the output support of the pre-trained deep neural network is further processed, so that the deep neural network can adapt to the requirements of various downstream tasks. The method can be suitable for various downstream tasks, and other data processing methods are supported to be set for specific problems based on the pre-training results.
In summary, the method for unsupervised deep learning based on synthetic data generation is provided, a Gaussian mixture model searches for a clustering center and standard deviation of original unlabeled data, bayesian optimization selects optimal superposition weights for the Gaussian mixture model by measuring differences between isotropic Gaussian spot generation data based on the Gaussian mixture model clustering center and standard deviation and original data distribution, the Gaussian mixture model determines the clustering number and standard deviation under the optimal superposition weights, the isotropic Gaussian spot generates a large amount of synthetic data with labels and balance based on the clustering number and standard deviation, the synthetic data is used for training a deep neural network model, and the trained deep neural network model performs downstream tasks based on the initial unlabeled data. A large amount of labeled synthetic data similar to the distribution of the labeled synthetic data can be generated based on the clustering method according to the statistical characteristics of the non-labeled data, so that sufficient training samples are provided for deep neural network model training in an unsupervised scene, the performance of the model is improved, synthetic data matched and balanced with the original data distribution is generated through the clustering method by analyzing the distribution characteristics of the original data, the training of the deep neural network is assisted, the model precision in the unsupervised learning scene is improved, and the technical problem that the deep learning model precision is low due to unbalance in current synthetic data generation is effectively solved.

Claims (6)

1.一种基于合成图像数据生成的无监督深度学习方法,其特征在于,包括以下步骤:1. An unsupervised deep learning method based on synthetic image data generation, characterized by comprising the following steps: S1:计算机视觉中的无标签图像分类任务中,高斯混合模型从贝叶斯优化器中获得叠加权重,基于无标签原始图像数据寻找聚类中心和标准差,各向同性高斯斑点使用当前高斯混合模型的聚类中心和标准差生成合成图像数据;S1: In the unlabeled image classification task in computer vision, the Gaussian mixture model obtains superposition weights from the Bayesian optimizer, finds cluster centers and standard deviations based on the unlabeled original image data, and the isotropic Gaussian spots use the cluster centers and standard deviations of the current Gaussian mixture model to generate synthetic image data; S2:衡量所述合成图像数据与原始图像数据之间的分布差异,基于贝叶斯优化迭代为高斯混合函数获得最优叠加权重;S2: measuring the distribution difference between the synthesized image data and the original image data, and obtaining an optimal superposition weight for the Gaussian mixture function based on Bayesian optimization iteration; S3:各向同性高斯以高斯混合模型在最优叠加权重下的聚类中心和标准差为基准,作为合成图像数据生成器,合成图像数据生成器生成与原始图像数据分布相似的有标签合成图像数据;S3: Isotropic Gaussian takes the cluster center and standard deviation of the Gaussian mixture model under the optimal superposition weight as the benchmark, and is used as a synthetic image data generator to generate labeled synthetic image data with a distribution similar to the original image data; S4:将所述有标签合成图像数据输入深度神经网络模型进行模型训练;S4: Inputting the labeled synthetic image data into a deep neural network model for model training; S5:将初始的无标签图像数据输入训练后的深度神经网络模型,执行下游任务;S5: Input the initial unlabeled image data into the trained deep neural network model to perform downstream tasks; 步骤S1中的基于无标签原始图像数据寻找聚类中心和标准差的具体过程如下:The specific process of finding the cluster center and standard deviation based on the unlabeled original image data in step S1 is as follows: S11:将无标签原始图像数据表示为X,包含N个样本,每个样本i包含M个特征,任意x i,j 表示第i个样本的第j个特征在中的值,其中,1≤iN,1≤jM,将无标签原始图像数据X分成K类,K=[1,2,···,k];S11: Denote the unlabeled original image data as X , which contains N samples. Each sample i contains M features. Any x i,j represents the value of the jth feature of the i - th sample in M , where 1 ≤ iN , 1 ≤ jM . Divide the unlabeled original image data X into K categories, where K = [1, 2, ···, k ]; S12:对无标签原始图像数据X进行聚类处理,具体公式如下:S12: Perform clustering on the unlabeled original image data X. The specific formula is as follows: ; 其中,为聚类处理结果,为样本i所属的类,为样本1所属的类,为样本2所属的类,为样本N所属的类;in, is the clustering result, is the class to which sample i belongs, , is the class to which sample 1 belongs, is the class to which sample 2 belongs, is the class to which sample N belongs; S13:根据给定的聚类计算图像数据样本的概率计算聚类中心C和标准差σ;S13: Calculate the cluster center C and standard deviation σ according to the probability of the given clustering image data sample; 步骤S13中根据给定的聚类计算图像数据样本的概率计算聚类中心C和标准差σ的具体过程如下:The specific process of calculating the cluster center C and the standard deviation σ according to the probability of the given clustering calculation image data sample in step S13 is as follows: S131:基于K个高斯模型线性叠加创建高斯混合模型,具体公式如下:S131: Create a Gaussian mixture model based on the linear superposition of K Gaussian models. The specific formula is as follows: ; 其中,w k 为叠加参数,为第k个高斯分量的平均向量,为第k个高斯分量的协方差矩阵,为第k个分量的多元正态密度函数;Among them, wk is the superposition parameter , , is the mean vector of the kth Gaussian component, is the covariance matrix of the kth Gaussian component, is the multivariate normal density function of the kth component; S132:贝叶斯优化器为高斯混合模型随机初始化叠加参数w,并基于原始图像数据计算原始样本均值,协方差矩阵,生成样本均值,此时高斯混合模型存在3×k个待更新的参数;S132: The Bayesian optimizer randomly initializes the superposition parameter w for the Gaussian mixture model and calculates the original sample mean based on the original image data , the covariance matrix , generating the sample mean , at this time, the Gaussian mixture model has 3× k parameters to be updated; S133:计算每个样本i属于每个聚类的概率,共迭代计算n×k次,计算公式如下:S133: Calculate the probability that each sample i belongs to each cluster , iterates n × k times in total, and the calculation formula is as follows: ; 其中,是第j个分量的混合权重,x表示样本集,x i 为第i个样本,为第j个高斯分量的平均向量,为第j个高斯分量的协方差矩阵;in, is the mixing weight of the jth component, , x represents the sample set, xi is the i- th sample, is the mean vector of the j -th Gaussian component, is the covariance matrix of the j -th Gaussian component; S134:各向同性高斯斑点生成与原始图像数据规模相同的有标签并且平衡的合成图像数据;S134: Generate labeled and balanced synthetic image data of the same scale as the original image data using isotropic Gaussian spots; 步骤S134中各向同性高斯斑点生成与原始图像数据规模相同的有标签并且平衡的合成图像数据的具体过程如下:The specific process of generating labeled and balanced synthetic image data of the same scale as the original image data by isotropic Gaussian spots in step S134 is as follows: S1341:各向同性高斯斑点根据高斯混合模型的最优权重下的参数生成标签:从K个聚类中随机选择一个聚类中心k,此时图像数据对象d i 属于k,生成器根据标记图像数据对象d i 来生成标签,d i D,得到一个大小为N的包含图像数据集D的标签集合S1341: Isotropic Gaussian blobs generate labels based on the optimal weighted parameters of the Gaussian mixture model: a cluster center k is randomly selected from the K clusters, and the image data object d i belongs to k . The generator generates labels based on the labeled image data object d i , d i D , and obtains a label set of size N containing the image dataset D ; S1342:图像数据生成器各向同性高斯斑点根据以C k ,σ k 为中心与标准差,为样本量,生成图像数据对象d i ,公式如下:S1342: Image data generator isotropic Gaussian spots are generated based on C k , σ k as the center and standard deviation, For the sample size, generate the image data object d i , the formula is as follows: ; 其中,服从正态分布,生成过程被执行K次,直到各向同性高斯斑点为所有聚类都生成了样本,下游得到一个具有N个合成样本的平衡图像数据集Din, Normal distribution , the generation process is performed K times until the isotropic Gaussian spots generate samples for all clusters, and a balanced image dataset D with N synthetic samples is obtained downstream. 2.根据权利要求1所述的一种基于合成图像数据生成的无监督深度学习方法,其特征在于,步骤S133中在每次迭代中,参数通过以下公式进行更新:2. The unsupervised deep learning method based on synthetic image data generation according to claim 1, characterized in that in step S133, in each iteration, the parameters are updated using the following formula: ; ; ; 其中,为样本均值,N为图像数据样本总数,I为单位矩阵,为正则项。in, is the sample mean, , N is the total number of image data samples, I is the unit matrix, is the regularization term. 3.根据权利要求1所述的一种基于合成图像数据生成的无监督深度学习方法,其特征在于,步骤S2中的基于贝叶斯优化迭代为高斯混合函数获得最优叠加权重的具体过程如下:3. The unsupervised deep learning method based on synthetic image data generation according to claim 1, wherein the specific process of obtaining the optimal superposition weight for the Gaussian mixture function based on Bayesian optimization iteration in step S2 is as follows: S21:通过柯尔莫哥洛夫-斯米尔诺夫检验来判定生成器生成的图像数据是否与原始图像数据分布相似性,其中柯尔莫哥洛夫-斯米尔诺夫检验的统计量S为:S21: The Kolmogorov-Smirnov test is used to determine whether the image data generated by the generator is similar to the original image data distribution. The statistic S of the Kolmogorov-Smirnov test is: S=sup x |F m (x)-G n (x)|S= sup x |F m ( x )- G n ( x ) | ; 其中,sup x 表示|F m (x)-G n (x)|最大值,F m (x)为原始图像数据的经验分布函数,G n (x)为合成图像数据的经验分布函数,Where sup x represents the maximum value of |F m ( x ) - G n ( x ) | , F m ( x ) is the empirical distribution function of the original image data, and G n ( x ) is the empirical distribution function of the synthetic image data. ; ; I为指示函数,当X i X,Y i X时,I=1,n为样本大小,m表示原始图像数据样本的数量;I is the indicator function. When Xi X , Yi X , I=1, n is the sample size, and m represents the number of original image data samples. S22:循环执行步骤S12,S13,S21,获得高斯混合模型最优叠加权重。S22: Loop through steps S12, S13, and S21 to obtain the optimal superposition weight of the Gaussian mixture model. 4.根据权利要求3所述的一种基于合成图像数据生成的无监督深度学习方法,其特征在于,步骤S3中生成有有标签合成图像数据的具体过程如下:4. The unsupervised deep learning method based on synthetic image data generation according to claim 3, wherein the specific process of generating labeled synthetic image data in step S3 is as follows: S31:基于获得的高斯混合模型最优叠加权重,运行高斯混合函数,得到此权重下的高斯混合模型迭代得到的聚类中心C和标准差σS31: Based on the obtained optimal superposition weight of the Gaussian mixture model, run the Gaussian mixture function to obtain the cluster center C and standard deviation σ obtained by iteration of the Gaussian mixture model under this weight; S32:指定所需样本数,按照聚类中心数量轮次运行各向同性高斯斑点,生成有标签的样本平衡的合成图像数据D’。S32: Specify the required number of samples, run the isotropic Gaussian spot algorithm in rounds according to the number of cluster centers, and generate labeled sample-balanced synthetic image data D’. 5.根据权利要求1所述的一种基于合成图像数据生成的无监督深度学习方法,其特征在于,步骤S4中将所述有标签合成图像数据输入深度神经网络模型进行模型训练的具体过程如下:5. The unsupervised deep learning method based on synthetic image data generation according to claim 1, wherein the specific process of inputting the labeled synthetic image data into the deep neural network model for model training in step S4 is as follows: S41:部署深度神经网络模型,包括掩码层、密集层、批归一化层和随机失活层,所述掩码层为深度神经网络模型的首层,密集层设有多个神经元且使用线性变换和增加非线性性的激活函数;S41: Deploy a deep neural network model, including a mask layer, a dense layer, a batch normalization layer, and a random dropout layer. The mask layer is the first layer of the deep neural network model, and the dense layer has multiple neurons and uses a linear transformation and an activation function that increases nonlinearity. S42:通过所述掩码层对输入序列中值与预设的指定值相等的样本进行掩码,具体公式如下:S42: Using the mask layer, the median of the input sequence is compared with the preset specified value. Equal samples are masked, the specific formula is as follows: ; 其中,表示掩码向量的第i个元素,表示输入序列中的第i个样本值;in, represents the i-th element of the mask vector, Represents the i-th sample value in the input sequence; S43:通过密集层进行计算,具体公式如下:S43: Calculation is performed through dense layers. The specific formula is as follows: h=ReLU(WD’+b);h=ReLU(WD’+b); 其中,h 为密集层计算后的输出,W为线性变化的可学习权重矩阵,b为偏置向量,ReLU为修正线性单元函数,ReLU=max(0,z),z是神经网络中的线性变换结果,在通过 ReLU 修正线性单元函数时,如果 z>0,则输出z;否则输出0,D’为有标签的样本平衡的合成图像数据;Where h is the output of the dense layer, W is the linearly varying learnable weight matrix, b is the bias vector, ReLU is the rectified linear unit function, ReLU = max (0, z ), z is the linear transformation result in the neural network. When the linear unit function is modified by ReLU, if z > 0, then z is output; otherwise, 0 is output. D' is the synthetic image data of the labeled sample balance. S44:将所述密集层的输出结果输入至批归一化层进行处理,具体公式如下:S44: Input the output result of the dense layer to the batch normalization layer for processing. The specific formula is as follows: ; 其中,分别为h在当前批次上的均值和标准差,为可学习的参数;in, and are the mean and standard deviation of h in the current batch, and are learnable parameters; S45:将批归一化层的输出结果输入至随机失活层进行处理,具体公式如下:S45: The output of the batch normalization layer is input to the random inactivation layer for processing. The specific formula is as follows: ; 其中,d是一个二进制掩蔽向量,每个元素为0的概率为p,每个元素为1的概率为1/pWhere d is a binary masking vector, the probability of each element being 0 is p , and the probability of each element being 1 is 1/ p ; S46:随机失活层的输出进入具有k个神经元和softmax激活函数的输出密集层,计算输入图像数据属于各类标签的概率:S46: Output of the random dropout layer Enter the output dense layer with k neurons and softmax activation function, and calculate the probability that the input image data belongs to each class label: ; 其中,y的维度为k×N,为权重矩阵,为偏置向量,softmax激活函数为:Among them, the dimension of y is k ×N, is the weight matrix, is the bias vector, and the softmax activation function is: ; 其中,z i 为输入向量z中的第i个元素;Where, z i is the i- th element in the input vector z ; S47:对预测的各类标签的概率结果采用交叉熵衡量:S47: The cross entropy is used to measure the probability of each type of predicted label: ; 其中,表示样本i所属类,synlog表示同步对数,为预测概率,训练损失通过Adam优化器对深度神经网络的可学习参数进行更新,直至深度神经网络收敛。in, Indicates the class to which sample i belongs, synlog indicates the synchronous logarithm, To predict the probability, the training loss is used to update the learnable parameters of the deep neural network through the Adam optimizer until the deep neural network converges. 6.根据权利要求1所述的一种基于合成图像数据生成的无监督深度学习方法,其特征在于,步骤S5的具体过程如下:6. The unsupervised deep learning method based on synthetic image data generation according to claim 1, wherein the specific process of step S5 is as follows: 将初始的无标签图像数据样本X输入步骤S4中由合成图像数据预训练得到的深度神经网络模型,输出对应结果,执行下游任务。Input the initial unlabeled image data sample X into the deep neural network model pre-trained by the synthetic image data in step S4, output the corresponding result, and execute the downstream task.
CN202510354534.2A 2025-03-25 2025-03-25 An unsupervised deep learning method based on synthetic data generation Active CN119862442B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510354534.2A CN119862442B (en) 2025-03-25 2025-03-25 An unsupervised deep learning method based on synthetic data generation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510354534.2A CN119862442B (en) 2025-03-25 2025-03-25 An unsupervised deep learning method based on synthetic data generation

Publications (2)

Publication Number Publication Date
CN119862442A CN119862442A (en) 2025-04-22
CN119862442B true CN119862442B (en) 2025-08-08

Family

ID=95395005

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510354534.2A Active CN119862442B (en) 2025-03-25 2025-03-25 An unsupervised deep learning method based on synthetic data generation

Country Status (1)

Country Link
CN (1) CN119862442B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111982910A (en) * 2020-07-06 2020-11-24 华南理工大学 Weak supervision machine vision detection method and system based on artificial defect simulation
CN114341928A (en) * 2019-06-26 2022-04-12 塞雷布里优公司 Medical scanning protocol for patient data acquisition analysis in a scanner

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7039239B2 (en) * 2002-02-07 2006-05-02 Eastman Kodak Company Method for image region classification using unsupervised and supervised learning
CN114492574B (en) * 2021-12-22 2025-04-04 中国矿业大学 Unsupervised adversarial domain adaptation image classification method based on pseudo-label loss of Gaussian uniform mixture model
CN117132003B (en) * 2023-10-26 2024-02-06 云南师范大学 Early prediction method for student academic performance of online learning platform based on self-training and semi-supervised learning
CN119322954B (en) * 2024-12-17 2025-04-18 南通市产品质量监督检验所 A supervised data generation method for photovoltaic power plants based on generative adversarial networks

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114341928A (en) * 2019-06-26 2022-04-12 塞雷布里优公司 Medical scanning protocol for patient data acquisition analysis in a scanner
CN111982910A (en) * 2020-07-06 2020-11-24 华南理工大学 Weak supervision machine vision detection method and system based on artificial defect simulation

Also Published As

Publication number Publication date
CN119862442A (en) 2025-04-22

Similar Documents

Publication Publication Date Title
US20200265301A1 (en) Incremental training of machine learning tools
CN110008259A (en) The method and terminal device of visualized data analysis
CN109891508A (en) Single cell type detection method, device, equipment and storage medium
Celeux et al. Variable selection in model-based clustering and discriminant analysis with a regularization approach
CN109062962A (en) A kind of gating cycle neural network point of interest recommended method merging Weather information
CN112420125A (en) Molecular property prediction method, device, intelligent device and terminal
CN114297025B (en) Data center resource analysis system and method, storage medium and electronic device
CN113704389A (en) Data evaluation method and device, computer equipment and storage medium
WO2023158333A1 (en) Large-scale architecture search in graph neural networks via synthetic data
CN113837266A (en) A software defect prediction method based on feature extraction and stacking ensemble learning
CN117437507A (en) Prejudice evaluation method for evaluating image recognition model
Li et al. Customer churn combination prediction model based on convolutional neural network and gradient boosting decision tree
Al Ali et al. Enhancing financial distress prediction through integrated Chinese whisper clustering and federated learning
Dentamaro et al. An interpretable adaptive multiscale attention deep neural network for tabular data
CN119577665A (en) An automatic data analysis system and method based on artificial intelligence
Cheng et al. A Memory Guided Transformer for Time Series Forecasting
CN119862442B (en) An unsupervised deep learning method based on synthetic data generation
Qin et al. Hybrid attention-based transformer for long-range document classification
CN109919219B (en) A Xgboost Multi-View Portrait Construction Method Based on Granular Computing ML-kNN
CN120296408A (en) A data labeling and model training method and related device
Rong et al. Exploring network behavior using cluster analysis
Li et al. A chunking-for-pooling strategy for cytometric representation learning for automatic hematologic malignancy classification
Lu et al. Research on osteoporosis risk assessment based on semi-supervised machine learning
CN119783744B (en) Edge calculation-oriented dynamic optimization method for deep neural network model block
CN115271823B (en) Semi-supervised user behavior prediction method based on DPI co-occurrence relationship features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant