CN119862442B - An unsupervised deep learning method based on synthetic data generation - Google Patents
An unsupervised deep learning method based on synthetic data generationInfo
- Publication number
- CN119862442B CN119862442B CN202510354534.2A CN202510354534A CN119862442B CN 119862442 B CN119862442 B CN 119862442B CN 202510354534 A CN202510354534 A CN 202510354534A CN 119862442 B CN119862442 B CN 119862442B
- Authority
- CN
- China
- Prior art keywords
- image data
- sample
- gaussian
- data
- neural network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/10—Pre-processing; Data cleansing
- G06F18/15—Statistical pre-processing, e.g. techniques for normalisation or restoring missing data
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N7/00—Computing arrangements based on specific mathematical models
- G06N7/01—Probabilistic graphical models, e.g. probabilistic networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Probability & Statistics with Applications (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computing Systems (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biomedical Technology (AREA)
- Algebra (AREA)
- Pure & Applied Mathematics (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Health & Medical Sciences (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Image Analysis (AREA)
Abstract
The invention belongs to the technical field of an unsupervised deep learning method, and particularly relates to an unsupervised deep learning method based on synthetic data generation, wherein a Gaussian mixture model is used for searching an original clustering center and standard deviation of unlabeled data, an optimal superposition weight is selected for the Gaussian mixture model by measuring the difference between isotropic Gaussian spot generated data based on the Gaussian mixture model clustering center and standard deviation and original data distribution, the Gaussian mixture model is used for determining the clustering number and standard deviation under the optimal superposition weight, a large amount of synthetic data with labels and balance are generated by the isotropic Gaussian spots based on the clustering number and standard deviation, the synthetic data is used for training a deep neural network model, and the trained deep neural network model is used for executing downstream tasks based on the initial unlabeled data. The method realizes sample equalization of generated synthetic data, assists training of the deep neural network, and improves model accuracy in an unsupervised learning scene.
Description
Technical Field
The invention belongs to the technical field of an unsupervised deep learning method, and particularly relates to an unsupervised deep learning method based on synthetic data generation.
Background
With the development of wireless networks, mining data information and extracting features thereof for application has become a common task in the data age. Such as unlabeled image classification tasks in computer vision, abnormal traffic detection tasks in the network. However, in the real world, there are various types of unlabeled data. Such data presents a significant challenge for data mining due to the lack of tags as a reference and guide for training. Currently, for data mining tasks of unlabeled data, the core idea of common methods is to analyze statistical properties and similarities of the data to find potential structures and patterns in the data, such methods being referred to as unsupervised training. The deep neural network is outstanding in various methods due to flexible change capability, and can realize complex feature transformation, so that the deep neural network is suitable for different unsupervised learning tasks.
The rules in deep neural network refinement and summary data require a large amount of data as input samples for training, and the current mainstream solution is to generate synthetic data based on existing samples, thereby enriching the sample scale involved in training. The generation of the synthetic data follows a paradigm in which the original data structure is first analyzed, and then the synthetic data is generated based on the analysis result. Such as a data generation method based on principal component analysis, and a synthetic data generation method based on gaussian mixture function. Although the method can solve the problem of insufficient sample size of the unsupervised deep learning to a certain extent, the existing method does not pay attention to the sample equalization problem of generating synthetic data, such as unbalanced sample quantity of various types, and low distinction degree of the characteristics of the various types of samples, so that the training process of the deep neural network is confused, the effect is poor in practical application, and the requirement of practical application is difficult to meet.
Therefore, how to improve the training process of the deep neural network in the existing method so as to realize sample equalization of generated synthetic data, assist training of the deep neural network, and improve model accuracy in an unsupervised learning scene is a technical problem to be solved at present.
Disclosure of Invention
The invention aims to provide an unsupervised deep learning method based on synthetic data generation, which is used for realizing sample equalization of the synthetic data generation, assisting training of a deep neural network and improving model precision in an unsupervised learning scene.
In order to solve the technical problems, the invention adopts the following technical scheme:
An unsupervised deep learning method based on synthetic data generation, comprising the steps of:
s1, a Gaussian mixture model obtains superposition weights from a Bayesian optimizer, a clustering center and a standard deviation are searched based on unlabeled original data, and an isotropic Gaussian spot uses the clustering center and the standard deviation of the current Gaussian mixture model to generate synthetic data;
S2, measuring the distribution difference between the synthesized data and the original data, and obtaining an optimal superposition weight for a Gaussian mixture function based on Bayes optimization iteration;
S3, taking a clustering center and standard deviation of the Gaussian mixture model under the optimal superposition weight as a benchmark, and taking the standard deviation as a synthetic data generator, wherein the synthetic data generator generates labeled synthetic data similar to the original data distribution;
S4, inputting the labeled synthetic data into a deep neural network model for model training;
S5, inputting the initial unlabeled data into the trained deep neural network model, and executing a downstream task.
Preferably, the specific process of searching the clustering center and standard deviation based on the unlabeled raw data in the step S1 is as follows:
S11, representing the unlabeled raw data as X, including N samples, wherein each sample i includes M features, and any X i,j represents the value of the j features of the i sample, wherein i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to M, and dividing the unlabeled raw data X into K classes, wherein K= [1,2, ··, K ];
S11, representing the unlabeled raw data as X, including N samples, wherein each sample i includes M features, and any X i,j represents the value of the j features of the i sample, wherein i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to M, and dividing the unlabeled raw data X into K classes, wherein K= [1,2, ··, K ];
S12, clustering the unlabeled original data X, wherein the specific formula is as follows:
;
Wherein, the In order to cluster the results of the processing,For the class to which sample i belongs,,For the class to which sample 1 belongs,For the class to which sample 2 belongs,Class to which sample N belongs;
S13, calculating a clustering center C and a standard deviation sigma according to the probability of the given clustering calculation data sample.
Preferably, the specific process of calculating the cluster center C and the standard deviation σ according to the probability of calculating the data sample for a given cluster in step S13 is as follows:
S131, creating a Gaussian mixture model based on linear superposition of K Gaussian models, wherein the specific formula is as follows:
;
Wherein w k is the superposition parameter, ,Is the average vector of the kth gaussian component,For the covariance matrix of the kth gaussian component,A multi-element normal density function for the kth component;
s132, a Bayes optimizer randomly initializes superposition parameters w for the Gaussian mixture model and calculates an original sample mean value based on the original data Covariance matrixGenerating a sample meanAt this time, 3×k parameters to be updated exist in the gaussian mixture model;
S133, calculating probability that each sample i belongs to each cluster And (3) carrying out iterative calculation on n multiplied by k times, wherein the calculation formula is as follows:
;
Wherein, the Is the mixing weight of the j-th component,X represents the sample set, x i is the ith sample,Is the average vector of the j-th gaussian component,Covariance matrix of j-th Gaussian component;
The isotropic gaussian spot generates labeled and balanced synthetic data of the same size as the original data S134.
Preferably, in each iteration in step S133, the parameters are updated by the following formula:
;
;
;
Wherein, the As the mean value of the sample,N is the total number of data samples, I is the identity matrix,Is a regular term.
Preferably, the specific procedure for generating labeled and balanced synthetic data of the same scale as the original data from isotropic gaussian spots in step S134 is as follows:
S1341, generating labels by isotropic Gaussian spots according to parameters under the optimal weight of a Gaussian mixture model, wherein a cluster center K is randomly selected from K clusters, at the moment, a data object D i belongs to K, generating labels by a generator according to a marked data object D i, and D i epsilon D to obtain a label set with the size N and containing a data set D ;
S1342 the data generator isotropically gaussian spot is based on the standard deviation centered around C k,σk,For the sample size, a data object d i is generated, the formula is as follows:
;
Wherein, the Obeys normal distributionThe generation process is performed K times until the isotropic gaussian spot generates samples for all clusters, downstream yielding a balanced dataset D with N synthetic samples.
Preferably, the specific process of obtaining the optimal superposition weight for the gaussian mixture function based on bayesian optimization iteration in step S2 is as follows:
s21, judging whether the data generated by the generator is similar to the original data distribution or not through a Kelmogorov-Semipor test, wherein the statistic S of the Kelmogorov-Semipor test is:
S=supx|Fm(x)-Gn(x)|;
wherein sup x represents the maximum value of |F m(x)-Gn (x) |, F m (x) is the empirical distribution function of the raw data, G n (x) is the empirical distribution function of the synthesized data,
;
;
I is an indication function, when X i≤X,Yi is less than or equal to X, i=1, n is a sample size, and m represents the number of original data samples;
and S22, circularly executing the steps S12, S13 and S21 to obtain the optimal superposition weight of the Gaussian mixture model.
Preferably, the specific process of generating the tagged synthetic data in step S3 is as follows:
S31, based on the obtained Gaussian mixture model optimal superposition weight, running a Gaussian mixture function to obtain a clustering center C and a standard deviation sigma which are obtained by iteration of the Gaussian mixture model under the weight;
s32, designating the required number of samples, running the isotropic Gaussian spots in turn according to the number of clustering centers, and generating labeled sample balance synthetic data D'.
Preferably, in step S4, the specific process of inputting the labeled synthesis data into the deep neural network model for model training is as follows:
S41, deploying a deep neural network model, wherein the deep neural network model comprises a mask layer, a dense layer, a batch normalization layer and a random inactivation layer, the mask layer is the first layer of the deep neural network model, and the dense layer is provided with a plurality of neurons and uses linear transformation and an activation function for increasing nonlinearity;
s42, inputting a median value of the sequence and a preset appointed value through the mask layer The equal samples are masked as follows:
;
Wherein, the Representing the i-th element of the mask vector,Representing an ith sample value in the input sequence;
S43, calculating through the dense layer, wherein the specific formula is as follows:
h=ReLU(WD’+b);
Wherein h is output after dense layer calculation, W is a leachable weight matrix with linear change, b is a bias vector, reLU is a modified linear unit function, reLU=max (0, z), z is a linear transformation result in the neural network, when the linear unit function is modified by the ReLU, if z is greater than 0, z is output, otherwise, 0, D' is synthesized data with sample balance of labels;
S44, inputting the output result of the dense layer into a batch normalization layer for processing, wherein the specific formula is as follows:
;
Wherein, the AndThe mean and standard deviation of h over the current lot,AndIs a parameter that can be learned;
S45, inputting the output result of the batch normalization layer into a random inactivation layer for processing, wherein the specific formula is as follows:
;
Wherein d is a binary masking vector, the probability of each element being 0 is p, and the probability of each element being 1 is 1/p;
s46 output of random inactivation layer Entering an output dense layer with k neurons and softmax activation functions, and calculating the probability that input data belongs to various labels:
;
wherein the dimension of y is k x N, As a matrix of weights, the weight matrix,As a bias vector, the softmax activation function is:
;
wherein z i is the i-th element in the input vector z;
s47, cross entropy measurement is adopted for the predicted probability results of various labels:
;
Wherein, the Representing the class to which sample i belongs, synlog representing the synchronization logarithm,To predict probability, training loss updates the learnable parameters of the deep neural network by Adam optimizer until the deep neural network converges.
Preferably, the specific procedure of step S5 is as follows:
And (3) inputting the initial unlabeled data sample X into the deep neural network model obtained by the pre-training of the synthetic data in the step S4, outputting a corresponding result, and executing a downstream task.
The beneficial effects of the invention include:
The invention provides an unsupervised deep learning method based on synthetic data generation, which is characterized in that a Gaussian mixture model searches for a clustering center and standard deviation of original unlabeled data, bayesian optimization is used for selecting optimal superposition weights for Gaussian mixture models by measuring differences between isotropic Gaussian spot generation data and original data distribution based on the clustering center and standard deviation of the Gaussian mixture model, the Gaussian mixture model determines the clustering number and standard deviation under the optimal superposition weights, the isotropic Gaussian spots generate a large amount of synthetic data with labels and balance based on the clustering number and standard deviation, the synthetic data are used for training a deep neural network model, and the trained deep neural network model performs downstream tasks based on the initial unlabeled data. A large amount of labeled synthetic data similar to the distribution of the labeled synthetic data can be generated based on the clustering method according to the statistical characteristics of the non-labeled data, so that sufficient training samples are provided for deep neural network model training in an unsupervised scene, the performance of the model is improved, sample equalization of the generated synthetic data is realized, training of the deep neural network is assisted, and the model precision in the unsupervised learning scene is improved.
According to the invention, the statistical characteristics of the unlabeled data are mined through the overlapping weight of the Bayesian optimization Gaussian mixture model, and the labeled synthetic data are generated so as to expand the training sample size, and label references are provided for deep neural network model training, so that the model performance is improved. And considering various sample sizes of the synthesized data, generating data with consistent sample numbers for each type of data, realizing sample equalization of the synthesized data, and avoiding confusion caused by unbalanced generated samples in deep neural network model learning. The structure of the deep neural network can be flexibly adjusted according to specific tasks, and the output support of the pre-trained deep neural network is further processed, so that the requirements of various downstream tasks can be met.
Drawings
Fig. 1 is a schematic flow chart of an unsupervised deep learning method based on synthetic data generation in the present invention.
FIG. 2 is a schematic flow chart of a Bayesian optimizer of the present invention for selecting optimal superposition weights for a Gaussian mixture model.
Detailed Description
The invention is further described in detail below with reference to fig. 1-2:
Example 1
Referring to fig. 1 and 2, an unsupervised deep learning method based on synthetic data generation includes the steps of:
s1, a Gaussian mixture model randomly obtains superposition weights from a Bayesian optimizer, a clustering center and a standard deviation are searched based on unlabeled original data, and an isotropic Gaussian spot uses the clustering center and the standard deviation of the current Gaussian mixture model to generate synthetic data;
S2, measuring the distribution difference between the synthesized data and the original data by using a Kelmogorov-Schmidt test, and obtaining an optimal superposition weight for a Gaussian mixture function based on Bayes optimization iteration;
and S3, taking the clustering center and standard deviation of the Gaussian mixture model under the optimal superposition weight as a reference, and taking the isotropic Gaussian mixture model as a synthetic data generator, wherein the synthetic data generator generates labeled synthetic data similar to the original data distribution.
The gaussian mixture function is run with the optimal parameters of the gaussian mixture model, and then the isotropic gaussian spot is run for the desired number of samples N' to generate the label and the composite data. In order to ensure the balance of the generated data, the generator calls the clustering center and the variance one by one to generate various types of synthesized data with balanced sample size. Generating a display havingN' samples of the cluster center, isotropic Gaussian blobs were called K times in total. Selecting one C k to generate when each call is madeSamples of whichDownstream a balanced data set of one size is obtainedClusters are represented as labels for each composite data.
And S4, inputting the labeled synthesized data into a deep neural network model for model training, and designing an adjustable deep neural network model in the step to adapt to the execution requirement of a downstream task. The deep neural network comprises a mask layer, a dense layer, a batch normalization layer and a random inactivation layer.
The combination of the above-described network layers may be flexibly adjusted for different specific tasks. The output result finally enters an output dense layer with k neurons and softmax activation functions, and the probability that the input data belongs to various labels is calculated. The predicted label probability result adopts cross entropy measurement as training loss, and the learnable parameters of the deep neural network model are updated by a gradient descent method until the deep neural network converges.
S5, inputting the initial unlabeled data into the trained deep neural network model, executing a downstream task, and outputting an execution result.
Example 2
Based on the embodiment 1, the specific process of searching the clustering center and standard deviation based on the unlabeled raw data in the step S1 is as follows:
S11, constructing a data structured expression, namely, representing unlabeled original data as X, comprising N samples, wherein each sample i comprises M features, and any X i,j represents the value of the j features of the i sample, wherein i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to M, and dividing the unlabeled original data X into K classes, wherein K= [1,2, ··, K ];
S12, clustering the unlabeled original data X, wherein the specific formula is as follows:
;
Wherein, the In order to cluster the results of the processing,For the class to which sample i belongs,,For the class to which sample 1 belongs,For the class to which sample 2 belongs,Class to which sample N belongs;
s13, calculating a clustering center C and a standard deviation sigma according to the probability of a given clustering calculation data sample;
S13, calculating a clustering center C and a standard deviation sigma according to the probability of the given clustering calculation data sample.
In this embodiment, the specific procedure of calculating the clustering center C and the standard deviation σ according to the probability of calculating the data samples for a given cluster in step S13 is as follows:
S131, creating a Gaussian mixture model based on linear superposition of K Gaussian models, wherein the specific formula is as follows:
;
Wherein w k is the superposition parameter, ,Is the average vector of the kth gaussian component,For the covariance matrix of the kth gaussian component,A multi-element normal density function for the kth component;
s132, a Bayes optimizer randomly initializes superposition parameters w for the Gaussian mixture model and calculates an original sample mean value based on the original data Covariance matrixGenerating a sample meanAt this time, 3×k parameters to be updated exist in the gaussian mixture model;
S133, calculating probability that each sample i belongs to each cluster And (3) carrying out iterative calculation on n multiplied by k times, wherein the calculation formula is as follows:
;
Wherein, the Is the mixing weight of the j-th component,X represents the sample set, x i is the ith sample,Is the average vector of the j-th gaussian component,Covariance matrix of j-th Gaussian component;
The isotropic gaussian spot generates labeled and balanced synthetic data of the same size as the original data S134.
In each iteration in step S133, the parameters are updated by the following formula:
;
;
;
Wherein, the As the mean value of the sample,N is the total number of data samples, I is the identity matrix,Is a regular term.
The specific procedure for generating labeled and balanced synthetic data for the same scale as the original data in the isotropic gaussian spot in step S134 is as follows:
S1341, generating labels by isotropic Gaussian spots according to parameters under the optimal weight of a Gaussian mixture model, wherein a cluster center K is randomly selected from K clusters, at the moment, a data object D i belongs to K, generating labels by a generator according to a marked data object D i, and D i epsilon D to obtain a label set with the size N and containing a data set D ;
S1342 the data generator isotropically gaussian spot is based on the standard deviation centered around C k,σk,For the sample size, a data object d i is generated, the formula is as follows:
;
Wherein, the Obeys normal distributionThe generation process is performed K times until the isotropic gaussian spot generates samples for all clusters, downstream yielding a balanced dataset D with N synthetic samples.
The isotropic gaussian spot generates synthetic data on the same scale as the original data, while the present invention focuses on the balance of generating synthetic data. Therefore, in the data generation process, isotropic Gaussian spots are called in order based on the number of samples and the number of clusters, and data with balanced sample size is generated. The isotropic Gaussian spots are generated by adopting a clustering center C and a standard deviation sigma generated in a Gaussian mixture function as the clustering center and the standard deviation, wherein C k,σk respectively represents the clustering center and the standard deviation of a specific class, and a data set which is similar to the characteristics of the original data and is consistent with the original data in size is generated. To ensure sample equalization of the synthesized data, isotropic gaussian blobs are invoked in a loop.
Example 3
Based on embodiment 1 or embodiment 2, the specific procedure for obtaining the optimal superposition weight for the gaussian mixture function based on bayesian optimization iteration in step S2 is as follows:
s21, judging whether the data generated by the generator is similar to the original data distribution or not through a Kelmogorov-Semipor test, wherein the statistic S of the Kelmogorov-Semipor test is:
S=supx|Fm(x)-Gn(x)|;
Wherein sup x represents a maximum value of |F m(x)-Gn (x) |.
F m (x) is the empirical distribution function of the raw data, G n (x) is the empirical distribution function of the synthesized data,
;
;
I is an indication function, when X i≤X,Yi is less than or equal to X, i=1, otherwise 0, n is the sample size, and m represents the number of original data samples;
and S22, circularly executing the steps S12, S13 and S21 to obtain the optimal superposition weight of the Gaussian mixture model.
The specific process of generating the labeled synthetic data in step S3 is as follows:
S31, based on the obtained Gaussian mixture model optimal superposition weight, running a Gaussian mixture function to obtain a clustering center C and a standard deviation sigma which are obtained by iteration of the Gaussian mixture model under the weight;
S32, designating the required sample number, running isotropic Gaussian spots in turn according to the number of clustering centers to generate labeled sample balanced synthetic data 。
Example 4
On the basis of embodiment 1, embodiment 2 or embodiment 3, the specific process of inputting the labeled synthesis data into the deep neural network model for model training in step S4 is as follows:
And S41, deploying a deep neural network model, wherein the deep neural network model comprises a mask layer, a dense layer, a batch normalization layer and a random inactivation layer, the mask layer is the first layer of the deep neural network model, the dense layer is provided with a plurality of neurons and uses linear transformation and a non-linear ReLU activation function, and the combination of the network layers can be flexibly adjusted for different specific tasks.
S42, inputting a median value of the sequence and a preset appointed value through the mask layerThe equal samples are masked as follows:
;
Representing the i-th element of the mask vector, Representing the i-th sample value in the input sequence.
S43, calculating through the dense layer, wherein the specific formula is as follows:
h=ReLU(WD’+b);
Where h is the output of dense layer computation, W is the learnable weight matrix of linear variation, b is the bias vector, reLU is the modified linear unit function, relu=max (0, z), z is the linear transformation result in the neural network, when the linear unit function is modified by ReLU, if z >0, then z is output, otherwise 0, d' is the sample balanced synthesized data with labels.
S44, inputting the output result of the dense layer into a batch normalization layer for processing, wherein the normalization layer can disorder the data sequence, quicken the convergence of the network and improve the generalization capability of the network, and the specific formula is as follows:
;
Wherein, the AndThe mean and standard deviation of h over the current lot,AndIs a parameter that can be learned;
s45, inputting the output result of the batch normalization layer into a random inactivation layer for processing, wherein the random inactivation layer can avoid the problem of overfitting of a deep neural network and reduce training time consumption, and the specific formula is as follows:
;
Wherein d is a binary masking vector, the probability of each element being 0 is p, and the probability of each element being 1 is 1/p;
S46, predicting the output of the label, namely the random inactivation layer, based on the synthesized data Entering an output dense layer with k neurons and softmax activation functions, and calculating the probability that input data belongs to various labels:
;
wherein the dimension of y is k x N, As a matrix of weights, the weight matrix,As a bias vector, the softmax activation function is:
;
wherein, z j is the j-th element in the input vector z, which is the i-th element in the input vector z;
s47, back propagation training a deep neural network, namely, cross entropy measurement is adopted for the probability result of each predicted label:
;
Wherein, the Representing the class to which sample i belongs, synlog representing the synchronization logarithm,To predict probability, training loss updates the learnable parameters of the deep neural network by Adam optimizer until the deep neural network converges.
The specific process of step S5 is as follows:
And (3) inputting the initial unlabeled data sample X into the deep neural network model obtained by the pre-training of the synthetic data in the step S4, outputting a corresponding result, and executing a downstream task.
According to the invention, the statistical characteristics of the unlabeled data are mined through the overlapping weight of the Bayesian optimization Gaussian mixture model, and the labeled synthetic data are generated so as to expand the training sample size, and label references are provided for deep neural network model training, so that the model performance is improved. And considering various sample sizes of the synthesized data, generating data with consistent sample numbers for each type of data, realizing sample equalization of the synthesized data, and avoiding confusion caused by unbalanced generated samples in deep neural network model learning. The structure of the deep neural network can be flexibly adjusted according to specific tasks, and the output support of the pre-trained deep neural network is further processed, so that the deep neural network can adapt to the requirements of various downstream tasks. The method can be suitable for various downstream tasks, and other data processing methods are supported to be set for specific problems based on the pre-training results.
In summary, the method for unsupervised deep learning based on synthetic data generation is provided, a Gaussian mixture model searches for a clustering center and standard deviation of original unlabeled data, bayesian optimization selects optimal superposition weights for the Gaussian mixture model by measuring differences between isotropic Gaussian spot generation data based on the Gaussian mixture model clustering center and standard deviation and original data distribution, the Gaussian mixture model determines the clustering number and standard deviation under the optimal superposition weights, the isotropic Gaussian spot generates a large amount of synthetic data with labels and balance based on the clustering number and standard deviation, the synthetic data is used for training a deep neural network model, and the trained deep neural network model performs downstream tasks based on the initial unlabeled data. A large amount of labeled synthetic data similar to the distribution of the labeled synthetic data can be generated based on the clustering method according to the statistical characteristics of the non-labeled data, so that sufficient training samples are provided for deep neural network model training in an unsupervised scene, the performance of the model is improved, synthetic data matched and balanced with the original data distribution is generated through the clustering method by analyzing the distribution characteristics of the original data, the training of the deep neural network is assisted, the model precision in the unsupervised learning scene is improved, and the technical problem that the deep learning model precision is low due to unbalance in current synthetic data generation is effectively solved.
Claims (6)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202510354534.2A CN119862442B (en) | 2025-03-25 | 2025-03-25 | An unsupervised deep learning method based on synthetic data generation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202510354534.2A CN119862442B (en) | 2025-03-25 | 2025-03-25 | An unsupervised deep learning method based on synthetic data generation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN119862442A CN119862442A (en) | 2025-04-22 |
CN119862442B true CN119862442B (en) | 2025-08-08 |
Family
ID=95395005
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202510354534.2A Active CN119862442B (en) | 2025-03-25 | 2025-03-25 | An unsupervised deep learning method based on synthetic data generation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN119862442B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111982910A (en) * | 2020-07-06 | 2020-11-24 | 华南理工大学 | Weak supervision machine vision detection method and system based on artificial defect simulation |
CN114341928A (en) * | 2019-06-26 | 2022-04-12 | 塞雷布里优公司 | Medical scanning protocol for patient data acquisition analysis in a scanner |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7039239B2 (en) * | 2002-02-07 | 2006-05-02 | Eastman Kodak Company | Method for image region classification using unsupervised and supervised learning |
CN114492574B (en) * | 2021-12-22 | 2025-04-04 | 中国矿业大学 | Unsupervised adversarial domain adaptation image classification method based on pseudo-label loss of Gaussian uniform mixture model |
CN117132003B (en) * | 2023-10-26 | 2024-02-06 | 云南师范大学 | Early prediction method for student academic performance of online learning platform based on self-training and semi-supervised learning |
CN119322954B (en) * | 2024-12-17 | 2025-04-18 | 南通市产品质量监督检验所 | A supervised data generation method for photovoltaic power plants based on generative adversarial networks |
-
2025
- 2025-03-25 CN CN202510354534.2A patent/CN119862442B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114341928A (en) * | 2019-06-26 | 2022-04-12 | 塞雷布里优公司 | Medical scanning protocol for patient data acquisition analysis in a scanner |
CN111982910A (en) * | 2020-07-06 | 2020-11-24 | 华南理工大学 | Weak supervision machine vision detection method and system based on artificial defect simulation |
Also Published As
Publication number | Publication date |
---|---|
CN119862442A (en) | 2025-04-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20200265301A1 (en) | Incremental training of machine learning tools | |
CN110008259A (en) | The method and terminal device of visualized data analysis | |
CN109891508A (en) | Single cell type detection method, device, equipment and storage medium | |
Celeux et al. | Variable selection in model-based clustering and discriminant analysis with a regularization approach | |
CN109062962A (en) | A kind of gating cycle neural network point of interest recommended method merging Weather information | |
CN112420125A (en) | Molecular property prediction method, device, intelligent device and terminal | |
CN114297025B (en) | Data center resource analysis system and method, storage medium and electronic device | |
CN113704389A (en) | Data evaluation method and device, computer equipment and storage medium | |
WO2023158333A1 (en) | Large-scale architecture search in graph neural networks via synthetic data | |
CN113837266A (en) | A software defect prediction method based on feature extraction and stacking ensemble learning | |
CN117437507A (en) | Prejudice evaluation method for evaluating image recognition model | |
Li et al. | Customer churn combination prediction model based on convolutional neural network and gradient boosting decision tree | |
Al Ali et al. | Enhancing financial distress prediction through integrated Chinese whisper clustering and federated learning | |
Dentamaro et al. | An interpretable adaptive multiscale attention deep neural network for tabular data | |
CN119577665A (en) | An automatic data analysis system and method based on artificial intelligence | |
Cheng et al. | A Memory Guided Transformer for Time Series Forecasting | |
CN119862442B (en) | An unsupervised deep learning method based on synthetic data generation | |
Qin et al. | Hybrid attention-based transformer for long-range document classification | |
CN109919219B (en) | A Xgboost Multi-View Portrait Construction Method Based on Granular Computing ML-kNN | |
CN120296408A (en) | A data labeling and model training method and related device | |
Rong et al. | Exploring network behavior using cluster analysis | |
Li et al. | A chunking-for-pooling strategy for cytometric representation learning for automatic hematologic malignancy classification | |
Lu et al. | Research on osteoporosis risk assessment based on semi-supervised machine learning | |
CN119783744B (en) | Edge calculation-oriented dynamic optimization method for deep neural network model block | |
CN115271823B (en) | Semi-supervised user behavior prediction method based on DPI co-occurrence relationship features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |