CN119862442B

CN119862442B - An unsupervised deep learning method based on synthetic data generation

Info

Publication number: CN119862442B
Application number: CN202510354534.2A
Authority: CN
Inventors: 陈新; 季良杰; 王宇峰
Original assignee: Sichuan Huakun Zhenyu Intelligent Technology Co ltd
Current assignee: Sichuan Huakun Zhenyu Intelligent Technology Co ltd
Priority date: 2025-03-25
Filing date: 2025-03-25
Publication date: 2025-08-08
Anticipated expiration: 2045-03-25
Also published as: CN119862442A

Abstract

The invention belongs to the technical field of an unsupervised deep learning method, and particularly relates to an unsupervised deep learning method based on synthetic data generation, wherein a Gaussian mixture model is used for searching an original clustering center and standard deviation of unlabeled data, an optimal superposition weight is selected for the Gaussian mixture model by measuring the difference between isotropic Gaussian spot generated data based on the Gaussian mixture model clustering center and standard deviation and original data distribution, the Gaussian mixture model is used for determining the clustering number and standard deviation under the optimal superposition weight, a large amount of synthetic data with labels and balance are generated by the isotropic Gaussian spots based on the clustering number and standard deviation, the synthetic data is used for training a deep neural network model, and the trained deep neural network model is used for executing downstream tasks based on the initial unlabeled data. The method realizes sample equalization of generated synthetic data, assists training of the deep neural network, and improves model accuracy in an unsupervised learning scene.

Description

Unsupervised deep learning method based on synthetic data generation

Technical Field

The invention belongs to the technical field of an unsupervised deep learning method, and particularly relates to an unsupervised deep learning method based on synthetic data generation.

Background

With the development of wireless networks, mining data information and extracting features thereof for application has become a common task in the data age. Such as unlabeled image classification tasks in computer vision, abnormal traffic detection tasks in the network. However, in the real world, there are various types of unlabeled data. Such data presents a significant challenge for data mining due to the lack of tags as a reference and guide for training. Currently, for data mining tasks of unlabeled data, the core idea of common methods is to analyze statistical properties and similarities of the data to find potential structures and patterns in the data, such methods being referred to as unsupervised training. The deep neural network is outstanding in various methods due to flexible change capability, and can realize complex feature transformation, so that the deep neural network is suitable for different unsupervised learning tasks.

The rules in deep neural network refinement and summary data require a large amount of data as input samples for training, and the current mainstream solution is to generate synthetic data based on existing samples, thereby enriching the sample scale involved in training. The generation of the synthetic data follows a paradigm in which the original data structure is first analyzed, and then the synthetic data is generated based on the analysis result. Such as a data generation method based on principal component analysis, and a synthetic data generation method based on gaussian mixture function. Although the method can solve the problem of insufficient sample size of the unsupervised deep learning to a certain extent, the existing method does not pay attention to the sample equalization problem of generating synthetic data, such as unbalanced sample quantity of various types, and low distinction degree of the characteristics of the various types of samples, so that the training process of the deep neural network is confused, the effect is poor in practical application, and the requirement of practical application is difficult to meet.

Therefore, how to improve the training process of the deep neural network in the existing method so as to realize sample equalization of generated synthetic data, assist training of the deep neural network, and improve model accuracy in an unsupervised learning scene is a technical problem to be solved at present.

Disclosure of Invention

The invention aims to provide an unsupervised deep learning method based on synthetic data generation, which is used for realizing sample equalization of the synthetic data generation, assisting training of a deep neural network and improving model precision in an unsupervised learning scene.

In order to solve the technical problems, the invention adopts the following technical scheme:

An unsupervised deep learning method based on synthetic data generation, comprising the steps of:

s1, a Gaussian mixture model obtains superposition weights from a Bayesian optimizer, a clustering center and a standard deviation are searched based on unlabeled original data, and an isotropic Gaussian spot uses the clustering center and the standard deviation of the current Gaussian mixture model to generate synthetic data;

S2, measuring the distribution difference between the synthesized data and the original data, and obtaining an optimal superposition weight for a Gaussian mixture function based on Bayes optimization iteration;

S3, taking a clustering center and standard deviation of the Gaussian mixture model under the optimal superposition weight as a benchmark, and taking the standard deviation as a synthetic data generator, wherein the synthetic data generator generates labeled synthetic data similar to the original data distribution;

S4, inputting the labeled synthetic data into a deep neural network model for model training;

S5, inputting the initial unlabeled data into the trained deep neural network model, and executing a downstream task.

Preferably, the specific process of searching the clustering center and standard deviation based on the unlabeled raw data in the step S1 is as follows:

S11, representing the unlabeled raw data as X, including N samples, wherein each sample i includes M features, and any X _i,j represents the value of the j features of the i sample, wherein i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to M, and dividing the unlabeled raw data X into K classes, wherein K= [1,2, ··, K ];

S12, clustering the unlabeled original data X, wherein the specific formula is as follows:

;

Wherein, the In order to cluster the results of the processing,For the class to which sample i belongs,,For the class to which sample 1 belongs,For the class to which sample 2 belongs,Class to which sample N belongs;

S13, calculating a clustering center C and a standard deviation sigma according to the probability of the given clustering calculation data sample.

Preferably, the specific process of calculating the cluster center C and the standard deviation σ according to the probability of calculating the data sample for a given cluster in step S13 is as follows:

S131, creating a Gaussian mixture model based on linear superposition of K Gaussian models, wherein the specific formula is as follows:

;

Wherein w _k is the superposition parameter, ,Is the average vector of the kth gaussian component,For the covariance matrix of the kth gaussian component,A multi-element normal density function for the kth component;

s132, a Bayes optimizer randomly initializes superposition parameters w for the Gaussian mixture model and calculates an original sample mean value based on the original data Covariance matrixGenerating a sample meanAt this time, 3×k parameters to be updated exist in the gaussian mixture model;

S133, calculating probability that each sample i belongs to each cluster And (3) carrying out iterative calculation on n multiplied by k times, wherein the calculation formula is as follows:

;

Wherein, the Is the mixing weight of the j-th component,X represents the sample set, x _i is the ith sample,Is the average vector of the j-th gaussian component,Covariance matrix of j-th Gaussian component;

The isotropic gaussian spot generates labeled and balanced synthetic data of the same size as the original data S134.

Preferably, in each iteration in step S133, the parameters are updated by the following formula:

;

Wherein, the As the mean value of the sample,N is the total number of data samples, I is the identity matrix,Is a regular term.

Preferably, the specific procedure for generating labeled and balanced synthetic data of the same scale as the original data from isotropic gaussian spots in step S134 is as follows:

S1341, generating labels by isotropic Gaussian spots according to parameters under the optimal weight of a Gaussian mixture model, wherein a cluster center K is randomly selected from K clusters, at the moment, a data object D _i belongs to K, generating labels by a generator according to a marked data object D _i, and D _i epsilon D to obtain a label set with the size N and containing a data set D ;

S1342 the data generator isotropically gaussian spot is based on the standard deviation centered around C _k,σ_k,For the sample size, a data object d _i is generated, the formula is as follows:

;

Wherein, the Obeys normal distributionThe generation process is performed K times until the isotropic gaussian spot generates samples for all clusters, downstream yielding a balanced dataset D with N synthetic samples.

Preferably, the specific process of obtaining the optimal superposition weight for the gaussian mixture function based on bayesian optimization iteration in step S2 is as follows:

s21, judging whether the data generated by the generator is similar to the original data distribution or not through a Kelmogorov-Semipor test, wherein the statistic S of the Kelmogorov-Semipor test is:

S=sup_x|F_m(x)-G_n(x)|;

wherein sup _x represents the maximum value of |F _m(x)-G_n (x) |, F _m (x) is the empirical distribution function of the raw data, G _n (x) is the empirical distribution function of the synthesized data,

;

I is an indication function, when X _i≤X,Y_i is less than or equal to X, i=1, n is a sample size, and m represents the number of original data samples;

and S22, circularly executing the steps S12, S13 and S21 to obtain the optimal superposition weight of the Gaussian mixture model.

Preferably, the specific process of generating the tagged synthetic data in step S3 is as follows:

S31, based on the obtained Gaussian mixture model optimal superposition weight, running a Gaussian mixture function to obtain a clustering center C and a standard deviation sigma which are obtained by iteration of the Gaussian mixture model under the weight;

s32, designating the required number of samples, running the isotropic Gaussian spots in turn according to the number of clustering centers, and generating labeled sample balance synthetic data D'.

Preferably, in step S4, the specific process of inputting the labeled synthesis data into the deep neural network model for model training is as follows:

S41, deploying a deep neural network model, wherein the deep neural network model comprises a mask layer, a dense layer, a batch normalization layer and a random inactivation layer, the mask layer is the first layer of the deep neural network model, and the dense layer is provided with a plurality of neurons and uses linear transformation and an activation function for increasing nonlinearity;

s42, inputting a median value of the sequence and a preset appointed value through the mask layer The equal samples are masked as follows:

;

Wherein, the Representing the i-th element of the mask vector,Representing an ith sample value in the input sequence;

S43, calculating through the dense layer, wherein the specific formula is as follows:

h=ReLU(WD’+b);

Wherein h is output after dense layer calculation, W is a leachable weight matrix with linear change, b is a bias vector, reLU is a modified linear unit function, reLU=max (0, z), z is a linear transformation result in the neural network, when the linear unit function is modified by the ReLU, if z is greater than 0, z is output, otherwise, 0, D' is synthesized data with sample balance of labels;

S44, inputting the output result of the dense layer into a batch normalization layer for processing, wherein the specific formula is as follows:

;

Wherein, the AndThe mean and standard deviation of h over the current lot,AndIs a parameter that can be learned;

S45, inputting the output result of the batch normalization layer into a random inactivation layer for processing, wherein the specific formula is as follows:

;

Wherein d is a binary masking vector, the probability of each element being 0 is p, and the probability of each element being 1 is 1/p;

s46 output of random inactivation layer Entering an output dense layer with k neurons and softmax activation functions, and calculating the probability that input data belongs to various labels:

;

wherein the dimension of y is k x N, As a matrix of weights, the weight matrix,As a bias vector, the softmax activation function is:

;

wherein z _i is the i-th element in the input vector z;

s47, cross entropy measurement is adopted for the predicted probability results of various labels:

;

Wherein, the Representing the class to which sample i belongs, synlog representing the synchronization logarithm,To predict probability, training loss updates the learnable parameters of the deep neural network by Adam optimizer until the deep neural network converges.

Preferably, the specific procedure of step S5 is as follows:

And (3) inputting the initial unlabeled data sample X into the deep neural network model obtained by the pre-training of the synthetic data in the step S4, outputting a corresponding result, and executing a downstream task.

The beneficial effects of the invention include:

The invention provides an unsupervised deep learning method based on synthetic data generation, which is characterized in that a Gaussian mixture model searches for a clustering center and standard deviation of original unlabeled data, bayesian optimization is used for selecting optimal superposition weights for Gaussian mixture models by measuring differences between isotropic Gaussian spot generation data and original data distribution based on the clustering center and standard deviation of the Gaussian mixture model, the Gaussian mixture model determines the clustering number and standard deviation under the optimal superposition weights, the isotropic Gaussian spots generate a large amount of synthetic data with labels and balance based on the clustering number and standard deviation, the synthetic data are used for training a deep neural network model, and the trained deep neural network model performs downstream tasks based on the initial unlabeled data. A large amount of labeled synthetic data similar to the distribution of the labeled synthetic data can be generated based on the clustering method according to the statistical characteristics of the non-labeled data, so that sufficient training samples are provided for deep neural network model training in an unsupervised scene, the performance of the model is improved, sample equalization of the generated synthetic data is realized, training of the deep neural network is assisted, and the model precision in the unsupervised learning scene is improved.

According to the invention, the statistical characteristics of the unlabeled data are mined through the overlapping weight of the Bayesian optimization Gaussian mixture model, and the labeled synthetic data are generated so as to expand the training sample size, and label references are provided for deep neural network model training, so that the model performance is improved. And considering various sample sizes of the synthesized data, generating data with consistent sample numbers for each type of data, realizing sample equalization of the synthesized data, and avoiding confusion caused by unbalanced generated samples in deep neural network model learning. The structure of the deep neural network can be flexibly adjusted according to specific tasks, and the output support of the pre-trained deep neural network is further processed, so that the requirements of various downstream tasks can be met.

Drawings

Fig. 1 is a schematic flow chart of an unsupervised deep learning method based on synthetic data generation in the present invention.

FIG. 2 is a schematic flow chart of a Bayesian optimizer of the present invention for selecting optimal superposition weights for a Gaussian mixture model.

Detailed Description

The invention is further described in detail below with reference to fig. 1-2:

Example 1

Referring to fig. 1 and 2, an unsupervised deep learning method based on synthetic data generation includes the steps of:

s1, a Gaussian mixture model randomly obtains superposition weights from a Bayesian optimizer, a clustering center and a standard deviation are searched based on unlabeled original data, and an isotropic Gaussian spot uses the clustering center and the standard deviation of the current Gaussian mixture model to generate synthetic data;

S2, measuring the distribution difference between the synthesized data and the original data by using a Kelmogorov-Schmidt test, and obtaining an optimal superposition weight for a Gaussian mixture function based on Bayes optimization iteration;

and S3, taking the clustering center and standard deviation of the Gaussian mixture model under the optimal superposition weight as a reference, and taking the isotropic Gaussian mixture model as a synthetic data generator, wherein the synthetic data generator generates labeled synthetic data similar to the original data distribution.

The gaussian mixture function is run with the optimal parameters of the gaussian mixture model, and then the isotropic gaussian spot is run for the desired number of samples N' to generate the label and the composite data. In order to ensure the balance of the generated data, the generator calls the clustering center and the variance one by one to generate various types of synthesized data with balanced sample size. Generating a display havingN' samples of the cluster center, isotropic Gaussian blobs were called K times in total. Selecting one C _k to generate when each call is madeSamples of whichDownstream a balanced data set of one size is obtainedClusters are represented as labels for each composite data.

And S4, inputting the labeled synthesized data into a deep neural network model for model training, and designing an adjustable deep neural network model in the step to adapt to the execution requirement of a downstream task. The deep neural network comprises a mask layer, a dense layer, a batch normalization layer and a random inactivation layer.

The combination of the above-described network layers may be flexibly adjusted for different specific tasks. The output result finally enters an output dense layer with k neurons and softmax activation functions, and the probability that the input data belongs to various labels is calculated. The predicted label probability result adopts cross entropy measurement as training loss, and the learnable parameters of the deep neural network model are updated by a gradient descent method until the deep neural network converges.

S5, inputting the initial unlabeled data into the trained deep neural network model, executing a downstream task, and outputting an execution result.

Example 2

Based on the embodiment 1, the specific process of searching the clustering center and standard deviation based on the unlabeled raw data in the step S1 is as follows:

S11, constructing a data structured expression, namely, representing unlabeled original data as X, comprising N samples, wherein each sample i comprises M features, and any X _i,j represents the value of the j features of the i sample, wherein i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to M, and dividing the unlabeled original data X into K classes, wherein K= [1,2, ··, K ];

;

s13, calculating a clustering center C and a standard deviation sigma according to the probability of a given clustering calculation data sample;

In this embodiment, the specific procedure of calculating the clustering center C and the standard deviation σ according to the probability of calculating the data samples for a given cluster in step S13 is as follows:

;

In each iteration in step S133, the parameters are updated by the following formula:

;

The specific procedure for generating labeled and balanced synthetic data for the same scale as the original data in the isotropic gaussian spot in step S134 is as follows:

;

The isotropic gaussian spot generates synthetic data on the same scale as the original data, while the present invention focuses on the balance of generating synthetic data. Therefore, in the data generation process, isotropic Gaussian spots are called in order based on the number of samples and the number of clusters, and data with balanced sample size is generated. The isotropic Gaussian spots are generated by adopting a clustering center C and a standard deviation sigma generated in a Gaussian mixture function as the clustering center and the standard deviation, wherein C _k,σ_k respectively represents the clustering center and the standard deviation of a specific class, and a data set which is similar to the characteristics of the original data and is consistent with the original data in size is generated. To ensure sample equalization of the synthesized data, isotropic gaussian blobs are invoked in a loop.

Example 3

Based on embodiment 1 or embodiment 2, the specific procedure for obtaining the optimal superposition weight for the gaussian mixture function based on bayesian optimization iteration in step S2 is as follows:

S=sup_x|F_m(x)-G_n(x)|;

Wherein sup _x represents a maximum value of |F _m(x)-G_n (x) |.

F _m (x) is the empirical distribution function of the raw data, G _n (x) is the empirical distribution function of the synthesized data,

;

I is an indication function, when X _i≤X,Y_i is less than or equal to X, i=1, otherwise 0, n is the sample size, and m represents the number of original data samples;

The specific process of generating the labeled synthetic data in step S3 is as follows:

S32, designating the required sample number, running isotropic Gaussian spots in turn according to the number of clustering centers to generate labeled sample balanced synthetic data 。

Example 4

On the basis of embodiment 1, embodiment 2 or embodiment 3, the specific process of inputting the labeled synthesis data into the deep neural network model for model training in step S4 is as follows:

And S41, deploying a deep neural network model, wherein the deep neural network model comprises a mask layer, a dense layer, a batch normalization layer and a random inactivation layer, the mask layer is the first layer of the deep neural network model, the dense layer is provided with a plurality of neurons and uses linear transformation and a non-linear ReLU activation function, and the combination of the network layers can be flexibly adjusted for different specific tasks.

S42, inputting a median value of the sequence and a preset appointed value through the mask layerThe equal samples are masked as follows:

;

Representing the i-th element of the mask vector, Representing the i-th sample value in the input sequence.

h=ReLU(WD’+b);

Where h is the output of dense layer computation, W is the learnable weight matrix of linear variation, b is the bias vector, reLU is the modified linear unit function, relu=max (0, z), z is the linear transformation result in the neural network, when the linear unit function is modified by ReLU, if z >0, then z is output, otherwise 0, d' is the sample balanced synthesized data with labels.

S44, inputting the output result of the dense layer into a batch normalization layer for processing, wherein the normalization layer can disorder the data sequence, quicken the convergence of the network and improve the generalization capability of the network, and the specific formula is as follows:

;

s45, inputting the output result of the batch normalization layer into a random inactivation layer for processing, wherein the random inactivation layer can avoid the problem of overfitting of a deep neural network and reduce training time consumption, and the specific formula is as follows:

;

S46, predicting the output of the label, namely the random inactivation layer, based on the synthesized data Entering an output dense layer with k neurons and softmax activation functions, and calculating the probability that input data belongs to various labels:

;

wherein, z _j is the j-th element in the input vector z, which is the i-th element in the input vector z;

s47, back propagation training a deep neural network, namely, cross entropy measurement is adopted for the probability result of each predicted label:

;

The specific process of step S5 is as follows:

According to the invention, the statistical characteristics of the unlabeled data are mined through the overlapping weight of the Bayesian optimization Gaussian mixture model, and the labeled synthetic data are generated so as to expand the training sample size, and label references are provided for deep neural network model training, so that the model performance is improved. And considering various sample sizes of the synthesized data, generating data with consistent sample numbers for each type of data, realizing sample equalization of the synthesized data, and avoiding confusion caused by unbalanced generated samples in deep neural network model learning. The structure of the deep neural network can be flexibly adjusted according to specific tasks, and the output support of the pre-trained deep neural network is further processed, so that the deep neural network can adapt to the requirements of various downstream tasks. The method can be suitable for various downstream tasks, and other data processing methods are supported to be set for specific problems based on the pre-training results.

In summary, the method for unsupervised deep learning based on synthetic data generation is provided, a Gaussian mixture model searches for a clustering center and standard deviation of original unlabeled data, bayesian optimization selects optimal superposition weights for the Gaussian mixture model by measuring differences between isotropic Gaussian spot generation data based on the Gaussian mixture model clustering center and standard deviation and original data distribution, the Gaussian mixture model determines the clustering number and standard deviation under the optimal superposition weights, the isotropic Gaussian spot generates a large amount of synthetic data with labels and balance based on the clustering number and standard deviation, the synthetic data is used for training a deep neural network model, and the trained deep neural network model performs downstream tasks based on the initial unlabeled data. A large amount of labeled synthetic data similar to the distribution of the labeled synthetic data can be generated based on the clustering method according to the statistical characteristics of the non-labeled data, so that sufficient training samples are provided for deep neural network model training in an unsupervised scene, the performance of the model is improved, synthetic data matched and balanced with the original data distribution is generated through the clustering method by analyzing the distribution characteristics of the original data, the training of the deep neural network is assisted, the model precision in the unsupervised learning scene is improved, and the technical problem that the deep learning model precision is low due to unbalance in current synthetic data generation is effectively solved.

Claims

1. An unsupervised deep learning method based on synthetic image data generation, characterized by comprising the following steps:

S1: In the unlabeled image classification task in computer vision, the Gaussian mixture model obtains superposition weights from the Bayesian optimizer, finds cluster centers and standard deviations based on the unlabeled original image data, and the isotropic Gaussian spots use the cluster centers and standard deviations of the current Gaussian mixture model to generate synthetic image data;

S2: measuring the distribution difference between the synthesized image data and the original image data, and obtaining an optimal superposition weight for the Gaussian mixture function based on Bayesian optimization iteration;

S3: Isotropic Gaussian takes the cluster center and standard deviation of the Gaussian mixture model under the optimal superposition weight as the benchmark, and is used as a synthetic image data generator to generate labeled synthetic image data with a distribution similar to the original image data;

S4: Inputting the labeled synthetic image data into a deep neural network model for model training;

S5: Input the initial unlabeled image data into the trained deep neural network model to perform downstream tasks;

The specific process of finding the cluster center and standard deviation based on the unlabeled original image data in step S1 is as follows:

S11: Denote the unlabeled original image data as X , which contains N samples. Each sample i contains M features. Any x _i,j represents the value of the jth feature of the i - th sample in M , where 1 ≤ i ≤ N , 1 ≤ j ≤ M . Divide the unlabeled original image data X into K categories, where K = [1, 2, ···, k ];

S12: Perform clustering on the unlabeled original image data X. The specific formula is as follows:

;

in, is the clustering result, is the class to which sample i belongs, , is the class to which sample 1 belongs, is the class to which sample 2 belongs, is the class to which sample N belongs;

S13: Calculate the cluster center C and standard deviation σ according to the probability of the given clustering image data sample;

The specific process of calculating the cluster center C and the standard deviation σ according to the probability of the given clustering calculation image data sample in step S13 is as follows:

S131: Create a Gaussian mixture model based on the linear superposition of K Gaussian models. The specific formula is as follows:

;

Among them, wk is the superposition parameter _, , is the mean vector of the kth Gaussian component, is the covariance matrix of the kth Gaussian component, is the multivariate normal density function of the kth component;

S132: The Bayesian optimizer randomly initializes the superposition parameter w for the Gaussian mixture model and calculates the original sample mean based on the original image data , the covariance matrix , generating the sample mean , at this time, the Gaussian mixture model has 3× k parameters to be updated;

S133: Calculate the probability that each sample i belongs to each cluster , iterates n × k times in total, and the calculation formula is as follows:

;

in, is the mixing weight of the jth component, , x represents the sample set, xi _is the i- th sample, is the mean vector of the j -th Gaussian component, is the covariance matrix of the j -th Gaussian component;

S134: Generate labeled and balanced synthetic image data of the same scale as the original image data using isotropic Gaussian spots;

The specific process of generating labeled and balanced synthetic image data of the same scale as the original image data by isotropic Gaussian spots in step S134 is as follows:

S1341: Isotropic Gaussian blobs generate labels based on the optimal weighted parameters of the Gaussian mixture model: a cluster center k is randomly selected from the K clusters, and the image data object d _i belongs to k . The generator generates labels based on the labeled image data object d _i , d _i ∈ D , and obtains a label set of size N containing the image dataset D ;

S1342: Image data generator isotropic Gaussian spots are generated based on C _k , σ _k as the center and standard deviation, For the sample size, generate the image data object d _i , the formula is as follows:

;

in, Normal distribution , the generation process is performed K times until the isotropic Gaussian spots generate samples for all clusters, and a balanced image dataset D with N synthetic samples is obtained downstream.

2. The unsupervised deep learning method based on synthetic image data generation according to claim 1, characterized in that in step S133, in each iteration, the parameters are updated using the following formula:

;

in, is the sample mean, , N is the total number of image data samples, I is the unit matrix, is the regularization term.

3. The unsupervised deep learning method based on synthetic image data generation according to claim 1, wherein the specific process of obtaining the optimal superposition weight for the Gaussian mixture function based on Bayesian optimization iteration in step S2 is as follows:

S21: The Kolmogorov-Smirnov test is used to determine whether the image data generated by the generator is similar to the original image data distribution. The statistic S of the Kolmogorov-Smirnov test is:

S= sup _x |F _m ( x )- G _n ( x ) | ;

Where sup _x represents the maximum value of |F _m ( x ) - G _n ( x ) | , F _m ( x ) is the empirical distribution function of the original image data, and G _n ( x ) is the empirical distribution function of the synthetic image data.

;

I is the indicator function. When Xi _≤ X , _Yi ≤ X , I=1, n is the sample size, and m represents the number of original image data samples.

S22: Loop through steps S12, S13, and S21 to obtain the optimal superposition weight of the Gaussian mixture model.

4. The unsupervised deep learning method based on synthetic image data generation according to claim 3, wherein the specific process of generating labeled synthetic image data in step S3 is as follows:

S31: Based on the obtained optimal superposition weight of the Gaussian mixture model, run the Gaussian mixture function to obtain the cluster center C and standard deviation σ obtained by iteration of the Gaussian mixture model under this weight;

S32: Specify the required number of samples, run the isotropic Gaussian spot algorithm in rounds according to the number of cluster centers, and generate labeled sample-balanced synthetic image data D’.

5. The unsupervised deep learning method based on synthetic image data generation according to claim 1, wherein the specific process of inputting the labeled synthetic image data into the deep neural network model for model training in step S4 is as follows:

S41: Deploy a deep neural network model, including a mask layer, a dense layer, a batch normalization layer, and a random dropout layer. The mask layer is the first layer of the deep neural network model, and the dense layer has multiple neurons and uses a linear transformation and an activation function that increases nonlinearity.

S42: Using the mask layer, the median of the input sequence is compared with the preset specified value. Equal samples are masked, the specific formula is as follows:

;

in, represents the i-th element of the mask vector, Represents the i-th sample value in the input sequence;

S43: Calculation is performed through dense layers. The specific formula is as follows:

h=ReLU(WD’+b);

Where h is the output of the dense layer, W is the linearly varying learnable weight matrix, b is the bias vector, ReLU is the rectified linear unit function, ReLU = max (0, z ), z is the linear transformation result in the neural network. When the linear unit function is modified by ReLU, if z > 0, then z is output; otherwise, 0 is output. D' is the synthetic image data of the labeled sample balance.

S44: Input the output result of the dense layer to the batch normalization layer for processing. The specific formula is as follows:

;

in, and are the mean and standard deviation of h in the current batch, and are learnable parameters;

S45: The output of the batch normalization layer is input to the random inactivation layer for processing. The specific formula is as follows:

;

Where d is a binary masking vector, the probability of each element being 0 is p , and the probability of each element being 1 is 1/ p ;

S46: Output of the random dropout layer Enter the output dense layer with k neurons and softmax activation function, and calculate the probability that the input image data belongs to each class label:

;

Among them, the dimension of y is k ×N, is the weight matrix, is the bias vector, and the softmax activation function is:

;

Where, z _i is the i- th element in the input vector z ;

S47: The cross entropy is used to measure the probability of each type of predicted label:

;

in, Indicates the class to which sample i belongs, synlog indicates the synchronous logarithm, To predict the probability, the training loss is used to update the learnable parameters of the deep neural network through the Adam optimizer until the deep neural network converges.

6. The unsupervised deep learning method based on synthetic image data generation according to claim 1, wherein the specific process of step S5 is as follows:

Input the initial unlabeled image data sample X into the deep neural network model pre-trained by the synthetic image data in step S4, output the corresponding result, and execute the downstream task.