WO2019200782A1 - 样本数据分类方法、模型训练方法、电子设备及存储介质 - Google Patents
样本数据分类方法、模型训练方法、电子设备及存储介质 Download PDFInfo
- Publication number
- WO2019200782A1 WO2019200782A1 PCT/CN2018/100157 CN2018100157W WO2019200782A1 WO 2019200782 A1 WO2019200782 A1 WO 2019200782A1 CN 2018100157 W CN2018100157 W CN 2018100157W WO 2019200782 A1 WO2019200782 A1 WO 2019200782A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sample
- distance
- value
- density
- samples
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/906—Clustering; Classification
Definitions
- the present application relates to the field of data processing, and in particular, to a sample data classification method, a model training method, an electronic device, and a storage medium.
- a sample data classification method comprising:
- the sample data is clustered into a plurality of subsets based on the at least one cluster center and features of each sample.
- a model training method comprising:
- the sample data of each category is classified by using the sample data classification method described in any embodiment to obtain a plurality of subsets of each category;
- the subsets with the same sorting position are read from the plurality of sorted subsets of each category in turn as training samples of the model, and the model is trained.
- An electronic device comprising: a memory for storing at least one instruction, the processor for executing the at least one instruction to implement a sample data classification method as in any of the embodiments, And/or the model training method of any of any of the embodiments.
- a non-volatile readable storage medium storing at least one instruction that, when executed by a processor, implements a sample data classification method as described in any embodiment, And/or the model training method as described in any embodiment.
- the present application calculates features of each sample in the sample data; calculates a distance set of each sample according to characteristics of each sample, and the distance set of each sample includes each sample corresponding to each sample The distance between each sample in the remaining samples; calculate the density value of each sample and calculate the density distance value of each sample according to the distance set of each sample; according to the density value of each sample and the density distance of each sample a value, determining at least one cluster center; clustering the sample data into a plurality of subsets based on the at least one cluster center and features of each sample.
- the present application trains from easy to difficult to avoid the difficult training samples being eliminated, thereby improving the adaptability of the model parameters.
- FIG. 1 is a flow chart of a first preferred embodiment of a sample data classification method of the present application.
- FIG. 2 is a flow chart of a first preferred embodiment of the method of training a model of the present application.
- FIG. 3 is a block diagram showing the program of a first preferred embodiment of the sample data sorting apparatus of the present application.
- FIG. 4 is a block diagram of a program of a first preferred embodiment of the model training device of the present application.
- FIG. 5 is a schematic structural diagram of a preferred embodiment of an electronic device in at least one example of the present application.
- FIG. 1 it is a flowchart of a first preferred embodiment of the sample data classification method of the present application.
- the order of the steps in the flowchart may be changed according to different requirements, and some steps may be omitted.
- the electronic device calculates characteristics of each sample of the sample data.
- the sample data includes, but is not limited to, pre-acquired data, data crawled from the network. Therefore, in the process of large-scale sample data collection, there is a low correlation with the category indicated by the sample data, or erroneous data appears. In order to improve the accuracy of model training, it is necessary to classify the sample data, automatically detect the simple samples that are easy to be learned in the model training process, and the difficult samples that are not easy to be learned in the model training process, thus achieving Classification of sample data.
- the features of each sample are extracted using a feature extraction model.
- the feature extraction model includes, but is not limited to, a deep convolutional neural network model.
- the sample data is extracted by a deep convolutional neural network.
- a deep convolutional neural network For example, any network (VGG-16, ResNet-50, etc.) in front of the Soft-max classification layer can be regarded as a feature extractor, and the output of this layer is taken.
- the deep convolutional neural network model is composed of one input layer, 20 convolution layers, 6 pooling layers, 3 hidden layers, and 1 sorting layer.
- the model architecture of the deep convolutional neural network model is shown in FIG. 3, wherein Conv ab (for example, Conv 3-64) indicates that the dimension of the layer convolution kernel is a ⁇ a, and the number of convolution kernels of the layer b; Maxpool2 indicates that the pooled core of the pooled layer has a dimension of 2 ⁇ 2; FC-c (for example, FC-6) indicates that the hidden layer (ie, the fully connected layer) has c output nodes; -max indicates that the classification layer classifies the input image using the Soft-max classifier.
- the training sample is used for training learning to obtain a trained deep convolutional neural network model.
- Importing the sample data into the trained deep convolutional neural network model can accurately and automatically extract features of each sample in the sample data.
- the larger the size of the training samples the more accurate the extracted features of the post-training deep convolutional neural network model.
- the deep convolutional neural network model can also be in other forms of expression, and the present application does not impose any limitation.
- the electronic device calculates a distance set of each sample according to characteristics of each sample.
- the distance set of each sample includes a distance between each sample of each of the remaining samples corresponding to each sample, wherein the remaining samples corresponding to each sample include each sample in the sample data.
- the distance matrix is a 3*2 or 2*3 matrix.
- the distance includes, without limitation, a European distance, a cosine distance, and the like.
- Each distance value in the distance matrix is greater than zero. For example, when the calculated cosine distance is less than 0, the absolute value of the calculated cosine distance is taken.
- the electronic device calculates a density value of each sample according to a distance set of each sample and calculates a density distance value of each sample.
- the distance set of each sample is compared with the distance threshold, the number of distances greater than the distance threshold is determined, and the number of distances corresponding to each sample is taken as the density value of each sample.
- the density value of any one of the samples is calculated as follows:
- ⁇ i represents the density value of the ith sample in the sample data
- D ij represents the distance between the ith sample and the jth sample in the sample data
- d c represents the distance threshold
- the calculating the density distance value of each sample comprises:
- the maximum distance is selected as the density distance value of the sample having the largest density value from the distance concentration of the sample having the largest density value.
- the second set of samples includes other samples of the sample data from which the sample having the largest density value is removed.
- the density distance value of each sample is calculated as follows:
- ⁇ i represents the density distance value of the i-th sample
- ⁇ i represents the density value of the i-th sample
- ⁇ j represents the density value of the j-th sample
- the distance between the i-th sample and the j-th sample of D ij is the distance between the i-th sample and the j-th sample of D ij .
- the electronic device determines at least one cluster center according to a density value of each sample and a density distance value of each sample.
- the determining, according to the density value of each sample and the density distance value of each sample, determining at least one cluster center comprises:
- cluster metric value of each sample is equal to the product of the density value of each sample and the density distance value of each sample.
- determining, according to the cluster metric value of each sample, the at least one cluster center includes:
- the sample with the clustering metric value greater than the threshold is filtered as the clustering center point.
- the threshold is configured according to a cluster metric value of each sample, for example, a mean value is calculated according to a cluster metric value of each sample, and an average value is taken as the threshold value.
- the electronic device clusters the sample data into a plurality of subsets based on the at least one cluster center and characteristics of each sample.
- the sample data is clustered into a plurality of subsets according to a distance set of samples corresponding to each cluster center in the at least one cluster center by using a clustering algorithm.
- the clustering algorithm includes, but is not limited to, a k-means clustering algorithm, a hierarchical clustering algorithm, and the like.
- a sample whose distance from each cluster center of the at least one cluster center exceeds a distance threshold is determined as an error sample. This can effectively eliminate the wrong sample.
- the larger the density value of the sample the more samples are represented similar to the sample.
- the larger the density distance value of the sample the further the distance between the subset of the sample points and other subsets. Therefore, after clustering according to the above embodiment, the distance between samples in the same subset can be shortened, and the distance between samples in different subsets becomes larger.
- the denser the samples of a certain subset the more similar the features representing the pictures, the more similar the data representing the subset is to the categories represented by the sample data, which are simple samples, and the model can easily learn the characteristics of simple samples.
- clustering the sample data through the cluster center selected in the above embodiment can also effectively eliminate the erroneous samples, thereby improving the accuracy of the parameters of the subsequent training model.
- FIG. 2 is a flow chart of a second preferred embodiment of the model training method of the present application.
- the order of the steps in the flowchart may be changed according to different requirements, and some steps may be omitted.
- the electronic device acquires sample data of each category.
- the trained model is used to identify the category to which the picture to be detected belongs, for example, the model is a vehicle part identification model, and the vehicle part identification model is used to identify which part of the vehicle in the picture to be tested belongs to Part. In this way, it is necessary to obtain sample data of various parts of the vehicle, and the sample data of one part belongs to one category.
- the electronic device classifies sample data of each category to obtain multiple subsets of each category.
- the sample data of each category is classified in the first preferred embodiment.
- step S21 the processing of the step S21 is the same as the data classification method in the first preferred embodiment, and will not be described in detail herein.
- the electronic device calculates a correlation between each subset of the plurality of subsets of each category and a category of each subset.
- the distance between samples in the same subset can be shortened, and the distance between samples in different subsets becomes larger.
- the denser the samples of a certain subset the more similar the characteristics of the representative pictures, the more the data in the subset is related to the category in which the subset is located, and the higher the similarity, the simple samples.
- the sample of a subset is sparse, the more representative the picture, the more difficult the sample.
- the number of samples included in each subset is taken as the relevance of each subset to the category in which each subset is located. For example, for a category, three subsets are obtained after clustering: a first subset, a second subset, and a third subset. If the number of samples in the first subset is 40, the number of samples in the second subset is 100, and the number of samples in the third subset is 10, a value of 40 is used to indicate that the first subset is similar to the category of the first subset. degree.
- the electronic device sorts multiple subsets of each category according to the relevance of each subset and category in multiple subsets of each category, and obtains multiple sorted sub-categories of each category. set.
- three subsets are obtained after clustering: a first subset, a second subset, and a third subset. If the number of samples in the first subset is 40, the number of samples in the second subset is 100, and the number of samples in the third subset is 10, then the plurality of sorted subsets of the one category is: Two subsets, a first subset, and a third subset.
- the electronic device sequentially reads, from a plurality of sorted subsets of each category, a subset with the same sorting position as a training sample of the model, and trains the model.
- the first subset of each category is read as a training sample of the model, and the model is trained to reach the first termination condition and then read each A second subset of the categories, the second subset of each category is added to the model's training samples, and the model continues to be trained until all subsets of each category are used as training samples.
- the subsets of each subset in each subset of each category are sorted from high to low, based on the relevance of each subset to the category, such simple samples are ranked first, and when training the model, simple The sample is easier to train, and the difficult sample is ranked later. It is more difficult to train. This will divide the training of the model into multiple subtasks. According to the difficulty of the task, it will be trained from easy to difficult to avoid the difficult training samples being rejected. So that the model can learn the characteristics of each category from easy to difficult, thereby improving the adaptability of the model parameters.
- the higher the ranking position the larger the weight corresponding to the subset.
- category A there are two categories, category A and category B.
- the sorted subsets in category A are: subset A1, subset A2, the weight corresponding to subset A1 is 1, and the weight corresponding to subset A2 is 0.5.
- the sorted subset in category B is subset B1 and subset B2, the weight corresponding to subset B1 is 1, and the weight corresponding to subset B2 is 0.5.
- First read the subset A1 and the subset B1 to train the model. After reaching the first termination condition, read the subset A2 and the subset B2, and add the subset A2 and the subset B2 to the training samples of the model.
- Set A1, subset B1, subset A2, and subset B2 are all used as training samples, and the model is trained until the end of training.
- a sample picture of each part of the vehicle is obtained, and the sample of one part is a picture of one category, and the sample of any part is processed by the sample data classification method in the first preferred embodiment to obtain multiple subsets of each part. And sorting the plurality of subsets of each part by using the method in the second preferred embodiment, and training the vehicle part identification model based on the plurality of sorted subsets of each part.
- the training of the vehicle part recognition model is divided into a plurality of subtasks, and according to the difficulty degree of the task, the training vehicle part recognition model is sequentially performed from easy to difficult, so as to avoid the difficult training samples being eliminated, thereby making the model easy to It is difficult to learn the characteristics of the sample pictures of various parts, thereby improving the adaptability of the model parameters.
- the present application classifies the training sample data of each category into a plurality of subsets according to the difficulty level, so that the distance between the samples in the same subset becomes shorter, and the distance between the samples in different subsets becomes larger.
- the denser the samples of a certain subset the more similar the features representing the pictures, the more similar the data representing the subset is to the categories represented by the sample data, which are simple samples, and the model can easily learn the characteristics of simple samples.
- the sample of a subset is sparse and the representative picture is more diverse, the data of the subset is considered to be more complicated and is a difficult sample.
- the plurality of subsets of the training sample data are sorted from easy to difficult, so that the training of the model is divided into multiple subtasks, and the training is performed from easy to difficult according to the difficulty of the task, so as to avoid difficult training.
- the samples are rejected, allowing the model to learn the characteristics of each category from easy to difficult, thereby increasing the resilience of the model parameters.
- the sample data classification device 3 includes, but is not limited to, one or more of the following modules: a calculation module 30, a determination module 31, and a clustering module 32.
- a unit referred to in this application refers to a series of computer readable instruction segments that can be executed by a processor of the sample data classification device 3 and that are capable of performing a fixed function, which are stored in a memory. The function of each unit will be detailed in the subsequent embodiments.
- the calculation module 30 calculates features of each sample of sample data.
- the sample data includes, but is not limited to, pre-acquired data, data crawled from the network. Therefore, in the process of large-scale sample data collection, there is a low correlation with the category indicated by the sample data, or erroneous data appears. In order to improve the accuracy of model training, it is necessary to classify the sample data, automatically detect the simple samples that are easy to be learned in the model training process, and the difficult samples that are not easy to be learned in the model training process, thus achieving Classification of sample data.
- the calculation module 30 extracts features of each sample using a feature extraction model.
- the feature extraction model includes, but is not limited to, a deep convolutional neural network model.
- the sample data is extracted by a deep convolutional neural network.
- a deep convolutional neural network For example, any network (VGG-16, ResNet-50, etc.) in front of the Soft-max classification layer can be regarded as a feature extractor, and the output of this layer is taken. As the extracted feature.
- the deep convolutional neural network model is composed of one input layer, 20 convolution layers, 6 pooling layers, 3 hidden layers, and 1 sorting layer.
- the model architecture of the deep convolutional neural network model is shown in FIG. 3, wherein Conv ab (for example, Conv 3-64) indicates that the dimension of the layer convolution kernel is a ⁇ a, and the number of convolution kernels of the layer b; Maxpool2 indicates that the pooled core of the pooled layer has a dimension of 2 ⁇ 2; FC-c (for example, FC-6) indicates that the hidden layer (ie, the fully connected layer) has c output nodes; -max indicates that the classification layer classifies the input image using the Soft-max classifier.
- the training sample is used for training learning to obtain a trained deep convolutional neural network model.
- Importing the sample data into the trained deep convolutional neural network model can accurately and automatically extract features of each sample in the sample data.
- the larger the size of the training samples the more accurate the extracted features of the post-training deep convolutional neural network model.
- the deep convolutional neural network model can also be in other forms of expression, and the present application does not impose any limitation.
- the calculation module 30 calculates a distance set for each sample based on the characteristics of each sample.
- the distance set of each sample includes a distance between each sample of each of the remaining samples corresponding to each sample, wherein the remaining samples corresponding to each sample include each sample in the sample data.
- the distance matrix is a 3*2 or 2*3 matrix.
- the distance includes, without limitation, a European distance, a cosine distance, and the like.
- Each distance value in the distance matrix is greater than zero. For example, when the calculated cosine distance is less than zero, the absolute value of the calculated cosine distance is taken.
- the calculation module 30 calculates a density value for each sample and calculates a density distance value for each sample based on the distance set of each sample.
- the calculation module 30 compares each distance of each sample with a distance threshold, determines a distance number greater than the distance threshold, and uses the distance number corresponding to each sample as the density value of each sample. . The larger the density value of such a sample, the more samples are represented similar to the sample.
- the density value of any one of the samples is calculated as follows:
- ⁇ i represents the density value of the ith sample in the sample data
- D ij represents the distance between the ith sample and the jth sample in the sample data
- d c represents the distance threshold
- the calculating module 30 calculates the density distance value of each sample includes:
- the maximum distance is selected as the density distance value of the sample having the largest density value from the distance concentration of the sample having the largest density value.
- the second set of samples includes other samples of the sample data from which the sample having the largest density value is removed.
- the density distance value of each sample is calculated as follows:
- ⁇ i represents the density distance value of the i-th sample
- ⁇ i represents the density value of the i-th sample
- ⁇ j represents the density value of the j-th sample
- the distance between the i-th sample and the j-th sample of D ij is the distance between the i-th sample and the j-th sample of D ij .
- the determining module 31 determines at least one cluster center according to the density value of each sample and the density distance value of each sample.
- the determining module 31 determines, according to the density value of each sample and the density distance value of each sample, that the at least one cluster center comprises:
- cluster metric value of each sample is equal to the product of the density value of each sample and the density distance value of each sample.
- the determining module 31 determines, according to the cluster metric value of each sample, the at least one cluster center includes:
- the sample with the clustering metric value greater than the threshold is filtered as the clustering center point.
- the threshold is configured according to a cluster metric value of each sample, for example, a mean value is calculated according to a cluster metric value of each sample, and an average value is taken as the threshold value.
- the clustering module 32 clusters the sample data into a plurality of subsets based on the at least one cluster center and features of each sample.
- the clustering module 32 clusters the sample data into a plurality of subsets according to a distance set of samples corresponding to each of the cluster centers in the at least one cluster center.
- the clustering algorithm includes, but is not limited to, a k-means clustering algorithm, a hierarchical clustering algorithm, and the like.
- the determining module 31 determines a sample whose distance from each cluster center of the at least one cluster center exceeds a distance threshold as an error sample. This can effectively eliminate the wrong sample.
- the larger the density value of the sample the more samples are represented similar to the sample.
- the larger the density distance value of the sample the further the distance between the subset of the sample points and other subsets. Therefore, after clustering according to the above embodiment, the distance between samples in the same subset can be shortened, and the distance between samples in different subsets becomes larger.
- the denser the samples of a certain subset the more similar the features representing the pictures, the more similar the data representing the subset is to the categories represented by the sample data, which are simple samples, and the model can easily learn the characteristics of simple samples.
- clustering the sample data through the cluster center selected in the above embodiment can also effectively eliminate the erroneous samples, thereby improving the accuracy of the parameters of the subsequent training model.
- the model training device 4 includes, but is not limited to, one or more of the following modules: a data acquisition module 40, a data clustering module 41, a correlation calculation module 42, a ranking module 43, and a training module 44.
- a unit referred to in this application refers to a series of computer readable instruction segments that can be executed by a processor of the model training device 4 and that are capable of performing a fixed function, which is stored in a memory. The function of each unit will be detailed in the subsequent embodiments.
- the data acquisition module 40 acquires sample data for each category.
- the trained model is used to identify the category to which the picture to be detected belongs, for example, the model is a vehicle part identification model, and the vehicle part identification model is used to identify which part of the vehicle in the picture to be tested belongs to Part. In this way, it is necessary to obtain sample data of various parts of the vehicle, and the sample data of one part belongs to one category.
- the data clustering module 41 classifies the sample data for each category to obtain a plurality of subsets of each category.
- the sample data of each category is classified in the first preferred embodiment.
- the data clustering module 41 is used to implement the sample data classification method in the first preferred embodiment, which is not described in detail herein.
- the relevance calculation module 42 calculates the relevance of each subset of the plurality of subsets of each category to the category in which each subset is located.
- the distance between samples in the same subset can be shortened, and the distance between samples in different subsets becomes larger.
- the denser the sample of such a subset the more similar the features of the representative image, the more relevant the data in the subset is to the category in which the subset is located, and the higher the similarity, the simple sample.
- the sample of a subset is sparse, the more representative the picture, the more difficult the sample.
- the number of samples included in each subset is taken as the relevance of each subset to the category in which each subset is located. For example, for a category, three subsets are obtained after clustering: a first subset, a second subset, and a third subset. If the number of samples in the first subset is 40, the number of samples in the second subset is 100, and the number of samples in the third subset is 10, a value of 40 is used to indicate that the first subset is similar to the category of the first subset. degree.
- the sorting module 43 sorts multiple subsets of each category according to the degree of relevance of each subset and category in each subset of each category, and obtains a plurality of sorted subsets of each category. .
- three subsets are obtained after clustering: a first subset, a second subset, and a third subset. If the number of samples in the first subset is 40, the number of samples in the second subset is 100, and the number of samples in the third subset is 10, then the plurality of sorted subsets of the one category is: Two subsets, a first subset, and a third subset.
- the training module 44 sequentially reads the subsets with the same sorting position from the plurality of sorted subsets of each category as training samples of the model, and trains the model.
- the training module 44 reads the first subset of each category from the plurality of sorted subsets of each category as a training sample of the model, and trains the model to reach the first termination condition. Thereafter, a second subset of each category is read, a second subset of each category is added to the model's training samples, and the model continues to be trained until all subsets of each category are used as training samples.
- the training of the model is divided into multiple subtasks, and the training is performed from easy to difficult according to the difficulty of the task, so as to avoid the difficult training samples being eliminated, so that the model can learn from each category from easy to difficult.
- the higher the ranking position the larger the weight corresponding to the subset.
- category A there are two categories, category A and category B.
- the sorted subsets in category A are: subset A1, subset A2, the weight corresponding to subset A1 is 1, and the weight corresponding to subset A2 is 0.5.
- the sorted subset in category B is subset B1 and subset B2, the weight corresponding to subset B1 is 1, and the weight corresponding to subset B2 is 0.5.
- First read the subset A1 and the subset B1 to train the model. After reaching the first termination condition, read the subset A2 and the subset B2, and add the subset A2 and the subset B2 to the training samples of the model.
- Set A1, subset B1, subset A2, and subset B2 are all used as training samples, and the model is trained until the end of training.
- a sample picture of each part of the vehicle is obtained, and the sample of one part is a picture of one category, and the sample of any part is processed by the sample data classification method in the first preferred embodiment to obtain multiple subsets of each part. And sorting the plurality of subsets of each part by using the method in the second preferred embodiment, and training the vehicle part identification model based on the plurality of sorted subsets of each part.
- the training of the vehicle part recognition model is divided into a plurality of subtasks, and according to the difficulty degree of the task, the training vehicle part recognition model is sequentially performed from easy to difficult, so as to avoid the difficult training samples being eliminated, thereby making the model easy to It is difficult to learn the characteristics of the sample pictures of various parts, thereby improving the adaptability of the model parameters.
- the present application classifies the training sample data of each category into a plurality of subsets according to the difficulty level, so that the distance between the samples in the same subset becomes shorter, and the distance between the samples in different subsets becomes larger.
- the denser the samples of a certain subset the more similar the features representing the pictures, the more similar the data representing the subset is to the categories represented by the sample data, which are simple samples, and the model can easily learn the characteristics of simple samples.
- the sample of a subset is sparse and the representative picture is more diverse, the data of the subset is considered to be more complicated and is a difficult sample.
- the plurality of subsets of the training sample data are sorted from easy to difficult, so that the training of the model is divided into multiple subtasks, and the training is performed from easy to difficult according to the difficulty of the task, so as to avoid difficult training.
- the samples are rejected, allowing the model to learn the characteristics of each category from easy to difficult, thereby increasing the resilience of the model parameters.
- the above-described integrated unit implemented in the form of a software program module can be stored in a non-volatile readable storage medium.
- the software program module described above is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to perform the method of each embodiment of the present application. Part of the steps.
- the electronic device 5 comprises at least one transmitting device 51, at least one memory 52, at least one processor 53, at least one receiving device 54, and at least one communication bus.
- the communication bus is used to implement connection communication between these components.
- the electronic device 5 is a device capable of automatically performing numerical calculation and/or information processing according to an instruction set or stored in advance, and the hardware includes, but not limited to, a microprocessor, an application specific integrated circuit (ASIC). ), Field-Programmable Gate Array (FPGA), Digital Signal Processor (DSP), embedded devices, etc.
- the electronic device 5 may also comprise a network device and/or a user device.
- the network device includes, but is not limited to, a single network server, a server group composed of multiple network servers, or a cloud computing-based cloud composed of a large number of hosts or network servers, where the cloud computing is distributed computing.
- a super virtual computer consisting of a group of loosely coupled computers.
- the electronic device 5 can be, but is not limited to, any electronic product that can interact with a user through a keyboard, a touch pad, or a voice control device, such as a tablet, a smart phone, or a personal digital assistant (Personal Digital Assistant). , PDA), smart wearable devices, camera equipment, monitoring equipment and other terminals.
- a keyboard e.g., a keyboard
- a touch pad e.g., a touch pad
- a voice control device such as a tablet, a smart phone, or a personal digital assistant (Personal Digital Assistant). , PDA), smart wearable devices, camera equipment, monitoring equipment and other terminals.
- PDA Personal Digital Assistant
- the network in which the electronic device 5 is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (VPN), and the like.
- the Internet includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (VPN), and the like.
- VPN virtual private network
- the receiving device 54 and the sending device 51 may be wired transmission ports, or may be wireless devices, for example, including antenna devices, for performing data communication with other devices.
- the memory 52 is used to store program code.
- the memory 52 may be a circuit having a storage function, such as a RAM (Random-Access Memory), a FIFO (First In First Out), or the like, which is not in a physical form in the integrated circuit.
- the memory 52 may also be a memory having a physical form, such as a memory stick, a TF card (Trans-flash Card), a smart media card, a secure digital card, a flash memory card.
- Storage devices such as (flash card) and the like.
- the processor 53 can include one or more microprocessors, digital processors.
- the processor 53 can call program code stored in the memory 52 to perform related functions.
- the various modules described in FIG. 3 are program code stored in the memory 52 and executed by the processor 53 to implement a sample data classification method; and/or as described in FIG.
- the various modules are program code stored in the memory 52 and executed by the processor 53 to implement a model training method.
- the processor 53 also known as a central processing unit (CPU), is a very large-scale integrated circuit, which is a computing core (Core) and a control unit (Control Unit).
- the embodiment of the present application further provides a non-volatile readable storage medium having stored thereon computer instructions that, when executed by an electronic device including one or more processors, cause the electronic device to perform the method as described above.
- a non-volatile readable storage medium having stored thereon computer instructions that, when executed by an electronic device including one or more processors, cause the electronic device to perform the method as described above.
- the memory 52 in the electronic device 5 stores a plurality of instructions to implement a sample data classification method, and the processor 53 can execute the plurality of instructions to implement:
- Calculating a feature of each sample in the sample data calculating a distance set of each sample according to characteristics of each sample, the distance set of each sample including each sample in each of the remaining samples corresponding to each sample Distance; calculate the density value of each sample and calculate the density distance value of each sample according to the distance set of each sample; determine at least one cluster center according to the density value of each sample and the density distance value of each sample And clustering the sample data into a plurality of subsets based on the at least one cluster center and features of each sample.
- the processor executing the plurality of instructions further includes:
- Each distance distance set of each sample is compared with a distance threshold, a distance number greater than the distance threshold is determined, and the distance number corresponding to each sample is taken as the density value of each sample.
- the calculating the density distance value of each sample comprises:
- the maximum distance is selected as the density distance value of the sample having the largest density value from the distance concentration of the sample having the largest density value;
- the processor may execute the plurality of instructions further including:
- At least one cluster center is determined based on the cluster metric of each sample.
- the cluster metric value of each sample is equal to the product of the density value of each sample and the density distance value of each sample.
- the executing the plurality of instructions by the processor when determining the at least one clustering center according to the clustering metric value of each sample, the executing the plurality of instructions by the processor further includes:
- the sorting is performed from large to small, and from the sorted clustering metric values, the sample of the preset number of bits before the sorting of the clustering metric value is selected as the clustering center point;
- the sample with the cluster metric greater than the threshold is filtered as the cluster center point.
- the processor executable to execute the plurality of instructions further includes:
- a sample having a distance from each cluster center of the at least one cluster center exceeding a distance threshold is determined as an error sample.
- the plurality of instructions corresponding to the sample data classification method are stored in the memory 52 in any of the embodiments and executed by the processor 53, and are not described in detail herein.
- the memory 52 in the electronic device 5 stores a plurality of instructions to implement a model training method
- the processor 53 can execute the plurality of instructions to implement:
- sample data for each category classifying sample data for each category to obtain multiple subsets of each category; calculating the relevance of each subset of each subset of each category to the category of each subset; The relevance of each subset and category in multiple subsets of a category, from high to low, sorting multiple subsets of each category to obtain multiple sorted subsets of each category; In a plurality of sorted subsets, a subset of the same sort position is read as a training sample of the model, and the model is trained.
- the subset of the higher ranking positions corresponds to a greater weight.
- the above-described characteristic means of the present application can be implemented by an integrated circuit and control the function of implementing the sample data classification method in any of the above embodiments. That is, the integrated circuit of the present application is installed in the electronic device, so that the electronic device performs the functions of: calculating characteristics of each sample in the sample data; calculating a distance set of each sample according to characteristics of each sample, The distance set of each sample includes the distance between each sample of each sample corresponding to each sample; according to the distance set of each sample, the density value of each sample is calculated and the density distance of each sample is calculated. a value; determining at least one cluster center according to a density value of each sample and a density distance value of each sample; clustering the sample data into a plurality of children based on the at least one cluster center and characteristics of each sample set.
- the functions that can be implemented by the sample data classification method in any of the embodiments can be installed in the electronic device through the integrated circuit of the present application, so that the electronic device can perform the sample data classification method in any embodiment.
- the functions implemented are not detailed here.
- the above-described characteristic means of the present application can be implemented by an integrated circuit and control the function of implementing the sample data classification method in any of the above embodiments. That is, the integrated circuit of the present application is installed in the electronic device, so that the electronic device functions to acquire sample data of each category; classify sample data of each category to obtain multiple subsets of each category Calculating the relevance of each subset in each subset of each category to the category in which each subset is located; the relevance of each subset to the category in multiple subsets of each category, from high to low, for each category Sorting subsets to obtain multiple sorted subsets of each category; sequentially reading subsets with the same sorting position from multiple sorted subsets of each category as training samples of the model, for the model Train.
- the functions that can be implemented by the model training method in any of the embodiments can be installed in the electronic device through the integrated circuit of the present application, so that the electronic device can be implemented by the model training method in any embodiment. Function, no longer detailed here.
- the disclosed apparatus may be implemented in other ways.
- the device embodiments described above are merely illustrative.
- the division of the unit is only a logical function division.
- there may be another division manner for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not executed.
- the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be electrical or otherwise.
- the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
- each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
- the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
- the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a non-volatile readable storage medium.
- a computer device which may be a personal computer, server or network device, etc.
- the foregoing storage medium includes: a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and the like.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
本申请提供一种样本数据分类方法,所述方法包括:计算样本数据中每个样本的特征;根据每个样本的特征,计算每个样本的距离集,所述每个样本的距离集包括每个样本与每个样本对应的剩余样本中每个样本间的距离;根据每个样本的距离集,计算每个样本的密度值及计算每个样本的密度距离值;根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心;基于所述至少一个聚类中心及每个样本的特征,将所述样本数据聚类成多个子集。本申请还提供一种利用所述样本数据分类方法的模型训练方法及电子设备。本申请按照任务的难易度,从易到难依次进行训练,以避免难训练样本被剔除,从而提高模型参数的适应力。
Description
本申请要求于2018年04月18日提交中国专利局,申请号为201810350730.2发明名称为“样本数据分类方法、模型训练方法、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及数据处理领域,尤其涉及一种样本数据分类方法、模型训练方法、电子设备及存储介质。
在大规模数据收集的过程中,难免会有噪声样本(例如不相关、错误的样本数据)出现。处理含有大量错误标签的算法一般是,设计对噪声鲁棒的算法,让模型自动检测出相关度高的样本和有噪声的样本,然后丢弃错误标签,再进行训练。但这种方法的缺陷是:很难区分难训练样本和错误样本,导致难训练样本被剔除,而难训练样本对提升模型性能是非常重要的。
发明内容
鉴于以上内容,有必要提供一种样本数据分类方法、模型训练方法、电子设备及存储介质,能从易到难依次进行训练车辆部位识别模型,以避免难训练样本被剔除,从而让所述模型从易到难学习各个部位的样本图片的特征,从而提高模型参数的适应力。
一种样本数据分类方法,所述方法包括:
计算样本数据中每个样本的特征;
根据每个样本的特征,计算每个样本的距离集,所述每个样本的距离集包括每个样本与每个样本对应的剩余样本中每个样本间的距离;
根据每个样本的距离集,计算每个样本的密度值及计算每个样本的密度距离值;
根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心;
基于所述至少一个聚类中心及每个样本的特征,将所述样本数据聚类成多个子集。
一种模型训练方法,所述方法包括:
获取每个类别的样本数据;
利用任意实施例中所述的样本数据分类方法对每个类别的样本数据进行分类,得到每个类别的多个子集;
计算每个类别的多个子集中每个子集与每个子集所在类别的相关度;
根据每个类别的多个子集中每个子集与类别的相关度,从高到低,对每个类别的多个子集进行排序,得到每个类别的多个排序后的子集;
依次从每个类别的多个排序后的子集中,读取排序位置相同的子集作为模型的训练样本,对所述模型进行训练。
一种电子设备,所述电子设备包括存储器及处理器,所述存储器用于存储至少一个指令,所述处理器用于执行所述至少一个指令以实现如任意实施例中所述样本数据分类方法,及/或任意实施例中任一项所述模型训练方法。
一种非易失性可读存储介质,所述非易失性可读存储介质存储有至少一个指令,所述至少一个指令被处理器执行时实现如任意实施例中所述样本数据分类方法,及/或如任意实施例中所述模型训练方法。
由以上技术方案可知,本申请计算样本数据中每个样本的特征;根据每个样本的特征,计算每个样本的距离集,所述每个样本的距离集包括每个样本与每个样本对应的剩余样本中每个样本间的距离;根据每个样本的距离集,计算每个样本的密度值及计算每个样本的密度距离值;根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心;基于所述至少一个聚类中心及每个样本的特征,将所述样本数据聚类成多个子集。本申请按照任务的难易度,从易到难依次进行训练,以避免难训练样本被剔除,从而提高模型参数的适应力。
图1是本申请样本数据分类方法的第一较佳实施例的流程图。
图2是本申请模型训练方法的第一较佳实施例的流程图。
图3是本申请样本数据分类装置的第一较佳实施例的程序模块图。
图4是本申请模型训练装置的第一较佳实施例的程序模块图。
图5是本申请至少一个实例中电子设备的较佳实施例的结构示意图。
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
为使本申请的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本申请作进一步详细的说明。
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”和“第三”等是用于区别不同对象,而非用于描述特定顺序。此外,术语“包括”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。
如图1所示,是本申请样本数据分类方法的第一较佳实施例的流程图。根据不同的需求,该流程图中步骤的顺序可以改变,某些步骤可以省略。
S10、电子设备计算样本数据每个样本的特征。
在可选实施例中,所述样本数据包括,但不限于:预先采集的数据,从网络上爬取的数据。因此,在大规模的样本数据收集的过程中,会有与所述样本数据表示的类别相关度不高,或者错误的数据出现。为了后续提高模型 训练的准确度,需要对样本数据进行分类,自动检测出在模型训练过程中容易被学习特征的简单样本,及在模型训练过程中不容易被学习特征的难样本,从而实现对样本数据的分类。
优选地,利用特征提取模型提取每个样本的特征。进一步地,所述特征提取模型包括,但不限于:深度卷积神经网络模型。通过深度卷积神经网络对样本数据提取特征,比如任何网络(VGG-16,ResNet-50等)在Soft-max分类层之前的一层都可以看做是特征提取器,把这一层的输出作为提取到的特征。
在本实施例中,所述深度卷积神经网络模型由1个输入层、20个卷积层、6个池化层、3个隐含层、1个分类层构成。所述深度卷积神经网络模型的模型架构如图3所示,其中,Conv a-b(例如,Conv 3-64)表示该层卷积核的维度为a×a,该层卷积核的个数为b;Maxpool2表示所述池化层的池化核的维度为2×2;FC-c(例如,FC-6)表示该隐含层(即:完全连接层)有c个输出节点;Soft-max表示该分类层使用Soft-max分类器对输入图像进行分类。
在本实施例中,利用训练样本进行训练学习得到训练后的深度卷积神经网络模型。将所述样本数据输入到所述训练后的深度卷积神经网络模型中就能准确自动地提取所述样本数据中每个样本的特征。一般情况下,训练样本的规模越大,所述训练后的深度卷积神经网络模型的提取的特征就越准确。当然所述深度卷积神经网络模型也可以其他表现形式,本申请不做任何限制。
S11、所述电子设备根据每个样本的特征,计算每个样本的距离集。
优选地,所述每个样本的距离集包括每个样本与每个样本对应的剩余样本中每个样本间的距离,其中每个样本对应的剩余样本包括所述样本数据中除去每个样本之外的其他样本。例如,若所述样本数据中有3个样本,样本A、样本B及样本C,对于样本A,分别计算样本A与样本B、样本C的距离。对于样本B,分别计算样本B与样本A、样本C的距离。对于样本C,分别计算样本C与样本A及样本B的距离。这样距离矩阵为一个3*2或者2*3的矩阵。
进一步地,所述距离包括,不限于:欧式距离、余弦距离等等。所述距离矩阵中的每个距离值都大于0,例如,当计算的余弦距离小于0时,则取计算的余弦距离的绝对值。
S12、所述电子设备根据每个样本的距离集,计算每个样本的密度值及计算每个样本的密度距离值。
优选地,将每个样本的距离集每个距离与距离阈值进行对比,确定大于所述距离阈值的距离数,并将每个样本对应的距离数作为每个样本的密度值。这样样本的密度值越大,表示与样本相似的样本越多。
具体地,对于任意一个样本,所述任意一个样本的密度值的计算方式如下:
其中ρ
i表示所述样本数据中第i个样本的密度值,D
ij表示第i个样本与所述样本数据中第j个样本的距离,d
c表示距离阈值。
优选地,所述计算每个样本的密度距离值包括:
(1)对于密度值最大的样本,从所述密度值最大的样本的距离集中,筛选最大距离作为所述密度值最大的样本的密度距离值。
(2)对于第二样本集中任意一个样本,确定密度值大于所述任意一个样本的密度值的样本;根据所述任意一个样本的距离集,从密度值大于所述任意一个样本的密度值的样本中确定与所述任意一个样本最近的距离,将与所述任意一个样本最近的距离确定为所述任意一个样本的密度距离值。其中所述第二样本集包括所述样本数据中除去密度值最大的样本的其他样本。
具体地,每个样本的密度距离值的计算公式如下:
其中δ
i表示第i个样本的密度距离值,ρ
i表示第i个样本的密度值,ρ
j表示第j个样本的密度值,D
ij第i个样本与第j个样本的距离。
S13、所述电子设备根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心。
优选地,所述根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心包括:
A、根据每个样本的密度值及每个样本的密度距离值,计算每个样本的聚类度量值。
进一步地,每个样本的聚类度量值等于每个样本的密度值与每个样本的密度距离值的乘积。
B、根据每个样本的聚类度量值,确定至少一个聚类中心。
进一步地,所述根据每个样本的聚类度量值,确定至少一个聚类中心包括:
(1)、根据每个样本的聚类度量值,从大到小进行排序,从排序后的聚类度量值中,筛选聚类度量值排序前预设位数(例如,前三个)的样本作为聚类中心点。
(2)、根据每个样本的聚类度量值,筛选聚类度量值大于阈值的样本作为聚类中心点。
进一步地,根据每个样本的聚类度量值配置所述阈值,例如根据每个样本的聚类度量值计算均值,将均值作为所述阈值。
S14、所述电子设备基于所述至少一个聚类中心及每个样本的特征,将所述样本数据聚类成多个子集。
优选地,根据所述至少一个聚类中心中每个聚类中心对应的样本的距离集,利用聚类算法,将所述样本数据聚类成多个子集。
进一步地,所述聚类算法包括,但不限于:k-means聚类算法、层次聚类算法等等。
进一步地,将与所述至少一个聚类中心每个聚类中心的距离超过距离阈值的样本确定为错误样本。这样可以有效排除错误样本。
在上述实施例中,样本的密度值越大,表示与该样本相似的样本越多。样本的密度距离值越大,表示该样本点所在的子集与其他子集的距离就越远。因此按照上述实施例进行聚类后,能使同一个子集中的样本之间距离变短,且不同子集中的样本之间距离变大。这样某一子集的样本越密集,代表图片的特征越相似,则代表该子集中的数据与样本数据表示的类别越相似,属于简单样本,模型很容易学习到简单样本的特征。反之,若一个子集的样本越稀疏,代表图片越多样化,则认为该子集的数据较复杂,属于难样本。而且通过上述实施例选取的聚类中心,对样本数据进行聚类,还可以有效排除错误的样本,从而提高后续训练模型参数的准确度。
如图2所示,是本申请模型训练方法的第二较佳实施例的流程图。根据不同的需求,该流程图中步骤的顺序可以改变,某些步骤可以省略。
S20、电子设备获取每个类别的样本数据。
在本实施例中,训练的模型用于识别待检测图片属于的类别,例如,所述模型为车辆部位识别模型,所述车辆部位识别模型用于识别待待测图片中的部位属于车辆的哪个部位。这样就需要获取车辆各个部位的样本数据,一个部位的样本数据属于一个类别。
S21、所述电子设备对每个类别的样本数据进行分类,得到每个类别的多个子集。其中采用第一较优实施例中对每个类别的样本数据进行分类。
在本实施例中,该步骤S21的处理过程与第一较优实施例中数据分类方法相同,在此不再详述。
S22、所述电子设备计算每个类别的多个子集中每个子集与每个子集所在类别的相关度。
按照上述实施例进行聚类后,对于每个类别而言,能使同一个子集中的样本之间距离变短,且不同子集中的样本之间距离变大。这样某一子集的样本越密集,代表图片的特征越相似,则表示该子集中的数据与该子集所在的类别越相关,相似度越高,属于简单样本。反之,若一个子集的样本越稀疏,代表图片越多样化,属于难样本。
优选地,对于每个类别,将每个子集包含的样本数作为每个子集与每个子集所在类别的相关度。例如对于一个类别,聚类后得到三个子集:第一子集、第二子集、第三子集。若第一子集的样本数为40个,第二子集的样本数位100个,第三子集的样本数位10个,则用数值40表示第一子集与第一子集所在类别的相似度。
S23、所述电子设备根据每个类别的多个子集中每个子集与类别的相关度,从高到低,对每个类别的多个子集进行排序,得到每个类别的多个排序后的子集。
例如对于一个类别,聚类后得到三个子集:第一子集、第二子集、第三子集。若第一子集的样本数为40个,第二子集的样本数为100个,第三子集 的样本数为10个,则所述一个类别的多个排序后的子集为:第二子集、第一子集、第三子集。
S24、所述电子设备依次从每个类别的多个排序后的子集中,读取排序位置相同的子集作为模型的训练样本,对所述模型进行训练。
优选地,从每个类别的多个排序后的子集中,读取每个类别的第一个子集作为模型的训练样本,对所述模型进行训练,达到第一终止条件后,读取每个类别的第二子集,将每个类别的第二子集加入模型的训练样本中,继续对所述模型进行训练,直至每个类别的所有子集都作为训练样本。由于根据每个类别的多个子集中每个子集与类别的相关度对每个类别的多个子集,从高到低,进行排序后,这样简单样本就会排在前面,在训练模型时,简单样本比较容易训练,难样本排在后面,比较难训练,这样将对所述模型的训练分成多个子任务,按照任务的难易度,从易到难依次进行训练,以避免难训练样本被剔除,从而让所述模型从易到难学习每个类别的特征,从而提高模型参数的适应力。
进一步地,在所述多个排序后的子集中,排序位置越靠前,子集对应的权重越大。这样相似度越高的样本,权重越大,在训练模型的时候,就能学习更多的特征,从而提高模型参数的准确率。因此,可以依赖置信度高的子集来提升模型识别的准确率。
例如,有两个类别,类别A及类别B,类别A中排序后的子集为:子集A1、子集A2,子集A1对应的权重为1、子集A2对应的权重为0.5。类别B中排序后的子集为子集B1、子集B2,子集B1对应的权重为1、子集B2对应的权重为0.5。先读取子集A1及子集B1对模型进行训练,达到第一终止条件后,再读取子集A2及子集B2,将子集A2及子集B2加入模型的训练样本中,这样子集A1、子集B1、子集A2及子集B2都作为训练样本,再对模型进行训练,直至训练结束。
利用上述方法,训练车辆部位识别模型应用场景举例如下:
首先获取车辆的各个部位的样本图片,一个部位的样本是一个类别的图片,对任意一个部位的样本采用第一较优实施例中的样本数据分类方法进行处理,得到每个部位的多个子集,并利用第二较优实施例中的方法对每个部位的多个子集进行排序,并基于每个部位的多个排序后的子集,对所述车辆部位识别模型进行训练。这样将对所述车辆部位识别模型的训练分成多个子任务,按照任务的难易度,从易到难依次进行训练车辆部位识别模型,以避免难训练样本被剔除,从而让所述模型从易到难学习各个部位的样本图片的特征,从而提高模型参数的适应力。
由以上实施例可知,本申请将将每个类别的训练样本数据按照难易程度分类多个子集,能使同一个子集中的样本之间距离变短,且不同子集中的样本之间距离变大。这样某一子集的样本越密集,代表图片的特征越相似,则代表该子集中的数据与样本数据表示的类别越相似,属于简单样本,模型很容易学习到简单样本的特征。反之,若一个子集的样本越稀疏,代表图片越多样化,则认为该子集的数据较复杂,属于难样本。再对训练样本数据的多个子集,从易到难进行排序,从而实现将对所述模型的训练分成多个子任务, 按照任务的难易度,从易到难依次进行训练,以避免难训练样本被剔除,从而让所述模型从易到难学习每个类别的特征,从而提高模型参数的适应力。
如图3所示,本申请样本数据分类装置的第一较佳实施例的程序模块图。所述样本数据分类装置3包括,但不限于以下一个或者多个模块:计算模块30、确定模块31及聚类模块32。本申请所称的单元是指一种能够被样本数据分类装置3的处理器所执行并且能够完成固定功能的一系列计算机可读指令段,其存储在存储器中。关于各单元的功能将在后续的实施例中详述。
所述计算模块30计算样本数据每个样本的特征。
在可选实施例中,所述样本数据包括,但不限于:预先采集的数据,从网络上爬取的数据。因此,在大规模的样本数据收集的过程中,会有与所述样本数据表示的类别相关度不高,或者错误的数据出现。为了后续提高模型训练的准确度,需要对样本数据进行分类,自动检测出在模型训练过程中容易被学习特征的简单样本,及在模型训练过程中不容易被学习特征的难样本,从而实现对样本数据的分类。
优选地,所述计算模块30利用特征提取模型提取每个样本的特征。进一步地,所述特征提取模型包括,但不限于:深度卷积神经网络模型。通过深度卷积神经网络对样本数据提取特征,比如任何网络(VGG-16,ResNet-50等)在Soft-max分类层之前的一层都可以看做是特征提取器,把这一层的输出作为提取到的特征。
在本实施例中,所述深度卷积神经网络模型由1个输入层、20个卷积层、6个池化层、3个隐含层、1个分类层构成。所述深度卷积神经网络模型的模型架构如图3所示,其中,Conv a-b(例如,Conv 3-64)表示该层卷积核的维度为a×a,该层卷积核的个数为b;Maxpool2表示所述池化层的池化核的维度为2×2;FC-c(例如,FC-6)表示该隐含层(即:完全连接层)有c个输出节点;Soft-max表示该分类层使用Soft-max分类器对输入图像进行分类。
在本实施例中,利用训练样本进行训练学习得到训练后的深度卷积神经网络模型。将所述样本数据输入到所述训练后的深度卷积神经网络模型中就能准确自动地提取所述样本数据中每个样本的特征。一般情况下,训练样本的规模越大,所述训练后的深度卷积神经网络模型的提取的特征就越准确。当然所述深度卷积神经网络模型也可以其他表现形式,本申请不做任何限制。
所述计算模块30根据每个样本的特征,计算每个样本的距离集。
优选地,所述每个样本的距离集包括每个样本与每个样本对应的剩余样本中每个样本间的距离,其中每个样本对应的剩余样本包括所述样本数据中除去每个样本之外的其他样本。例如,若所述样本数据中有3个样本,样本A、样本B及样本C,对于样本A,分别计算样本A与样本B、样本C的距离。对于样本B,分别计算样本B与样本A、样本C的距离。对于样本C,分别计算样本C与样本A及样本B的距离。这样距离矩阵为一个3*2或者2*3的矩阵。
进一步地,所述距离包括,不限于:欧式距离、余弦距离等等。所述距离矩阵中的每个距离值都大于0,例如,当计算的余弦距离小于0时,则取 计算的余弦距离的绝对值。
所述计算模块30根据每个样本的距离集,计算每个样本的密度值及计算每个样本的密度距离值。
优选地,所述计算模块30将每个样本的距离集每个距离与距离阈值进行对比,确定大于所述距离阈值的距离数,并将每个样本对应的距离数作为每个样本的密度值。这样样本的密度值越大,表示与样本相似的样本越多。
具体地,对于任意一个样本,所述任意一个样本的密度值的计算方式如下:
其中ρ
i表示所述样本数据中第i个样本的密度值,D
ij表示第i个样本与所述样本数据中第j个样本的距离,d
c表示距离阈值。
优选地,所述计算模块30计算每个样本的密度距离值包括:
(1)对于密度值最大的样本,从所述密度值最大的样本的距离集中,筛选最大距离作为所述密度值最大的样本的密度距离值。
(2)对于第二样本集中任意一个样本,确定密度值大于所述任意一个样本的密度值的样本;根据所述任意一个样本的距离集,从密度值大于所述任意一个样本的密度值的样本中确定与所述任意一个样本最近的距离,将与所述任意一个样本最近的距离确定为所述任意一个样本的密度距离值。其中所述第二样本集包括所述样本数据中除去密度值最大的样本的其他样本。
具体地,每个样本的密度距离值的计算公式如下:
其中δ
i表示第i个样本的密度距离值,ρ
i表示第i个样本的密度值,ρ
j表示第j个样本的密度值,D
ij第i个样本与第j个样本的距离。
所述确定模块31根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心。
优选地,所述确定模块31根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心包括:
A、根据每个样本的密度值及每个样本的密度距离值,计算每个样本的聚类度量值。
进一步地,每个样本的聚类度量值等于每个样本的密度值与每个样本的密度距离值的乘积。
B、根据每个样本的聚类度量值,确定至少一个聚类中心。
进一步地,所述确定模块31根据每个样本的聚类度量值,确定至少一个聚类中心包括:
(1)、根据每个样本的聚类度量值,从大到小进行排序,从排序后的聚类度量值中,筛选聚类度量值排序前预设位数(例如,前三个)的样本作为聚类中心点。
(2)、根据每个样本的聚类度量值,筛选聚类度量值大于阈值的样本作为聚类中心点。
进一步地,根据每个样本的聚类度量值配置所述阈值,例如根据每个样本的聚类度量值计算均值,将均值作为所述阈值。
所述聚类模块32基于所述至少一个聚类中心及每个样本的特征,将所述样本数据聚类成多个子集。
优选地,所述聚类模块32根据所述至少一个聚类中心中每个聚类中心对应的样本的距离集,利用聚类算法,将所述样本数据聚类成多个子集。
进一步地,所述聚类算法包括,但不限于:k-means聚类算法、层次聚类算法等等。
进一步地,所述确定模块31将与所述至少一个聚类中心每个聚类中心的距离超过距离阈值的样本确定为错误样本。这样可以有效排除错误样本。
在上述实施例中,样本的密度值越大,表示与该样本相似的样本越多。样本的密度距离值越大,表示该样本点所在的子集与其他子集的距离就越远。因此按照上述实施例进行聚类后,能使同一个子集中的样本之间距离变短,且不同子集中的样本之间距离变大。这样某一子集的样本越密集,代表图片的特征越相似,则代表该子集中的数据与样本数据表示的类别越相似,属于简单样本,模型很容易学习到简单样本的特征。反之,若一个子集的样本越稀疏,代表图片越多样化,则认为该子集的数据较复杂,属于难样本。而且通过上述实施例选取的聚类中心,对样本数据进行聚类,还可以有效排除错误的样本,从而提高后续训练模型参数的准确度。
如图4所示,本申请模型训练装置的第一较佳实施例的程序模块图。所述模型训练装置4包括,但不限于以下一个或者多个模块:数据获取模块40、数据聚类模块41、相关度计算模块42、排序模块43及训练模块44。本申请所称的单元是指一种能够被模型训练装置4的处理器所执行并且能够完成固定功能的一系列计算机可读指令段,其存储在存储器中。关于各单元的功能将在后续的实施例中详述。
所述数据获取模块40获取每个类别的样本数据。
在本实施例中,训练的模型用于识别待检测图片属于的类别,例如,所述模型为车辆部位识别模型,所述车辆部位识别模型用于识别待待测图片中的部位属于车辆的哪个部位。这样就需要获取车辆各个部位的样本数据,一个部位的样本数据属于一个类别。
所述数据聚类模块41对每个类别的样本数据进行分类,得到每个类别的多个子集。其中采用第一较优实施例中对每个类别的样本数据进行分类。
在本实施例中,所述数据聚类模块41用于实现第一较优实施例中样本数据分类方法,在此不再详述。
所述相关度计算模块42计算每个类别的多个子集中每个子集与每个子集所在类别的相关度。
按照上述实施例进行聚类后,对于每个类别而言,能使同一个子集中的样本之间距离变短,且不同子集中的样本之间距离变大。这样某一子集的样 本越密集,代表图片的特征越相似,则表示该子集中的数据与该子集所在的类别越相关,相似度越高,属于简单样本。反之,若一个子集的样本越稀疏,代表图片越多样化,属于难样本。
优选地,对于每个类别,将每个子集包含的样本数作为每个子集与每个子集所在类别的相关度。例如对于一个类别,聚类后得到三个子集:第一子集、第二子集、第三子集。若第一子集的样本数为40个,第二子集的样本数位100个,第三子集的样本数位10个,则用数值40表示第一子集与第一子集所在类别的相似度。
所述排序模块43根据每个类别的多个子集中每个子集与类别的相关度,从高到低,对每个类别的多个子集进行排序,得到每个类别的多个排序后的子集。
例如对于一个类别,聚类后得到三个子集:第一子集、第二子集、第三子集。若第一子集的样本数为40个,第二子集的样本数为100个,第三子集的样本数为10个,则所述一个类别的多个排序后的子集为:第二子集、第一子集、第三子集。
所述训练模块44依次从每个类别的多个排序后的子集中,读取排序位置相同的子集作为模型的训练样本,对所述模型进行训练。
优选地,所述训练模块44从每个类别的多个排序后的子集中,读取每个类别的第一个子集作为模型的训练样本,对所述模型进行训练,达到第一终止条件后,读取每个类别的第二子集,将每个类别的第二子集加入模型的训练样本中,继续对所述模型进行训练,直至每个类别的所有子集都作为训练样本。这样将对所述模型的训练分成多个子任务,按照任务的难易度,从易到难依次进行训练,以避免难训练样本被剔除,从而让所述模型从易到难学习每个类别的特征,从而提高模型参数的适应力。
进一步地,在所述多个排序后的子集中,排序位置越靠前,子集对应的权重越大。这样相似度越高的样本,权重越大,在训练模型的时候,就能学习更多的特征,从而提高模型参数的准确率。因此,可以依赖置信度高的子集来提升模型识别的准确率。
例如,有两个类别,类别A及类别B,类别A中排序后的子集为:子集A1、子集A2,子集A1对应的权重为1、子集A2对应的权重为0.5。类别B中排序后的子集为子集B1、子集B2,子集B1对应的权重为1、子集B2对应的权重为0.5。先读取子集A1及子集B1对模型进行训练,达到第一终止条件后,再读取子集A2及子集B2,将子集A2及子集B2加入模型的训练样本中,这样子集A1、子集B1、子集A2及子集B2都作为训练样本,再对模型进行训练,直至训练结束。
训练车辆部位识别模型应用场景举例如下:
首先获取车辆的各个部位的样本图片,一个部位的样本是一个类别的图片,对任意一个部位的样本采用第一较优实施例中的样本数据分类方法进行处理,得到每个部位的多个子集,并利用第二较优实施例中的方法对每个部位的多个子集进行排序,并基于每个部位的多个排序后的子集,对所述车辆 部位识别模型进行训练。这样将对所述车辆部位识别模型的训练分成多个子任务,按照任务的难易度,从易到难依次进行训练车辆部位识别模型,以避免难训练样本被剔除,从而让所述模型从易到难学习各个部位的样本图片的特征,从而提高模型参数的适应力。
由以上实施例可知,本申请将将每个类别的训练样本数据按照难易程度分类多个子集,能使同一个子集中的样本之间距离变短,且不同子集中的样本之间距离变大。这样某一子集的样本越密集,代表图片的特征越相似,则代表该子集中的数据与样本数据表示的类别越相似,属于简单样本,模型很容易学习到简单样本的特征。反之,若一个子集的样本越稀疏,代表图片越多样化,则认为该子集的数据较复杂,属于难样本。再对训练样本数据的多个子集,从易到难进行排序,从而实现将对所述模型的训练分成多个子任务,按照任务的难易度,从易到难依次进行训练,以避免难训练样本被剔除,从而让所述模型从易到难学习每个类别的特征,从而提高模型参数的适应力。
上述以软件程序模块的形式实现的集成的单元,可以存储在一个非易失性可读取存储介质中。上述软件程序模块存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请每个实施例所述方法的部分步骤。
如图5所示,所述电子设备5包括至少一个发送装置51、至少一个存储器52、至少一个处理器53、至少一个接收装置54以及至少一个通信总线。其中,所述通信总线用于实现这些组件之间的连接通信。
所述电子设备5是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。所述电子设备5还可包括网络设备和/或用户设备。其中,所述网络设备包括但不限于单个网络服务器、多个网络服务器组成的服务器组或基于云计算(Cloud Computing)的由大量主机或网络服务器构成的云,其中,云计算是分布式计算的一种,由一群松散耦合的计算机集组成的一个超级虚拟计算机。
所述电子设备5可以是,但不限于任何一种可与用户通过键盘、触摸板或声控设备等方式进行人机交互的电子产品,例如,平板电脑、智能手机、个人数字助理(Personal Digital Assistant,PDA)、智能式穿戴式设备、摄像设备、监控设备等终端。
所述电子设备5所处的网络包括,但不限于互联网、广域网、城域网、局域网、虚拟专用网络(Virtual Private Network,VPN)等。
其中,所述接收装置54和所述发送装置51可以是有线发送端口,也可以为无线设备,例如包括天线装置,用于与其他设备进行数据通信。
所述存储器52用于存储程序代码。所述存储器52可以是集成电路中没有实物形式的具有存储功能的电路,如RAM(Random-Access Memory,随机存取存储器)、FIFO(First In First Out,)等。或者,所述存储器52也可以是 具有实物形式的存储器,如内存条、TF卡(Trans-flash Card)、智能媒体卡(smart media card)、安全数字卡(secure digital card)、快闪存储器卡(flash card)等储存设备等等。
所述处理器53可以包括一个或者多个微处理器、数字处理器。所述处理器53可调用存储器52中存储的程序代码以执行相关的功能。例如,图3中所述的各个模块是存储在所述存储器52中的程序代码,并由所述处理器53所执行,以实现一种样本数据分类方法;及/或图4中所述的各个模块是存储在所述存储器52中的程序代码,并由所述处理器53所执行,以实现一种模型训练方法。所述处理器53又称中央处理器(CPU,Central Processing Unit),是一块超大规模的集成电路,是运算核心(Core)和控制核心(Control Unit)。
本申请实施例还提供一种非易失性可读存储介质,其上存储有计算机指令,所述指令当被包括一个或多个处理器的电子设备执行时,使电子设备执行如上文方法实施例所述的样本数据分类方法及/或模型训练方法。
结合图1所示,所述电子设备5中的所述存储器52存储多个指令以实现一种样本数据分类方法,所述处理器53可执行所述多个指令从而实现:
计算样本数据中每个样本的特征;根据每个样本的特征,计算每个样本的距离集,所述每个样本的距离集包括每个样本与每个样本对应的剩余样本中每个样本间的距离;根据每个样本的距离集,计算每个样本的密度值及计算每个样本的密度距离值;根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心;基于所述至少一个聚类中心及每个样本的特征,将所述样本数据聚类成多个子集。
根据本申请优选实施例,在计算每个样本的密度值时,所述处理器可执行所述多个指令还包括:
将每个样本的距离集每个距离与距离阈值进行对比,确定大于所述距离阈值的距离数,并将每个样本对应的距离数作为每个样本的密度值。
根据本申请优选实施例,所述计算每个样本的密度距离值包括:
对于密度值最大的样本,从所述密度值最大的样本的距离集中,筛选最大距离作为所述密度值最大的样本的密度距离值;
对于第二样本集中任意一个样本,确定密度值大于所述任意一个样本的密度值的样本;根据所述任意一个样本的距离集,从密度值大于所述任意一个样本的密度值的样本中确定与所述任意一个样本最近的距离,将与所述任意一个样本最近的距离确定为所述任意一个样本的密度距离值,所述第二样本集包括所述样本数据中除去密度值最大的样本的其他样本。
根据本申请优选实施例,在根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心时,所述处理器可执行所述多个指令还包括:
根据每个样本的密度值及每个样本的密度距离值,计算每个样本的聚类度量值;
根据每个样本的聚类度量值,确定至少一个聚类中心。
根据本申请优选实施例,所述每个样本的聚类度量值等于每个样本的密度值与每个样本的密度距离值的乘积。
根据本申请优选实施例,在根据每个样本的聚类度量值,确定至少一个聚类中心时,所述处理器可执行所述多个指令还包括:
根据每个样本的聚类度量值,从大到小进行排序,从排序后的聚类度量值中,筛选聚类度量值排序前预设位数的样本作为聚类中心点;
根据每个样本的聚类度量值,筛选聚类度量值大于阈值的样本作为聚类中心点。
根据本申请优选实施例,所述处理器可执行所述多个指令还包括:
将与所述至少一个聚类中心每个聚类中心的距离超过距离阈值的样本确定为错误样本。
在任意实施例中所述样本数据分类方法对应的多个指令存储在所述存储器52,并通过所述处理器53来执行,在此不再详述。
结合图2所示,所述电子设备5中的所述存储器52存储多个指令以实现一种模型训练方法,所述处理器53可执行所述多个指令从而实现:
获取每个类别的样本数据;对每个类别的样本数据进行分类,得到每个类别的多个子集;计算每个类别的多个子集中每个子集与每个子集所在类别的相关度;根据每个类别的多个子集中每个子集与类别的相关度,从高到低,对每个类别的多个子集进行排序,得到每个类别的多个排序后的子集;依次从每个类别的多个排序后的子集中,读取排序位置相同的子集作为模型的训练样本,对所述模型进行训练。
根据本申请优选实施例,在所述多个排序后的子集中,排序位置越靠前的子集对应的权重越大。
以上说明的本申请的特征性的手段可以通过集成电路来实现,并控制实现上述任意实施例中所述样本数据分类方法的功能。即,本申请的集成电路安装于所述电子设备中,使所述电子设备发挥如下功能:计算样本数据中每个样本的特征;根据每个样本的特征,计算每个样本的距离集,所述每个样本的距离集包括每个样本与每个样本对应的剩余样本中每个样本间的距离;根据每个样本的距离集,计算每个样本的密度值及计算每个样本的密度距离值;根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心;基于所述至少一个聚类中心及每个样本的特征,将所述样本数据聚类成多个子集。
在任意实施例中所述样本数据分类方法所能实现的功能都能通过本申请的集成电路安装于所述电子设备中,使所述电子设备发挥任意实施例中所述样本数据分类方法所能实现的功能,在此不再详述。
以上说明的本申请的特征性的手段可以通过集成电路来实现,并控制实现上述任意实施例中所述样本数据分类方法的功能。即,本申请的集成电路安装于所述电子设备中,使所述电子设备发挥如下功能:获取每个类别的样本数据;对每个类别的样本数据进行分类,得到每个类别的多个子集;计算每个类别的多个子集中每个子集与每个子集所在类别的相关度;根据每个类别的多个子集中每个子集与类别的相关度,从高到低,对每个类别的多个子集进行排序,得到每个类别的多个排序后的子集;依次从每个类别的多个排序后的子集中,读取排序位置相同的子集作为模型的训练样本,对所述模型进行训练。
在任意实施例中所述模型训练方法所能实现的功能都能通过本申请的集成电路安装于所述电子设备中,使所述电子设备发挥任意实施例中所述模型训练方法所能实现的功能,在此不再详述。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请的各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个非易失性可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。
Claims (20)
- 一种样本数据分类方法,其特征在于,所述方法包括:计算样本数据中每个样本的特征;根据每个样本的特征,计算每个样本的距离集,所述每个样本的距离集包括每个样本与每个样本对应的剩余样本中每个样本间的距离;根据每个样本的距离集,计算每个样本的密度值及计算每个样本的密度距离值;根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心;基于所述至少一个聚类中心及每个样本的特征,将所述样本数据聚类成多个子集。
- 如权利要求1所述的样本数据分类方法,其特征在于,所述计算每个样本的密度值包括:将每个样本的距离集中每个距离与距离阈值进行对比,确定大于所述距离阈值的距离数,并将每个样本对应的距离数作为每个样本的密度值。
- 如权利要求1所述的样本数据分类方法,其特征在于,所述计算每个样本的密度距离值包括:对于密度值最大的样本,从所述密度值最大的样本的距离集中,筛选最大距离作为所述密度值最大的样本的密度距离值;对于第二样本集中任意一个样本,确定密度值大于所述任意一个样本的密度值的样本;根据所述任意一个样本的距离集,从密度值大于所述任意一个样本的密度值的样本中确定与所述任意一个样本最近的距离,将与所述任意一个样本最近的距离确定为所述任意一个样本的密度距离值,所述第二样本集包括所述样本数据中除去密度值最大的样本的其他样本。
- 如权利要求1所述的样本数据分类方法,其特征在于,所述根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心包括:根据每个样本的密度值及每个样本的密度距离值,计算每个样本的聚类度量值;根据每个样本的聚类度量值,确定至少一个聚类中心,所述每个样本的聚类度量值等于每个样本的密度值与每个样本的密度距离值的乘积。
- 如权利要求4所述的样本数据分类方法,其特征在于,所述根据每个样本的聚类度量值,确定至少一个聚类中心包括:根据每个样本的聚类度量值,从大到小进行排序,从排序后的聚类度量值中,筛选聚类度量值排序前预设位数的样本作为聚类中心点;根据每个样本的聚类度量值,筛选聚类度量值大于阈值的样本作为聚类中心点。
- 如权利要求1所述的样本数据分类方法,其特征在于,所述方法还包括:将与所述至少一个聚类中心每个聚类中心的距离超过距离阈值的样本确定为错误样本。
- 一种模型训练方法,其特征在于,所述方法包括:获取每个类别的样本数据;利用如权利要求1至6中任一项所述的样本数据分类方法对每个类别的样本数据进行分类,得到每个类别的多个子集;计算每个类别的多个子集中每个子集与每个子集所在类别的相关度;根据每个类别的多个子集中每个子集与类别的相关度,从高到低,对每个类别的多个子集进行排序,得到每个类别的多个排序后的子集;依次从每个类别的多个排序后的子集中,读取排序位置相同的子集作为模型的训练样本,对所述模型进行训练。
- 如权利要求7所述的模型训练方法,其特征在于,在所述多个排序后的子集中,排序位置越靠前的子集对应的权重越大。
- 一种电子设备,其特征在于,所述电子设备包括存储器及处理器,所述存储器用于存储至少一个指令,所述处理器用于执行所述至少一个指令以实现以下步骤:计算样本数据中每个样本的特征;根据每个样本的特征,计算每个样本的距离集,所述每个样本的距离集包括每个样本与每个样本对应的剩余样本中每个样本间的距离;根据每个样本的距离集,计算每个样本的密度值及计算每个样本的密度距离值;根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心;基于所述至少一个聚类中心及每个样本的特征,将所述样本数据聚类成多个子集。
- 如权利要求9所述的电子设备,其特征在于,所述计算每个样本的密度值包括:将每个样本的距离集中每个距离与距离阈值进行对比,确定大于所述距离阈值的距离数,并将每个样本对应的距离数作为每个样本的密度值。
- 如权利要求9所述的电子设备,其特征在于,所述计算每个样本的密度距离值包括:对于密度值最大的样本,从所述密度值最大的样本的距离集中,筛选最大距离作为所述密度值最大的样本的密度距离值;对于第二样本集中任意一个样本,确定密度值大于所述任意一个样本的密度值的样本;根据所述任意一个样本的距离集,从密度值大于所述任意一个样本的密度值的样本中确定与所述任意一个样本最近的距离,将与所述任意一个样本最近的距离确定为所述任意一个样本的密度距离值,所述第二样本集包括所述样本数据中除去密度值最大的样本的其他样本。
- 如权利要求9所述的电子设备,其特征在于,所述根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心包括:根据每个样本的密度值及每个样本的密度距离值,计算每个样本的聚类度量值;根据每个样本的聚类度量值,确定至少一个聚类中心,所述每个样本的聚类度量值等于每个样本的密度值与每个样本的密度距离值的乘积。
- 如权利要求12所述的电子设备,其特征在于,所述根据每个样本的聚类度量值,确定至少一个聚类中心包括:根据每个样本的聚类度量值,从大到小进行排序,从排序后的聚类度量值中,筛选聚类度量值排序前预设位数的样本作为聚类中心点;根据每个样本的聚类度量值,筛选聚类度量值大于阈值的样本作为聚类中心点。
- 如权利要求9所述的电子设备,其特征在于,所述处理器还用于执行所述至少一个指令以实现以下步骤:将与所述至少一个聚类中心每个聚类中心的距离超过距离阈值的样本确定为错误样本。
- 一种非易失性可读存储介质,其特征在于,所述非易失性可读存储介质存储有至少一个指令,所述至少一个指令被处理器执行时实现以下步骤:计算样本数据中每个样本的特征;根据每个样本的特征,计算每个样本的距离集,所述每个样本的距离集包括每个样本与每个样本对应的剩余样本中每个样本间的距离;根据每个样本的距离集,计算每个样本的密度值及计算每个样本的密度距离值;根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心;基于所述至少一个聚类中心及每个样本的特征,将所述样本数据聚类成多个子集。
- 如权利要求15所述的存储介质,其特征在于,所述计算每个样本的密度值包括:将每个样本的距离集中每个距离与距离阈值进行对比,确定大于所述距离阈值的距离数,并将每个样本对应的距离数作为每个样本的密度值。
- 如权利要求15所述的存储介质,其特征在于,所述计算每个样本的密度距离值包括:对于密度值最大的样本,从所述密度值最大的样本的距离集中,筛选最大距离作为所述密度值最大的样本的密度距离值;对于第二样本集中任意一个样本,确定密度值大于所述任意一个样本的密度值的样本;根据所述任意一个样本的距离集,从密度值大于所述任意一个样本的密度值的样本中确定与所述任意一个样本最近的距离,将与所述任意一个样本最近的距离确定为所述任意一个样本的密度距离值,所述第二样本集包括所述样本数据中除去密度值最大的样本的其他样本。
- 如权利要求1所述的存储介质,其特征在于,所述根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心包括:根据每个样本的密度值及每个样本的密度距离值,计算每个样本的聚类度量值;根据每个样本的聚类度量值,确定至少一个聚类中心,所述每个样本的聚类度量值等于每个样本的密度值与每个样本的密度距离值的乘积。
- 如权利要求18所述的存储介质,其特征在于,所述根据每个样本的聚 类度量值,确定至少一个聚类中心包括:根据每个样本的聚类度量值,从大到小进行排序,从排序后的聚类度量值中,筛选聚类度量值排序前预设位数的样本作为聚类中心点;根据每个样本的聚类度量值,筛选聚类度量值大于阈值的样本作为聚类中心点。
- 如权利要求15所述的存储介质,其特征在于,所述至少一个指令被处理器执行时还实现以下步骤:将与所述至少一个聚类中心每个聚类中心的距离超过距离阈值的样本确定为错误样本。
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201810350730.2 | 2018-04-18 | ||
| CN201810350730.2A CN108595585B (zh) | 2018-04-18 | 2018-04-18 | 样本数据分类方法、模型训练方法、电子设备及存储介质 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2019200782A1 true WO2019200782A1 (zh) | 2019-10-24 |
Family
ID=63613517
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2018/100157 Ceased WO2019200782A1 (zh) | 2018-04-18 | 2018-08-13 | 样本数据分类方法、模型训练方法、电子设备及存储介质 |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN108595585B (zh) |
| WO (1) | WO2019200782A1 (zh) |
Families Citing this family (21)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109378003B (zh) * | 2018-11-02 | 2021-10-01 | 科大讯飞股份有限公司 | 一种声纹模型训练的方法和系统 |
| CN109299279B (zh) * | 2018-11-29 | 2020-08-21 | 奇安信科技集团股份有限公司 | 一种数据处理方法、设备、系统和介质 |
| CN109508750A (zh) * | 2018-12-03 | 2019-03-22 | 斑马网络技术有限公司 | 用户起讫点聚类分析方法、装置及存储介质 |
| CN109671007A (zh) * | 2018-12-27 | 2019-04-23 | 沈阳航空航天大学 | 一种基于多维度数据的火车站附近打车难易度评估方法 |
| CN109856307B (zh) * | 2019-03-27 | 2021-04-16 | 大连理工大学 | 一种代谢组分子变量综合筛选技术 |
| CN109993234B (zh) * | 2019-04-10 | 2021-05-28 | 百度在线网络技术(北京)有限公司 | 一种无人驾驶训练数据分类方法、装置及电子设备 |
| CN110141226B (zh) * | 2019-05-29 | 2022-03-15 | 清华大学深圳研究生院 | 自动睡眠分期方法、装置、计算机设备及计算机存储介质 |
| CN110414587A (zh) * | 2019-07-23 | 2019-11-05 | 南京邮电大学 | 基于渐进学习的深度卷积神经网络训练方法与系统 |
| CN111079830A (zh) * | 2019-12-12 | 2020-04-28 | 北京金山云网络技术有限公司 | 目标任务模型的训练方法、装置和服务器 |
| CN111414952B (zh) * | 2020-03-17 | 2023-10-17 | 腾讯科技(深圳)有限公司 | 行人重识别的噪声样本识别方法、装置、设备和存储介质 |
| CN112131362B (zh) * | 2020-09-22 | 2023-12-12 | 腾讯科技(深圳)有限公司 | 对话语句生成方法和装置、存储介质及电子设备 |
| CN112132239B (zh) * | 2020-11-24 | 2021-03-16 | 北京远鉴信息技术有限公司 | 一种训练方法、装置、设备和存储介质 |
| CN112884040B (zh) * | 2021-02-19 | 2024-04-30 | 北京小米松果电子有限公司 | 训练样本数据的优化方法、系统、存储介质及电子设备 |
| CN112733808B (zh) * | 2021-02-22 | 2025-08-22 | 深圳市商汤科技有限公司 | 模型训练与图像处理方法、装置、电子设备和存储介质 |
| CN113035347A (zh) * | 2021-03-15 | 2021-06-25 | 武汉中旗生物医疗电子有限公司 | 一种心电数据病症识别方法、设备及存储介质 |
| CN112990337B (zh) * | 2021-03-31 | 2022-11-29 | 电子科技大学中山学院 | 一种面向目标识别的多阶段训练方法 |
| CN113837000A (zh) * | 2021-08-16 | 2021-12-24 | 天津大学 | 一种基于任务排序元学习的小样本故障诊断方法 |
| CN116881759B (zh) * | 2022-03-24 | 2025-12-02 | 腾讯科技(深圳)有限公司 | 一种样本分类的方法以及相关装置 |
| CN115439685A (zh) * | 2022-08-26 | 2022-12-06 | 重庆长安汽车股份有限公司 | 一种小样本图像数据集划分方法及计算机可读存储介质 |
| CN115979891B (zh) * | 2023-03-16 | 2023-06-23 | 中建路桥集团有限公司 | 高压液气混合流体喷射破碎及固化粘性土的检测方法 |
| CN116664894A (zh) * | 2023-06-15 | 2023-08-29 | 平安科技(深圳)有限公司 | 基于特征记忆库比对的质检方法、装置、设备及存储介质 |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20140270495A1 (en) * | 2013-03-14 | 2014-09-18 | Microsoft Corporation | Multiple Cluster Instance Learning for Image Classification |
| CN105653598A (zh) * | 2015-12-22 | 2016-06-08 | 北京奇虎科技有限公司 | 一种关联新闻的确定方法以及装置 |
| CN106874923A (zh) * | 2015-12-14 | 2017-06-20 | 阿里巴巴集团控股有限公司 | 一种商品的风格分类确定方法及装置 |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106447676B (zh) * | 2016-10-12 | 2019-01-22 | 浙江工业大学 | 一种基于快速密度聚类算法的图像分割方法 |
| CN107180075A (zh) * | 2017-04-17 | 2017-09-19 | 浙江工商大学 | 文本分类集成层次聚类分析的标签自动生成方法 |
-
2018
- 2018-04-18 CN CN201810350730.2A patent/CN108595585B/zh active Active
- 2018-08-13 WO PCT/CN2018/100157 patent/WO2019200782A1/zh not_active Ceased
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20140270495A1 (en) * | 2013-03-14 | 2014-09-18 | Microsoft Corporation | Multiple Cluster Instance Learning for Image Classification |
| CN106874923A (zh) * | 2015-12-14 | 2017-06-20 | 阿里巴巴集团控股有限公司 | 一种商品的风格分类确定方法及装置 |
| CN105653598A (zh) * | 2015-12-22 | 2016-06-08 | 北京奇虎科技有限公司 | 一种关联新闻的确定方法以及装置 |
Also Published As
| Publication number | Publication date |
|---|---|
| CN108595585A (zh) | 2018-09-28 |
| CN108595585B (zh) | 2019-11-12 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2019200782A1 (zh) | 样本数据分类方法、模型训练方法、电子设备及存储介质 | |
| US10438091B2 (en) | Method and apparatus for recognizing image content | |
| CN107169454B (zh) | 一种人脸图像年龄估算方法、装置及其终端设备 | |
| WO2019200781A1 (zh) | 票据识别方法、装置及存储介质 | |
| CN111814810A (zh) | 图像识别方法、装置、电子设备及存储介质 | |
| CN110046634B (zh) | 聚类结果的解释方法和装置 | |
| CN109309630A (zh) | 一种网络流量分类方法、系统及电子设备 | |
| WO2019119505A1 (zh) | 人脸识别的方法和装置、计算机装置及存储介质 | |
| CN110147732A (zh) | 指静脉识别方法、装置、计算机设备及存储介质 | |
| CN110717554A (zh) | 图像识别方法、电子设备及存储介质 | |
| CN109165309B (zh) | 负例训练样本采集方法、装置及模型训练方法、装置 | |
| CN108090508A (zh) | 一种分类训练方法、装置及存储介质 | |
| CN109817339B (zh) | 基于大数据的患者分组方法和装置 | |
| CN111291817A (zh) | 图像识别方法、装置、电子设备和计算机可读介质 | |
| CN110287311B (zh) | 文本分类方法及装置、存储介质、计算机设备 | |
| CN112307860A (zh) | 图像识别模型训练方法和装置、图像识别方法和装置 | |
| CN103679160B (zh) | 一种人脸识别方法和装置 | |
| CN111881671A (zh) | 一种属性词提取方法 | |
| CN118094118B (zh) | 数据集质量评估方法、系统、电子设备及存储介质 | |
| CN104462301A (zh) | 一种网络数据的处理方法和装置 | |
| CN105046214A (zh) | 基于聚类的在线式多人脸图像处理的方法 | |
| CN109840413A (zh) | 一种钓鱼网站检测方法及装置 | |
| Liong et al. | Automatic traditional Chinese painting classification: A benchmarking analysis | |
| CN110414562B (zh) | X光片的分类方法、装置、终端及存储介质 | |
| CN111382712A (zh) | 一种手掌图像识别方法、系统及设备 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 18915611 Country of ref document: EP Kind code of ref document: A1 |