CN108984303B

CN108984303B - Incremental data set acceleration generation method and system

Info

Publication number: CN108984303B
Application number: CN201810746066.3A
Authority: CN
Inventors: 罗培元
Original assignee: Hunan Zhuomao Electronic Technology Co ltd
Current assignee: Hunan Zhuomao Electronic Technology Co ltd
Priority date: 2018-07-09
Filing date: 2018-07-09
Publication date: 2024-08-13
Anticipated expiration: 2038-07-09
Also published as: CN108984303A

Abstract

The invention discloses an incremental data set acceleration generation method, which comprises the following steps: loading a category label file and a progress record file; checking whether the category label file is changed or not; when the category label file is checked to be unchanged, reading all progress record files, and acquiring original file information corresponding to all intermediate data in each intermediate data set; generating task lists of all current original files in disorder; comparing whether all original files corresponding to the intermediate data in the corresponding intermediate data set exist in a task list or not; and operating the task list according to the comparison result, acquiring a final task list, and calling threads in the thread pool according to the tasks in the task list to generate corresponding intermediate data. The invention realizes the incremental generation of the intermediate data set, and when the original file is added, modified or deleted, the existing generated intermediate data set is not required to be deleted and regenerated, thereby reducing the workload and accelerating the processing speed.

Description

Incremental data set acceleration generation method and system

Technical Field

The present invention relates to the field of data processing, and in particular, to a method and system for generating an incremental data set in an accelerated manner.

Background

Convolutional neural networks have now become part of the most powerful innovation in the field of computer vision, since image classification is an operation on input images, and finally outputs a set of best probabilities describing classification or classification of image content, but for humans, recognition is a skill for which birth starts to learn, but a proprietary skill machine for humans is not yet enjoyed, so that pictures are subjected to some processing such as classification, localization, detection, segmentation through convolutional neural networks, and then different categories of pictures are read, which are now available to Facebook (and instragram), which is now owned by billions of users.

However, in the convolutional neural network training in the prior art, the training framework often uses an intermediate data format for maintaining higher data reading efficiency, for example tensorflow, and tfrecord is used as an intermediate data storage format of the training framework, so that the data reading efficiency is improved, and in addition, the intermediate data can be checked to ensure the correctness of the data, and meanwhile, the training is not interrupted and the training program is not abnormally exited. The advantages of the prior art are also disadvantageous, for example, the generation of such intermediate data formats consumes a lot of time and does not support incremental data generation, since all files need to be regenerated in their entirety as long as the data set is slightly modified during training, which is a very expensive effort for the massive data used in life.

It is necessary to solve such a problem that the speed is slow and incremental data generation is not supported.

Disclosure of Invention

In order to overcome the technical defects, the invention provides an incremental data set acceleration generation method and system, and specifically provides the following technical scheme:

First, the invention provides an incremental data set acceleration generation method, which comprises the following steps: loading a category label file and all progress record files, wherein the category label file contains a category name and a corresponding number of a training file; the progress record files correspond to the intermediate data sets, and each progress record file records the mark information of the original file corresponding to all intermediate data in one intermediate data set; checking whether the category label file is changed or not; when the category label file is detected to be unchanged, reading all progress record files, and obtaining original file information corresponding to all intermediate data in each intermediate data set; generating task lists of all current original files in disorder; acquiring a target task to be deleted according to the task list and the progress record file, and deleting the target task in the task list; and acquiring a final task list, and generating corresponding intermediate data according to threads in a task call thread pool in the task list.

Further preferably, obtaining the target task to be deleted according to the task list and the progress record file, and deleting the target task in the task list specifically includes: comparing whether original files corresponding to all intermediate data in the corresponding intermediate data set read from each progress record file exist in the task list or not; marking an intermediate data set, in which all original files corresponding to the intermediate data exist in the task list, as a first intermediate data set; marking an intermediate data set which does not exist in the task list in the original files corresponding to all the intermediate data as a second intermediate data set; when all the intermediate data sets are first intermediate data sets, selecting one intermediate data set for deleting, and deleting the original files corresponding to all the intermediate data in the rest intermediate data sets in the task list; and deleting the second intermediate data set when the second intermediate data set exists in all the intermediate data sets, and deleting the original files corresponding to all the intermediate data in the first intermediate data set in the task list.

Further preferably, the progress record file is a serialized file of MD5 values of the original file corresponding to all intermediate data in one intermediate data set.

Further preferably, all intermediate data sets are deleted when a change to the class label file is detected.

Further preferably, the number of concurrent threads of the thread pool is set according to the number of cores of the CPU.

The invention further provides an incremental data set acceleration generation system, which is characterized by comprising the following steps: the data loading module is used for loading a category label file and all progress record files, wherein the category label file contains a category name and a corresponding number of a file for training; the progress record files correspond to the intermediate data sets, and each progress record file records the mark information of the original file corresponding to all intermediate data in one intermediate data set; the checking module is used for checking whether the class label file loaded in the data loading module is changed or not; the reading module is used for reading all progress record files when the checking module 2 checks that the category label file is not changed, and obtaining original file information corresponding to all intermediate data in each intermediate data set; the list generation module is used for generating task lists of all current original files in an out-of-order manner; the task deleting module is used for acquiring a target task to be deleted according to the task list and the progress record file, and deleting the target task in the task list; the list acquisition module is used for acquiring a final task list; and the thread calling module is used for generating corresponding intermediate data according to the threads in the task calling thread pool in the task list acquired by the list acquisition module.

Further preferably, the task deletion module further includes: the comparison data module is used for comparing whether original files corresponding to all intermediate data in the corresponding intermediate data set read from each progress record file loaded by the data loading module exist in the task list generated by the list generating module; the data marking module is used for marking the intermediate data sets, of which all the original files corresponding to the intermediate data exist in the task list, as first intermediate data sets; marking an intermediate data set which does not exist in the task list in the original files corresponding to all the intermediate data as a second intermediate data set; the data operation sub-module is used for selecting one intermediate data set for deleting when all the intermediate data sets are the first intermediate data sets marked by the data marking module, and deleting the original files corresponding to all the intermediate data in the rest intermediate data sets in the task list; and when the second intermediate data set marked by the data marking module exists in all the intermediate data sets, deleting the second intermediate data set, and deleting the original files corresponding to all the intermediate data in the first intermediate data set in the task list.

Further preferably, the progress record file loaded in the data loading module is a serialized file of MD5 values of the original file corresponding to all intermediate data in one intermediate data set.

Further preferably, the data operation sub-module is further configured to delete all intermediate data sets when the inspection module detects that the category label file has a change.

Further preferably, an incremental dataset acceleration generation system further comprises: and the thread setting module is used for setting the number of concurrent threads of the thread pool according to the number of cores of the CPU.

The incremental data set acceleration generation method and system provided by the invention can bring at least one of the following beneficial effects:

1. The invention adopts the progress record file to record the original file information of the generated intermediate data, so that the original files which are processed to generate the intermediate data can be quickly searched in the subsequent processing, thereby accelerating the progress of data processing and improving the working efficiency. Particularly, if the information of the intermediate file generated by the original file is recorded in a serialization processing mode, the progress record file is stored in a binary mode, and the speed of storage and reading is higher by deserializing reading in the subsequent reading, so that the data processing speed is further improved.

2. Compared with the prior art, the invention does not support incremental data generation, is more humanized, when the original data file is added, modified or deleted, the generated intermediate data sets are not required to be deleted completely and regenerated, but the corresponding processing is carried out after the comparison of the record information in all progress record files and the task list, and only one intermediate data set is required to be deleted under the condition that the original data file is only added, so as to achieve the purpose of disorder processing; the rest intermediate data sets are not required to be deleted, but only the corresponding tasks in the task list are deleted; thus, the processing amount of tasks is reduced, and the data processing speed is improved. And under the condition that the original file is modified or deleted, only the intermediate data set corresponding to the original file is required to be deleted, the intermediate data sets corresponding to the rest original files which are not changed are not required to be deleted, and the corresponding task pieces are also required to be deleted. Therefore, by comparing all progress record files with the task list, different processing measures are adopted according to different conditions of the intermediate data set, so that the data processing speed is increased, and the processing efficiency is improved.

3. After the final task list is obtained, the invention can process each task by calling the threads in the thread pool. In addition, the number of threads in the thread pool can be set according to the CPU core number, because in the concurrent process of system scheduling, too small thread pool number setting can delay corresponding data processing, and too large thread pool number setting leads to the increase of switching overhead among threads, thereby leading to the slow efficiency; the scheme of the invention utilizes the concurrency characteristic of threads, and in the executing process, the threads are actually executed in each CPU unit core, because the maximum number of threads is set to be proper, the maximum multithreading performance of the system is obtained through the operation, and the processing speed of data is greatly improved.

Drawings

The foregoing features, technical features, advantages and implementation of an incremental data set acceleration generation method and system will be further described with reference to the accompanying drawings in a clear and understandable manner.

FIG. 1 is a flow chart of an incremental dataset acceleration generation method according to an embodiment of the present invention;

FIG. 2 is a flow chart of another embodiment of an incremental dataset acceleration generation method of the present invention;

FIG. 3 is a flow chart of yet another embodiment of an incremental dataset acceleration generation method of the present invention;

FIG. 4 is a schematic block diagram illustrating one exemplary embodiment of an incremental dataset acceleration generation system in accordance with the present invention;

FIG. 5 is a schematic block diagram of another embodiment of an incremental dataset acceleration generation system of the present invention.

Reference numerals illustrate:

The system comprises a 1-data loading module, a 2-checking module, a 3-reading module, a 4-list generating module, a 5-comparison data sub-module, a 6-data marking sub-module, a 7-data operating sub-module, an 8-list obtaining module, a 9-thread calling module, a 10-thread setting module and a task deleting module 11.

Detailed Description

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will explain the specific embodiments of the present invention with reference to the accompanying drawings. It is evident that the drawings in the following description are only examples of the invention, from which other drawings and other embodiments can be obtained by a person skilled in the art without inventive effort.

For the sake of simplicity of the drawing, the parts relevant to the present invention are shown only schematically in the figures, which do not represent the actual structure thereof as a product. Additionally, in order to simplify the drawing for ease of understanding, components having the same structure or function in some of the drawings are shown schematically with only one of them, or only one of them is labeled. Herein, "a" means not only "only this one" but also "more than one" case.

According to an embodiment of the present invention, as shown in fig. 1, an incremental data set acceleration generating method includes the steps of:

s001, loading a category label file and all progress record files, wherein the category label file contains a category name and a corresponding number of a file for training; the progress record files correspond to the intermediate data sets, and each progress record file records the mark information of the original file corresponding to all intermediate data in one intermediate data set;

s002, checking whether the category label file is changed or not;

s003, when the category label file is detected to be unchanged, reading all progress record files, and obtaining original file information corresponding to all intermediate data in each intermediate data set;

s004, generating task lists of all current original files in an out-of-order manner;

S005 obtains a target task to be deleted according to the task list and the progress record file, and deletes the target task in the task list;

S006 obtains the final task list, and generates corresponding intermediate data according to the task in the task list and the thread in the thread pool.

First, in training convolutional neural networks, the training framework often uses an intermediate data format in order to maintain higher data reading efficiency. That is, the original file needs to be converted into intermediate data for training. The intermediate data set is a storage format of intermediate data, and a general intermediate data set stores 1000 intermediate data.

The progress record file corresponds to an intermediate data set, the intermediate data in the intermediate data set corresponds to intermediate data generated by the original picture, and the progress record file records the mark information of the original file of the generated intermediate data. The class label file corresponds to all the intermediate data sets, and the class label file records the class of the intermediate data sets in the table, and the class refers to, for example, a problem of tensorflow training for 10 classes, then 10 class labels, namely their class names and numbers [ 0-9 ] (10 classes), are recorded in the class label file.

Specifically, class labels and all progress record files are loaded, class names and corresponding numbers of files for training are recorded in the class labels, for example, pictures of cats and pictures of dogs are recorded in the files for training, so that the class labels set 1 for classifying the pictures of cats into corresponding numbers, 2 for classifying the pictures of dogs into corresponding numbers, the progress record files correspond to intermediate data sets, each progress record file records mark information of original files corresponding to all intermediate data in one intermediate data set, and one intermediate data set comprises 1000 intermediate data generated by 1000 original files; the record file is used for recording the mark information of the original file corresponding to 1000 pieces of intermediate data in the corresponding intermediate data set.

Next, it is checked whether the category label file is changed, and there are various cases in which the category label file is changed, including that the category and the corresponding number are changed, and that the category is added, or that the category is reduced, etc. When the information in the category label file is checked to be unchanged, reading all progress record files, and acquiring corresponding original picture information for generating intermediate data, and then generating task lists of all current original files, namely all the task lists of the original pictures for preparing the intermediate data; and then, acquiring a target task to be deleted through the task list of the original picture for generating the intermediate data and the progress record file corresponding to the intermediate data set and recording the mark information of the original file corresponding to all the intermediate data in the intermediate data set, deleting the target task in the list, acquiring a final task list, and generating corresponding intermediate data according to the threads in the task call thread pool in the task list.

According to an embodiment of the present invention, as shown in fig. 2, an incremental data set acceleration generating method includes the steps of:

S101, loading a category label file and all progress record files, wherein the category label file contains a category name and a corresponding number of a file for training; the progress record files correspond to the intermediate data sets, and each progress record file records the mark information of the original file corresponding to all intermediate data in one intermediate data set;

s102, checking whether the category label file is changed or not;

S103, when the category label file is detected to be unchanged, reading all progress record files, and obtaining original file information corresponding to all intermediate data in each intermediate data set;

S104, generating task lists of all current original files in an out-of-order manner;

s105, comparing whether original files corresponding to all intermediate data in the corresponding intermediate data set read from each progress record file exist in the task list or not;

S106, marking an intermediate data set, in which all the original files corresponding to the intermediate data exist in the task list, as a first intermediate data set; marking an intermediate data set which does not exist in the task list in the original files corresponding to all the intermediate data as a second intermediate data set;

S107, when all the intermediate data sets are the first intermediate data set, selecting one intermediate data set for deleting, and deleting the original files corresponding to all the intermediate data in the rest intermediate data sets in the task list;

s108, deleting the second intermediate data set when the second intermediate data set exists in all the intermediate data sets, and deleting the original files corresponding to all the intermediate data in the first intermediate data set in the task list;

S109, acquiring a final task list, and generating corresponding intermediate data according to threads in a task call thread pool in the task list.

Specifically, class labels and all progress record files are loaded, class names and corresponding numbers of files for training are recorded in the class labels, for example, pictures of cats and pictures of dogs are recorded in the files for training, so that the class labels set 1 for classifying the pictures of cats into corresponding numbers of one class, 2 for classifying the pictures of dogs into corresponding numbers of one class, the progress record files correspond to intermediate data sets, each progress record file records mark information of original files corresponding to all intermediate data in one intermediate data set, and one intermediate data set comprises 1000 intermediate data generated by 1000 original files (one intermediate data is generated by one original file); the record file is used for recording the mark information of the original file corresponding to 1000 pieces of intermediate data in the corresponding intermediate data set.

Next, it is checked whether the category label file is changed, and there are various cases in which the category label file is changed, including that the category and the corresponding number are changed, and that the category is added, or that the category is reduced, etc. When the information in the category label file is checked to be unchanged, all progress record files are read, corresponding original picture information for generating intermediate data can be obtained, and then task lists of all current original files, that is, task lists of all original pictures for preparing the intermediate data, are generated.

Comparing the task list with the original picture information of the generated intermediate data in all progress record files to see whether the original files corresponding to all the intermediate data in each intermediate data set exist in the task list, and dividing the situation into two cases, the embodiment marks the data set, in which all the original files corresponding to the intermediate data exist in the task list, as a first data set, marks the intermediate data set, in the task list, in which all the original files corresponding to the intermediate data do not exist, as a second data set, for example, the original pictures corresponding to all the intermediate data in the intermediate data set obtained from a certain progress record file are all in the task list, then the intermediate data set is marked as the first data set, for example, the original pictures corresponding to the generated intermediate data are recorded in the progress record file, and if two original pictures do not exist in the task list, then the intermediate data set is marked as the second data set.

In this embodiment, when all the intermediate data sets are the first intermediate data set, one intermediate data set is selected for deletion, and then the original files corresponding to all the intermediate data in the remaining intermediate data sets belong to target tasks, so that the target tasks also need to be deleted in the task list;

When the second intermediate data set exists in all the intermediate data sets, deleting the second intermediate data set, and the original files corresponding to all the intermediate data in all the remaining first intermediate data sets belong to target tasks, so that the target tasks are required to be deleted in the task list.

For example, if the original picture information corresponding to all the intermediate data in the four datasets A, B, C, D is present in the task list, then one of the intermediate datasets is deleted, such as intermediate dataset a, and the original picture task corresponding to the remaining intermediate dataset B, C, D is deleted in the task list. That is, the original picture task corresponding to the remaining intermediate dataset B, C, D is the target task and needs to be deleted in the task list. Because the intermediate data set B, C, D has intermediate data, the corresponding original files do not need to be repeatedly generated, so that the processing speed of the original files can be increased, and the time waste caused by repeated generation is avoided. And why one intermediate data set is selected for deletion, because if all intermediate data sets are first intermediate data sets, it is explained that no file is deleted or modified in the original file, only new original pictures are added (note that the new original pictures belong to several existing categories), and training data given during training, that is, intermediate data, need to be randomly disordered, and when all intermediate data sets generated are first data sets, if one intermediate data set is not deleted, only the new original pictures can be used for disordered processing, so that the original purpose of disordered processing is overcome, therefore, in this case, one intermediate data set needs to be deleted, all original pictures corresponding to the deleted intermediate data set need to be completely regenerated, and the new original pictures are mixed in a task list, so that not only the new original pictures are disordered processed, but also the previous original pictures are mixed, and the disordered processing of training data is satisfied.

In addition, when all the intermediate data sets have the second intermediate data set, that is, when the original files corresponding to the intermediate data are not all present in the task list, deleting the second intermediate data set, and deleting the original files corresponding to all the intermediate data in the first intermediate data set in the task list. For example, there are now A, B, C, D intermediate data sets respectively, and in the task list where all the original picture tasks corresponding to the intermediate data in the four intermediate data sets are present in A, B, C, only the task list where all the original picture tasks corresponding to the intermediate data in the intermediate data set D are not present in the intermediate data set D is the second intermediate data set, at this time, the intermediate data set D needs to be deleted, and the original picture tasks corresponding to the intermediate data in the intermediate data set A, B, C need to be deleted in the list, at this time, the original picture tasks corresponding to the three intermediate data in the intermediate data set A, B, C need to be deleted in the task list, so that the original picture tasks corresponding to the intermediate data in the three intermediate data sets A, B, C need not to be regenerated, thereby reducing repeated data generation and accelerating the processing speed.

And finally, obtaining a processed task list, setting a task calling thread in the task list, carrying out concurrent processing through a plurality of called threads, and generating intermediate data from an original file to be generated into the intermediate data in the task list.

In another embodiment, as shown in fig. 3, an incremental data set acceleration generating method includes the steps of:

S201, loading a category label file and all progress record files, wherein the category label file contains a category name and a corresponding number of a file for training; the progress record files correspond to the intermediate data sets, and each progress record file records the mark information of the original file corresponding to all intermediate data in one intermediate data set;

S202, checking whether the category label file is changed or not; if yes, go to step S203, otherwise go to step S204;

And S203, deleting all intermediate data sets when the change of the category label file is detected, generating a task list of all current original files in disorder as a final task list, and entering step S210.

S204, when the category label file is detected to be unchanged, reading all progress record files in a deserializing mode, and obtaining original file information corresponding to all intermediate data in each intermediate data set;

S205, generating task lists of all current original files in an out-of-order manner;

s206, comparing whether original files corresponding to all intermediate data in the corresponding intermediate data set read from each progress record file exist in the task list or not;

S207, marking an intermediate data set, in which all the original files corresponding to the intermediate data exist in the task list, as a first intermediate data set; marking an intermediate data set which does not exist in the task list in the original files corresponding to all the intermediate data as a second intermediate data set;

S208, when all the intermediate data sets are the first intermediate data set, selecting one intermediate data set for deleting, and deleting the original files corresponding to all the intermediate data in the rest intermediate data sets in the task list;

s209, deleting the second intermediate data set when the second intermediate data set exists in all the intermediate data sets, and deleting the original files corresponding to all the intermediate data in the first intermediate data set in the task list;

S210, acquiring a final task list, and generating corresponding intermediate data according to threads in a task call thread pool in the task list.

In this embodiment, the progress record file in step S201 is a serialized file of MD5 values of the original file corresponding to all intermediate data in one intermediate data set;

Specifically, in this embodiment, further optimization is performed on the embodiment of fig. 2, where the progress record file records a serialized file of MD5 values of the original file corresponding to all intermediate data in the intermediate data set, for example, 1000 intermediate data sets are included in the intermediate data set, the 1000 intermediate data sets correspond to 1000 original pictures, the progress record file records MD5 values of the 1000 original pictures, and the MD5 values corresponding to the tasks of the original file that need to generate the intermediate data in the task list corresponding to the intermediate data are converted from a string into a binary stream corresponding to the corresponding original file, so as to serialize the list of MD5 values; when the category label file is checked to be unchanged, all progress record files are read in a deserializing mode, namely, data are stored in a binary reading mode, so that the file reading speed is increased, and secondly, when the category label file is checked to be changed, namely, the classification and the number in the category label file are changed, for example, one classification is reduced, the number is changed, and the like, all intermediate data sets are simultaneously deleted. All original files need to be regenerated into intermediate files.

Preferably, in any of the above embodiments, when all the intermediate data sets are the first intermediate data set, the selection mode adopted in selecting one intermediate data set for deletion is random selection or designated selection. For example, an intermediate data set is deleted randomly, 1000 intermediate data generated by original pictures form an intermediate data set, all original pictures corresponding to the intermediate data in the intermediate data set are in a task list needing to generate intermediate data, and all original pictures corresponding to intermediate data generated by other 1000 original pictures are in a task list needing to generate intermediate data, so that the original files corresponding to the intermediate data in the two intermediate data sets are in the task list, and therefore one intermediate data set can be deleted randomly, or a designated intermediate data set can be deleted, such as deleting the first intermediate data set, or deleting the last intermediate data set. When all the original pictures corresponding to the intermediate data in the intermediate data set are in the task list, the original pictures corresponding to the intermediate data in the intermediate data set are not deleted or modified, and only the corresponding original files are increased, but in the process of training data, the data is required to be randomly disordered, so that one intermediate data set is deleted, the original picture tasks corresponding to the intermediate data in the data set and the newly added original file tasks are randomly disordered, and meanwhile, the original picture tasks corresponding to the intermediate data in the other intermediate data set are deleted in the task list, so that the random disordered is achieved, the increment generation is realized, the intermediate data is not required to be completely regenerated, and the data processing speed is improved.

Preferably, in any of the above embodiments, corresponding intermediate data is generated according to the threads in the task call thread pool in the task list, where the thread number is mainly set according to the kernel number of the CPU, too little thread number is set to delay corresponding data processing, too much thread number is set to cause the overhead of switching between threads to become very large, instead, the speed of data processing is slow, based on concurrency of threads, the threads are executed in parallel in each CPU unit kernel, so for example, the CPU is now 4 kernels, then multiple threads (each thread corresponds to 1000 song task objects in 1000 original file task lists) are set in each kernel, after setting, the threads are placed in the thread pool, and the original file tasks of the intermediate data are generated in the task list are processed, because such a way can play a very good acceleration role when achieving tb-level data processing, the invention adopts an intermediate data of storing 1000 objects, the amount is recommended in the official to be actually a proper scheme, so that each call task is executed in parallel with 1000 tasks.

Compared to the generation of the original intermediate data set. The thread pool technology is adopted to accelerate the generation process, and meanwhile, the proper acceleration rate adjustment can be carried out according to the configuration condition of the machine, so that the speed can be improved by more than four times.

The embodiment designs an incremental data set generating method in a brand new mode, by adding the record information, the whole data set is not required to be regenerated in the adjustment process of the data set, and a very good acceleration effect is achieved in the generation of the data set of the quick tb level.

In another embodiment of the present invention, the following documents are briefly introduced:

The pkl file is a file information recording file, and corresponds to the progress recording file in the above-described embodiment. He matches the tfrecord file transfer pair. the tfrecord file has recorded therein a corresponding original data set, such as a picture data source. The pkl file records the corresponding md5 values for these data source files.

The intermediate data set (tfrecord) refers to the intermediate data-holding file, tfrecord file (i.e., intermediate data set) of tensorflow. This file is used as a training, direct input. Converted from the original file.

The label files (corresponding to the category label files in the above embodiments), a plurality of tfrecord files would correspond to one label file. The label file records the classifications of the tables in which the corresponding data sets are located, the classifications and the serial numbers of the classification classifications.

The implementation steps of this embodiment are as follows:

in the first step, the pkl dataset is loaded with the generated information. Loading label information;

Step two, detecting the modification of the label, if the label is modified, deleting all tfrecord files, and entering a seventh step; if the label is not changed, entering a third step;

Thirdly, inversely serializing the pkl information file (obtaining the generated information);

fourth, loading a full-volume complete-task list (files needed to be converted into tfrecord);

and fifthly, comparing the completed partial files. Deleting the completed task;

Sixth, under the condition that the file set is only increased, the range is enlarged, and the random disordered data set list is formed;

seventh, the thread pool accelerates the generation of the intermediate data set.

Specifically, in the first step, the pkl file (i.e., the corresponding progress record file in the above embodiment) corresponds to the tfrecord file (i.e., the corresponding intermediate data set in the above embodiment). Pkl records a list of md5 values for the pictures contained in this dataset file. The completed information, which is progress, is actually recorded. Loading the pkl file in effect obtains what such information is for the original picture file he has in the tfrecord file.

And in the second step, the loaded label file is subjected to change detection, the label file is the corresponding relation between all classification names and classification corresponding numbers, and if the label file is found to change after loading, the training set is changed greatly at the moment, and the whole training set needs to be regenerated.

For the third step, the pkl file records the md5 values of all original files in tfrecord files, which is a serialized list. The serialization and deserialization processes are as follows: (since the computer computing environment is ubuntu X PC environment, the serialization process does not need to take into account the corresponding big-end-to-small-end alignment problem). Serializing pkl files: the task file MD5 values in tfrecord are traversed. The MD5 value is directly converted into binary stream from character string and is directly oriented into corresponding file. (all fixed length data, all without the problem of variable length sentence breaking.) one md5 value is serialized and the write data (0 x00 ) is used as the new data start point. Anti-serialization: reading in the corresponding serialization file, and carrying out corresponding reading according to the rules. The speed is very considerable due to the binary read and store modes that are all employed.

For the fourth step, the conventional python operation, traverses the file directory list to obtain the full file tasks (these files are converted to corresponding tfrecord files).

For the fifth step, in this step, the task list is actually adjusted accordingly, so that the file generated in the task list is removed, and repeated generation is not required. However, the training data is required to be sufficiently disturbed by tensorflow. When the task list is generated, the task list is subjected to random seed disorder treatment. For all Tfrecord files, there are Tfrecord in which all original pictures can be found in the task list, and also Tfrecord files in which all original pictures do not all exist in the task list (i.e., in the above embodiment, the case where all intermediate datasets exist in the second intermediate dataset); for these two Tfrecord documents, different measures are taken:

All original picture files in Tfrecord files are all in a task list, which indicates that intermediate data of the original pictures are generated, and the original picture task can be removed from the task list;

The original picture existing in the Tfrecord file does not appear in the task list, which means that the part of the original picture which does not exist is deleted in the original file, and the Tfrecord file needs to be regenerated, and the original file position is replaced by a new file after the regeneration.

The sixth step is actually a supplementary description of the fifth step. The original picture files in Tfrecord files can all appear in the task list (i.e. in the case that all the intermediate data sets are the first intermediate data set in the above embodiment), which indicates that the current generation has no file deletion or modification, and only the corresponding problem is generated. However, for training data to be given to tensorflow, it is desirable to be randomly shuffled so that only newly filled data can be used for the shuffle process, thus defeating the purpose of the random shuffle process because if this occurs, the method described herein deletes the last Tfrecord file and mixes the corresponding original file with the newly added task list file before re-randomizing.

For the seventh step, in the first six steps, the present embodiment may obtain corresponding original file lists, where the original file lists are complete task lists that need to generate the corresponding tfrecord files. Generating tfrecord a file proceeds to the following thread pool acceleration means: the tfrecord file recommended by the official in tensorflow suggests that 1000 objects are stored, and based on this characteristic, the embodiment also adopts that 1000 files are stored in each tfrecord file, a thread is opened for each 1000 files, the threads are put into a thread pool for standby, and the system has automatic scheduling of resources and concurrent execution. With respect to system scheduling concurrency, too few thread count settings may delay the corresponding data processing, too many thread count settings, leading to increased handoff overhead between threads, and thus reduced efficiency. The concurrent nature of threads is based herein on the fact that execution is performed in each CPU unit core during execution, since the maximum number of threads sets the CPU core number to be the appropriate thread. Such operation achieves maximum multithreading performance of the system.

Based on the same technical concept, the invention also provides an incremental dataset acceleration generating system, specifically, an embodiment of the incremental dataset acceleration generating system of the invention is shown in fig. 4, and includes:

The data loading module 1 is used for loading a category label file and all progress record files, wherein the category label file contains a category name and a corresponding number of a file for training; the progress record files correspond to the intermediate data sets, and each progress record file records the mark information of the original file corresponding to all intermediate data in one intermediate data set;

a checking module 2, configured to check whether the class tag file loaded in the data loading module 1 has a change;

The reading module 3 is used for reading all progress record files when the checking module 2 checks that the category label file is not changed, and obtaining original file information corresponding to all intermediate data in each intermediate data set;

the list generation module 4 is used for generating task lists of all current original files in an out-of-order manner;

The task deleting module 11 is configured to obtain a target task to be deleted according to the task list and the progress record file, and delete the target task in the task list;

a list acquisition module 8, configured to acquire a final task list;

and the thread calling module 9 is used for generating corresponding intermediate data according to the threads in the task calling thread pool in the task list acquired by the list acquisition module 8.

Specifically, the data loading module 1 loads a category label and all progress record files, wherein the category label records the category name and the corresponding number of the file for training, for example, the training file has pictures of cats and pictures of dogs, so the category label sets the pictures of cats as corresponding numbers in one category, the pictures of dogs as corresponding numbers in one category are set as 1, the progress record files are corresponding to intermediate data sets, each progress record file records the mark information of original files corresponding to all intermediate data in one intermediate data set, and one intermediate data set contains 1000 intermediate data generated by 1000 original files (one original file generates one intermediate data); the record file is used for recording the mark information of the original file corresponding to 1000 pieces of intermediate data in the corresponding intermediate data set.

The loaded progress record file corresponds to the intermediate data set, records the mark information of the original file corresponding to the data in the intermediate data set, and can provide a basis for the task of the original file in a task list corresponding to the intermediate data later; in addition, the loaded class label file records the classification classes and the serial numbers of classification class labels of a plurality of intermediate data sets in a table, which provides a certain precondition for how to operate the data in the task list later, and the task list of generating all original files in disorder is a progress record file which can be matched in the task list, so as to judge whether all the tasks of the original files can be matched or not and then process the data; the intermediate data set is marked again in order to be able to determine how to process the original file task corresponding to the intermediate data in the intermediate data set in many cases.

And then obtaining a target task to be deleted through the task list of the original picture for generating the intermediate data and the progress record file corresponding to the intermediate data set and recording the mark information of the original file corresponding to all the intermediate data in the intermediate data set, and deleting the target task in the list.

Preferably, the progress record file records a serialization file of MD5 values of the original file corresponding to all intermediate data in the intermediate data set, for example, 1000 intermediate data sets are provided in the intermediate data set, the 1000 intermediate data sets correspond to 1000 original pictures, the progress record file records MD5 values of the 1000 original pictures, and the MD5 values corresponding to the tasks of the original file that need to generate intermediate data in the task list corresponding to the intermediate data are converted from a character string into a binary stream corresponding to the corresponding original file, so as to serialize the list of MD5 values; when the category label file is checked to be unchanged, all progress record files are read in a deserializing mode, namely, data are stored in a binary reading mode, so that the file reading speed is increased.

According to an embodiment of the present invention, as shown in fig. 5, an incremental data set acceleration generating system includes:

the task deletion module 11 includes:

A comparison data sub-module 5, configured to compare whether original files corresponding to all intermediate data in the corresponding intermediate data set read from each progress record file loaded by the data loading module 1 are all present in the task list generated by the list generating module 4;

A data marking sub-module 6, configured to mark, as a first intermediate data set, intermediate data sets in the task list, where all original files corresponding to the intermediate data are present in the task list; marking an intermediate data set which does not exist in the task list in the original files corresponding to all the intermediate data as a second intermediate data set;

A data operation sub-module 7, configured to select one intermediate data set for deletion when all intermediate data sets are the first intermediate data set marked by the data marking module 6, and delete tasks of original files corresponding to all intermediate data in the remaining intermediate data sets in the task list;

the data operation sub-module is further configured to delete all intermediate data sets when the inspection module detects that the category label file has a change.

When the second intermediate data set marked by the data marking module 6 exists in all the intermediate data sets, deleting the second intermediate data set, and deleting the original files corresponding to all the intermediate data in the first intermediate data set in the task list;

a list acquisition module 8, configured to acquire a final task list;

The thread setting module 10 is configured to set the number of concurrent threads in the thread pool according to the number of cores of the CPU.

In this embodiment, based on the above embodiment, when all the intermediate data sets are marked first intermediate data sets, then one of the intermediate data sets is selected and deleted, and the original file task corresponding to the intermediate data in the other intermediate data sets is deleted, so that the generated intermediate in the task list does not need to be regenerated; when all the intermediate data sets have marked second intermediate data sets, deleting the second intermediate data sets and deleting original file tasks corresponding to the intermediate data in the first intermediate data sets, so that not only are tasks in a task list which does not need to be regenerated to generate intermediate data realized, but also the newly added tasks are generated in disorder with original file tasks corresponding to the intermediate data in the originally deleted intermediate data sets when the intermediate data sets are regenerated, the purpose of disorder processing is achieved, and finally, the original file tasks of the intermediate data to be generated in the task list are processed concurrently through setting threads.

Then, when all the intermediate data sets are the first intermediate data set, one data set is selected for deletion, for example, original picture information corresponding to all the intermediate data in the four data sets A, B, C, D is present in the task list, then one intermediate data set is deleted, for example, the intermediate data set a is deleted, where the original file task corresponding to all the intermediate data in the remaining intermediate data sets B, C, D is the target task, and the original picture task corresponding to B, C, D needs to be deleted in the task list. Because the intermediate data set B, C, D has intermediate data, the corresponding original files do not need to be repeatedly generated, so that the processing speed of the original files can be increased, and the time waste caused by repeated generation is avoided. And why one intermediate data set is selected for deletion, since if all intermediate data sets are first intermediate data sets, it is explained that no file is deleted or modified in the original file, only new original pictures are added (note that the new original pictures belong to several existing categories), and training data given during training, that is, intermediate data, need to be randomly disordered, and when all intermediate data sets that have been generated are first data sets, if one intermediate data set is not deleted, only the new original pictures can be used for disordered processing, thus defeating the original purpose of disordered processing, therefore, in this case, all original pictures corresponding to one intermediate data set that has been deleted need to be completely regenerated, and this part of original pictures can be mixed with the new original pictures in the task list, so that not only the new original pictures are disordered processed, but also the previous original pictures are mixed, and the disordered processing of training data is satisfied.

In addition, when all the intermediate data sets have the second intermediate data set, that is, when the original files corresponding to the intermediate data are not all present in the task list, deleting the second intermediate data set, and deleting the original files corresponding to all the intermediate data in the first intermediate data set in the task list. For example, in the task list where all the original picture tasks corresponding to the intermediate data in the four intermediate data sets are present, for example, A, B, C, D intermediate data sets exist, A, B, C, and only the task list where all the original picture tasks corresponding to the intermediate data in the intermediate data set D are not present is the task list where all the original picture tasks corresponding to the intermediate data in the intermediate data set D are present, the intermediate data set D is the second intermediate data set, and the intermediate data set D needs to be deleted, at this time, the original picture tasks corresponding to the three intermediate data in the intermediate data set A, B, C are the target tasks and need to be deleted in the task list, so that the original picture tasks corresponding to the intermediate data in the intermediate data set A, B, C need not be regenerated, repeated data generation is reduced, and the processing speed is increased.

In addition, corresponding intermediate data is generated according to the threads in the task call thread pool in the task list, wherein the thread quantity is mainly set according to the core quantity of the CPU, the too small thread quantity is set to delay corresponding data processing, the too large thread quantity is set to cause the overhead of switching among threads to become very large, the speed of data processing is reduced instead, and the parallel execution is performed in each CPU unit core based on the concurrency of the threads, so that for example, the CPU is 4 cores, a plurality of threads (each thread is corresponding to 1000-song task objects in 1000 original file task lists) are set in each core, are placed in the thread pool after the setting, and the original file tasks of the intermediate data are generated in the task list are called, because the mode can play a very good accelerating role when the data processing of the tb level is achieved, the intermediate data generated by each intermediate data set is stored in 1000-object-level mode, the proposal of the quantity in the official is a proper scheme in reality, and the processing of the data in the concurrent mode is accelerated after the processing of 1000-source files in 1000 task lists.

Preferably, when the category label file is checked to be changed, that is, when the classification and the number in the category label file are changed, for example, one classification is reduced, and when the number is changed, all intermediate data sets are deleted through the data operation sub-module, and all original files need to be regenerated into intermediate files.

It should be noted that the above embodiments can be freely combined as needed. The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. An incremental data set acceleration generation method is characterized by comprising the following steps:

Loading a category label file and all progress record files, wherein the category label file contains a category name and a corresponding number of a training file; the progress record files correspond to the intermediate data sets, and each progress record file records the mark information of the original file corresponding to all the intermediate data in one intermediate data set;

checking whether the category label file is changed or not;

When the category label file is detected to be unchanged, reading all progress record files, and obtaining original file information corresponding to all intermediate data in each intermediate data set;

Generating task lists of all current original files in disorder;

acquiring a target task to be deleted according to the task list and the progress record file, and deleting the target task in the task list;

acquiring a final task list, and generating corresponding intermediate data according to threads in a task call thread pool in the task list;

Acquiring the target task to be deleted according to the task list and the progress record file, and deleting the target task in the task list comprises the following steps:

Comparing whether original files corresponding to all intermediate data in the corresponding intermediate data set read from each progress record file exist in the task list or not;

marking an intermediate data set, in which all original files corresponding to the intermediate data exist in the task list, as a first intermediate data set; marking an intermediate data set which does not exist in the task list in the original files corresponding to all the intermediate data as a second intermediate data set;

when all the intermediate data sets are first intermediate data sets, selecting one intermediate data set for deleting, and deleting the original files corresponding to all the intermediate data in the rest intermediate data sets in the task list;

And deleting the second intermediate data set when the second intermediate data set exists in all the intermediate data sets, and deleting the original files corresponding to all the intermediate data in the first intermediate data set in the task list.

2. The incremental dataset acceleration generating method of claim 1, wherein the progress record file is a serialized file of MD5 values of the original file corresponding to all intermediate data in one intermediate dataset.

3. The incremental dataset acceleration generation method of claim 1, wherein:

and deleting all the intermediate data sets when the change of the category label file is detected.

4. The incremental dataset acceleration generation method of claim 1, wherein:

and setting the number of concurrent threads of the thread pool according to the number of cores of the CPU.

5. An incremental dataset acceleration generation system, comprising:

The data loading module is used for loading a category label file and all progress record files, wherein the category label file contains a category name and a corresponding number of a file for training; the progress record files correspond to the intermediate data sets, and each progress record file records the mark information of the original file corresponding to all the intermediate data in one intermediate data set;

The checking module is used for checking whether the class label file loaded in the data loading module is changed or not;

the reading module is used for reading all progress record files when the checking module detects that the category label file is not changed, and obtaining original file information corresponding to all intermediate data in each intermediate data set;

the list generation module is used for generating task lists of all current original files in an out-of-order manner;

The task deleting module is used for acquiring a target task to be deleted according to the task list and the progress record file, and deleting the target task in the task list;

the list acquisition module is used for acquiring a final task list;

the thread calling module is used for generating corresponding intermediate data according to the threads in the task calling thread pool in the task list acquired by the list acquisition module;

The task deletion module includes:

The comparison data sub-module is used for comparing whether original files corresponding to all intermediate data in the corresponding intermediate data set read from each progress record file loaded by the data loading module exist in the task list generated by the list generating module;

The data marking sub-module is used for marking the intermediate data set, in which all the original files corresponding to the intermediate data exist in the task list, as a first intermediate data set; marking an intermediate data set which does not exist in the task list in the original files corresponding to all the intermediate data as a second intermediate data set;

The data operation sub-module is used for selecting one intermediate data set for deleting when all the intermediate data sets are the first intermediate data sets marked by the data marking module, and deleting the original files corresponding to all the intermediate data in the rest intermediate data sets in the task list;

And when the second intermediate data set marked by the data marking module exists in all the intermediate data sets, deleting the second intermediate data set, and deleting the original files corresponding to all the intermediate data in the first intermediate data set in the task list.

6. An incremental dataset acceleration generating system as claimed in claim 5, wherein:

The progress record file loaded in the data loading module is a serialized file of MD5 values of original files corresponding to all intermediate data in one intermediate data set.

7. The incremental dataset acceleration generating system of claim 6, wherein:

8. The incremental dataset acceleration generating system of claim 5, further comprising:

And the thread setting module is used for setting the number of concurrent threads of the thread pool according to the number of cores of the CPU.