CN117272006A

CN117272006A - An active learning training acceleration method and system based on database index technology

Info

Publication number: CN117272006A
Application number: CN202311052101.9A
Authority: CN
Inventors: 侯捷; 伍赛; 陈刚
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-08-21
Filing date: 2023-08-21
Publication date: 2023-12-22
Anticipated expiration: 2043-08-21
Also published as: CN117272006B

Abstract

The invention discloses an active learning training acceleration method and system based on a database indexing technology. The method comprises the following steps: inputting the training sample into a sample feature extraction module, and outputting a feature vector; inputting the sample scores into an active learning evaluation module, and calling a high-performance index module for sorting; the high-performance index module stores the scores of the samples and maintains the ordering relation; the integrated active learning algorithm module invokes the high-performance index module for pre-screening, selects a sample to be trained by using an active learning algorithm, and inputs the sample to be trained into a deep learning model to be trained for training; repeating until training is completed. The invention combines active learning and high-efficiency data index structure, accelerates active learning training process, and improves model training efficiency and performance. The method optimizes the data storage and index modes, reduces the complexity of data operation, is beneficial to improving the overall efficiency of the training process, and can accelerate the training process based on an efficient database index structure and a lightweight active learning algorithm.

Description

Active learning training acceleration method and system based on database index technology

Technical Field

The invention relates to an active learning training acceleration method, relates to the technical field of artificial intelligence AI database indexing, and in particular relates to an active learning training acceleration method and system based on a database indexing technology.

Background

In recent years, deep neural networks have been an attractive development in various fields and have become a core technology for many tasks. However, training of these deep neural networks typically requires a large amount of high quality labeling data to achieve optimal performance. Taking ChatGPT as an example, its complete training requires a corpus of tens of millions of levels. Although there are currently large datasets for deep learning model training, the quality and number of these datasets still do not meet the requirements of deep neural network model training. Therefore, active learning, which is a technology capable of effectively selecting and labeling the most valuable samples, has been widely used in deep learning model training to reduce labeling costs and improve model performance.

Active learning is used as an effective learning model, and the most information value samples are selected in a targeted mode to carry out labeling, so that the labeling cost is reduced, and the model performance is improved. In active learning, the key to sample selection is the ability to quickly and efficiently find the most effective sample for current model training. In recent years, researchers have proposed a number of methods and techniques suitable for active learning to improve the efficiency and performance of sample selection.

At present, although research on active learning has achieved a certain result, when processing a large-scale data set, the conventional sample selection method faces some problems to be solved:

1. repeated training problem: in conventional deep learning model training, the model may be trained repeatedly in large numbers due to the many similar data samples, which makes it difficult to efficiently retrieve samples that are more useful for the model. Therefore, there is a need to develop more intelligent and efficient sample selection strategies to optimize training efficiency and performance of the model.

2. Inefficient data scanning and position adjustment: traditional active learning sample selection methods may face inefficient data scanning and data location adjustment problems in large-scale data sets. This results in reduced training efficiency on a large data set, as well as increased computational costs during training.

3. Maintenance overhead: when a data index requires extensive updates, such as the addition or deletion of data, movement and reordering of data is involved, which can create significant overhead for maintenance of the data set. Therefore, there is a need to propose more efficient data indexing and updating methods to reduce maintenance costs and maintain stability of the data set.

In the conventional deep learning model training process, a large number of similar data samples are used, the model is based on the large number of repeated training, so that the optimal effect is difficult to obtain, and the conventional active learning sample selection method may face the problems of inefficient data scanning and data position adjustment in a large-scale data set. This results in reduced training efficiency on a large data set, as well as increased computational costs during training.

Disclosure of Invention

In order to solve the problems in the background technology, the invention provides an active learning training acceleration method and system based on a database indexing technology.

The technical scheme adopted by the invention is as follows:

1. an active learning training acceleration method based on database indexing technology, comprising:

step 1), an active learning training acceleration model is established, wherein the active learning training acceleration model comprises a sample feature extraction module, an active learning evaluation module, a high-performance index module and an integrated active learning algorithm module.

Step 2) inputting each untrained sample and trained sample into a sample feature extraction module, and outputting feature vectors of each untrained sample and trained sample by the sample feature extraction module.

And 3) inputting the feature vectors of each untrained sample and trained sample into an active learning evaluation module, outputting the sample scores of each untrained sample by the active learning evaluation module, and sequencing the sample scores of each untrained sample from high to low by calling a high-performance index module.

And 4) the high-performance index module stores the sample scores of the untrained samples, and simultaneously, the high-performance index module maintains the ordering order relation of the sample scores of the untrained samples.

And 5) the integrated active learning algorithm module performs pre-screening on each untrained sample by calling the high-performance index module to obtain a plurality of boundary samples, then selects a plurality of samples to be trained from each boundary sample by using an active learning algorithm, and inputs each sample to be trained into a deep learning model to be trained for training.

Step 6) repeating steps 1) -5) until training is completed.

In the step 2), the sample feature extraction module is specifically a self-supervision DINO (DETR withImproved deNoising anchOr boxes) model. Before sample feature extraction, unified and conventional preprocessing operation is required to be carried out on the training samples.

In the step 3), feature vectors of each untrained sample and trained sample are input into an active learning evaluation module, the active learning evaluation module firstly uses a k nearest neighbor (k-Nearest Neighbors) algorithm to search and select feature vectors of nearest neighbor untrained samples of the feature vectors of each trained sample, and then the selected frequency of the feature vectors of each untrained sample is obtained as the respective sample score.

In the initialization stage of the active learning evaluation module, the initial sample score of the feature vector of each untrained sample is 0, { { S } '' _x } _init ＝0，x′∈D _unlabeld Each subsequent training round iteratively updates the sample score of the feature vector of each untrained sample, as follows:

wherein,and->Representing the current iteration update and up, respectivelySample scoring of feature vectors of the untrained samples updated in one iteration; x is x _i And x' _i Respectively representing the feature vectors of the trained samples and the feature vectors of the nearest neighbor untrained samples selected by a k nearest neighbor KNN algorithm; d (D) _labeled Representing a set of feature vectors for the trained samples.

In the step 4), the high-performance index module comprises a semi-ordered index structure and a state information record table, wherein the semi-ordered index structure comprises a plurality of data blocks, and the ordered relation among the data blocks is kept; storing a plurality of data items in each data block, wherein the data items keep unordered relation, and each data item comprises an untrained sample and a sample score thereof; when data is inserted into the index, each data block is traversed from the index header in turn, the data block with the first data in the storage range of the block is found, and then the data is stored in the free position in the block.

The state information record table records state information of each data item in each data block, namely an update state of the feature vector of each untrained sample, wherein the update state comprises a state that the sample score of the feature vector of the untrained sample is being updated, not updated and updated.

The high performance indexing module maintains sample score ranks based on updating the optimized database indexing structure for other modules to quickly retrieve the best samples.

In the step 4), the high-performance index module performs maintenance on the ordering order relation of the sample scores of the untrained samples, and updates the sample scores in the data items without traversing the index for the state information of each data item in each data block, specifically, only the frequency increased by the current update is required to be updated in the state information record table when the state information of each data item is in the updated state; when the state information of the data item is in an un-updated state, the state of the state information of the data item is adjusted to be the updated state, meanwhile, the index is traversed to update the sample scores in the data item, and the position of the sample in the index is adjusted.

In the step 5), the integrated active learning algorithm module performs pre-screening on each untrained sample by calling the high-performance index module to obtain a plurality of boundary samples, specifically, pre-screening out untrained samples with higher scores in each untrained sample as boundary samples under the condition of meeting the maximum screening target, wherein the maximum screening target is specifically as follows:

wherein D is _selected Representing a pre-screened boundary sample set, wherein the boundary sample set comprises a plurality of boundary samples; x represents an untrained sample; i () and C () represent the information content function and the contribution function, respectively, I () is obtained using the confidence derived by model reasoning, and C () is calculated based on the correlation of the untrained sample with the trained sample.

Given training data setWherein x is _i Is the input sample, y _i Is a label corresponding to the input sample, and a model f (x; θ), where θ is a parameter of the trained model, x is training data input into the model, the object of the model is from D _unlabeld Is to maximize the selection of sample sets +.>And the total contribution.

In the step 5), the active learning algorithm is specifically a minimum confidence algorithm or an entropy highest algorithm.

The integrated active learning algorithm module uses the existing active learning algorithm, at D _selected The most valuable samples were further sorted out. In this module, the index module is used to select the top scoring batch of samples as D _selected Then, on this batch of samples, an existing active learning algorithm is used. D (D) _selected Containing less data than D _unlabel Therefore, the method reduces the screening cost and improves the speed of the training process.

Samples participating in training in the integrated active learning algorithm module can call an interface of the active learning evaluation module, and the scores of unlabeled samples are updated. In addition, the module adjusts the order relation of the samples in the high-performance index module through the active learning evaluation module.

2. Active learning training acceleration model based on active learning training acceleration method:

the model includes a sample feature extraction module for feature extraction of each of the untrained sample and the trained sample input sample to obtain a feature vector.

The model comprises an active learning evaluation module which performs active learning evaluation based on the feature vector output by the sample feature extraction module to obtain a sample score and calls the high-performance index module to perform sample order relation adjustment.

The model includes a high performance indexing module for saving and updating sample scores and ordering.

The model comprises an integrated active learning algorithm module which calls a high-performance index module to conduct pre-screening and then conduct final screening to obtain a sample to be trained.

The beneficial effects of the invention are as follows:

the invention combines active learning and high-efficiency data index structure to accelerate active learning training process, and provides a new active learning training strategy which can improve the efficiency and performance of model training. Meanwhile, the storage and index modes of the data are optimized, the complexity of data operation is reduced, maintenance expenditure is reduced, the overall efficiency of the training process is improved, and the training process can be accelerated based on an efficient database index structure and a lightweight active learning algorithm.

Drawings

FIG. 1 is a flow chart of the steps of the invention;

FIG. 2 is a high performance index module design of the present invention.

Detailed Description

The invention will be described in further detail with reference to the accompanying drawings and specific examples.

As shown in fig. 1, the active learning training acceleration method based on the database indexing technology includes:

In step 2), the sample feature extraction module is specifically a self-supervision DINO (DETR withImproved deNoising anchOr boxes) model. Before sample feature extraction, unified and conventional preprocessing operation is required to be carried out on the training samples.

In step 3), the feature vectors of each untrained sample and trained sample are input into an active learning evaluation module, the active learning evaluation module firstly uses a k nearest neighbor KNN (k-Nearest Neighbors) algorithm to search and select the feature vector of the nearest neighbor untrained sample of the feature vector of each trained sample, and then the selected frequency of the feature vector of each untrained sample is obtained as the respective sample score.

wherein,and->Sample scores representing feature vectors of untrained samples of the current iteration update and the last iteration update, respectively; x is x _i And x' _i Respectively representing the feature vectors of the trained samples and the feature vectors of the nearest neighbor untrained samples selected by a k nearest neighbor KNN algorithm; d (D) _labeled Representing a set of feature vectors for the trained samples.

In step 4), the high-performance index module comprises a semi-ordered index structure and a state information record table, wherein the semi-ordered index structure comprises a plurality of data blocks, and the ordered relation among the data blocks is kept; storing a plurality of data items in each data block, wherein the data items keep unordered relation, and each data item comprises an untrained sample and a sample score thereof; when data is inserted into the index, each data block is traversed from the index header in turn, the data block with the first data in the storage range of the block is found, and then the data is stored in the free position in the block.

In step 4), the high performance index module performs the maintenance of the sorting order relation of the sample scores of the untrained samples, and for the state information of each data item in each data block, particularly when the state information of the data item is in the updated state, the sample scores in the data item are not required to be updated by traversing the index, and only the frequency increased by the current update is required to be updated into the state information record table; when the state information of the data item is in an un-updated state, the state of the state information of the data item is adjusted to be the updated state, meanwhile, the index is traversed to update the sample scores in the data item, and the position of the sample in the index is adjusted.

In step 5), the integrated active learning algorithm module performs pre-screening on each untrained sample by calling the high-performance index module to obtain a plurality of boundary samples, specifically, pre-screening out untrained samples with higher sample scores in each untrained sample as boundary samples under the condition of meeting the maximum screening target, wherein the maximum screening target is specifically as follows:

In step 5), the active learning algorithm is specifically a minimum confidence algorithm or an entropy highest algorithm.

Step 6) repeating steps 1) -5) until training is completed.

The active learning training acceleration model comprises a sample feature extraction module, a training module and a training module, wherein the sample feature extraction module is used for extracting features of each untrained sample and each trained sample input sample to obtain feature vectors; the active learning evaluation module is used for performing active learning evaluation based on the feature vector output by the sample feature extraction module to obtain a sample score and calling the high-performance index module to perform sample order relation adjustment; the system comprises a high-performance index module for saving and updating sample scores and sorting; the integrated active learning algorithm module comprises an integrated active learning algorithm module which calls a high-performance index module to conduct pre-screening and then conduct final screening to obtain a sample to be trained.

As shown in fig. 2, the high-performance index module is composed of a semi-ordered index structure and a state information record table. The former is a chained architecture in which each data node is a block capable of holding a plurality of data items, the data items stored within each block being unordered and ordered between blocks. In the state information record table, the update state of each data item is recorded, and each thread in the high-performance index module accesses and updates the data node.

Untrained sample x and trained sample cell D of the present invention _train Feature vector characterization is obtained based on a pre-trained sample characterization model, then the KNN neighbors of each trained sample are calculated based on the feature vector, and sample scores are calculated based on the frequency of untrained samples x as neighbors. The process maintains the score of each sample by means of a semi-ordered index structure, and after obtaining the scores of all untrained samples, sequentially reads the sample x' with the highest score from the index as a boundary sample, and combines the boundary samples into D _selected . Finally, at D _selected And selecting a batch of samples with the largest model benefit by using the existing active learning algorithm, and then marking and participating in model training.

Aiming at an active learning scene, the invention provides an active learning algorithm which can sense the boundary between classes of a model on a specific data set, calculate the training value of each unlabeled sample based on the boundary, select a batch of samples with the highest value, and apply the active learning algorithm, thereby intelligently selecting the optimal samples, finally reducing the number of samples required by training and accelerating the overall active learning training process.

Claims

1. A method for accelerating active learning training based on database indexing technology, characterized in that it includes:

Step 1) Establish an active learning training acceleration model, which includes a sample feature extraction module, an active learning evaluation module, a high-performance indexing module, and an integrated active learning algorithm module.

Step 2) Input each untrained sample and trained sample into the sample feature extraction module, and the sample feature extraction module outputs the feature vectors of each untrained sample and trained sample.

Step 3) Input the feature vectors of each untrained sample and trained sample into the active learning evaluation module. The active learning evaluation module outputs the sample score of each untrained sample. The high-performance indexing module is called to sort the sample scores of each untrained sample from high to low.

Step 4) The high-performance indexing module saves the sample scores of each untrained sample, and at the same time maintains the sorting order of the sample scores of each untrained sample.

Step 5) The integrated active learning algorithm module obtains several boundary samples by calling the high-performance indexing module to pre-screen each untrained sample. Then, the integrated active learning algorithm module uses the active learning algorithm to select several training samples from each boundary sample and inputs each training sample into the deep learning model to be trained for training.

Step 6) Repeat steps 1)-5) until training is complete.

2. The active learning training acceleration method based on database indexing technology according to claim 1, characterized in that: in step 2), the sample feature extraction module is specifically a self-supervised DINO model.

3. The active learning training acceleration method based on database indexing technology according to claim 1, characterized in that: in step 3), the feature vectors of each untrained sample and trained sample are input into the active learning evaluation module. The active learning evaluation module first uses the k nearest neighbor KNN algorithm to search and select the feature vectors of the nearest untrained samples of each trained sample's feature vector, and then obtains the frequency of the feature vector of each untrained sample being selected as its respective sample score.

In the initialization phase of the active learning evaluation module, the initial sample score of the feature vector of each untrained sample is 0. Subsequently, in each training round, the sample score of the feature vector of each untrained sample is iteratively updated, as follows:

in, and represents the sample score of the feature vector of the untrained sample updated in the current iteration and the previous iteration, respectively; x_i and x′ _i represent the feature vector of the trained sample and the feature vector of the nearest neighbor untrained sample selected by the k-nearest neighbor KNN algorithm, respectively; D _labeled represents the set of feature vectors of the trained samples.

4. The active learning training acceleration method based on database indexing technology according to claim 1, characterized in that: in step 4), the high-performance indexing module includes a semi-ordered index structure and a state information recording table, the semi-ordered index structure includes several data blocks, and the data blocks maintain an ordered relationship; each data block stores several data items, and the data items maintain an unordered relationship, and each data item includes an untrained sample and its sample score;

The status information record table records the status information of each data item in each data block, that is, the update status of the feature vector of each untrained sample. The update status includes the sample score of the feature vector of the untrained sample being updated, not being updated, and already updated.

5. The active learning training acceleration method based on database indexing technology according to claim 4, characterized in that: in step 4), the high-performance indexing module maintains the sorting order relationship of the sample scores of each untrained sample, and for the state information of each data item in each data block, specifically, when the state information of the data item is in the state of being updated, it is not necessary to traverse the index to update the sample scores in the data item, but only to update the frequency increased by the current update to the state information record table; when the state information of the data item is in the state of not being updated, the state of the data item is adjusted to the state of being updated, and the index is traversed to update the sample scores in the data item.

6. The active learning training acceleration method based on database indexing technology according to claim 1, characterized in that: in step 5), the integrated active learning algorithm module obtains several boundary samples by calling the high-performance indexing module to pre-screen each untrained sample, specifically, under the condition of maximizing the screening objective, pre-screening several untrained samples with higher sample scores as boundary samples, and the maximizing screening objective is as follows:

Where D _selected represents the pre-selected set of boundary samples, which includes several boundary samples; x represents untrained samples; I() and C() represent the information function and contribution function, respectively.

7. The active learning training acceleration method based on database indexing technology according to claim 1, wherein in step 5), the active learning algorithm is specifically the minimum confidence algorithm or the maximum entropy algorithm, etc.

8. An active learning training acceleration model based on the active learning training acceleration method according to any one of claims 1-7, characterized in that: it includes a sample feature extraction module for extracting features from each untrained sample and trained sample input sample to obtain feature vectors;

This includes an active learning evaluation module that uses the feature vectors output by the sample feature extraction module to perform active learning evaluation to obtain sample scores and calls the high-performance indexing module to adjust the order of samples.

Includes a high-performance indexing module for saving and updating sample scores and sorting;

This includes an integrated active learning algorithm module that calls a high-performance indexing module for pre-screening and then performs final screening to obtain the training samples.