[go: up one dir, main page]

CN117272006A - An active learning training acceleration method and system based on database index technology - Google Patents

An active learning training acceleration method and system based on database index technology Download PDF

Info

Publication number
CN117272006A
CN117272006A CN202311052101.9A CN202311052101A CN117272006A CN 117272006 A CN117272006 A CN 117272006A CN 202311052101 A CN202311052101 A CN 202311052101A CN 117272006 A CN117272006 A CN 117272006A
Authority
CN
China
Prior art keywords
sample
active learning
module
untrained
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311052101.9A
Other languages
Chinese (zh)
Other versions
CN117272006B (en
Inventor
侯捷
伍赛
陈刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202311052101.9A priority Critical patent/CN117272006B/en
Publication of CN117272006A publication Critical patent/CN117272006A/en
Application granted granted Critical
Publication of CN117272006B publication Critical patent/CN117272006B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2272Management thereof
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an active learning training acceleration method and system based on a database indexing technology. The method comprises the following steps: inputting the training sample into a sample feature extraction module, and outputting a feature vector; inputting the sample scores into an active learning evaluation module, and calling a high-performance index module for sorting; the high-performance index module stores the scores of the samples and maintains the ordering relation; the integrated active learning algorithm module invokes the high-performance index module for pre-screening, selects a sample to be trained by using an active learning algorithm, and inputs the sample to be trained into a deep learning model to be trained for training; repeating until training is completed. The invention combines active learning and high-efficiency data index structure, accelerates active learning training process, and improves model training efficiency and performance. The method optimizes the data storage and index modes, reduces the complexity of data operation, is beneficial to improving the overall efficiency of the training process, and can accelerate the training process based on an efficient database index structure and a lightweight active learning algorithm.

Description

Active learning training acceleration method and system based on database index technology
Technical Field
The invention relates to an active learning training acceleration method, relates to the technical field of artificial intelligence AI database indexing, and in particular relates to an active learning training acceleration method and system based on a database indexing technology.
Background
In recent years, deep neural networks have been an attractive development in various fields and have become a core technology for many tasks. However, training of these deep neural networks typically requires a large amount of high quality labeling data to achieve optimal performance. Taking ChatGPT as an example, its complete training requires a corpus of tens of millions of levels. Although there are currently large datasets for deep learning model training, the quality and number of these datasets still do not meet the requirements of deep neural network model training. Therefore, active learning, which is a technology capable of effectively selecting and labeling the most valuable samples, has been widely used in deep learning model training to reduce labeling costs and improve model performance.
Active learning is used as an effective learning model, and the most information value samples are selected in a targeted mode to carry out labeling, so that the labeling cost is reduced, and the model performance is improved. In active learning, the key to sample selection is the ability to quickly and efficiently find the most effective sample for current model training. In recent years, researchers have proposed a number of methods and techniques suitable for active learning to improve the efficiency and performance of sample selection.
At present, although research on active learning has achieved a certain result, when processing a large-scale data set, the conventional sample selection method faces some problems to be solved:
1. repeated training problem: in conventional deep learning model training, the model may be trained repeatedly in large numbers due to the many similar data samples, which makes it difficult to efficiently retrieve samples that are more useful for the model. Therefore, there is a need to develop more intelligent and efficient sample selection strategies to optimize training efficiency and performance of the model.
2. Inefficient data scanning and position adjustment: traditional active learning sample selection methods may face inefficient data scanning and data location adjustment problems in large-scale data sets. This results in reduced training efficiency on a large data set, as well as increased computational costs during training.
3. Maintenance overhead: when a data index requires extensive updates, such as the addition or deletion of data, movement and reordering of data is involved, which can create significant overhead for maintenance of the data set. Therefore, there is a need to propose more efficient data indexing and updating methods to reduce maintenance costs and maintain stability of the data set.
In the conventional deep learning model training process, a large number of similar data samples are used, the model is based on the large number of repeated training, so that the optimal effect is difficult to obtain, and the conventional active learning sample selection method may face the problems of inefficient data scanning and data position adjustment in a large-scale data set. This results in reduced training efficiency on a large data set, as well as increased computational costs during training.
Disclosure of Invention
In order to solve the problems in the background technology, the invention provides an active learning training acceleration method and system based on a database indexing technology.
The technical scheme adopted by the invention is as follows:
1. an active learning training acceleration method based on database indexing technology, comprising:
step 1), an active learning training acceleration model is established, wherein the active learning training acceleration model comprises a sample feature extraction module, an active learning evaluation module, a high-performance index module and an integrated active learning algorithm module.
Step 2) inputting each untrained sample and trained sample into a sample feature extraction module, and outputting feature vectors of each untrained sample and trained sample by the sample feature extraction module.
And 3) inputting the feature vectors of each untrained sample and trained sample into an active learning evaluation module, outputting the sample scores of each untrained sample by the active learning evaluation module, and sequencing the sample scores of each untrained sample from high to low by calling a high-performance index module.
And 4) the high-performance index module stores the sample scores of the untrained samples, and simultaneously, the high-performance index module maintains the ordering order relation of the sample scores of the untrained samples.
And 5) the integrated active learning algorithm module performs pre-screening on each untrained sample by calling the high-performance index module to obtain a plurality of boundary samples, then selects a plurality of samples to be trained from each boundary sample by using an active learning algorithm, and inputs each sample to be trained into a deep learning model to be trained for training.
Step 6) repeating steps 1) -5) until training is completed.
In the step 2), the sample feature extraction module is specifically a self-supervision DINO (DETR withImproved deNoising anchOr boxes) model. Before sample feature extraction, unified and conventional preprocessing operation is required to be carried out on the training samples.
In the step 3), feature vectors of each untrained sample and trained sample are input into an active learning evaluation module, the active learning evaluation module firstly uses a k nearest neighbor (k-Nearest Neighbors) algorithm to search and select feature vectors of nearest neighbor untrained samples of the feature vectors of each trained sample, and then the selected frequency of the feature vectors of each untrained sample is obtained as the respective sample score.
In the initialization stage of the active learning evaluation module, the initial sample score of the feature vector of each untrained sample is 0, { { S } '' x } init =0,x′∈D unlabeld Each subsequent training round iteratively updates the sample score of the feature vector of each untrained sample, as follows:
wherein,and->Representing the current iteration update and up, respectivelySample scoring of feature vectors of the untrained samples updated in one iteration; x is x i And x' i Respectively representing the feature vectors of the trained samples and the feature vectors of the nearest neighbor untrained samples selected by a k nearest neighbor KNN algorithm; d (D) labeled Representing a set of feature vectors for the trained samples.
In the step 4), the high-performance index module comprises a semi-ordered index structure and a state information record table, wherein the semi-ordered index structure comprises a plurality of data blocks, and the ordered relation among the data blocks is kept; storing a plurality of data items in each data block, wherein the data items keep unordered relation, and each data item comprises an untrained sample and a sample score thereof; when data is inserted into the index, each data block is traversed from the index header in turn, the data block with the first data in the storage range of the block is found, and then the data is stored in the free position in the block.
The state information record table records state information of each data item in each data block, namely an update state of the feature vector of each untrained sample, wherein the update state comprises a state that the sample score of the feature vector of the untrained sample is being updated, not updated and updated.
The high performance indexing module maintains sample score ranks based on updating the optimized database indexing structure for other modules to quickly retrieve the best samples.
In the step 4), the high-performance index module performs maintenance on the ordering order relation of the sample scores of the untrained samples, and updates the sample scores in the data items without traversing the index for the state information of each data item in each data block, specifically, only the frequency increased by the current update is required to be updated in the state information record table when the state information of each data item is in the updated state; when the state information of the data item is in an un-updated state, the state of the state information of the data item is adjusted to be the updated state, meanwhile, the index is traversed to update the sample scores in the data item, and the position of the sample in the index is adjusted.
In the step 5), the integrated active learning algorithm module performs pre-screening on each untrained sample by calling the high-performance index module to obtain a plurality of boundary samples, specifically, pre-screening out untrained samples with higher scores in each untrained sample as boundary samples under the condition of meeting the maximum screening target, wherein the maximum screening target is specifically as follows:
wherein D is selected Representing a pre-screened boundary sample set, wherein the boundary sample set comprises a plurality of boundary samples; x represents an untrained sample; i () and C () represent the information content function and the contribution function, respectively, I () is obtained using the confidence derived by model reasoning, and C () is calculated based on the correlation of the untrained sample with the trained sample.
Given training data setWherein x is i Is the input sample, y i Is a label corresponding to the input sample, and a model f (x; θ), where θ is a parameter of the trained model, x is training data input into the model, the object of the model is from D unlabeld Is to maximize the selection of sample sets +.>And the total contribution.
In the step 5), the active learning algorithm is specifically a minimum confidence algorithm or an entropy highest algorithm.
The integrated active learning algorithm module uses the existing active learning algorithm, at D selected The most valuable samples were further sorted out. In this module, the index module is used to select the top scoring batch of samples as D selected Then, on this batch of samples, an existing active learning algorithm is used. D (D) selected Containing less data than D unlabel Therefore, the method reduces the screening cost and improves the speed of the training process.
Samples participating in training in the integrated active learning algorithm module can call an interface of the active learning evaluation module, and the scores of unlabeled samples are updated. In addition, the module adjusts the order relation of the samples in the high-performance index module through the active learning evaluation module.
2. Active learning training acceleration model based on active learning training acceleration method:
the model includes a sample feature extraction module for feature extraction of each of the untrained sample and the trained sample input sample to obtain a feature vector.
The model comprises an active learning evaluation module which performs active learning evaluation based on the feature vector output by the sample feature extraction module to obtain a sample score and calls the high-performance index module to perform sample order relation adjustment.
The model includes a high performance indexing module for saving and updating sample scores and ordering.
The model comprises an integrated active learning algorithm module which calls a high-performance index module to conduct pre-screening and then conduct final screening to obtain a sample to be trained.
The beneficial effects of the invention are as follows:
the invention combines active learning and high-efficiency data index structure to accelerate active learning training process, and provides a new active learning training strategy which can improve the efficiency and performance of model training. Meanwhile, the storage and index modes of the data are optimized, the complexity of data operation is reduced, maintenance expenditure is reduced, the overall efficiency of the training process is improved, and the training process can be accelerated based on an efficient database index structure and a lightweight active learning algorithm.
Drawings
FIG. 1 is a flow chart of the steps of the invention;
FIG. 2 is a high performance index module design of the present invention.
Detailed Description
The invention will be described in further detail with reference to the accompanying drawings and specific examples.
As shown in fig. 1, the active learning training acceleration method based on the database indexing technology includes:
step 1), an active learning training acceleration model is established, wherein the active learning training acceleration model comprises a sample feature extraction module, an active learning evaluation module, a high-performance index module and an integrated active learning algorithm module.
Step 2) inputting each untrained sample and trained sample into a sample feature extraction module, and outputting feature vectors of each untrained sample and trained sample by the sample feature extraction module.
In step 2), the sample feature extraction module is specifically a self-supervision DINO (DETR withImproved deNoising anchOr boxes) model. Before sample feature extraction, unified and conventional preprocessing operation is required to be carried out on the training samples.
And 3) inputting the feature vectors of each untrained sample and trained sample into an active learning evaluation module, outputting the sample scores of each untrained sample by the active learning evaluation module, and sequencing the sample scores of each untrained sample from high to low by calling a high-performance index module.
In step 3), the feature vectors of each untrained sample and trained sample are input into an active learning evaluation module, the active learning evaluation module firstly uses a k nearest neighbor KNN (k-Nearest Neighbors) algorithm to search and select the feature vector of the nearest neighbor untrained sample of the feature vector of each trained sample, and then the selected frequency of the feature vector of each untrained sample is obtained as the respective sample score.
In the initialization stage of the active learning evaluation module, the initial sample score of the feature vector of each untrained sample is 0, { { S } '' x } init =0,x′∈D unlabeld Each subsequent training round iteratively updates the sample score of the feature vector of each untrained sample, as follows:
wherein,and->Sample scores representing feature vectors of untrained samples of the current iteration update and the last iteration update, respectively; x is x i And x' i Respectively representing the feature vectors of the trained samples and the feature vectors of the nearest neighbor untrained samples selected by a k nearest neighbor KNN algorithm; d (D) labeled Representing a set of feature vectors for the trained samples.
And 4) the high-performance index module stores the sample scores of the untrained samples, and simultaneously, the high-performance index module maintains the ordering order relation of the sample scores of the untrained samples.
In step 4), the high-performance index module comprises a semi-ordered index structure and a state information record table, wherein the semi-ordered index structure comprises a plurality of data blocks, and the ordered relation among the data blocks is kept; storing a plurality of data items in each data block, wherein the data items keep unordered relation, and each data item comprises an untrained sample and a sample score thereof; when data is inserted into the index, each data block is traversed from the index header in turn, the data block with the first data in the storage range of the block is found, and then the data is stored in the free position in the block.
The state information record table records state information of each data item in each data block, namely an update state of the feature vector of each untrained sample, wherein the update state comprises a state that the sample score of the feature vector of the untrained sample is being updated, not updated and updated.
The high performance indexing module maintains sample score ranks based on updating the optimized database indexing structure for other modules to quickly retrieve the best samples.
In step 4), the high performance index module performs the maintenance of the sorting order relation of the sample scores of the untrained samples, and for the state information of each data item in each data block, particularly when the state information of the data item is in the updated state, the sample scores in the data item are not required to be updated by traversing the index, and only the frequency increased by the current update is required to be updated into the state information record table; when the state information of the data item is in an un-updated state, the state of the state information of the data item is adjusted to be the updated state, meanwhile, the index is traversed to update the sample scores in the data item, and the position of the sample in the index is adjusted.
And 5) the integrated active learning algorithm module performs pre-screening on each untrained sample by calling the high-performance index module to obtain a plurality of boundary samples, then selects a plurality of samples to be trained from each boundary sample by using an active learning algorithm, and inputs each sample to be trained into a deep learning model to be trained for training.
In step 5), the integrated active learning algorithm module performs pre-screening on each untrained sample by calling the high-performance index module to obtain a plurality of boundary samples, specifically, pre-screening out untrained samples with higher sample scores in each untrained sample as boundary samples under the condition of meeting the maximum screening target, wherein the maximum screening target is specifically as follows:
wherein D is selected Representing a pre-screened boundary sample set, wherein the boundary sample set comprises a plurality of boundary samples; x represents an untrained sample; i () and C () represent the information content function and the contribution function, respectively, I () is obtained using the confidence derived by model reasoning, and C () is calculated based on the correlation of the untrained sample with the trained sample.
Given training data setWherein x is i Is the input sample, y i Is a label corresponding to the input sample, and a model f (x; θ), where θ is a parameter of the trained model, x is training data input into the model, the object of the model is from D unlabeld Is to maximize the selection of sample sets +.>And the total contribution.
In step 5), the active learning algorithm is specifically a minimum confidence algorithm or an entropy highest algorithm.
The integrated active learning algorithm module uses the existing active learning algorithm, at D selected The most valuable samples were further sorted out. In this module, the index module is used to select the top scoring batch of samples as D selected Then, on this batch of samples, an existing active learning algorithm is used. D (D) selected Containing less data than D unlabel Therefore, the method reduces the screening cost and improves the speed of the training process.
Samples participating in training in the integrated active learning algorithm module can call an interface of the active learning evaluation module, and the scores of unlabeled samples are updated. In addition, the module adjusts the order relation of the samples in the high-performance index module through the active learning evaluation module.
Step 6) repeating steps 1) -5) until training is completed.
The active learning training acceleration model comprises a sample feature extraction module, a training module and a training module, wherein the sample feature extraction module is used for extracting features of each untrained sample and each trained sample input sample to obtain feature vectors; the active learning evaluation module is used for performing active learning evaluation based on the feature vector output by the sample feature extraction module to obtain a sample score and calling the high-performance index module to perform sample order relation adjustment; the system comprises a high-performance index module for saving and updating sample scores and sorting; the integrated active learning algorithm module comprises an integrated active learning algorithm module which calls a high-performance index module to conduct pre-screening and then conduct final screening to obtain a sample to be trained.
As shown in fig. 2, the high-performance index module is composed of a semi-ordered index structure and a state information record table. The former is a chained architecture in which each data node is a block capable of holding a plurality of data items, the data items stored within each block being unordered and ordered between blocks. In the state information record table, the update state of each data item is recorded, and each thread in the high-performance index module accesses and updates the data node.
Untrained sample x and trained sample cell D of the present invention train Feature vector characterization is obtained based on a pre-trained sample characterization model, then the KNN neighbors of each trained sample are calculated based on the feature vector, and sample scores are calculated based on the frequency of untrained samples x as neighbors. The process maintains the score of each sample by means of a semi-ordered index structure, and after obtaining the scores of all untrained samples, sequentially reads the sample x' with the highest score from the index as a boundary sample, and combines the boundary samples into D selected . Finally, at D selected And selecting a batch of samples with the largest model benefit by using the existing active learning algorithm, and then marking and participating in model training.
Aiming at an active learning scene, the invention provides an active learning algorithm which can sense the boundary between classes of a model on a specific data set, calculate the training value of each unlabeled sample based on the boundary, select a batch of samples with the highest value, and apply the active learning algorithm, thereby intelligently selecting the optimal samples, finally reducing the number of samples required by training and accelerating the overall active learning training process.

Claims (8)

1.一种基于数据库索引技术的主动学习训练加速方法,其特征在于,包括:1. A method for accelerating active learning training based on database indexing technology, characterized in that it includes: 步骤1)建立主动学习训练加速模型,主动学习训练加速模型包括样本特征提取模块、主动学习评价模块、高性能索引模块和集成主动学习算法模块;Step 1) Establish an active learning training acceleration model, which includes a sample feature extraction module, an active learning evaluation module, a high-performance indexing module, and an integrated active learning algorithm module. 步骤2)将各个未训练样本和已训练样本输入样本特征提取模块中,样本特征提取模块输出各个未训练样本和已训练样本的特征向量;Step 2) Input each untrained sample and trained sample into the sample feature extraction module, and the sample feature extraction module outputs the feature vectors of each untrained sample and trained sample. 步骤3)将各个未训练样本和已训练样本的特征向量输入主动学习评价模块中,主动学习评价模块输出各个未训练样本的样本评分,通过调用高性能索引模块对各个未训练样本的样本评分从高到低进行排序;Step 3) Input the feature vectors of each untrained sample and trained sample into the active learning evaluation module. The active learning evaluation module outputs the sample score of each untrained sample. The high-performance indexing module is called to sort the sample scores of each untrained sample from high to low. 步骤4)高性能索引模块对各个未训练样本的样本评分进行保存,同时高性能索引模块进行各个未训练样本的样本评分的排序次序关系维护;Step 4) The high-performance indexing module saves the sample scores of each untrained sample, and at the same time maintains the sorting order of the sample scores of each untrained sample. 步骤5)集成主动学习算法模块通过调用高性能索引模块对各个未训练样本进行预筛选获得若干边界样本,然后集成主动学习算法模块使用主动学习算法在各个边界样本中挑选出若干待训练样本,将各个待训练样本输入待训练的深度学习模型中进行训练;Step 5) The integrated active learning algorithm module obtains several boundary samples by calling the high-performance indexing module to pre-screen each untrained sample. Then, the integrated active learning algorithm module uses the active learning algorithm to select several training samples from each boundary sample and inputs each training sample into the deep learning model to be trained for training. 步骤6)重复步骤1)-5)直至训练完成。Step 6) Repeat steps 1)-5) until training is complete. 2.根据权利要求1所述的基于数据库索引技术的主动学习训练加速方法,其特征在于:所述的步骤2)中,样本特征提取模块具体为自监督DINO模型。2. The active learning training acceleration method based on database indexing technology according to claim 1, characterized in that: in step 2), the sample feature extraction module is specifically a self-supervised DINO model. 3.根据权利要求1所述的基于数据库索引技术的主动学习训练加速方法,其特征在于:所述的步骤3)中,将各个未训练样本和已训练样本的特征向量输入主动学习评价模块中,主动学习评价模块首先使用k最近邻KNN算法进行检索并挑选出每个已训练样本的特征向量的最近邻的未训练样本的特征向量,然后获得每个未训练样本的特征向量被选中的频次作为各自的样本评分;3. The active learning training acceleration method based on database indexing technology according to claim 1, characterized in that: in step 3), the feature vectors of each untrained sample and trained sample are input into the active learning evaluation module. The active learning evaluation module first uses the k nearest neighbor KNN algorithm to search and select the feature vectors of the nearest untrained samples of each trained sample's feature vector, and then obtains the frequency of the feature vector of each untrained sample being selected as its respective sample score. 在主动学习评价模块初始化阶段,每个未训练样本的特征向量的初始样本评分为0,随后每轮训练均对每个未训练样本的特征向量的样本评分进行迭代更新,具体如下:In the initialization phase of the active learning evaluation module, the initial sample score of the feature vector of each untrained sample is 0. Subsequently, in each training round, the sample score of the feature vector of each untrained sample is iteratively updated, as follows: 其中,分别表示当前迭代更新和上一次迭代更新的未训练样本的特征向量的样本评分;xi和x′i分别表示已训练样本的特征向量及其经k最近邻KNN算法挑选出的最近邻的未训练样本的特征向量;Dlabeled表示已训练样本的特征向量的集合。in, and represents the sample score of the feature vector of the untrained sample updated in the current iteration and the previous iteration, respectively; x<sub>i</sub> and x′<sub> i </sub> represent the feature vector of the trained sample and the feature vector of the nearest neighbor untrained sample selected by the k-nearest neighbor KNN algorithm, respectively; D <sub>labeled </sub> represents the set of feature vectors of the trained samples. 4.根据权利要求1所述的基于数据库索引技术的主动学习训练加速方法,其特征在于:所述的步骤4)中,高性能索引模块包括半有序索引结构和状态信息记录表,半有序索引结构中包括若干数据块,各个数据块之间保持有序关系;每个数据块中保存若干条数据项,各个数据项之间保持无序关系,每个数据项包括一个未训练样本及其样本评分;4. The active learning training acceleration method based on database indexing technology according to claim 1, characterized in that: in step 4), the high-performance indexing module includes a semi-ordered index structure and a state information recording table, the semi-ordered index structure includes several data blocks, and the data blocks maintain an ordered relationship; each data block stores several data items, and the data items maintain an unordered relationship, and each data item includes an untrained sample and its sample score; 状态信息记录表中记录了每个数据块中的每个数据项的状态信息,即每个未训练样本的特征向量所处的更新状态,更新状态包括未训练样本的特征向量的样本评分正在被更新、未被更新和已更新状态。The status information record table records the status information of each data item in each data block, that is, the update status of the feature vector of each untrained sample. The update status includes the sample score of the feature vector of the untrained sample being updated, not being updated, and already updated. 5.根据权利要求4所述的基于数据库索引技术的主动学习训练加速方法,其特征在于:所述的步骤4)中,高性能索引模块进行各个未训练样本的样本评分的排序次序关系维护,针对每个数据块中的每个数据项的状态信息,具体为当数据项的状态信息处于正在被更新状态时,则无需遍历索引对数据项中的样本评分进行更新,仅需将当前更新所增加的频次数更新至状态信息记录表中;当数据项的状态信息处于未被更新状态时,则将数据项的状态信息的状态调整为正在被更新状态,同时遍历索引对数据项中的样本评分进行更新。5. The active learning training acceleration method based on database indexing technology according to claim 4, characterized in that: in step 4), the high-performance indexing module maintains the sorting order relationship of the sample scores of each untrained sample, and for the state information of each data item in each data block, specifically, when the state information of the data item is in the state of being updated, it is not necessary to traverse the index to update the sample scores in the data item, but only to update the frequency increased by the current update to the state information record table; when the state information of the data item is in the state of not being updated, the state of the data item is adjusted to the state of being updated, and the index is traversed to update the sample scores in the data item. 6.根据权利要求1所述的基于数据库索引技术的主动学习训练加速方法,其特征在于:所述的步骤5)中,集成主动学习算法模块通过调用高性能索引模块对各个未训练样本进行预筛选获得若干边界样本,具体为在满足最大化筛选目标的情况下预筛选出各个未训练样本中的若干样本评分较高的未训练样本作为边界样本,最大化筛选目标具体如下:6. The active learning training acceleration method based on database indexing technology according to claim 1, characterized in that: in step 5), the integrated active learning algorithm module obtains several boundary samples by calling the high-performance indexing module to pre-screen each untrained sample, specifically, under the condition of maximizing the screening objective, pre-screening several untrained samples with higher sample scores as boundary samples, and the maximizing screening objective is as follows: 其中,Dselected表示预筛选出的边界样本集合,边界样本集合中包括若干边界样本;x表示未训练样本;I()和C()分别表示信息量函数和贡献度函数。Where D selected represents the pre-selected set of boundary samples, which includes several boundary samples; x represents untrained samples; I() and C() represent the information function and contribution function, respectively. 7.根据权利要求1所述的基于数据库索引技术的主动学习训练加速方法,其特征在于:所述的步骤5)中,主动学习算法具体为最小置信度算法或熵最高算法等。7. The active learning training acceleration method based on database indexing technology according to claim 1, wherein in step 5), the active learning algorithm is specifically the minimum confidence algorithm or the maximum entropy algorithm, etc. 8.一种基于权利要求1-7任一所述的主动学习训练加速方法的主动学习训练加速模型,其特征在于:包括用于对各个未训练样本和已训练样本输入样本进行特征提取获得特征向量的样本特征提取模块;8. An active learning training acceleration model based on the active learning training acceleration method according to any one of claims 1-7, characterized in that: it includes a sample feature extraction module for extracting features from each untrained sample and trained sample input sample to obtain feature vectors; 包括基于样本特征提取模块输出的特征向量进行主动学习评价获得样本评分并调用高性能索引模块进行样本次序关系调整的主动学习评价模块;This includes an active learning evaluation module that uses the feature vectors output by the sample feature extraction module to perform active learning evaluation to obtain sample scores and calls the high-performance indexing module to adjust the order of samples. 包括用于保存并更新样本评分以及进行排序的高性能索引模块;Includes a high-performance indexing module for saving and updating sample scores and sorting; 包括调用高性能索引模块进行预筛选进而进行最终筛选获得待训练样本的集成主动学习算法模块。This includes an integrated active learning algorithm module that calls a high-performance indexing module for pre-screening and then performs final screening to obtain the training samples.
CN202311052101.9A 2023-08-21 2023-08-21 A method and system for accelerating active learning training based on database indexing technology Active CN117272006B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311052101.9A CN117272006B (en) 2023-08-21 2023-08-21 A method and system for accelerating active learning training based on database indexing technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311052101.9A CN117272006B (en) 2023-08-21 2023-08-21 A method and system for accelerating active learning training based on database indexing technology

Publications (2)

Publication Number Publication Date
CN117272006A true CN117272006A (en) 2023-12-22
CN117272006B CN117272006B (en) 2026-01-06

Family

ID=89199847

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311052101.9A Active CN117272006B (en) 2023-08-21 2023-08-21 A method and system for accelerating active learning training based on database indexing technology

Country Status (1)

Country Link
CN (1) CN117272006B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110737812A (en) * 2019-09-20 2020-01-31 浙江大学 A search engine user satisfaction evaluation method integrating semi-supervised learning and active learning
CN111310799A (en) * 2020-01-20 2020-06-19 中国人民大学 Active learning algorithm based on historical evaluation result
US20200250527A1 (en) * 2019-02-04 2020-08-06 Google Llc Systems and Methods for Active Learning
CN112069310A (en) * 2020-06-18 2020-12-11 中国科学院计算技术研究所 Text classification method and system based on active learning strategy
US20220300518A1 (en) * 2021-03-19 2022-09-22 International Business Machines Corporation Indexing based on feature importance
CN115130570A (en) * 2022-06-23 2022-09-30 腾讯科技(深圳)有限公司 Training method, application method, device and equipment of pulsar search model
US20220366282A1 (en) * 2021-05-06 2022-11-17 Thomson Reuters Enterprise Centre Gmbh Systems and Methods for Active Curriculum Learning
CN115661506A (en) * 2022-09-14 2023-01-31 天津大学 An active learning image classification method and a visual interactive system
CN116453617A (en) * 2023-03-30 2023-07-18 湖南大学 Multi-target optimization molecule generation method and system combining active learning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200250527A1 (en) * 2019-02-04 2020-08-06 Google Llc Systems and Methods for Active Learning
CN110737812A (en) * 2019-09-20 2020-01-31 浙江大学 A search engine user satisfaction evaluation method integrating semi-supervised learning and active learning
CN111310799A (en) * 2020-01-20 2020-06-19 中国人民大学 Active learning algorithm based on historical evaluation result
CN112069310A (en) * 2020-06-18 2020-12-11 中国科学院计算技术研究所 Text classification method and system based on active learning strategy
US20220300518A1 (en) * 2021-03-19 2022-09-22 International Business Machines Corporation Indexing based on feature importance
US20220366282A1 (en) * 2021-05-06 2022-11-17 Thomson Reuters Enterprise Centre Gmbh Systems and Methods for Active Curriculum Learning
CN115130570A (en) * 2022-06-23 2022-09-30 腾讯科技(深圳)有限公司 Training method, application method, device and equipment of pulsar search model
CN115661506A (en) * 2022-09-14 2023-01-31 天津大学 An active learning image classification method and a visual interactive system
CN116453617A (en) * 2023-03-30 2023-07-18 湖南大学 Multi-target optimization molecule generation method and system combining active learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
杨文柱;田潇潇;王思乐;张锡忠;: "主动学习算法研究进展", 河北大学学报(自然科学版), no. 02, 25 March 2017 (2017-03-25) *
谢伙生;刘敏;: "一种基于主动学习的集成协同训练算法", 山东大学学报(工学版), no. 03, 20 June 2012 (2012-06-20) *
邓思宇: "基于PageRank的主动学习算法", 智能系统学报, vol. 14, no. 3, 31 May 2019 (2019-05-31), pages 551 *

Also Published As

Publication number Publication date
CN117272006B (en) 2026-01-06

Similar Documents

Publication Publication Date Title
CN111738301B (en) A Long Tail Distribution Image Data Recognition Method Based on Two-Channel Learning
Zhong et al. Blockqnn: Efficient block-wise neural network architecture generation
Zhong et al. Practical block-wise neural network architecture generation
CN106980641B (en) Unsupervised Hash quick picture retrieval system and unsupervised Hash quick picture retrieval method based on convolutional neural network
CN114299362B (en) A small sample image classification method based on k-means clustering
CN113378913B (en) A semi-supervised node classification method based on self-supervised learning
CN112700060A (en) Station terminal load prediction method and prediction device
CN116503676B (en) An image classification method and system based on knowledge distillation and small sample incremental learning
Tan et al. Clonal particle swarm optimization and its applications
CN116452862B (en) Image classification method based on domain generalization learning
Fujino et al. Deep convolutional networks for human sketches by means of the evolutionary deep learning
CN110909158B (en) Text classification method based on improved firefly algorithm and K nearest neighbor
CN114676275A (en) Deep Supervised Hash Retrieval Method and System Based on Greedy Asymmetric Loss
WO2022169625A1 (en) Improved fine-tuning strategy for few shot learning
CN106022293B (en) A Pedestrian Re-identification Method Based on Adaptive Shared Niche Evolutionary Algorithm
CN117272006B (en) A method and system for accelerating active learning training based on database indexing technology
CN116630718A (en) A Prototype-Based Low Perturbation Image-like Incremental Learning Algorithm
CN119830957A (en) Quick searching method for optimal embedded position of large model
Hu et al. Automatic channel pruning by neural network based on improved poplar optimisation algorithm
CN118038277B (en) A robot scene recognition method based on lifelong learning
Ferdinand et al. Feature expansion and enhanced compression for class incremental learning
CN118132679A (en) Vector retrieval method based on residual quantization
CN113313255B (en) An unsupervised domain adaptation method based on neural network architecture search
Huisman et al. A preliminary study on the feature representations of transfer learning and gradient-based meta-learning techniques
Shi et al. Selecting useful knowledge from previous tasks for future learning in a single network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant