[go: up one dir, main page]

CN119622341A - Data processing method, computer device and readable storage medium - Google Patents

Data processing method, computer device and readable storage medium Download PDF

Info

Publication number
CN119622341A
CN119622341A CN202411777279.4A CN202411777279A CN119622341A CN 119622341 A CN119622341 A CN 119622341A CN 202411777279 A CN202411777279 A CN 202411777279A CN 119622341 A CN119622341 A CN 119622341A
Authority
CN
China
Prior art keywords
image
local
description
text
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411777279.4A
Other languages
Chinese (zh)
Inventor
张睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Cloud Technology Co Ltd
Original Assignee
China Telecom Cloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Cloud Technology Co Ltd filed Critical China Telecom Cloud Technology Co Ltd
Priority to CN202411777279.4A priority Critical patent/CN119622341A/en
Publication of CN119622341A publication Critical patent/CN119622341A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90332Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/041Abduction
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

本申请涉及一种数据处理方法、装置、计算机设备、计算机可读存储介质和计算机程序产品。方法包括:获取图文知识数据集,图文知识数据集包括多组数据,每组数据包括图像、与图像对应的知识文本以及与知识文本对应的查询问题;针对每组数据,获取当前数据包含的目标图像的全局性描述,基于全局性描述包含的关键词对目标图像进行目标检测,得到目标检测框,基于目标检测框获取局部图像集;获取局部图像集中每个局部图像的局部描述,得到局部描述集;全局性描述、局部图像集、局部描述集以及当前数据构成图文匹配对;基于每组数据对应的图文匹配对,构建正负样本对;基于正负样本对对多模态模型进行训练。提升模型输出准确性。

The present application relates to a data processing method, device, computer equipment, computer-readable storage medium and computer program product. The method includes: obtaining a graphic knowledge data set, the graphic knowledge data set includes multiple groups of data, each group of data includes an image, a knowledge text corresponding to the image and a query question corresponding to the knowledge text; for each group of data, obtaining a global description of the target image contained in the current data, performing target detection on the target image based on the keywords contained in the global description, obtaining a target detection frame, and obtaining a local image set based on the target detection frame; obtaining a local description of each local image in the local image set to obtain a local description set; the global description, the local image set, the local description set and the current data constitute a graphic-text matching pair; based on the graphic-text matching pair corresponding to each group of data, constructing a positive and negative sample pair; training a multimodal model based on the positive and negative sample pairs. Improve the accuracy of model output.

Description

Data processing method, computer device, and readable storage medium
Technical Field
The present application relates to the field of machine learning technology, and in particular, to a data processing method, apparatus, computer device, computer readable storage medium, and computer program product.
Background
In recent years, with the development of deep learning technology, a language model based on deep learning has become a mainstream. Wherein the multimodal model allows a user to ask questions in a combined manner. However, the recovery accuracy of the current multi-modal model output is not high.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a data processing method, apparatus, computer device, computer readable storage medium, and computer program product capable of improving reply accuracy.
In a first aspect, the present application provides a data processing method, including:
Acquiring an image-text knowledge data set, wherein the image-text knowledge data set comprises a plurality of groups of data, and each group of data comprises an image, a knowledge text corresponding to the image and a query problem corresponding to the knowledge text;
For each group of data, global description of a target image contained in the current data is obtained, target detection is carried out on the target image based on keywords contained in the global description to obtain a target detection frame, and a local image set is obtained based on the target detection frame;
and constructing positive and negative sample pairs based on the image-text matching pairs corresponding to each group of data, wherein the positive and negative sample pairs are used for training the multi-modal model.
In one embodiment, acquiring a local image set based on an object detection box includes:
cutting the target image based on the target detection frame to obtain a target local image;
Acquiring the number of transverse segmentation and the number of longitudinal segmentation in a preset image segmentation rule, and segmenting the target image into at least one regional partial image based on the number of transverse segmentation and the number of longitudinal segmentation;
a local image set is constructed based on the target local image and the region local image.
In one embodiment, constructing positive and negative sample pairs based on the image-text matching pairs corresponding to each group of data includes:
The method comprises the steps of obtaining a target knowledge text and a target local description set contained in a current image-text matching pair according to the image-text matching pair corresponding to each group of data, encoding the target knowledge text to obtain a knowledge text vector, encoding each local description in the target local description set to obtain a local text vector of each local description;
splitting the current picture text into positive and negative sample groups based on the knowledge text vector and the local text vector of each local description;
and constructing positive and negative sample pairs based on the positive and negative sample groups obtained by splitting the image-text matching pairs corresponding to each group of data.
In one embodiment, splitting the current context match into positive and negative sample sets based on the knowledge text vector and the local text vector for each local description includes:
For each local description, calculating the cosine distance between the knowledge text vector and the local text vector of the current local description, and taking the current local description as a local description positive sample if the cosine distance is greater than or equal to a preset cosine distance threshold value, and taking the current local description as a local description negative sample if the cosine distance is less than the preset cosine distance threshold value;
constructing a positive sample group based on the local description positive sample set, the local image corresponding to each local description in the local description positive sample set, the corresponding global description and the corresponding data;
and constructing a negative sample group based on the local description negative sample set and the local image corresponding to each local description in the local description negative sample set.
In one embodiment, constructing the positive and negative sample pairs based on the positive and negative sample groups obtained by splitting the image-text matching pairs corresponding to each group of data includes:
for the image-text matching pair corresponding to each group of data, acquiring a target positive and negative sample group obtained by splitting the current image-text matching pair, and taking a positive sample group in the target positive and negative sample group as a positive sample pair;
And randomly selecting target data from all data except the data corresponding to the current image-text matching in the image-text knowledge data set, and constructing a negative sample pair based on a negative sample group in the positive and negative sample groups of the target, global description of images contained in the target data and the target data.
In one embodiment, training the multimodal model based on positive and negative samples includes:
based on the positive and negative sample pairs, performing contrast learning on the multimode model to be trained to obtain a basic multimode model;
generating an instruction fine adjustment data set based on the image-text matching pair corresponding to each group of data and a preset instruction template;
And carrying out instruction fine tuning training on the basic multi-modal model based on the instruction fine tuning data set to obtain a trained multi-modal model.
In one embodiment, the method further comprises:
acquiring data to be put in storage, wherein the data to be put in storage comprises a put in storage image, a put in storage knowledge text corresponding to the put in storage image and a put in storage inquiry problem corresponding to the put in storage knowledge text;
The method comprises the steps of obtaining global description of a warehouse-in image, taking the global description of the warehouse-in image as the global description of the warehouse-in, carrying out target detection on the warehouse-in image based on keywords contained in the global description of the warehouse-in, and obtaining a warehouse-in local image set based on detection results;
Coding the global description of the warehouse-in to obtain a global description vector of the warehouse-in, coding the warehouse-in knowledge text to obtain a text vector of the warehouse-in knowledge, and respectively coding each local description in the local description set of the warehouse-in to obtain a local text vector of the warehouse-in;
Calculating cosine distances between the warehouse-in knowledge text vectors and each warehouse-in local text vector, and sequencing local descriptions in the warehouse-in local description set according to the sequence from small cosine distances to large cosine distances to obtain target local descriptions in the preset quantity;
Extracting a local image embedded vector of a local image corresponding to each target local description;
And constructing a knowledge group based on the warehousing global description vector, the target local description, the local image embedding vector, the global image embedding vector and the warehousing knowledge text vector, and storing the knowledge group into a knowledge base.
In one embodiment, the method further comprises:
Receiving a search request, wherein the search request comprises a query image and a query problem;
acquiring global description of the query image;
searching at least one target knowledge group with similarity degree meeting preset conditions in a knowledge base based on global description of the query image;
randomly selecting an image local description instruction from an image local description preset instruction set, and generating a plurality of local descriptions based on the image local description instruction, the global description, the query problem and the knowledge text in the target knowledge group;
Expanding the query problem based on a plurality of local descriptions to obtain an expanded problem;
Searching in at least one target knowledge group based on the query problem and the expansion problem to obtain a final knowledge group, and generating a search reply based on the final knowledge group.
In a second aspect, the present application also provides a data processing apparatus, including:
The acquisition module is used for acquiring an image-text knowledge data set, wherein the image-text knowledge data set comprises a plurality of groups of data, and each group of data comprises an image, a knowledge text corresponding to the image and a query problem corresponding to the knowledge text;
The construction module is used for acquiring global description of a target image contained in the current data aiming at each group of data, carrying out target detection on the target image based on keywords contained in the global description to obtain a target detection frame, and acquiring a local image set based on the target detection frame;
The training module is used for constructing positive and negative sample pairs based on the image-text matching pairs corresponding to each group of data, and training the multi-modal model based on the positive and negative sample pairs.
In a third aspect, the present application also provides a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
Acquiring an image-text knowledge data set, wherein the image-text knowledge data set comprises a plurality of groups of data, and each group of data comprises an image, a knowledge text corresponding to the image and a query problem corresponding to the knowledge text;
For each group of data, global description of a target image contained in the current data is obtained, target detection is carried out on the target image based on keywords contained in the global description to obtain a target detection frame, and a local image set is obtained based on the target detection frame;
and training the multi-modal model based on the positive and negative sample pairs.
In a fourth aspect, the present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
Acquiring an image-text knowledge data set, wherein the image-text knowledge data set comprises a plurality of groups of data, and each group of data comprises an image, a knowledge text corresponding to the image and a query problem corresponding to the knowledge text;
For each group of data, global description of a target image contained in the current data is obtained, target detection is carried out on the target image based on keywords contained in the global description to obtain a target detection frame, and a local image set is obtained based on the target detection frame;
and training the multi-modal model based on the positive and negative sample pairs.
In a fifth aspect, the application also provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of:
Acquiring an image-text knowledge data set, wherein the image-text knowledge data set comprises a plurality of groups of data, and each group of data comprises an image, a knowledge text corresponding to the image and a query problem corresponding to the knowledge text;
For each group of data, global description of a target image contained in the current data is obtained, target detection is carried out on the target image based on keywords contained in the global description to obtain a target detection frame, and a local image set is obtained based on the target detection frame;
and training the multi-modal model based on the positive and negative sample pairs.
The data processing method, the device, the computer equipment, the computer readable storage medium and the computer program product acquire a graphic knowledge data set, wherein the graphic knowledge data set comprises a plurality of groups of data, each group of data comprises an image, a knowledge text corresponding to the image and a query problem corresponding to the knowledge text, global description of a target image contained in current data is acquired for each group of data, target detection is carried out on the target image based on keywords contained in the global description to obtain a target detection frame, a local image set is acquired based on the target detection frame, local description of each local image in the local image set is acquired to obtain a local description set, the global description, the local image set, the local description set and the current data form graphic matching pairs, positive and negative sample pairs are constructed based on the graphic matching pairs corresponding to each group of data, and the positive and negative sample pairs are used for training a multi-mode model. The image detail information extraction is realized in a fine granularity description mode, original image-text knowledge is subjected to data enhancement in a multi-angle global to local mode, and a multi-mode model obtained by training based on an image-text knowledge data set after data enhancement can output more accurate replies.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are needed in the description of the embodiments of the present application or the related technologies will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other related drawings may be obtained according to these drawings without inventive effort to those of ordinary skill in the art.
FIG. 1 is a flow diagram of a data processing method in one embodiment;
FIG. 2 is an exemplary diagram of data in one embodiment;
FIG. 3 is a diagram of an example of generation of a global description in one embodiment;
FIG. 4 is an exemplary diagram of keyword extraction in one embodiment;
FIG. 5 is an exemplary diagram of a region local image in one embodiment;
FIG. 6 is a flow chart of a data processing method according to another embodiment;
FIG. 7 is a flow chart of data warehousing in one embodiment;
FIG. 8 is a flow diagram of a search in one embodiment;
fig. 9 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
In one embodiment, as shown in fig. 1, a data processing method is provided, where this embodiment is applied to a terminal to illustrate the method, and it is understood that the method may also be applied to a server, and may also be applied to a system including a terminal and a server, and implemented through interaction between the terminal and the server. In this embodiment, the method includes the steps of:
Step 102, obtaining an image-text knowledge data set, wherein the image-text knowledge data set comprises a plurality of groups of data, and each group of data comprises an image, a knowledge text corresponding to the image and a query problem corresponding to the knowledge text.
Alternatively, a set of knowledge data of a professional field may be collected, the set of knowledge data of the graphic field comprising a plurality of sets of data, each set of data comprising at least one image of the professional field, a knowledge text corresponding to the image, and a query question corresponding to the knowledge text.
In order to acquire a massive set of teletext knowledge data, the data is typically acquired from multiple sources, so that the image has the potential for multiple elements of unrelated text content. As shown in fig. 2, an image and a knowledge text in a set of data are illustrated, the knowledge text mainly explains the meaning of a warning light in the image, but other information such as average fuel consumption, a vehicle speed meter and the like can also exist in the image, and the warning light is illustrated by an arrow in fig. 2.
Step 104, for each group of data, global description of a target image contained in the current data is obtained, target detection is carried out on the target image based on keywords contained in the global description, a target detection frame is obtained, a local image set is obtained based on the target detection frame, local description of each local image in the local image set is obtained, a local description set is obtained, and the global description, the local image set, the local description set and the current data form an image-text matching pair.
Alternatively, a native visual text multi-modal model M to be trimmed, such as BLIPv, GLM-4V-9B, etc., may be prepared in advance. For the sake of this description, the visual text multimodal model mentioned in the embodiments of the present application may be simply referred to as a multimodal model. For each set of data, a global description of the target image contained by the current data may be obtained by the multimodal model M.
Specifically, for each set of data, the target image contained in the current data may be processed using the multimodal model M to generate the global description T g. In the process of processing the target image using the multimodal model M, the usable hint term P g,Pg may be "describe this figure", and the generation process of T g may be represented by the following mathematical expression:
Tg=M(I,Pg)
For example, referring to fig. 3, the image shown in fig. 3 is processed by using the multi-mode model M using the hint word "describe the image", and the global description is "the image is a state of a vehicle after the engine hood is opened, a silver engine is in the middle of the image, and a further silver mark is written on the image, and TURBO is written on the image. In the upper right position of the picture there is a red container in which a liquid, possibly a cooling or braking liquid, is contained. In addition, there are other automotive parts and pipes, but their specific functions are not well known. "
Wherein, for each group of data, after the global description of the target image contained in the current data is obtained, the keywords can be extracted from the global description.
Alternatively, the natural language submodel M L in the multimodal model M may be used to extract the keyword set K from the global description T g, the hint word P k may be "extract the potential keyword from the content in the < quote > </quote > tag", and the content in the < quote > </quote > tag is the global description T g. The above-described process may be represented by the following mathematical expression, an example of which is shown in fig. 4:
K=ML(Tg,Pk)
Optionally, the object detection is performed on the object image by using each keyword in the keyword set K by means of the Grounded-SAM model M sam, so as to obtain an object detection frame, where the keywords are in one-to-one correspondence with the object detection frames, so as to obtain an object detection frame set B.
Optionally, after the target detection frame set B is obtained, cutting processing is performed on the target image by using the target detection frames in the target detection frame set B to obtain target partial images, where the target detection frames correspond to the target partial images one by one, and a set formed by each target partial image can be used as a partial image set I loc.
Optionally, the multi-modal model M may be used to process each local image in the local image set I loc to generate a corresponding local description, resulting in the local description set T loc. The process of processing a certain group of data in the image-text knowledge data set D can process all the data by adopting the same method to obtain the image-text matching pair corresponding to each group of data.
Through the data enhancement, a new image-text knowledge data set D * can be obtained, wherein the image-text knowledge data set D * is composed of a plurality of image-text matching pairs, and the I-th image-text matching pair comprises an image I i contained in I-th group data, a local image set I i loc, a query question Q i contained in I-th group data, a knowledge text T i contained in I-th group data, a global description T i g and a local description set T i loc.
And 106, constructing positive and negative sample pairs based on the image-text matching pairs corresponding to each group of data, wherein the positive and negative sample pairs are used for training the multi-mode model generated by multi-mode retrieval enhancement.
Optionally, for each graph matching pair in the graph knowledge data set D *, the graph matching pair can be split to obtain positive and negative sample groups, positive and negative sample pairs are constructed based on the positive and negative sample groups corresponding to each graph matching pair, and the multi-mode model to be trained can be subjected to contrast learning based on the positive and negative sample pairs to obtain a trained multi-mode model for practical reasoning.
Optionally, the multi-modal model may be a multi-modal retrieval enhancement generation (RETRIEVAL-Augmented Generation, abbreviated as RAG) model, and the method provided by the embodiment of the application can promote the generation performance of the multi-modal RAG model. The multi-modal RAG model is an artificial intelligence model that combines retrieval and generation capabilities and is primarily used to process and generate data relating to multiple modalities (e.g., text, images, sound, etc.). Some typical usage scenarios of a multimodal RAG model are 1) image-to-text generation, which may generate descriptive text from an input image, such as automatically generating a title, description, or story of the image. 2) Visual question-answering system in which a multimodal RAG model can process questions of a user about image contents and generate accurate answers. 3) Semantic retrieval of images and text the multimodal RAG model can retrieve images semantically related to the user from an image database according to a text query provided by the user or retrieve related text information according to the image query. The positive and negative samples can be used for carrying out contrast learning on the multi-modal model to be trained in combination with the use scene of the multi-modal model.
In the embodiment, a graphic knowledge data set is acquired, wherein the graphic knowledge data set comprises a plurality of groups of data, each group of data comprises an image, a knowledge text corresponding to the image and a query problem corresponding to the knowledge text, global description of a target image contained in current data is acquired for each group of data, target detection is carried out on the target image based on keywords contained in the global description to obtain a target detection frame, a local image set is acquired based on the target detection frame, local description of each local image in the local image set is acquired to obtain a local description set, the global description, the local image set, the local description set and the current data form a graphic matching pair, positive and negative sample pairs are constructed based on the graphic matching pair corresponding to each group of data, and the positive and negative sample pairs are used for training a multi-mode model. The image detail information extraction is realized in a fine granularity description mode, original image-text knowledge is subjected to data enhancement in a multi-angle global to local mode, and a multi-mode model obtained by training based on an image-text knowledge data set after data enhancement can output more accurate replies.
In some embodiments, the local image set is acquired based on a target detection frame, and the method comprises the steps of cutting a target image based on the target detection frame to obtain a target local image, acquiring the transverse segmentation number and the longitudinal segmentation number based on a preset image segmentation rule, cutting the target image into at least one regional local image based on the transverse segmentation number and the longitudinal segmentation number, and constructing the local image set based on the target local image and the regional local image.
Optionally, after the target detection frame set B is obtained, cutting the target image by using the target detection frames in the target detection frame set B to obtain target partial images, where the target detection frames correspond to the target partial images one by one, and the set formed by each target partial image can be used as the target partial image setEmbodiments of the present application supplement some potential targets as they may not be detected during the above-described target detection process.
Alternatively, the preset image segmentation rule may be h g*wg grid, where h g is an average segmentation of h g parts along the image y-axis, i.e., h g is a longitudinal segmentation, and similarly, w g is an average segmentation of w g parts along the image x-axis, i.e., w g is a transverse segmentation, and based on the transverse segmentation and the longitudinal segmentation, the target image may be segmented into h g*wg regional partial images, where the regional partial images form a regional partial image setAs shown in fig. 5, h g is 3,w g and 3, so this figure can obtain 9 regional partial images.
Optionally, when the target local image set is obtainedSum region local image setAfter that, a local image set can be constructed
In the embodiment, cutting processing is performed on a target image based on a target detection frame to obtain a target local image, the number of transverse segmentation parts and the number of longitudinal segmentation parts are obtained based on a preset image segmentation rule, the target image is segmented into at least one regional local image based on the number of transverse segmentation parts and the number of longitudinal segmentation parts, and a local image set is constructed based on the target local image and the regional local image. The possible missing targets in the target detection process are supplemented, and the data enhancement effect is improved.
In some embodiments, positive and negative sample pairs are constructed based on image-text matching pairs corresponding to each group of data, wherein the positive and negative sample pairs comprise a target knowledge text and a target local description set contained in the current image-text matching pairs, a knowledge text vector is obtained by encoding the target knowledge text, a local text vector of each local description is obtained by encoding each local description in the target local description set, the current image-text matching is split into positive and negative sample groups based on the knowledge text vector and the local text vector of each local description, and the positive and negative sample pairs are constructed based on the positive and negative sample groups obtained by splitting the image-text matching pairs corresponding to each group of data.
Alternatively, since there may be irrelevant descriptions in the local description set, such as local descriptions corresponding to local images of the region, there is a high possibility of independence. Therefore, for each group of data corresponding graph-text matching pairs, the target knowledge text and the target local description set contained in the current graph-text matching pair can be obtained, the target knowledge text is encoded into a knowledge text vector V i T by using a text embedding model M te of a multi-mode model M, and each local description in the target local description set is encoded into a local text vectorAnd constructing positive and negative sample pairs based on the positive and negative sample groups obtained by splitting the picture-text matching pairs corresponding to each group of data.
In the above embodiment, an implementation manner of constructing the positive and negative sample pairs is provided, and the multi-modal model to be trained can be subjected to comparison learning based on the positive and negative sample pairs. Because the positive and negative sample pairs are constructed based on the image-text knowledge data set after data enhancement, the multi-modal model obtained by training can output more accurate replies.
In some embodiments, the current graph-text matching is split into positive and negative sample groups based on knowledge text vectors and local text vectors of each local description, the method comprises the steps of calculating cosine distances between the knowledge text vectors and the local text vectors of the current local description for each local description, taking the current local description as a local description positive sample if the cosine distances are larger than or equal to a preset cosine distance threshold, taking the current local description as a local description negative sample if the cosine distances are smaller than a preset cosine distance threshold, forming a local description positive sample set by all the local description positive samples, forming a local description negative sample set by all the local description negative samples, constructing a local image corresponding to each local description in the local description positive sample set based on the local description positive sample set, the local description global description corresponding to each local image in the local description positive sample set, and the corresponding data, and constructing a positive sample group based on the local description negative sample set and the local image corresponding to each local description in the local description negative sample set.
Alternatively, the text embedding model M te of the multimodal model M is used to encode the target knowledge text into a knowledge text vector V i T, and each local description in the target local description set is encoded into a local text vectorAfter that, calculate V i T andThe cosine distance between the two partial description sets is greater than or equal to a preset cosine distance threshold value and is classified as a partial description positive sample, all the partial description positive samples form a partial description positive sample set T i loc_p, otherwise, the partial description positive sample set T i loc_n is classified as a partial description negative sample, and all the partial description negative samples form a partial description negative sample set. A local image positive sample set corresponding to the local description positive sample set T i loc_p can be obtainedObtaining a local image negative sample set corresponding to the local description negative sample set T i loc_n Therefore, positive and negative sample groups can be formed in the ith image-text matching pair, and the positive sample group isNegative sample set isIn this allocation scheme, dataset D * is further optimized into a teletext knowledge dataset D **.
In the above embodiment, an implementation manner of splitting image-text matching into positive and negative sample sets is provided, and the process is a basis for constructing positive and negative sample pairs, and then, based on the positive and negative sample pairs, the multi-modal model to be trained can be subjected to contrast learning. Because the positive and negative sample pairs are constructed based on the image-text knowledge data set after data enhancement, the multi-modal model obtained by training can output more accurate replies.
In some embodiments, constructing positive and negative sample pairs based on positive and negative sample groups obtained by splitting image-text matching pairs corresponding to each group of data comprises the steps of obtaining target positive and negative sample groups obtained by splitting current image-text matching pairs for each group of data, taking positive sample groups in the target positive and negative sample groups as positive sample pairs, randomly selecting target data from all data except data corresponding to the current image-text matching in an image-text knowledge data set, and constructing negative sample pairs based on negative sample groups in the target positive and negative sample groups, global description of images contained in the target data and the target data.
Optionally, for the graphic knowledge data set D **, the ith graphic matching pair is split to obtain a positive sample groupThe images and text in the image and text can form positive sample pairs with each other, and negative sample groupsThe other data, the global description of the image contained by the other data, form negative pairs of samples with each other.
In the above embodiment, an implementation manner of how to construct positive and negative sample pairs after splitting each image-text matching pair to obtain positive and negative sample groups is provided, and then, based on the positive and negative sample pairs, the multi-modal model to be trained can be subjected to contrast learning so as to output more accurate replies.
In some embodiments, the data processing method provided by the embodiment of the application further comprises the steps of comparing and learning the multimode model to be trained based on the positive and negative sample pairs to obtain a basic multimode model, generating an instruction fine adjustment data set based on the image-text matching pairs corresponding to each group of data and a preset instruction template, and performing instruction fine adjustment training on the basic multimode model based on the instruction fine adjustment data set to obtain a trained multimode model.
The contrast learning is a self-supervision learning technology, which learns feature representation by comparing positive sample pairs with negative sample pairs, and the core is that the distance between the positive sample pairs is shortened, and the distance between the negative sample pairs is lengthened.
Optionally, loRA is a Parameter efficient fine Tuning technique (Parameter-EFFICIENT FINE-Tuning, PEFT for short), and the application LoRA not only can train the model efficiently, but also can avoid the situation that the model is not fit due to lack of enough data for training the model in full. The embodiment of the application can adopt LoRA to carry out parameter fine adjustment.
The instruction fine tuning technology can enable a model to learn to follow instructions to finish specific tasks and improve the response capability of the model to specific instructions or task requests. Because the local image extraction cannot be performed based on real-time response consideration in the retrieval process, the fine-tuning model is required to have the capability of completing the extraction of the characteristic information of the local area of the image through instructions and keywords. To achieve this effect, the embodiment of the application presets an instruction template and reforms the data set D ** into an instruction trimming data set for instruction trimming based on the template
For example, the preset instruction templates may be:
Summary of # #
< Global description >
Knowledge of # #
< Knowledge text >
Keyword # #
< Knowledge text keyword 1>
< Knowledge text keyword 2>
...
< Knowledge text keyword n >
Problem # #
< Query questions >
# # Instruction
< Image local description preset instruction >
Detail # #
< Local description 1>
< Local description 2>
...
< Local description m >
Wherein < image local description preset instruction > requires a preset set of instructions, and other contents can be obtained from the data set D **. < image local description preset instruction > the following instructions (including but not limited to):
1. combining the problems and knowledge, please help me extract the main key content in the picture.
2. According to the problem, analyzing the picture content by utilizing your knowledge, and finding out key information points in the picture content.
3. Based on the problem, key elements in the picture are found.
4. A brief description of the picture content is generated, and important content related to problems and knowledge is highlighted.
5. And recommending relevant key search descriptions for me according to the picture content, the knowledge and the questions.
6. The subject or focus of the picture is determined and summarized with keywords.
Optionally, after obtaining the instruction trim datasetThen, loRA can be adopted to conduct instruction fine tuning training on the basic multi-modal model, and a trained multi-modal model M * is obtained.
In the above embodiment, a training process of the multimodal model is introduced, and fine-granularity description fine tuning is used to enable the multimodal model to have local key information extraction and local multimodal feature alignment capability, so that the capability of the multimodal model for extracting image-text embedded vectors of knowledge in a certain field is improved, and fine granularity of images and text embedded vectors is enhanced.
In some embodiments, a method of processing data is provided, see FIG. 6, comprising the step of, for each set of data in a set of teletext knowledge data, including an image, knowledge text corresponding to the image, and a query question corresponding to the knowledge text, the data thus also being referred to as multimodal data. The method comprises the steps of firstly, generating global description about image content, then, extracting keywords based on the global description, detecting image targets based on the keywords, then, cutting target images and cutting preset areas of the images, then, generating description about local image content, then, performing contrast learning on combinations based on positive and negative samples of fine-grained graphics, then, performing instruction fine adjustment based on an image local description instruction fine adjustment template, and finally, outputting a fine-adjustment multi-mode model.
In some embodiments, the data processing method provided by the embodiment of the application further comprises the steps of obtaining data to be put in, wherein the data to be put in comprises an put in image, put in knowledge text corresponding to the put in image and put in query questions corresponding to the put in knowledge text, obtaining global description of the put in image as put in global description, carrying out target detection on the put in image based on keywords contained in the put in global description, obtaining a put in local image set based on detection results, obtaining a put in local description set, encoding the put in global description to obtain put in global description vectors, encoding the put in knowledge text to obtain put in knowledge text vectors, encoding the put in local description set to obtain put in local text vectors, calculating cosine distances between the put in knowledge text vectors and the put in local description sets, sorting the put in local description sets according to a sequence from smaller cosine distances to larger, obtaining target local description of the put in local image corresponding to the put in local image set, extracting the put in global description set, extracting put in global description vectors of the put in image set, and constructing put in the put in knowledge text vectors based on the global description, the target local description and the put in knowledge text vectors.
Optionally, the steps of obtaining global description of the warehouse-in image, performing target detection on the warehouse-in image based on keywords contained in the warehouse-in global description, obtaining a warehouse-in local image set based on a detection result, and obtaining local description of each local image in the warehouse-in local image set may be implemented by adopting a Grounded-SAM model and a multi-mode model M *, and the detailed process may refer to the foregoing embodiments and is not repeated herein.
Wherein a text embedding model using a multimodal model M * And coding the global description of the warehouse-in to obtain a global description vector of the warehouse-in, coding the warehouse-in knowledge text to obtain a text vector of the warehouse-in knowledge, and respectively coding each local description in the local description set of the warehouse-in to obtain a local text vector of the warehouse-in. And calculating cosine distances between the warehouse-in knowledge text vectors and each warehouse-in local text vector, and sequencing the local description in the warehouse-in local description set according to the sequence from small cosine distances to large cosine distances. The local description of the TOP-K (K value is set according to different conditions) nearest distance is taken as the target local description. And uses the image embedding model of the multi-modal model M * Extracting local image embedded vectors of local images corresponding to each target local description, and extracting global image embedded vectors of warehouse-in images. And forming a knowledge group by the warehouse-in global description vector, the target local description, the local image embedding vector, the global image embedding vector, the warehouse-in knowledge text vector, the data to be warehouse-in, the warehouse-in global description and the target local description corresponding to the local image and storing the knowledge group into a knowledge base.
In some embodiments, referring to FIG. 7, the binning step includes, for multi-modal data to be binned, first performing global description generation with respect to image content, then performing keyword extraction based on the global description, and image object detection based on the keywords, then performing object image segmentation and image preset region segmentation, then performing description generation with respect to local image content, then extracting embedded vectors for all texts, calculating cosine distances between knowledge text embedded vectors and all generated text embedded vectors, taking TOP-K nearest distance generated text and corresponding local images, extracting embedded vectors for the original image and the selected local image, and entering the effective embedded vectors into a database.
In the above embodiment, a multi-mode fine granularity embedded vector matching mechanism is provided to optimize the current mainstream warehousing strategy.
In terms of search strategies, the current mainstream search strategy is to firstly expand the query problem and then search in a knowledge base based on the original query problem and a plurality of expanded query problems. Although expanding the query problem can increase the hit rate, if the knowledge base is large and the query problem is not well expressed, it may result in a query with an undesirable search result and inaccurate generated result. Therefore, the invention optimizes the current retrieval strategy by adopting a mode of 'image description global rough matching, local description and query problem fine matching'.
In some embodiments, the data processing method provided by the embodiment of the application further comprises the steps of receiving a search request, wherein the search request comprises a query image and a query problem, obtaining global description of the query image, searching at least one target knowledge group with similarity meeting preset conditions in a knowledge base based on the global description of the query image, randomly selecting an image local description instruction from an image local description preset instruction set, generating a plurality of local descriptions based on the image local description instruction, the global description, the query problem and knowledge texts in the target knowledge group, expanding the query problem based on the plurality of local descriptions to obtain an expanded problem, searching in the at least one target knowledge group based on the query problem and the expanded problem to obtain a final knowledge group, and generating a search reply based on the final knowledge group.
Wherein the query image may be processed through the multimodal model M * to generate a global description of the query image. Text embedding model using multimodal model M * The global description is processed to generate a query text vector.
At least one target knowledge group with TOP-N (N value set according to different conditions) similarity degree meeting preset conditions can be searched in the knowledge base. Specifically, global description vectors contained in each knowledge group in the knowledge base can be obtained, cosine distances between the query text vectors and the global description vectors are calculated, the knowledge groups in the knowledge base are ordered according to the sequence of the cosine distances from small to large, and TOP-N knowledge groups arranged in front are used as target knowledge groups.
As described above, the image local description preset instruction set includes, but is not limited to, instructions of 1. Please help me extract the main key content in the picture in combination with questions and knowledge. 2. According to the problem, analyzing the picture content by utilizing your knowledge, and finding out key information points in the picture content. 3. Based on the problem, key elements in the picture are found. 4. A brief description of the picture content is generated, and important content related to problems and knowledge is highlighted. 5. And recommending relevant key search descriptions for me according to the picture content, the knowledge and the questions. 6. The subject or focus of the picture is determined and summarized with keywords.
Wherein the plurality of local descriptions may be generated using the trained multimodal model M * based on the randomly selected image local description instruction, the global description of the query image, the query question, and the knowledge text in the target knowledge set. The multimodal model M * can be directed to expand query questions based on a plurality of locally descriptive extensible instructions. And finally, searching in at least one target knowledge group by utilizing the query problem and the expansion problem to obtain a final knowledge group, taking a knowledge text in the final knowledge group as a knowledge strip, and generating a search reply based on the knowledge strip. Because the local description generation no longer needs to preprocess the local image, the retrieval timeliness is ensured, the retrieval precision and the fine granularity are improved, and finally the retrieval recovery generation effect is improved.
In some embodiments, referring to FIG. 8, the retrieving step includes receiving a retrieval request including a query image and a query question, performing global description generation on the image content, extracting an embedded vector of the global description, detecting TOP-N knowledge groups from a knowledge base, then generating a local description by an instruction, expanding the query question, extracting an embedded vector of the expanded question, retrieving knowledge bars within the TOP-N knowledge groups, and generating an answer in conjunction with the retrieved knowledge bars.
In the above embodiment, the image is globally described through the multimodal model, then TOP-N similar knowledge groups are searched in the knowledge base, and finally relevant answers are located in the knowledge groups through the query questions and the expansion questions, so that the search reply generation effect is improved.
It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a data processing device for realizing the above related data processing method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation of one or more embodiments of the data processing device provided below may refer to the limitation of the data processing method hereinabove, and will not be repeated herein.
In one exemplary embodiment, there is provided a data processing apparatus including:
The acquisition module is used for acquiring an image-text knowledge data set, wherein the image-text knowledge data set comprises a plurality of groups of data, and each group of data comprises an image, a knowledge text corresponding to the image and a query problem corresponding to the knowledge text;
The construction module is used for acquiring global description of a target image contained in the current data aiming at each group of data, carrying out target detection on the target image based on keywords contained in the global description to obtain a target detection frame, and acquiring a local image set based on the target detection frame;
The training module is used for constructing positive and negative sample pairs based on the image-text matching pairs corresponding to each group of data, and training the multi-modal model based on the positive and negative sample pairs.
In some embodiments, the obtaining module is specifically configured to:
cutting the target image based on the target detection frame to obtain a target local image;
Acquiring the number of transverse segmentation and the number of longitudinal segmentation in a preset image segmentation rule, and segmenting the target image into at least one regional partial image based on the number of transverse segmentation and the number of longitudinal segmentation;
a local image set is constructed based on the target local image and the region local image.
In some embodiments, the training module is specifically configured to:
The method comprises the steps of obtaining a target knowledge text and a target local description set contained in a current image-text matching pair according to the image-text matching pair corresponding to each group of data, encoding the target knowledge text to obtain a knowledge text vector, encoding each local description in the target local description set to obtain a local text vector of each local description;
splitting the current picture text into positive and negative sample groups based on the knowledge text vector and the local text vector of each local description;
and constructing positive and negative sample pairs based on the positive and negative sample groups obtained by splitting the image-text matching pairs corresponding to each group of data.
In some embodiments, the training module is specifically configured to:
For each local description, calculating the cosine distance between the knowledge text vector and the local text vector of the current local description, and taking the current local description as a local description positive sample if the cosine distance is greater than or equal to a preset cosine distance threshold value, and taking the current local description as a local description negative sample if the cosine distance is less than the preset cosine distance threshold value;
constructing a positive sample group based on the local description positive sample set, the local image corresponding to each local description in the local description positive sample set, the corresponding global description and the corresponding data;
and constructing a negative sample group based on the local description negative sample set and the local image corresponding to each local description in the local description negative sample set.
In some embodiments, the training module is specifically configured to:
for the image-text matching pair corresponding to each group of data, acquiring a target positive and negative sample group obtained by splitting the current image-text matching pair, and taking a positive sample group in the target positive and negative sample group as a positive sample pair;
And randomly selecting target data from all data except the data corresponding to the current image-text matching in the image-text knowledge data set, and constructing a negative sample pair based on a negative sample group in the positive and negative sample groups of the target, global description of images contained in the target data and the target data.
In some embodiments, the data processing apparatus further comprises a binning module for:
acquiring data to be put in storage, wherein the data to be put in storage comprises a put in storage image, a put in storage knowledge text corresponding to the put in storage image and a put in storage inquiry problem corresponding to the put in storage knowledge text;
The method comprises the steps of obtaining global description of a warehouse-in image, taking the global description of the warehouse-in image as the global description of the warehouse-in, carrying out target detection on the warehouse-in image based on keywords contained in the global description of the warehouse-in, and obtaining a warehouse-in local image set based on detection results;
Coding the global description of the warehouse-in to obtain a global description vector of the warehouse-in, coding the warehouse-in knowledge text to obtain a text vector of the warehouse-in knowledge, and respectively coding each local description in the local description set of the warehouse-in to obtain a local text vector of the warehouse-in;
Calculating cosine distances between the warehouse-in knowledge text vectors and each warehouse-in local text vector, and sequencing local descriptions in the warehouse-in local description set according to the sequence from small cosine distances to large cosine distances to obtain target local descriptions in the preset quantity;
Extracting a local image embedded vector of a local image corresponding to each target local description;
And constructing a knowledge group based on the warehousing global description vector, the target local description, the local image embedding vector, the global image embedding vector and the warehousing knowledge text vector, and storing the knowledge group into a knowledge base.
In some embodiments, the data processing apparatus further comprises a retrieval module for:
Receiving a search request, wherein the search request comprises a query image and a query problem;
acquiring global description of the query image;
searching at least one target knowledge group with similarity degree meeting preset conditions in a knowledge base based on global description of the query image;
randomly selecting an image local description instruction from an image local description preset instruction set, and generating a plurality of local descriptions based on the image local description instruction, the global description, the query problem and the knowledge text in the target knowledge group;
Expanding the query problem based on a plurality of local descriptions to obtain an expanded problem;
Searching in at least one target knowledge group based on the query problem and the expansion problem to obtain a final knowledge group, and generating a search reply based on the final knowledge group.
Each of the modules in the above-described data processing apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one exemplary embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 9. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing a teletext knowledge data set. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data processing method.
It will be appreciated by persons skilled in the art that the architecture shown in fig. 9 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting as to the computer device to which the present inventive arrangements are applicable, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.
In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile memory and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (RESISTIVE RANDOM ACCESS MEMORY, reRAM), magneto-resistive Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computation, an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) processor, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the present application.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (10)

1.一种数据处理方法,其特征在于,所述方法包括:1. A data processing method, characterized in that the method comprises: 获取图文知识数据集,所述图文知识数据集包括多组数据,每组数据包括图像、与所述图像对应的知识文本以及与所述知识文本对应的查询问题;Acquire a picture-text knowledge data set, wherein the picture-text knowledge data set includes multiple groups of data, each group of data includes an image, a knowledge text corresponding to the image, and a query question corresponding to the knowledge text; 针对每组数据,获取当前数据包含的目标图像的全局性描述,基于所述全局性描述包含的关键词对所述目标图像进行目标检测,得到目标检测框,基于所述目标检测框获取局部图像集;获取所述局部图像集中每个局部图像的局部描述,得到局部描述集;所述全局性描述、所述局部图像集、所述局部描述集以及当前数据构成图文匹配对;For each set of data, a global description of a target image contained in the current data is obtained, target detection is performed on the target image based on keywords contained in the global description to obtain a target detection frame, and a local image set is obtained based on the target detection frame; a local description of each local image in the local image set is obtained to obtain a local description set; the global description, the local image set, the local description set and the current data constitute an image-text matching pair; 基于每组数据对应的图文匹配对,构建正负样本对;基于所述正负样本对对多模态模型进行训练。Based on the image-text matching pairs corresponding to each set of data, positive and negative sample pairs are constructed; and the multimodal model is trained based on the positive and negative sample pairs. 2.根据权利要求1所述的方法,其特征在于,所述基于所述目标检测框获取局部图像集,包括:2. The method according to claim 1, characterized in that the acquiring of the local image set based on the target detection frame comprises: 基于所述目标检测框在所述目标图像进行切割处理,得到目标局部图像;Performing a cutting process on the target image based on the target detection frame to obtain a target partial image; 基于预先设置的图像分割规则中获取横向分割份数和纵向分割份数,基于所述横向分割份数和所述纵向分割份数,将所述目标图像切分为至少一个区域局部图像;Obtaining a number of horizontal divisions and a number of vertical divisions based on a preset image segmentation rule, and dividing the target image into at least one regional local image based on the number of horizontal divisions and the number of vertical divisions; 基于所述目标局部图像和所述区域局部图像,构建局部图像集。A local image set is constructed based on the target local image and the regional local image. 3.根据权利要求1所述的方法,其特征在于,所述基于每组数据对应的图文匹配对,构建正负样本对,包括:3. The method according to claim 1, characterized in that constructing positive and negative sample pairs based on the image-text matching pairs corresponding to each set of data comprises: 针对每组数据对应的图文匹配对,获取当前图文匹配对包含的目标知识文本和目标局部描述集;对所述目标知识文本进行编码得到知识文本向量;分别对所述目标局部描述集中的每个局部描述进行编码,得到每个局部描述的局部文本向量;For each image-text matching pair corresponding to each set of data, the target knowledge text and the target local description set contained in the current image-text matching pair are obtained; the target knowledge text is encoded to obtain a knowledge text vector; each local description in the target local description set is encoded respectively to obtain a local text vector for each local description; 基于所述知识文本向量和每个局部描述的局部文本向量,将所述当前图文匹配拆分为正负样本组;Based on the knowledge text vector and the local text vector of each local description, splitting the current image-text matching into positive and negative sample groups; 基于对每组数据对应的图文匹配对拆分得到的正负样本组,构建正负样本对。Based on the positive and negative sample groups obtained by splitting the image-text matching pairs corresponding to each group of data, positive and negative sample pairs are constructed. 4.根据权利要求3所述的方法,其特征在于,所述基于所述知识文本向量和每个局部描述的局部文本向量,将所述当前图文匹配拆分为正负样本组,包括:4. The method according to claim 3, characterized in that the step of splitting the current image-text matching into positive and negative sample groups based on the knowledge text vector and the local text vector of each local description comprises: 针对每个局部描述,计算所述知识文本向量和当前局部描述的局部文本向量之间的余弦距离,若所述余弦距离大于或者等于预设余弦距离阈值,则将当前局部描述作为局部描述正样本;若所述余弦距离小于所述预设余弦距离阈值,则将当前局部描述作为局部描述负样本;所有局部描述正样本构成局部描述正样本集,所有局部描述负样本构成局部描述负样本集;For each local description, the cosine distance between the knowledge text vector and the local text vector of the current local description is calculated. If the cosine distance is greater than or equal to a preset cosine distance threshold, the current local description is taken as a local description positive sample; if the cosine distance is less than the preset cosine distance threshold, the current local description is taken as a local description negative sample; all local description positive samples constitute a local description positive sample set, and all local description negative samples constitute a local description negative sample set; 基于所述局部描述正样本集、所述局部描述正样本集中每个局部描述对应的局部图像、对应全局描述以及对应数据,构建正样本组;Constructing a positive sample group based on the local description positive sample set, the local image corresponding to each local description in the local description positive sample set, the corresponding global description and the corresponding data; 基于所述局部描述负样本集、所述局部描述负样本集中每个局部描述对应的局部图像,构建负样本组。A negative sample group is constructed based on the local description negative sample set and the local image corresponding to each local description in the local description negative sample set. 5.根据权利要求3所述的方法,其特征在于,所述基于对每组数据对应的图文匹配对拆分得到的正负样本组,构建正负样本对,包括:5. The method according to claim 3, characterized in that the constructing of positive and negative sample pairs based on the positive and negative sample groups obtained by splitting the image-text matching pairs corresponding to each group of data comprises: 针对每组数据对应的图文匹配对,获取针对当前图文匹配对拆分得到的目标正负样本组,将所述目标正负样本组中的正样本组,作为一个正样本对;For each image-text matching pair corresponding to each group of data, a target positive and negative sample group obtained by splitting the current image-text matching pair is obtained, and a positive sample group in the target positive and negative sample group is used as a positive sample pair; 从所述图文知识数据集中除当前图文匹配对对应的数据外的所有数据中随机选择目标数据,基于所述目标正负样本组中的负样本组、所述目标数据包含的图像的全局描述以及所述目标数据,构建负样本对。Target data is randomly selected from all data in the image-text knowledge dataset except data corresponding to the current image-text matching pair, and a negative sample pair is constructed based on a negative sample group in the target positive and negative sample groups, a global description of the image contained in the target data, and the target data. 6.根据权利要求1所述的方法,其特征在于,所述基于所述正负样本对对多模态模型进行训练,包括:6. The method according to claim 1, characterized in that the training of the multimodal model based on the positive and negative sample pairs comprises: 基于所述正负样本对,对待训练的多模态模型进行对比学习,得到基础多模态模型;Based on the positive and negative sample pairs, comparative learning is performed on the multimodal model to be trained to obtain a basic multimodal model; 基于每组数据对应的图文匹配对和预设指令模板,生成指令微调数据集;Generate an instruction fine-tuning dataset based on the image-text matching pairs and preset instruction templates corresponding to each set of data; 基于所述指令微调数据集,对所述基础多模态模型进行指令微调训练,得到训练完成的多模态模型。Based on the instruction fine-tuning data set, instruction fine-tuning training is performed on the basic multimodal model to obtain a trained multimodal model. 7.根据权利要求1所述的方法,其特征在于,所述方法还包括:7. The method according to claim 1, characterized in that the method further comprises: 获取待入库数据,所述待入库数据包括入库图像、与所述入库图像对应的入库知识文本以及与所述入库知识文本对应的入库查询问题;Acquire data to be stored, the data to be stored comprising a storage image, storage knowledge text corresponding to the storage image, and a storage query question corresponding to the storage knowledge text; 获取所述入库图像的全局性描述,作为入库全局性描述,基于所述入库全局性描述包含的关键词对所述入库图像进行目标检测,基于检测结果获取入库局部图像集;获取所述入库局部图像集中每个局部图像的局部描述,得到入库局部描述集;Obtain a global description of the stored image as the stored global description, perform target detection on the stored image based on keywords included in the stored global description, and obtain a stored local image set based on the detection result; obtain a local description of each local image in the stored local image set to obtain a stored local description set; 对所述入库全局性描述进行编码得到入库全局描述向量;对所述入库知识文本进行编码得到入库知识文本向量;分别对所述入库局部描述集中每个局部描述进行编码得到入库局部文本向量;Encoding the global description of the storage to obtain a global description vector of the storage; encoding the knowledge text of the storage to obtain a knowledge text vector of the storage; encoding each local description in the local description set of the storage to obtain a local text vector of the storage; 计算所述入库知识文本向量和每个入库局部文本向量之间的余弦距离,并按照余弦距离从小到大的顺序对所述入库局部描述集中的局部描述进行排序,获取排在前面预设数量的目标局部描述;Calculate the cosine distance between the stored knowledge text vector and each stored local text vector, and sort the local descriptions in the stored local description set in ascending order of the cosine distance to obtain a preset number of target local descriptions in the front; 提取每个目标局部描述对应局部图像的局部图像嵌入向量;提取所述入库图像的全局图像嵌入向量;Extracting a local image embedding vector of a local image corresponding to each target local description; extracting a global image embedding vector of the stored image; 基于所述入库全局描述向量、所述目标局部描述、所述局部图像嵌入向量、所述全局图像嵌入向量以及所述入库知识文本向量,构建知识组存入知识库。Based on the stored global description vector, the target local description, the local image embedding vector, the global image embedding vector and the stored knowledge text vector, a knowledge group is constructed and stored in a knowledge base. 8.根据权利要求1所述的方法,其特征在于,所述方法还包括:8. The method according to claim 1, characterized in that the method further comprises: 接收检索请求,所述检索请求包括查询图像和查询问题;receiving a search request, wherein the search request includes a query image and a query question; 获取所述查询图像的全局性描述;Obtaining a global description of the query image; 基于所述查询图像的全局性描述,在知识库中检索出相似程度满足预设条件的至少一个目标知识组;Based on the global description of the query image, at least one target knowledge group whose similarity meets a preset condition is retrieved from the knowledge base; 从图像局部描述预设指令集中随机选择一条图像局部描述指令,基于所述图像局部描述指令、所述全局性描述、所述查询问题以及所述目标知识组中的知识文本,利用训练好的多模态模型生成多个局部描述;Randomly select an image local description instruction from a preset image local description instruction set, and generate multiple local descriptions using a trained multimodal model based on the image local description instruction, the global description, the query question, and the knowledge text in the target knowledge group; 基于所述多个局部描述扩展查询问题,得到扩展问题;Expanding the query question based on the multiple local descriptions to obtain an expanded question; 基于所述查询问题和所述扩展问题在所述至少一个目标知识组中检索,得到最终知识组,基于所述最终知识组生成检索回复。Based on the query question and the extended question, the at least one target knowledge group is searched to obtain a final knowledge group, and a search reply is generated based on the final knowledge group. 9.一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至8中任一项所述的方法的步骤。9. A computer device, comprising a memory and a processor, wherein the memory stores a computer program, wherein the processor implements the steps of the method according to any one of claims 1 to 8 when executing the computer program. 10.一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至8中任一项所述的方法的步骤。10. A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 8 are implemented.
CN202411777279.4A 2024-12-05 2024-12-05 Data processing method, computer device and readable storage medium Pending CN119622341A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411777279.4A CN119622341A (en) 2024-12-05 2024-12-05 Data processing method, computer device and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411777279.4A CN119622341A (en) 2024-12-05 2024-12-05 Data processing method, computer device and readable storage medium

Publications (1)

Publication Number Publication Date
CN119622341A true CN119622341A (en) 2025-03-14

Family

ID=94890504

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411777279.4A Pending CN119622341A (en) 2024-12-05 2024-12-05 Data processing method, computer device and readable storage medium

Country Status (1)

Country Link
CN (1) CN119622341A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120256489A (en) * 2025-06-06 2025-07-04 北京观微科技有限公司 Geographic positioning method, device, equipment and medium based on multimodal large model

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120256489A (en) * 2025-06-06 2025-07-04 北京观微科技有限公司 Geographic positioning method, device, equipment and medium based on multimodal large model

Similar Documents

Publication Publication Date Title
Taipalus Vector database management systems: Fundamental concepts, use-cases, and current challenges
Moreira et al. Image provenance analysis at scale
CN112819023B (en) Sample set acquisition method, device, computer equipment and storage medium
Kuo et al. Unsupervised semantic feature discovery for image object retrieval and tag refinement
KR100903961B1 (en) High-Dimensional Data Indexing and Retrieval Using Signature Files and Its System
CN106897374B (en) A personalized recommendation method based on nearest neighbor query of trajectory big data
CN114332893B (en) Table structure recognition method, device, computer equipment and storage medium
CN116975615A (en) Task prediction method and device based on video multi-mode information
CN117556067B (en) Data retrieval method, device, computer equipment and storage medium
CN111651635A (en) Video retrieval method based on natural language description
CN117194710B (en) Multi-granularity video retrieval method and device
Tian et al. Multi-scale hierarchical residual network for dense captioning
CN117435685A (en) Document retrieval method, document retrieval device, computer equipment, storage medium and product
CN118861211B (en) Multi-mode data retrieval method and device based on measurement index
CN118761844A (en) Information recommendation method, system and device
Tan et al. 3D detection transformer: Set prediction of objects using point clouds
CN110275990B (en) Method and device for generating keys and values stored in KV
CN117994623A (en) Image feature vector acquisition method
CN114329010B (en) A method for generating image scene graph based on knowledge graph
CN119622341A (en) Data processing method, computer device and readable storage medium
Bornia et al. Towards a semantic video analysis using deep learning and ontology
CN119990340B (en) Knowledge graph-based network generation method, inference method, system and terminal
CN120259674B (en) Weakly supervised point cloud semantic segmentation method and system based on multi-scale feature extraction and classification
Doulamis et al. 3D modelling of cultural heritage objects from photos posted over the Twitter
US20240338553A1 (en) Recommending backgrounds based on user intent

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination