Disclosure of Invention
In view of the foregoing, it is desirable to provide a data processing method, apparatus, computer device, computer readable storage medium, and computer program product capable of improving reply accuracy.
In a first aspect, the present application provides a data processing method, including:
Acquiring an image-text knowledge data set, wherein the image-text knowledge data set comprises a plurality of groups of data, and each group of data comprises an image, a knowledge text corresponding to the image and a query problem corresponding to the knowledge text;
For each group of data, global description of a target image contained in the current data is obtained, target detection is carried out on the target image based on keywords contained in the global description to obtain a target detection frame, and a local image set is obtained based on the target detection frame;
and constructing positive and negative sample pairs based on the image-text matching pairs corresponding to each group of data, wherein the positive and negative sample pairs are used for training the multi-modal model.
In one embodiment, acquiring a local image set based on an object detection box includes:
cutting the target image based on the target detection frame to obtain a target local image;
Acquiring the number of transverse segmentation and the number of longitudinal segmentation in a preset image segmentation rule, and segmenting the target image into at least one regional partial image based on the number of transverse segmentation and the number of longitudinal segmentation;
a local image set is constructed based on the target local image and the region local image.
In one embodiment, constructing positive and negative sample pairs based on the image-text matching pairs corresponding to each group of data includes:
The method comprises the steps of obtaining a target knowledge text and a target local description set contained in a current image-text matching pair according to the image-text matching pair corresponding to each group of data, encoding the target knowledge text to obtain a knowledge text vector, encoding each local description in the target local description set to obtain a local text vector of each local description;
splitting the current picture text into positive and negative sample groups based on the knowledge text vector and the local text vector of each local description;
and constructing positive and negative sample pairs based on the positive and negative sample groups obtained by splitting the image-text matching pairs corresponding to each group of data.
In one embodiment, splitting the current context match into positive and negative sample sets based on the knowledge text vector and the local text vector for each local description includes:
For each local description, calculating the cosine distance between the knowledge text vector and the local text vector of the current local description, and taking the current local description as a local description positive sample if the cosine distance is greater than or equal to a preset cosine distance threshold value, and taking the current local description as a local description negative sample if the cosine distance is less than the preset cosine distance threshold value;
constructing a positive sample group based on the local description positive sample set, the local image corresponding to each local description in the local description positive sample set, the corresponding global description and the corresponding data;
and constructing a negative sample group based on the local description negative sample set and the local image corresponding to each local description in the local description negative sample set.
In one embodiment, constructing the positive and negative sample pairs based on the positive and negative sample groups obtained by splitting the image-text matching pairs corresponding to each group of data includes:
for the image-text matching pair corresponding to each group of data, acquiring a target positive and negative sample group obtained by splitting the current image-text matching pair, and taking a positive sample group in the target positive and negative sample group as a positive sample pair;
And randomly selecting target data from all data except the data corresponding to the current image-text matching in the image-text knowledge data set, and constructing a negative sample pair based on a negative sample group in the positive and negative sample groups of the target, global description of images contained in the target data and the target data.
In one embodiment, training the multimodal model based on positive and negative samples includes:
based on the positive and negative sample pairs, performing contrast learning on the multimode model to be trained to obtain a basic multimode model;
generating an instruction fine adjustment data set based on the image-text matching pair corresponding to each group of data and a preset instruction template;
And carrying out instruction fine tuning training on the basic multi-modal model based on the instruction fine tuning data set to obtain a trained multi-modal model.
In one embodiment, the method further comprises:
acquiring data to be put in storage, wherein the data to be put in storage comprises a put in storage image, a put in storage knowledge text corresponding to the put in storage image and a put in storage inquiry problem corresponding to the put in storage knowledge text;
The method comprises the steps of obtaining global description of a warehouse-in image, taking the global description of the warehouse-in image as the global description of the warehouse-in, carrying out target detection on the warehouse-in image based on keywords contained in the global description of the warehouse-in, and obtaining a warehouse-in local image set based on detection results;
Coding the global description of the warehouse-in to obtain a global description vector of the warehouse-in, coding the warehouse-in knowledge text to obtain a text vector of the warehouse-in knowledge, and respectively coding each local description in the local description set of the warehouse-in to obtain a local text vector of the warehouse-in;
Calculating cosine distances between the warehouse-in knowledge text vectors and each warehouse-in local text vector, and sequencing local descriptions in the warehouse-in local description set according to the sequence from small cosine distances to large cosine distances to obtain target local descriptions in the preset quantity;
Extracting a local image embedded vector of a local image corresponding to each target local description;
And constructing a knowledge group based on the warehousing global description vector, the target local description, the local image embedding vector, the global image embedding vector and the warehousing knowledge text vector, and storing the knowledge group into a knowledge base.
In one embodiment, the method further comprises:
Receiving a search request, wherein the search request comprises a query image and a query problem;
acquiring global description of the query image;
searching at least one target knowledge group with similarity degree meeting preset conditions in a knowledge base based on global description of the query image;
randomly selecting an image local description instruction from an image local description preset instruction set, and generating a plurality of local descriptions based on the image local description instruction, the global description, the query problem and the knowledge text in the target knowledge group;
Expanding the query problem based on a plurality of local descriptions to obtain an expanded problem;
Searching in at least one target knowledge group based on the query problem and the expansion problem to obtain a final knowledge group, and generating a search reply based on the final knowledge group.
In a second aspect, the present application also provides a data processing apparatus, including:
The acquisition module is used for acquiring an image-text knowledge data set, wherein the image-text knowledge data set comprises a plurality of groups of data, and each group of data comprises an image, a knowledge text corresponding to the image and a query problem corresponding to the knowledge text;
The construction module is used for acquiring global description of a target image contained in the current data aiming at each group of data, carrying out target detection on the target image based on keywords contained in the global description to obtain a target detection frame, and acquiring a local image set based on the target detection frame;
The training module is used for constructing positive and negative sample pairs based on the image-text matching pairs corresponding to each group of data, and training the multi-modal model based on the positive and negative sample pairs.
In a third aspect, the present application also provides a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
Acquiring an image-text knowledge data set, wherein the image-text knowledge data set comprises a plurality of groups of data, and each group of data comprises an image, a knowledge text corresponding to the image and a query problem corresponding to the knowledge text;
For each group of data, global description of a target image contained in the current data is obtained, target detection is carried out on the target image based on keywords contained in the global description to obtain a target detection frame, and a local image set is obtained based on the target detection frame;
and training the multi-modal model based on the positive and negative sample pairs.
In a fourth aspect, the present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
Acquiring an image-text knowledge data set, wherein the image-text knowledge data set comprises a plurality of groups of data, and each group of data comprises an image, a knowledge text corresponding to the image and a query problem corresponding to the knowledge text;
For each group of data, global description of a target image contained in the current data is obtained, target detection is carried out on the target image based on keywords contained in the global description to obtain a target detection frame, and a local image set is obtained based on the target detection frame;
and training the multi-modal model based on the positive and negative sample pairs.
In a fifth aspect, the application also provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of:
Acquiring an image-text knowledge data set, wherein the image-text knowledge data set comprises a plurality of groups of data, and each group of data comprises an image, a knowledge text corresponding to the image and a query problem corresponding to the knowledge text;
For each group of data, global description of a target image contained in the current data is obtained, target detection is carried out on the target image based on keywords contained in the global description to obtain a target detection frame, and a local image set is obtained based on the target detection frame;
and training the multi-modal model based on the positive and negative sample pairs.
The data processing method, the device, the computer equipment, the computer readable storage medium and the computer program product acquire a graphic knowledge data set, wherein the graphic knowledge data set comprises a plurality of groups of data, each group of data comprises an image, a knowledge text corresponding to the image and a query problem corresponding to the knowledge text, global description of a target image contained in current data is acquired for each group of data, target detection is carried out on the target image based on keywords contained in the global description to obtain a target detection frame, a local image set is acquired based on the target detection frame, local description of each local image in the local image set is acquired to obtain a local description set, the global description, the local image set, the local description set and the current data form graphic matching pairs, positive and negative sample pairs are constructed based on the graphic matching pairs corresponding to each group of data, and the positive and negative sample pairs are used for training a multi-mode model. The image detail information extraction is realized in a fine granularity description mode, original image-text knowledge is subjected to data enhancement in a multi-angle global to local mode, and a multi-mode model obtained by training based on an image-text knowledge data set after data enhancement can output more accurate replies.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
In one embodiment, as shown in fig. 1, a data processing method is provided, where this embodiment is applied to a terminal to illustrate the method, and it is understood that the method may also be applied to a server, and may also be applied to a system including a terminal and a server, and implemented through interaction between the terminal and the server. In this embodiment, the method includes the steps of:
Step 102, obtaining an image-text knowledge data set, wherein the image-text knowledge data set comprises a plurality of groups of data, and each group of data comprises an image, a knowledge text corresponding to the image and a query problem corresponding to the knowledge text.
Alternatively, a set of knowledge data of a professional field may be collected, the set of knowledge data of the graphic field comprising a plurality of sets of data, each set of data comprising at least one image of the professional field, a knowledge text corresponding to the image, and a query question corresponding to the knowledge text.
In order to acquire a massive set of teletext knowledge data, the data is typically acquired from multiple sources, so that the image has the potential for multiple elements of unrelated text content. As shown in fig. 2, an image and a knowledge text in a set of data are illustrated, the knowledge text mainly explains the meaning of a warning light in the image, but other information such as average fuel consumption, a vehicle speed meter and the like can also exist in the image, and the warning light is illustrated by an arrow in fig. 2.
Step 104, for each group of data, global description of a target image contained in the current data is obtained, target detection is carried out on the target image based on keywords contained in the global description, a target detection frame is obtained, a local image set is obtained based on the target detection frame, local description of each local image in the local image set is obtained, a local description set is obtained, and the global description, the local image set, the local description set and the current data form an image-text matching pair.
Alternatively, a native visual text multi-modal model M to be trimmed, such as BLIPv, GLM-4V-9B, etc., may be prepared in advance. For the sake of this description, the visual text multimodal model mentioned in the embodiments of the present application may be simply referred to as a multimodal model. For each set of data, a global description of the target image contained by the current data may be obtained by the multimodal model M.
Specifically, for each set of data, the target image contained in the current data may be processed using the multimodal model M to generate the global description T g. In the process of processing the target image using the multimodal model M, the usable hint term P g,Pg may be "describe this figure", and the generation process of T g may be represented by the following mathematical expression:
Tg=M(I,Pg)
For example, referring to fig. 3, the image shown in fig. 3 is processed by using the multi-mode model M using the hint word "describe the image", and the global description is "the image is a state of a vehicle after the engine hood is opened, a silver engine is in the middle of the image, and a further silver mark is written on the image, and TURBO is written on the image. In the upper right position of the picture there is a red container in which a liquid, possibly a cooling or braking liquid, is contained. In addition, there are other automotive parts and pipes, but their specific functions are not well known. "
Wherein, for each group of data, after the global description of the target image contained in the current data is obtained, the keywords can be extracted from the global description.
Alternatively, the natural language submodel M L in the multimodal model M may be used to extract the keyword set K from the global description T g, the hint word P k may be "extract the potential keyword from the content in the < quote > </quote > tag", and the content in the < quote > </quote > tag is the global description T g. The above-described process may be represented by the following mathematical expression, an example of which is shown in fig. 4:
K=ML(Tg,Pk)
Optionally, the object detection is performed on the object image by using each keyword in the keyword set K by means of the Grounded-SAM model M sam, so as to obtain an object detection frame, where the keywords are in one-to-one correspondence with the object detection frames, so as to obtain an object detection frame set B.
Optionally, after the target detection frame set B is obtained, cutting processing is performed on the target image by using the target detection frames in the target detection frame set B to obtain target partial images, where the target detection frames correspond to the target partial images one by one, and a set formed by each target partial image can be used as a partial image set I loc.
Optionally, the multi-modal model M may be used to process each local image in the local image set I loc to generate a corresponding local description, resulting in the local description set T loc. The process of processing a certain group of data in the image-text knowledge data set D can process all the data by adopting the same method to obtain the image-text matching pair corresponding to each group of data.
Through the data enhancement, a new image-text knowledge data set D * can be obtained, wherein the image-text knowledge data set D * is composed of a plurality of image-text matching pairs, and the I-th image-text matching pair comprises an image I i contained in I-th group data, a local image set I i loc, a query question Q i contained in I-th group data, a knowledge text T i contained in I-th group data, a global description T i g and a local description set T i loc.
And 106, constructing positive and negative sample pairs based on the image-text matching pairs corresponding to each group of data, wherein the positive and negative sample pairs are used for training the multi-mode model generated by multi-mode retrieval enhancement.
Optionally, for each graph matching pair in the graph knowledge data set D *, the graph matching pair can be split to obtain positive and negative sample groups, positive and negative sample pairs are constructed based on the positive and negative sample groups corresponding to each graph matching pair, and the multi-mode model to be trained can be subjected to contrast learning based on the positive and negative sample pairs to obtain a trained multi-mode model for practical reasoning.
Optionally, the multi-modal model may be a multi-modal retrieval enhancement generation (RETRIEVAL-Augmented Generation, abbreviated as RAG) model, and the method provided by the embodiment of the application can promote the generation performance of the multi-modal RAG model. The multi-modal RAG model is an artificial intelligence model that combines retrieval and generation capabilities and is primarily used to process and generate data relating to multiple modalities (e.g., text, images, sound, etc.). Some typical usage scenarios of a multimodal RAG model are 1) image-to-text generation, which may generate descriptive text from an input image, such as automatically generating a title, description, or story of the image. 2) Visual question-answering system in which a multimodal RAG model can process questions of a user about image contents and generate accurate answers. 3) Semantic retrieval of images and text the multimodal RAG model can retrieve images semantically related to the user from an image database according to a text query provided by the user or retrieve related text information according to the image query. The positive and negative samples can be used for carrying out contrast learning on the multi-modal model to be trained in combination with the use scene of the multi-modal model.
In the embodiment, a graphic knowledge data set is acquired, wherein the graphic knowledge data set comprises a plurality of groups of data, each group of data comprises an image, a knowledge text corresponding to the image and a query problem corresponding to the knowledge text, global description of a target image contained in current data is acquired for each group of data, target detection is carried out on the target image based on keywords contained in the global description to obtain a target detection frame, a local image set is acquired based on the target detection frame, local description of each local image in the local image set is acquired to obtain a local description set, the global description, the local image set, the local description set and the current data form a graphic matching pair, positive and negative sample pairs are constructed based on the graphic matching pair corresponding to each group of data, and the positive and negative sample pairs are used for training a multi-mode model. The image detail information extraction is realized in a fine granularity description mode, original image-text knowledge is subjected to data enhancement in a multi-angle global to local mode, and a multi-mode model obtained by training based on an image-text knowledge data set after data enhancement can output more accurate replies.
In some embodiments, the local image set is acquired based on a target detection frame, and the method comprises the steps of cutting a target image based on the target detection frame to obtain a target local image, acquiring the transverse segmentation number and the longitudinal segmentation number based on a preset image segmentation rule, cutting the target image into at least one regional local image based on the transverse segmentation number and the longitudinal segmentation number, and constructing the local image set based on the target local image and the regional local image.
Optionally, after the target detection frame set B is obtained, cutting the target image by using the target detection frames in the target detection frame set B to obtain target partial images, where the target detection frames correspond to the target partial images one by one, and the set formed by each target partial image can be used as the target partial image setEmbodiments of the present application supplement some potential targets as they may not be detected during the above-described target detection process.
Alternatively, the preset image segmentation rule may be h g*wg grid, where h g is an average segmentation of h g parts along the image y-axis, i.e., h g is a longitudinal segmentation, and similarly, w g is an average segmentation of w g parts along the image x-axis, i.e., w g is a transverse segmentation, and based on the transverse segmentation and the longitudinal segmentation, the target image may be segmented into h g*wg regional partial images, where the regional partial images form a regional partial image setAs shown in fig. 5, h g is 3,w g and 3, so this figure can obtain 9 regional partial images.
Optionally, when the target local image set is obtainedSum region local image setAfter that, a local image set can be constructed
In the embodiment, cutting processing is performed on a target image based on a target detection frame to obtain a target local image, the number of transverse segmentation parts and the number of longitudinal segmentation parts are obtained based on a preset image segmentation rule, the target image is segmented into at least one regional local image based on the number of transverse segmentation parts and the number of longitudinal segmentation parts, and a local image set is constructed based on the target local image and the regional local image. The possible missing targets in the target detection process are supplemented, and the data enhancement effect is improved.
In some embodiments, positive and negative sample pairs are constructed based on image-text matching pairs corresponding to each group of data, wherein the positive and negative sample pairs comprise a target knowledge text and a target local description set contained in the current image-text matching pairs, a knowledge text vector is obtained by encoding the target knowledge text, a local text vector of each local description is obtained by encoding each local description in the target local description set, the current image-text matching is split into positive and negative sample groups based on the knowledge text vector and the local text vector of each local description, and the positive and negative sample pairs are constructed based on the positive and negative sample groups obtained by splitting the image-text matching pairs corresponding to each group of data.
Alternatively, since there may be irrelevant descriptions in the local description set, such as local descriptions corresponding to local images of the region, there is a high possibility of independence. Therefore, for each group of data corresponding graph-text matching pairs, the target knowledge text and the target local description set contained in the current graph-text matching pair can be obtained, the target knowledge text is encoded into a knowledge text vector V i T by using a text embedding model M te of a multi-mode model M, and each local description in the target local description set is encoded into a local text vectorAnd constructing positive and negative sample pairs based on the positive and negative sample groups obtained by splitting the picture-text matching pairs corresponding to each group of data.
In the above embodiment, an implementation manner of constructing the positive and negative sample pairs is provided, and the multi-modal model to be trained can be subjected to comparison learning based on the positive and negative sample pairs. Because the positive and negative sample pairs are constructed based on the image-text knowledge data set after data enhancement, the multi-modal model obtained by training can output more accurate replies.
In some embodiments, the current graph-text matching is split into positive and negative sample groups based on knowledge text vectors and local text vectors of each local description, the method comprises the steps of calculating cosine distances between the knowledge text vectors and the local text vectors of the current local description for each local description, taking the current local description as a local description positive sample if the cosine distances are larger than or equal to a preset cosine distance threshold, taking the current local description as a local description negative sample if the cosine distances are smaller than a preset cosine distance threshold, forming a local description positive sample set by all the local description positive samples, forming a local description negative sample set by all the local description negative samples, constructing a local image corresponding to each local description in the local description positive sample set based on the local description positive sample set, the local description global description corresponding to each local image in the local description positive sample set, and the corresponding data, and constructing a positive sample group based on the local description negative sample set and the local image corresponding to each local description in the local description negative sample set.
Alternatively, the text embedding model M te of the multimodal model M is used to encode the target knowledge text into a knowledge text vector V i T, and each local description in the target local description set is encoded into a local text vectorAfter that, calculate V i T andThe cosine distance between the two partial description sets is greater than or equal to a preset cosine distance threshold value and is classified as a partial description positive sample, all the partial description positive samples form a partial description positive sample set T i loc_p, otherwise, the partial description positive sample set T i loc_n is classified as a partial description negative sample, and all the partial description negative samples form a partial description negative sample set. A local image positive sample set corresponding to the local description positive sample set T i loc_p can be obtainedObtaining a local image negative sample set corresponding to the local description negative sample set T i loc_n Therefore, positive and negative sample groups can be formed in the ith image-text matching pair, and the positive sample group isNegative sample set isIn this allocation scheme, dataset D * is further optimized into a teletext knowledge dataset D **.
In the above embodiment, an implementation manner of splitting image-text matching into positive and negative sample sets is provided, and the process is a basis for constructing positive and negative sample pairs, and then, based on the positive and negative sample pairs, the multi-modal model to be trained can be subjected to contrast learning. Because the positive and negative sample pairs are constructed based on the image-text knowledge data set after data enhancement, the multi-modal model obtained by training can output more accurate replies.
In some embodiments, constructing positive and negative sample pairs based on positive and negative sample groups obtained by splitting image-text matching pairs corresponding to each group of data comprises the steps of obtaining target positive and negative sample groups obtained by splitting current image-text matching pairs for each group of data, taking positive sample groups in the target positive and negative sample groups as positive sample pairs, randomly selecting target data from all data except data corresponding to the current image-text matching in an image-text knowledge data set, and constructing negative sample pairs based on negative sample groups in the target positive and negative sample groups, global description of images contained in the target data and the target data.
Optionally, for the graphic knowledge data set D **, the ith graphic matching pair is split to obtain a positive sample groupThe images and text in the image and text can form positive sample pairs with each other, and negative sample groupsThe other data, the global description of the image contained by the other data, form negative pairs of samples with each other.
In the above embodiment, an implementation manner of how to construct positive and negative sample pairs after splitting each image-text matching pair to obtain positive and negative sample groups is provided, and then, based on the positive and negative sample pairs, the multi-modal model to be trained can be subjected to contrast learning so as to output more accurate replies.
In some embodiments, the data processing method provided by the embodiment of the application further comprises the steps of comparing and learning the multimode model to be trained based on the positive and negative sample pairs to obtain a basic multimode model, generating an instruction fine adjustment data set based on the image-text matching pairs corresponding to each group of data and a preset instruction template, and performing instruction fine adjustment training on the basic multimode model based on the instruction fine adjustment data set to obtain a trained multimode model.
The contrast learning is a self-supervision learning technology, which learns feature representation by comparing positive sample pairs with negative sample pairs, and the core is that the distance between the positive sample pairs is shortened, and the distance between the negative sample pairs is lengthened.
Optionally, loRA is a Parameter efficient fine Tuning technique (Parameter-EFFICIENT FINE-Tuning, PEFT for short), and the application LoRA not only can train the model efficiently, but also can avoid the situation that the model is not fit due to lack of enough data for training the model in full. The embodiment of the application can adopt LoRA to carry out parameter fine adjustment.
The instruction fine tuning technology can enable a model to learn to follow instructions to finish specific tasks and improve the response capability of the model to specific instructions or task requests. Because the local image extraction cannot be performed based on real-time response consideration in the retrieval process, the fine-tuning model is required to have the capability of completing the extraction of the characteristic information of the local area of the image through instructions and keywords. To achieve this effect, the embodiment of the application presets an instruction template and reforms the data set D ** into an instruction trimming data set for instruction trimming based on the template
For example, the preset instruction templates may be:
Summary of # #
< Global description >
Knowledge of # #
< Knowledge text >
Keyword # #
< Knowledge text keyword 1>
< Knowledge text keyword 2>
...
< Knowledge text keyword n >
Problem # #
< Query questions >
# # Instruction
< Image local description preset instruction >
Detail # #
< Local description 1>
< Local description 2>
...
< Local description m >
Wherein < image local description preset instruction > requires a preset set of instructions, and other contents can be obtained from the data set D **. < image local description preset instruction > the following instructions (including but not limited to):
1. combining the problems and knowledge, please help me extract the main key content in the picture.
2. According to the problem, analyzing the picture content by utilizing your knowledge, and finding out key information points in the picture content.
3. Based on the problem, key elements in the picture are found.
4. A brief description of the picture content is generated, and important content related to problems and knowledge is highlighted.
5. And recommending relevant key search descriptions for me according to the picture content, the knowledge and the questions.
6. The subject or focus of the picture is determined and summarized with keywords.
Optionally, after obtaining the instruction trim datasetThen, loRA can be adopted to conduct instruction fine tuning training on the basic multi-modal model, and a trained multi-modal model M * is obtained.
In the above embodiment, a training process of the multimodal model is introduced, and fine-granularity description fine tuning is used to enable the multimodal model to have local key information extraction and local multimodal feature alignment capability, so that the capability of the multimodal model for extracting image-text embedded vectors of knowledge in a certain field is improved, and fine granularity of images and text embedded vectors is enhanced.
In some embodiments, a method of processing data is provided, see FIG. 6, comprising the step of, for each set of data in a set of teletext knowledge data, including an image, knowledge text corresponding to the image, and a query question corresponding to the knowledge text, the data thus also being referred to as multimodal data. The method comprises the steps of firstly, generating global description about image content, then, extracting keywords based on the global description, detecting image targets based on the keywords, then, cutting target images and cutting preset areas of the images, then, generating description about local image content, then, performing contrast learning on combinations based on positive and negative samples of fine-grained graphics, then, performing instruction fine adjustment based on an image local description instruction fine adjustment template, and finally, outputting a fine-adjustment multi-mode model.
In some embodiments, the data processing method provided by the embodiment of the application further comprises the steps of obtaining data to be put in, wherein the data to be put in comprises an put in image, put in knowledge text corresponding to the put in image and put in query questions corresponding to the put in knowledge text, obtaining global description of the put in image as put in global description, carrying out target detection on the put in image based on keywords contained in the put in global description, obtaining a put in local image set based on detection results, obtaining a put in local description set, encoding the put in global description to obtain put in global description vectors, encoding the put in knowledge text to obtain put in knowledge text vectors, encoding the put in local description set to obtain put in local text vectors, calculating cosine distances between the put in knowledge text vectors and the put in local description sets, sorting the put in local description sets according to a sequence from smaller cosine distances to larger, obtaining target local description of the put in local image corresponding to the put in local image set, extracting the put in global description set, extracting put in global description vectors of the put in image set, and constructing put in the put in knowledge text vectors based on the global description, the target local description and the put in knowledge text vectors.
Optionally, the steps of obtaining global description of the warehouse-in image, performing target detection on the warehouse-in image based on keywords contained in the warehouse-in global description, obtaining a warehouse-in local image set based on a detection result, and obtaining local description of each local image in the warehouse-in local image set may be implemented by adopting a Grounded-SAM model and a multi-mode model M *, and the detailed process may refer to the foregoing embodiments and is not repeated herein.
Wherein a text embedding model using a multimodal model M * And coding the global description of the warehouse-in to obtain a global description vector of the warehouse-in, coding the warehouse-in knowledge text to obtain a text vector of the warehouse-in knowledge, and respectively coding each local description in the local description set of the warehouse-in to obtain a local text vector of the warehouse-in. And calculating cosine distances between the warehouse-in knowledge text vectors and each warehouse-in local text vector, and sequencing the local description in the warehouse-in local description set according to the sequence from small cosine distances to large cosine distances. The local description of the TOP-K (K value is set according to different conditions) nearest distance is taken as the target local description. And uses the image embedding model of the multi-modal model M * Extracting local image embedded vectors of local images corresponding to each target local description, and extracting global image embedded vectors of warehouse-in images. And forming a knowledge group by the warehouse-in global description vector, the target local description, the local image embedding vector, the global image embedding vector, the warehouse-in knowledge text vector, the data to be warehouse-in, the warehouse-in global description and the target local description corresponding to the local image and storing the knowledge group into a knowledge base.
In some embodiments, referring to FIG. 7, the binning step includes, for multi-modal data to be binned, first performing global description generation with respect to image content, then performing keyword extraction based on the global description, and image object detection based on the keywords, then performing object image segmentation and image preset region segmentation, then performing description generation with respect to local image content, then extracting embedded vectors for all texts, calculating cosine distances between knowledge text embedded vectors and all generated text embedded vectors, taking TOP-K nearest distance generated text and corresponding local images, extracting embedded vectors for the original image and the selected local image, and entering the effective embedded vectors into a database.
In the above embodiment, a multi-mode fine granularity embedded vector matching mechanism is provided to optimize the current mainstream warehousing strategy.
In terms of search strategies, the current mainstream search strategy is to firstly expand the query problem and then search in a knowledge base based on the original query problem and a plurality of expanded query problems. Although expanding the query problem can increase the hit rate, if the knowledge base is large and the query problem is not well expressed, it may result in a query with an undesirable search result and inaccurate generated result. Therefore, the invention optimizes the current retrieval strategy by adopting a mode of 'image description global rough matching, local description and query problem fine matching'.
In some embodiments, the data processing method provided by the embodiment of the application further comprises the steps of receiving a search request, wherein the search request comprises a query image and a query problem, obtaining global description of the query image, searching at least one target knowledge group with similarity meeting preset conditions in a knowledge base based on the global description of the query image, randomly selecting an image local description instruction from an image local description preset instruction set, generating a plurality of local descriptions based on the image local description instruction, the global description, the query problem and knowledge texts in the target knowledge group, expanding the query problem based on the plurality of local descriptions to obtain an expanded problem, searching in the at least one target knowledge group based on the query problem and the expanded problem to obtain a final knowledge group, and generating a search reply based on the final knowledge group.
Wherein the query image may be processed through the multimodal model M * to generate a global description of the query image. Text embedding model using multimodal model M * The global description is processed to generate a query text vector.
At least one target knowledge group with TOP-N (N value set according to different conditions) similarity degree meeting preset conditions can be searched in the knowledge base. Specifically, global description vectors contained in each knowledge group in the knowledge base can be obtained, cosine distances between the query text vectors and the global description vectors are calculated, the knowledge groups in the knowledge base are ordered according to the sequence of the cosine distances from small to large, and TOP-N knowledge groups arranged in front are used as target knowledge groups.
As described above, the image local description preset instruction set includes, but is not limited to, instructions of 1. Please help me extract the main key content in the picture in combination with questions and knowledge. 2. According to the problem, analyzing the picture content by utilizing your knowledge, and finding out key information points in the picture content. 3. Based on the problem, key elements in the picture are found. 4. A brief description of the picture content is generated, and important content related to problems and knowledge is highlighted. 5. And recommending relevant key search descriptions for me according to the picture content, the knowledge and the questions. 6. The subject or focus of the picture is determined and summarized with keywords.
Wherein the plurality of local descriptions may be generated using the trained multimodal model M * based on the randomly selected image local description instruction, the global description of the query image, the query question, and the knowledge text in the target knowledge set. The multimodal model M * can be directed to expand query questions based on a plurality of locally descriptive extensible instructions. And finally, searching in at least one target knowledge group by utilizing the query problem and the expansion problem to obtain a final knowledge group, taking a knowledge text in the final knowledge group as a knowledge strip, and generating a search reply based on the knowledge strip. Because the local description generation no longer needs to preprocess the local image, the retrieval timeliness is ensured, the retrieval precision and the fine granularity are improved, and finally the retrieval recovery generation effect is improved.
In some embodiments, referring to FIG. 8, the retrieving step includes receiving a retrieval request including a query image and a query question, performing global description generation on the image content, extracting an embedded vector of the global description, detecting TOP-N knowledge groups from a knowledge base, then generating a local description by an instruction, expanding the query question, extracting an embedded vector of the expanded question, retrieving knowledge bars within the TOP-N knowledge groups, and generating an answer in conjunction with the retrieved knowledge bars.
In the above embodiment, the image is globally described through the multimodal model, then TOP-N similar knowledge groups are searched in the knowledge base, and finally relevant answers are located in the knowledge groups through the query questions and the expansion questions, so that the search reply generation effect is improved.
It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a data processing device for realizing the above related data processing method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation of one or more embodiments of the data processing device provided below may refer to the limitation of the data processing method hereinabove, and will not be repeated herein.
In one exemplary embodiment, there is provided a data processing apparatus including:
The acquisition module is used for acquiring an image-text knowledge data set, wherein the image-text knowledge data set comprises a plurality of groups of data, and each group of data comprises an image, a knowledge text corresponding to the image and a query problem corresponding to the knowledge text;
The construction module is used for acquiring global description of a target image contained in the current data aiming at each group of data, carrying out target detection on the target image based on keywords contained in the global description to obtain a target detection frame, and acquiring a local image set based on the target detection frame;
The training module is used for constructing positive and negative sample pairs based on the image-text matching pairs corresponding to each group of data, and training the multi-modal model based on the positive and negative sample pairs.
In some embodiments, the obtaining module is specifically configured to:
cutting the target image based on the target detection frame to obtain a target local image;
Acquiring the number of transverse segmentation and the number of longitudinal segmentation in a preset image segmentation rule, and segmenting the target image into at least one regional partial image based on the number of transverse segmentation and the number of longitudinal segmentation;
a local image set is constructed based on the target local image and the region local image.
In some embodiments, the training module is specifically configured to:
The method comprises the steps of obtaining a target knowledge text and a target local description set contained in a current image-text matching pair according to the image-text matching pair corresponding to each group of data, encoding the target knowledge text to obtain a knowledge text vector, encoding each local description in the target local description set to obtain a local text vector of each local description;
splitting the current picture text into positive and negative sample groups based on the knowledge text vector and the local text vector of each local description;
and constructing positive and negative sample pairs based on the positive and negative sample groups obtained by splitting the image-text matching pairs corresponding to each group of data.
In some embodiments, the training module is specifically configured to:
For each local description, calculating the cosine distance between the knowledge text vector and the local text vector of the current local description, and taking the current local description as a local description positive sample if the cosine distance is greater than or equal to a preset cosine distance threshold value, and taking the current local description as a local description negative sample if the cosine distance is less than the preset cosine distance threshold value;
constructing a positive sample group based on the local description positive sample set, the local image corresponding to each local description in the local description positive sample set, the corresponding global description and the corresponding data;
and constructing a negative sample group based on the local description negative sample set and the local image corresponding to each local description in the local description negative sample set.
In some embodiments, the training module is specifically configured to:
for the image-text matching pair corresponding to each group of data, acquiring a target positive and negative sample group obtained by splitting the current image-text matching pair, and taking a positive sample group in the target positive and negative sample group as a positive sample pair;
And randomly selecting target data from all data except the data corresponding to the current image-text matching in the image-text knowledge data set, and constructing a negative sample pair based on a negative sample group in the positive and negative sample groups of the target, global description of images contained in the target data and the target data.
In some embodiments, the data processing apparatus further comprises a binning module for:
acquiring data to be put in storage, wherein the data to be put in storage comprises a put in storage image, a put in storage knowledge text corresponding to the put in storage image and a put in storage inquiry problem corresponding to the put in storage knowledge text;
The method comprises the steps of obtaining global description of a warehouse-in image, taking the global description of the warehouse-in image as the global description of the warehouse-in, carrying out target detection on the warehouse-in image based on keywords contained in the global description of the warehouse-in, and obtaining a warehouse-in local image set based on detection results;
Coding the global description of the warehouse-in to obtain a global description vector of the warehouse-in, coding the warehouse-in knowledge text to obtain a text vector of the warehouse-in knowledge, and respectively coding each local description in the local description set of the warehouse-in to obtain a local text vector of the warehouse-in;
Calculating cosine distances between the warehouse-in knowledge text vectors and each warehouse-in local text vector, and sequencing local descriptions in the warehouse-in local description set according to the sequence from small cosine distances to large cosine distances to obtain target local descriptions in the preset quantity;
Extracting a local image embedded vector of a local image corresponding to each target local description;
And constructing a knowledge group based on the warehousing global description vector, the target local description, the local image embedding vector, the global image embedding vector and the warehousing knowledge text vector, and storing the knowledge group into a knowledge base.
In some embodiments, the data processing apparatus further comprises a retrieval module for:
Receiving a search request, wherein the search request comprises a query image and a query problem;
acquiring global description of the query image;
searching at least one target knowledge group with similarity degree meeting preset conditions in a knowledge base based on global description of the query image;
randomly selecting an image local description instruction from an image local description preset instruction set, and generating a plurality of local descriptions based on the image local description instruction, the global description, the query problem and the knowledge text in the target knowledge group;
Expanding the query problem based on a plurality of local descriptions to obtain an expanded problem;
Searching in at least one target knowledge group based on the query problem and the expansion problem to obtain a final knowledge group, and generating a search reply based on the final knowledge group.
Each of the modules in the above-described data processing apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one exemplary embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 9. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing a teletext knowledge data set. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data processing method.
It will be appreciated by persons skilled in the art that the architecture shown in fig. 9 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting as to the computer device to which the present inventive arrangements are applicable, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.
In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile memory and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (RESISTIVE RANDOM ACCESS MEMORY, reRAM), magneto-resistive Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computation, an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) processor, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the present application.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.