CN119227816A

CN119227816A - A method for generating image description text based on large language model

Info

Publication number: CN119227816A
Application number: CN202411756885.8A
Authority: CN
Inventors: 王端; 彭超; 陈宇峰; 江爱文; 魏智
Original assignee: Jiangxi Normal University
Current assignee: Jiangxi Normal University
Priority date: 2024-12-03
Filing date: 2024-12-03
Publication date: 2024-12-31
Anticipated expiration: 2044-12-03
Also published as: CN119227816B

Abstract

The invention discloses an image description text generation method based on a large language model, which comprises the steps of constructing an image description text generation model to obtain entity information and an entity relation scene graph of an image; and then, the entity information and the entity relation scene graph are led into a large language model, the initial description of the image is obtained for further color rendering, and the final color rendering text is obtained. The invention can accurately identify and describe each object in the image and clearly express the complex interaction relationship between the objects. The thinking chain type description generation mode not only improves the accuracy of model generation description, but also enables the description to be more in line with the cognition habit of human beings. By stepwise guidance, the resulting description can better convey the overall information and intent of the image.

Description

Image description text generation method based on large language model

Technical Field

The invention belongs to the technical field of image text processing, and particularly relates to an image description text generation method based on a large language model.

Background

Image description generation is a complex task that combines visual and natural language processing techniques with the core goal of enabling a computer to accurately identify and understand the content in an image and generate descriptions in natural language. The progress of the technology has important significance in academic research and wide prospect in practical application, such as automatic picture marking, auxiliary vision-impaired people, improvement of the performance of an image search engine and the like. With continued technical innovation and optimization, image description generation will play a more important role in future developments.

The first challenge that needs to be addressed in image description generation is the problem of recognition of objects in an image. This involves the development of computer vision techniques, such as Convolutional Neural Networks (CNNs), which are capable of progressively refining the understanding of the various elements in an image by layer-by-layer feature extraction, but remain limited. The dataset currently used for the Image capture task, while containing a large number of images and descriptions, still has limitations. For example, certain items may appear less frequently in the dataset, resulting in models that do not learn adequately about the features and descriptions of the items while training. Additionally, in some images, the items may be obscured or overlapped by other items, which makes it more difficult to identify the items.

Furthermore, image description generation requires that not only a computer be able to recognize objects in an image, but also that it be able to understand how these objects interact. For example, in one picture, one cat sits on a table, and the computer needs to recognize not only the "cat" and "table", but also understand the relationship of "cat on table". If the interactive relationship between the objects cannot be accurately identified, the model cannot generate accurate image description.

Based on this, the Recurrent Neural Network (RNNs) and the long-short-term memory network (LSTMs) play an important role in this regard. They are able to generate smooth, grammatical word descriptions from the information in the image. In addition, recent research has introduced Transformers models, such as BERT and GPT, which exhibit greater effectiveness and flexibility in generating natural language text.

However, the text description generated by the current image capture has language stiffness, is very straight-forward and has low liveness. However, image descriptions of humans are often not limited to information of the images themselves, but are enriched by imagination and association. In the image description generation task, not only is an accurate description generated, but the description is made more vivid and concrete.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides an image description text generation method based on a large language model, which comprises the following steps:

step S1, acquiring a plurality of images And corresponding labels, and constructing a memory bank according to the labels;

Step S2, constructing an image description text generation model, wherein the model comprises an object extraction module, a relation extraction module, an initial description text generation module and a text color rendering module;

Step S3, importing the image I needing text description in the step S2 and the entity information of the image to a relation extraction module to obtain a corresponding entity relation scene graph;

step S4, importing the image I needing text description in the step S2, the entity information of the step S2 and the entity relation scene graph of the step S3 into an initial description text generation module to obtain corresponding initial description;

And S5, importing the image I needing text description in the step S2 and the initial description text-to-text rendering module of the image in the step S4 to obtain a final rendering text.

Further, the step S1 specifically includes:

Several images to be acquired using an image encoder The image embedding is obtained by encoding, and the method is expressed as follows:

;

Wherein, An image encoding operation is represented by the image,Representing an imageThe first of (3)Zhang Yangben the process of the preparation of the pharmaceutical composition,Representation image embeddingThe first of (3)A number of samples of the sample were taken,Representing the total number of images in the memory bank,Representing image embedding;

storing the image embedding and the corresponding label in the form of key value pairs, wherein a plurality of pairs of key value pairs form a memory bank, the image embedding is used as a key, and the corresponding label is used as a value and expressed as:

;

Wherein, The memory bank is represented by the data of the memory bank,The number i of the keys is represented by,The value of the i-th is indicated,Representation image embeddingThe first of (3)The individual images are embedded and,Representation and the firstThe individual images are embedded in the corresponding labels.

Further, the object extraction module in step S2 includes an image segmentation module and an image encoder;

Inputting an image I to be subjected to text description into an image segmentation module to obtain bounding boxes and masks masks of all objects in the image I to be subjected to text description;

According to the pixel value in the mask masks, calculating, setting the pixel value of the part with the object as 1 and the pixel value of the part without the object as 0, and performing dot multiplication operation on the mask masks and the image I needing text description to obtain the matting of the object in the image I, wherein the matting is expressed as follows:

;

Wherein, An image segmentation operation is represented as such,A mask representing the i-th object,Is the matting of the ith object, n is the number of all objects in the image,A matting of an object in the image I is represented;

the matting of each object is embedded into the corresponding image obtained by the image encoder To obtain total image embeddingExpressed as:

;

embedding a corresponding image of each object And the memory bank in step S1All images in (a) are embedded with corresponding keysCosine similarity calculation is performed, expressed as:

=;

Wherein, ,Representing a cosine similarity function;

Key for obtaining object with highest similarity Re-acquiringCorresponding valueObtaining the image in the memory library which is most similar to the input image I needing text description, obtaining the corresponding label, and assigning the label to the image I needing text description as the result of object recognition, namely entity informationAnd is expressed as:

;

Where j is the index value of the index of the key of the object with the highest similarity, The function of the maximum value is represented,Representing the label corresponding to the nth object in image I.

Further, step S3 is specifically to construct a relevant hint based on the identified entity informationInputting a large language model, and acquiring an interaction relation among objects in an image I needing text description, wherein the interaction relation comprises relative positions, actions and states among the objects;

wherein, related prompts of large language model design The method comprises the following steps:

;

Wherein, Representing interactive condition related prompts for setting generation conditions,Representing the information of the entity and,The related prompt of the interactive relation task is used for setting the generation requirement;

Image and related prompt Input into a large language model to generate an object relation scene graphAnd is expressed as:

;

wherein the method comprises the steps of In the case of a large language model,In the case of a visual encoder,Is a text encoder;

further, in the process of constructing the object relation scene graph, the importance degree of the objects is ordered, which is expressed as follows:

Sorting importance of each object according to the score, and determining which objects are the main body of the image and which objects are in secondary positions;

The constraint condition of the ordering is that the object score in the foreground is larger than that in the background, and/or the object score is higher when the object is closer to the center position of the image I, and/or the score is higher when the occupied area of the object in the image I is larger, and/or the score is higher when the interaction times of the object and other objects are more;

Calculating importance scores for each object according to the constraint conditions Expressed as:

;

Wherein, To obtain a calculation function of the euclidean distance of the object position center from the center of the whole image I,To obtain a calculation function of the area of the mask corresponding to the object,AndIs the parameter of the ultrasonic wave to be used as the ultrasonic wave,To obtain a computational function of the degree of an object in an object-relationship scene graph,Representing an object;

According to importance scores Ordering the objects according to the scores from top to bottom, updating according to the ordering, and obtaining updated objects;

The method for calculating the object position center specifically comprises the following steps:

According to the bounding box acquired in step S2, assume an object The position coordinates of the four vertexes of the corresponding bounding box are:

;

The coordinates of the center position thereof are:

;

assume that the position coordinates of the geometric center of the whole image I are ;

Then:

。

further, the step S4 specifically includes:

constructing a second correlation cue By a second correlation promptAn object relationship scene graph will be generatedUpdated objectThe initial description based on the large language model is input into an initial description text generation module to be combined with an image I needing text description, and the initial description can be generatedExpressed as:

;

Wherein, Representing an initial descriptive condition-related prompt,Representing an initial description task related prompt.

Further, the step S5 specifically includes:

Constructing a third related prompt guides the text rendering module based on the large language model to the initial description Further reasoning and expanding are carried out, and a final color-rendering text is obtained by combining an image I needing text descriptionExpressed as:

;

Wherein, A related prompt indicating the condition of the color rendering text,Representing a text task related prompt for rendering.

The invention has the positive progress effects that:

1) The invention can accurately identify and describe each object in the image and clearly express the complex interaction relationship between the objects. Through the thinking chain type description generation mode, the accuracy of model generation description is improved, and the description is more in line with the cognition habit of human beings. By stepwise guidance, the resulting description can better convey the overall information and intent of the image.

2) Further, the invention divides the image into a plurality of objects and sorts the importance, and determines which objects are the main body of the image and which objects are in secondary positions. After determining the description priority and the importance ranking, when describing the image, the main object is first described in detail, and the information such as the characteristics, the position, the action and the like of the main object are clarified. For example, in describing an image including a person and a landscape, the appearance, posture, and activity of the person are described in detail, and then the main features of the landscape are described. For the secondary object, its existence is briefly described or just mentioned, without being expanded in detail. The description mode can make the image content be clear in level and the emphasis is on.

Drawings

FIG. 1 is a flow chart of steps of a method for generating image description text based on a large language model of the present invention.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention.

Referring to fig. 1, a method for generating image description text based on a large language model, in an example, includes the steps of:

Further, in an example, step S1 is specifically:

;

=;

Wherein, ,Representing a cosine similarity function;

;

Further, in one example, step S3 is embodied in constructing a relevant hint based on the identified entity informationInputting a large language model, and acquiring an interaction relation among objects in an image I needing text description, wherein the interaction relation comprises relative positions, actions and states among the objects;

;

further, in one example, in the process of building the object relationship scene graph, the objects are ranked for importance, expressed as:

;

The coordinates of the center position thereof are:

;

Then:

。

further, in an example, step S4 is specifically:

;

Further, in an example, step S5 is specifically:

;

The present invention has been described in detail with reference to the embodiments of the drawings, and those skilled in the art can make various modifications to the invention based on the above description. Accordingly, certain details of the embodiments are not to be interpreted as limiting the invention, which is defined by the appended claims.

Claims

1. An image description text generation method based on a large language model is characterized by comprising the following steps:

step S3, importing the image I needing text description in the step S2 and entity information of the image to a relation extraction module to obtain an object relation scene graph;

Step S4, importing the image I needing text description in the step S2, the entity information of the step S2 and the entity relation scene graph of the step S3 into an initial description text generation module to obtain initial description;

2. The method for generating image description text based on large language model as claimed in claim 1, wherein step S1 specifically comprises:

3. The method for generating image description text based on a large language model according to claim 2, wherein the object extraction module in step S2 includes an image segmentation module and an image encoder;

Wherein, ,Representing a cosine similarity function;

4. The method for generating image description text based on large language model as set forth in claim 3, wherein step S3 is specifically to construct related prompts based on the identified entity informationInputting a large language model, and acquiring an interaction relation among objects in an image I needing text description, wherein the interaction relation comprises relative positions, actions and states among the objects;

The coordinates of the center position thereof are:

Then:

5. The method for generating image description text based on large language model as claimed in claim 1, wherein step S4 specifically comprises:

6. The method for generating image description text based on large language model as claimed in claim 1, wherein step S5 specifically comprises: