CN119227816A - A method for generating image description text based on large language model - Google Patents
A method for generating image description text based on large language model Download PDFInfo
- Publication number
- CN119227816A CN119227816A CN202411756885.8A CN202411756885A CN119227816A CN 119227816 A CN119227816 A CN 119227816A CN 202411756885 A CN202411756885 A CN 202411756885A CN 119227816 A CN119227816 A CN 119227816A
- Authority
- CN
- China
- Prior art keywords
- image
- text
- description
- language model
- representing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/761—Proximity, similarity or dissimilarity measures
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Processing Or Creating Images (AREA)
Abstract
The invention discloses an image description text generation method based on a large language model, which comprises the steps of constructing an image description text generation model to obtain entity information and an entity relation scene graph of an image; and then, the entity information and the entity relation scene graph are led into a large language model, the initial description of the image is obtained for further color rendering, and the final color rendering text is obtained. The invention can accurately identify and describe each object in the image and clearly express the complex interaction relationship between the objects. The thinking chain type description generation mode not only improves the accuracy of model generation description, but also enables the description to be more in line with the cognition habit of human beings. By stepwise guidance, the resulting description can better convey the overall information and intent of the image.
Description
Technical Field
The invention belongs to the technical field of image text processing, and particularly relates to an image description text generation method based on a large language model.
Background
Image description generation is a complex task that combines visual and natural language processing techniques with the core goal of enabling a computer to accurately identify and understand the content in an image and generate descriptions in natural language. The progress of the technology has important significance in academic research and wide prospect in practical application, such as automatic picture marking, auxiliary vision-impaired people, improvement of the performance of an image search engine and the like. With continued technical innovation and optimization, image description generation will play a more important role in future developments.
The first challenge that needs to be addressed in image description generation is the problem of recognition of objects in an image. This involves the development of computer vision techniques, such as Convolutional Neural Networks (CNNs), which are capable of progressively refining the understanding of the various elements in an image by layer-by-layer feature extraction, but remain limited. The dataset currently used for the Image capture task, while containing a large number of images and descriptions, still has limitations. For example, certain items may appear less frequently in the dataset, resulting in models that do not learn adequately about the features and descriptions of the items while training. Additionally, in some images, the items may be obscured or overlapped by other items, which makes it more difficult to identify the items.
Furthermore, image description generation requires that not only a computer be able to recognize objects in an image, but also that it be able to understand how these objects interact. For example, in one picture, one cat sits on a table, and the computer needs to recognize not only the "cat" and "table", but also understand the relationship of "cat on table". If the interactive relationship between the objects cannot be accurately identified, the model cannot generate accurate image description.
Based on this, the Recurrent Neural Network (RNNs) and the long-short-term memory network (LSTMs) play an important role in this regard. They are able to generate smooth, grammatical word descriptions from the information in the image. In addition, recent research has introduced Transformers models, such as BERT and GPT, which exhibit greater effectiveness and flexibility in generating natural language text.
However, the text description generated by the current image capture has language stiffness, is very straight-forward and has low liveness. However, image descriptions of humans are often not limited to information of the images themselves, but are enriched by imagination and association. In the image description generation task, not only is an accurate description generated, but the description is made more vivid and concrete.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides an image description text generation method based on a large language model, which comprises the following steps:
step S1, acquiring a plurality of images And corresponding labels, and constructing a memory bank according to the labels;
Step S2, constructing an image description text generation model, wherein the model comprises an object extraction module, a relation extraction module, an initial description text generation module and a text color rendering module;
Step S3, importing the image I needing text description in the step S2 and the entity information of the image to a relation extraction module to obtain a corresponding entity relation scene graph;
step S4, importing the image I needing text description in the step S2, the entity information of the step S2 and the entity relation scene graph of the step S3 into an initial description text generation module to obtain corresponding initial description;
And S5, importing the image I needing text description in the step S2 and the initial description text-to-text rendering module of the image in the step S4 to obtain a final rendering text.
Further, the step S1 specifically includes:
Several images to be acquired using an image encoder The image embedding is obtained by encoding, and the method is expressed as follows:
;
;
;
Wherein, An image encoding operation is represented by the image,Representing an imageThe first of (3)Zhang Yangben the process of the preparation of the pharmaceutical composition,Representation image embeddingThe first of (3)A number of samples of the sample were taken,Representing the total number of images in the memory bank,Representing image embedding;
storing the image embedding and the corresponding label in the form of key value pairs, wherein a plurality of pairs of key value pairs form a memory bank, the image embedding is used as a key, and the corresponding label is used as a value and expressed as:
;
;
Wherein, The memory bank is represented by the data of the memory bank,The number i of the keys is represented by,The value of the i-th is indicated,Representation image embeddingThe first of (3)The individual images are embedded and,Representation and the firstThe individual images are embedded in the corresponding labels.
Further, the object extraction module in step S2 includes an image segmentation module and an image encoder;
Inputting an image I to be subjected to text description into an image segmentation module to obtain bounding boxes and masks masks of all objects in the image I to be subjected to text description;
According to the pixel value in the mask masks, calculating, setting the pixel value of the part with the object as 1 and the pixel value of the part without the object as 0, and performing dot multiplication operation on the mask masks and the image I needing text description to obtain the matting of the object in the image I, wherein the matting is expressed as follows:
;
;
Wherein, An image segmentation operation is represented as such,A mask representing the i-th object,Is the matting of the ith object, n is the number of all objects in the image,A matting of an object in the image I is represented;
the matting of each object is embedded into the corresponding image obtained by the image encoder To obtain total image embeddingExpressed as:
;
embedding a corresponding image of each object And the memory bank in step S1All images in (a) are embedded with corresponding keysCosine similarity calculation is performed, expressed as:
=;
Wherein, ,Representing a cosine similarity function;
Key for obtaining object with highest similarity Re-acquiringCorresponding valueObtaining the image in the memory library which is most similar to the input image I needing text description, obtaining the corresponding label, and assigning the label to the image I needing text description as the result of object recognition, namely entity informationAnd is expressed as:
;
;
;
Where j is the index value of the index of the key of the object with the highest similarity, The function of the maximum value is represented,Representing the label corresponding to the nth object in image I.
Further, step S3 is specifically to construct a relevant hint based on the identified entity informationInputting a large language model, and acquiring an interaction relation among objects in an image I needing text description, wherein the interaction relation comprises relative positions, actions and states among the objects;
wherein, related prompts of large language model design The method comprises the following steps:
;
Wherein, Representing interactive condition related prompts for setting generation conditions,Representing the information of the entity and,The related prompt of the interactive relation task is used for setting the generation requirement;
Image and related prompt Input into a large language model to generate an object relation scene graphAnd is expressed as:
;
wherein the method comprises the steps of In the case of a large language model,In the case of a visual encoder,Is a text encoder;
further, in the process of constructing the object relation scene graph, the importance degree of the objects is ordered, which is expressed as follows:
Sorting importance of each object according to the score, and determining which objects are the main body of the image and which objects are in secondary positions;
The constraint condition of the ordering is that the object score in the foreground is larger than that in the background, and/or the object score is higher when the object is closer to the center position of the image I, and/or the score is higher when the occupied area of the object in the image I is larger, and/or the score is higher when the interaction times of the object and other objects are more;
Calculating importance scores for each object according to the constraint conditions Expressed as:
;
Wherein, To obtain a calculation function of the euclidean distance of the object position center from the center of the whole image I,To obtain a calculation function of the area of the mask corresponding to the object,AndIs the parameter of the ultrasonic wave to be used as the ultrasonic wave,To obtain a computational function of the degree of an object in an object-relationship scene graph,Representing an object;
According to importance scores Ordering the objects according to the scores from top to bottom, updating according to the ordering, and obtaining updated objects;
The method for calculating the object position center specifically comprises the following steps:
According to the bounding box acquired in step S2, assume an object The position coordinates of the four vertexes of the corresponding bounding box are:
;
The coordinates of the center position thereof are:
;
assume that the position coordinates of the geometric center of the whole image I are ;
Then:
。
further, the step S4 specifically includes:
constructing a second correlation cue By a second correlation promptAn object relationship scene graph will be generatedUpdated objectThe initial description based on the large language model is input into an initial description text generation module to be combined with an image I needing text description, and the initial description can be generatedExpressed as:
;
;
Wherein, Representing an initial descriptive condition-related prompt,Representing an initial description task related prompt.
Further, the step S5 specifically includes:
Constructing a third related prompt guides the text rendering module based on the large language model to the initial description Further reasoning and expanding are carried out, and a final color-rendering text is obtained by combining an image I needing text descriptionExpressed as:
;
;
Wherein, A related prompt indicating the condition of the color rendering text,Representing a text task related prompt for rendering.
The invention has the positive progress effects that:
1) The invention can accurately identify and describe each object in the image and clearly express the complex interaction relationship between the objects. Through the thinking chain type description generation mode, the accuracy of model generation description is improved, and the description is more in line with the cognition habit of human beings. By stepwise guidance, the resulting description can better convey the overall information and intent of the image.
2) Further, the invention divides the image into a plurality of objects and sorts the importance, and determines which objects are the main body of the image and which objects are in secondary positions. After determining the description priority and the importance ranking, when describing the image, the main object is first described in detail, and the information such as the characteristics, the position, the action and the like of the main object are clarified. For example, in describing an image including a person and a landscape, the appearance, posture, and activity of the person are described in detail, and then the main features of the landscape are described. For the secondary object, its existence is briefly described or just mentioned, without being expanded in detail. The description mode can make the image content be clear in level and the emphasis is on.
Drawings
FIG. 1 is a flow chart of steps of a method for generating image description text based on a large language model of the present invention.
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention.
Referring to fig. 1, a method for generating image description text based on a large language model, in an example, includes the steps of:
step S1, acquiring a plurality of images And corresponding labels, and constructing a memory bank according to the labels;
Step S2, constructing an image description text generation model, wherein the model comprises an object extraction module, a relation extraction module, an initial description text generation module and a text color rendering module;
Step S3, importing the image I needing text description in the step S2 and the entity information of the image to a relation extraction module to obtain a corresponding entity relation scene graph;
step S4, importing the image I needing text description in the step S2, the entity information of the step S2 and the entity relation scene graph of the step S3 into an initial description text generation module to obtain corresponding initial description;
And S5, importing the image I needing text description in the step S2 and the initial description text-to-text rendering module of the image in the step S4 to obtain a final rendering text.
Further, in an example, step S1 is specifically:
Several images to be acquired using an image encoder The image embedding is obtained by encoding, and the method is expressed as follows:
;
;
;
Wherein, An image encoding operation is represented by the image,Representing an imageThe first of (3)Zhang Yangben the process of the preparation of the pharmaceutical composition,Representation image embeddingThe first of (3)A number of samples of the sample were taken,Representing the total number of images in the memory bank,Representing image embedding;
storing the image embedding and the corresponding label in the form of key value pairs, wherein a plurality of pairs of key value pairs form a memory bank, the image embedding is used as a key, and the corresponding label is used as a value and expressed as:
;
;
Wherein, The memory bank is represented by the data of the memory bank,The number i of the keys is represented by,The value of the i-th is indicated,Representation image embeddingThe first of (3)The individual images are embedded and,Representation and the firstThe individual images are embedded in the corresponding labels.
Further, the object extraction module in step S2 includes an image segmentation module and an image encoder;
Inputting an image I to be subjected to text description into an image segmentation module to obtain bounding boxes and masks masks of all objects in the image I to be subjected to text description;
According to the pixel value in the mask masks, calculating, setting the pixel value of the part with the object as 1 and the pixel value of the part without the object as 0, and performing dot multiplication operation on the mask masks and the image I needing text description to obtain the matting of the object in the image I, wherein the matting is expressed as follows:
;
;
Wherein, An image segmentation operation is represented as such,A mask representing the i-th object,Is the matting of the ith object, n is the number of all objects in the image,A matting of an object in the image I is represented;
the matting of each object is embedded into the corresponding image obtained by the image encoder To obtain total image embeddingExpressed as:
;
embedding a corresponding image of each object And the memory bank in step S1All images in (a) are embedded with corresponding keysCosine similarity calculation is performed, expressed as:
=;
Wherein, ,Representing a cosine similarity function;
Key for obtaining object with highest similarity Re-acquiringCorresponding valueObtaining the image in the memory library which is most similar to the input image I needing text description, obtaining the corresponding label, and assigning the label to the image I needing text description as the result of object recognition, namely entity informationAnd is expressed as:
;
;
;
Where j is the index value of the index of the key of the object with the highest similarity, The function of the maximum value is represented,Representing the label corresponding to the nth object in image I.
Further, in one example, step S3 is embodied in constructing a relevant hint based on the identified entity informationInputting a large language model, and acquiring an interaction relation among objects in an image I needing text description, wherein the interaction relation comprises relative positions, actions and states among the objects;
wherein, related prompts of large language model design The method comprises the following steps:
;
Wherein, Representing interactive condition related prompts for setting generation conditions,Representing the information of the entity and,The related prompt of the interactive relation task is used for setting the generation requirement;
Image and related prompt Input into a large language model to generate an object relation scene graphAnd is expressed as:
;
wherein the method comprises the steps of In the case of a large language model,In the case of a visual encoder,Is a text encoder;
further, in one example, in the process of building the object relationship scene graph, the objects are ranked for importance, expressed as:
Sorting importance of each object according to the score, and determining which objects are the main body of the image and which objects are in secondary positions;
The constraint condition of the ordering is that the object score in the foreground is larger than that in the background, and/or the object score is higher when the object is closer to the center position of the image I, and/or the score is higher when the occupied area of the object in the image I is larger, and/or the score is higher when the interaction times of the object and other objects are more;
Calculating importance scores for each object according to the constraint conditions Expressed as:
;
Wherein, To obtain a calculation function of the euclidean distance of the object position center from the center of the whole image I,To obtain a calculation function of the area of the mask corresponding to the object,AndIs the parameter of the ultrasonic wave to be used as the ultrasonic wave,To obtain a computational function of the degree of an object in an object-relationship scene graph,Representing an object;
According to importance scores Ordering the objects according to the scores from top to bottom, updating according to the ordering, and obtaining updated objects;
The method for calculating the object position center specifically comprises the following steps:
According to the bounding box acquired in step S2, assume an object The position coordinates of the four vertexes of the corresponding bounding box are:
;
The coordinates of the center position thereof are:
;
assume that the position coordinates of the geometric center of the whole image I are ;
Then:
。
further, in an example, step S4 is specifically:
constructing a second correlation cue By a second correlation promptAn object relationship scene graph will be generatedUpdated objectThe initial description based on the large language model is input into an initial description text generation module to be combined with an image I needing text description, and the initial description can be generatedExpressed as:
;
;
Wherein, Representing an initial descriptive condition-related prompt,Representing an initial description task related prompt.
Further, in an example, step S5 is specifically:
Constructing a third related prompt guides the text rendering module based on the large language model to the initial description Further reasoning and expanding are carried out, and a final color-rendering text is obtained by combining an image I needing text descriptionExpressed as:
;
;
Wherein, A related prompt indicating the condition of the color rendering text,Representing a text task related prompt for rendering.
The present invention has been described in detail with reference to the embodiments of the drawings, and those skilled in the art can make various modifications to the invention based on the above description. Accordingly, certain details of the embodiments are not to be interpreted as limiting the invention, which is defined by the appended claims.
Claims (6)
1. An image description text generation method based on a large language model is characterized by comprising the following steps:
step S1, acquiring a plurality of images And corresponding labels, and constructing a memory bank according to the labels;
step S2, constructing an image description text generation model, wherein the model comprises an object extraction module, a relation extraction module, an initial description text generation module and a text color rendering module;
step S3, importing the image I needing text description in the step S2 and entity information of the image to a relation extraction module to obtain an object relation scene graph;
Step S4, importing the image I needing text description in the step S2, the entity information of the step S2 and the entity relation scene graph of the step S3 into an initial description text generation module to obtain initial description;
And S5, importing the image I needing text description in the step S2 and the initial description text-to-text rendering module of the image in the step S4 to obtain a final rendering text.
2. The method for generating image description text based on large language model as claimed in claim 1, wherein step S1 specifically comprises:
Several images to be acquired using an image encoder The image embedding is obtained by encoding, and the method is expressed as follows:
Wherein, An image encoding operation is represented by the image,Representing an imageThe first of (3)Zhang Yangben the process of the preparation of the pharmaceutical composition,Representation image embeddingThe first of (3)A number of samples of the sample were taken,Representing the total number of images in the memory bank,Representing image embedding;
storing the image embedding and the corresponding label in the form of key value pairs, wherein a plurality of pairs of key value pairs form a memory bank, the image embedding is used as a key, and the corresponding label is used as a value and expressed as:
Wherein, The memory bank is represented by the data of the memory bank,The number i of the keys is represented by,The value of the i-th is indicated,Representation image embeddingThe first of (3)The individual images are embedded and,Representation and the firstThe individual images are embedded in the corresponding labels.
3. The method for generating image description text based on a large language model according to claim 2, wherein the object extraction module in step S2 includes an image segmentation module and an image encoder;
Inputting an image I to be subjected to text description into an image segmentation module to obtain bounding boxes and masks masks of all objects in the image I to be subjected to text description;
According to the pixel value in the mask masks, calculating, setting the pixel value of the part with the object as 1 and the pixel value of the part without the object as 0, and performing dot multiplication operation on the mask masks and the image I needing text description to obtain the matting of the object in the image I, wherein the matting is expressed as follows:
Wherein, An image segmentation operation is represented as such,A mask representing the i-th object,Is the matting of the ith object, n is the number of all objects in the image,A matting of an object in the image I is represented;
the matting of each object is embedded into the corresponding image obtained by the image encoder To obtain total image embeddingExpressed as:
embedding a corresponding image of each object And the memory bank in step S1All images in (a) are embedded with corresponding keysCosine similarity calculation is performed, expressed as:
Wherein, ,Representing a cosine similarity function;
Key for obtaining object with highest similarity Re-acquiringCorresponding valueObtaining the image in the memory library which is most similar to the input image I needing text description, obtaining the corresponding label, and assigning the label to the image I needing text description as the result of object recognition, namely entity informationAnd is expressed as:
Where j is the index value of the index of the key of the object with the highest similarity, The function of the maximum value is represented,Representing the label corresponding to the nth object in image I.
4. The method for generating image description text based on large language model as set forth in claim 3, wherein step S3 is specifically to construct related prompts based on the identified entity informationInputting a large language model, and acquiring an interaction relation among objects in an image I needing text description, wherein the interaction relation comprises relative positions, actions and states among the objects;
wherein, related prompts of large language model design The method comprises the following steps:
Wherein, Representing interactive condition related prompts for setting generation conditions,Representing the information of the entity and,The related prompt of the interactive relation task is used for setting the generation requirement;
Image and related prompt Input into a large language model to generate an object relation scene graphAnd is expressed as:
wherein the method comprises the steps of In the case of a large language model,In the case of a visual encoder,Is a text encoder;
further, in the process of constructing the object relation scene graph, the importance degree of the objects is ordered, which is expressed as follows:
Sorting importance of each object according to the score, and determining which objects are the main body of the image and which objects are in secondary positions;
The constraint condition of the ordering is that the object score in the foreground is larger than that in the background, and/or the object score is higher when the object is closer to the center position of the image I, and/or the score is higher when the occupied area of the object in the image I is larger, and/or the score is higher when the interaction times of the object and other objects are more;
Calculating importance scores for each object according to the constraint conditions Expressed as:
Wherein, To obtain a calculation function of the euclidean distance of the object position center from the center of the whole image I,To obtain a calculation function of the area of the mask corresponding to the object,AndIs the parameter of the ultrasonic wave to be used as the ultrasonic wave,To obtain a computational function of the degree of an object in an object-relationship scene graph,Representing an object;
According to importance scores Ordering the objects according to the scores from top to bottom, updating according to the ordering, and obtaining updated objects;
The method for calculating the object position center specifically comprises the following steps:
According to the bounding box acquired in step S2, assume an object The position coordinates of the four vertexes of the corresponding bounding box are:
The coordinates of the center position thereof are:
assume that the position coordinates of the geometric center of the whole image I are ;
Then:
5. The method for generating image description text based on large language model as claimed in claim 1, wherein step S4 specifically comprises:
constructing a second correlation cue By a second correlation promptAn object relationship scene graph will be generatedUpdated objectThe initial description based on the large language model is input into an initial description text generation module to be combined with an image I needing text description, and the initial description can be generatedExpressed as:
Wherein, Representing an initial descriptive condition-related prompt,Representing an initial description task related prompt.
6. The method for generating image description text based on large language model as claimed in claim 1, wherein step S5 specifically comprises:
Constructing a third related prompt guides the text rendering module based on the large language model to the initial description Further reasoning and expanding are carried out, and a final color-rendering text is obtained by combining an image I needing text descriptionExpressed as:
Wherein, A related prompt indicating the condition of the color rendering text,Representing a text task related prompt for rendering.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202411756885.8A CN119227816B (en) | 2024-12-03 | 2024-12-03 | A method for generating image description text based on large language model |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202411756885.8A CN119227816B (en) | 2024-12-03 | 2024-12-03 | A method for generating image description text based on large language model |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN119227816A true CN119227816A (en) | 2024-12-31 |
| CN119227816B CN119227816B (en) | 2025-04-18 |
Family
ID=94046770
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202411756885.8A Active CN119227816B (en) | 2024-12-03 | 2024-12-03 | A method for generating image description text based on large language model |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN119227816B (en) |
Citations (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2000028467A1 (en) * | 1998-11-06 | 2000-05-18 | The Trustees Of Columbia University In The City Of New York | Image description system and method |
| US6847980B1 (en) * | 1999-07-03 | 2005-01-25 | Ana B. Benitez | Fundamental entity-relationship models for the generic audio visual data signal description |
| CN111598183A (en) * | 2020-05-22 | 2020-08-28 | 上海海事大学 | Multi-feature fusion image description method |
| CN111858954A (en) * | 2020-06-29 | 2020-10-30 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Task-Oriented Text Generative Image Network Model |
| US20210374349A1 (en) * | 2020-09-21 | 2021-12-02 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method for text generation, device and storage medium |
| CN116384403A (en) * | 2023-04-19 | 2023-07-04 | 东北大学 | A Scene Graph Based Multimodal Social Media Named Entity Recognition Method |
| WO2023134073A1 (en) * | 2022-01-11 | 2023-07-20 | 平安科技(深圳)有限公司 | Artificial intelligence-based image description generation method and apparatus, device, and medium |
| CN116977774A (en) * | 2023-04-21 | 2023-10-31 | 腾讯科技(深圳)有限公司 | Image generation method, device, equipment and medium |
| CN117036545A (en) * | 2023-07-17 | 2023-11-10 | 北京林业大学 | Image scene feature-based image description text generation method and system |
| CN118865388A (en) * | 2024-07-09 | 2024-10-29 | 杭州电子科技大学 | Image detailed description method based on large model fusion and refined scene graph thinking chain |
| CN118966165A (en) * | 2024-08-09 | 2024-11-15 | 哈尔滨思和信息技术股份有限公司 | Text generation method, device, electronic device and storage medium |
-
2024
- 2024-12-03 CN CN202411756885.8A patent/CN119227816B/en active Active
Patent Citations (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2000028467A1 (en) * | 1998-11-06 | 2000-05-18 | The Trustees Of Columbia University In The City Of New York | Image description system and method |
| US6847980B1 (en) * | 1999-07-03 | 2005-01-25 | Ana B. Benitez | Fundamental entity-relationship models for the generic audio visual data signal description |
| CN111598183A (en) * | 2020-05-22 | 2020-08-28 | 上海海事大学 | Multi-feature fusion image description method |
| CN111858954A (en) * | 2020-06-29 | 2020-10-30 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Task-Oriented Text Generative Image Network Model |
| US20210374349A1 (en) * | 2020-09-21 | 2021-12-02 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method for text generation, device and storage medium |
| WO2023134073A1 (en) * | 2022-01-11 | 2023-07-20 | 平安科技(深圳)有限公司 | Artificial intelligence-based image description generation method and apparatus, device, and medium |
| CN116384403A (en) * | 2023-04-19 | 2023-07-04 | 东北大学 | A Scene Graph Based Multimodal Social Media Named Entity Recognition Method |
| CN116977774A (en) * | 2023-04-21 | 2023-10-31 | 腾讯科技(深圳)有限公司 | Image generation method, device, equipment and medium |
| CN117036545A (en) * | 2023-07-17 | 2023-11-10 | 北京林业大学 | Image scene feature-based image description text generation method and system |
| CN118865388A (en) * | 2024-07-09 | 2024-10-29 | 杭州电子科技大学 | Image detailed description method based on large model fusion and refined scene graph thinking chain |
| CN118966165A (en) * | 2024-08-09 | 2024-11-15 | 哈尔滨思和信息技术股份有限公司 | Text generation method, device, electronic device and storage medium |
Non-Patent Citations (3)
| Title |
|---|
| WENTIAN ZHAO ET AL: "Boosting Entity-Aware Image Captioning With Multi-Modal Knowledge Graph", 《IEEE TRANSACTIONS ON MULTIMEDIA》, 11 August 2023 (2023-08-11), pages 2659 - 2670 * |
| 兰红;刘秦邑;: "图注意力网络的场景图到图像生成模型", 中国图象图形学报, no. 08, 12 August 2020 (2020-08-12), pages 83 - 95 * |
| 李志欣;魏海洋;黄飞成;张灿龙;马慧芳;史忠植;: "结合视觉特征和场景语义的图像描述生成", 计算机学报, no. 09, 15 September 2020 (2020-09-15), pages 38 - 54 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN119227816B (en) | 2025-04-18 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN114092707B (en) | Image-text visual question answering method, system and storage medium | |
| CN115861995B (en) | Visual question-answering method and device, electronic equipment and storage medium | |
| WO2022007685A1 (en) | Method and device for text-based image generation | |
| CN117033609B (en) | Text visual question-answering method, device, computer equipment and storage medium | |
| CN108804530B (en) | Subtitling areas of an image | |
| EP3757905A1 (en) | Deep neural network training method and apparatus | |
| WO2022007823A1 (en) | Text data processing method and device | |
| CN116258147B (en) | A Multimodal Comment Sentiment Analysis Method and System Based on Heterogeneous Graph Convolution | |
| CN120148072B (en) | A re-identification model training method and system based on noise robust cue learning framework | |
| CN113095072B (en) | Text processing method and device | |
| CN114398471A (en) | Visual question-answering method based on deep reasoning attention mechanism | |
| WO2021137942A1 (en) | Pattern generation | |
| CN118865388A (en) | Image detailed description method based on large model fusion and refined scene graph thinking chain | |
| WO2021129410A1 (en) | Method and device for text processing | |
| CN113221882A (en) | Image text aggregation method and system for curriculum field | |
| CN114840680A (en) | Entity relationship joint extraction method, device, storage medium and terminal | |
| CN114627312A (en) | Zero sample image classification method, system, equipment and storage medium | |
| CN115223171B (en) | Text recognition method, device, equipment and storage medium | |
| Choi | CNN output optimization for more balanced classification | |
| CN115858816A (en) | Construction method and system of intelligent agent cognitive map for public security field | |
| CN119227816B (en) | A method for generating image description text based on large language model | |
| CN117934854A (en) | Traffic scene continuous semantic segmentation method based on diffusion model and dual generator | |
| CN117609536A (en) | Language-guided referential expression understanding reasoning network system and reasoning method | |
| CN113722477B (en) | Internet citizen emotion recognition method and system based on multitask learning and electronic equipment | |
| CN116955565A (en) | Method and system for generating diversity problem based on syntactic dependency graph joint embedding |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |