CN119106098A

CN119106098A - A video plot question-answering method and device based on RAG

Info

Publication number: CN119106098A
Application number: CN202411160716.8A
Authority: CN
Inventors: 方律; 周凌洋
Original assignee: Hefei Feier Intelligent Technology Co ltd
Current assignee: Anhui Feiling Technology Co.,Ltd.
Priority date: 2024-08-22
Filing date: 2024-08-22
Publication date: 2024-12-10

Abstract

A video plot question-and-answer method and device based on RAG, relating to the field of RAG. In the method, user content input by a user for video content is obtained; the user content is converted into a corresponding user vector; a first similarity between the user vector and a first target vector and a second similarity between the user vector and a second target vector are calculated, and a preset vector database includes multiple target vectors; if it is determined that the first similarity is greater than the second similarity, the first target text content corresponding to the first target vector is determined according to the preset vector database; the first target text content and the user content are input into a preset RAG model to generate reply content for the user content. By implementing the technical solution provided by the present application, user content and video files can be processed by vectorization, so that user questions can be answered for video files.

Description

Video scenario question answering method and device based on RAG

Technical Field

The application relates to the field of RAG (random access gateway), in particular to a video scenario question-answering method and device based on RAG.

Background

With the rapid growth of internet technology and explosive growth of video content, users are increasingly dependent on video platforms in acquiring information and entertainment content. The video content not only enriches the daily life of people, but also becomes an important knowledge acquisition way. However, in a large amount of video data, it is often difficult for a user to quickly acquire specific information as needed. The demand encourages the wide application of the video scenario question-answering system, and helps users to quickly acquire related information and answers in the process of watching videos.

Currently, related art is primarily dependent on rule matching and simple text retrieval techniques for user problems with user inputs. The related art can only process plain text contents, has limited parsing and understanding ability for complex multi-modal data (e.g., video), and cannot answer user-entered questions for video contents.

Therefore, there is a need for a video scenario question-answering method and device based on RAG.

Disclosure of Invention

The application provides a video scenario question-answering method and device based on RAG, which can analyze and understand video files and answer user questions aiming at the video files by vectorizing user contents and the video files.

The first aspect of the application provides a video scenario question-answering method based on RAG, which comprises the steps of obtaining user content input by a user aiming at video content, converting the user content into corresponding user vectors, calculating first similarity between the user vectors and first target vectors and second similarity between the user vectors and second target vectors, wherein the first target vectors are any target vectors in a preset vector database, the second target vectors are any target vectors except the first target vectors in the preset vector database, the preset vector database comprises a plurality of target vectors, the preset vector database is constructed in advance according to the video content, if the first similarity is larger than the second similarity, determining first target text content corresponding to the first target vectors according to the preset vector database, and inputting the first target text content and the user content into a preset RAG model to generate reply content aiming at the user content.

By adopting the technical scheme, the matching and understanding of the user input content and the video content can be realized by converting the content input by the user into the user vector and calculating the similarity between the user vector and the target vector in the preset vector database. And determining the best matched target vector and corresponding target text content by comparing the similarity. And generating reply content for the user content by inputting the user input content and the target text content into a preset RAG model. Thus, the video processing capability can be enhanced, and the video processing capability can more accurately and intelligently respond to the demands and questions of the user.

Optionally, before the user content input by the user is obtained, the method further comprises the steps of obtaining a video file corresponding to the video content, converting the video file into an audio file by adopting a transcoding tool, performing role separation on the audio file to obtain a role and audio content corresponding to the role, converting the audio content into text content, setting a corresponding role tag for the text content, extracting the video file by adopting a preset multi-mode big model, extracting a key frame, converting the key frame into background description and scenario description for the video file, integrating the text content, the role tag corresponding to the text content, the background description and the scenario description according to time sequence to obtain target text content, converting the target text content into a target vector, and storing the target vector, the target text content and the corresponding relation between the target vector and the target text content into a preset vector database.

By adopting the technical scheme, the video file is converted into the audio file by adopting the transcoding tool, so that preparation is made for subsequent audio processing and text conversion. And performing role separation on the audio files, and separating out the audio contents of different roles. Thus, the audio content of different roles in the video can be acquired. And converting the separated audio content into text content, setting a corresponding role label for the text content, and identifying which role the text content belongs to. This allows the text content of different roles to be distinguished in subsequent processing. And processing the video file by adopting a preset multi-mode large model, and extracting key frames of the video. These key frames are then translated into a background description and scenario description for the video file. And integrating the text content, the role labels, the background description and the scenario description according to the time sequence to obtain the target text content. In this way, the individual elements can be integrated together to form complete text describing the video content (target text content). And converting the target text content into a target vector, and storing the corresponding relation between the target vector and the target text content in a preset vector database. In this way, the target text content and the corresponding vector form thereof can be associated and stored, and subsequent processing is convenient.

The method comprises the steps of obtaining a history chat record corresponding to user content, extracting the history chat record to obtain history key information, splicing the user content and the history key information to obtain spliced target user content, and converting the target user content into the user vector.

By adopting the technical scheme, the user content currently input by the user is spliced with the history key information by acquiring the history chat record of the user and extracting the history key information, and the current request of the user is combined with the past interaction context, so that more comprehensive and accurate semantic representation is obtained. And converting the target user content into a representation form of the user vector, and facilitating subsequent similarity calculation and matching. Through vectorization, user content can be converted into a numerical representation that can be calculated and compared, thereby achieving more accurate similarity matching and reply generation.

Optionally, the historical chat record is extracted to obtain the historical key information, which specifically comprises the steps of carrying out entity identification on the historical chat record to obtain a corresponding historical entity, carrying out entity identification on the user content to obtain a corresponding content entity, and if the historical entity and the content entity are the same entity, determining the historical chat record corresponding to the historical entity as the historical key information.

By adopting the technical scheme, the historical entity and the content entity are compared and matched. If it is determined that the historical entity and the content entity are the same entity, i.e., they represent the same thing, then a historical chat record containing the historical entity may be determined as historical key information. The historical key information provides important contextual information related to the user's content, helping to better understand the user's current request or intent.

Optionally, extracting the video file by using a preset multi-mode large model to extract a key frame, and specifically comprises determining a plurality of image frames in the video file, calculating image similarity of a first image frame and a second image frame, wherein the first image frame and the second image frame are any two adjacent image frames in the plurality of image frames, and determining that the first image frame and the second image frame are key frames if the image similarity is smaller than a preset similarity threshold.

By adopting the technical scheme, the key frames in the video file can be determined by calculating the image similarity and applying the preset similarity threshold. Key frame extraction provides the basis for subsequent video content analysis tasks.

The method comprises the steps of obtaining a background description and a scenario description of a video file, wherein the background description and the scenario description of the video file are obtained through the steps of carrying out feature extraction on the key frame to obtain a corresponding visual feature vector, inputting the visual feature vector into a preset image description generating model to obtain text description of the key frame, carrying out keyword extraction on the text description to determine a corresponding keyword, inputting the keyword into a preset keyword library to obtain a description type corresponding to the keyword, wherein the description type comprises the background description and the scenario description, and the preset keyword library comprises the corresponding relation between the keyword and the description type.

By adopting the technical scheme, the visual feature vector is obtained by extracting the features of the key frames, and the visual feature vector is input into the preset image description generation model, so that the text description with strong pertinence and accuracy can be generated. The textual descriptions generated using the visual feature vectors may better reflect the video content, providing more detailed and rich background and scenario information. By extracting keywords and matching with a preset keyword library, the background description and the scenario description can be accurately distinguished. This refined description classification helps to better understand the different aspects of the video content, enhancing the overall content understanding capability.

Optionally, after inputting the first target text content and the user content into a preset RAG model and generating reply content for the user content, the method further comprises storing the user content and the reply content into a historical chat record.

By adopting the technical scheme, the user content and the reply content are stored in the history chat record, so that the history chat record can be referred in the subsequent dialogue, and more relevant and continuous replies are provided.

The application provides a video scenario question-answering device based on RAG, which comprises an acquisition module and a processing module, wherein the acquisition module is used for acquiring user content input by a user aiming at video content, the processing module is used for converting the user content into a corresponding user vector, the processing module is further used for calculating first similarity between the user vector and a first target vector and second similarity between the user vector and a second target vector, the first target vector is any target vector in a preset vector database, the second target vector is any target vector except the first target vector in the preset vector database, the preset vector database comprises a plurality of target vectors, the preset vector database is constructed in advance according to the video content, the processing module is further used for determining a first target text corresponding to the first target text according to the preset vector database if the first similarity is determined to be greater than the second similarity, and the processing module is further used for generating the first reply text aiming at the user content and the first target text input by the user content.

In a third aspect the application provides an electronic device comprising a processor, a memory for storing instructions, a user interface and a network interface, both for communicating with other devices, the processor being for executing instructions stored in the memory to cause the electronic device to perform a method as claimed in any one of the preceding claims.

In a fourth aspect of the application there is provided a computer readable storage medium storing instructions which, when executed, perform a method as claimed in any one of the preceding claims.

In summary, one or more technical solutions provided in the embodiments of the present application at least have the following technical effects or advantages:

1. The method can realize matching and understanding of the user input content and the video content by converting the user input content into the user vector and calculating the similarity between the user vector and the target vector in the preset vector database. And determining the best matched target vector and corresponding target text content by comparing the similarity. And generating reply content for the user content by inputting the user input content and the target text content into a preset RAG model. Thus, the video processing capability can be enhanced, and the video processing capability can more accurately and intelligently respond to the demands and questions of the user.

2. The video file is converted to an audio file using a transcoding tool, ready for subsequent audio processing and text conversion. And performing role separation on the audio files, and separating out the audio contents of different roles. Thus, the audio content of different roles in the video can be acquired. And converting the separated audio content into text content, setting a corresponding role label for the text content, and identifying which role the text content belongs to. This allows the text content of different roles to be distinguished in subsequent processing. And processing the video file by adopting a preset multi-mode large model, and extracting key frames of the video. These key frames are then translated into a background description and scenario description for the video file. And integrating the text content, the role labels, the background description and the scenario description according to the time sequence to obtain the target text content. In this way, the individual elements can be integrated together to form complete text describing the video content (target text content). And converting the target text content into a target vector, and storing the corresponding relation between the target vector and the target text content in a preset vector database. In this way, the target text content and the corresponding vector form thereof can be associated and stored, and subsequent processing is convenient.

3. By acquiring the historical chat record of the user and extracting the historical key information, the user content currently input by the user is spliced with the historical key information, and the current request of the user is combined with the past interaction context, so that more comprehensive and accurate semantic representation is obtained. And converting the target user content into a representation form of the user vector, and facilitating subsequent similarity calculation and matching. Through vectorization, user content can be converted into a numerical representation that can be calculated and compared, thereby achieving more accurate similarity matching and reply generation.

Drawings

Fig. 1 is a schematic flow chart of a video scenario question-answering method based on RAG according to an embodiment of the present application;

Fig. 2 is a schematic block diagram of a video scenario question-answering device based on RAG according to an embodiment of the present application;

Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Reference numerals illustrate 201, acquisition module, 202, processing module, 300, electronic device, 301, processor, 302, communication bus, 303, user interface, 304, network interface, 305, memory.

Detailed Description

In order that those skilled in the art will better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments.

In describing embodiments of the present application, words such as "for example" or "for example" are used to mean serving as examples, illustrations, or descriptions. Any embodiment or design described herein as "such as" or "for example" in embodiments of the application should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "or" for example "is intended to present related concepts in a concrete fashion.

In the description of embodiments of the application, the term "plurality" means two or more. For example, a plurality of systems means two or more systems, and a plurality of screen terminals means two or more screen terminals. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating an indicated technical feature. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

The application provides a video scenario question-answering method based on RAG, and referring to FIG. 1, FIG. 1 is a flow diagram of the video scenario question-answering method based on RAG provided by the embodiment of the application. The method is applied to the server and comprises the following steps of S101 to S105:

Step S101, user content input by a user for video content is acquired.

Before step S101, the method further comprises the steps of obtaining a video file, converting the video file into an audio file by using a transcoding tool, performing role separation on the audio file to obtain audio content corresponding to roles, converting the audio content into text content, setting corresponding role labels for the text content, extracting the video file by using a preset multi-mode big model, extracting key frames, converting the key frames into background description and scenario description aiming at the video file, integrating the text content, the role labels, the background description and the scenario description corresponding to the text content according to time sequence to obtain target text content, converting the target text content into target vectors, and storing the target vectors, the target text content and the corresponding relations between the target vectors and the target text content into a preset vector database.

Specifically, the server obtains the video file uploaded or designated by the user through a network interface or a local file system. After the video file is obtained, the server calls a pre-configured transcoding tool to process the video file. The transcoding tool is FFmpeg of open source. The server converts the video file to an audio file through a command line interface or API of the transcoding tool. And the server performs role separation on the converted audio file. Character separation separates the sounds of different persons from the mixed audio signal. The server loads a pre-trained character separation model, such as a Deep convolutional network (Deep CNN), and uses the character separation model to infer the audio file. The character separation model automatically identifies voiceprints of different speakers in audio by analyzing the characteristics of the audio, such as frequency spectrum, fundamental frequency, formants, and the like, and separates the voiceprints into different audio tracks. The separated audio tracks correspond to different roles, each corresponding to a respective audio content, the audio content of each role being stored separately.

After the character audio content separation is completed, the server converts the audio content of each character into text content using voice recognition technology. The server invokes a pre-trained speech recognition model, such as DEEPSPEECH, WAV Letter, to recognize the audio content of each character. The speech recognition model converts speech signals into corresponding text sequences by analyzing acoustic and linguistic features of the audio content. The text content obtained through recognition corresponds to the corresponding character audio one by one, and character-text content pairs are formed. The server stores the text content, and sets a corresponding role tag for each text content segment for subsequent processing.

The server also needs to extract key information, including background descriptions and scenario descriptions, from the video file while acquiring character text content. This step is achieved by invoking a preset multi-modal large model. The preset multi-mode large model is an AI model for processing multiple mode information such as vision, voice and text simultaneously, for example ViLBERT. The server first breaks the video file into a series of image frames using a video analysis tool (e.g., openCV), and then selects key frames that can represent the main content of the video through a key frame extraction algorithm.

In one possible implementation, a preset multi-mode large model is adopted to extract a video file, and a key frame is extracted, and the method specifically comprises the steps of determining a plurality of image frames in the video file, calculating image similarity of a first image frame and a second image frame, wherein the first image frame and the second image frame are any two adjacent image frames in the plurality of image frames, and determining the first image frame and the second image frame as the key frame if the image similarity is smaller than a preset similarity threshold value.

Specifically, the server needs to decode the video file and acquire all the image frames therein. After all the image frames are acquired, the server needs to calculate the similarity between any two adjacent image frames. The similarity is used for measuring the similarity of two images on visual contents, and the image similarity calculation method comprises histogram comparison, feature point matching, perceptual hash and the like. In the scheme, the server calculates the image similarity by adopting a preset multi-mode large model. The preset multi-mode large model is a model based on deep learning, and can process data of multiple modes such as texts, images, audios and the like at the same time and learn association and mapping relations among the data. The server inputs two adjacent image frames (a first image frame and a second image frame) into a preset multi-mode large model, the preset multi-mode large model extracts high-level semantic features of the first image frame and the second image frame through a Convolutional Neural Network (CNN) technology, and then similarity between the two image frames is obtained through feature comparison and similarity calculation. The similarity is typically a real number between 0 and 1, indicating the degree of similarity of the two images. The larger the similarity, the closer the contents of the two images are, and the smaller the similarity, the larger the difference in contents of the two images is. After calculating the similarity of all the adjacent image frames, the server needs to determine which image frames are key frames according to the size of the similarity. The key frame is selected according to the image frames with larger content difference between the front image frame and the rear image frame, namely, the image frames with lower similarity. The server is internally preset with a preset similarity threshold value for judging whether the two image frames are different enough or not. If the similarity between the first image frame and the second image frame is less than the preset similarity threshold, the two image frames are considered to have larger changes in visual content and can be used as key frames. For example, assuming that the preset similarity threshold is 0.8, if the similarity of the first image frame and the second image frame is 0.7, which is smaller than the preset similarity threshold, both image frames are marked as key frames. The server traverses all adjacent image frame pairs, and performs threshold comparison on the similarity of the image frame pairs to finally obtain a group of key frames.

After the key frame is selected, the server inputs the key frame into a preset multi-mode large model for analysis. And carrying out feature extraction on the key frames by a preset multi-mode large model, and obtaining visual features of the images through a Convolutional Neural Network (CNN). The visual features of the preset multi-mode large model are aligned and fused with the text features to form multi-mode semantic representation. Based on the semantic representation, a pre-set multi-modal large model may generate a natural language description of the image content of the key frame, i.e., converting the key frame into a background description and a scenario description. Background descriptions generally relate to static information of scenes, environments, objects, etc. in an image, while scenario descriptions relate to dynamic information of actions, events, interactions, etc. of persons in an image.

In one possible implementation, the method for converting the key frame into the background description and the scenario description of the video file specifically comprises the steps of extracting features of the key frame to obtain corresponding visual feature vectors, inputting the visual feature vectors into a preset image description generating model to obtain text description of the key frame, extracting keywords of the text description to determine corresponding keywords, inputting the keywords into a preset keyword library to obtain description types corresponding to the keywords, wherein the description types comprise background description and scenario description, and the preset keyword library comprises corresponding relations between the keywords and the description types.

Specifically, the server inputs the image data of each key frame into a preset multi-mode large model, and the preset multi-mode large model converts the input image into a visual feature vector with a fixed length (4096 dimensions, for example). The visual feature vector highly concentrates the semantic content information of the image. The extracted visual feature vectors of all key frames are temporarily stored in a memory. Meanwhile, the server inputs the visual feature vectors of the key frames into a preset image description generation model in batches. The preset image description generation model adopts an encoder-decoder structure, takes a visual characteristic vector as input, and generates a corresponding natural language description text (text description). Meanwhile, the server establishes a mapping relation between the generated text description and the corresponding key frame. And the server calls a preset keyword extraction module for processing the text description generated by each key frame. The module uses algorithms based on TF-IDF, textRank, etc. to extract several keywords from the text description, which can generally be highly generalized to the core content of the description. And the server establishes a mapping relation between the extracted keywords and the corresponding key frames. And the server queries the keywords extracted from each key frame in a preset keyword library. Description types (background description or scenario description) corresponding to various common keywords are manually defined in a preset keyword library in advance. For example, "trees," "buildings," etc. generally correspond to background descriptions, and "run," "talk," etc. generally correspond to scenario descriptions. By matching the keywords, it can be determined whether a key frame is more likely to belong to the background description or the scenario description. Through the above processing steps, each key frame in the video gets two pieces of information, a background description (key word) and a scenario description (key word).

For example, assuming that a key frame is a picture of two people chatting in a park, the background description generated by the multimodal model may be "this is a sunny park with green trees, benches and fountains", while the scenario description may be "two young people chatting on benches, happy with ease, and seemingly discussing some interesting topics".

And finally, integrating all information extracted from the video, including text content, character labels, background descriptions, scenario descriptions and the like, according to a time sequence by the server to form a complete target text content. Since both the previous character separation and key frame extraction preserve time information, the server can align them according to the time stamps, ensuring that the integrated text content is synchronized with the video content on the time axis. The integrated target text content represents the complete semantic information of the video, but the server also needs to convert it into a semantic vector representation, i.e., a target vector, in order to facilitate subsequent retrieval and generation. The server encodes the target text content using natural language models, such as BERT, GPT, etc., and maps it into a high-dimensional semantic space. The encoding process fully utilizes self-attention mechanism in the transform structure, and can capture long-distance dependence and context information in text content. The target vector obtained after encoding is a real-value vector with fixed length, and contains semantic features of target text content.

In order to facilitate subsequent retrieval and matching, the server stores the target vector, the target text content and the correspondence between the target text content and the target text content in a preset vector database. The preset vector database is a database specially used for storing and retrieving high-dimensional vectors, such as Faiss, annoy, etc. The server takes the target vector as a key, takes the target text content as a value, and stores the target text content into a preset vector database in the form of key value pairs. When searching, the preset vector database can rapidly calculate the similarity between the query vector input by the user and all the target vectors, and returns the target text content most relevant to the query.

In step S101, in the video scenario question-answering method, the acquisition of the user content input by the user is the start point and the basis of the entire question-answering flow. The user content refers to a question or query about the video scenario, which is presented by the user, and is the basis for the server to perform subsequent processing and generate an answer. The server needs to provide an interactive interface for the user to enter questions or queries. The interface can be in various forms such as web pages, mobile applications, chat windows and the like, and the main purpose is to enable users to conveniently and intuitively present own problems. The interactive interface typically includes an input box for the user to enter questions in text form. The input box has a proper size and style to ensure that the user can clearly see the content entered by himself. For example, the server may provide a search box on the web page and display a prompt text "please enter your questions about the video scenario" above the search box. The user can enter his/her question in the search box, such as "who is the principal angle of the movie?

And step S102, converting the user content into corresponding user vectors.

In step S102, the user content is converted into a corresponding user vector, which concretely comprises the steps of obtaining a history chat record corresponding to the user content, extracting the history chat record to obtain history key information, splicing the user content and the history key information to obtain spliced target user content, and converting the target user content into the user vector.

In one possible implementation manner, the method for extracting the historical chat record to obtain the historical key information specifically comprises the steps of carrying out entity identification on the historical chat record to obtain a corresponding historical entity, carrying out entity identification on user content to obtain a corresponding content entity, and if the historical entity and the content entity are the same entity, determining the historical chat record corresponding to the historical entity as the historical key information.

Specifically, when a user inputs a question or inquires about content, the server processes it as user content. In order to better understand the context and background of the user content, the server needs to obtain a history chat record corresponding to the user content. The server identifies the current user through a user authentication mechanism and retrieves the user's historical chat log from the chat log database. The chat log database stores a complete history of each user interaction, including metadata of the user's questions, answers to the questions, timestamps, and the like.

After the history chat record is obtained, the server needs to extract the history key information most relevant to the current user content from the history chat record. This is achieved by entity recognition techniques. Entity identification may identify an entity (e.g., person name, place name, organization name, etc.) from text and determine its type. And the server respectively carries out entity identification on the historical chat record and the current user content to obtain a historical entity and a content entity.

The server performs the same processing and entity identification steps on the user content input by the current user, so as to obtain the content entities mentioned in the current input of the user, and form a content entity list. The server matches the historical entity list with the content entity list. The matching method is based on string equality, and the server checks whether the same entity exists in both lists.

For example, assume that the server obtains a history entity list of [ "Zhang Sang", "Beijing", "Microsoft corporation", "2022" ], and a content entity list of [ "Zhang Sang", "apple corporation", "san Francisco" ]. By matching, the server finds that the entity "Zhang Sanu" appears in both lists, meaning that the "Zhang Sanu" currently being discussed by the user is likely the same person as the "Zhang Sanu" mentioned in the previous chat. This reflects from the side that the user's current input is semantically associated with some of the chat content before.

For entities that match successfully, the server further locates their original locations in the historical chat log. Taking the above "Zhang Sano" as an example, the server will extract all historical chat fragments that contain the entity "Zhang Sano", which fragments are likely to contain background information related to the current discussion. According to the configured strategy (such as time distance, keyword relevance and the like), the server selects one or more most relevant fragments from all relevant fragments to be used as the current input historical key information. The server splices the extracted historical key information with the user content input by the current user to form a complete user input (target user content) containing the context information. The spliced target user content can replace the original user input and is used for subsequent tasks such as semantic understanding, dialogue generation and the like.

Finally, the server converts the target user content into a semantic vector representation, i.e., a user vector. The server encodes the target user content through the BERT language model. The BERT language model captures contextual information and long-range dependencies in text through a self-attention mechanism using a Transformer architecture. After inputting the target user content into the BERT language model, the BERT language model generates a fixed-length vector representation, i.e., a user vector, which includes semantic features of the target user content. The user vectors encode semantic information of the user content in a compact, continuous form for subsequent retrieval and matching. The server can use the user vector to calculate the similarity with the target vector in the preset vector database to find out the background description and scenario description most relevant to the user content, so as to generate an accurate and consistent question-answer result.

Step S103, calculating first similarity between the user vector and a first target vector and second similarity between the user vector and a second target vector, wherein the first target vector is any target vector in a preset vector database, and the second target vector is any target vector except the first target vector in the preset vector database, and the preset vector database comprises a plurality of target vectors.

In step S103, the server first obtains any two target vectors from the preset vector database, which are respectively referred to as a first target vector and a second target vector. The preset vector database is constructed before, a large number of target vectors are stored in the preset vector database, and each target vector corresponds to target text content of a video scenario, and the key information comprises background description, scenario description and the like. After the server selects the first target vector and the second target vector, the server can calculate the similarity between the first target vector and the second target vector and the user vector. The application adopts cosine similarity to calculate similarity, and sets a user vector as u, a first target vector as v1, a second target vector as v2, and a calculation formula of the cosine similarity is as follows:

cos(u,v)=(u·v)/(||u||*||v||)

Where u·v represents the dot product of the two vectors, and u and v represent the L2 norms of the two vectors, respectively. The range of cosine similarity is [ -1,1], with a larger value indicating that the directions of the two vectors are closer, i.e., the similarity is higher.

The server calculates cosine similarity cos (u, v 1) of the user vector u and the first target vector v1, and cosine similarity cos (u, v 2) of the user vector u and the second target vector v2, respectively. The two similarity values reflect semantic relevance between the user content and the video scenario description corresponding to the two target vectors. Through similarity calculation, the server can quickly find out the video scenario segment most relevant to the user content, and provide important reference information for subsequent question and answer generation.

Step S104, if the first similarity is larger than the second similarity, determining first target text content corresponding to the first target vector according to a preset vector database.

In step S104, the server first needs to compare the magnitudes of the similarity (first similarity) between the user vector and the first target vector and the similarity (second similarity) between the user vector and the second target vector. The server compares the similarity through the conditional judgment statement. If the first similarity is greater than the second similarity, it is indicated that the user vector matches the first target vector more, i.e., the user content is more relevant to the first target text content corresponding to the first target vector. At this time, the server executes the subsequent operation, and determines the first target text content corresponding to the first target vector according to the preset vector database. The first target text content includes a background description, a scenario description, etc. of the segment. When the first similarity is determined to be greater than the second similarity, the server needs to search text content corresponding to the first target vector in a preset vector database. First, the server searches in the index of the preset vector database by taking the first target vector as a query condition. After obtaining the first target text content corresponding to the first target vector, the server uses the first target text content as an output result of the step S104 to be used for a subsequent question-answer generating task. The first target text content typically contains rich video scenario information, such as text content, corresponding character labels, background descriptions, and scenario descriptions, that can provide important contextual cues and knowledge support for questions and answers.

Step S105, inputting the first target text content and the user content into a preset RAG model to generate reply content aiming at the user content.

In step S105, the server inputs the first target text content and the user content together into a preset RAG model, and generates reply content for the user content. The preset RAG model is a method based on search enhancement generation, combines two tasks of information search and text generation, and can generate answers which are related to user questions and accord with the context according to the searched related text information. The core components of the preset RAG model include retrievers (retrievers) and generators (generators). The retriever is responsible for retrieving a plurality of documents most relevant to the user questions from a large-scale text library, and the generator generates final answer text according to the documents and the user questions.

In this example, the server has found the first target text content that is most relevant to the user content by similarity calculation. Thus, the task of the retriever has been completed, the server directly inputs the first target text content and the user content into the generator of the RAG model.

The generator of the preset RAG model adopts a sequence-to-sequence (Seq 2 Seq) model based on a transducer, such as BART, T5 and the like. The generator receives two inputs, one being the first target text content, the context information generated as an answer, and the other being the user content, the question to be answered. The server needs to convert the first target text content and the user content into an input format acceptable to the preset RAG model. The preset RAG model adopts a sequence format similar to' question:. N context:. N answer:. Wherein the question part is user content, the context part is first target text content, and the answer part is left blank and is generated by the preset RAG model.

For example, if the user content is "is the man's principal angle last and is the woman's principal angle together. ", the input sequence may be structured as:

"problem: is the man principal angle finally and the woman principal angle together? at the end of the movie, the men's and women's principal angles end up untangling, thanks to the happiness of living together. An n answer "

The generator portion of the pre-set RAG model is typically based on a transducer architecture, such as BART. The preset RAG model encodes the input sequence using a self-attention mechanism based encoder, converting it into a set of vector representations. The server transmits the constructed input sequence to an encoder of a preset RAG model. The encoder first tokenizes the sequence, breaks it down into individual words or subwords, and maps it into corresponding embedded vectors. Then, the encoder transforms and combines the embedded vectors through a multi-layer self-attention mechanism and a feedforward neural network, extracts semantic information therein, and generates a set of context vectors. The context vector not only captures the semantics of the user problem, but also fuses the background information provided by the first target text content. The context vector generated by the encoder is transferred to a decoder of a preset RAG model. The decoder also generates a complete answer sequence step by step from the context vector and the generated answer segments by a self-attention mechanism and a cross-attention mechanism based on the transducer architecture. The decoding process typically uses an autoregressive approach, i.e., each word is generated with the previously generated word as input, to predict the next most likely word. This loop prediction process continues until a special end-marker (e.g., "[ EOS ]") is generated. After post-processing, the server obtains a format-specific, easy-to-read reply text to the user's content. Finally, the server returns the answer text to the user, completing the task of step S105.

For example, if the answer generated by the preset RAG model for the input sequence is "Yes", the server processes it thereafter as "Yes, [ EOS ]. ", and returned to the user. The final answer seen by the user is yes. The man's principal angle and the woman's principal angle eventually get undone the misunderstanding and live happily together. "

In one possible implementation, after inputting the first target text content and the user content into a preset RAG model and generating reply content to the user content, the method further includes storing the user content and the reply content into a historical chat record.

Specifically, the server stores the user content and the generated reply content in a historical chat log database for subsequent query and context processing.

Referring to fig. 2, the application further provides a video scenario question-answering device based on a RAG, the device is a server, the server comprises an acquisition module 201 and a processing module 202, the acquisition module 201 is used for acquiring user content input by a user for video content, the processing module 202 is used for converting the user content into a corresponding user vector, the processing module 202 is also used for calculating first similarity between the user vector and a first target vector and second similarity between the user vector and a second target vector, the first target vector is any target vector in a preset vector database, the second target vector is any target vector except the first target vector in the preset vector database, the preset vector database is constructed in advance according to the video content, the processing module 202 is also used for determining first target text content corresponding to the first target vector according to the preset vector database if the first similarity is determined to be greater than the second similarity, and the processing module 202 is also used for inputting the first target text content and the user content into a preset RAG model to generate reply content for the user content.

In one possible implementation, before the obtaining module 201 obtains the user content input by the user, the method further includes the obtaining module 201 obtaining the video file, the processing module 202 converting the video file into the audio file by using a transcoding tool, the processing module 202 performing role separation on the audio file to obtain the audio content corresponding to the roles, the processing module 202 converting the audio content into text content and setting corresponding role labels for the text content, the processing module 202 extracting the video file by using a preset multi-mode big model, extracting key frames, converting the key frames into background descriptions and scenario descriptions for the video file, the processing module 202 integrating the text content, the role labels, the background descriptions and the scenario descriptions corresponding to the text content in time sequence to obtain target text content, and the processing module 202 converting the target text content into target vectors, and storing the target vectors, the target text content and the corresponding relations between the target vectors and the target text content into a preset vector database.

In a possible implementation manner, the processing module 202 converts the user content into a corresponding user vector, and specifically includes the obtaining module 201 obtaining a historical chat record corresponding to the user content, the processing module 202 extracting the historical chat record to obtain historical key information, the processing module 202 splicing the user content and the historical key information to obtain a spliced target user content, and the processing module 202 converting the target user content into the user vector.

In one possible implementation, the processing module 202 extracts the historical chat record to obtain the historical key information, and specifically includes that the processing module 202 performs entity identification on the historical chat record to obtain a corresponding historical entity and performs entity identification on user content to obtain a corresponding content entity, and if the processing module 202 determines that the historical entity and the content entity are the same entity, the historical chat record corresponding to the historical entity is determined to be the historical key information.

In a possible implementation manner, the processing module 202 extracts the video file by using a preset multi-mode large model, and extracts the key frames, and specifically includes the processing module 202 determining a plurality of image frames in the video file, the processing module 202 calculating image similarity of a first image frame and a second image frame, where the first image frame and the second image frame are any two adjacent image frames in the plurality of image frames, and the processing module 202 determining that the first image frame and the second image frame are the key frames if the image similarity is determined to be less than a preset similarity threshold.

In one possible implementation manner, the processing module 202 converts the key frame into a background description and a scenario description for the video file, and specifically includes that the processing module 202 performs feature extraction on the key frame to obtain a corresponding visual feature vector, the processing module 202 inputs the visual feature vector into a preset image description generating model to obtain a text description for the key frame, the processing module 202 performs keyword extraction on the text description to determine a corresponding keyword, the processing module 202 inputs the keyword into a preset keyword library to obtain a description type corresponding to the keyword, the description type comprises the background description and the scenario description, and the preset keyword library comprises a corresponding relation between the keyword and the description type.

In one possible implementation, after the processing module 202 inputs the first target text content and the user content into the preset RAG model and generates the reply content to the user content, the method further includes the processing module 202 storing the user content and the reply content into the historical chat log.

It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be implemented by different functional modules, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the embodiments of the apparatus and the method provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the embodiments of the method are detailed in the method embodiments, which are not repeated herein.

The application further provides electronic equipment. Referring to fig. 3, fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device 300 may include at least one processor 301, at least one network interface 304, a user interface 303, a memory 305, and at least one communication bus 302.

Wherein the communication bus 302 is used to enable connected communication between these components.

The user interface 303 may include a Display screen (Display), a Camera (Camera), and the optional user interface 303 may further include a standard wired interface, and a wireless interface.

The network interface 304 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others.

Wherein the processor 301 may include one or more processing cores. The processor 301 utilizes various interfaces and lines to connect various portions of the overall server, perform various functions of the server and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 305, and invoking data stored in the memory 305. Alternatively, the processor 301 may be implemented in at least one hardware form of digital signal Processing (DIGITAL SIGNAL Processing, DSP), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 301 may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like, the GPU is used for rendering and drawing contents required to be displayed by the display screen, and the modem is used for processing wireless communication. It will be appreciated that the modem may not be integrated into the processor 301 and may be implemented by a single chip.

The Memory 305 may include a random access Memory (Random Access Memory, RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 305 includes a non-transitory computer readable medium (non-transitory computer-readable storage medium). Memory 305 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 305 may include a stored program area that may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the respective method embodiments described above, etc., and a stored data area that may store data, etc., involved in the respective method embodiments described above. Memory 305 may also optionally be at least one storage device located remotely from the aforementioned processor 301. Referring to fig. 3, an operating system, a network communication module, a user interface module, and an application program of a RAG-based video processing method may be included in the memory 305 as a computer storage medium.

In the electronic device 300 shown in fig. 3, the user interface 303 is primarily used to provide an input interface for a user to obtain data entered by the user, while the processor 301 may be used to invoke an application program in the memory 305 storing a RAG-based video processing method, which when executed by the one or more processors 301, causes the electronic device 300 to perform the method as described in one or more of the embodiments above. It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all of the preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.

The application also provides a computer readable storage medium storing instructions. When executed by the one or more processors 301, causes the electronic device 300 to perform the methods as described in one or more of the embodiments above.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, such as a division of units, merely a division of logic functions, and there may be additional divisions in actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some service interface, device or unit indirect coupling or communication connection, electrical or otherwise.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable memory. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in whole or in part in the form of a software product stored in a memory, comprising several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method of the various embodiments of the present application. The memory includes various media capable of storing program codes, such as a USB flash disk, a mobile hard disk, a magnetic disk or an optical disk.

The foregoing is merely exemplary embodiments of the present disclosure and is not intended to limit the scope of the present disclosure. That is, equivalent changes and modifications are contemplated by the teachings of this disclosure, which fall within the scope of the present disclosure. Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure.

This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a scope and spirit of the disclosure being indicated by the claims.

Claims

1. A video plot question-answering method based on RAG, characterized in that the method comprises:

Obtain user content input by the user for the video content;

Converting the user content into a corresponding user vector;

Calculating a first similarity between the user vector and a first target vector and a second similarity between the user vector and a second target vector, wherein the first target vector is any target vector in a preset vector database, and the second target vector is any target vector in the preset vector database except the first target vector, wherein the preset vector database includes a plurality of the target vectors, and the preset vector database is pre-constructed according to the video content;

If it is determined that the first similarity is greater than the second similarity, determining the first target text content corresponding to the first target vector according to the preset vector database;

The first target text content and the user content are input into a preset RAG model to generate reply content for the user content.

2. The method according to claim 1, characterized in that before obtaining the user content input by the user for the video content, the method further comprises:

Obtaining a video file corresponding to the video content;

Using a transcoding tool to convert the video file into an audio file;

Separating the audio file into roles to obtain roles and audio content corresponding to the roles;

Convert the audio content into text content and set corresponding role tags for the text content;

The video file is extracted using a preset multimodal macro model to extract key frames, and the key frames are converted into background descriptions and plot descriptions for the video file;

Integrate the text content, the character label corresponding to the text content, the background description, and the plot description in chronological order to obtain target text content;

The target text content is converted into a target vector, and the target vector, the target text content, and the corresponding relationship between the target vector and the target text content are stored in a preset vector database.

3. The method according to claim 1, wherein converting the user content into a corresponding user vector comprises:

Obtaining historical chat records corresponding to the user content;

Extracting the historical chat records to obtain historical key information;

splicing the user content and the historical key information to obtain spliced target user content;

The target user content is converted into the user vector.

4. The method according to claim 3 is characterized in that the extracting of the historical chat records to obtain historical key information specifically includes:

Performing entity recognition on the historical chat records to obtain corresponding historical entities, and performing entity recognition on the user content to obtain corresponding content entities;

If it is determined that the historical entity and the content entity are the same entity, the historical chat record corresponding to the historical entity is determined as the historical key information.

5. The method according to claim 2, characterized in that the step of extracting the key frames from the video file using a preset multimodal large model specifically comprises:

Determining a plurality of image frames in the video file;

Calculating an image similarity between a first image frame and a second image frame, where the first image frame and the second image frame are any two adjacent image frames among the plurality of image frames;

If it is determined that the image similarity is less than a preset similarity threshold, the first image frame and the second image frame are determined to be key frames.

6. The method according to claim 2, characterized in that the step of converting the key frames into a background description and a plot description for the video file specifically comprises:

Extracting features from the key frame to obtain a corresponding visual feature vector;

Inputting the visual feature vector into a preset image description generation model to obtain a text description for the key frame;

Extract keywords from the text description to determine corresponding keywords;

The keyword is input into a preset keyword library to obtain a description type corresponding to the keyword, wherein the description type includes a background description and a plot description, and the preset keyword library includes a correspondence between the keyword and the description type.

7. The method according to claim 1, characterized in that after inputting the first target text content and the user content into a preset RAG model to generate reply content for the user content, the method further comprises:

The user content and the reply content are stored in the historical chat record.

8. A video processing device based on RAG, characterized in that the device comprises an acquisition module (201) and a processing module (202), wherein:

The acquisition module (201) is used to acquire user content input by the user for the video content;

The processing module (202) is used to convert the user content into a corresponding user vector;

The processing module (202) is further used to calculate a first similarity between the user vector and a first target vector and a second similarity between the user vector and a second target vector, wherein the first target vector is any target vector in a preset vector database, and the second target vector is any target vector in the preset vector database except the first target vector, wherein the preset vector database includes a plurality of the target vectors, and the preset vector database is pre-constructed according to the video content;

The processing module (202) is further configured to determine, if it is determined that the first similarity is greater than the second similarity, a first target text content corresponding to the first target vector according to the preset vector database;

The processing module (202) is further used to input the first target text content and the user content into a preset RAG model to generate reply content for the user content.

9. An electronic device, characterized in that it comprises a processor (301), a memory (305), a user interface (303) and a network interface (304), wherein the memory (305) is used to store instructions, the user interface (303) and the network interface (304) are used to communicate with other devices, and the processor (301) is used to execute the instructions stored in the memory (305) so that the electronic device (300) executes the method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores instructions, and when the instructions are executed, the method according to any one of claims 1 to 7 is executed.