[go: up one dir, main page]

CN119106098A - A video plot question-answering method and device based on RAG - Google Patents

A video plot question-answering method and device based on RAG Download PDF

Info

Publication number
CN119106098A
CN119106098A CN202411160716.8A CN202411160716A CN119106098A CN 119106098 A CN119106098 A CN 119106098A CN 202411160716 A CN202411160716 A CN 202411160716A CN 119106098 A CN119106098 A CN 119106098A
Authority
CN
China
Prior art keywords
content
user
vector
target
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411160716.8A
Other languages
Chinese (zh)
Inventor
方律
周凌洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Feiling Technology Co.,Ltd.
Original Assignee
Hefei Feier Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei Feier Intelligent Technology Co ltd filed Critical Hefei Feier Intelligent Technology Co ltd
Priority to CN202411160716.8A priority Critical patent/CN119106098A/en
Publication of CN119106098A publication Critical patent/CN119106098A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/041Abduction
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Human Computer Interaction (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于RAG的视频剧情问答方法及装置,涉及RAG领域。在该方法中,获取用户针对视频内容输入的用户内容;将用户内容转化为对应的用户向量;计算用户向量与第一目标向量的第一相似度以及用户向量与第二目标向量的第二相似度,预设向量数据库中包括多个目标向量;若确定第一相似度大于第二相似度,则根据预设向量数据库确定第一目标向量对应的第一目标文本内容;将第一目标文本内容和用户内容输入预设RAG模型,生成针对用户内容的回复内容。实施本申请提供的技术方案,通过向量化处理用户内容和视频文件,能够针对视频文件回答用户的问题。

A video plot question-and-answer method and device based on RAG, relating to the field of RAG. In the method, user content input by a user for video content is obtained; the user content is converted into a corresponding user vector; a first similarity between the user vector and a first target vector and a second similarity between the user vector and a second target vector are calculated, and a preset vector database includes multiple target vectors; if it is determined that the first similarity is greater than the second similarity, the first target text content corresponding to the first target vector is determined according to the preset vector database; the first target text content and the user content are input into a preset RAG model to generate reply content for the user content. By implementing the technical solution provided by the present application, user content and video files can be processed by vectorization, so that user questions can be answered for video files.

Description

Video scenario question answering method and device based on RAG
Technical Field
The application relates to the field of RAG (random access gateway), in particular to a video scenario question-answering method and device based on RAG.
Background
With the rapid growth of internet technology and explosive growth of video content, users are increasingly dependent on video platforms in acquiring information and entertainment content. The video content not only enriches the daily life of people, but also becomes an important knowledge acquisition way. However, in a large amount of video data, it is often difficult for a user to quickly acquire specific information as needed. The demand encourages the wide application of the video scenario question-answering system, and helps users to quickly acquire related information and answers in the process of watching videos.
Currently, related art is primarily dependent on rule matching and simple text retrieval techniques for user problems with user inputs. The related art can only process plain text contents, has limited parsing and understanding ability for complex multi-modal data (e.g., video), and cannot answer user-entered questions for video contents.
Therefore, there is a need for a video scenario question-answering method and device based on RAG.
Disclosure of Invention
The application provides a video scenario question-answering method and device based on RAG, which can analyze and understand video files and answer user questions aiming at the video files by vectorizing user contents and the video files.
The first aspect of the application provides a video scenario question-answering method based on RAG, which comprises the steps of obtaining user content input by a user aiming at video content, converting the user content into corresponding user vectors, calculating first similarity between the user vectors and first target vectors and second similarity between the user vectors and second target vectors, wherein the first target vectors are any target vectors in a preset vector database, the second target vectors are any target vectors except the first target vectors in the preset vector database, the preset vector database comprises a plurality of target vectors, the preset vector database is constructed in advance according to the video content, if the first similarity is larger than the second similarity, determining first target text content corresponding to the first target vectors according to the preset vector database, and inputting the first target text content and the user content into a preset RAG model to generate reply content aiming at the user content.
By adopting the technical scheme, the matching and understanding of the user input content and the video content can be realized by converting the content input by the user into the user vector and calculating the similarity between the user vector and the target vector in the preset vector database. And determining the best matched target vector and corresponding target text content by comparing the similarity. And generating reply content for the user content by inputting the user input content and the target text content into a preset RAG model. Thus, the video processing capability can be enhanced, and the video processing capability can more accurately and intelligently respond to the demands and questions of the user.
Optionally, before the user content input by the user is obtained, the method further comprises the steps of obtaining a video file corresponding to the video content, converting the video file into an audio file by adopting a transcoding tool, performing role separation on the audio file to obtain a role and audio content corresponding to the role, converting the audio content into text content, setting a corresponding role tag for the text content, extracting the video file by adopting a preset multi-mode big model, extracting a key frame, converting the key frame into background description and scenario description for the video file, integrating the text content, the role tag corresponding to the text content, the background description and the scenario description according to time sequence to obtain target text content, converting the target text content into a target vector, and storing the target vector, the target text content and the corresponding relation between the target vector and the target text content into a preset vector database.
By adopting the technical scheme, the video file is converted into the audio file by adopting the transcoding tool, so that preparation is made for subsequent audio processing and text conversion. And performing role separation on the audio files, and separating out the audio contents of different roles. Thus, the audio content of different roles in the video can be acquired. And converting the separated audio content into text content, setting a corresponding role label for the text content, and identifying which role the text content belongs to. This allows the text content of different roles to be distinguished in subsequent processing. And processing the video file by adopting a preset multi-mode large model, and extracting key frames of the video. These key frames are then translated into a background description and scenario description for the video file. And integrating the text content, the role labels, the background description and the scenario description according to the time sequence to obtain the target text content. In this way, the individual elements can be integrated together to form complete text describing the video content (target text content). And converting the target text content into a target vector, and storing the corresponding relation between the target vector and the target text content in a preset vector database. In this way, the target text content and the corresponding vector form thereof can be associated and stored, and subsequent processing is convenient.
The method comprises the steps of obtaining a history chat record corresponding to user content, extracting the history chat record to obtain history key information, splicing the user content and the history key information to obtain spliced target user content, and converting the target user content into the user vector.
By adopting the technical scheme, the user content currently input by the user is spliced with the history key information by acquiring the history chat record of the user and extracting the history key information, and the current request of the user is combined with the past interaction context, so that more comprehensive and accurate semantic representation is obtained. And converting the target user content into a representation form of the user vector, and facilitating subsequent similarity calculation and matching. Through vectorization, user content can be converted into a numerical representation that can be calculated and compared, thereby achieving more accurate similarity matching and reply generation.
Optionally, the historical chat record is extracted to obtain the historical key information, which specifically comprises the steps of carrying out entity identification on the historical chat record to obtain a corresponding historical entity, carrying out entity identification on the user content to obtain a corresponding content entity, and if the historical entity and the content entity are the same entity, determining the historical chat record corresponding to the historical entity as the historical key information.
By adopting the technical scheme, the historical entity and the content entity are compared and matched. If it is determined that the historical entity and the content entity are the same entity, i.e., they represent the same thing, then a historical chat record containing the historical entity may be determined as historical key information. The historical key information provides important contextual information related to the user's content, helping to better understand the user's current request or intent.
Optionally, extracting the video file by using a preset multi-mode large model to extract a key frame, and specifically comprises determining a plurality of image frames in the video file, calculating image similarity of a first image frame and a second image frame, wherein the first image frame and the second image frame are any two adjacent image frames in the plurality of image frames, and determining that the first image frame and the second image frame are key frames if the image similarity is smaller than a preset similarity threshold.
By adopting the technical scheme, the key frames in the video file can be determined by calculating the image similarity and applying the preset similarity threshold. Key frame extraction provides the basis for subsequent video content analysis tasks.
The method comprises the steps of obtaining a background description and a scenario description of a video file, wherein the background description and the scenario description of the video file are obtained through the steps of carrying out feature extraction on the key frame to obtain a corresponding visual feature vector, inputting the visual feature vector into a preset image description generating model to obtain text description of the key frame, carrying out keyword extraction on the text description to determine a corresponding keyword, inputting the keyword into a preset keyword library to obtain a description type corresponding to the keyword, wherein the description type comprises the background description and the scenario description, and the preset keyword library comprises the corresponding relation between the keyword and the description type.
By adopting the technical scheme, the visual feature vector is obtained by extracting the features of the key frames, and the visual feature vector is input into the preset image description generation model, so that the text description with strong pertinence and accuracy can be generated. The textual descriptions generated using the visual feature vectors may better reflect the video content, providing more detailed and rich background and scenario information. By extracting keywords and matching with a preset keyword library, the background description and the scenario description can be accurately distinguished. This refined description classification helps to better understand the different aspects of the video content, enhancing the overall content understanding capability.
Optionally, after inputting the first target text content and the user content into a preset RAG model and generating reply content for the user content, the method further comprises storing the user content and the reply content into a historical chat record.
By adopting the technical scheme, the user content and the reply content are stored in the history chat record, so that the history chat record can be referred in the subsequent dialogue, and more relevant and continuous replies are provided.
The application provides a video scenario question-answering device based on RAG, which comprises an acquisition module and a processing module, wherein the acquisition module is used for acquiring user content input by a user aiming at video content, the processing module is used for converting the user content into a corresponding user vector, the processing module is further used for calculating first similarity between the user vector and a first target vector and second similarity between the user vector and a second target vector, the first target vector is any target vector in a preset vector database, the second target vector is any target vector except the first target vector in the preset vector database, the preset vector database comprises a plurality of target vectors, the preset vector database is constructed in advance according to the video content, the processing module is further used for determining a first target text corresponding to the first target text according to the preset vector database if the first similarity is determined to be greater than the second similarity, and the processing module is further used for generating the first reply text aiming at the user content and the first target text input by the user content.
In a third aspect the application provides an electronic device comprising a processor, a memory for storing instructions, a user interface and a network interface, both for communicating with other devices, the processor being for executing instructions stored in the memory to cause the electronic device to perform a method as claimed in any one of the preceding claims.
In a fourth aspect of the application there is provided a computer readable storage medium storing instructions which, when executed, perform a method as claimed in any one of the preceding claims.
In summary, one or more technical solutions provided in the embodiments of the present application at least have the following technical effects or advantages:
1. The method can realize matching and understanding of the user input content and the video content by converting the user input content into the user vector and calculating the similarity between the user vector and the target vector in the preset vector database. And determining the best matched target vector and corresponding target text content by comparing the similarity. And generating reply content for the user content by inputting the user input content and the target text content into a preset RAG model. Thus, the video processing capability can be enhanced, and the video processing capability can more accurately and intelligently respond to the demands and questions of the user.
2. The video file is converted to an audio file using a transcoding tool, ready for subsequent audio processing and text conversion. And performing role separation on the audio files, and separating out the audio contents of different roles. Thus, the audio content of different roles in the video can be acquired. And converting the separated audio content into text content, setting a corresponding role label for the text content, and identifying which role the text content belongs to. This allows the text content of different roles to be distinguished in subsequent processing. And processing the video file by adopting a preset multi-mode large model, and extracting key frames of the video. These key frames are then translated into a background description and scenario description for the video file. And integrating the text content, the role labels, the background description and the scenario description according to the time sequence to obtain the target text content. In this way, the individual elements can be integrated together to form complete text describing the video content (target text content). And converting the target text content into a target vector, and storing the corresponding relation between the target vector and the target text content in a preset vector database. In this way, the target text content and the corresponding vector form thereof can be associated and stored, and subsequent processing is convenient.
3. By acquiring the historical chat record of the user and extracting the historical key information, the user content currently input by the user is spliced with the historical key information, and the current request of the user is combined with the past interaction context, so that more comprehensive and accurate semantic representation is obtained. And converting the target user content into a representation form of the user vector, and facilitating subsequent similarity calculation and matching. Through vectorization, user content can be converted into a numerical representation that can be calculated and compared, thereby achieving more accurate similarity matching and reply generation.
Drawings
Fig. 1 is a schematic flow chart of a video scenario question-answering method based on RAG according to an embodiment of the present application;
Fig. 2 is a schematic block diagram of a video scenario question-answering device based on RAG according to an embodiment of the present application;
Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Reference numerals illustrate 201, acquisition module, 202, processing module, 300, electronic device, 301, processor, 302, communication bus, 303, user interface, 304, network interface, 305, memory.
Detailed Description
In order that those skilled in the art will better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments.
In describing embodiments of the present application, words such as "for example" or "for example" are used to mean serving as examples, illustrations, or descriptions. Any embodiment or design described herein as "such as" or "for example" in embodiments of the application should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "or" for example "is intended to present related concepts in a concrete fashion.
In the description of embodiments of the application, the term "plurality" means two or more. For example, a plurality of systems means two or more systems, and a plurality of screen terminals means two or more screen terminals. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating an indicated technical feature. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.
The application provides a video scenario question-answering method based on RAG, and referring to FIG. 1, FIG. 1 is a flow diagram of the video scenario question-answering method based on RAG provided by the embodiment of the application. The method is applied to the server and comprises the following steps of S101 to S105:
Step S101, user content input by a user for video content is acquired.
Before step S101, the method further comprises the steps of obtaining a video file, converting the video file into an audio file by using a transcoding tool, performing role separation on the audio file to obtain audio content corresponding to roles, converting the audio content into text content, setting corresponding role labels for the text content, extracting the video file by using a preset multi-mode big model, extracting key frames, converting the key frames into background description and scenario description aiming at the video file, integrating the text content, the role labels, the background description and the scenario description corresponding to the text content according to time sequence to obtain target text content, converting the target text content into target vectors, and storing the target vectors, the target text content and the corresponding relations between the target vectors and the target text content into a preset vector database.
Specifically, the server obtains the video file uploaded or designated by the user through a network interface or a local file system. After the video file is obtained, the server calls a pre-configured transcoding tool to process the video file. The transcoding tool is FFmpeg of open source. The server converts the video file to an audio file through a command line interface or API of the transcoding tool. And the server performs role separation on the converted audio file. Character separation separates the sounds of different persons from the mixed audio signal. The server loads a pre-trained character separation model, such as a Deep convolutional network (Deep CNN), and uses the character separation model to infer the audio file. The character separation model automatically identifies voiceprints of different speakers in audio by analyzing the characteristics of the audio, such as frequency spectrum, fundamental frequency, formants, and the like, and separates the voiceprints into different audio tracks. The separated audio tracks correspond to different roles, each corresponding to a respective audio content, the audio content of each role being stored separately.
After the character audio content separation is completed, the server converts the audio content of each character into text content using voice recognition technology. The server invokes a pre-trained speech recognition model, such as DEEPSPEECH, WAV Letter, to recognize the audio content of each character. The speech recognition model converts speech signals into corresponding text sequences by analyzing acoustic and linguistic features of the audio content. The text content obtained through recognition corresponds to the corresponding character audio one by one, and character-text content pairs are formed. The server stores the text content, and sets a corresponding role tag for each text content segment for subsequent processing.
The server also needs to extract key information, including background descriptions and scenario descriptions, from the video file while acquiring character text content. This step is achieved by invoking a preset multi-modal large model. The preset multi-mode large model is an AI model for processing multiple mode information such as vision, voice and text simultaneously, for example ViLBERT. The server first breaks the video file into a series of image frames using a video analysis tool (e.g., openCV), and then selects key frames that can represent the main content of the video through a key frame extraction algorithm.
In one possible implementation, a preset multi-mode large model is adopted to extract a video file, and a key frame is extracted, and the method specifically comprises the steps of determining a plurality of image frames in the video file, calculating image similarity of a first image frame and a second image frame, wherein the first image frame and the second image frame are any two adjacent image frames in the plurality of image frames, and determining the first image frame and the second image frame as the key frame if the image similarity is smaller than a preset similarity threshold value.
Specifically, the server needs to decode the video file and acquire all the image frames therein. After all the image frames are acquired, the server needs to calculate the similarity between any two adjacent image frames. The similarity is used for measuring the similarity of two images on visual contents, and the image similarity calculation method comprises histogram comparison, feature point matching, perceptual hash and the like. In the scheme, the server calculates the image similarity by adopting a preset multi-mode large model. The preset multi-mode large model is a model based on deep learning, and can process data of multiple modes such as texts, images, audios and the like at the same time and learn association and mapping relations among the data. The server inputs two adjacent image frames (a first image frame and a second image frame) into a preset multi-mode large model, the preset multi-mode large model extracts high-level semantic features of the first image frame and the second image frame through a Convolutional Neural Network (CNN) technology, and then similarity between the two image frames is obtained through feature comparison and similarity calculation. The similarity is typically a real number between 0 and 1, indicating the degree of similarity of the two images. The larger the similarity, the closer the contents of the two images are, and the smaller the similarity, the larger the difference in contents of the two images is. After calculating the similarity of all the adjacent image frames, the server needs to determine which image frames are key frames according to the size of the similarity. The key frame is selected according to the image frames with larger content difference between the front image frame and the rear image frame, namely, the image frames with lower similarity. The server is internally preset with a preset similarity threshold value for judging whether the two image frames are different enough or not. If the similarity between the first image frame and the second image frame is less than the preset similarity threshold, the two image frames are considered to have larger changes in visual content and can be used as key frames. For example, assuming that the preset similarity threshold is 0.8, if the similarity of the first image frame and the second image frame is 0.7, which is smaller than the preset similarity threshold, both image frames are marked as key frames. The server traverses all adjacent image frame pairs, and performs threshold comparison on the similarity of the image frame pairs to finally obtain a group of key frames.
After the key frame is selected, the server inputs the key frame into a preset multi-mode large model for analysis. And carrying out feature extraction on the key frames by a preset multi-mode large model, and obtaining visual features of the images through a Convolutional Neural Network (CNN). The visual features of the preset multi-mode large model are aligned and fused with the text features to form multi-mode semantic representation. Based on the semantic representation, a pre-set multi-modal large model may generate a natural language description of the image content of the key frame, i.e., converting the key frame into a background description and a scenario description. Background descriptions generally relate to static information of scenes, environments, objects, etc. in an image, while scenario descriptions relate to dynamic information of actions, events, interactions, etc. of persons in an image.
In one possible implementation, the method for converting the key frame into the background description and the scenario description of the video file specifically comprises the steps of extracting features of the key frame to obtain corresponding visual feature vectors, inputting the visual feature vectors into a preset image description generating model to obtain text description of the key frame, extracting keywords of the text description to determine corresponding keywords, inputting the keywords into a preset keyword library to obtain description types corresponding to the keywords, wherein the description types comprise background description and scenario description, and the preset keyword library comprises corresponding relations between the keywords and the description types.
Specifically, the server inputs the image data of each key frame into a preset multi-mode large model, and the preset multi-mode large model converts the input image into a visual feature vector with a fixed length (4096 dimensions, for example). The visual feature vector highly concentrates the semantic content information of the image. The extracted visual feature vectors of all key frames are temporarily stored in a memory. Meanwhile, the server inputs the visual feature vectors of the key frames into a preset image description generation model in batches. The preset image description generation model adopts an encoder-decoder structure, takes a visual characteristic vector as input, and generates a corresponding natural language description text (text description). Meanwhile, the server establishes a mapping relation between the generated text description and the corresponding key frame. And the server calls a preset keyword extraction module for processing the text description generated by each key frame. The module uses algorithms based on TF-IDF, textRank, etc. to extract several keywords from the text description, which can generally be highly generalized to the core content of the description. And the server establishes a mapping relation between the extracted keywords and the corresponding key frames. And the server queries the keywords extracted from each key frame in a preset keyword library. Description types (background description or scenario description) corresponding to various common keywords are manually defined in a preset keyword library in advance. For example, "trees," "buildings," etc. generally correspond to background descriptions, and "run," "talk," etc. generally correspond to scenario descriptions. By matching the keywords, it can be determined whether a key frame is more likely to belong to the background description or the scenario description. Through the above processing steps, each key frame in the video gets two pieces of information, a background description (key word) and a scenario description (key word).
For example, assuming that a key frame is a picture of two people chatting in a park, the background description generated by the multimodal model may be "this is a sunny park with green trees, benches and fountains", while the scenario description may be "two young people chatting on benches, happy with ease, and seemingly discussing some interesting topics".
And finally, integrating all information extracted from the video, including text content, character labels, background descriptions, scenario descriptions and the like, according to a time sequence by the server to form a complete target text content. Since both the previous character separation and key frame extraction preserve time information, the server can align them according to the time stamps, ensuring that the integrated text content is synchronized with the video content on the time axis. The integrated target text content represents the complete semantic information of the video, but the server also needs to convert it into a semantic vector representation, i.e., a target vector, in order to facilitate subsequent retrieval and generation. The server encodes the target text content using natural language models, such as BERT, GPT, etc., and maps it into a high-dimensional semantic space. The encoding process fully utilizes self-attention mechanism in the transform structure, and can capture long-distance dependence and context information in text content. The target vector obtained after encoding is a real-value vector with fixed length, and contains semantic features of target text content.
In order to facilitate subsequent retrieval and matching, the server stores the target vector, the target text content and the correspondence between the target text content and the target text content in a preset vector database. The preset vector database is a database specially used for storing and retrieving high-dimensional vectors, such as Faiss, annoy, etc. The server takes the target vector as a key, takes the target text content as a value, and stores the target text content into a preset vector database in the form of key value pairs. When searching, the preset vector database can rapidly calculate the similarity between the query vector input by the user and all the target vectors, and returns the target text content most relevant to the query.
In step S101, in the video scenario question-answering method, the acquisition of the user content input by the user is the start point and the basis of the entire question-answering flow. The user content refers to a question or query about the video scenario, which is presented by the user, and is the basis for the server to perform subsequent processing and generate an answer. The server needs to provide an interactive interface for the user to enter questions or queries. The interface can be in various forms such as web pages, mobile applications, chat windows and the like, and the main purpose is to enable users to conveniently and intuitively present own problems. The interactive interface typically includes an input box for the user to enter questions in text form. The input box has a proper size and style to ensure that the user can clearly see the content entered by himself. For example, the server may provide a search box on the web page and display a prompt text "please enter your questions about the video scenario" above the search box. The user can enter his/her question in the search box, such as "who is the principal angle of the movie?
And step S102, converting the user content into corresponding user vectors.
In step S102, the user content is converted into a corresponding user vector, which concretely comprises the steps of obtaining a history chat record corresponding to the user content, extracting the history chat record to obtain history key information, splicing the user content and the history key information to obtain spliced target user content, and converting the target user content into the user vector.
In one possible implementation manner, the method for extracting the historical chat record to obtain the historical key information specifically comprises the steps of carrying out entity identification on the historical chat record to obtain a corresponding historical entity, carrying out entity identification on user content to obtain a corresponding content entity, and if the historical entity and the content entity are the same entity, determining the historical chat record corresponding to the historical entity as the historical key information.
Specifically, when a user inputs a question or inquires about content, the server processes it as user content. In order to better understand the context and background of the user content, the server needs to obtain a history chat record corresponding to the user content. The server identifies the current user through a user authentication mechanism and retrieves the user's historical chat log from the chat log database. The chat log database stores a complete history of each user interaction, including metadata of the user's questions, answers to the questions, timestamps, and the like.
After the history chat record is obtained, the server needs to extract the history key information most relevant to the current user content from the history chat record. This is achieved by entity recognition techniques. Entity identification may identify an entity (e.g., person name, place name, organization name, etc.) from text and determine its type. And the server respectively carries out entity identification on the historical chat record and the current user content to obtain a historical entity and a content entity.
The server performs the same processing and entity identification steps on the user content input by the current user, so as to obtain the content entities mentioned in the current input of the user, and form a content entity list. The server matches the historical entity list with the content entity list. The matching method is based on string equality, and the server checks whether the same entity exists in both lists.
For example, assume that the server obtains a history entity list of [ "Zhang Sang", "Beijing", "Microsoft corporation", "2022" ], and a content entity list of [ "Zhang Sang", "apple corporation", "san Francisco" ]. By matching, the server finds that the entity "Zhang Sanu" appears in both lists, meaning that the "Zhang Sanu" currently being discussed by the user is likely the same person as the "Zhang Sanu" mentioned in the previous chat. This reflects from the side that the user's current input is semantically associated with some of the chat content before.
For entities that match successfully, the server further locates their original locations in the historical chat log. Taking the above "Zhang Sano" as an example, the server will extract all historical chat fragments that contain the entity "Zhang Sano", which fragments are likely to contain background information related to the current discussion. According to the configured strategy (such as time distance, keyword relevance and the like), the server selects one or more most relevant fragments from all relevant fragments to be used as the current input historical key information. The server splices the extracted historical key information with the user content input by the current user to form a complete user input (target user content) containing the context information. The spliced target user content can replace the original user input and is used for subsequent tasks such as semantic understanding, dialogue generation and the like.
Finally, the server converts the target user content into a semantic vector representation, i.e., a user vector. The server encodes the target user content through the BERT language model. The BERT language model captures contextual information and long-range dependencies in text through a self-attention mechanism using a Transformer architecture. After inputting the target user content into the BERT language model, the BERT language model generates a fixed-length vector representation, i.e., a user vector, which includes semantic features of the target user content. The user vectors encode semantic information of the user content in a compact, continuous form for subsequent retrieval and matching. The server can use the user vector to calculate the similarity with the target vector in the preset vector database to find out the background description and scenario description most relevant to the user content, so as to generate an accurate and consistent question-answer result.
Step S103, calculating first similarity between the user vector and a first target vector and second similarity between the user vector and a second target vector, wherein the first target vector is any target vector in a preset vector database, and the second target vector is any target vector except the first target vector in the preset vector database, and the preset vector database comprises a plurality of target vectors.
In step S103, the server first obtains any two target vectors from the preset vector database, which are respectively referred to as a first target vector and a second target vector. The preset vector database is constructed before, a large number of target vectors are stored in the preset vector database, and each target vector corresponds to target text content of a video scenario, and the key information comprises background description, scenario description and the like. After the server selects the first target vector and the second target vector, the server can calculate the similarity between the first target vector and the second target vector and the user vector. The application adopts cosine similarity to calculate similarity, and sets a user vector as u, a first target vector as v1, a second target vector as v2, and a calculation formula of the cosine similarity is as follows:
cos(u,v)=(u·v)/(||u||*||v||)
Where u·v represents the dot product of the two vectors, and u and v represent the L2 norms of the two vectors, respectively. The range of cosine similarity is [ -1,1], with a larger value indicating that the directions of the two vectors are closer, i.e., the similarity is higher.
The server calculates cosine similarity cos (u, v 1) of the user vector u and the first target vector v1, and cosine similarity cos (u, v 2) of the user vector u and the second target vector v2, respectively. The two similarity values reflect semantic relevance between the user content and the video scenario description corresponding to the two target vectors. Through similarity calculation, the server can quickly find out the video scenario segment most relevant to the user content, and provide important reference information for subsequent question and answer generation.
Step S104, if the first similarity is larger than the second similarity, determining first target text content corresponding to the first target vector according to a preset vector database.
In step S104, the server first needs to compare the magnitudes of the similarity (first similarity) between the user vector and the first target vector and the similarity (second similarity) between the user vector and the second target vector. The server compares the similarity through the conditional judgment statement. If the first similarity is greater than the second similarity, it is indicated that the user vector matches the first target vector more, i.e., the user content is more relevant to the first target text content corresponding to the first target vector. At this time, the server executes the subsequent operation, and determines the first target text content corresponding to the first target vector according to the preset vector database. The first target text content includes a background description, a scenario description, etc. of the segment. When the first similarity is determined to be greater than the second similarity, the server needs to search text content corresponding to the first target vector in a preset vector database. First, the server searches in the index of the preset vector database by taking the first target vector as a query condition. After obtaining the first target text content corresponding to the first target vector, the server uses the first target text content as an output result of the step S104 to be used for a subsequent question-answer generating task. The first target text content typically contains rich video scenario information, such as text content, corresponding character labels, background descriptions, and scenario descriptions, that can provide important contextual cues and knowledge support for questions and answers.
Step S105, inputting the first target text content and the user content into a preset RAG model to generate reply content aiming at the user content.
In step S105, the server inputs the first target text content and the user content together into a preset RAG model, and generates reply content for the user content. The preset RAG model is a method based on search enhancement generation, combines two tasks of information search and text generation, and can generate answers which are related to user questions and accord with the context according to the searched related text information. The core components of the preset RAG model include retrievers (retrievers) and generators (generators). The retriever is responsible for retrieving a plurality of documents most relevant to the user questions from a large-scale text library, and the generator generates final answer text according to the documents and the user questions.
In this example, the server has found the first target text content that is most relevant to the user content by similarity calculation. Thus, the task of the retriever has been completed, the server directly inputs the first target text content and the user content into the generator of the RAG model.
The generator of the preset RAG model adopts a sequence-to-sequence (Seq 2 Seq) model based on a transducer, such as BART, T5 and the like. The generator receives two inputs, one being the first target text content, the context information generated as an answer, and the other being the user content, the question to be answered. The server needs to convert the first target text content and the user content into an input format acceptable to the preset RAG model. The preset RAG model adopts a sequence format similar to' question:. N context:. N answer:. Wherein the question part is user content, the context part is first target text content, and the answer part is left blank and is generated by the preset RAG model.
For example, if the user content is "is the man's principal angle last and is the woman's principal angle together. ", the input sequence may be structured as:
"problem: is the man principal angle finally and the woman principal angle together? at the end of the movie, the men's and women's principal angles end up untangling, thanks to the happiness of living together. An n answer "
The generator portion of the pre-set RAG model is typically based on a transducer architecture, such as BART. The preset RAG model encodes the input sequence using a self-attention mechanism based encoder, converting it into a set of vector representations. The server transmits the constructed input sequence to an encoder of a preset RAG model. The encoder first tokenizes the sequence, breaks it down into individual words or subwords, and maps it into corresponding embedded vectors. Then, the encoder transforms and combines the embedded vectors through a multi-layer self-attention mechanism and a feedforward neural network, extracts semantic information therein, and generates a set of context vectors. The context vector not only captures the semantics of the user problem, but also fuses the background information provided by the first target text content. The context vector generated by the encoder is transferred to a decoder of a preset RAG model. The decoder also generates a complete answer sequence step by step from the context vector and the generated answer segments by a self-attention mechanism and a cross-attention mechanism based on the transducer architecture. The decoding process typically uses an autoregressive approach, i.e., each word is generated with the previously generated word as input, to predict the next most likely word. This loop prediction process continues until a special end-marker (e.g., "[ EOS ]") is generated. After post-processing, the server obtains a format-specific, easy-to-read reply text to the user's content. Finally, the server returns the answer text to the user, completing the task of step S105.
For example, if the answer generated by the preset RAG model for the input sequence is "Yes", the server processes it thereafter as "Yes, [ EOS ]. ", and returned to the user. The final answer seen by the user is yes. The man's principal angle and the woman's principal angle eventually get undone the misunderstanding and live happily together. "
In one possible implementation, after inputting the first target text content and the user content into a preset RAG model and generating reply content to the user content, the method further includes storing the user content and the reply content into a historical chat record.
Specifically, the server stores the user content and the generated reply content in a historical chat log database for subsequent query and context processing.
Referring to fig. 2, the application further provides a video scenario question-answering device based on a RAG, the device is a server, the server comprises an acquisition module 201 and a processing module 202, the acquisition module 201 is used for acquiring user content input by a user for video content, the processing module 202 is used for converting the user content into a corresponding user vector, the processing module 202 is also used for calculating first similarity between the user vector and a first target vector and second similarity between the user vector and a second target vector, the first target vector is any target vector in a preset vector database, the second target vector is any target vector except the first target vector in the preset vector database, the preset vector database is constructed in advance according to the video content, the processing module 202 is also used for determining first target text content corresponding to the first target vector according to the preset vector database if the first similarity is determined to be greater than the second similarity, and the processing module 202 is also used for inputting the first target text content and the user content into a preset RAG model to generate reply content for the user content.
In one possible implementation, before the obtaining module 201 obtains the user content input by the user, the method further includes the obtaining module 201 obtaining the video file, the processing module 202 converting the video file into the audio file by using a transcoding tool, the processing module 202 performing role separation on the audio file to obtain the audio content corresponding to the roles, the processing module 202 converting the audio content into text content and setting corresponding role labels for the text content, the processing module 202 extracting the video file by using a preset multi-mode big model, extracting key frames, converting the key frames into background descriptions and scenario descriptions for the video file, the processing module 202 integrating the text content, the role labels, the background descriptions and the scenario descriptions corresponding to the text content in time sequence to obtain target text content, and the processing module 202 converting the target text content into target vectors, and storing the target vectors, the target text content and the corresponding relations between the target vectors and the target text content into a preset vector database.
In a possible implementation manner, the processing module 202 converts the user content into a corresponding user vector, and specifically includes the obtaining module 201 obtaining a historical chat record corresponding to the user content, the processing module 202 extracting the historical chat record to obtain historical key information, the processing module 202 splicing the user content and the historical key information to obtain a spliced target user content, and the processing module 202 converting the target user content into the user vector.
In one possible implementation, the processing module 202 extracts the historical chat record to obtain the historical key information, and specifically includes that the processing module 202 performs entity identification on the historical chat record to obtain a corresponding historical entity and performs entity identification on user content to obtain a corresponding content entity, and if the processing module 202 determines that the historical entity and the content entity are the same entity, the historical chat record corresponding to the historical entity is determined to be the historical key information.
In a possible implementation manner, the processing module 202 extracts the video file by using a preset multi-mode large model, and extracts the key frames, and specifically includes the processing module 202 determining a plurality of image frames in the video file, the processing module 202 calculating image similarity of a first image frame and a second image frame, where the first image frame and the second image frame are any two adjacent image frames in the plurality of image frames, and the processing module 202 determining that the first image frame and the second image frame are the key frames if the image similarity is determined to be less than a preset similarity threshold.
In one possible implementation manner, the processing module 202 converts the key frame into a background description and a scenario description for the video file, and specifically includes that the processing module 202 performs feature extraction on the key frame to obtain a corresponding visual feature vector, the processing module 202 inputs the visual feature vector into a preset image description generating model to obtain a text description for the key frame, the processing module 202 performs keyword extraction on the text description to determine a corresponding keyword, the processing module 202 inputs the keyword into a preset keyword library to obtain a description type corresponding to the keyword, the description type comprises the background description and the scenario description, and the preset keyword library comprises a corresponding relation between the keyword and the description type.
In one possible implementation, after the processing module 202 inputs the first target text content and the user content into the preset RAG model and generates the reply content to the user content, the method further includes the processing module 202 storing the user content and the reply content into the historical chat log.
It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be implemented by different functional modules, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the embodiments of the apparatus and the method provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the embodiments of the method are detailed in the method embodiments, which are not repeated herein.
The application further provides electronic equipment. Referring to fig. 3, fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device 300 may include at least one processor 301, at least one network interface 304, a user interface 303, a memory 305, and at least one communication bus 302.
Wherein the communication bus 302 is used to enable connected communication between these components.
The user interface 303 may include a Display screen (Display), a Camera (Camera), and the optional user interface 303 may further include a standard wired interface, and a wireless interface.
The network interface 304 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others.
Wherein the processor 301 may include one or more processing cores. The processor 301 utilizes various interfaces and lines to connect various portions of the overall server, perform various functions of the server and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 305, and invoking data stored in the memory 305. Alternatively, the processor 301 may be implemented in at least one hardware form of digital signal Processing (DIGITAL SIGNAL Processing, DSP), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 301 may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like, the GPU is used for rendering and drawing contents required to be displayed by the display screen, and the modem is used for processing wireless communication. It will be appreciated that the modem may not be integrated into the processor 301 and may be implemented by a single chip.
The Memory 305 may include a random access Memory (Random Access Memory, RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 305 includes a non-transitory computer readable medium (non-transitory computer-readable storage medium). Memory 305 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 305 may include a stored program area that may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the respective method embodiments described above, etc., and a stored data area that may store data, etc., involved in the respective method embodiments described above. Memory 305 may also optionally be at least one storage device located remotely from the aforementioned processor 301. Referring to fig. 3, an operating system, a network communication module, a user interface module, and an application program of a RAG-based video processing method may be included in the memory 305 as a computer storage medium.
In the electronic device 300 shown in fig. 3, the user interface 303 is primarily used to provide an input interface for a user to obtain data entered by the user, while the processor 301 may be used to invoke an application program in the memory 305 storing a RAG-based video processing method, which when executed by the one or more processors 301, causes the electronic device 300 to perform the method as described in one or more of the embodiments above. It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all of the preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.
The application also provides a computer readable storage medium storing instructions. When executed by the one or more processors 301, causes the electronic device 300 to perform the methods as described in one or more of the embodiments above.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, such as a division of units, merely a division of logic functions, and there may be additional divisions in actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some service interface, device or unit indirect coupling or communication connection, electrical or otherwise.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable memory. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in whole or in part in the form of a software product stored in a memory, comprising several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method of the various embodiments of the present application. The memory includes various media capable of storing program codes, such as a USB flash disk, a mobile hard disk, a magnetic disk or an optical disk.
The foregoing is merely exemplary embodiments of the present disclosure and is not intended to limit the scope of the present disclosure. That is, equivalent changes and modifications are contemplated by the teachings of this disclosure, which fall within the scope of the present disclosure. Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure.
This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a scope and spirit of the disclosure being indicated by the claims.

Claims (10)

1.一种基于RAG的视频剧情问答方法,其特征在于,所述方法包括:1. A video plot question-answering method based on RAG, characterized in that the method comprises: 获取用户针对视频内容输入的用户内容;Obtain user content input by the user for the video content; 将所述用户内容转化为对应的用户向量;Converting the user content into a corresponding user vector; 计算所述用户向量与第一目标向量的第一相似度以及所述用户向量与第二目标向量的第二相似度,所述第一目标向量为预设向量数据库中的任意一个目标向量,所述第二目标向量为所述预设向量数据库中除所述第一目标向量以外的任意一个目标向量,所述预设向量数据库中包括多个所述目标向量,所述预设向量数据库为预先根据所述视频内容构建;Calculating a first similarity between the user vector and a first target vector and a second similarity between the user vector and a second target vector, wherein the first target vector is any target vector in a preset vector database, and the second target vector is any target vector in the preset vector database except the first target vector, wherein the preset vector database includes a plurality of the target vectors, and the preset vector database is pre-constructed according to the video content; 若确定所述第一相似度大于所述第二相似度,则根据所述预设向量数据库确定所述第一目标向量对应的第一目标文本内容;If it is determined that the first similarity is greater than the second similarity, determining the first target text content corresponding to the first target vector according to the preset vector database; 将所述第一目标文本内容和所述用户内容输入预设RAG模型,生成针对所述用户内容的回复内容。The first target text content and the user content are input into a preset RAG model to generate reply content for the user content. 2.根据权利要求1所述的方法,其特征在于,所述获取用户针对视频内容输入的用户内容之前,所述方法还包括:2. The method according to claim 1, characterized in that before obtaining the user content input by the user for the video content, the method further comprises: 获取所述视频内容对应的视频文件;Obtaining a video file corresponding to the video content; 采用转码工具将所述视频文件转换成音频文件;Using a transcoding tool to convert the video file into an audio file; 对所述音频文件进行角色分离,得到角色和所述角色对应的音频内容;Separating the audio file into roles to obtain roles and audio content corresponding to the roles; 将所述音频内容转化为文本内容,并为所述文本内容设置对应的角色标签;Convert the audio content into text content and set corresponding role tags for the text content; 采用预设多模态大模型对所述视频文件进行提取,提取出关键帧,并将所述关键帧转化为针对所述视频文件的背景描述和情节描述;The video file is extracted using a preset multimodal macro model to extract key frames, and the key frames are converted into background descriptions and plot descriptions for the video file; 将所述文本内容、所述文本内容对应的角色标签、所述背景描述以及所述情节描述按照时间顺序进行整合,得到目标文本内容;Integrate the text content, the character label corresponding to the text content, the background description, and the plot description in chronological order to obtain target text content; 将所述目标文本内容转化为目标向量,并将所述目标向量、所述目标文本内容以及所述目标向量和所述目标文本内容的对应关系存储至预设向量数据库中。The target text content is converted into a target vector, and the target vector, the target text content, and the corresponding relationship between the target vector and the target text content are stored in a preset vector database. 3.根据权利要求1所述的方法,其特征在于,所述将所述用户内容转化为对应的用户向量,具体包括:3. The method according to claim 1, wherein converting the user content into a corresponding user vector comprises: 获取所述用户内容对应的历史聊天记录;Obtaining historical chat records corresponding to the user content; 对所述历史聊天记录进行提取,得到历史关键信息;Extracting the historical chat records to obtain historical key information; 将所述用户内容和所述历史关键信息进行拼接,得到拼接后的目标用户内容;splicing the user content and the historical key information to obtain spliced target user content; 将所述目标用户内容转化为所述用户向量。The target user content is converted into the user vector. 4.根据权利要求3所述的方法,其特征在于,所述对所述历史聊天记录进行提取,得到历史关键信息,具体包括:4. The method according to claim 3 is characterized in that the extracting of the historical chat records to obtain historical key information specifically includes: 对所述历史聊天记录进行实体识别,得到对应的历史实体,并对所述用户内容进行实体识别,得到对应的内容实体;Performing entity recognition on the historical chat records to obtain corresponding historical entities, and performing entity recognition on the user content to obtain corresponding content entities; 若确定所述历史实体和所述内容实体为同一实体,则确定所述历史实体对应的历史聊天记录为所述历史关键信息。If it is determined that the historical entity and the content entity are the same entity, the historical chat record corresponding to the historical entity is determined as the historical key information. 5.根据权利要求2所述的方法,其特征在于,所述采用预设多模态大模型对所述视频文件进行提取,提取出关键帧,具体包括:5. The method according to claim 2, characterized in that the step of extracting the key frames from the video file using a preset multimodal large model specifically comprises: 确定所述视频文件中的多个图像帧;Determining a plurality of image frames in the video file; 计算第一图像帧和第二图像帧的图像相似度,所述第一图像帧和所述第二图像帧为多个所述图像帧中任意相邻的两个图像帧;Calculating an image similarity between a first image frame and a second image frame, where the first image frame and the second image frame are any two adjacent image frames among the plurality of image frames; 若确定所述图像相似度小于预设相似度阈值,则确定所述第一图像帧和所述第二图像帧为关键帧。If it is determined that the image similarity is less than a preset similarity threshold, the first image frame and the second image frame are determined to be key frames. 6.根据权利要求2所述的方法,其特征在于,所述将所述关键帧转化为针对所述视频文件的背景描述和情节描述,具体包括:6. The method according to claim 2, characterized in that the step of converting the key frames into a background description and a plot description for the video file specifically comprises: 对所述关键帧进行特征提取,得到对应视觉特征向量;Extracting features from the key frame to obtain a corresponding visual feature vector; 将所述视觉特征向量输入预设图像描述生成模型,得到针对所述关键帧的文本描述;Inputting the visual feature vector into a preset image description generation model to obtain a text description for the key frame; 对所述文本描述进行关键字提取,确定对应的关键字;Extract keywords from the text description to determine corresponding keywords; 将所述关键字输入预设关键字库中,得到所述关键字对应的描述类型,所述描述类型包括背景描述和情节描述,所述预设关键字库中包括关键字和描述类型的对应关系。The keyword is input into a preset keyword library to obtain a description type corresponding to the keyword, wherein the description type includes a background description and a plot description, and the preset keyword library includes a correspondence between the keyword and the description type. 7.根据权利要求1所述的方法,其特征在于,所述将所述第一目标文本内容和所述用户内容输入预设RAG模型,生成针对所述用户内容的回复内容之后,所述方法还包括:7. The method according to claim 1, characterized in that after inputting the first target text content and the user content into a preset RAG model to generate reply content for the user content, the method further comprises: 将所述用户内容和所述回复内容存储至历史聊天记录中。The user content and the reply content are stored in the historical chat record. 8.一种基于RAG的视频处理装置,其特征在于,所述装置包括获取模块(201)和处理模块(202),其中:8. A video processing device based on RAG, characterized in that the device comprises an acquisition module (201) and a processing module (202), wherein: 所述获取模块(201),用于获取用户针对视频内容输入的用户内容;The acquisition module (201) is used to acquire user content input by the user for the video content; 所述处理模块(202),用于将所述用户内容转化为对应的用户向量;The processing module (202) is used to convert the user content into a corresponding user vector; 所述处理模块(202),还用于计算所述用户向量与第一目标向量的第一相似度以及所述用户向量与第二目标向量的第二相似度,所述第一目标向量为预设向量数据库中的任意一个目标向量,所述第二目标向量为所述预设向量数据库中除所述第一目标向量以外的任意一个目标向量,所述预设向量数据库中包括多个所述目标向量,所述预设向量数据库为预先根据所述视频内容构建;The processing module (202) is further used to calculate a first similarity between the user vector and a first target vector and a second similarity between the user vector and a second target vector, wherein the first target vector is any target vector in a preset vector database, and the second target vector is any target vector in the preset vector database except the first target vector, wherein the preset vector database includes a plurality of the target vectors, and the preset vector database is pre-constructed according to the video content; 所述处理模块(202),还用于若确定所述第一相似度大于所述第二相似度,则根据所述预设向量数据库确定所述第一目标向量对应的第一目标文本内容;The processing module (202) is further configured to determine, if it is determined that the first similarity is greater than the second similarity, a first target text content corresponding to the first target vector according to the preset vector database; 所述处理模块(202),还用于将所述第一目标文本内容和所述用户内容输入预设RAG模型,生成针对所述用户内容的回复内容。The processing module (202) is further used to input the first target text content and the user content into a preset RAG model to generate reply content for the user content. 9.一种电子设备,其特征在于,包括处理器(301)、存储器(305)、用户接口(303)及网络接口(304),所述存储器(305)用于存储指令,所述用户接口(303)和网络接口(304)用于给其他设备通信,所述处理器(301)用于执行所述存储器(305)中存储的指令,以使所述电子设备(300)执行如权利要求1-7任意一项所述的方法。9. An electronic device, characterized in that it comprises a processor (301), a memory (305), a user interface (303) and a network interface (304), wherein the memory (305) is used to store instructions, the user interface (303) and the network interface (304) are used to communicate with other devices, and the processor (301) is used to execute the instructions stored in the memory (305) so that the electronic device (300) executes the method according to any one of claims 1 to 7. 10.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有指令,当所述指令被执行时,执行如权利要求1-7任意一项所述的方法。10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores instructions, and when the instructions are executed, the method according to any one of claims 1 to 7 is executed.
CN202411160716.8A 2024-08-22 2024-08-22 A video plot question-answering method and device based on RAG Pending CN119106098A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411160716.8A CN119106098A (en) 2024-08-22 2024-08-22 A video plot question-answering method and device based on RAG

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411160716.8A CN119106098A (en) 2024-08-22 2024-08-22 A video plot question-answering method and device based on RAG

Publications (1)

Publication Number Publication Date
CN119106098A true CN119106098A (en) 2024-12-10

Family

ID=93719785

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411160716.8A Pending CN119106098A (en) 2024-08-22 2024-08-22 A video plot question-answering method and device based on RAG

Country Status (1)

Country Link
CN (1) CN119106098A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120067396A (en) * 2025-04-25 2025-05-30 浪潮智能终端有限公司 Retrieval method and device for video content

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120067396A (en) * 2025-04-25 2025-05-30 浪潮智能终端有限公司 Retrieval method and device for video content

Similar Documents

Publication Publication Date Title
CN113205817B (en) Speech semantic recognition method, system, device and medium
CN109509470B (en) Voice interaction method and device, computer readable storage medium and terminal equipment
CN113239169B (en) Answer generation method, device, equipment and storage medium based on artificial intelligence
CN114417097B (en) Emotion prediction method and system based on time convolution and self-attention
CN111858876B (en) Knowledge base generation method, text searching method and device
US12111866B2 (en) Term weight generation method, apparatus, device and medium
CN118051635B (en) Conversational image retrieval method and device based on large language model
CN117010907A (en) Multi-mode customer service method and system based on voice and image recognition
CN113392265A (en) Multimedia processing method, device and equipment
CN116662495A (en) Question and answer processing method, method and device for training question and answer processing model
CN119047494B (en) Neural network text translation enhancement method and system in multilingual cross-language environment
CN113836273A (en) Legal consultation method based on complex context and related equipment
CN119128096A (en) A knowledge base question answering method, device and computer readable storage medium
CN116561271A (en) Question and answer processing method and device
CN119106098A (en) A video plot question-answering method and device based on RAG
CN114048757A (en) A sign language synthesis method, device, computer equipment and storage medium
CN117421413A (en) Question-answer pair generation method and device and electronic equipment
CN118467780A (en) Film and television search recommendation method, system, equipment and medium based on large model
CN110942775B (en) Data processing method and device, electronic equipment and storage medium
CN115705705A (en) Video identification method, device, server and storage medium based on machine learning
CN117725153B (en) Text matching method, device, electronic equipment and storage medium
CN115017325B (en) Text-based entity linking, recognition method, electronic device, and storage medium
CN118227910B (en) Media resource aggregation method, device, equipment and storage medium
CN120450058B (en) Digital human interaction method and device and electronic equipment
CN119204208B (en) User question-answering method, device, equipment, medium, and product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20250410

Address after: Room 808, Shangyuan Building, 3335 Xiyou Road, High tech Zone, Hefei City, Anhui Province, China 230000

Applicant after: Anhui Feiling Technology Co.,Ltd.

Country or region after: China

Address before: 6th Floor, Pilot Building No.1, China Shenggu Industrial Park, High tech Zone, Hefei City, Anhui Province, China 230000

Applicant before: Hefei Feier Intelligent Technology Co.,Ltd.

Country or region before: China