[go: up one dir, main page]

WO2025026012A1 - Video retrieval method - Google Patents

Video retrieval method Download PDF

Info

Publication number
WO2025026012A1
WO2025026012A1 PCT/CN2024/104568 CN2024104568W WO2025026012A1 WO 2025026012 A1 WO2025026012 A1 WO 2025026012A1 CN 2024104568 W CN2024104568 W CN 2024104568W WO 2025026012 A1 WO2025026012 A1 WO 2025026012A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
target
matching
feature sequence
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/CN2024/104568
Other languages
French (fr)
Chinese (zh)
Inventor
李攀登
谢晨伟
赵黎明
郑赟
赵德丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Publication of WO2025026012A1 publication Critical patent/WO2025026012A1/en
Anticipated expiration legal-status Critical
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Definitions

  • the embodiments of the present specification relate to the field of data processing technology, and in particular to a video retrieval method.
  • an embodiment of this specification provides a video retrieval method.
  • One or more embodiments of this specification also relate to a video retrieval device, a computing device, a computer-readable storage medium and a computer program to solve the technical defects existing in the prior art.
  • a video retrieval method including:
  • the search text and the target candidate video are input into the video matching model to obtain the a matching weight corresponding to the target candidate video output by the video matching model, wherein the target candidate video is any one of the at least one candidate video, and the matching weight is determined based on a first matching result and a second matching result between the search text and the target candidate video, wherein the first matching result is used to characterize the matching degree between the search text and each target object in the target candidate video, and the second matching result is used to characterize the matching degree between the search text and the video semantics of the target candidate video;
  • At least one target video is determined from the at least one candidate video based on the matching weights corresponding to the candidate videos.
  • a video retrieval device including:
  • An acquisition module configured to acquire a search text and at least one candidate video
  • An input module is configured to input the search text and the target candidate video into a video matching model, and obtain a matching weight corresponding to the target candidate video output by the video matching model, wherein the target candidate video is any one of the at least one candidate video, and the matching weight is determined based on a first matching result and a second matching result between the search text and the target candidate video, wherein the first matching result is used to characterize the matching degree between the search text and each target object in the target candidate video, and the second matching result is used to characterize the matching degree between the search text and the video semantics of the target candidate video;
  • the determination module is configured to determine at least one target video from the at least one candidate video based on the matching weight corresponding to each candidate video.
  • a computing device including:
  • the memory is used to store computer executable instructions
  • the processor is used to execute the computer executable instructions.
  • the steps of the above-mentioned video retrieval method are implemented.
  • a computer-readable storage medium which stores computer-executable instructions, and when the instructions are executed by a processor, the steps of the above-mentioned video retrieval method are implemented.
  • a computer program is provided, wherein when the computer program is executed in a computer, the computer is caused to execute the steps of the above-mentioned video retrieval method.
  • the video retrieval method provided in this specification includes: obtaining a retrieval text and at least one candidate video; inputting the search text and the target candidate video into a video matching model to obtain a matching weight corresponding to the target candidate video output by the video matching model, wherein the target candidate video is any one of the at least one candidate video, and the matching weight is determined based on a first matching result and a second matching result between the search text and the target candidate video, the first matching result being used to characterize the matching degree between the search text and each target object in the target candidate video, and the second matching result being used to characterize the matching degree between the search text and the video semantics of the target candidate video; based on the matching weight corresponding to each candidate video, determining at least one target video in the at least one candidate video.
  • the first matching result between the search text and the target candidate video is calculated to determine the matching degree between the search text and each target object in the target candidate video
  • the second matching result between the search text and the target candidate video is calculated to determine the matching degree between the search text and the video semantics of the target candidate video.
  • the matching weight of the target candidate video is determined by combining the first matching result and the second matching result, and the accuracy of determining the matching weight of the target candidate video is improved by analyzing the local content and the overall content of the target candidate video; then, the target video is determined in each candidate video according to the matching weight corresponding to each candidate video, thereby improving the accuracy of text-based video retrieval.
  • FIG1 is an architecture diagram of a video retrieval system provided by an embodiment of the present specification
  • FIG2 is a flow chart of a video retrieval method provided by an embodiment of the present specification.
  • FIG3 is a schematic diagram of a model architecture of a video matching model provided by an embodiment of the present specification
  • FIG4a is a schematic diagram of an interactive interface of a video retrieval method provided by an embodiment of this specification.
  • FIG4b is a schematic diagram of an interactive interface of another video retrieval method provided by an embodiment of the present specification.
  • FIG5 is a flow chart of a processing process of a video retrieval method provided by an embodiment of the present specification
  • FIG6 is a schematic diagram of the structure of a video retrieval device provided by an embodiment of this specification.
  • FIG. 7 is a structural block diagram of a computing device provided by an embodiment of the present specification.
  • first, second, etc. may be used to describe various information in one or more embodiments of this specification, this information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other.
  • the first may also be referred to as the second, and similarly, the second may also be referred to as the first.
  • word “if” as used herein may be interpreted as "at the time of” or “when” or “in response to determining”.
  • the user information including but not limited to user device information, user personal information, etc.
  • data including but not limited to data used for analysis, stored data, displayed data, etc.
  • the collection, use and processing of relevant data must comply with relevant laws, regulations and standards, and corresponding operation entrances shall be provided for users to choose to authorize or refuse.
  • the large model in one or more embodiments of this specification specifically refers to a deep learning model with large-scale model parameters, which usually contains hundreds of millions, tens of billions, or even hundreds of billions of model parameters.
  • the large model can also be called a foundation model/foundation model.
  • the large model is pre-trained with large-scale unlabeled corpus to produce a pre-trained model with more than 100 million parameters.
  • This model can adapt to a wide range of downstream tasks and has good generalization capabilities, such as large-scale language models (LLM), multi-modal pre-training models, etc.
  • LLM large-scale language models
  • large models only need a small number of samples to fine-tune the pre-trained model and can be applied to different tasks.
  • Large models can be widely used in natural language processing (NLP), computer vision and other fields. It is applied to computer vision tasks such as visual question answering (VQA), image caption (IC), image generation, as well as natural language processing tasks such as text-based sentiment classification, text summary generation, and machine translation.
  • VQA visual question answering
  • IC image caption
  • natural language processing tasks such as text-based sentiment classification, text summary generation, and machine translation.
  • the main application scenarios of the large model include digital assistants, intelligent robots, search, online education, office software, e-commerce, intelligent design, etc.
  • Video text retrieval Traditional image retrieval tasks are only applicable to searches between images, and their application scenarios are limited. Video text retrieval can achieve cross-modal search between natural language and video, and its model is generally trained based on a large amount of video-text data pairs.
  • CLIP is a multimodal pre-training model. It can process images and texts simultaneously and combine the two for tasks such as classification, retrieval, and generation. During the training process, CLIP uses a large amount of image and text data. Its pre-training goal is to learn to align the representations of text and images so that similar text and images are closer in the embedding space. To achieve this goal, CLIP uses a contrastive learning method to maximize the cosine similarity of similar text and images, while minimizing the cosine similarity of dissimilar text and images, thereby training an embedding space that can align data of different modalities.
  • Transformer A model that mainly uses self-attention to obtain long-term dependencies in a sequence.
  • Transformer contains an encoder and a decoder.
  • the encoder is used to generate a representation of the input sequence.
  • the decoder is used to generate a representation of the target sequence.
  • Both the encoder and the decoder are composed of multiple layers of self-attention and feedforward neural networks.
  • the self-attention mechanism in Transformer is a mechanism that calculates the representation of each element in the sequence, which can help the model focus on the part of the input sequence related to the current position and generate an importance weight vector. For example, when a sentence is input, the meaning of each word in the sentence is related to the other words in the sentence.
  • Prototype A representative representation vector.
  • an event can be represented by an event-level vector (prototype).
  • Video text retrieval aims to search for semantically related videos based on natural language input by users, which requires appropriate matching modeling between video and text data.
  • the inherent modal difference phenomenon increases the difficulty of associating multimodal data.
  • multiple unimodal pre-trained models are usually used to extract features, and then metric learning strategies are used to strengthen modal alignment in the joint space.
  • metric learning strategies are used to strengthen modal alignment in the joint space.
  • videos contain rich visual elements
  • text descriptions may only correspond to part of the video content and lack the perception of the overall and local content of the video, making the accuracy of video retrieval insufficient to meet user needs and reducing the user experience.
  • a video retrieval method is provided.
  • This specification also relates to a video retrieval system, a video retrieval device, a computing device, and a computer-readable storage medium, which are described in detail one by one in the following embodiments.
  • FIG. 1 shows an architecture diagram of a video retrieval system provided by an embodiment of the present specification.
  • the video retrieval system may include a client 100 and a server 200;
  • the client 100 is used to send a video retrieval instruction to the server 200;
  • the server 200 is used to receive a video search instruction sent by the client 100; obtain a search text and at least one candidate video in response to the video search instruction; input the search text and the target candidate video into a video matching model to obtain a matching weight corresponding to the target candidate video output by the video matching model, wherein the target candidate video is any one of the at least one candidate video, and the matching weight is determined based on a first matching result and a second matching result between the search text and the target candidate video, wherein the first matching result is used to characterize the matching degree between the search text and each target object in the target candidate video, and the second matching result is used to characterize the matching degree between the search text and the video semantics of the target candidate video; determine at least one target video in the at least one candidate video based on the matching weight corresponding to each candidate video; and send at least one target video to the client 100;
  • the client 100 is further configured to receive at least one target video sent by the server 200 .
  • a user can trigger a video retrieval instruction through a client and send the video retrieval instruction to a server.
  • the server obtains the retrieval text carried in the video retrieval instruction and obtains multiple candidate videos from a preset database, inputs the retrieval text and the candidate videos into a video matching model to determine a target video associated with the retrieval text among multiple candidate videos, and feeds the target video back to the client.
  • the video retrieval system may include multiple clients 100 and a server 200, wherein the client 100 may be referred to as a client-side device and the server 200 may be referred to as a cloud-side device.
  • a communication connection may be established between multiple clients 100 through the server 200.
  • the server 200 is used to provide video retrieval services between multiple clients 100.
  • Multiple clients 100 may serve as a sender or a receiver, respectively, and achieve communication through the server 200.
  • the user can interact with the server 200 through the client 100 to receive other client 100, or sends data to other clients 100, etc.
  • the user may publish a data stream to the server 200 through the client 100, and the server 200 determines at least one target video according to the data stream and pushes the at least one target video to other clients that establish communication.
  • the client 100 and the server 200 are connected via a network.
  • the network provides a medium for a communication link between the client 100 and the server 200.
  • the network may include various connection types, such as wired or wireless communication links or optical fiber cables, etc.
  • the data transmitted by the client 100 may need to be encoded, transcoded, compressed, etc. before being released to the server 200.
  • the client 100 may be a browser, an APP (Application), or a web application such as an H5 (HyperText Markup Language5, the fifth edition of Hypertext Markup Language) application, or a light application (also known as a mini-program, a lightweight application) or a cloud application, etc.
  • the client 100 may be based on the software development kit (SDK, Software Development Kit) of the corresponding service provided by the server 200, such as based on the real-time communication (RTC, Real Time Communication) SDK development and acquisition, etc.
  • SDK Software Development Kit
  • RTC Real Time Communication
  • the electronic device may have a display screen and support information browsing, etc., such as a personal mobile terminal such as a mobile phone, a tablet computer, a personal computer, etc.
  • a personal mobile terminal such as a mobile phone, a tablet computer, a personal computer, etc.
  • applications may also be configured in the electronic device, such as human-computer dialogue applications, model training applications, text processing applications, web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, etc.
  • the server 200 may include servers that provide various services, such as servers that provide communication services to multiple clients, servers for background training that support models used on clients, and servers that process data sent by clients. It should be noted that the server 200 can be implemented as a distributed server cluster consisting of multiple servers, or as a single server.
  • the server can also be a server of a distributed system, or a server combined with a blockchain.
  • the server can also be a cloud server for basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content distribution networks (CDN, Content Delivery Network), and big data and artificial intelligence platforms, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology.
  • the video retrieval method provided in the embodiments of this specification is generally executed by the server, but in other embodiments of this specification, the client may also have similar functions to the server, thereby executing the video retrieval method provided in the embodiments of this specification.
  • the client and the server execute together.
  • FIG. 2 shows a flow chart of a video retrieval method provided by an embodiment of the present specification, which specifically includes the following steps:
  • Step 202 Obtain a search text and at least one candidate video.
  • each short video platform will make corresponding video recommendations based on the text entered by the user, there may also be a problem that the videos recommended by the short video platform do not match the text entered by the user's search or the matching degree is low, thereby reducing the user's viewing experience.
  • the video retrieval method provided in an embodiment of the present specification can recommend videos matching the retrieval text to the user based on the retrieval text input by the user.
  • the search text refers to the text used to search for videos, and the search text may be a keyword, a phrase, a sentence, or a text paragraph, etc.
  • the candidate video refers to a video stored in a preset database, which is video data for searching videos based on the search text.
  • the user can manually enter the search text to be searched in the search bar, or input the search text by voice, thereby generating a video search instruction.
  • the client sends the video search instruction to the server.
  • the server After receiving the video search instruction, the server obtains the search text carried in the video search instruction and obtains the candidate videos in the preset database.
  • the client sends the video retrieval instruction generated for "cooking noodles” to the server.
  • the server obtains the retrieval text "cooking noodles” carried in the video retrieval instruction and obtains multiple candidate videos from the preset database.
  • the search text and the multiple candidate videos can be matched to determine the matching degree between the search text and the multiple candidate videos, and based on the matching degree between the search text and the multiple candidate videos, determine the video that the user needs to retrieve.
  • the video matching model can be used to The search text and each candidate video are processed.
  • Step 204 input the search text and the target candidate video into a video matching model to obtain a matching weight corresponding to the target candidate video output by the video matching model.
  • the target candidate video refers to any one of the candidate videos.
  • the video matching model is used to determine the matching weight between the search text and each candidate video.
  • the matching weight is used to characterize the matching degree between the search text and each candidate video, which can be expressed by a numerical value, such as 0.5, 0.8, or 50%, 80%, etc.
  • the acquired search text and target candidate video are input into the video matching model, and the matching weight corresponding to the target candidate video output by the video matching model can be obtained.
  • the video matching model includes a coding layer, a first matching layer, and a second matching layer;
  • Inputting the search text and the target candidate video into a video matching model, and obtaining a matching weight corresponding to the target candidate video output by the video matching model comprises:
  • the target object feature sequence, the text feature sequence and the first matching result are input into the second matching layer, a second matching result is determined according to the target object feature sequence and the text feature sequence, and a matching weight corresponding to the target candidate video is determined based on the first matching result and the second matching result.
  • the encoding layer is used to encode the search text and each candidate video to obtain a text feature sequence corresponding to the search text;
  • the first matching layer is used to determine the first matching result between the search text and each candidate video, and the first matching result is used to characterize the matching degree between the search text and each target object in each candidate video, and the target object may refer to a person, an object, etc. in the video frame;
  • the second matching layer is used to determine the second matching result between the search text and each candidate video, and the second matching result is used to characterize the matching degree between the search text and the video semantics of each candidate video.
  • the retrieval text and the target candidate video are input into the video matching model.
  • the retrieval text and the target candidate video need to be input into the encoding layer of the video matching model, and the retrieval text and the target candidate video are encoded respectively to obtain the text feature sequence corresponding to the retrieval text and the target video frame block feature sequence corresponding to the target candidate video.
  • the text feature sequence refers to a feature sequence composed of text feature vectors corresponding to the search text, and the text feature vectors include word feature vectors and identifier feature vectors;
  • the target video frame block feature sequence refers to a feature sequence composed of target video frame block feature vectors corresponding to the target candidate video, and the target video frame block feature vectors include key frame block feature vectors and global frame feature vectors.
  • the retrieved text and the target candidate video belong to data of different modalities
  • the retrieved text and the target candidate video can be encoded using a text encoder and a video encoder respectively.
  • the encoding layer includes a text encoder and a video encoder
  • the target candidate video is input into the video encoder to obtain a target video frame block feature sequence of the target candidate video.
  • the text encoder refers to an encoder used to encode the retrieved text
  • the video encoder refers to an encoder used to encode the candidate video.
  • the search text and the target candidate video are input into the encoding layer of the video matching model, that is, the search text is input into the text encoder, and the target candidate video is input into the video encoder, thereby obtaining the text feature sequence corresponding to the search text and the target video frame block feature sequence corresponding to the target candidate video.
  • obtaining the text feature sequence of the search text includes:
  • a text feature sequence of the search text is obtained.
  • Preset text identifiers refer to identifiers that are pre-set to determine the start and end positions of the searched text, including the start identifier and the end identifier. [start] and [end].
  • the identifier feature vector is the vector representation of the start identifier and the end identifier, where the identifier feature vector corresponding to the end identifier is also the global feature vector of the search text.
  • the search text is segmented and recognized to obtain each word in the search text, and each word is encoded to obtain a word feature vector corresponding to each word, a preset text identifier is obtained, the preset identifier is encoded, an identifier feature vector corresponding to each preset identifier is obtained, and a text feature sequence is constructed based on each word feature vector and each identifier feature vector.
  • the text feature sequence obtained based on the above method can be expressed as Among them, y S is the starting identifier feature vector, is the word feature vector, y E is the end identifier feature vector, and is also the global feature vector of the retrieved text Yi , M is the number of words, and D is the feature dimension, which is usually set to 512 dimensions. This manual does not limit the feature dimension of the feature vector.
  • obtaining a target video frame block feature sequence of the target candidate video includes:
  • a target video frame block feature sequence of the target candidate video is obtained.
  • the preset number refers to the number of video frames that need to be sampled in advance; the key frame refers to the video frame sampled from the target candidate video; the key frame block feature sequence refers to the feature sequence composed of the key frame block feature vectors corresponding to the key frame.
  • the preset classification embedding character refers to a special classification embedding character that is pre-set and inserted before the key frame. It is used in the classification task and can be specifically represented by [CLS].
  • the global frame feature vector refers to the vector representation of the preset classification embedding character.
  • a preset number of key frames are obtained, and a preset number of key frames are sampled from the target candidate video based on the preset number of key frames, and each key frame obtained by sampling is segmented to obtain multiple video frame blocks of each key frame, and then each video frame block is encoded to obtain the key frame corresponding to each video frame block.
  • the frame block feature vector is obtained, and a key frame block feature sequence corresponding to each key frame is generated.
  • a preset classification embedding symbol is obtained, and the preset classification embedding symbol is encoded to obtain a global frame feature vector corresponding to the preset classification embedding symbol. Based on each key frame block feature sequence and each global frame feature vector, a target video frame block feature sequence is constructed.
  • the target video frame block feature sequence obtained based on the above method can be expressed as in, is the global frame feature vector, is the key frame block feature vector of the lth frame, is the key frame block feature sequence of the lth frame, K is the number of video frame blocks, L is the preset number, D is the feature dimension, usually set to 512 dimensions, and this specification does not limit the feature dimension of the feature vector.
  • the text feature sequence and the target video frame block feature sequence are input into the first matching layer of the video matching model to obtain the phrase feature sequence corresponding to the text feature sequence and the target object feature sequence corresponding to the target video frame block feature sequence, so that the first matching result can be determined based on the phrase feature sequence and the target object feature sequence.
  • the first matching layer includes a spatial prototype generator and a target phrase matcher
  • the phrase feature sequence and the target object feature sequence are input into the target phrase matcher to obtain a first matching result.
  • the spatial prototype generator is used to aggregate each text feature vector in the text feature sequence into a phrase feature vector, and to aggregate each target video frame block feature vector in the target video frame block feature sequence into a target object feature vector.
  • the target phrase matcher is used to perform feature matching between each target object feature vector and each phrase feature vector.
  • Phrase feature sequence refers to the feature vectors of each phrase aggregated from the text feature sequence.
  • the target object feature sequence refers to the feature sequence composed of the feature vectors of each target object aggregated from the target video frame block feature sequence.
  • the text feature sequence and the target video frame block feature sequence are input into the spatial prototype generator, and the phrase feature sequence corresponding to the text feature sequence and the target object feature sequence corresponding to the target video frame block feature sequence can be obtained. Then, the obtained phrase feature sequence and target object feature sequence are input into the target phrase matcher, and the first matching result for characterizing the matching degree between the search text and each target object in the target candidate video can be obtained.
  • the implementation method of obtaining the phrase feature sequence is as follows:
  • obtaining a phrase feature sequence corresponding to the text feature sequence includes:
  • the weight corresponding to each text feature vector can be predicted, and the text feature vectors can be aggregated according to the weight corresponding to each text feature vector to generate a phrase feature sequence.
  • the target text feature vector refers to any one of the text feature vectors in the text feature sequence.
  • the predicted text weight is used to characterize the importance of the text feature vector in obtaining effective information.
  • Np is the number of phrase feature vectors.
  • phrase feature sequence can be expressed as The generation process of phrase feature sequence can be expressed as
  • the method for obtaining the target object feature sequence is as follows:
  • the target video frame block characteristics are obtained.
  • the target object feature sequence corresponding to the feature sequence includes:
  • a target object feature sequence is obtained.
  • the weights corresponding to the feature vectors of each target video frame block can be predicted. According to the weights corresponding to the feature vectors of each target video frame block, the feature vectors of each target video frame block are filtered and aggregated to generate a target object feature sequence.
  • the first frame block feature vector refers to any one of the target video frame block feature vectors in the target video frame block feature sequence.
  • the predicted frame weight is used to characterize the importance of the target video frame block feature vector in obtaining effective information. It is represented by, and No is the number of feature vectors of the target object.
  • any target video frame block feature vector is selected as the first frame block feature vector in the target video frame block feature sequence, the predicted frame weight of the first frame block feature vector is determined, and the first object feature vector is generated based on the first frame block feature vector and the predicted frame weight of the first frame block feature vector, and then, the target object feature sequence is formed based on the generated first object feature vectors.
  • the target object feature sequence can be expressed as
  • the generation process of the target object feature sequence can be expressed as
  • redundant text feature vectors and redundant target video frame block feature vectors can be filtered, and a phrase feature sequence corresponding to the text feature sequence and a target object feature sequence corresponding to the target video frame block feature sequence can be aggregated to improve the accuracy of determining the target video.
  • the phrase feature sequence and the target object feature sequence may be input into a target phrase matcher to obtain a first matching result.
  • obtaining a first matching result includes:
  • a first matching result is generated.
  • the feature vector of the object to be processed refers to any one of the feature vectors of the target object in the feature sequence of the target object.
  • the first target similarity specifically refers to the maximum similarity among the first similarities
  • the second target similarity specifically refers to the maximum similarity among the second similarities.
  • the initial matching result refers to the matching result between the feature vector of the object to be processed and the feature vectors of each phrase, and is used to determine the phrase feature vector with the maximum similarity to the feature vector of the object to be processed among the feature vectors of each phrase.
  • any target object feature vector is selected in the target object feature sequence for processing, that is, as the object feature vector to be processed, and the first similarity between the object feature vector to be processed and each phrase feature vector in the phrase feature sequence, and the second similarity between the object feature vector to be processed and each target video frame block feature vector in the target video frame block feature sequence are calculated, and the maximum first similarity is determined as the target first similarity in the first similarity between the object feature vector to be processed and each phrase feature vector in the phrase feature sequence, and the maximum second similarity is determined as the target second similarity in the second similarity between the object feature vector to be processed and each target video frame block feature vector in the target video frame block feature sequence, and the final similarity of the object feature vector to be processed is determined according to the target first similarity and the target second similarity, and the similarity is determined as the initial matching result of the object feature vector to be processed. Further, based on the same method as above, the initial matching results of each target object feature vector in the
  • the matching degree between the search text and each target object in the target candidate video can be obtained.
  • the target object feature sequence, the text feature sequence and the first matching result are input into the second matching layer to obtain the second matching result between the search text and the target candidate video, so that the matching result corresponding to the target candidate video can be determined according to the first matching result and the second matching result. Assign weights.
  • the second matching layer includes a temporal prototype generator and a semantic matcher
  • Inputting the target object feature sequence, the text feature sequence, and the first matching result into the second matching layer, and determining a second matching result according to the target object feature sequence and the text feature sequence, comprising:
  • a global feature vector is determined in the text feature sequence, and the semantic feature sequence and the global feature vector are input into the semantic matcher to obtain a second matching result.
  • the temporal prototype generator is used to aggregate the target object feature vectors in the target object feature sequence into a semantic feature vector.
  • the semantic feature sequence refers to a feature sequence composed of the semantic feature vectors aggregated from the target object feature sequence.
  • the semantic matcher is used to perform feature matching between the global feature vector and the semantic feature vectors.
  • the target object feature sequence is input into the temporal prototype generator to obtain a semantic feature sequence corresponding to the target object feature sequence, and then a global feature vector is determined in the text feature vector, and the global feature vector and the semantic feature sequence are input into the semantic matcher to obtain a second matching result between the search text and the target candidate video.
  • the implementation method of obtaining the semantic feature sequence corresponding to the target object feature sequence is as follows:
  • obtaining a semantic feature sequence corresponding to the target object feature sequence includes:
  • a frame decoder is provided in the temporal prototype generator, which can decode each target object feature vector to generate a key frame feature vector at the frame level, and further, can generate a semantic feature vector based on each key frame feature vector.
  • the key frame feature sequence refers to the feature sequence composed of the key frame feature vectors obtained by decoding the target object feature vector.
  • the key frame feature sequence can be expressed as
  • the semantic feature sequence refers to a feature sequence composed of semantic feature vectors generated by the interaction of each key frame feature vector in the key frame feature sequence.
  • the semantic feature sequence can be Expressed as Among them, Ne is the number of semantic feature vectors.
  • a frame decoder is used to decode each target object feature vector in the target object feature sequence, and the spatial relationship between each target object feature vector is analyzed to obtain each key frame feature vector to form a corresponding key frame feature sequence.
  • the association relationship between each key frame feature vector in the key frame feature sequence is determined, and at least one semantic feature vector is generated according to each association relationship, and a semantic feature sequence is generated based on the at least one semantic feature vector.
  • the attention mechanism can be used to analyze the spatial relationship between the feature vectors of each target object.
  • the frame query feature vector Qf is randomly initialized, each target object feature vector Po is linearly transformed, and the feature vectors after linear transformation are used as the frame key feature vector Ko and the frame value feature vector Vo , respectively.
  • the spatial relationship between the feature vectors of each target object is obtained, and based on the spatial relationship between the feature vectors of each target object, a key frame feature sequence is generated.
  • the mask attention calculation can be implemented by the following formula (1):
  • Pf is the key frame feature vector
  • Qf is the frame query feature vector
  • Ko is the frame key feature vector
  • Vo is the frame value feature vector
  • Mf is the attention mask, which can be specifically expressed as softmax(.) is a normalized exponential function.
  • the attention mechanism can be used to dynamically analyze the semantic relationship between the feature vectors of each key frame.
  • the semantic query feature vector Qe is randomly initialized, each key frame feature vector Pf is linearly transformed, and the feature vectors after linear transformation are respectively used as the semantic key feature vector Kf and the semantic value feature vector Vf .
  • the association relationship between the feature vectors of each key frame is obtained, and based on the association relationship between the feature vectors of each key frame, a semantic feature sequence is generated.
  • the number of semantic feature vectors can be customized based on the actual application situation, for example, the number of semantic feature vectors can be set to 2, 3, etc.
  • Pe is the semantic feature vector
  • Qe is the semantic query feature vector
  • Kf is the semantic key feature vector
  • Vf is the semantic value feature vector
  • softmax(.) is the normalized exponential function
  • the spatial relationship between the feature vectors of each target object is analyzed to generate a key frame feature sequence, and the semantic relationship between the feature vectors of each key frame is analyzed to generate a semantic feature sequence, so that the video semantic diversity of the target candidate video can be known.
  • the target video can be retrieved in combination with the video semantics of the candidate video, which can improve the accuracy of retrieving the target video.
  • the search text needs to be matched with the video semantics of the target candidate video to obtain a second matching result between the search text and the target candidate video.
  • obtaining a second matching result includes:
  • a second matching result between the search text and the target candidate video is determined.
  • the semantic similarity between the global feature vector and each semantic feature vector in the semantic feature sequence is calculated, the maximum semantic similarity is determined among the semantic similarities between the global feature vector and each semantic feature vector, and the maximum semantic similarity is determined as the second matching result.
  • the matching degree between the search text and the video semantics of the target candidate video can be known.
  • a matching weight of the target candidate video is determined based on the first matching result and the second matching result.
  • the first matching result and the second matching result may be weighted and calculated, and the calculation result may be determined as the matching weight of the target candidate video.
  • s is the matching weight
  • s es is the second matching result
  • s op is the first matching result
  • is the spatial matching factor. ⁇ can be manually adjusted based on actual application conditions, and the specific numerical setting is not limited in this specification.
  • the video retrieval method provided in this specification determines the matching degree between the retrieval text and each target object in the target candidate video by calculating the first matching result between the retrieval text and the target candidate video, and determines the matching degree between the retrieval text and the video semantics of the target candidate video by calculating the second matching result between the retrieval text and the target candidate video. Then, the matching weight of the target candidate video is determined by combining the first matching result and the second matching result, and the accuracy of determining the matching weight of the target candidate video is improved by analyzing the local content and the overall content of the target candidate video.
  • FIG. 3 shows a video Schematic diagram of the model architecture of the matching model.
  • the video matching model includes an encoding layer, a first matching layer, and a second matching layer.
  • the encoding layer includes a video encoder and a text encoder
  • the first matching layer includes a spatial prototype generator and a target phrase matcher
  • the second matching layer includes a temporal prototype generator and a semantic matcher.
  • the search text is input into the text encoder of the encoding layer to obtain the text feature sequence corresponding to the search text
  • the target candidate video is input into the video encoder of the encoding layer to obtain the target video frame block feature sequence corresponding to the target candidate video
  • the text feature sequence and the target video frame block feature sequence are input into the spatial prototype generator of the first matching layer to obtain the phrase feature sequence corresponding to the text feature sequence and the target object feature sequence corresponding to the target video frame block feature sequence
  • the phrase feature sequence and the target object feature sequence are input into the target phrase matcher of the first matching layer to obtain the first matching result
  • the target object feature sequence is input into the temporal prototype generator of the second matching layer to obtain the semantic feature sequence corresponding to the target object feature sequence
  • the global feature vector and the semantic feature sequence in the text feature sequence are input into the semantic matcher of the second matching layer to obtain the second matching result
  • the second matching layer calculates the matching weight of the target candidate video based on the first matching result and the second matching result.
  • the video matching model is trained by the following method:
  • the model parameters of the video matching model are adjusted according to the model loss value, and the layout generation model is continuously trained until a training stop condition is reached.
  • the training data sample pair refers to the text-video data pair obtained in the training data sample pair set, which is the training sample of the video matching model, including the positive training data sample pair and the negative training data sample pair;
  • the training data sample pair set refers to the text content in the collected input speech or input text, and the video retrieved according to the text content, which is composed of the text-video data pair set;
  • the matching weight label refers to the actual matching weight corresponding to the training data sample pair;
  • the predicted matching weight refers to the training data sample pair input to the video
  • the model loss value refers to the difference between the matching weight label and the predicted matching weight, which is used to measure the difference between the matching weight label and the predicted matching weight.
  • the text data in the training data sample pair is obtained by the above-mentioned acquisition method of obtaining the retrieved text, and the video obtained based on the text data retrieval and the corresponding text data constitute the positive training data sample pair, and the other text-video data pairs in the same batch in the training data sample pair set constitute the negative training data sample pair.
  • the training data sample pair is input into the video matching model, and the video matching model is used to predict the matching weight of the training data sample pair.
  • the video matching model is a model that has not been trained yet. There will be a deviation between the predicted predicted matching weight and the actual matching weight label, and the model parameters of the video matching model need to be adjusted accordingly.
  • the model loss value of the video matching model is calculated according to the output predicted matching weight and matching weight label.
  • the loss function for calculating the model loss value can be a 0-1 loss function, a square loss function, a cross entropy loss function, etc. in actual applications.
  • the cross entropy function is selected as the loss function for calculating the model loss value, and the model parameters of the video matching model are adjusted according to the model loss value.
  • the adjusted model parameters are used for the next batch of training data sample pairs to continue training the video matching model until the stop condition of the model training is reached.
  • the model training stopping conditions include that the model loss value is less than a preset threshold and/or the training rounds reach a preset round.
  • the preset threshold is 0.3.
  • the preset training rounds are 30 rounds.
  • the training rounds of the training data sample pair reach 30 rounds, the video matching model training is considered to be completed.
  • two training stop conditions a preset threshold and a preset training round
  • the model loss value and the training rounds are monitored simultaneously.
  • the video matching model training is considered completed.
  • the video retrieval method provided in this specification analyzes and processes the retrieval text and the target candidate video through a trained video matching model to obtain the matching weight corresponding to the target candidate video, thereby improving the accuracy of determining the matching weight of the target candidate video.
  • Step 206 Determine at least one target video from the at least one candidate video based on the matching weight corresponding to each candidate video.
  • the target video refers to a video that has a high degree of matching with the search text and can be used to be displayed on the client based on the search text.
  • the target video corresponding to the search text can be determined in the candidate videos based on the matching weights corresponding to the candidate videos.
  • determining at least one target video from the at least one candidate video includes:
  • the candidate videos are sorted according to the matching weights corresponding to the candidate videos to obtain a candidate video list, and a target video is determined in the candidate video list based on a preset number of videos.
  • the preset matching weight threshold refers to the preset minimum value of the matching weight, which is used to measure the matching degree between each candidate video and the search text.
  • the preset number of videos refers to the preset number of selectable target videos.
  • a preset matching weight threshold can be obtained, and candidate videos whose matching weights are greater than or equal to the preset matching weight threshold are determined among the candidate videos, and candidate videos whose matching weights are greater than or equal to the preset matching weight threshold are determined as target videos.
  • a preset number of videos can also be obtained, and then, according to the matching weight corresponding to each candidate video, each candidate video is sorted to obtain a candidate video list, and a preset number of candidate videos are selected from the candidate video list as target videos.
  • the implementation method for determining the target video can be set according to actual application conditions, and this specification does not limit it here.
  • the video retrieval method provided in this specification after obtaining the matching weights corresponding to each candidate video, determines the target video among the candidate videos based on the matching weights corresponding to each candidate video, thereby improving the accuracy of determining the target video.
  • Figure 4a shows a schematic diagram of an interactive interface of a video retrieval method provided according to an embodiment of the present specification.
  • a user can enter a search text to be searched in the search bar of the client to generate a video retrieval instruction.
  • the server uses the above-mentioned video retrieval method to determine at least one target video in a preset database, and feeds back and displays the target video to the client, and its display interface can be shown in Figure 4a.
  • FIG. 4b shows another embodiment of the present invention.
  • Schematic diagram of the interactive interface of the video retrieval method As shown in FIG4b, the user can enter the search text to be searched in the search bar of the client to generate a video retrieval instruction. After receiving the video retrieval instruction, the server uses the above-mentioned video retrieval method to determine at least one target video in the preset database, and feeds back and displays the target video to the client.
  • the display interface can be shown in FIG4b.
  • FIG. 4 a and FIG. 4 b are only exemplary illustrations.
  • the video retrieval method provided in this specification includes: obtaining a retrieval text and at least one candidate video; inputting the retrieval text and the target candidate video into a video matching model, and obtaining a matching weight corresponding to the target candidate video output by the video matching model, wherein the target candidate video is any one of the at least one candidate video, and the matching weight is determined based on a first matching result and a second matching result between the retrieval text and the target candidate video, the first matching result being used to characterize the matching degree between the retrieval text and each target object in the target candidate video, and the second matching result being used to characterize the matching degree between the retrieval text and the video semantics of the target candidate video; based on the matching weight corresponding to each candidate video, at least one target video is determined in the at least one candidate video.
  • the first matching result between the search text and the target candidate video is calculated to determine the matching degree between the search text and each target object in the target candidate video
  • the second matching result between the search text and the target candidate video is calculated to determine the matching degree between the search text and the video semantics of the target candidate video.
  • the matching weight of the target candidate video is determined by combining the first matching result and the second matching result, and the accuracy of determining the matching weight of the target candidate video is improved by analyzing the local content and the overall content of the target candidate video; then, the target video is determined in each candidate video according to the matching weight corresponding to each candidate video, thereby improving the accuracy of text-based video retrieval.
  • Figure 5 shows a processing flow chart of a video retrieval method provided by an embodiment of this specification, which specifically includes the following steps:
  • Step 502 Obtain the search text "cook noodles" and at least one candidate video.
  • Step 504 Input the search text "cook noodles" into the text encoder of the video matching model to obtain the text feature sequence of the search text "cook noodles", input the target candidate video into the video encoder of the video matching model to obtain the target video frame block feature sequence of the target candidate video.
  • Step 506 Input the text feature sequence and the target video frame block feature sequence into the spatial prototype generator of the video matching model to obtain a phrase feature sequence corresponding to the text feature sequence and a target object feature sequence corresponding to the target video frame block feature sequence.
  • Step 508 Input the phrase feature sequence and the target object feature sequence into the target phrase matcher of the video matching model to obtain a first matching result.
  • Step 510 Input the target object feature sequence into the temporal prototype generator of the video matching model to obtain a semantic feature sequence corresponding to the target object feature sequence.
  • Step 512 Determine a global feature vector in the text feature sequence, input the semantic feature sequence and the global feature vector into the semantic matcher of the video matching model, and obtain a second matching result.
  • Step 514 Determine a matching weight corresponding to the target candidate video based on the first matching result and the second matching result.
  • Step 516 Determine at least one target video from the at least one candidate video based on the matching weight corresponding to each candidate video.
  • the first matching result between the search text "cook noodles” and the target candidate video is calculated to determine the matching degree between the search text "cook noodles” and each target object in the target candidate video
  • the second matching result between the search text "cook noodles” and the target candidate video is calculated to determine the matching degree between the search text "cook noodles” and the video semantics of the target candidate video.
  • the matching weight of the target candidate video is determined by combining the first matching result and the second matching result, and the accuracy of determining the matching weight of the target candidate video is improved by analyzing the local content and the overall content of the target candidate video; then, the target video is determined in each candidate video according to the matching weight corresponding to each candidate video, thereby improving the accuracy of text-based video retrieval.
  • FIG6 shows a schematic diagram of the structure of a video retrieval device provided by an embodiment of this specification. As shown in FIG6, the device includes:
  • An acquisition module 602 is configured to acquire a search text and at least one candidate video
  • the input module 604 is configured to input the search text and the target candidate video into the video matching model, and obtain the matching weight corresponding to the target candidate video output by the video matching model, wherein the target candidate video is any one of the at least one candidate video, and the matching weight is determined based on the first matching result and the second matching result between the search text and the target candidate video, wherein the first matching result is used to characterize the matching degree between the search text and each target object in the target candidate video, and the second matching result is used to characterize the matching degree between the search text and each target object in the target candidate video.
  • the result is used to represent the matching degree between the retrieved text and the video semantics of the target candidate video;
  • the determination module 606 is configured to determine at least one target video from the at least one candidate video based on the matching weight corresponding to each candidate video.
  • the video matching model includes a coding layer, a first matching layer and a second matching layer;
  • the input module is further configured as follows:
  • the target object feature sequence, the text feature sequence and the first matching result are input into the second matching layer, a second matching result is determined according to the target object feature sequence and the text feature sequence, and a matching weight corresponding to the target candidate video is determined based on the first matching result and the second matching result.
  • the encoding layer includes a text encoder and a video encoder
  • the input module is further configured as follows:
  • the target candidate video is input into the video encoder to obtain a target video frame block feature sequence of the target candidate video.
  • the first matching layer includes a spatial prototype generator and a target phrase matcher
  • the input module is further configured as follows:
  • the phrase feature sequence and the target object feature sequence are input into the target phrase matcher to obtain a first matching result.
  • the input module is further configured as:
  • the feature vector is any one of the text feature vectors in the text feature sequence
  • the input module is further configured as:
  • a target object feature sequence is obtained.
  • the input module is further configured as:
  • a first matching result is generated.
  • the second matching layer includes a temporal prototype generator and a semantic matcher
  • the input module is further configured as follows:
  • a global feature vector is determined in the text feature sequence, and the semantic feature sequence and the global feature vector are input into the semantic matcher to obtain a second matching result.
  • the input module is further configured as:
  • the input module is further configured as:
  • a second matching result between the search text and the target candidate video is determined.
  • the determining module is further configured to:
  • the candidate videos are sorted according to the matching weights corresponding to the candidate videos to obtain a candidate video list, and a target video is determined in the candidate video list based on a preset number of videos.
  • the device further comprises a training module configured to:
  • the model parameters of the video matching model are adjusted according to the model loss value, and the layout generation model is continuously trained until a training stop condition is reached.
  • the video retrieval device includes: an acquisition module, configured to acquire a search text and at least one candidate video; an input module, configured to input the search text and the target candidate video into a video matching model, and obtain a matching weight corresponding to the target candidate video output by the video matching model, wherein the target candidate video is any one of the at least one candidate video, and the matching weight is determined based on a first matching result and a second matching result between the search text and the target candidate video, the first matching result being used to characterize the matching degree between the search text and each target object in the target candidate video, and the second matching result being used to characterize the matching degree between the search text and the video semantics of the target candidate video; a determination module, configured to determine at least one target video in the at least one candidate video based on the matching weight corresponding to each candidate video.
  • the matching degree between the search text and each target object in the target candidate video is determined, and the second matching result between the search text and the target candidate video is calculated to determine the matching degree between the search text and the video semantics of the target candidate video.
  • the second matching result determines the matching weight of the target candidate video.
  • the above is a schematic scheme of a video retrieval device of this embodiment. It should be noted that the technical scheme of the video retrieval device and the technical scheme of the above video retrieval method belong to the same concept, and the details not described in detail in the technical scheme of the video retrieval device can be referred to the description of the technical scheme of the above video retrieval method.
  • Fig. 7 shows a block diagram of a computing device 700 according to an embodiment of the present specification.
  • the components of the computing device 700 include but are not limited to a memory 710 and a processor 720.
  • the processor 720 is connected to the memory 710 via a bus 730, and the database 750 is used to store data.
  • the computing device 700 also includes an access device 740 that enables the computing device 700 to communicate via one or more networks 760.
  • networks include a public switched telephone network (PSTN), a local area network (LAN), a wide area network (WAN), a personal area network (PAN), or a combination of communication networks such as the Internet.
  • the access device 740 may include one or more of any type of network interface, wired or wireless (e.g., a network interface card (NIC)), such as an IEEE802.11 wireless local area network (WLAN) wireless interface, a Worldwide Interoperability for Microwave Access (Wi-MAX) interface, an Ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a Bluetooth interface, and a near field communication (NFC).
  • NIC network interface card
  • the above components of the computing device 700 and other components not shown in FIG. 7 may also be connected to each other, for example, through a bus. It should be understood that the computing device structure block diagram shown in FIG. 7 is only for illustrative purposes and is not intended to limit the scope of the present specification. Those skilled in the art may add or replace other components as needed.
  • the computing device 700 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., a tablet computer, a personal digital assistant, a laptop computer, a notebook computer, a netbook, etc.), a mobile phone (e.g., a smart phone), a wearable computing device (e.g., a smart watch, smart glasses, etc.), or other types of mobile devices, or a stationary computing device such as a desktop computer or a personal computer (PC).
  • the computing device 700 may also be a mobile or stationary server.
  • the processor 720 is used to execute the following computer executable instructions, which implement the steps of the above-mentioned video retrieval method when executed by the processor.
  • the above is a schematic scheme of a computing device of this embodiment. It should be noted that the technical scheme of the computing device and the technical scheme of the above video retrieval method belong to the same concept, and the details not described in detail in the technical scheme of the computing device can be referred to the description of the technical scheme of the above video retrieval method.
  • An embodiment of the present specification further provides a computer-readable storage medium storing computer-executable instructions, which can implement the steps of the above-mentioned video retrieval method when executed by a processor.
  • the above is a schematic scheme of a computer-readable storage medium of this embodiment. It should be noted that the technical scheme of the storage medium and the technical scheme of the above video retrieval method belong to the same concept, and the details not described in detail in the technical scheme of the storage medium can be referred to the description of the technical scheme of the above video retrieval method.
  • An embodiment of the present specification further provides a computer program, wherein when the computer program is executed in a computer, the computer is caused to execute the steps of the above-mentioned video retrieval method.
  • the above is a schematic scheme of a computer program of this embodiment. It should be noted that the technical scheme of the computer program and the technical scheme of the above video retrieval method belong to the same concept, and the details not described in detail in the technical scheme of the computer program can be referred to the description of the technical scheme of the above video retrieval method.
  • the computer instructions include computer program codes, which may be in source code form, object code form, executable file or some intermediate form, etc.
  • the computer readable medium may include: any entity or device capable of carrying the computer program code, recording medium, USB flash drive, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM), random access memory (RAM), electric carrier signal, telecommunication signal and software distribution medium, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Embodiments of the present description provide a video retrieval method. The video retrieval method comprises: obtaining a retrieval text and at least one candidate video; inputting the retrieval text and a target candidate video into a video matching model to obtain a matching weight corresponding to the target candidate video output by the video matching model, wherein the target candidate video is any one of the at least one candidate video, the matching weight is determined based on a first matching result and a second matching result between the retrieval text and the target candidate video, the first matching result is used for representing the matching degree between the retrieval text and each target object in the target candidate video, and the second matching result is used for representing the matching degree between the retrieval text and video semantics of the target candidate video; and determining at least one target video among the at least one candidate video on the basis of the matching weight corresponding to each candidate video. Local content and overall content of the candidate video are analyzed to improve the accuracy of text-based video retrieval.

Description

视频检索方法Video Retrieval Methods

本申请要求于2023年08月01日提交中国专利局、申请号为202310961671.3、申请名称为“视频检索方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed with the China Patent Office on August 1, 2023, with application number 202310961671.3 and application name “Video Retrieval Method”, all contents of which are incorporated by reference in this application.

技术领域Technical Field

本说明书实施例涉及数据处理技术领域,特别涉及一种视频检索方法。The embodiments of the present specification relate to the field of data processing technology, and in particular to a video retrieval method.

背景技术Background Art

随着各个视频平台的发展,人们在闲暇时间通常会通过观看视频来打发时间。由于个人观看需求的不同,用户在进行观看视频的过程中,若对视频平台推荐的视频不感兴趣,则会基于文本进行自主搜索,来获取视频。在目前的实际应用中,根据用户输入的自然语言搜索语义相关的视频,往往需要在视频和文本数据之间进行适当的匹配建模。或是通过训练多个单模态预训练模型来提取特征并进行特征融合等处理。With the development of various video platforms, people usually spend their free time watching videos. Due to different personal viewing needs, if users are not interested in the videos recommended by the video platform, they will conduct independent searches based on text to obtain videos. In current practical applications, searching for semantically related videos based on the natural language input by users often requires appropriate matching modeling between video and text data. Or, it is necessary to extract features and perform feature fusion by training multiple single-modal pre-trained models.

然而,由于视频包含丰富的视觉元素,文本描述可能只对应视频的部分内容,缺乏对视频整体内容和局部内容的感知,从而使得视频检索的准确性不足以满足用户需求,降低用户的使用体验。因此,亟需一种方法来解决上述问题。However, since videos contain rich visual elements, text descriptions may only correspond to part of the video content, lacking the perception of the overall and local content of the video, making the accuracy of video retrieval insufficient to meet user needs and reducing the user experience. Therefore, a method is urgently needed to solve the above problems.

发明内容Summary of the invention

有鉴于此,本说明书实施例提供了一种视频检索方法。本说明书一个或者多个实施例同时涉及一种视频检索装置,一种计算设备,一种计算机可读存储介质以及一种计算机程序,以解决现有技术中存在的技术缺陷。In view of this, an embodiment of this specification provides a video retrieval method. One or more embodiments of this specification also relate to a video retrieval device, a computing device, a computer-readable storage medium and a computer program to solve the technical defects existing in the prior art.

根据本说明书实施例的第一方面,提供了一种视频检索方法,包括:According to a first aspect of an embodiment of this specification, a video retrieval method is provided, including:

获取检索文本和至少一个候选视频;Obtaining a search text and at least one candidate video;

将所述检索文本和目标候选视频输入至视频匹配模型,获得所述 视频匹配模型输出的所述目标候选视频对应的匹配权重,其中,所述目标候选视频为所述至少一个候选视频中的任一个,所述匹配权重基于所述检索文本与所述目标候选视频之间的第一匹配结果和第二匹配结果确定,所述第一匹配结果用于表征所述检索文本与所述目标候选视频中各目标对象的匹配度,所述第二匹配结果用于表征所述检索文本与所述目标候选视频的视频语义之间的匹配度;The search text and the target candidate video are input into the video matching model to obtain the a matching weight corresponding to the target candidate video output by the video matching model, wherein the target candidate video is any one of the at least one candidate video, and the matching weight is determined based on a first matching result and a second matching result between the search text and the target candidate video, wherein the first matching result is used to characterize the matching degree between the search text and each target object in the target candidate video, and the second matching result is used to characterize the matching degree between the search text and the video semantics of the target candidate video;

基于各候选视频对应的匹配权重,在所述至少一个候选视频中确定至少一个目标视频。At least one target video is determined from the at least one candidate video based on the matching weights corresponding to the candidate videos.

根据本说明书实施例的第二方面,提供了一种视频检索装置,包括:According to a second aspect of the embodiments of this specification, a video retrieval device is provided, including:

获取模块,被配置为获取检索文本和至少一个候选视频;An acquisition module, configured to acquire a search text and at least one candidate video;

输入模块,被配置为将所述检索文本和目标候选视频输入至视频匹配模型,获得所述视频匹配模型输出的所述目标候选视频对应的匹配权重,其中,所述目标候选视频为所述至少一个候选视频中的任一个,所述匹配权重基于所述检索文本与所述目标候选视频之间的第一匹配结果和第二匹配结果确定,所述第一匹配结果用于表征所述检索文本与所述目标候选视频中各目标对象的匹配度,所述第二匹配结果用于表征所述检索文本与所述目标候选视频的视频语义之间的匹配度;An input module is configured to input the search text and the target candidate video into a video matching model, and obtain a matching weight corresponding to the target candidate video output by the video matching model, wherein the target candidate video is any one of the at least one candidate video, and the matching weight is determined based on a first matching result and a second matching result between the search text and the target candidate video, wherein the first matching result is used to characterize the matching degree between the search text and each target object in the target candidate video, and the second matching result is used to characterize the matching degree between the search text and the video semantics of the target candidate video;

确定模块,被配置为基于各候选视频对应的匹配权重,在所述至少一个候选视频中确定至少一个目标视频。The determination module is configured to determine at least one target video from the at least one candidate video based on the matching weight corresponding to each candidate video.

根据本说明书实施例的第三方面,提供了一种计算设备,包括:According to a third aspect of an embodiment of this specification, a computing device is provided, including:

存储器和处理器;Memory and processor;

所述存储器用于存储计算机可执行指令,所述处理器用于执行所述计算机可执行指令,该计算机可执行指令被处理器执行时实现上述视频检索方法的步骤。The memory is used to store computer executable instructions, and the processor is used to execute the computer executable instructions. When the computer executable instructions are executed by the processor, the steps of the above-mentioned video retrieval method are implemented.

根据本说明书实施例的第四方面,提供了一种计算机可读存储介质,其存储有计算机可执行指令,该指令被处理器执行时实现上述视频检索方法的步骤。According to a fourth aspect of the embodiments of this specification, a computer-readable storage medium is provided, which stores computer-executable instructions, and when the instructions are executed by a processor, the steps of the above-mentioned video retrieval method are implemented.

根据本说明书实施例的第五方面,提供了一种计算机程序,其中,当所述计算机程序在计算机中执行时,令计算机执行上述视频检索方法的步骤。According to a fifth aspect of the embodiments of this specification, a computer program is provided, wherein when the computer program is executed in a computer, the computer is caused to execute the steps of the above-mentioned video retrieval method.

本说明书提供的视频检索方法,包括:获取检索文本和至少一个 候选视频;将所述检索文本和目标候选视频输入至视频匹配模型,获得所述视频匹配模型输出的所述目标候选视频对应的匹配权重,其中,所述目标候选视频为所述至少一个候选视频中的任一个,所述匹配权重基于所述检索文本与所述目标候选视频之间的第一匹配结果和第二匹配结果确定,所述第一匹配结果用于表征所述检索文本与所述目标候选视频中各目标对象的匹配度,所述第二匹配结果用于表征所述检索文本与所述目标候选视频的视频语义之间的匹配度;基于各候选视频对应的匹配权重,在所述至少一个候选视频中确定至少一个目标视频。The video retrieval method provided in this specification includes: obtaining a retrieval text and at least one candidate video; inputting the search text and the target candidate video into a video matching model to obtain a matching weight corresponding to the target candidate video output by the video matching model, wherein the target candidate video is any one of the at least one candidate video, and the matching weight is determined based on a first matching result and a second matching result between the search text and the target candidate video, the first matching result being used to characterize the matching degree between the search text and each target object in the target candidate video, and the second matching result being used to characterize the matching degree between the search text and the video semantics of the target candidate video; based on the matching weight corresponding to each candidate video, determining at least one target video in the at least one candidate video.

本说明书一个实施例,通过计算检索文本和目标候选视频之间的第一匹配结果,确定检索文本与目标候选视频中各目标对象的匹配度,计算检索文本和目标候选视频之间的第二匹配结果,确定检索文本与目标候选视频的视频语义之间的匹配度。进而,结合第一匹配结果和第二匹配结果,确定目标候选视频的匹配权重,通过对目标候选视频的局部内容和整体内容进行分析,提高确定目标候选视频匹配权重的准确性;再根据各候选视频对应的匹配权重,在各候选视频中确定目标视频,提高基于文本检索视频的准确性。In one embodiment of the present specification, the first matching result between the search text and the target candidate video is calculated to determine the matching degree between the search text and each target object in the target candidate video, and the second matching result between the search text and the target candidate video is calculated to determine the matching degree between the search text and the video semantics of the target candidate video. Then, the matching weight of the target candidate video is determined by combining the first matching result and the second matching result, and the accuracy of determining the matching weight of the target candidate video is improved by analyzing the local content and the overall content of the target candidate video; then, the target video is determined in each candidate video according to the matching weight corresponding to each candidate video, thereby improving the accuracy of text-based video retrieval.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本说明书一个实施例提供的一种视频检索系统的架构图;FIG1 is an architecture diagram of a video retrieval system provided by an embodiment of the present specification;

图2是本说明书一个实施例提供的一种视频检索方法的流程图;FIG2 is a flow chart of a video retrieval method provided by an embodiment of the present specification;

图3是本说明书一个实施例提供的一种视频匹配模型的模型架构示意图;FIG3 is a schematic diagram of a model architecture of a video matching model provided by an embodiment of the present specification;

图4a是本说明书一个实施例提供的一种视频检索方法的交互界面示意图;FIG4a is a schematic diagram of an interactive interface of a video retrieval method provided by an embodiment of this specification;

图4b是本说明书一个实施例提供的另一种视频检索方法的交互界面示意图;FIG4b is a schematic diagram of an interactive interface of another video retrieval method provided by an embodiment of the present specification;

图5是本说明书一个实施例提供的一种视频检索方法的处理过程流程图;FIG5 is a flow chart of a processing process of a video retrieval method provided by an embodiment of the present specification;

图6是本说明书一个实施例提供的一种视频检索装置的结构示意图;FIG6 is a schematic diagram of the structure of a video retrieval device provided by an embodiment of this specification;

图7是本说明书一个实施例提供的一种计算设备的结构框图。 FIG. 7 is a structural block diagram of a computing device provided by an embodiment of the present specification.

具体实施方式DETAILED DESCRIPTION

在下面的描述中阐述了很多具体细节以便于充分理解本说明书。但是本说明书能够以很多不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本说明书内涵的情况下做类似推广,因此本说明书不受下面公开的具体实施的限制。Many specific details are described in the following description to facilitate a full understanding of this specification. However, this specification can be implemented in many other ways than those described herein, and those skilled in the art can make similar generalizations without violating the connotation of this specification, so this specification is not limited to the specific implementation disclosed below.

在本说明书一个或多个实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本说明书一个或多个实施例。在本说明书一个或多个实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本说明书一个或多个实施例中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terms used in one or more embodiments of this specification are only for the purpose of describing specific embodiments, and are not intended to limit one or more embodiments of this specification. The singular forms of "a", "said" and "the" used in one or more embodiments of this specification and the appended claims are also intended to include plural forms, unless the context clearly indicates other meanings. It should also be understood that the term "and/or" used in one or more embodiments of this specification refers to and includes any or all possible combinations of one or more associated listed items.

应当理解,尽管在本说明书一个或多个实施例中可能采用术语第一、第二等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本说明书一个或多个实施例范围的情况下,第一也可以被称为第二,类似地,第二也可以被称为第一。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, etc. may be used to describe various information in one or more embodiments of this specification, this information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other. For example, without departing from the scope of one or more embodiments of this specification, the first may also be referred to as the second, and similarly, the second may also be referred to as the first. Depending on the context, the word "if" as used herein may be interpreted as "at the time of" or "when" or "in response to determining".

此外,需要说明的是,本说明书一个或多个实施例所涉及的用户信息(包括但不限于用户设备信息、用户个人信息等)和数据(包括但不限于用于分析的数据、存储的数据、展示的数据等),均为经用户授权或者经过各方充分授权的信息和数据,并且相关数据的收集、使用和处理需要遵守相关法律法规和标准,并提供有相应的操作入口,供用户选择授权或者拒绝。In addition, it should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in one or more embodiments of this specification are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data must comply with relevant laws, regulations and standards, and corresponding operation entrances shall be provided for users to choose to authorize or refuse.

针对本说明书一个或多个实施例中的大模型,具体是指具有大规模模型参数的深度学习模型,通常包含上亿、上百亿、甚至上千亿的模型参数。大模型又可以称为基石模型/基础模型(Foundation Model),通过大规模无标注的语料进行大模型的预训练,产出亿级以上参数的预训练模型,这种模型能适应广泛的下游任务,模型具有较好的泛化能力,例如大规模语言模型(Large Language Model,LLM)、多模态预训练模型(multi-modal pre-training model)等。The large model in one or more embodiments of this specification specifically refers to a deep learning model with large-scale model parameters, which usually contains hundreds of millions, tens of billions, or even hundreds of billions of model parameters. The large model can also be called a foundation model/foundation model. The large model is pre-trained with large-scale unlabeled corpus to produce a pre-trained model with more than 100 million parameters. This model can adapt to a wide range of downstream tasks and has good generalization capabilities, such as large-scale language models (LLM), multi-modal pre-training models, etc.

大模型在实际应用时,仅需少量样本对预训练模型进行微调即可应用于不同的任务中,大模型可以广泛应用于自然语言处理(Natural Language Processing,简称NLP)、计算机视觉等领域,具体可以 应用于如视觉问答(Visual Question Answering,简称VQA)、图像描述(Image Caption,简称IC)、图像生成等计算机视觉领域任务,以及基于文本的情感分类、文本摘要生成、机器翻译等自然语言处理领域任务,大模型主要的应用场景包括数字助理、智能机器人、搜索、在线教育、办公软件、电子商务、智能设计等。In actual application, large models only need a small number of samples to fine-tune the pre-trained model and can be applied to different tasks. Large models can be widely used in natural language processing (NLP), computer vision and other fields. It is applied to computer vision tasks such as visual question answering (VQA), image caption (IC), image generation, as well as natural language processing tasks such as text-based sentiment classification, text summary generation, and machine translation. The main application scenarios of the large model include digital assistants, intelligent robots, search, online education, office software, e-commerce, intelligent design, etc.

首先,对本说明书一个或多个实施例涉及的名词术语进行解释。First, the terms involved in one or more embodiments of this specification are explained.

视频文本检索:传统的图像检索任务仅适用于图像之间的搜索,应用场景受限。视频文本检索能够实现自然语言和视频之间的跨模态搜索,其模型一般基于大量的视频-文本数据对来进行训练。Video text retrieval: Traditional image retrieval tasks are only applicable to searches between images, and their application scenarios are limited. Video text retrieval can achieve cross-modal search between natural language and video, and its model is generally trained based on a large amount of video-text data pairs.

CLIP:是一种多模态预训练模型。它能够同时处理图像和文本,并将二者结合起来进行分类、检索、生成等任务。在训练过程中,CLIP使用了大量的图像和文本数据,其预训练目标是学习对齐文本和图像的表征,使得相似的文本和图像在嵌入空间中的距离更近。为了实现这个目标,CLIP采用对比学习方法,最大化相似的文本和图像的余弦相似度,同时最小化不相似的文本和图像的余弦相似度,从而训练出一个能够将不同模态的数据对齐的嵌入空间。CLIP: is a multimodal pre-training model. It can process images and texts simultaneously and combine the two for tasks such as classification, retrieval, and generation. During the training process, CLIP uses a large amount of image and text data. Its pre-training goal is to learn to align the representations of text and images so that similar text and images are closer in the embedding space. To achieve this goal, CLIP uses a contrastive learning method to maximize the cosine similarity of similar text and images, while minimizing the cosine similarity of dissimilar text and images, thereby training an embedding space that can align data of different modalities.

Transformer:主要利用自注意力来获取序列中的长期依赖的模型。Transformer里面包含编码器和解码器,编码器用于生成输入序列的表示。解码器用于生成目标序列的表示。编码器和解码器都由多层自注意力和前馈神经网络组成。另外,Transformer中的自注意力机制是一种计算序列中每个元素表示的机制,能够帮助模型聚焦于输入序列中与当前位置有关的部分,并生成重要性权重向量。例如,当输入句子时,句子中的每个单词的含义都和该句子的其他的单词相关。Transformer: A model that mainly uses self-attention to obtain long-term dependencies in a sequence. Transformer contains an encoder and a decoder. The encoder is used to generate a representation of the input sequence. The decoder is used to generate a representation of the target sequence. Both the encoder and the decoder are composed of multiple layers of self-attention and feedforward neural networks. In addition, the self-attention mechanism in Transformer is a mechanism that calculates the representation of each element in the sequence, which can help the model focus on the part of the input sequence related to the current position and generate an importance weight vector. For example, when a sentence is input, the meaning of each word in the sentence is related to the other words in the sentence.

原型:代表性的表征向量。例如,一个事件可以用一个事件级的向量(原型)进行表示。Prototype: A representative representation vector. For example, an event can be represented by an event-level vector (prototype).

理解多模态信息是人类感知世界的重要途径。作为多模态理解中的一项基础任务,视频文本检索随着短视频平台的快速发展而引起了研究者的极大兴趣。视频文本检索旨在根据用户输入的自然语言搜索语义相关的视频,这需要在视频和文本数据之间进行适当的匹配建模。但是,固有的模态差异现象增加了关联多模态数据的难度。针对此问题,在实际应用中通常利用多个单模态预训练模型来提取特征,然后使用度量学习策略来加强联合空间中的模态对齐。然而,多个单模态离线特征的初始分布存在较大差异,这不可避免地带来特征融合挑战,影响检索结果。 Understanding multimodal information is an important way for humans to perceive the world. As a basic task in multimodal understanding, video text retrieval has attracted great interest from researchers with the rapid development of short video platforms. Video text retrieval aims to search for semantically related videos based on natural language input by users, which requires appropriate matching modeling between video and text data. However, the inherent modal difference phenomenon increases the difficulty of associating multimodal data. To address this problem, in practical applications, multiple unimodal pre-trained models are usually used to extract features, and then metric learning strategies are used to strengthen modal alignment in the joint space. However, there are large differences in the initial distributions of multiple unimodal offline features, which inevitably brings challenges to feature fusion and affects the retrieval results.

另外,由于视频包含丰富的视觉元素,文本描述可能只对应视频的部分内容,缺乏对视频整体内容和局部内容的感知,从而使得视频检索的准确性不足以满足用户需求,降低用户的使用体验。In addition, since videos contain rich visual elements, text descriptions may only correspond to part of the video content and lack the perception of the overall and local content of the video, making the accuracy of video retrieval insufficient to meet user needs and reducing the user experience.

在本说明书中,提供了一种视频检索方法,本说明书同时涉及一种视频检索系统,一种视频检索装置,一种计算设备,以及一种计算机可读存储介质,在下面的实施例中逐一进行详细说明。In this specification, a video retrieval method is provided. This specification also relates to a video retrieval system, a video retrieval device, a computing device, and a computer-readable storage medium, which are described in detail one by one in the following embodiments.

参见图1,图1示出了本说明书一个实施例提供的一种视频检索系统的架构图,视频检索系统可以包括客户端100和服务端200;Referring to FIG. 1 , FIG. 1 shows an architecture diagram of a video retrieval system provided by an embodiment of the present specification. The video retrieval system may include a client 100 and a server 200;

客户端100,用于向服务端200发送视频检索指令;The client 100 is used to send a video retrieval instruction to the server 200;

服务端200,用于接收客户端100发送的视频检索指令;响应于所述视频检索指令获取检索文本和至少一个候选视频;将所述检索文本和目标候选视频输入至视频匹配模型,获得所述视频匹配模型输出的所述目标候选视频对应的匹配权重,其中,所述目标候选视频为所述至少一个候选视频中的任一个,所述匹配权重基于所述检索文本与所述目标候选视频之间的第一匹配结果和第二匹配结果确定,所述第一匹配结果用于表征所述检索文本与所述目标候选视频中各目标对象的匹配度,所述第二匹配结果用于表征所述检索文本与所述目标候选视频的视频语义之间的匹配度;基于各候选视频对应的匹配权重,在所述至少一个候选视频中确定至少一个目标视频;向客户端100发送至少一个目标视频;The server 200 is used to receive a video search instruction sent by the client 100; obtain a search text and at least one candidate video in response to the video search instruction; input the search text and the target candidate video into a video matching model to obtain a matching weight corresponding to the target candidate video output by the video matching model, wherein the target candidate video is any one of the at least one candidate video, and the matching weight is determined based on a first matching result and a second matching result between the search text and the target candidate video, wherein the first matching result is used to characterize the matching degree between the search text and each target object in the target candidate video, and the second matching result is used to characterize the matching degree between the search text and the video semantics of the target candidate video; determine at least one target video in the at least one candidate video based on the matching weight corresponding to each candidate video; and send at least one target video to the client 100;

客户端100,还用于接收服务端200发送的至少一个目标视频。The client 100 is further configured to receive at least one target video sent by the server 200 .

应用本说明书实施例的方案,可以由用户通过客户端触发视频检索指令,并将视频检索指令发送至服务端,服务端在接收到视频检索指令后,获取视频检索指令中携带的检索文本,并从预设数据库中获取多个候选视频,将检索文本和候选视频输入至视频匹配模型中,以在多个候选视频中确定与检索文本相关联的目标视频,并将目标视频反馈至客户端。By applying the scheme of the embodiments of the present specification, a user can trigger a video retrieval instruction through a client and send the video retrieval instruction to a server. After receiving the video retrieval instruction, the server obtains the retrieval text carried in the video retrieval instruction and obtains multiple candidate videos from a preset database, inputs the retrieval text and the candidate videos into a video matching model to determine a target video associated with the retrieval text among multiple candidate videos, and feeds the target video back to the client.

视频检索系统可以包括多个客户端100以及服务端200,其中,客户端100可以称为端侧设备,服务端200可以称为云侧设备。多个客户端100之间通过服务端200可以建立通信连接,在视频检索场景中,服务端200即用来在多个客户端100之间提供视频检索服务,多个客户端100可以分别作为发送端或接收端,通过服务端200实现通信。The video retrieval system may include multiple clients 100 and a server 200, wherein the client 100 may be referred to as a client-side device and the server 200 may be referred to as a cloud-side device. A communication connection may be established between multiple clients 100 through the server 200. In the video retrieval scenario, the server 200 is used to provide video retrieval services between multiple clients 100. Multiple clients 100 may serve as a sender or a receiver, respectively, and achieve communication through the server 200.

用户通过客户端100可与服务端200进行交互以接收其它客户端 100发送的数据,或将数据发送至其它客户端100等。在视频检索场景中,可以是用户通过客户端100向服务端200发布数据流,服务端200根据该数据流确定至少一个目标视频,并将至少一个目标视频推送至其他建立通信的客户端中。The user can interact with the server 200 through the client 100 to receive other client 100, or sends data to other clients 100, etc. In the video retrieval scenario, the user may publish a data stream to the server 200 through the client 100, and the server 200 determines at least one target video according to the data stream and pushes the at least one target video to other clients that establish communication.

其中,客户端100与服务端200之间通过网络建立连接。网络为客户端100与服务端200之间提供了通信链路的介质。网络可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。客户端100所传输的数据可能需要经过编码、转码、压缩等处理之后才发布至服务端200。The client 100 and the server 200 are connected via a network. The network provides a medium for a communication link between the client 100 and the server 200. The network may include various connection types, such as wired or wireless communication links or optical fiber cables, etc. The data transmitted by the client 100 may need to be encoded, transcoded, compressed, etc. before being released to the server 200.

客户端100可以为浏览器、APP(Application,应用程序)、或网页应用如H5(HyperText Markup Language5,超文本标记语言第5版)应用、或轻应用(也被称为小程序,一种轻量级应用程序)或云应用等,客户端100可以基于服务端200提供的相应服务的软件开发工具包(SDK,Software Development Kit),如基于实时通信(RTC,Real Time Communication)SDK开发获得等。客户端100可以部署在电子设备中,需要依赖设备运行或者设备中的某些APP而运行等。电子设备例如可以具有显示屏并支持信息浏览等,如可以是个人移动终端如手机、平板电脑、个人计算机等。在电子设备中通常还可以配置各种其它类应用,例如人机对话类应用、模型训练类应用、文本处理类应用、网页浏览器应用、购物类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等。The client 100 may be a browser, an APP (Application), or a web application such as an H5 (HyperText Markup Language5, the fifth edition of Hypertext Markup Language) application, or a light application (also known as a mini-program, a lightweight application) or a cloud application, etc. The client 100 may be based on the software development kit (SDK, Software Development Kit) of the corresponding service provided by the server 200, such as based on the real-time communication (RTC, Real Time Communication) SDK development and acquisition, etc. The client 100 may be deployed in an electronic device, and may need to rely on the device to run or some APPs in the device to run, etc. For example, the electronic device may have a display screen and support information browsing, etc., such as a personal mobile terminal such as a mobile phone, a tablet computer, a personal computer, etc. Various other types of applications may also be configured in the electronic device, such as human-computer dialogue applications, model training applications, text processing applications, web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, etc.

服务端200可以包括提供各种服务的服务器,例如为多个客户端提供通信服务的服务器,又如为客户端上使用的模型提供支持的用于后台训练的服务器,又如对客户端发送的数据进行处理的服务器等。需要说明的是,服务端200可以实现成多个服务器组成的分布式服务器集群,也可以实现成单个服务器。服务器也可以为分布式系统的服务器,或者是结合了区块链的服务器。服务器也可以是云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(CDN,Content Delivery Network)以及大数据和人工智能平台等基础云计算服务的云服务器,或者是带人工智能技术的智能云计算服务器或智能云主机。The server 200 may include servers that provide various services, such as servers that provide communication services to multiple clients, servers for background training that support models used on clients, and servers that process data sent by clients. It should be noted that the server 200 can be implemented as a distributed server cluster consisting of multiple servers, or as a single server. The server can also be a server of a distributed system, or a server combined with a blockchain. The server can also be a cloud server for basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content distribution networks (CDN, Content Delivery Network), and big data and artificial intelligence platforms, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology.

值得说明的是,本说明书实施例中提供的视频检索方法一般由服务端执行,但是,在本说明书的其它实施例中,客户端也可以与服务端具有相似的功能,从而执行本说明书实施例所提供的视频检索方法。在其它实施例中,本说明书实施例所提供的视频检索方法还可以是由 客户端与服务端共同执行。It is worth noting that the video retrieval method provided in the embodiments of this specification is generally executed by the server, but in other embodiments of this specification, the client may also have similar functions to the server, thereby executing the video retrieval method provided in the embodiments of this specification. The client and the server execute together.

参见图2,图2示出了本说明书一个实施例提供的一种视频检索方法的流程图,具体包括以下步骤:Referring to FIG. 2 , FIG. 2 shows a flow chart of a video retrieval method provided by an embodiment of the present specification, which specifically includes the following steps:

步骤202:获取检索文本和至少一个候选视频。Step 202: Obtain a search text and at least one candidate video.

在实际应用中,各个短视频平台渐渐被大众所青睐,人们往往会通过刷新并观看视频来消磨时间,但在观看视频的过程中,会存在用户对短视频平台所推荐的视频内容不感兴趣的情况,在这种情况下,用户往往会在搜索栏中进行搜索视频。In actual applications, various short video platforms are gradually favored by the public, and people tend to kill time by refreshing and watching videos. However, in the process of watching videos, there may be cases where users are not interested in the video content recommended by the short video platform. In this case, users tend to search for videos in the search bar.

在这种场景下,各个短视频平台虽然会基于用户所输入的文本进行相应的视频推荐,但也会存在短视频平台所推荐的视频与用户搜索输入的文本不匹配或是匹配度较低的问题,从而降低用户的观看体验。In this scenario, although each short video platform will make corresponding video recommendations based on the text entered by the user, there may also be a problem that the videos recommended by the short video platform do not match the text entered by the user's search or the matching degree is low, thereby reducing the user's viewing experience.

本说明书一实施例提供的视频检索方法,可以基于用户输入的检索文本,为用户进行推荐与检索文本相匹配的视频。The video retrieval method provided in an embodiment of the present specification can recommend videos matching the retrieval text to the user based on the retrieval text input by the user.

其中,检索文本,是指用于检索视频的文本,检索文本可以是关键词、短语、语句或是文本段落等。候选视频,是指存储于预设数据库中的视频,是基于检索文本检索视频的视频数据。The search text refers to the text used to search for videos, and the search text may be a keyword, a phrase, a sentence, or a text paragraph, etc. The candidate video refers to a video stored in a preset database, which is video data for searching videos based on the search text.

例如,需要从数据库A中检索“煮面”的相关视频,则“煮面”为检索文本,数据库A中所存储的各个视频均为候选视频。For example, if it is necessary to retrieve videos related to "cooking noodles" from database A, then "cooking noodles" is the search text, and each video stored in database A is a candidate video.

具体地,在用户需要进行检索视频的情况下,用户可以在搜索栏中手动输入想要进行检索的检索文本,或是通过语音输入检索文本,从而生成视频检索指令,客户端将该视频检索指令发送至服务端,服务端在接收到该视频检索指令后,获取该视频检索指令中携带的检索文本,并获取预设数据库中的候选视频。Specifically, when a user needs to search for videos, the user can manually enter the search text to be searched in the search bar, or input the search text by voice, thereby generating a video search instruction. The client sends the video search instruction to the server. After receiving the video search instruction, the server obtains the search text carried in the video search instruction and obtains the candidate videos in the preset database.

例如,用户想要获取“煮面”的相关视频,可以在客户端的搜索栏中进行输入“煮面”一词,或是与“煮面”相关的文本,进而,客户端将针对“煮面”生成的视频检索指令发送至服务端,服务端接收到该视频检索指令后,获取该视频检索指令中携带的检索文本“煮面”,并从预设数据库中获取多个候选视频。For example, if a user wants to obtain videos related to "cooking noodles", he can enter the word "cooking noodles" or text related to "cooking noodles" in the search bar of the client. Then, the client sends the video retrieval instruction generated for "cooking noodles" to the server. After receiving the video retrieval instruction, the server obtains the retrieval text "cooking noodles" carried in the video retrieval instruction and obtains multiple candidate videos from the preset database.

在获取到检索文本和多个候选视频后,即可对检索文本和多个候选视频进行匹配处理,以确定检索文本和多个候选视频之间的匹配度,并基于检索文本与多个候选视频之间的匹配度,确定用户需要检索的视频。After obtaining the search text and multiple candidate videos, the search text and the multiple candidate videos can be matched to determine the matching degree between the search text and the multiple candidate videos, and based on the matching degree between the search text and the multiple candidate videos, determine the video that the user needs to retrieve.

为了提高视频文本检索的检索效率,可以利用视频匹配模型对检 索文本和各候选视频进行处理。In order to improve the retrieval efficiency of video text retrieval, the video matching model can be used to The search text and each candidate video are processed.

步骤204:将所述检索文本和目标候选视频输入至视频匹配模型,获得所述视频匹配模型输出的所述目标候选视频对应的匹配权重。Step 204: input the search text and the target candidate video into a video matching model to obtain a matching weight corresponding to the target candidate video output by the video matching model.

其中,目标候选视频,是指各候选视频中的任意一个。视频匹配模型,用于确定检索文本与各候选视频至今的匹配权重;匹配权重,用于表征检索文本与各候选视频之间的匹配度,可以用数值进行表示,例如0.5、0.8,或者50%、80%等。The target candidate video refers to any one of the candidate videos. The video matching model is used to determine the matching weight between the search text and each candidate video. The matching weight is used to characterize the matching degree between the search text and each candidate video, which can be expressed by a numerical value, such as 0.5, 0.8, or 50%, 80%, etc.

具体地,将获取得到的检索文本和目标候选视频输入至视频匹配模型,即可获得视频匹配模型输出的目标候选视频对应的匹配权重。Specifically, the acquired search text and target candidate video are input into the video matching model, and the matching weight corresponding to the target candidate video output by the video matching model can be obtained.

在本说明书提供的一具体实施方式中,所述视频匹配模型包括编码层、第一匹配层和第二匹配层;In a specific implementation provided in this specification, the video matching model includes a coding layer, a first matching layer, and a second matching layer;

将所述检索文本和目标候选视频输入至视频匹配模型,获得所述视频匹配模型输出的所述目标候选视频对应的匹配权重,包括:Inputting the search text and the target candidate video into a video matching model, and obtaining a matching weight corresponding to the target candidate video output by the video matching model, comprises:

将所述检索文本和所述目标候选视频输入至所述编码层,获得所述检索文本的文本特征序列和所述目标候选视频的目标视频帧块特征序列;Inputting the search text and the target candidate video into the coding layer to obtain a text feature sequence of the search text and a target video frame block feature sequence of the target candidate video;

将所述文本特征序列和所述目标视频帧块特征序列输入至所述第一匹配层,获得所述文本特征序列对应的短语特征序列和所述目标视频帧块特征序列对应的目标对象特征序列,根据所述短语特征序列和所述目标对象特征序列确定第一匹配结果;Inputting the text feature sequence and the target video frame block feature sequence into the first matching layer, obtaining a phrase feature sequence corresponding to the text feature sequence and a target object feature sequence corresponding to the target video frame block feature sequence, and determining a first matching result according to the phrase feature sequence and the target object feature sequence;

将所述目标对象特征序列、所述文本特征序列和所述第一匹配结果输入至所述第二匹配层,根据所述目标对象特征序列和所述文本特征序列确定第二匹配结果,并基于所述第一匹配结果和所述第二匹配结果确定所述目标候选视频对应的匹配权重。The target object feature sequence, the text feature sequence and the first matching result are input into the second matching layer, a second matching result is determined according to the target object feature sequence and the text feature sequence, and a matching weight corresponding to the target candidate video is determined based on the first matching result and the second matching result.

其中,编码层用于对检索文本和各候选视频进行编码,以获得检索文本对应的文本特征序列;第一匹配层用于确定检索文本与各候选视频的第一匹配结果,第一匹配结果用于表征检索文本与各候选视频中各目标对象的匹配度,目标对象可以是指视频帧中的人物、物品等;第二匹配层用于确定检索文本与各候选视频的第二匹配结果,第二匹配结果用于表征检索文本与各候选视频的视频语义之间的匹配度。Among them, the encoding layer is used to encode the search text and each candidate video to obtain a text feature sequence corresponding to the search text; the first matching layer is used to determine the first matching result between the search text and each candidate video, and the first matching result is used to characterize the matching degree between the search text and each target object in each candidate video, and the target object may refer to a person, an object, etc. in the video frame; the second matching layer is used to determine the second matching result between the search text and each candidate video, and the second matching result is used to characterize the matching degree between the search text and the video semantics of each candidate video.

具体地,将检索文本和目标候选视频输入至视频匹配模型,首先需要将检索文本和目标候选视频输入至视频匹配模型的编码层,分别对检索文本和目标候选视频进行编码,获得检索文本对应的文本特征序列和目标候选视频对应的目标视频帧块特征序列。 Specifically, the retrieval text and the target candidate video are input into the video matching model. First, the retrieval text and the target candidate video need to be input into the encoding layer of the video matching model, and the retrieval text and the target candidate video are encoded respectively to obtain the text feature sequence corresponding to the retrieval text and the target video frame block feature sequence corresponding to the target candidate video.

文本特征序列,是指由检索文本对应的各文本特征向量所构成的特征序列,文本特征向量包括词语特征向量和标识符特征向量;目标视频帧块特征序列,是指由目标候选视频对应的各目标视频帧块特征向量所构成的特征序列,目标视频帧块特征向量包括关键帧块特征向量和全局帧特征向量。The text feature sequence refers to a feature sequence composed of text feature vectors corresponding to the search text, and the text feature vectors include word feature vectors and identifier feature vectors; the target video frame block feature sequence refers to a feature sequence composed of target video frame block feature vectors corresponding to the target candidate video, and the target video frame block feature vectors include key frame block feature vectors and global frame feature vectors.

由于检索文本与目标候选视频属于不同模态的数据,因此,可以分别利用文本编码器和视频编码器对检索文本和目标候选视频进行编码。Since the retrieved text and the target candidate video belong to data of different modalities, the retrieved text and the target candidate video can be encoded using a text encoder and a video encoder respectively.

在本说明书提供的一具体实施方式中,所述编码层包括文本编码器和视频编码器;In a specific implementation provided in this specification, the encoding layer includes a text encoder and a video encoder;

将所述检索文本和目标候选视频输入至视频匹配模型,获得所述检索文本的文本特征序列和所述目标候选视频的目标视频帧块特征序列,包括:Inputting the search text and the target candidate video into a video matching model to obtain a text feature sequence of the search text and a target video frame block feature sequence of the target candidate video, including:

将所述检索文本输入至所述文本编码器,获得所述检索文本的文本特征序列;Inputting the search text into the text encoder to obtain a text feature sequence of the search text;

将所述目标候选视频输入至所述视频编码器,获得所述目标候选视频的目标视频帧块特征序列。The target candidate video is input into the video encoder to obtain a target video frame block feature sequence of the target candidate video.

其中,文本编码器,是指用于对检索文本进行编码的编码器,视频编码器,是指用于对候选视频进行编码的编码器。The text encoder refers to an encoder used to encode the retrieved text, and the video encoder refers to an encoder used to encode the candidate video.

将检索文本和目标候选视频输入至视频匹配模型的编码层,即为将检索文本输入至文本编码器,将目标候选视频输入至视频编码器,进而,获得检索文本对应的文本特征序列,以及目标候选视频对应的目标视频帧块特征序列。The search text and the target candidate video are input into the encoding layer of the video matching model, that is, the search text is input into the text encoder, and the target candidate video is input into the video encoder, thereby obtaining the text feature sequence corresponding to the search text and the target video frame block feature sequence corresponding to the target candidate video.

具体地,在本说明书提供的一具体实施方式中,获得所述检索文本的文本特征序列,包括:Specifically, in a specific implementation provided in this specification, obtaining the text feature sequence of the search text includes:

识别所述检索文本,获取所述检索文本中的词语;Identify the search text and obtain words in the search text;

对各词语进行编码,获得各词语对应的词语特征向量;Encode each word to obtain the word feature vector corresponding to each word;

获取预设文本标识符,对各预设文本标识符进行编码,获得各预设文本标识符对应的标识符特征向量;Obtaining preset text identifiers, encoding each preset text identifier, and obtaining an identifier feature vector corresponding to each preset text identifier;

基于各词语特征向量和各标识符特征向量,获得所述检索文本的文本特征序列。Based on each word feature vector and each identifier feature vector, a text feature sequence of the search text is obtained.

预设文本标识符,是指预先设置的用于确定检索文本起始位置与结束位置的标识符,包括起始标识符和结束标识符,具体可以用 [start]和[end]来表示。标识符特征向量,则为起始标识符和结束标识符的向量表示,其中,结束标识符对应的标识符特征向量也为检索文本的全局特征向量。Preset text identifiers refer to identifiers that are pre-set to determine the start and end positions of the searched text, including the start identifier and the end identifier. [start] and [end]. The identifier feature vector is the vector representation of the start identifier and the end identifier, where the identifier feature vector corresponding to the end identifier is also the global feature vector of the search text.

对检索文本进行分词识别,以获得检索文本中的各个词语,并对各词语进行编码,获得各词语对应的词语特征向量,获取预设文本标识符,对预设标识符进行编码,获得各预设标识符对应的标识符特征向量,并基于各词语特征向量和各标识符特征向量,构成文本特征序列。The search text is segmented and recognized to obtain each word in the search text, and each word is encoded to obtain a word feature vector corresponding to each word, a preset text identifier is obtained, the preset identifier is encoded, an identifier feature vector corresponding to each preset identifier is obtained, and a text feature sequence is constructed based on each word feature vector and each identifier feature vector.

例如,针对检索文本Yi,基于上述方法获得的文本特征序列可以表示为其中,yS是起始标识符特征向量,是词语特征向量,yE是结束标识符特征向量,也为检索文本Yi的全局特征向量,M是单词的个数,D是特征维度,通常设置为512维,本说明书对特征向量的特征维度不做限定。For example, for the retrieval text Yi , the text feature sequence obtained based on the above method can be expressed as Among them, y S is the starting identifier feature vector, is the word feature vector, y E is the end identifier feature vector, and is also the global feature vector of the retrieved text Yi , M is the number of words, and D is the feature dimension, which is usually set to 512 dimensions. This manual does not limit the feature dimension of the feature vector.

在本说明书提供的一具体实施方式中,获得所述目标候选视频的目标视频帧块特征序列,包括:In a specific implementation manner provided in this specification, obtaining a target video frame block feature sequence of the target candidate video includes:

从所述目标候选视频中采样预设数量的关键帧;Sampling a preset number of key frames from the target candidate video;

切分目标关键帧,获得所述目标关键帧对应的至少一个视频帧块,其中,所述目标关键帧为所述预设数量的关键帧中的任一个;Segmenting a target key frame to obtain at least one video frame block corresponding to the target key frame, wherein the target key frame is any one of the preset number of key frames;

对各视频帧块进行编码,获得所述目标关键帧的关键帧块特征序列;Encoding each video frame block to obtain a key frame block feature sequence of the target key frame;

获取预设分类嵌入符,对所述预设分类嵌入符进行编码,获得所述预设分类嵌入符对应的全局帧特征向量;Obtaining a preset classification embedding symbol, encoding the preset classification embedding symbol, and obtaining a global frame feature vector corresponding to the preset classification embedding symbol;

基于各关键帧的关键帧块特征序列和各全局帧特征向量,获得所述目标候选视频的目标视频帧块特征序列。Based on the key frame block feature sequence of each key frame and each global frame feature vector, a target video frame block feature sequence of the target candidate video is obtained.

预设数量,是指预先设置的需要进行采样的视频帧的数量;关键帧,是指从目标候选视频中采样得到的视频帧;关键帧块特征序列,是指由关键帧对应的各关键帧块特征向量所构成的特征序列。The preset number refers to the number of video frames that need to be sampled in advance; the key frame refers to the video frame sampled from the target candidate video; the key frame block feature sequence refers to the feature sequence composed of the key frame block feature vectors corresponding to the key frame.

预设分类嵌入符,是指预先设置的用于插入至关键帧前的特殊分类嵌入字符,被用于分类任务中,具体可以用[CLS]表示。全局帧特征向量,是指预设分类嵌入符的向量表示。The preset classification embedding character refers to a special classification embedding character that is pre-set and inserted before the key frame. It is used in the classification task and can be specifically represented by [CLS]. The global frame feature vector refers to the vector representation of the preset classification embedding character.

获取预设数量,基于预设数量从目标候选视频中采样预设数量的关键帧,对采样得到的各关键帧进行切分,以获得各关键帧的多个视频帧块,进而,对各视频帧块进行编码,获得各视频帧块对应的关键 帧块特征向量,并生成各关键帧对应的关键帧块特征序列。获取预设分类嵌入符,对预设分类嵌入符进行编码,获得预设分类嵌入符对应的全局帧特征向量。基于各关键帧块特征序列和各全局帧特征向量,构成目标视频帧块特征序列。A preset number of key frames are obtained, and a preset number of key frames are sampled from the target candidate video based on the preset number of key frames, and each key frame obtained by sampling is segmented to obtain multiple video frame blocks of each key frame, and then each video frame block is encoded to obtain the key frame corresponding to each video frame block. The frame block feature vector is obtained, and a key frame block feature sequence corresponding to each key frame is generated. A preset classification embedding symbol is obtained, and the preset classification embedding symbol is encoded to obtain a global frame feature vector corresponding to the preset classification embedding symbol. Based on each key frame block feature sequence and each global frame feature vector, a target video frame block feature sequence is constructed.

例如,针对目标候选视频Xi,基于上述方法获得的目标视频帧块特征序列可以表示为其中,是全局帧特征向量,是第l帧的关键帧块特征向量,是第l帧的关键帧块特征序列,K是划分视频帧块的个数,L是预设数量,D是特征维度,通常设置为512维,本说明书对特征向量的特征维度不做限定。For example, for the target candidate video Xi , the target video frame block feature sequence obtained based on the above method can be expressed as in, is the global frame feature vector, is the key frame block feature vector of the lth frame, is the key frame block feature sequence of the lth frame, K is the number of video frame blocks, L is the preset number, D is the feature dimension, usually set to 512 dimensions, and this specification does not limit the feature dimension of the feature vector.

进一步地,在对检索文本和目标候选视频分别进行编码,并获得检索文本对应的文本特征序列和目标候选视频对应的目标视频帧块特征序列后,将文本特征序列和目标视频帧块特征序列输入至视频匹配模型的第一匹配层,以获得文本特征序列对应的短语特征序列和目标视频帧块特征序列对应的目标对象特征序列,从而可以根据短语特征序列和目标对象特征序列,确定第一匹配结果。Furthermore, after respectively encoding the retrieval text and the target candidate video and obtaining the text feature sequence corresponding to the retrieval text and the target video frame block feature sequence corresponding to the target candidate video, the text feature sequence and the target video frame block feature sequence are input into the first matching layer of the video matching model to obtain the phrase feature sequence corresponding to the text feature sequence and the target object feature sequence corresponding to the target video frame block feature sequence, so that the first matching result can be determined based on the phrase feature sequence and the target object feature sequence.

在本说明书提供的一具体实施方式中,所述第一匹配层包括空间原型生成器和目标短语匹配器;In a specific implementation provided in this specification, the first matching layer includes a spatial prototype generator and a target phrase matcher;

将所述文本特征序列和所述目标视频帧块特征序列输入至所述第一匹配层,获得所述文本特征序列对应的短语特征序列和所述目标视频帧块特征序列对应的目标对象特征序列,根据所述短语特征序列和所述目标对象特征序列确定第一匹配结果,包括:Inputting the text feature sequence and the target video frame block feature sequence into the first matching layer, obtaining a phrase feature sequence corresponding to the text feature sequence and a target object feature sequence corresponding to the target video frame block feature sequence, and determining a first matching result according to the phrase feature sequence and the target object feature sequence, including:

将所述文本特征序列和所述目标视频帧块特征序列输入至所述空间原型生成器,获得所述文本特征序列对应的短语特征序列和所述目标视频帧块特征序列对应的目标对象特征序列;Inputting the text feature sequence and the target video frame block feature sequence into the spatial prototype generator to obtain a phrase feature sequence corresponding to the text feature sequence and a target object feature sequence corresponding to the target video frame block feature sequence;

将所述短语特征序列和所述目标对象特征序列输入至所述目标短语匹配器,获得第一匹配结果。The phrase feature sequence and the target object feature sequence are input into the target phrase matcher to obtain a first matching result.

其中,空间原型生成器,用于将文本特征序列中的各文本特征向量聚合成短语特征向量,将目标视频帧块特征序列中的各目标视频帧块特征向量聚合成目标对象特征向量。The spatial prototype generator is used to aggregate each text feature vector in the text feature sequence into a phrase feature vector, and to aggregate each target video frame block feature vector in the target video frame block feature sequence into a target object feature vector.

目标短语匹配器,用于将各目标对象特征向量与各短语特征向量进行特征匹配。The target phrase matcher is used to perform feature matching between each target object feature vector and each phrase feature vector.

短语特征序列,是指由文本特征序列聚合成的各短语特征向量所 构成的特征序列;目标对象特征序列,是指由目标视频帧块特征序列聚合成的各目标对象特征向量所构成的特征序列。Phrase feature sequence refers to the feature vectors of each phrase aggregated from the text feature sequence. The target object feature sequence refers to the feature sequence composed of the feature vectors of each target object aggregated from the target video frame block feature sequence.

具体地,将文本特征序列和目标视频帧块特征序列输入至空间原型生成器,可以获得文本特征序列对应的短语特征序列,以及目标视频帧块特征序列对应的目标对象特征序列,进而,将获得的短语特征序列和目标对象特征序列输入至目标短语匹配器,可以获得用于表征检索文本与目标候选视频中各目标对象匹配度的第一匹配结果。具体地,获得短语特征序列的实现方式如下:Specifically, the text feature sequence and the target video frame block feature sequence are input into the spatial prototype generator, and the phrase feature sequence corresponding to the text feature sequence and the target object feature sequence corresponding to the target video frame block feature sequence can be obtained. Then, the obtained phrase feature sequence and target object feature sequence are input into the target phrase matcher, and the first matching result for characterizing the matching degree between the search text and each target object in the target candidate video can be obtained. Specifically, the implementation method of obtaining the phrase feature sequence is as follows:

在本说明书提供的一具体实施方式中,获得所述文本特征序列对应的短语特征序列,包括;In a specific implementation provided in this specification, obtaining a phrase feature sequence corresponding to the text feature sequence includes:

确定目标文本特征向量的预测文本权重,其中,所述目标文本特征向量为所述文本特征序列中各文本特征向量的任一个;Determining a predicted text weight of a target text feature vector, wherein the target text feature vector is any one of the text feature vectors in the text feature sequence;

基于所述目标文本特征向量和所述预测文本权重,生成目标短语特征向量;Generate a target phrase feature vector based on the target text feature vector and the predicted text weight;

基于各目标短语特征向量,获得短语特征序列。Based on each target phrase feature vector, a phrase feature sequence is obtained.

由于各文本特征向量中可能会存在冗余特征向量,因此,在对各文本特征向量进行聚合的过程中,并不会对所有的文本特征向量进行聚合,为了筛选各文本特征向量中的有效文本特征向量,可以预测各文本特征向量对应的权重,根据各文本特征向量对应的权重,对各文本特征向量进行聚合,以生成短语特征序列。Since there may be redundant feature vectors in each text feature vector, not all text feature vectors will be aggregated during the process of aggregating the text feature vectors. In order to screen the valid text feature vectors in each text feature vector, the weight corresponding to each text feature vector can be predicted, and the text feature vectors can be aggregated according to the weight corresponding to each text feature vector to generate a phrase feature sequence.

其中,目标文本特征向量,是指文本特征序列中各文本特征向量中的任意一个。预测文本权重,用于表征文本特征向量可获取有效信息的重要程度,具体可以用进行表示,Np是短语特征向量的个数。The target text feature vector refers to any one of the text feature vectors in the text feature sequence. The predicted text weight is used to characterize the importance of the text feature vector in obtaining effective information. Np is the number of phrase feature vectors.

具体地,在文本特征序列中选取任意一个文本特征向量作为目标文本特征向量,确定目标文本特征向量的预测文本权重,并基于目标文本特征向量和目标文本特征向量的预测文本权重,生成目标短语特征向量,进而,基于生成的各目标短语特征向量构成短语特征序列。其中,短语特征序列可以表示为短语特征序列的生成过程可以表示为 Specifically, any text feature vector is selected as the target text feature vector in the text feature sequence, the predicted text weight of the target text feature vector is determined, and the target phrase feature vector is generated based on the target text feature vector and the predicted text weight of the target text feature vector, and then, a phrase feature sequence is formed based on the generated target phrase feature vectors. Among them, the phrase feature sequence can be expressed as The generation process of phrase feature sequence can be expressed as

进一步地,获得目标对象特征序列的实现方式如下:Furthermore, the method for obtaining the target object feature sequence is as follows:

在本说明书提供的一具体实施方式中,获得所述目标视频帧块特 征序列对应的目标对象特征序列,包括:In a specific implementation manner provided in this specification, the target video frame block characteristics are obtained. The target object feature sequence corresponding to the feature sequence includes:

确定第一帧块特征向量的预测帧权重,其中,所述第一帧块特征向量为所述目标视频帧块特征序列中各目标视频帧块特征向量的任一个;Determine a prediction frame weight of a first frame block feature vector, wherein the first frame block feature vector is any one of the target video frame block feature vectors in the target video frame block feature sequence;

基于所述第一帧块特征向量和所述预测帧权重,生成第一对象特征向量;generating a first object feature vector based on the first frame block feature vector and the predicted frame weight;

基于各第一对象特征向量,获得目标对象特征序列。Based on each first object feature vector, a target object feature sequence is obtained.

由于各目标视频帧块特征向量中同样可能会存在冗余特征向量,甚至可能会干扰跨模态对齐,因此,在对各目标视频帧块特征向量进行聚合的过程中,也不会对所有的目标视频帧块特征向量进行聚合,为了筛选各目标视频帧块特征向量中的有效目标视频帧块特征向量,可以预测各目标视频帧块特征向量对应的权重,根据各目标视频帧块特征向量对应的权重,对各目标视频帧块特征向量进行过滤和聚合,以生成目标对象特征序列。Since redundant feature vectors may also exist in the feature vectors of each target video frame block, which may even interfere with cross-modal alignment, not all target video frame block feature vectors will be aggregated during the process of aggregating the feature vectors of each target video frame block. In order to screen out the valid target video frame block feature vectors in the feature vectors of each target video frame block, the weights corresponding to the feature vectors of each target video frame block can be predicted. According to the weights corresponding to the feature vectors of each target video frame block, the feature vectors of each target video frame block are filtered and aggregated to generate a target object feature sequence.

其中,第一帧块特征向量,是指目标视频帧块特征序列中各目标视频帧块特征向量中的任意一个。预测帧权重,用于表征目标视频帧块特征向量可获取有效信息的重要程度,具体可以用进行表示,No是目标对象特征向量的个数。The first frame block feature vector refers to any one of the target video frame block feature vectors in the target video frame block feature sequence. The predicted frame weight is used to characterize the importance of the target video frame block feature vector in obtaining effective information. It is represented by, and No is the number of feature vectors of the target object.

具体地,在目标视频帧块特征序列中选取任意一个目标视频帧块特征向量作为第一帧块特征向量,确定第一帧块特征向量的预测帧权重,并基于第一帧块特征向量和第一帧块特征向量的预测帧权重,生成第一对象特征向量,进而,基于生成的各第一对象特征向量构成目标对象特征序列。其中,目标对象特征序列可以表示为目标对象特征序列的生成过程可以表示为 Specifically, any target video frame block feature vector is selected as the first frame block feature vector in the target video frame block feature sequence, the predicted frame weight of the first frame block feature vector is determined, and the first object feature vector is generated based on the first frame block feature vector and the predicted frame weight of the first frame block feature vector, and then, the target object feature sequence is formed based on the generated first object feature vectors. The target object feature sequence can be expressed as The generation process of the target object feature sequence can be expressed as

通过上述方式,可以过滤冗余的文本特征向量和冗余的目标视频帧块特征向量,并聚合生成文本特征序列对应的短语特征序列,以及目标视频帧块特征序列对应的目标对象特征序列,从而提高确定目标视频的准确性。In this way, redundant text feature vectors and redundant target video frame block feature vectors can be filtered, and a phrase feature sequence corresponding to the text feature sequence and a target object feature sequence corresponding to the target video frame block feature sequence can be aggregated to improve the accuracy of determining the target video.

在生成短语特征序列和目标对象特征序列后,可以将短语特征序列和目标对象特征序列输入至目标短语匹配器,以获得第一匹配结果。After the phrase feature sequence and the target object feature sequence are generated, the phrase feature sequence and the target object feature sequence may be input into a target phrase matcher to obtain a first matching result.

在本说明书提供的一具体实施方式中,获得第一匹配结果,包括:In a specific implementation manner provided in this specification, obtaining a first matching result includes:

在所述目标对象特征序列中选取待处理对象特征向量; Selecting a feature vector of an object to be processed from the target object feature sequence;

计算所述待处理对象特征向量与所述短语特征序列中各短语特征向量之间的第一相似度,计算所述待处理对象特征向量与所述目标视频帧块特征序列中各目标视频帧块特征向量之间的第二相似度;Calculating a first similarity between the feature vector of the object to be processed and each phrase feature vector in the phrase feature sequence, and calculating a second similarity between the feature vector of the object to be processed and each target video frame block feature vector in the target video frame block feature sequence;

在各第一相似度中确定目标第一相似度,在各第二相似度中确定目标第二相似度;Determine a target first similarity among each first similarity, and determine a target second similarity among each second similarity;

基于所述目标第一相似度和所述目标第二相似度,确定所述待处理对象特征向量的初始匹配结果;Determining an initial matching result of the feature vector of the object to be processed based on the first target similarity and the second target similarity;

基于各待处理对象特征向量的初始匹配结果,生成第一匹配结果。Based on the initial matching results of the feature vectors of the objects to be processed, a first matching result is generated.

其中,待处理对象特征向量,是指目标对象特征序列中各目标对象特征向量中的任意一个。目标第一相似度,具体是指各第一相似度中的最大相似度,目标第二相似度,具体是指各第二相似度中的最大相似度。初始匹配结果,是指待处理对象特征向量与各短语特征向量之间的匹配结果,用于在各短语特征向量中确定待处理对象特征向量相似度最大的短语特征向量。The feature vector of the object to be processed refers to any one of the feature vectors of the target object in the feature sequence of the target object. The first target similarity specifically refers to the maximum similarity among the first similarities, and the second target similarity specifically refers to the maximum similarity among the second similarities. The initial matching result refers to the matching result between the feature vector of the object to be processed and the feature vectors of each phrase, and is used to determine the phrase feature vector with the maximum similarity to the feature vector of the object to be processed among the feature vectors of each phrase.

具体地,在生成短语特征序列和目标对象特征序列后,在目标对象特征序列中选取任意一个目标对象特征向量进行处理,即作为待处理对象特征向量,计算待处理对象特征向量与短语特征序列中各短语特征向量之间的第一相似度,以及待处理对象特征向量与目标视频帧块特征序列中各目标视频帧块特征向量之间的第二相似度,并在待处理对象特征向量与短语特征序列中各短语特征向量之间的第一相似度中,确定最大第一相似度作为目标第一相似度,在待处理对象特征向量与目标视频帧块特征序列中各目标视频帧块特征向量之间的第二相似度中,确定最大第二相似度作为目标第二相似度,根据目标第一相似度和目标第二相似度,确定待处理对象特征向量的最终相似度,将该相似度确定为待处理对象特征向量的初始匹配结果。进而,基于上述相同的方法,计算目标对象特征序列中各目标对象特征向量的初始匹配结果,将各初始匹配结果直接进行相加或是加权后相加的结果,作为第一匹配结果。Specifically, after generating the phrase feature sequence and the target object feature sequence, any target object feature vector is selected in the target object feature sequence for processing, that is, as the object feature vector to be processed, and the first similarity between the object feature vector to be processed and each phrase feature vector in the phrase feature sequence, and the second similarity between the object feature vector to be processed and each target video frame block feature vector in the target video frame block feature sequence are calculated, and the maximum first similarity is determined as the target first similarity in the first similarity between the object feature vector to be processed and each phrase feature vector in the phrase feature sequence, and the maximum second similarity is determined as the target second similarity in the second similarity between the object feature vector to be processed and each target video frame block feature vector in the target video frame block feature sequence, and the final similarity of the object feature vector to be processed is determined according to the target first similarity and the target second similarity, and the similarity is determined as the initial matching result of the object feature vector to be processed. Further, based on the same method as above, the initial matching results of each target object feature vector in the target object feature sequence are calculated, and the initial matching results are directly added or weighted and added as the first matching result.

通过计算检索文本与目标候选视频之间的第一匹配结果,可以获知检索文本与目标候选视频中各目标对象的匹配度。By calculating the first matching result between the search text and the target candidate video, the matching degree between the search text and each target object in the target candidate video can be obtained.

进一步地,在获得检索文本和目标候选视频之间的第一匹配结果后,将目标对象特征序列、文本特征序列和第一匹配结果输入至第二匹配层,以获得检索文本和目标候选视频之间的第二匹配结果,从而可以根据第一匹配结果和第二匹配结果,确定目标候选视频对应的匹 配权重。Further, after obtaining the first matching result between the search text and the target candidate video, the target object feature sequence, the text feature sequence and the first matching result are input into the second matching layer to obtain the second matching result between the search text and the target candidate video, so that the matching result corresponding to the target candidate video can be determined according to the first matching result and the second matching result. Assign weights.

在本说明书提供的一具体实施方式中,所述第二匹配层包括时序原型生成器和语义匹配器;In a specific implementation provided in this specification, the second matching layer includes a temporal prototype generator and a semantic matcher;

将所述目标对象特征序列、所述文本特征序列和所述第一匹配结果输入至所述第二匹配层,根据所述目标对象特征序列和所述文本特征序列确定第二匹配结果,包括:Inputting the target object feature sequence, the text feature sequence, and the first matching result into the second matching layer, and determining a second matching result according to the target object feature sequence and the text feature sequence, comprising:

将所述目标对象特征序列输入至所述时序原型生成器,获得所述目标对象特征序列对应的语义特征序列;Inputting the target object feature sequence into the temporal prototype generator to obtain a semantic feature sequence corresponding to the target object feature sequence;

在所述文本特征序列中确定全局特征向量,将所述语义特征序列和所述全局特征向量输入至所述语义匹配器,获得第二匹配结果。A global feature vector is determined in the text feature sequence, and the semantic feature sequence and the global feature vector are input into the semantic matcher to obtain a second matching result.

其中,时序原型生成器,用于将目标对象特征序列中的各目标对象特征向量聚合成语义特征向量。语义特征序列,是指由目标对象特征序列聚合成的各语义特征向量所构成的特征序列。语义匹配器,用于将全局特征向量与各语义特征向量进行特征匹配。The temporal prototype generator is used to aggregate the target object feature vectors in the target object feature sequence into a semantic feature vector. The semantic feature sequence refers to a feature sequence composed of the semantic feature vectors aggregated from the target object feature sequence. The semantic matcher is used to perform feature matching between the global feature vector and the semantic feature vectors.

具体地,将目标对象特征序列输入至时序原型生成器,可以获得目标对象特征序列对应的语义特征序列,进而,在文本特征向量中确定全局特征向量,并将全局特征向量和语义特征序列输入至语义匹配器,以获得检索文本和目标候选视频之间的第二匹配结果。进一步地,获得目标对象特征序列对应的语义特征序列的实现方式如下:Specifically, the target object feature sequence is input into the temporal prototype generator to obtain a semantic feature sequence corresponding to the target object feature sequence, and then a global feature vector is determined in the text feature vector, and the global feature vector and the semantic feature sequence are input into the semantic matcher to obtain a second matching result between the search text and the target candidate video. Furthermore, the implementation method of obtaining the semantic feature sequence corresponding to the target object feature sequence is as follows:

在本说明书提供的一具体实施方式中,获得所述目标对象特征序列对应的语义特征序列,包括:In a specific implementation provided in this specification, obtaining a semantic feature sequence corresponding to the target object feature sequence includes:

解码所述目标对象特征序列,获得关键帧特征序列;Decoding the target object feature sequence to obtain a key frame feature sequence;

确定所述关键帧特征序列中各关键帧特征向量之间的关联关系;Determining the correlation relationship between the key frame feature vectors in the key frame feature sequence;

基于所述关联关系,生成至少一个语义特征向量;Based on the association relationship, generating at least one semantic feature vector;

基于各语义特征向量,获得语义特征序列。Based on each semantic feature vector, a semantic feature sequence is obtained.

在时序原型生成器中设置有帧解码器,可以对各目标对象特征向量进行解码,生成帧级别的关键帧特征向量,进而,可以基于各关键帧特征向量进行生成语义特征向量。A frame decoder is provided in the temporal prototype generator, which can decode each target object feature vector to generate a key frame feature vector at the frame level, and further, can generate a semantic feature vector based on each key frame feature vector.

其中,关键帧特征序列,是指由目标对象特征向量解码获得的关键帧特征向量所构成的特征序列。关键帧特征序列可以表示为 语义特征序列,是指由关键帧特征序列中的各关键帧特征向量进行交互生成的语义特征向量所构成的特征序列。语义特征序列可以 表示为其中,Ne是语义特征向量的个数。Among them, the key frame feature sequence refers to the feature sequence composed of the key frame feature vectors obtained by decoding the target object feature vector. The key frame feature sequence can be expressed as The semantic feature sequence refers to a feature sequence composed of semantic feature vectors generated by the interaction of each key frame feature vector in the key frame feature sequence. The semantic feature sequence can be Expressed as Among them, Ne is the number of semantic feature vectors.

具体地,利用帧解码器对目标对象特征序列中的各目标对象特征向量进行解码,分析各目标对象特征向量之间的空间关系,获得各关键帧特征向量,以构成对应的关键帧特征序列。确定关键帧特征序列中各关键帧特征向量之间的关联关系,并根据各关联关系生成至少一个语义特征向量,基于至少一个语义特征向量生成语义特征序列。Specifically, a frame decoder is used to decode each target object feature vector in the target object feature sequence, and the spatial relationship between each target object feature vector is analyzed to obtain each key frame feature vector to form a corresponding key frame feature sequence. The association relationship between each key frame feature vector in the key frame feature sequence is determined, and at least one semantic feature vector is generated according to each association relationship, and a semantic feature sequence is generated based on the at least one semantic feature vector.

进一步地,可以利用注意力机制进行分析各目标对象特征向量之间的空间关系。具体地,随机初始化帧查询特征向量Qf,将各目标对象特征向量Po进行线性变换,并将线性变换后的特征向量分别作为帧键特征向量Ko,以及帧值特征向量Vo,通过掩码注意力计算,获得各目标对象特征向量之间的空间关系,基于各目标对象特征向量之间的空间关系,生成关键帧特征序列。Furthermore, the attention mechanism can be used to analyze the spatial relationship between the feature vectors of each target object. Specifically, the frame query feature vector Qf is randomly initialized, each target object feature vector Po is linearly transformed, and the feature vectors after linear transformation are used as the frame key feature vector Ko and the frame value feature vector Vo , respectively. Through mask attention calculation, the spatial relationship between the feature vectors of each target object is obtained, and based on the spatial relationship between the feature vectors of each target object, a key frame feature sequence is generated.

更进一步地,掩码注意力计算可通过下述公式(1)进行计算实现:
Furthermore, the mask attention calculation can be implemented by the following formula (1):

其中,Pf为关键帧特征向量,Qf为帧查询特征向量,Ko为帧键特征向量,Vo为帧值特征向量,Mf为注意力掩码,具体表示为 softmax(.)是归一化指数函数。Among them, Pf is the key frame feature vector, Qf is the frame query feature vector, Ko is the frame key feature vector, Vo is the frame value feature vector, and Mf is the attention mask, which can be specifically expressed as softmax(.) is a normalized exponential function.

类似地,可以利用注意力机制进行动态分析各关键帧特征向量之间的语义关系。具体地,随机初始化语义查询特征向量Qe,将各关键帧特征向量Pf进行线性变换,并将线性变换后的特征向量分别作为语义键特征向量Kf,以及语义值特征向量Vf,通过动态注意力计算,获得各关键帧特征向量之间的关联关系,并基于各关键帧特征向量之间的关联关系,生成语义特征序列。其中,语义特征向量的数量可以基于实际应用情况进行自定义设置,例如,将语义特征向量的数量设置为2个、3个等。Similarly, the attention mechanism can be used to dynamically analyze the semantic relationship between the feature vectors of each key frame. Specifically, the semantic query feature vector Qe is randomly initialized, each key frame feature vector Pf is linearly transformed, and the feature vectors after linear transformation are respectively used as the semantic key feature vector Kf and the semantic value feature vector Vf . Through dynamic attention calculation, the association relationship between the feature vectors of each key frame is obtained, and based on the association relationship between the feature vectors of each key frame, a semantic feature sequence is generated. Among them, the number of semantic feature vectors can be customized based on the actual application situation, for example, the number of semantic feature vectors can be set to 2, 3, etc.

动态注意力计算可通过下述公式(2)进行计算实现:
Dynamic attention calculation can be achieved by the following formula (2):

其中,Pe为语义特征向量,Qe为语义查询特征向量,Kf为语义键特征向量,Vf为语义值特征向量,softmax(.)是归一化指数函数。Among them, Pe is the semantic feature vector, Qe is the semantic query feature vector, Kf is the semantic key feature vector, Vf is the semantic value feature vector, and softmax(.) is the normalized exponential function.

通过上述方法,对各目标对象特征向量之间的空间关系进行分析,生成关键帧特征序列,对各关键帧特征向量之间的语义关系进行分析,生成语义特征序列,从而可以获知目标候选视频的视频语义多样性。 在后续检索目标视频的过程中,结合候选视频的视频语义进行检索目标视频,可以提高检索目标视频的准确性。Through the above method, the spatial relationship between the feature vectors of each target object is analyzed to generate a key frame feature sequence, and the semantic relationship between the feature vectors of each key frame is analyzed to generate a semantic feature sequence, so that the video semantic diversity of the target candidate video can be known. In the subsequent process of retrieving the target video, the target video can be retrieved in combination with the video semantics of the candidate video, which can improve the accuracy of retrieving the target video.

在生成语义特征序列后,需要将检索文本与目标候选视频的视频语义进行匹配,以获得检索文本与目标候选视频之间的第二匹配结果。After the semantic feature sequence is generated, the search text needs to be matched with the video semantics of the target candidate video to obtain a second matching result between the search text and the target candidate video.

在本说明书提供的一具体实施方式中,获得第二匹配结果,包括:In a specific implementation manner provided in this specification, obtaining a second matching result includes:

确定所述全局特征向量与所述语义特征序列中各语义特征向量之间的语义相似度;Determining the semantic similarity between the global feature vector and each semantic feature vector in the semantic feature sequence;

基于各语义特征向量的语义相似度,确定所述检索文本与所述目标候选视频之间的第二匹配结果。Based on the semantic similarity of each semantic feature vector, a second matching result between the search text and the target candidate video is determined.

具体地,计算全局特征向量与语义特征序列中各语义特征向量之间的语义相似度,在全局特征向量与各语义特征向量之间的语义相似度中,确定最大语义相似度,并将该最大语义相似度确定为第二匹配结果。Specifically, the semantic similarity between the global feature vector and each semantic feature vector in the semantic feature sequence is calculated, the maximum semantic similarity is determined among the semantic similarities between the global feature vector and each semantic feature vector, and the maximum semantic similarity is determined as the second matching result.

通过计算检索文本与目标候选视频之间的第二匹配结果,可以获知检索文本与目标候选视频的视频语义的匹配度。By calculating the second matching result between the search text and the target candidate video, the matching degree between the search text and the video semantics of the target candidate video can be known.

在获得检索文本与目标候选视频之间的第一匹配结果和第二匹配结果后,基于第一匹配结果和第二匹配结果,确定目标候选视频的匹配权重。After obtaining a first matching result and a second matching result between the search text and the target candidate video, a matching weight of the target candidate video is determined based on the first matching result and the second matching result.

具体地,可以对第一匹配结果和第二匹配结果进行加权计算,将计算结果确定为目标候选视频的匹配权重,具体计算方法可参见下述公式(3):
s=ses+βsop            (3)
Specifically, the first matching result and the second matching result may be weighted and calculated, and the calculation result may be determined as the matching weight of the target candidate video. The specific calculation method may refer to the following formula (3):
s=s es +βs op (3)

其中,s为匹配权重,ses为第二匹配结果,sop为第一匹配结果,β为空间匹配因子。β可以基于实际应用情况进行手动调整,具体数值设定本说明书在此不做限定。Wherein, s is the matching weight, s es is the second matching result, s op is the first matching result, and β is the spatial matching factor. β can be manually adjusted based on actual application conditions, and the specific numerical setting is not limited in this specification.

本说明书提供的视频检索方法,通过计算检索文本和目标候选视频之间的第一匹配结果,确定检索文本与目标候选视频中各目标对象的匹配度,计算检索文本和目标候选视频之间的第二匹配结果,确定检索文本与目标候选视频的视频语义之间的匹配度。进而,结合第一匹配结果和第二匹配结果,确定目标候选视频的匹配权重,通过对目标候选视频的局部内容和整体内容进行分析,提高确定目标候选视频匹配权重的准确性。The video retrieval method provided in this specification determines the matching degree between the retrieval text and each target object in the target candidate video by calculating the first matching result between the retrieval text and the target candidate video, and determines the matching degree between the retrieval text and the video semantics of the target candidate video by calculating the second matching result between the retrieval text and the target candidate video. Then, the matching weight of the target candidate video is determined by combining the first matching result and the second matching result, and the accuracy of determining the matching weight of the target candidate video is improved by analyzing the local content and the overall content of the target candidate video.

参见图3,图3示出了根据本说明书一个实施例提供的一种视频 匹配模型的模型架构示意图。如图3所示,视频匹配模型包括编码层、第一匹配层和第二匹配层。其中,编码层包括视频编码器和文本编码器,第一匹配层包括空间原型生成器和目标短语匹配器,第二匹配层包括时序原型生成器和语义匹配器。Referring to FIG. 3, FIG. 3 shows a video Schematic diagram of the model architecture of the matching model. As shown in Figure 3, the video matching model includes an encoding layer, a first matching layer, and a second matching layer. The encoding layer includes a video encoder and a text encoder, the first matching layer includes a spatial prototype generator and a target phrase matcher, and the second matching layer includes a temporal prototype generator and a semantic matcher.

进一步地,在实际应用的过程中,将检索文本输入至编码层的文本编码器中,可以获得检索文本对应的文本特征序列,将目标候选视频输入至编码层的视频编码器中,可以获得目标候选视频对应的目标视频帧块特征序列;将文本特征序列和目标视频帧块特征序列输入至第一匹配层的空间原型生成器,可以获得文本特征序列对应的短语特征序列和目标视频帧块特征序列对应的目标对象特征序列,将短语特征序列和目标对象特征序列输入至第一匹配层的目标短语匹配器,可以获得第一匹配结果;进而,将目标对象特征序列输入至第二匹配层的时序原型生成器,可以获得目标对象特征序列对应的语义特征序列,将文本特征序列中的全局特征向量和语义特征序列输入至第二匹配层的语义匹配器,可以获得第二匹配结果;最后,第二匹配层基于第一匹配结果和第二匹配结果,计算获得目标候选视频的匹配权重。通过对目标候选视频的局部内容和整体内容进行结合分析,提高确定目标候选视频匹配权重的准确性。Furthermore, in the actual application process, the search text is input into the text encoder of the encoding layer to obtain the text feature sequence corresponding to the search text, and the target candidate video is input into the video encoder of the encoding layer to obtain the target video frame block feature sequence corresponding to the target candidate video; the text feature sequence and the target video frame block feature sequence are input into the spatial prototype generator of the first matching layer to obtain the phrase feature sequence corresponding to the text feature sequence and the target object feature sequence corresponding to the target video frame block feature sequence; the phrase feature sequence and the target object feature sequence are input into the target phrase matcher of the first matching layer to obtain the first matching result; further, the target object feature sequence is input into the temporal prototype generator of the second matching layer to obtain the semantic feature sequence corresponding to the target object feature sequence; the global feature vector and the semantic feature sequence in the text feature sequence are input into the semantic matcher of the second matching layer to obtain the second matching result; finally, the second matching layer calculates the matching weight of the target candidate video based on the first matching result and the second matching result. By combining the analysis of the local content and the overall content of the target candidate video, the accuracy of determining the matching weight of the target candidate video is improved.

更进一步地,为了提升视频匹配模型的准确性,所述视频匹配模型通过下述方法训练获得:Furthermore, in order to improve the accuracy of the video matching model, the video matching model is trained by the following method:

获取训练数据样本对,和所述训练数据样本对对应的匹配权重标签;Obtaining training data sample pairs and matching weight labels corresponding to the training data sample pairs;

将所述训练数据样本对输入至所述视频匹配模型,获得预测匹配权重;Inputting the training data sample pairs into the video matching model to obtain predicted matching weights;

根据所述匹配权重标签和所述预测匹配权重,计算所述视频匹配模型的模型损失值;Calculating a model loss value of the video matching model according to the matching weight label and the predicted matching weight;

根据所述模型损失值调整所述视频匹配模型的模型参数,并继续训练所述布局生成模型,直至达到训练停止条件。The model parameters of the video matching model are adjusted according to the model loss value, and the layout generation model is continuously trained until a training stop condition is reached.

其中,训练数据样本对是指在训练数据样本对集合中获取的文本-视频数据对,是所述视频匹配模型的训练样本,包括训练数据样本对正例和训练数据样本对负例;训练数据样本对集合是指通过采集输入语音或输入文本中的文本内容,以及根据文本内容检索获得的视频,组成的文本-视频数据对集合;匹配权重标签是指训练数据样本对对应的实际匹配权重;预测匹配权重是指将训练数据样本对输入至视频 匹配模型所输出的匹配权重;模型损失值是指匹配权重标签与预测匹配权重之间的差异值,用于度量匹配权重标签与预测匹配权重之间的差异。Among them, the training data sample pair refers to the text-video data pair obtained in the training data sample pair set, which is the training sample of the video matching model, including the positive training data sample pair and the negative training data sample pair; the training data sample pair set refers to the text content in the collected input speech or input text, and the video retrieved according to the text content, which is composed of the text-video data pair set; the matching weight label refers to the actual matching weight corresponding to the training data sample pair; the predicted matching weight refers to the training data sample pair input to the video The matching weight output by the matching model; the model loss value refers to the difference between the matching weight label and the predicted matching weight, which is used to measure the difference between the matching weight label and the predicted matching weight.

具体地,通过上述获取检索文本的获取方式获取训练数据样本对中的文本数据,将基于文本数据检索获得的视频,与其对应的文本数据组成训练数据样本对正例,将训练数据样本对集合中同一批次的其他文本-视频数据对组成训练数据样本对负例。将训练数据样本对输入至视频匹配模型中,视频匹配模型用于预测训练数据样本对的匹配权重,此时的视频匹配模型是还未训练好的模型,预测出的预测匹配权重与实际的匹配权重标签之间会存在偏差,需要对视频匹配模型的模型参数进行相应的调整,具体地,根据输出的预测匹配权重和匹配权重标签计算视频匹配模型的模型损失值,计算模型损失值的损失函数在实际应用中可以为0-1损失函数、平方损失函数、交叉熵损失函数等等,在本说明书中,优选的,选择交叉熵函数作为计算模型损失值的损失函数,并根据模型损失值调整视频匹配模型的模型参数,基于调整后的模型参数用于下一批次训练数据样本对继续训练视频匹配模型,直至达到模型训练的停止条件。Specifically, the text data in the training data sample pair is obtained by the above-mentioned acquisition method of obtaining the retrieved text, and the video obtained based on the text data retrieval and the corresponding text data constitute the positive training data sample pair, and the other text-video data pairs in the same batch in the training data sample pair set constitute the negative training data sample pair. The training data sample pair is input into the video matching model, and the video matching model is used to predict the matching weight of the training data sample pair. At this time, the video matching model is a model that has not been trained yet. There will be a deviation between the predicted predicted matching weight and the actual matching weight label, and the model parameters of the video matching model need to be adjusted accordingly. Specifically, the model loss value of the video matching model is calculated according to the output predicted matching weight and matching weight label. The loss function for calculating the model loss value can be a 0-1 loss function, a square loss function, a cross entropy loss function, etc. in actual applications. In this specification, preferably, the cross entropy function is selected as the loss function for calculating the model loss value, and the model parameters of the video matching model are adjusted according to the model loss value. The adjusted model parameters are used for the next batch of training data sample pairs to continue training the video matching model until the stop condition of the model training is reached.

具体地,模型训练停止条件包括模型损失值小于预设阈值和/或训练轮次达到预设的轮次。Specifically, the model training stopping conditions include that the model loss value is less than a preset threshold and/or the training rounds reach a preset round.

在本说明书提供的一具体实施方式中,以通过模型损失值小于预设阈值为训练停止条件为例,预设阈值为0.3,当模型损失值小于0.3时,则认为视频匹配模型训练完成。In a specific implementation provided in this specification, taking the training stop condition of the model loss value being less than a preset threshold as an example, the preset threshold is 0.3. When the model loss value is less than 0.3, it is considered that the video matching model training is completed.

在本说明书提供的另一具体实施方式中,以预设的训练轮次作为训练停止条件为例,预设的训练轮次为30轮,当训练数据样本对的训练轮次达到30轮后,则认为视频匹配模型训练完成。In another specific implementation provided in the present specification, taking the preset training rounds as the training stop condition as an example, the preset training rounds are 30 rounds. When the training rounds of the training data sample pair reach 30 rounds, the video matching model training is considered to be completed.

在本说明书提供的又一具体实施方式中,设置预设阈值和预设训练轮次两个训练停止条件,同时监测模型损失值和训练轮次,当模型损失值或训练轮次中任意一项满足训练停止条件时,则认为视频匹配模型训练完成。In another specific implementation manner provided in the present specification, two training stop conditions, a preset threshold and a preset training round, are set, and the model loss value and the training rounds are monitored simultaneously. When either the model loss value or the training rounds meets the training stop condition, the video matching model training is considered completed.

本说明书提供的视频检索方法,通过训练好的视频匹配模型,对检索文本和目标候选视频进行分析处理,以获得目标候选视频对应的匹配权重,提高确定目标候选视频匹配权重的准确性。The video retrieval method provided in this specification analyzes and processes the retrieval text and the target candidate video through a trained video matching model to obtain the matching weight corresponding to the target candidate video, thereby improving the accuracy of determining the matching weight of the target candidate video.

步骤206:基于各候选视频对应的匹配权重,在所述至少一个候选视频中确定至少一个目标视频。 Step 206: Determine at least one target video from the at least one candidate video based on the matching weight corresponding to each candidate video.

其中,目标视频,是指与检索文本匹配度较高的视频,可用于基于检索文本显示于客户端。The target video refers to a video that has a high degree of matching with the search text and can be used to be displayed on the client based on the search text.

具体地,在确定各候选视频对应的匹配权重后,基于各候选视频对应的匹配权重,即可在各候选视频中确定检索文本对应的目标视频。Specifically, after determining the matching weights corresponding to the candidate videos, the target video corresponding to the search text can be determined in the candidate videos based on the matching weights corresponding to the candidate videos.

在本说明书提供的一具体实施方式中,基于各候选视频对应的匹配权重,在所述至少一个候选视频中确定至少一个目标视频,包括:In a specific implementation provided in this specification, based on the matching weights corresponding to the candidate videos, determining at least one target video from the at least one candidate video includes:

将匹配权重大于或等于预设匹配权重阈值的候选视频,确定为目标视频;或Determine the candidate video whose matching weight is greater than or equal to a preset matching weight threshold as the target video; or

根据各候选视频对应的匹配权重对各候选视频进行排序,获得候选视频列表,基于预设视频数量在所述候选视频列表中确定目标视频。The candidate videos are sorted according to the matching weights corresponding to the candidate videos to obtain a candidate video list, and a target video is determined in the candidate video list based on a preset number of videos.

其中,预设匹配权重阈值,是指预先设置的匹配权重的最小值,用于衡量各候选视频与检索文本的匹配度。预设视频数量,是指预先设置的可选取的目标视频的数量。The preset matching weight threshold refers to the preset minimum value of the matching weight, which is used to measure the matching degree between each candidate video and the search text. The preset number of videos refers to the preset number of selectable target videos.

具体地,在确定各候选视频对应的匹配权重后,可获取预设匹配权重阈值,在各候选视频中确定匹配权重大于或等于预设匹配权重阈值的候选视频,并将匹配权重大于或等于预设匹配权重阈值的候选视频确定为目标视频。Specifically, after determining the matching weight corresponding to each candidate video, a preset matching weight threshold can be obtained, and candidate videos whose matching weights are greater than or equal to the preset matching weight threshold are determined among the candidate videos, and candidate videos whose matching weights are greater than or equal to the preset matching weight threshold are determined as target videos.

进一步地,在确定各候选视频对应的匹配权重后,与也可以获取预设视频数量,进而,根据各候选视频对应的匹配权重,将各候选视频进行排序,以获得候选视频列表,在候选视频列表中选取预设视频数量的候选视频作为目标视频。Furthermore, after determining the matching weight corresponding to each candidate video, a preset number of videos can also be obtained, and then, according to the matching weight corresponding to each candidate video, each candidate video is sorted to obtain a candidate video list, and a preset number of candidate videos are selected from the candidate video list as target videos.

对于确定目标视频的实现方式,可以根据实际应用情况进行设定,本说明书在此不做限定。The implementation method for determining the target video can be set according to actual application conditions, and this specification does not limit it here.

本说明书提供的视频检索方法,在获得各候选视频对应的匹配权重后,基于各候选视频对应的匹配权重,在各候选视频中确定目标视频,提高确定目标视频的准确性。The video retrieval method provided in this specification, after obtaining the matching weights corresponding to each candidate video, determines the target video among the candidate videos based on the matching weights corresponding to each candidate video, thereby improving the accuracy of determining the target video.

参见图4a,图4a示出了根据本说明书一个实施例提供的一种视频检索方法的交互界面示意图。如图4a所示,用户可以在客户端的搜索栏中进行输入需要进行搜索的检索文本,从而生成视频检索指令,服务端在接收到视频检索指令后,利用上述视频检索方法,在预设数据库中确定至少一个目标视频,并将目标视频反馈并显示至客户端,其显示界面可如图4a所示。Referring to Figure 4a, Figure 4a shows a schematic diagram of an interactive interface of a video retrieval method provided according to an embodiment of the present specification. As shown in Figure 4a, a user can enter a search text to be searched in the search bar of the client to generate a video retrieval instruction. After receiving the video retrieval instruction, the server uses the above-mentioned video retrieval method to determine at least one target video in a preset database, and feeds back and displays the target video to the client, and its display interface can be shown in Figure 4a.

参见图4b,图4b示出了根据本说明书一个实施例提供的另一种 视频检索方法的交互界面示意图。如图4b所示,用户可以在客户端的搜索栏中进行输入需要进行搜索的检索文本,从而生成视频检索指令,服务端在接收到视频检索指令后,利用上述视频检索方法,在预设数据库中确定至少一个目标视频,并将目标视频反馈并显示至客户端,其显示界面可如图4b所示。Referring to FIG. 4b, FIG. 4b shows another embodiment of the present invention. Schematic diagram of the interactive interface of the video retrieval method. As shown in FIG4b, the user can enter the search text to be searched in the search bar of the client to generate a video retrieval instruction. After receiving the video retrieval instruction, the server uses the above-mentioned video retrieval method to determine at least one target video in the preset database, and feeds back and displays the target video to the client. The display interface can be shown in FIG4b.

需要进行说明的是,用户进行输入检索文本的方式并不局限于文字输入,也可以使用语音输入,图4a和图4b仅作示例性说明。It should be noted that the way in which the user inputs the search text is not limited to text input, and voice input may also be used. FIG. 4 a and FIG. 4 b are only exemplary illustrations.

本说明书提供的视频检索方法,包括:获取检索文本和至少一个候选视频;将所述检索文本和目标候选视频输入至视频匹配模型,获得所述视频匹配模型输出的所述目标候选视频对应的匹配权重,其中,所述目标候选视频为所述至少一个候选视频中的任一个,所述匹配权重基于所述检索文本与所述目标候选视频之间的第一匹配结果和第二匹配结果确定,所述第一匹配结果用于表征所述检索文本与所述目标候选视频中各目标对象的匹配度,所述第二匹配结果用于表征所述检索文本与所述目标候选视频的视频语义之间的匹配度;基于各候选视频对应的匹配权重,在所述至少一个候选视频中确定至少一个目标视频。The video retrieval method provided in this specification includes: obtaining a retrieval text and at least one candidate video; inputting the retrieval text and the target candidate video into a video matching model, and obtaining a matching weight corresponding to the target candidate video output by the video matching model, wherein the target candidate video is any one of the at least one candidate video, and the matching weight is determined based on a first matching result and a second matching result between the retrieval text and the target candidate video, the first matching result being used to characterize the matching degree between the retrieval text and each target object in the target candidate video, and the second matching result being used to characterize the matching degree between the retrieval text and the video semantics of the target candidate video; based on the matching weight corresponding to each candidate video, at least one target video is determined in the at least one candidate video.

本说明书一个实施例,通过计算检索文本和目标候选视频之间的第一匹配结果,确定检索文本与目标候选视频中各目标对象的匹配度,计算检索文本和目标候选视频之间的第二匹配结果,确定检索文本与目标候选视频的视频语义之间的匹配度。进而,结合第一匹配结果和第二匹配结果,确定目标候选视频的匹配权重,通过对目标候选视频的局部内容和整体内容进行分析,提高确定目标候选视频匹配权重的准确性;再根据各候选视频对应的匹配权重,在各候选视频中确定目标视频,提高基于文本检索视频的准确性。In one embodiment of the present specification, the first matching result between the search text and the target candidate video is calculated to determine the matching degree between the search text and each target object in the target candidate video, and the second matching result between the search text and the target candidate video is calculated to determine the matching degree between the search text and the video semantics of the target candidate video. Then, the matching weight of the target candidate video is determined by combining the first matching result and the second matching result, and the accuracy of determining the matching weight of the target candidate video is improved by analyzing the local content and the overall content of the target candidate video; then, the target video is determined in each candidate video according to the matching weight corresponding to each candidate video, thereby improving the accuracy of text-based video retrieval.

下述结合附图5,以本说明书提供的视频检索方法在基于文本检索视频的应用为例,对所述视频检索方法进行进一步说明。其中,图5示出了本说明书一个实施例提供的一种视频检索方法的处理过程流程图,具体包括以下步骤:The following is combined with Figure 5, taking the application of the video retrieval method provided by this specification in text-based video retrieval as an example to further illustrate the video retrieval method. Among them, Figure 5 shows a processing flow chart of a video retrieval method provided by an embodiment of this specification, which specifically includes the following steps:

步骤502:获取检索文本“煮面”和至少一个候选视频。Step 502: Obtain the search text "cook noodles" and at least one candidate video.

步骤504:将所述检索文本“煮面”输入至视频匹配模型的文本编码器,获得所述检索文本“煮面”的文本特征序列,将目标候选视频输入至所述视频匹配模型的视频编码器,获得所述目标候选视频的目标视频帧块特征序列。 Step 504: Input the search text "cook noodles" into the text encoder of the video matching model to obtain the text feature sequence of the search text "cook noodles", input the target candidate video into the video encoder of the video matching model to obtain the target video frame block feature sequence of the target candidate video.

步骤506:将所述文本特征序列和所述目标视频帧块特征序列输入至所述视频匹配模型的空间原型生成器,获得所述文本特征序列对应的短语特征序列和所述目标视频帧块特征序列对应的目标对象特征序列。Step 506: Input the text feature sequence and the target video frame block feature sequence into the spatial prototype generator of the video matching model to obtain a phrase feature sequence corresponding to the text feature sequence and a target object feature sequence corresponding to the target video frame block feature sequence.

步骤508:将所述短语特征序列和所述目标对象特征序列输入至所述视频匹配模型的目标短语匹配器,获得第一匹配结果。Step 508: Input the phrase feature sequence and the target object feature sequence into the target phrase matcher of the video matching model to obtain a first matching result.

步骤510:将所述目标对象特征序列输入至所述视频匹配模型的时序原型生成器,获得所述目标对象特征序列对应的语义特征序列。Step 510: Input the target object feature sequence into the temporal prototype generator of the video matching model to obtain a semantic feature sequence corresponding to the target object feature sequence.

步骤512:在所述文本特征序列中确定全局特征向量,将所述语义特征序列和所述全局特征向量输入至所述视频匹配模型的语义匹配器,获得第二匹配结果。Step 512: Determine a global feature vector in the text feature sequence, input the semantic feature sequence and the global feature vector into the semantic matcher of the video matching model, and obtain a second matching result.

步骤514:基于所述第一匹配结果和所述第二匹配结果确定所述目标候选视频对应的匹配权重。Step 514: Determine a matching weight corresponding to the target candidate video based on the first matching result and the second matching result.

步骤516:基于各候选视频对应的匹配权重,在所述至少一个候选视频中确定至少一个目标视频。Step 516: Determine at least one target video from the at least one candidate video based on the matching weight corresponding to each candidate video.

本说明书一个实施例,通过计算检索文本“煮面”和目标候选视频之间的第一匹配结果,确定检索文本“煮面”与目标候选视频中各目标对象的匹配度,计算检索文本“煮面”和目标候选视频之间的第二匹配结果,确定检索文本“煮面”与目标候选视频的视频语义之间的匹配度。进而,结合第一匹配结果和第二匹配结果,确定目标候选视频的匹配权重,通过对目标候选视频的局部内容和整体内容进行分析,提高确定目标候选视频匹配权重的准确性;再根据各候选视频对应的匹配权重,在各候选视频中确定目标视频,提高基于文本检索视频的准确性。In one embodiment of the present specification, the first matching result between the search text "cook noodles" and the target candidate video is calculated to determine the matching degree between the search text "cook noodles" and each target object in the target candidate video, and the second matching result between the search text "cook noodles" and the target candidate video is calculated to determine the matching degree between the search text "cook noodles" and the video semantics of the target candidate video. Then, the matching weight of the target candidate video is determined by combining the first matching result and the second matching result, and the accuracy of determining the matching weight of the target candidate video is improved by analyzing the local content and the overall content of the target candidate video; then, the target video is determined in each candidate video according to the matching weight corresponding to each candidate video, thereby improving the accuracy of text-based video retrieval.

与上述方法实施例相对应,本说明书还提供了视频检索装置实施例,图6示出了本说明书一个实施例提供的一种视频检索装置的结构示意图。如图6所示,该装置包括:Corresponding to the above method embodiment, this specification also provides a video retrieval device embodiment. FIG6 shows a schematic diagram of the structure of a video retrieval device provided by an embodiment of this specification. As shown in FIG6, the device includes:

获取模块602,被配置为获取检索文本和至少一个候选视频;An acquisition module 602 is configured to acquire a search text and at least one candidate video;

输入模块604,被配置为将所述检索文本和目标候选视频输入至视频匹配模型,获得所述视频匹配模型输出的所述目标候选视频对应的匹配权重,其中,所述目标候选视频为所述至少一个候选视频中的任一个,所述匹配权重基于所述检索文本与所述目标候选视频之间的第一匹配结果和第二匹配结果确定,所述第一匹配结果用于表征所述检索文本与所述目标候选视频中各目标对象的匹配度,所述第二匹配 结果用于表征所述检索文本与所述目标候选视频的视频语义之间的匹配度;The input module 604 is configured to input the search text and the target candidate video into the video matching model, and obtain the matching weight corresponding to the target candidate video output by the video matching model, wherein the target candidate video is any one of the at least one candidate video, and the matching weight is determined based on the first matching result and the second matching result between the search text and the target candidate video, wherein the first matching result is used to characterize the matching degree between the search text and each target object in the target candidate video, and the second matching result is used to characterize the matching degree between the search text and each target object in the target candidate video. The result is used to represent the matching degree between the retrieved text and the video semantics of the target candidate video;

确定模块606,被配置为基于各候选视频对应的匹配权重,在所述至少一个候选视频中确定至少一个目标视频。The determination module 606 is configured to determine at least one target video from the at least one candidate video based on the matching weight corresponding to each candidate video.

可选的,所述视频匹配模型包括编码层、第一匹配层和第二匹配层;Optionally, the video matching model includes a coding layer, a first matching layer and a second matching layer;

所述输入模块,进一步被配置为:The input module is further configured as follows:

将所述检索文本和所述目标候选视频输入至所述编码层,获得所述检索文本的文本特征序列和所述目标候选视频的目标视频帧块特征序列;Inputting the search text and the target candidate video into the coding layer to obtain a text feature sequence of the search text and a target video frame block feature sequence of the target candidate video;

将所述文本特征序列和所述目标视频帧块特征序列输入至所述第一匹配层,获得所述文本特征序列对应的短语特征序列和所述目标视频帧块特征序列对应的目标对象特征序列,根据所述短语特征序列和所述目标对象特征序列确定第一匹配结果;Inputting the text feature sequence and the target video frame block feature sequence into the first matching layer, obtaining a phrase feature sequence corresponding to the text feature sequence and a target object feature sequence corresponding to the target video frame block feature sequence, and determining a first matching result according to the phrase feature sequence and the target object feature sequence;

将所述目标对象特征序列、所述文本特征序列和所述第一匹配结果输入至所述第二匹配层,根据所述目标对象特征序列和所述文本特征序列确定第二匹配结果,并基于所述第一匹配结果和所述第二匹配结果确定所述目标候选视频对应的匹配权重。The target object feature sequence, the text feature sequence and the first matching result are input into the second matching layer, a second matching result is determined according to the target object feature sequence and the text feature sequence, and a matching weight corresponding to the target candidate video is determined based on the first matching result and the second matching result.

可选的,所述编码层包括文本编码器和视频编码器;Optionally, the encoding layer includes a text encoder and a video encoder;

所述输入模块,进一步被配置为:The input module is further configured as follows:

将所述检索文本输入至所述文本编码器,获得所述检索文本的文本特征序列;Inputting the search text into the text encoder to obtain a text feature sequence of the search text;

将所述目标候选视频输入至所述视频编码器,获得所述目标候选视频的目标视频帧块特征序列。The target candidate video is input into the video encoder to obtain a target video frame block feature sequence of the target candidate video.

可选的,所述第一匹配层包括空间原型生成器和目标短语匹配器;Optionally, the first matching layer includes a spatial prototype generator and a target phrase matcher;

所述输入模块,进一步被配置为:The input module is further configured as follows:

将所述文本特征序列和所述目标视频帧块特征序列输入至所述空间原型生成器,获得所述文本特征序列对应的短语特征序列和所述目标视频帧块特征序列对应的目标对象特征序列;Inputting the text feature sequence and the target video frame block feature sequence into the spatial prototype generator to obtain a phrase feature sequence corresponding to the text feature sequence and a target object feature sequence corresponding to the target video frame block feature sequence;

将所述短语特征序列和所述目标对象特征序列输入至所述目标短语匹配器,获得第一匹配结果。The phrase feature sequence and the target object feature sequence are input into the target phrase matcher to obtain a first matching result.

可选的,所述输入模块,进一步被配置为;Optionally, the input module is further configured as:

确定目标文本特征向量的预测文本权重,其中,所述目标文本特 征向量为所述文本特征序列中各文本特征向量的任一个;Determine the predicted text weight of the target text feature vector, wherein the target text feature vector The feature vector is any one of the text feature vectors in the text feature sequence;

基于所述目标文本特征向量和所述预测文本权重,生成目标短语特征向量;Generate a target phrase feature vector based on the target text feature vector and the predicted text weight;

基于各目标短语特征向量,获得短语特征序列。Based on each target phrase feature vector, a phrase feature sequence is obtained.

可选的,所述输入模块,进一步被配置为:Optionally, the input module is further configured as:

确定第一帧块特征向量的预测帧权重,其中,所述第一帧块特征向量为所述目标视频帧块特征序列中各目标视频帧块特征向量的任一个;Determine a prediction frame weight of a first frame block feature vector, wherein the first frame block feature vector is any one of the target video frame block feature vectors in the target video frame block feature sequence;

基于所述第一帧块特征向量和所述预测帧权重,生成第一对象特征向量;generating a first object feature vector based on the first frame block feature vector and the predicted frame weight;

基于各第一对象特征向量,获得目标对象特征序列。Based on each first object feature vector, a target object feature sequence is obtained.

可选的,所述输入模块,进一步被配置为:Optionally, the input module is further configured as:

在所述目标对象特征序列中选取待处理对象特征向量;Selecting a feature vector of an object to be processed from the target object feature sequence;

计算所述待处理对象特征向量与所述短语特征序列中各短语特征向量之间的第一相似度,计算所述待处理对象特征向量与所述目标视频帧块特征序列中各目标视频帧块特征向量之间的第二相似度;Calculating a first similarity between the feature vector of the object to be processed and each phrase feature vector in the phrase feature sequence, and calculating a second similarity between the feature vector of the object to be processed and each target video frame block feature vector in the target video frame block feature sequence;

在各第一相似度中确定目标第一相似度,在各第二相似度中确定目标第二相似度;Determine a target first similarity among each first similarity, and determine a target second similarity among each second similarity;

基于所述目标第一相似度和所述目标第二相似度,确定所述待处理对象特征向量的初始匹配结果;Determining an initial matching result of the feature vector of the object to be processed based on the first target similarity and the second target similarity;

基于各待处理对象特征向量的初始匹配结果,生成第一匹配结果。Based on the initial matching results of the feature vectors of the objects to be processed, a first matching result is generated.

可选的,所述第二匹配层包括时序原型生成器和语义匹配器;Optionally, the second matching layer includes a temporal prototype generator and a semantic matcher;

所述输入模块,进一步被配置为:The input module is further configured as follows:

将所述目标对象特征序列输入至所述时序原型生成器,获得所述目标对象特征序列对应的语义特征序列;Inputting the target object feature sequence into the temporal prototype generator to obtain a semantic feature sequence corresponding to the target object feature sequence;

在所述文本特征序列中确定全局特征向量,将所述语义特征序列和所述全局特征向量输入至所述语义匹配器,获得第二匹配结果。A global feature vector is determined in the text feature sequence, and the semantic feature sequence and the global feature vector are input into the semantic matcher to obtain a second matching result.

可选的,所述输入模块,进一步被配置为:Optionally, the input module is further configured as:

解码所述目标对象特征序列,获得关键帧特征序列;Decoding the target object feature sequence to obtain a key frame feature sequence;

确定所述关键帧特征序列中各关键帧特征向量之间的关联关系;Determining the correlation relationship between the key frame feature vectors in the key frame feature sequence;

基于所述关联关系,生成至少一个语义特征向量;Based on the association relationship, generating at least one semantic feature vector;

基于各语义特征向量,获得语义特征序列。 Based on each semantic feature vector, a semantic feature sequence is obtained.

可选的,所述输入模块,进一步被配置为:Optionally, the input module is further configured as:

确定所述全局特征向量与所述语义特征序列中各语义特征向量之间的语义相似度;Determining the semantic similarity between the global feature vector and each semantic feature vector in the semantic feature sequence;

基于各语义特征向量的语义相似度,确定所述检索文本与所述目标候选视频之间的第二匹配结果。Based on the semantic similarity of each semantic feature vector, a second matching result between the search text and the target candidate video is determined.

可选的,所述确定模块,进一步被配置为:Optionally, the determining module is further configured to:

将匹配权重大于或等于预设匹配权重阈值的候选视频,确定为目标视频;或Determine the candidate video whose matching weight is greater than or equal to a preset matching weight threshold as the target video; or

根据各候选视频对应的匹配权重对各候选视频进行排序,获得候选视频列表,基于预设视频数量在所述候选视频列表中确定目标视频。The candidate videos are sorted according to the matching weights corresponding to the candidate videos to obtain a candidate video list, and a target video is determined in the candidate video list based on a preset number of videos.

可选的,所述装置还包括训练模块,被配置为:Optionally, the device further comprises a training module configured to:

获取训练数据样本对,和所述训练数据样本对对应的匹配权重标签;Obtaining training data sample pairs and matching weight labels corresponding to the training data sample pairs;

将所述训练数据样本对输入至所述视频匹配模型,获得预测匹配权重;Inputting the training data sample pairs into the video matching model to obtain predicted matching weights;

根据所述匹配权重标签和所述预测匹配权重,计算所述视频匹配模型的模型损失值;Calculating a model loss value of the video matching model according to the matching weight label and the predicted matching weight;

根据所述模型损失值调整所述视频匹配模型的模型参数,并继续训练所述布局生成模型,直至达到训练停止条件。The model parameters of the video matching model are adjusted according to the model loss value, and the layout generation model is continuously trained until a training stop condition is reached.

本说明书提供的视频检索装置,包括:获取模块,被配置为获取检索文本和至少一个候选视频;输入模块,被配置为将所述检索文本和目标候选视频输入至视频匹配模型,获得所述视频匹配模型输出的所述目标候选视频对应的匹配权重,其中,所述目标候选视频为所述至少一个候选视频中的任一个,所述匹配权重基于所述检索文本与所述目标候选视频之间的第一匹配结果和第二匹配结果确定,所述第一匹配结果用于表征所述检索文本与所述目标候选视频中各目标对象的匹配度,所述第二匹配结果用于表征所述检索文本与所述目标候选视频的视频语义之间的匹配度;确定模块,被配置为基于各候选视频对应的匹配权重,在所述至少一个候选视频中确定至少一个目标视频。The video retrieval device provided in this specification includes: an acquisition module, configured to acquire a search text and at least one candidate video; an input module, configured to input the search text and the target candidate video into a video matching model, and obtain a matching weight corresponding to the target candidate video output by the video matching model, wherein the target candidate video is any one of the at least one candidate video, and the matching weight is determined based on a first matching result and a second matching result between the search text and the target candidate video, the first matching result being used to characterize the matching degree between the search text and each target object in the target candidate video, and the second matching result being used to characterize the matching degree between the search text and the video semantics of the target candidate video; a determination module, configured to determine at least one target video in the at least one candidate video based on the matching weight corresponding to each candidate video.

本说明书一个实施例,通过计算检索文本和目标候选视频之间的第一匹配结果,确定检索文本与目标候选视频中各目标对象的匹配度,计算检索文本和目标候选视频之间的第二匹配结果,确定检索文本与目标候选视频的视频语义之间的匹配度。进而,结合第一匹配结果和 第二匹配结果,确定目标候选视频的匹配权重,通过对目标候选视频的局部内容和整体内容进行分析,提高确定目标候选视频匹配权重的准确性;再根据各候选视频对应的匹配权重,在各候选视频中确定目标视频,提高基于文本检索视频的准确性。In one embodiment of the present specification, by calculating the first matching result between the search text and the target candidate video, the matching degree between the search text and each target object in the target candidate video is determined, and the second matching result between the search text and the target candidate video is calculated to determine the matching degree between the search text and the video semantics of the target candidate video. The second matching result determines the matching weight of the target candidate video. By analyzing the local content and overall content of the target candidate video, the accuracy of determining the matching weight of the target candidate video is improved; then, according to the matching weights corresponding to each candidate video, the target video is determined among the candidate videos to improve the accuracy of text-based video retrieval.

上述为本实施例的一种视频检索装置的示意性方案。需要说明的是,该视频检索装置的技术方案与上述的视频检索方法的技术方案属于同一构思,视频检索装置的技术方案未详细描述的细节内容,均可以参见上述视频检索方法的技术方案的描述。The above is a schematic scheme of a video retrieval device of this embodiment. It should be noted that the technical scheme of the video retrieval device and the technical scheme of the above video retrieval method belong to the same concept, and the details not described in detail in the technical scheme of the video retrieval device can be referred to the description of the technical scheme of the above video retrieval method.

图7示出了根据本说明书一个实施例提供的一种计算设备700的结构框图。该计算设备700的部件包括但不限于存储器710和处理器720。处理器720与存储器710通过总线730相连接,数据库750用于保存数据。Fig. 7 shows a block diagram of a computing device 700 according to an embodiment of the present specification. The components of the computing device 700 include but are not limited to a memory 710 and a processor 720. The processor 720 is connected to the memory 710 via a bus 730, and the database 750 is used to store data.

计算设备700还包括接入设备740,接入设备740使得计算设备700能够经由一个或多个网络760通信。这些网络的示例包括公用交换电话网(PSTN,Public Switched Telephone Network)、局域网(LAN,Local Area Network)、广域网(WAN,Wide Area Network)、个域网(PAN,Personal Area Network)或诸如因特网的通信网络的组合。接入设备740可以包括有线或无线的任何类型的网络接口(例如,网络接口卡(NIC,network interface controller))中的一个或多个,诸如IEEE802.11无线局域网(WLAN,Wireless Local Area Network)无线接口、全球微波互联接入(Wi-MAX,Worldwide Interoperability for Microwave Access)接口、以太网接口、通用串行总线(USB,Universal Serial Bus)接口、蜂窝网络接口、蓝牙接口、近场通信(NFC,Near Field Communication)。The computing device 700 also includes an access device 740 that enables the computing device 700 to communicate via one or more networks 760. Examples of these networks include a public switched telephone network (PSTN), a local area network (LAN), a wide area network (WAN), a personal area network (PAN), or a combination of communication networks such as the Internet. The access device 740 may include one or more of any type of network interface, wired or wireless (e.g., a network interface card (NIC)), such as an IEEE802.11 wireless local area network (WLAN) wireless interface, a Worldwide Interoperability for Microwave Access (Wi-MAX) interface, an Ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a Bluetooth interface, and a near field communication (NFC).

在本说明书的一个实施例中,计算设备700的上述部件以及图7中未示出的其他部件也可以彼此相连接,例如通过总线。应当理解,图7所示的计算设备结构框图仅仅是出于示例的目的,而不是对本说明书范围的限制。本领域技术人员可以根据需要,增添或替换其他部件。In one embodiment of the present specification, the above components of the computing device 700 and other components not shown in FIG. 7 may also be connected to each other, for example, through a bus. It should be understood that the computing device structure block diagram shown in FIG. 7 is only for illustrative purposes and is not intended to limit the scope of the present specification. Those skilled in the art may add or replace other components as needed.

计算设备700可以是任何类型的静止或移动计算设备,包括移动计算机或移动计算设备(例如,平板计算机、个人数字助理、膝上型计算机、笔记本计算机、上网本等)、移动电话(例如,智能手机)、可佩戴的计算设备(例如,智能手表、智能眼镜等)或其他类型的移动设备,或者诸如台式计算机或个人计算机(PC,Personal Computer)的静止计算设备。计算设备700还可以是移动式或静止式的服务器。 The computing device 700 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., a tablet computer, a personal digital assistant, a laptop computer, a notebook computer, a netbook, etc.), a mobile phone (e.g., a smart phone), a wearable computing device (e.g., a smart watch, smart glasses, etc.), or other types of mobile devices, or a stationary computing device such as a desktop computer or a personal computer (PC). The computing device 700 may also be a mobile or stationary server.

其中,处理器720用于执行如下计算机可执行指令,该计算机可执行指令被处理器执行时实现上述视频检索方法的步骤。The processor 720 is used to execute the following computer executable instructions, which implement the steps of the above-mentioned video retrieval method when executed by the processor.

上述为本实施例的一种计算设备的示意性方案。需要说明的是,该计算设备的技术方案与上述的视频检索方法的技术方案属于同一构思,计算设备的技术方案未详细描述的细节内容,均可以参见上述视频检索方法的技术方案的描述。The above is a schematic scheme of a computing device of this embodiment. It should be noted that the technical scheme of the computing device and the technical scheme of the above video retrieval method belong to the same concept, and the details not described in detail in the technical scheme of the computing device can be referred to the description of the technical scheme of the above video retrieval method.

本说明书一实施例还提供一种计算机可读存储介质,其存储有计算机可执行指令,该计算机可执行指令被处理器执行时实现上述视频检索方法的步骤。An embodiment of the present specification further provides a computer-readable storage medium storing computer-executable instructions, which can implement the steps of the above-mentioned video retrieval method when executed by a processor.

上述为本实施例的一种计算机可读存储介质的示意性方案。需要说明的是,该存储介质的技术方案与上述的视频检索方法的技术方案属于同一构思,存储介质的技术方案未详细描述的细节内容,均可以参见上述视频检索方法的技术方案的描述。The above is a schematic scheme of a computer-readable storage medium of this embodiment. It should be noted that the technical scheme of the storage medium and the technical scheme of the above video retrieval method belong to the same concept, and the details not described in detail in the technical scheme of the storage medium can be referred to the description of the technical scheme of the above video retrieval method.

本说明书一实施例还提供一种计算机程序,其中,当所述计算机程序在计算机中执行时,令计算机执行上述视频检索方法的步骤。An embodiment of the present specification further provides a computer program, wherein when the computer program is executed in a computer, the computer is caused to execute the steps of the above-mentioned video retrieval method.

上述为本实施例的一种计算机程序的示意性方案。需要说明的是,该计算机程序的技术方案与上述的视频检索方法的技术方案属于同一构思,计算机程序的技术方案未详细描述的细节内容,均可以参见上述视频检索方法的技术方案的描述。The above is a schematic scheme of a computer program of this embodiment. It should be noted that the technical scheme of the computer program and the technical scheme of the above video retrieval method belong to the same concept, and the details not described in detail in the technical scheme of the computer program can be referred to the description of the technical scheme of the above video retrieval method.

上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。The above is a description of a specific embodiment of the specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recorded in the claims can be performed in an order different from that in the embodiments and still achieve the desired results. In addition, the processes depicted in the drawings do not necessarily require the specific order or continuous order shown to achieve the desired results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

所述计算机指令包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。The computer instructions include computer program codes, which may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, recording medium, USB flash drive, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM), random access memory (RAM), electric carrier signal, telecommunication signal and software distribution medium, etc.

需要说明的是,对于前述的各方法实施例,为了简便描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本说明书实施例并不受所描述的动作顺序的限制,因为依据本说明书实施 例,某些步骤可以采用其它顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定都是本说明书实施例所必须的。It should be noted that, for the convenience of description, the above-mentioned method embodiments are all described as a series of action combinations, but those skilled in the art should know that the embodiments of this specification are not limited to the described action sequence, because according to the implementation of this specification For example, some steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also be aware that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the embodiments of this specification.

在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。In the above embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference can be made to the relevant descriptions of other embodiments.

以上公开的本说明书优选实施例只是用于帮助阐述本说明书。可选实施例并没有详尽叙述所有的细节,也不限制该发明仅为所述的具体实施方式。显然,根据本说明书实施例的内容,可作很多的修改和变化。本说明书选取并具体描述这些实施例,是为了更好地解释本说明书实施例的原理和实际应用,从而使所属技术领域技术人员能很好地理解和利用本说明书。本说明书仅受权利要求书及其全部范围和等效物的限制。 The preferred embodiments of this specification disclosed above are only used to help explain this specification. The optional embodiments do not describe all the details in detail, nor do they limit the invention to only the specific implementation methods described. Obviously, many modifications and changes can be made according to the content of the embodiments of this specification. This specification selects and specifically describes these embodiments in order to better explain the principles and practical applications of the embodiments of this specification, so that technicians in the relevant technical field can well understand and use this specification. This specification is only limited by the claims and their full scope and equivalents.

Claims (14)

一种视频检索方法,包括:A video retrieval method, comprising: 获取检索文本和至少一个候选视频;Obtaining a search text and at least one candidate video; 将所述检索文本和目标候选视频输入至视频匹配模型,获得所述视频匹配模型输出的所述目标候选视频对应的匹配权重,其中,所述目标候选视频为所述至少一个候选视频中的任一个,所述匹配权重基于所述检索文本与所述目标候选视频之间的第一匹配结果和第二匹配结果确定,所述第一匹配结果用于表征所述检索文本与所述目标候选视频中各目标对象的匹配度,所述第二匹配结果用于表征所述检索文本与所述目标候选视频的视频语义之间的匹配度;Input the search text and the target candidate video into a video matching model, and obtain a matching weight corresponding to the target candidate video output by the video matching model, wherein the target candidate video is any one of the at least one candidate video, and the matching weight is determined based on a first matching result and a second matching result between the search text and the target candidate video, wherein the first matching result is used to characterize the matching degree between the search text and each target object in the target candidate video, and the second matching result is used to characterize the matching degree between the search text and the video semantics of the target candidate video; 基于各候选视频对应的匹配权重,在所述至少一个候选视频中确定至少一个目标视频。At least one target video is determined from the at least one candidate video based on the matching weights corresponding to the candidate videos. 如权利要求1所述的方法,所述视频匹配模型包括编码层、第一匹配层和第二匹配层;The method according to claim 1, wherein the video matching model comprises a coding layer, a first matching layer, and a second matching layer; 将所述检索文本和目标候选视频输入至视频匹配模型,获得所述视频匹配模型输出的所述目标候选视频对应的匹配权重,包括:Inputting the search text and the target candidate video into a video matching model, and obtaining a matching weight corresponding to the target candidate video output by the video matching model, comprises: 将所述检索文本和所述目标候选视频输入至所述编码层,获得所述检索文本的文本特征序列和所述目标候选视频的目标视频帧块特征序列;Inputting the search text and the target candidate video into the coding layer to obtain a text feature sequence of the search text and a target video frame block feature sequence of the target candidate video; 将所述文本特征序列和所述目标视频帧块特征序列输入至所述第一匹配层,获得所述文本特征序列对应的短语特征序列和所述目标视频帧块特征序列对应的目标对象特征序列,根据所述短语特征序列和所述目标对象特征序列确定第一匹配结果;Inputting the text feature sequence and the target video frame block feature sequence into the first matching layer, obtaining a phrase feature sequence corresponding to the text feature sequence and a target object feature sequence corresponding to the target video frame block feature sequence, and determining a first matching result according to the phrase feature sequence and the target object feature sequence; 将所述目标对象特征序列、所述文本特征序列和所述第一匹配结果输入至所述第二匹配层,根据所述目标对象特征序列和所述文本特征序列确定第二匹配结果,并基于所述第一匹配结果和所述第二匹配结果确定所述目标候选视频对应的匹配权重。The target object feature sequence, the text feature sequence and the first matching result are input into the second matching layer, a second matching result is determined according to the target object feature sequence and the text feature sequence, and a matching weight corresponding to the target candidate video is determined based on the first matching result and the second matching result. 如权利要求2所述的方法,所述编码层包括文本编码器和视频编码器;The method of claim 2, wherein the encoding layer comprises a text encoder and a video encoder; 将所述检索文本和目标候选视频输入至视频匹配模型,获得所述检索文本的文本特征序列和所述目标候选视频的目标视频帧块特征序列,包括:Inputting the search text and the target candidate video into a video matching model to obtain a text feature sequence of the search text and a target video frame block feature sequence of the target candidate video, including: 将所述检索文本输入至所述文本编码器,获得所述检索文本的文本特征序列;Inputting the search text into the text encoder to obtain a text feature sequence of the search text; 将所述目标候选视频输入至所述视频编码器,获得所述目标候选视频的目标视频帧块特征序列。The target candidate video is input into the video encoder to obtain a target video frame block feature sequence of the target candidate video. 如权利要求2所述的方法,所述第一匹配层包括空间原型生成器和目标短语匹配器;The method of claim 2, wherein the first matching layer comprises a spatial prototype generator and a target phrase matcher; 将所述文本特征序列和所述目标视频帧块特征序列输入至所述第一匹配层,获得所述文本特征序列对应的短语特征序列和所述目标视频帧块特 征序列对应的目标对象特征序列,根据所述短语特征序列和所述目标对象特征序列确定第一匹配结果,包括:The text feature sequence and the target video frame block feature sequence are input into the first matching layer to obtain the phrase feature sequence corresponding to the text feature sequence and the target video frame block feature sequence. The target object feature sequence corresponding to the phrase feature sequence is determined, and a first matching result is determined according to the phrase feature sequence and the target object feature sequence, including: 将所述文本特征序列和所述目标视频帧块特征序列输入至所述空间原型生成器,获得所述文本特征序列对应的短语特征序列和所述目标视频帧块特征序列对应的目标对象特征序列;Inputting the text feature sequence and the target video frame block feature sequence into the spatial prototype generator to obtain a phrase feature sequence corresponding to the text feature sequence and a target object feature sequence corresponding to the target video frame block feature sequence; 将所述短语特征序列和所述目标对象特征序列输入至所述目标短语匹配器,获得第一匹配结果。The phrase feature sequence and the target object feature sequence are input into the target phrase matcher to obtain a first matching result. 如权利要求4所述的方法,获得所述文本特征序列对应的短语特征序列,包括;The method according to claim 4, obtaining a phrase feature sequence corresponding to the text feature sequence, comprises: 确定目标文本特征向量的预测文本权重,其中,所述目标文本特征向量为所述文本特征序列中各文本特征向量的任一个;Determining a predicted text weight of a target text feature vector, wherein the target text feature vector is any one of the text feature vectors in the text feature sequence; 基于所述目标文本特征向量和所述预测文本权重,生成目标短语特征向量;Generate a target phrase feature vector based on the target text feature vector and the predicted text weight; 基于各目标短语特征向量,获得短语特征序列。Based on each target phrase feature vector, a phrase feature sequence is obtained. 如权利要求4所述的方法,获得所述目标视频帧块特征序列对应的目标对象特征序列,包括:The method according to claim 4, obtaining the target object feature sequence corresponding to the target video frame block feature sequence, comprises: 确定第一帧块特征向量的预测帧权重,其中,所述第一帧块特征向量为所述目标视频帧块特征序列中各目标视频帧块特征向量的任一个;Determine a prediction frame weight of a first frame block feature vector, wherein the first frame block feature vector is any one of the target video frame block feature vectors in the target video frame block feature sequence; 基于所述第一帧块特征向量和所述预测帧权重,生成第一对象特征向量;generating a first object feature vector based on the first frame block feature vector and the predicted frame weight; 基于各第一对象特征向量,获得目标对象特征序列。Based on each first object feature vector, a target object feature sequence is obtained. 如权利要求4所述的方法,获得第一匹配结果,包括:The method according to claim 4, obtaining a first matching result, comprising: 在所述目标对象特征序列中选取待处理对象特征向量;Selecting a feature vector of an object to be processed from the target object feature sequence; 计算所述待处理对象特征向量与所述短语特征序列中各短语特征向量之间的第一相似度,计算所述待处理对象特征向量与所述目标视频帧块特征序列中各目标视频帧块特征向量之间的第二相似度;Calculating a first similarity between the feature vector of the object to be processed and each phrase feature vector in the phrase feature sequence, and calculating a second similarity between the feature vector of the object to be processed and each target video frame block feature vector in the target video frame block feature sequence; 在各第一相似度中确定目标第一相似度,在各第二相似度中确定目标第二相似度;Determine a target first similarity among each first similarity, and determine a target second similarity among each second similarity; 基于所述目标第一相似度和所述目标第二相似度,确定所述待处理对象特征向量的初始匹配结果;Determining an initial matching result of the feature vector of the object to be processed based on the first target similarity and the second target similarity; 基于各待处理对象特征向量的初始匹配结果,生成第一匹配结果。Based on the initial matching results of the feature vectors of the objects to be processed, a first matching result is generated. 如权利要求2所述的方法,所述第二匹配层包括时序原型生成器和语义匹配器;The method according to claim 2, wherein the second matching layer comprises a temporal prototype generator and a semantic matcher; 将所述目标对象特征序列、所述文本特征序列和所述第一匹配结果输入至所述第二匹配层,根据所述目标对象特征序列和所述文本特征序列确定第二匹配结果,包括:Inputting the target object feature sequence, the text feature sequence, and the first matching result into the second matching layer, and determining a second matching result according to the target object feature sequence and the text feature sequence, comprising: 将所述目标对象特征序列输入至所述时序原型生成器,获得所述目标对象特征序列对应的语义特征序列;Inputting the target object feature sequence into the temporal prototype generator to obtain a semantic feature sequence corresponding to the target object feature sequence; 在所述文本特征序列中确定全局特征向量,将所述语义特征序列和所 述全局特征向量输入至所述语义匹配器,获得第二匹配结果。Determine a global feature vector in the text feature sequence, and combine the semantic feature sequence and the The global feature vector is input into the semantic matcher to obtain a second matching result. 如权利要求8所述的方法,获得所述目标对象特征序列对应的语义特征序列,包括:The method according to claim 8, obtaining the semantic feature sequence corresponding to the target object feature sequence, comprises: 解码所述目标对象特征序列,获得关键帧特征序列;Decoding the target object feature sequence to obtain a key frame feature sequence; 确定所述关键帧特征序列中各关键帧特征向量之间的关联关系;Determining the correlation relationship between the key frame feature vectors in the key frame feature sequence; 基于所述关联关系,生成至少一个语义特征向量;Based on the association relationship, generating at least one semantic feature vector; 基于各语义特征向量,获得语义特征序列。Based on each semantic feature vector, a semantic feature sequence is obtained. 如权利要求8所述的方法,获得第二匹配结果,包括:The method according to claim 8, obtaining the second matching result comprises: 确定所述全局特征向量与所述语义特征序列中各语义特征向量之间的语义相似度;Determining the semantic similarity between the global feature vector and each semantic feature vector in the semantic feature sequence; 基于各语义特征向量的语义相似度,确定所述检索文本与所述目标候选视频之间的第二匹配结果。Based on the semantic similarity of each semantic feature vector, a second matching result between the search text and the target candidate video is determined. 如权利要求1所述的方法,基于各候选视频对应的匹配权重,在所述至少一个候选视频中确定至少一个目标视频,包括:The method according to claim 1, determining at least one target video from the at least one candidate video based on the matching weights corresponding to the candidate videos, comprises: 将匹配权重大于或等于预设匹配权重阈值的候选视频,确定为目标视频;或Determine the candidate video whose matching weight is greater than or equal to a preset matching weight threshold as the target video; or 根据各候选视频对应的匹配权重对各候选视频进行排序,获得候选视频列表,基于预设视频数量在所述候选视频列表中确定目标视频。The candidate videos are sorted according to the matching weights corresponding to the candidate videos to obtain a candidate video list, and a target video is determined in the candidate video list based on a preset number of videos. 如权利要求1所述的方法,所述视频匹配模型通过下述方法训练获得:The method according to claim 1, wherein the video matching model is trained by the following method: 获取训练数据样本对,和所述训练数据样本对对应的匹配权重标签;Obtaining training data sample pairs and matching weight labels corresponding to the training data sample pairs; 将所述训练数据样本对输入至所述视频匹配模型,获得预测匹配权重;Inputting the training data sample pairs into the video matching model to obtain predicted matching weights; 根据所述匹配权重标签和所述预测匹配权重,计算所述视频匹配模型的模型损失值;Calculating a model loss value of the video matching model according to the matching weight label and the predicted matching weight; 根据所述模型损失值调整所述视频匹配模型的模型参数,并继续训练所述布局生成模型,直至达到训练停止条件。The model parameters of the video matching model are adjusted according to the model loss value, and the layout generation model is continuously trained until a training stop condition is reached. 一种计算设备,包括:A computing device comprising: 存储器和处理器;Memory and processor; 所述存储器用于存储计算机可执行指令,所述处理器用于执行所述计算机可执行指令,该计算机可执行指令被处理器执行时实现权利要求1至12任意一项所述视频检索方法的步骤。The memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions. When the computer-executable instructions are executed by the processor, the steps of the video retrieval method according to any one of claims 1 to 12 are implemented. 一种计算机可读存储介质,其存储有计算机可执行指令,该计算机可执行指令被处理器执行时实现权利要求1至12任意一项所述视频检索方法的步骤。 A computer-readable storage medium stores computer-executable instructions, which, when executed by a processor, implement the steps of the video retrieval method described in any one of claims 1 to 12.
PCT/CN2024/104568 2023-08-01 2024-07-09 Video retrieval method Pending WO2025026012A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202310961671.3 2023-08-01
CN202310961671.3A CN119441537A (en) 2023-08-01 2023-08-01 Video Retrieval Methods

Publications (1)

Publication Number Publication Date
WO2025026012A1 true WO2025026012A1 (en) 2025-02-06

Family

ID=94393264

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2024/104568 Pending WO2025026012A1 (en) 2023-08-01 2024-07-09 Video retrieval method

Country Status (2)

Country Link
CN (1) CN119441537A (en)
WO (1) WO2025026012A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487239A (en) * 2020-11-27 2021-03-12 北京百度网讯科技有限公司 Video retrieval method, model training method, device, equipment and storage medium
CN113094550A (en) * 2020-01-08 2021-07-09 百度在线网络技术(北京)有限公司 Video retrieval method, device, equipment and medium
US20210224601A1 (en) * 2019-03-05 2021-07-22 Tencent Technology (Shenzhen) Company Limited Video sequence selection method, computer device, and storage medium
CN116166843A (en) * 2023-03-02 2023-05-26 北京中科闻歌科技股份有限公司 Text video cross-modal retrieval method and device based on fine granularity perception

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210224601A1 (en) * 2019-03-05 2021-07-22 Tencent Technology (Shenzhen) Company Limited Video sequence selection method, computer device, and storage medium
CN113094550A (en) * 2020-01-08 2021-07-09 百度在线网络技术(北京)有限公司 Video retrieval method, device, equipment and medium
CN112487239A (en) * 2020-11-27 2021-03-12 北京百度网讯科技有限公司 Video retrieval method, model training method, device, equipment and storage medium
CN116166843A (en) * 2023-03-02 2023-05-26 北京中科闻歌科技股份有限公司 Text video cross-modal retrieval method and device based on fine granularity perception

Also Published As

Publication number Publication date
CN119441537A (en) 2025-02-14

Similar Documents

Publication Publication Date Title
CN117876941B (en) Target multimodal model system and construction method, video processing model training method, video processing method
CN118410152B (en) Information processing method, question-answering method and question-answering system
WO2025007892A1 (en) Task processing method and task processing system
WO2025016255A1 (en) Task processing method and automatic question answering method
CN117573842B (en) Document retrieval method and automatic question answering method
CN113535946B (en) Text identification method, device, equipment and storage medium based on deep learning
CN114386412B (en) A multi-modal named entity recognition method based on uncertainty perception
CN112348111A (en) Multi-modal feature fusion method and device in video, electronic equipment and medium
CN111666400A (en) Message acquisition method and device, computer equipment and storage medium
CN117436480A (en) A large model and recommendation method under the Mindspore framework
WO2026001219A1 (en) Video generation method, motion video generation method for virtual object, video editing method, video generation model training method and video generation model-based information processing method
WO2024220032A1 (en) Information extraction method, conference opinion extraction method, and information extraction model training method
WO2024230337A1 (en) Training method for recall model, recall method and related device
CN117972047A (en) Document retrieval method and automatic question-answering method
CN115687664A (en) Chinese image-text retrieval method and data processing method for Chinese image-text retrieval
CN118132988A (en) Machine learning model training method, text-based image searching method, automatic question-answering method, computing device, computer-readable storage medium, and computer program product
CN114328820A (en) Information searching method and related equipment
WO2025026012A1 (en) Video retrieval method
CN116644743A (en) Information extraction, item recognition and information extraction model training method
CN119166651A (en) Question-answering method based on structured data, question-answering method based on sports tables
CN116136869A (en) Dialogue Content Generation, Virtual Dialogue, and Data Processing Method for Dialogue Content
CN120278795B (en) Product recommendation model training method, product recommendation method
WO2025112940A1 (en) Task processing method, information extraction model training method, and classification task processing method
CN118013246A (en) Data processing method, computing device, and computer-readable storage medium
CN120725064A (en) CTR prediction model training method and CTR prediction method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24847971

Country of ref document: EP

Kind code of ref document: A1