[go: up one dir, main page]

CN116644208B - Video retrieval method, device, electronic equipment and computer readable storage medium - Google Patents

Video retrieval method, device, electronic equipment and computer readable storage medium

Info

Publication number
CN116644208B
CN116644208B CN202310621588.1A CN202310621588A CN116644208B CN 116644208 B CN116644208 B CN 116644208B CN 202310621588 A CN202310621588 A CN 202310621588A CN 116644208 B CN116644208 B CN 116644208B
Authority
CN
China
Prior art keywords
video
video segment
vector
similarity
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310621588.1A
Other languages
Chinese (zh)
Other versions
CN116644208A (en
Inventor
舒畅
陈又新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202310621588.1A priority Critical patent/CN116644208B/en
Publication of CN116644208A publication Critical patent/CN116644208A/en
Application granted granted Critical
Publication of CN116644208B publication Critical patent/CN116644208B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/732Query formulation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Library & Information Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明涉及人工智能技术,揭露一种视频检索方法,包括:通过镜头分割每个预设的视频文件,得到第一视频段集合,依次对每个第一视频段进行语义分割,得到对应视频文件的第二视频段集合,利用预先训练好的CLIP+LSTM模型,提取每个所述第二视频段的视频段特征,融合所有所述视频段特征得到对应视频文件的视频特征,接收待检索文本,利用所述预先训练好的CLIP+LSTM模型,提取所述待检索文本的文本特征,依次计算所述文本特征与每个所述预设的视频文件的视频特征之间的特征相似度,选择满足预设相似度条件的特征相似度对应视频文件作为目标视频文件。本发明还提出一种视频检索装置、设备以及介质。本发明可以提升智慧医疗领域中医疗视频检索的准确性。

The present invention relates to artificial intelligence technology and discloses a video retrieval method, comprising: dividing each preset video file by lens to obtain a first video segment set, sequentially performing semantic segmentation on each first video segment to obtain a second video segment set corresponding to the video file, using a pre-trained CLIP+LSTM model to extract the video segment features of each second video segment, fusing all the video segment features to obtain the video features of the corresponding video file, receiving a text to be retrieved, using the pre-trained CLIP+LSTM model to extract the text features of the text to be retrieved, sequentially calculating the feature similarity between the text features and the video features of each preset video file, and selecting the video file corresponding to the feature similarity that meets the preset similarity condition as the target video file. The present invention also proposes a video retrieval device, equipment, and medium. The present invention can improve the accuracy of medical video retrieval in the field of smart medical care.

Description

Video retrieval method, device, electronic equipment and computer readable storage medium
Technical Field
The present invention relates to the field of artificial intelligence technologies, and in particular, to a video retrieval method, a video retrieval device, an electronic device, and a computer readable storage medium.
Background
With the development of Internet and video technology, more and more intelligent medical platforms release medical science popularization videos on the Internet, so that medical science popularization knowledge is intuitively pushed to users. Because of the increasing release amount of medical science popularization videos, how to accurately and rapidly search videos required by users is an important problem of concern of large medical platforms.
The video retrieval method commonly used in the industry comprises the following steps:
firstly, matching a query text input by a user with a text title of a video;
Secondly, extracting the label of the video, and matching the query text with the label of the video;
Third, through ASR (Automatic Speech Recognition ) or OCR (Optical Character Recognition, optical character recognition) technology, the text information corresponding to the video is recognized, and then the query text is matched with the recognized video text information.
The above method is essentially matching between text (query text) and text (video tag, video title, video text information), i.e. the data in the same characterization space is matched, and in this way, the information of medical video images and pictures is always lost, so the accuracy of the video retrieval is still to be improved.
Disclosure of Invention
The invention provides a video retrieval method, a video retrieval device, electronic equipment and a computer readable storage medium, and mainly aims to improve the accuracy of medical video retrieval in the intelligent medical field.
In order to achieve the above object, the present invention provides a video retrieval method, including:
Performing segmentation operation on each preset video file through a lens to obtain a first video segment set corresponding to each preset video file;
carrying out semantic segmentation on each first video segment in the first video segment set in turn to obtain a second video segment set of a corresponding video file;
sequentially extracting video segment characteristics of each second video segment in the second video segment set by using a pre-trained CLIP+LSTM model, and fusing all the video segment characteristics to obtain video characteristics of a corresponding video file;
receiving a text to be searched, extracting text characteristics of the text to be searched by using the pre-trained CLIP+LSTM model, sequentially calculating characteristic similarity between the text characteristics and video characteristics of each preset video file, and selecting a video file corresponding to the characteristic similarity meeting the preset similarity condition as a target video file.
Optionally, the semantic segmentation is performed on each first video segment in the first video segment set in turn to obtain a second video segment set of the corresponding video file, which includes:
Identifying the text of each first video segment, and carrying out clause on the text of each first video segment;
performing sentence vector conversion on each clause of each first video segment to obtain a sentence vector set corresponding to the first video segment;
Calculating adjacent window similarity and skip window similarity between every two sentence vectors in the sentence vector set to obtain corresponding vector similarity, and dividing clauses corresponding to the vector similarity meeting a preset similarity threshold into one second video segment;
and collecting all the second video segments to obtain a second video segment set corresponding to the preset video file.
Optionally, the performing sentence vector conversion on each clause of each first video segment to obtain a sentence vector set corresponding to the first video segment includes:
sequentially segmenting each clause, and carrying out word vector conversion on each segmented word;
adding word vectors corresponding to each clause to obtain a word vector matrix of each clause;
and carrying out pooling operation on each word vector matrix to obtain sentence vectors corresponding to each clause.
Optionally, the calculating the adjacent window similarity and the skip window similarity between every two sentence vectors in the sentence vector set to obtain corresponding vector similarity, and dividing the clause corresponding to the vector similarity meeting the preset similarity threshold into the second video segment includes:
Step A, taking the first sentence vector in the sentence vector set as a starting point;
Step B, calculating the similarity of adjacent windows between the starting point and sentence vectors adjacent to the starting point, and judging whether the similarity of the adjacent windows is larger than a preset similarity threshold;
c, when the similarity of the adjacent windows is larger than the preset semantic similarity threshold, executing the step C, and taking the starting point and the sentence vector adjacent to the starting point as a temporary video segment;
Step C1, eliminating sentence vectors in the temporary video segment from the sentence vector set, and judging whether the sentence vector set after eliminating the vectors is empty or not;
when the sentence vector set after eliminating the vectors is empty, executing C11, dividing the temporary video segment into a second video segment, and jumping to the step E1;
When the sentence vector set after eliminating the vectors is not empty, executing C12, taking the first sentence vector in the sentence vector set as a starting point, calculating adjacent window similarity and skip window similarity between the starting point and vectors in the temporary video segment, weighting and averaging the adjacent window similarity and the skip window similarity to obtain vector similarity, and judging whether the vector similarity is larger than a preset similarity threshold;
when the vector similarity is greater than the preset similarity threshold, executing C121, adding the starting point into the temporary video segment, and returning to the step C1;
When the vector similarity is not greater than the preset similarity threshold, executing C122, dividing the temporary video segment into a second video segment, eliminating vectors corresponding to the second video segment from the vector set, and skipping to the step E;
When the similarity of the adjacent windows is not greater than the preset similarity threshold, executing the step D, dividing the starting point into a second video segment, and eliminating the starting point from the sentence vector set;
e, judging whether the sentence vector set after eliminating the vectors is empty or not;
E1, when the sentence vector set after eliminating the vectors is empty, executing the step E1, and collecting the second video segments to obtain a second video segment set;
And (C) returning to the step A when the sentence vector set after eliminating the vectors is not empty.
Optionally, extracting video segment features of each second video segment in the second video segment set in turn by using a pre-trained clip+lstm model, and fusing all the video segment features to obtain video features of a corresponding video file, including:
Extracting video frames of each second video segment in sequence according to the time sequence to obtain a video frame set of each second video segment;
Sequentially extracting frame feature vectors of each video frame in the video frame set by utilizing the CLIP part in the pre-trained clip+LSTM model;
Carrying out convolution operation on all frame feature vectors of each second video segment by utilizing an LSTM part in the pre-trained CLIP+LSTM model to obtain video segment features of the corresponding second video segment;
and carrying out pooling operation on all video segment characteristics corresponding to the preset video file to obtain the video characteristics of the preset video file.
Optionally, extracting the text feature of the text to be retrieved by using the pre-trained clip+lstm model includes:
Word segmentation is carried out on the text to be searched to obtain one or more search word segments, and word vectors of each search word segment are obtained;
splicing word vectors of each search word by utilizing a CLIP part in the pre-trained CLIP+LSTM model to obtain a text vector matrix;
sequentially selecting a search word as a target word, and calculating a key value of the target word according to a word vector of the target word and the text vector matrix;
selecting a preset number of search terms as feature terms according to the sequence of the key values from large to small;
And splicing word vectors of the feature word segmentation to obtain text features of the text to be searched.
Optionally, the calculating the key value of the target word according to the word vector of the target word and the text vector matrix includes:
calculating the key value of the target word by using the following key value algorithm:
Wherein K is the key value, W is the text vector matrix, T is a matrix transpose symbol, W is a modulo symbol, Word vectors that segment the target word.
In order to solve the above problems, the present invention also provides a video retrieval apparatus, the apparatus comprising:
the system comprises a shot segmentation module, a video segmentation module and a video segmentation module, wherein the shot segmentation module is used for executing segmentation operation on each preset video file through a shot to obtain a first video segment set corresponding to each preset video file;
the semantic segmentation module is used for sequentially carrying out semantic segmentation on each first video segment in the first video segment set to obtain a second video segment set of the corresponding video file;
the video feature extraction module is used for sequentially extracting video segment features of each second video segment in the second video segment set by utilizing a pre-trained CLIP+LSTM model, and fusing all the video segment features to obtain video features of corresponding video files;
And the text and video feature comparison module is used for receiving the text to be searched, extracting the text features of the text to be searched by utilizing the pre-trained CLIP+LSTM model, sequentially calculating the feature similarity between the text features and the video features of each preset video file, and selecting the video file corresponding to the feature similarity meeting the preset similarity condition as a target video file.
In order to solve the above-described problems, the present invention also provides a computer-readable storage medium having stored therein at least one computer program that is executed by a processor in an electronic device to implement the video retrieval method described above.
According to the embodiment of the invention, the preset video file is segmented according to the lens and segmented according to the semantics, so that the video file is accurately refined, the video segment characteristics of the refined second video segment are extracted, the video characteristics of the final video file are obtained based on the video segment characteristics, the video characteristics embody the image characteristics of the video file, finally, the video file meeting the preset similarity condition is selected as the target video file in a mode of calculating the characteristic similarity between the text characteristics of the text to be searched and the video characteristics of the video file, and the searching mode is based on the video characteristics of the video file, so that the searching accuracy of the video file is improved.
Drawings
Fig. 1 is a flow chart of a video retrieval method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a detailed implementation of one of the steps in the video searching method according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating another step in the video searching method according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating another step in the video searching method according to an embodiment of the present invention;
FIG. 5 is a functional block diagram of a video retrieval device according to an embodiment of the present invention;
Fig. 6 is a schematic structural diagram of an electronic device for implementing the video retrieval method according to an embodiment of the present invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The embodiment of the application provides a video retrieval method. The execution subject of the video retrieval method includes, but is not limited to, at least one of a server, a terminal, and the like, which can be configured to execute the method provided by the embodiment of the application. In other words, the video retrieval method may be performed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform. The server may be an independent server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (ContentDelivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.
Referring to fig. 1, a flowchart of a video retrieval method according to an embodiment of the invention is shown.
In this embodiment, the video retrieval method includes:
s1, executing segmentation operation on each preset video file through a lens to obtain a first video segment set corresponding to each preset video file;
in the embodiment of the invention, an intelligent medical platform is taken as an example for explanation. The intelligent medical platform is used for providing medical assistance and medical knowledge science popularization for common users by maintaining and releasing a series of medical science popularization videos. The preset video file refers to a medical video maintained by the intelligent medical platform, for example, a common disease prevention knowledge video, a family medical aid common knowledge video, a public health protection video, a medical hot event video and the like.
It will be appreciated that, in general, a video text is composed of one or more groups of shots, each shot representing an independent meaning, each of the preset video files is segmented according to shots, and one of the preset video files is segmented into a plurality of video segments, each of the video segments containing image information of one of the shots. Thus, the operation is convenient, and the characteristics of each preset video file can be better acquired later.
In the embodiment of the present invention, each preset video file may be segmented by using the shot segmentation tool disclosed in OPENCV.
S2, carrying out semantic segmentation on each first video segment in the first video segment set in sequence to obtain a second video segment set of a corresponding video file;
it will be appreciated that if a shot contains a lot of semantic information, e.g. a long shot. Each first video segment can be subdivided, so that the semantics of the second video segment obtained through final cutting are purer, granularity is not overlarge, and the accuracy of extracting video features of a corresponding video file based on the final cut video segment is improved.
In detail, referring to fig. 2, the step S2 includes:
s21, identifying the text of each first video segment, and carrying out clause on the text of each first video segment;
s22, performing sentence vector conversion on each clause of each first video segment to obtain a sentence vector set corresponding to the first video segment;
s23, calculating adjacent window similarity and skip window similarity between every two sentence vectors in the sentence vector set to obtain corresponding vector similarity, and dividing a clause corresponding to the vector similarity meeting a preset similarity threshold into a second video segment;
S24, collecting all the second video segments to obtain a second video segment set corresponding to the preset video file.
In the embodiment of the invention, the ASR technology can be utilized to acquire the video segment text of each first video segment, and after the video segment text is divided into sentences, semantic segmentation is carried out on the first video segment by taking the sentence as a unit.
In another optional embodiment of the present invention, an acoustic model may be used to perform speech recognition based on speech information corresponding to the first video segment to obtain a video segment text corresponding to the first video, where the acoustic model performs speech recognition on speech information by modeling each word to establish a database including a plurality of words and standard speech corresponding to each word, and performs probability matching on the speech information to obtain the video segment text by collecting user speech in the first video segment at each moment in the speech information to obtain the speech of the user at each moment, where the speech is pre-constructed and includes a plurality of words and words in a database of standard speech corresponding to each word.
In an alternative embodiment of the present invention, sentence vector conversion may be performed on each of the clauses by the following method:
sequentially segmenting each clause, and carrying out word vector conversion on each segmented word;
adding word vectors corresponding to each clause to obtain a word vector matrix of each clause;
and carrying out pooling operation on each word vector matrix to obtain sentence vectors corresponding to each clause.
In the embodiment of the invention, the word segmentation processing can be performed on the clause by adopting a preset standard dictionary to obtain a plurality of segmented words, wherein the standard dictionary comprises a plurality of standard segmented words. The clause may also be segmented using a segmentation tool, such as jieba segmentation.
In the embodiment of the invention, word2vec model, NLP (Natural Language Processing ) model and other models with word vector conversion function can be adopted to carry out word vector conversion on each word.
In an alternative embodiment of the present invention, the word vector matrix may be pooled by using a k-max pooling method, where the k value may be predefined, for example, the k value may be 35%. Since the number of the words included in each clause is not equal, it is preferable to round up 35% of the words in each clause, and take the top K maximum values in each pooling block, for example, if clause 1 includes only 1 word, 1×35% rounds up to 1, and clause 2 includes 5 words, and 5×35% rounds up to 2.
In detail, referring to fig. 3, the calculating the adjacent window similarity and the skip window similarity between every two sentence vectors in the sentence vector set to obtain the corresponding vector similarity divides the clause corresponding to the vector similarity meeting the preset similarity threshold into one second video segment includes:
Step A, taking the first sentence vector in the sentence vector set as a starting point;
Step B, calculating the similarity of adjacent windows between the starting point and sentence vectors adjacent to the starting point, and judging whether the similarity of the adjacent windows is larger than a preset similarity threshold;
c, when the similarity of the adjacent windows is larger than the preset semantic similarity threshold, executing the step C, and taking the starting point and the sentence vector adjacent to the starting point as a temporary video segment;
Step C1, eliminating sentence vectors in the temporary video segment from the sentence vector set, and judging whether the sentence vector set after eliminating the vectors is empty or not;
when the sentence vector set after eliminating the vectors is empty, executing C11, dividing the temporary video segment into a second video segment, and jumping to the step E1;
When the sentence vector set after eliminating the vectors is not empty, executing C12, taking the first sentence vector in the sentence vector set as a starting point, calculating adjacent window similarity and skip window similarity between the starting point and vectors in the temporary video segment, weighting and averaging the adjacent window similarity and the skip window similarity to obtain vector similarity, and judging whether the vector similarity is larger than a preset similarity threshold;
when the vector similarity is greater than the preset similarity threshold, executing C121, adding the starting point into the temporary video segment, and returning to the step C1;
When the vector similarity is not greater than the preset similarity threshold, executing C122, dividing the temporary video segment into a second video segment, eliminating vectors corresponding to the second video segment from the vector set, and skipping to the step E;
When the similarity of the adjacent windows is not greater than the preset similarity threshold, executing the step D, dividing the starting point into a second video segment, and eliminating the starting point from the sentence vector set;
e, judging whether the sentence vector set after eliminating the vectors is empty or not;
E1, when the sentence vector set after eliminating the vectors is empty, executing the step E1, and collecting the second video segments to obtain a second video segment set;
And (C) returning to the step A when the sentence vector set after eliminating the vectors is not empty.
In the embodiment of the invention, the adjacent window similarity and the skip window similarity are the similarity between every two sentence vectors, namely the adjacent window similarity and the skip window similarity. For example, if the sentence vectors of a first video segment include S1, S2, S3, S4, and S5, respectively, then there is a neighboring window similarity between S1 and S2, a skip window similarity between S1 and S3, a neighboring window similarity between S2 and S3, a skip window similarity between S2 and S4, and a skip window similarity between S3 and S5.
In the embodiment of the invention, the adjacent window similarity or the jump window similarity between every two sentence vectors can be calculated by utilizing a pre-trained MLP (Multilayer Perceptron, multi-layer perceptron) model.
In the embodiment of the invention, different weights can be allocated to the adjacent window similarity and the jump window similarity in advance, and finally the vector similarity is obtained by carrying out weighted averaging operation on the adjacent window similarity and the jump window similarity.
In the embodiment of the invention, the preset similarity threshold can be set according to the actual service condition.
For example, assuming that a first video segment includes 4 clauses, the corresponding sentence vectors are S1, S2, S3, and S4, and the semantic segmentation is performed on the first video segment, the following division results may be obtained:
The first division result comprises three second video segments S1, S2+ S3 and S4, wherein the adjacent window similarity between S1 and S2 is smaller than the preset similarity threshold, so that the clause where S1 is positioned is divided into an independent second video segment, the adjacent window similarity between S2 and S3 is larger than the preset similarity threshold, and the vector similarity corresponding to the adjacent window similarity between S2 and S3 and the jump window similarity between S2 and S4 is smaller than the preset similarity threshold, so that S2 and S3 are divided into an independent second video segment, and S4 is divided into an independent video segment;
The second division result comprises three second video segments S1+S2, S3 and S4, wherein the adjacent window similarity between S1 and S2 is larger than the preset similarity threshold, the vector similarity corresponding to the adjacent window similarity between S2 and S3 and the jump window similarity between S1 and S3 is smaller than the preset similarity threshold, so that S1 and S2 are divided into independent second video segments, the adjacent window similarity between S3 and S4 is smaller than the preset similarity threshold, so that the clause where S3 is located is divided into independent second video segments, and the clause corresponding to S4 is divided into independent second video segments;
The third division result comprises that two video segments S1+S2+S3 and S4 are provided, the adjacent window similarity between S1 and S2 is larger than the preset similarity threshold, the vector similarity corresponding to the adjacent window similarity between S2 and S3 and the jump window similarity between S1 and S3 is larger than the preset similarity threshold, the vector similarity corresponding to the adjacent window similarity between S3 and S4 and the jump window similarity between S2 and S4 is smaller than the preset similarity threshold, S4 is an independent second video segment, and S1, S2 and S3 are independent second video segments.
It should be noted that the above is only an example, and there may be a plurality of division results for S1, S2, S3, S4.
S3, extracting video segment characteristics of each second video segment in the second video segment set in sequence by utilizing a pre-trained CLIP+LSTM model, and fusing all the video segment characteristics to obtain video characteristics of a corresponding video file;
in the embodiment of the invention, the Pre-trained CLIP+LSTM model comprises a CLIP part (Contrastive Language-Image Pre-training) and an LSTM part (Long Short-Term Memory network).
In detail, the extracting video segment features of each second video segment in the second video segment set sequentially by using a pre-trained clip+lstm model, and fusing all the video segment features to obtain video features of a corresponding video file includes:
Extracting video frames of each second video segment in sequence according to the time sequence to obtain a video frame set of each second video segment;
Sequentially extracting frame feature vectors of each video frame in the video frame set by utilizing the CLIP part in the pre-trained clip+LSTM model;
Carrying out convolution operation on all frame feature vectors of each second video segment by utilizing an LSTM part in the pre-trained CLIP+LSTM model to obtain video segment features of the corresponding second video segment;
and carrying out pooling operation on all video segment characteristics corresponding to the preset video file to obtain the video characteristics of the preset video file.
In an optional embodiment of the present invention, 4 video frames are extracted from each second video segment according to the time sequence, so as to form a video frame set of the second video segment.
According to the embodiment of the invention, the video frames of each second video segment are sequentially extracted by taking the second video segment as a unit, the video segment characteristics of the corresponding second video segment are extracted based on the frame characteristic vector of each video frame by utilizing the pre-trained CLIP+LSTM model, and finally the video characteristics of the corresponding video file are obtained based on the video segment characteristics of all the second video segments, wherein the video characteristics embody the image and picture characteristics of the video.
And S4, receiving a text to be searched, extracting text characteristics of the text to be searched by using the pre-trained CLIP+LSTM model, sequentially calculating characteristic similarity between the text characteristics and video characteristics of each preset video file, and selecting a video file corresponding to the characteristic similarity meeting the preset similarity condition as a target video file.
In the embodiment of the invention, the text characteristics of the text to be searched can be extracted by using the same model, namely the pre-trained CLIP+LSTM model. The text characteristics of the text to be searched and the video characteristics of the video file can be mapped to the same characterization space by the operation, so that the text characteristics and the video characteristics of the text to be searched can be conveniently compared and calculated.
In detail, referring to fig. 4, the extracting text features of the text to be retrieved by using the pretrained clip+lstm model includes:
s41, word segmentation is carried out on the text to be searched to obtain one or more search word segments, and word vectors of each search word segment are obtained;
S42, splicing word vectors of each search word by utilizing the CLIP part in the pre-trained CLIP+LSTM model to obtain a text vector matrix;
s43, sequentially selecting a search word as a target word, and calculating a key value of the target word according to a word vector of the target word and the text vector matrix;
s44, selecting a preset number of search terms as characteristic terms according to the sequence of the key values from large to small;
S45, word vectors of the feature word segmentation are spliced to obtain text features of the text to be searched.
In detail, since the text to be searched contains a large number of search terms, but not every search term is a feature of the text to be searched, the search terms are required to be screened, one of the search terms is selected one by one from the search terms to be a target term, and a key value of the target term is calculated according to a term vector of the target term and the text vector matrix, so that feature terms which are representative of the text to be searched are screened according to the key value, and the text feature of the text to be searched is obtained.
Specifically, the calculating the key value of the target word according to the word vector of the target word and the text vector matrix includes:
calculating the key value of the target word by using the following key value algorithm:
Wherein K is the key value, W is the text vector matrix, T is a matrix transpose symbol, W is a modulo symbol, Word vectors that segment the target word.
In the embodiment of the invention, the preset number of search terms are selected from the plurality of search terms as characteristic terms according to the order of the key value of each search term from large to small.
For example, the plurality of search terms comprise search term A, search term B and search term C, wherein the key value of the search term A is 80, the key value of the search term B is 70, the key value of the search term C is 30, if the preset number is 2, the search term A and the search term B are selected as feature terms according to the sequence from the large key value to the small key value, and word vectors of the search term A and the search term B are spliced to obtain text features of the text to be searched.
In the embodiment of the invention, the feature similarity can be obtained by calculating the cosine similarity between the text feature and the video feature of each video file.
Further, the cosine similarity can be obtained by calculating the cosine similarity between the text feature and the video feature of each second video segment, and the second video segment closest to the text feature is matched, so that the accuracy of video retrieval is further improved.
According to the embodiment of the invention, the preset video file is segmented according to the lens and segmented according to the semantics, so that the video file is accurately refined, the video segment characteristics of the refined second video segment are extracted, the video characteristics of the final video file are obtained based on the video segment characteristics, the video characteristics embody the image characteristics of the video file, finally, the video file meeting the preset similarity condition is selected as the target video file in a mode of calculating the characteristic similarity between the text characteristics of the text to be searched and the video characteristics of the video file, and the searching mode is based on the video characteristics of the video file, so that the searching accuracy of the video file is improved.
Fig. 5 is a functional block diagram of a video search device according to an embodiment of the present invention.
The video search apparatus 100 of the present invention may be mounted in an electronic device. The video retrieval device 100 comprises a lens segmentation module 101, a semantic segmentation module 102, a video feature extraction module 103 and a text-to-video feature comparison module 104 according to the functions implemented. The module of the invention, which may also be referred to as a unit, refers to a series of computer program segments, which are stored in the memory of the electronic device, capable of being executed by the processor of the electronic device and of performing a fixed function.
In the present embodiment, the functions concerning the respective modules/units are as follows:
the shot segmentation module 101 is configured to perform a segmentation operation on each preset video file through a shot to obtain a first video segment set corresponding to each preset video file;
The semantic segmentation module 102 is configured to perform semantic segmentation on each first video segment in the first video segment set in sequence to obtain a second video segment set of a corresponding video file;
the video feature extraction module 103 is configured to sequentially extract video segment features of each second video segment in the second video segment set by using a pre-trained clip+lstm model, and fuse all the video segment features to obtain video features of a corresponding video file;
The text-to-video feature comparison module 104 is configured to receive a text to be retrieved, extract text features of the text to be retrieved by using the pre-trained clip+lstm model, sequentially calculate feature similarities between the text features and video features of each preset video file, and select a video file corresponding to the feature similarities that satisfies a preset similarity condition as a target video file.
In detail, each module of the video searching apparatus 100 in the embodiment of the present invention adopts the same technical means as the video searching method described in fig. 1 to 4, and can produce the same technical effects, which are not repeated here.
Fig. 6 is a schematic structural diagram of an electronic device for implementing a video retrieval method according to an embodiment of the present invention.
The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a video retrieval program, stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of readable storage medium, including flash memory, a mobile hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may in other embodiments also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only for storing application software installed in the electronic device 1 and various types of data, such as codes of video search programs, but also for temporarily storing data that has been output or is to be output.
The processor 10 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing Unit, CPU), microprocessors, digital processing chips, graphics processors, combinations of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects respective parts of the entire electronic device using various interfaces and lines, and executes various functions of the electronic device 1 and processes data by running or executing programs or modules (e.g., video retrieval programs, etc.) stored in the memory 11, and calling data stored in the memory 11.
The bus may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The bus is arranged to enable a connection communication between the memory 11 and at least one processor 10 etc.
Fig. 6 shows only an electronic device with components, it being understood by a person skilled in the art that the structure shown in fig. 6 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or may combine certain components, or may be arranged in different components.
For example, although not shown, the electronic device 1 may further include a power source (such as a battery) for supplying power to each component, and preferably, the power source may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management, and the like are implemented through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device 1 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described herein.
Further, the electronic device 1 may also comprise a network interface, optionally the network interface may comprise a wired interface and/or a wireless interface (e.g. WI-FI interface, bluetooth interface, etc.), typically used for establishing a communication connection between the electronic device 1 and other electronic devices.
The electronic device 1 may optionally further comprise a user interface, which may be a Display, an input unit, such as a Keyboard (Keyboard), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device 1 and for displaying a visual user interface.
It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.
The video retrieval program stored in the memory 11 in the electronic device 1 is a combination of instructions that, when executed in the processor 10, may implement:
Performing segmentation operation on each preset video file through a lens to obtain a first video segment set corresponding to each preset video file;
carrying out semantic segmentation on each first video segment in the first video segment set in turn to obtain a second video segment set of a corresponding video file;
sequentially extracting video segment characteristics of each second video segment in the second video segment set by using a pre-trained CLIP+LSTM model, and fusing all the video segment characteristics to obtain video characteristics of a corresponding video file;
receiving a text to be searched, extracting text characteristics of the text to be searched by using the pre-trained CLIP+LSTM model, sequentially calculating characteristic similarity between the text characteristics and video characteristics of each preset video file, and selecting a video file corresponding to the characteristic similarity meeting the preset similarity condition as a target video file.
Further, the modules/units integrated in the electronic device 1 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as separate products. The computer readable storage medium may be volatile or nonvolatile. For example, the computer readable medium may include any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).
The present invention also provides a computer readable storage medium storing a computer program which, when executed by a processor of an electronic device, can implement:
Performing segmentation operation on each preset video file through a lens to obtain a first video segment set corresponding to each preset video file;
carrying out semantic segmentation on each first video segment in the first video segment set in turn to obtain a second video segment set of a corresponding video file;
sequentially extracting video segment characteristics of each second video segment in the second video segment set by using a pre-trained CLIP+LSTM model, and fusing all the video segment characteristics to obtain video characteristics of a corresponding video file;
receiving a text to be searched, extracting text characteristics of the text to be searched by using the pre-trained CLIP+LSTM model, sequentially calculating characteristic similarity between the text characteristics and video characteristics of each preset video file, and selecting a video file corresponding to the characteristic similarity meeting the preset similarity condition as a target video file.
In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.
The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
The embodiment of the application can acquire and process the related data based on the holographic projection technology. Wherein artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.
Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the system claims can also be implemented by means of software or hardware by means of one unit or means. The terms second, etc. are used to denote a name, but not any particular order.
Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims (9)

1.一种视频检索方法,其特征在于,所述方法包括:1. A video retrieval method, characterized in that the method comprises: 通过镜头对每个预设的视频文件执行分段操作,得到每个所述预设的视频文件对应的第一视频段集合;Performing a segmentation operation on each preset video file through a lens to obtain a first video segment set corresponding to each preset video file; 依次对所述第一视频段集合中的每个第一视频段进行语义分割,得到对应视频文件的第二视频段集合;Sequentially performing semantic segmentation on each first video segment in the first video segment set to obtain a second video segment set corresponding to the video file; 利用预先训练好的CLIP+LSTM模型,依次提取所述第二视频段集合中的每个第二视频段的视频段特征,融合所有所述视频段特征得到对应视频文件的视频特征;Using a pre-trained CLIP+LSTM model, sequentially extracting video segment features of each second video segment in the second video segment set, and fusing all the video segment features to obtain video features of the corresponding video file; 接收待检索文本,利用所述预先训练好的CLIP+LSTM模型,提取所述待检索文本的文本特征,依次计算所述文本特征与每个所述预设的视频文件的视频特征之间的特征相似度,选择满足预设相似度条件的特征相似度对应视频文件作为目标视频文件;Receiving a text to be retrieved, extracting text features of the text to be retrieved using the pre-trained CLIP+LSTM model, sequentially calculating feature similarities between the text features and video features of each of the preset video files, and selecting a video file corresponding to the feature similarity that meets a preset similarity condition as a target video file; 其中,所述依次对所述第一视频段集合中的每个第一视频段进行语义分割,得到对应视频文件的第二视频段集合,包括:识别每个所述第一视频段的文本,并对每个所述第一视频段的文本进行分句;对每个所述第一视频段的每个所述分句进行句向量转换,得到对应第一视频段的句向量集合;计算所述句向量集合中的每两个句向量之间的邻窗相似度和跳窗相似度,得到对应的向量相似度,将满足预设相似度阈值的向量相似度对应的分句划分为一个所述第二视频段;汇集所有所述第二视频段,得到对应预设视频文件的第二视频段集合。Among them, the semantic segmentation is performed on each first video segment in the first video segment set in turn to obtain a second video segment set corresponding to the video file, including: identifying the text of each first video segment and dividing the text of each first video segment into sentences; performing sentence vector conversion on each sentence of each first video segment to obtain a sentence vector set corresponding to the first video segment; calculating the adjacent window similarity and the jump window similarity between every two sentence vectors in the sentence vector set to obtain the corresponding vector similarity, and dividing the sentences corresponding to the vector similarity that meets the preset similarity threshold into a second video segment; and collecting all the second video segments to obtain a second video segment set corresponding to the preset video file. 2.如权利要求1所述的视频检索方法,其特征在于,所述对每个所述第一视频段的每个所述分句进行句向量转换,得到对应第一视频段的句向量集合,包括:2. The video retrieval method according to claim 1, wherein the step of converting each sentence of each first video segment into a sentence vector to obtain a sentence vector set corresponding to the first video segment comprises: 依次对每个所述分句进行分词,并对每个所述分词进行词向量转换;Segment each of the sentences in turn, and convert each of the segmented words into a word vector; 将每个所述分句对应的词向量相加,得到每个所述分句的词向量矩阵;Add the word vectors corresponding to each of the clauses to obtain a word vector matrix for each of the clauses; 对每个所述词向量矩阵进行池化操作,得到每个所述分句对应的句向量。A pooling operation is performed on each of the word vector matrices to obtain a sentence vector corresponding to each of the clauses. 3.如权利要求1所述的视频检索方法,其特征在于,所述计算所述句向量集合中的每两个句向量之间的邻窗相似度和跳窗相似度,得到对应的向量相似度,将满足预设相似度阈值的向量相似度对应的分句划分为一个所述第二视频段,包括:3. The video retrieval method according to claim 1, wherein calculating the adjacent window similarity and the jump window similarity between each two sentence vectors in the sentence vector set to obtain corresponding vector similarities, and dividing sentences corresponding to vector similarities that meet a preset similarity threshold into a second video segment, comprises: 步骤A、以所述句向量集合中首位的句向量为起点;Step A: taking the first sentence vector in the sentence vector set as the starting point; 步骤B、计算所述起点和与所述起点相邻的句向量之间的邻窗相似度,判断所述邻窗相似度是否大于预设的相似度阈值;Step B: calculating the adjacent window similarity between the starting point and the sentence vectors adjacent to the starting point, and determining whether the adjacent window similarity is greater than a preset similarity threshold; 当所述邻窗相似度大于所述预设的语义相似度阈值时,则执行步骤C、将所述起点和与所述起点相邻的句向量作为一个临时视频段;When the adjacent window similarity is greater than the preset semantic similarity threshold, step C is performed, and the starting point and the sentence vector adjacent to the starting point are used as a temporary video segment; 步骤C1、从所述句向量集合中剔除所述临时视频段中的句向量,判断剔除向量后的句向量集合是否为空;Step C1: removing the sentence vectors in the temporary video segment from the sentence vector set, and determining whether the sentence vector set after removing the vectors is empty; 当剔除向量后的句向量集合为空时,则执行C11、将所述临时视频段划分为一个所述第二视频段,并跳转到步骤E1;When the sentence vector set after the vector is removed is empty, executing C11, dividing the temporary video segment into a second video segment, and jumping to step E1; 当剔除向量后的句向量集合不为空时,则执行C12、以所述句向量集合中首位的句向量为起点,计算所述起点与所述临时视频段中的向量之间的邻窗相似度和跳窗相似度,并对所述邻窗相似度和所述跳窗相似度进行加权求平均得到向量相似度,判断所述向量相似度是否大于预设的相似度阈值;If the sentence vector set after the vector is removed is not empty, executing C12, taking the first sentence vector in the sentence vector set as a starting point, calculating the adjacent window similarity and the jump window similarity between the starting point and the vector in the temporary video segment, performing weighted averaging on the adjacent window similarity and the jump window similarity to obtain a vector similarity, and determining whether the vector similarity is greater than a preset similarity threshold; 当所述向量相似度大于所述预设的相似度阈值时,则执行C121、将所述起点加入到所述临时视频段中,并返回步骤C1;When the vector similarity is greater than the preset similarity threshold, executing C121, adding the starting point to the temporary video segment, and returning to step C1; 当所述向量相似度不大于所述预设的相似度阈值时,则执行C122、将所述临时视频段划分为一个所述第二视频段,将所述第二视频段对应的向量从所述向量集合中剔除,跳转值至步骤E;When the vector similarity is not greater than the preset similarity threshold, executing C122, dividing the temporary video segment into a second video segment, removing the vector corresponding to the second video segment from the vector set, and jumping to step E; 当所述邻窗相似度不大于所述预设的相似度阈值时,则执行步骤D、将所述起点划分为一个所述第二视频段,将所述起点从所述句向量集合中剔除;When the adjacent window similarity is not greater than the preset similarity threshold, executing step D, dividing the starting point into a second video segment, and removing the starting point from the sentence vector set; 步骤E、判断剔除向量后的句向量集合是否为空;Step E: Determine whether the sentence vector set after removing the vector is empty; 当剔除向量后的句向量集合为空时,则执行步骤E1、汇集所述第二视频段,得到所述第二视频段集合;When the sentence vector set after removing the vector is empty, executing step E1, collecting the second video segments to obtain the second video segment set; 当剔除向量后的句向量集合不为空时,则返回上述的步骤A。When the sentence vector set after removing the vector is not empty, return to the above step A. 4.如权利要求1所述的视频检索方法,其特征在于,所述利用预先训练好的CLIP+LSTM模型,依次提取所述第二视频段集合中的每个第二视频段的视频段特征,融合所有所述视频段特征得到对应视频文件的视频特征,包括:4. The video retrieval method according to claim 1, wherein the step of sequentially extracting the video segment features of each second video segment in the second video segment set using the pre-trained CLIP+LSTM model and fusing all the video segment features to obtain the video features of the corresponding video file comprises: 按照时间顺序,依次对每个所述第二视频段进行视频帧抽取,得到每个所述第二视频段的视频帧集;Extracting video frames from each of the second video segments in sequence according to time to obtain a video frame set for each of the second video segments; 利用所述预先训练好的CLIP+LSTM模型中的CLIP部分,依次提取所述视频帧集中每个视频帧的帧特征向量;Using the CLIP part of the pre-trained CLIP+LSTM model, sequentially extracting a frame feature vector of each video frame in the video frame set; 利用所述预先训练好的CLIP+LSTM模型中的LSTM部分,对每个所述第二视频段的所有帧特征向量进行卷积操作,得到对应第二视频段的视频段特征;Using the LSTM part of the pre-trained CLIP+LSTM model, a convolution operation is performed on all frame feature vectors of each second video segment to obtain a video segment feature corresponding to the second video segment; 对所述预设的视频文件对应的所有视频段特征进行池化操作,得到所述预设的视频文件的视频特征。A pooling operation is performed on all video segment features corresponding to the preset video file to obtain video features of the preset video file. 5.如权利要求1所述的视频检索方法,其特征在于,所述利用所述预先训练好的CLIP+LSTM模型,提取所述待检索文本的文本特征,包括:5. The video retrieval method according to claim 1, wherein extracting text features of the text to be retrieved using the pre-trained CLIP+LSTM model comprises: 对所述待检索文本进行分词,得到一个或多个检索分词,获取每个检索分词的词向量;Segmenting the text to be searched to obtain one or more search segments, and obtaining a word vector for each search segment; 利用所述预先训练好的CLIP+LSTM模型中CLIP部分,对每个所述检索分词的词向量进行拼接,得到文本向量矩阵;Using the CLIP part of the pre-trained CLIP+LSTM model, the word vectors of each search word are concatenated to obtain a text vector matrix; 依次选取一个检索分词作为目标分词,根据所述目标分词的词向量及所述文本向量矩阵,计算所述目标分词的关键值;Selecting a search word as a target word in turn, and calculating the key value of the target word according to the word vector of the target word and the text vector matrix; 按照所述关键值从大到小的顺序,选取预设数量的检索分词为特征分词;Selecting a preset number of search terms as feature terms in descending order of the key values; 将所述特征分词的词向量拼接得到所述待检索文本的文本特征。The word vectors of the feature segmentations are concatenated to obtain the text features of the text to be retrieved. 6.如权利要求5所述的视频检索方法,其特征在于,所述根据所述目标分词的词向量及所述文本向量矩阵,计算所述目标分词的关键值,包括:6. The video retrieval method according to claim 5, wherein the step of calculating the key value of the target word segmentation based on the word vector of the target word segmentation and the text vector matrix comprises: 利用如下关键值算法计算所述目标分词的关键值:The key value of the target word is calculated using the following key value algorithm: 其中,为所述关键值,为所述文本向量矩阵,为矩阵转置符号,为求模符号,为所述目标分词的词向量。in, is the key value, is the text vector matrix, is the matrix transpose symbol, To find the modulo symbol, The word vector of the target word. 7.一种视频检索装置,用于实现如权利要求1至6中任意一项所述的视频检索方法,其特征在于,所述装置包括:7. A video retrieval device for implementing the video retrieval method according to any one of claims 1 to 6, characterized in that the device comprises: 镜头分割模块,用于通过镜头对每个预设的视频文件执行分段操作,得到每个所述预设的视频文件对应的第一视频段集合;a shot segmentation module, configured to perform a segmentation operation on each preset video file by shot, to obtain a first video segment set corresponding to each preset video file; 语义分割模块,用于依次对所述第一视频段集合中的每个第一视频段进行语义分割,得到对应视频文件的第二视频段集合;a semantic segmentation module, configured to sequentially perform semantic segmentation on each first video segment in the first video segment set to obtain a second video segment set corresponding to the video file; 视频特征提取模块,用于利用预先训练好的CLIP+LSTM模型,依次提取所述第二视频段集合中的每个第二视频段的视频段特征,融合所有所述视频段特征得到对应视频文件的视频特征;a video feature extraction module, configured to sequentially extract video segment features of each second video segment in the second video segment set using a pre-trained CLIP+LSTM model, and fuse all the video segment features to obtain video features of the corresponding video file; 文本与视频特征比对模块,用于接收待检索文本,利用所述预先训练好的CLIP+LSTM模型,提取所述待检索文本的文本特征,依次计算所述文本特征与每个所述预设的视频文件的视频特征之间的特征相似度,选择满足预设相似度条件的特征相似度对应视频文件作为目标视频文件。The text and video feature comparison module is used to receive the text to be retrieved, use the pre-trained CLIP+LSTM model to extract the text features of the text to be retrieved, calculate the feature similarity between the text features and the video features of each preset video file in turn, and select the video file corresponding to the feature similarity that meets the preset similarity conditions as the target video file. 8.一种电子设备,其特征在于,所述电子设备包括:8. An electronic device, characterized in that the electronic device comprises: 至少一个处理器;以及,at least one processor; and, 与所述至少一个处理器通信连接的存储器;其中,a memory communicatively connected to the at least one processor; wherein, 所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行如权利要求1至6中任意一项所述的视频检索方法。The memory stores a computer program executable by the at least one processor, and the computer program is executed by the at least one processor to enable the at least one processor to perform the video retrieval method according to any one of claims 1 to 6. 9.一种计算机可读存储介质,存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至6中任意一项所述的视频检索方法。9. A computer-readable storage medium storing a computer program, wherein when the computer program is executed by a processor, the video retrieval method according to any one of claims 1 to 6 is implemented.
CN202310621588.1A 2023-05-30 2023-05-30 Video retrieval method, device, electronic equipment and computer readable storage medium Active CN116644208B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310621588.1A CN116644208B (en) 2023-05-30 2023-05-30 Video retrieval method, device, electronic equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310621588.1A CN116644208B (en) 2023-05-30 2023-05-30 Video retrieval method, device, electronic equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN116644208A CN116644208A (en) 2023-08-25
CN116644208B true CN116644208B (en) 2025-10-17

Family

ID=87643017

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310621588.1A Active CN116644208B (en) 2023-05-30 2023-05-30 Video retrieval method, device, electronic equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN116644208B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117729391B (en) * 2023-09-27 2025-03-25 书行科技(北京)有限公司 A video segmentation method, device, computer equipment, medium and product

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114332679A (en) * 2021-12-07 2022-04-12 腾讯科技(深圳)有限公司 Video processing method, device, equipment, storage medium and computer program product
CN115357754A (en) * 2022-07-11 2022-11-18 武汉理工大学 Deep learning-based large-scale short video retrieval method, system and equipment

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190042584A1 (en) * 2016-02-09 2019-02-07 Carrier Corporation Performing multiple queries within a robust video search and retrieval mechanism
US11880408B2 (en) * 2020-09-10 2024-01-23 Adobe Inc. Interacting with hierarchical clusters of video segments using a metadata search
CN113052149B (en) * 2021-05-20 2021-08-13 平安科技(深圳)有限公司 Video abstract generation method and device, computer equipment and medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114332679A (en) * 2021-12-07 2022-04-12 腾讯科技(深圳)有限公司 Video processing method, device, equipment, storage medium and computer program product
CN115357754A (en) * 2022-07-11 2022-11-18 武汉理工大学 Deep learning-based large-scale short video retrieval method, system and equipment

Also Published As

Publication number Publication date
CN116644208A (en) 2023-08-25

Similar Documents

Publication Publication Date Title
CN113157927B (en) Text classification method, device, electronic device and readable storage medium
CN111062871B (en) Image processing method and device, computer equipment and readable storage medium
CN115114408B (en) Multimodal sentiment classification method, device, equipment and storage medium
CN115221276B (en) Chinese image-text retrieval model training method, device, equipment and medium based on CLIP
WO2021139191A1 (en) Method for data labeling and apparatus for data labeling
CN113157739B (en) Cross-modal retrieval methods, devices, electronic equipment and storage media
CN115599953B (en) Training method, retrieval method and related equipment of video text retrieval model
CN115146792A (en) Multi-task learning model training method, device, electronic device and storage medium
CN112347739B (en) Applicable rule analysis method, device, electronic device and storage medium
CN114677526A (en) Image classification method, device, equipment and medium
CN116719904A (en) Information query methods, devices, equipment and storage media based on the combination of images and text
CN116450829A (en) Medical text classification method, device, equipment and medium
TWI749441B (en) Etrieval method and apparatus, and storage medium thereof
CN114840684A (en) Map construction method, device and equipment based on medical entity and storage medium
CN116644208B (en) Video retrieval method, device, electronic equipment and computer readable storage medium
CN115344772B (en) Web page text extraction method, device, equipment and storage medium
CN114723523B (en) Product recommendation method, device, equipment and medium based on user capability image
CN116737996A (en) Multi-modal video retrieval method, device, equipment and media based on multi-encoder
CN113887198B (en) Project splitting method, device, equipment and storage medium based on topic prediction
CN115205758A (en) Intelligent conversion method and device based on video and text, electronic equipment and medium
CN116701680B (en) Intelligent matching methods, devices, and equipment based on text and images
CN116737842B (en) Entity relationship display method and device, electronic equipment and computer storage medium
CN116306656B (en) Entity relation extraction method, device, equipment and storage medium
CN115409041B (en) Unstructured data extraction method, device, equipment and storage medium
CN116597362A (en) Method, device, electronic equipment and medium for identifying hotspot video segments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant