CN116644208B

CN116644208B - Video retrieval method, device, electronic equipment and computer readable storage medium

Info

Publication number: CN116644208B
Application number: CN202310621588.1A
Authority: CN
Inventors: 舒畅; 陈又新
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-05-30
Filing date: 2023-05-30
Publication date: 2025-10-17
Anticipated expiration: 2043-05-30
Also published as: CN116644208A

Abstract

The present invention relates to artificial intelligence technology and discloses a video retrieval method, comprising: dividing each preset video file by lens to obtain a first video segment set, sequentially performing semantic segmentation on each first video segment to obtain a second video segment set corresponding to the video file, using a pre-trained CLIP+LSTM model to extract the video segment features of each second video segment, fusing all the video segment features to obtain the video features of the corresponding video file, receiving a text to be retrieved, using the pre-trained CLIP+LSTM model to extract the text features of the text to be retrieved, sequentially calculating the feature similarity between the text features and the video features of each preset video file, and selecting the video file corresponding to the feature similarity that meets the preset similarity condition as the target video file. The present invention also proposes a video retrieval device, equipment, and medium. The present invention can improve the accuracy of medical video retrieval in the field of smart medical care.

Description

Video retrieval method, device, electronic equipment and computer readable storage medium

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a video retrieval method, a video retrieval device, an electronic device, and a computer readable storage medium.

Background

With the development of Internet and video technology, more and more intelligent medical platforms release medical science popularization videos on the Internet, so that medical science popularization knowledge is intuitively pushed to users. Because of the increasing release amount of medical science popularization videos, how to accurately and rapidly search videos required by users is an important problem of concern of large medical platforms.

The video retrieval method commonly used in the industry comprises the following steps:

firstly, matching a query text input by a user with a text title of a video;

Secondly, extracting the label of the video, and matching the query text with the label of the video;

Third, through ASR (Automatic Speech Recognition ) or OCR (Optical Character Recognition, optical character recognition) technology, the text information corresponding to the video is recognized, and then the query text is matched with the recognized video text information.

The above method is essentially matching between text (query text) and text (video tag, video title, video text information), i.e. the data in the same characterization space is matched, and in this way, the information of medical video images and pictures is always lost, so the accuracy of the video retrieval is still to be improved.

Disclosure of Invention

The invention provides a video retrieval method, a video retrieval device, electronic equipment and a computer readable storage medium, and mainly aims to improve the accuracy of medical video retrieval in the intelligent medical field.

In order to achieve the above object, the present invention provides a video retrieval method, including:

Performing segmentation operation on each preset video file through a lens to obtain a first video segment set corresponding to each preset video file;

carrying out semantic segmentation on each first video segment in the first video segment set in turn to obtain a second video segment set of a corresponding video file;

sequentially extracting video segment characteristics of each second video segment in the second video segment set by using a pre-trained CLIP+LSTM model, and fusing all the video segment characteristics to obtain video characteristics of a corresponding video file;

receiving a text to be searched, extracting text characteristics of the text to be searched by using the pre-trained CLIP+LSTM model, sequentially calculating characteristic similarity between the text characteristics and video characteristics of each preset video file, and selecting a video file corresponding to the characteristic similarity meeting the preset similarity condition as a target video file.

Optionally, the semantic segmentation is performed on each first video segment in the first video segment set in turn to obtain a second video segment set of the corresponding video file, which includes:

Identifying the text of each first video segment, and carrying out clause on the text of each first video segment;

performing sentence vector conversion on each clause of each first video segment to obtain a sentence vector set corresponding to the first video segment;

Calculating adjacent window similarity and skip window similarity between every two sentence vectors in the sentence vector set to obtain corresponding vector similarity, and dividing clauses corresponding to the vector similarity meeting a preset similarity threshold into one second video segment;

and collecting all the second video segments to obtain a second video segment set corresponding to the preset video file.

Optionally, the performing sentence vector conversion on each clause of each first video segment to obtain a sentence vector set corresponding to the first video segment includes:

sequentially segmenting each clause, and carrying out word vector conversion on each segmented word;

adding word vectors corresponding to each clause to obtain a word vector matrix of each clause;

and carrying out pooling operation on each word vector matrix to obtain sentence vectors corresponding to each clause.

Optionally, the calculating the adjacent window similarity and the skip window similarity between every two sentence vectors in the sentence vector set to obtain corresponding vector similarity, and dividing the clause corresponding to the vector similarity meeting the preset similarity threshold into the second video segment includes:

Step A, taking the first sentence vector in the sentence vector set as a starting point;

Step B, calculating the similarity of adjacent windows between the starting point and sentence vectors adjacent to the starting point, and judging whether the similarity of the adjacent windows is larger than a preset similarity threshold;

c, when the similarity of the adjacent windows is larger than the preset semantic similarity threshold, executing the step C, and taking the starting point and the sentence vector adjacent to the starting point as a temporary video segment;

Step C1, eliminating sentence vectors in the temporary video segment from the sentence vector set, and judging whether the sentence vector set after eliminating the vectors is empty or not;

when the sentence vector set after eliminating the vectors is empty, executing C11, dividing the temporary video segment into a second video segment, and jumping to the step E1;

When the sentence vector set after eliminating the vectors is not empty, executing C12, taking the first sentence vector in the sentence vector set as a starting point, calculating adjacent window similarity and skip window similarity between the starting point and vectors in the temporary video segment, weighting and averaging the adjacent window similarity and the skip window similarity to obtain vector similarity, and judging whether the vector similarity is larger than a preset similarity threshold;

when the vector similarity is greater than the preset similarity threshold, executing C121, adding the starting point into the temporary video segment, and returning to the step C1;

When the vector similarity is not greater than the preset similarity threshold, executing C122, dividing the temporary video segment into a second video segment, eliminating vectors corresponding to the second video segment from the vector set, and skipping to the step E;

When the similarity of the adjacent windows is not greater than the preset similarity threshold, executing the step D, dividing the starting point into a second video segment, and eliminating the starting point from the sentence vector set;

e, judging whether the sentence vector set after eliminating the vectors is empty or not;

E1, when the sentence vector set after eliminating the vectors is empty, executing the step E1, and collecting the second video segments to obtain a second video segment set;

And (C) returning to the step A when the sentence vector set after eliminating the vectors is not empty.

Optionally, extracting video segment features of each second video segment in the second video segment set in turn by using a pre-trained clip+lstm model, and fusing all the video segment features to obtain video features of a corresponding video file, including:

Extracting video frames of each second video segment in sequence according to the time sequence to obtain a video frame set of each second video segment;

Sequentially extracting frame feature vectors of each video frame in the video frame set by utilizing the CLIP part in the pre-trained clip+LSTM model;

Carrying out convolution operation on all frame feature vectors of each second video segment by utilizing an LSTM part in the pre-trained CLIP+LSTM model to obtain video segment features of the corresponding second video segment;

and carrying out pooling operation on all video segment characteristics corresponding to the preset video file to obtain the video characteristics of the preset video file.

Optionally, extracting the text feature of the text to be retrieved by using the pre-trained clip+lstm model includes:

Word segmentation is carried out on the text to be searched to obtain one or more search word segments, and word vectors of each search word segment are obtained;

splicing word vectors of each search word by utilizing a CLIP part in the pre-trained CLIP+LSTM model to obtain a text vector matrix;

sequentially selecting a search word as a target word, and calculating a key value of the target word according to a word vector of the target word and the text vector matrix;

selecting a preset number of search terms as feature terms according to the sequence of the key values from large to small;

And splicing word vectors of the feature word segmentation to obtain text features of the text to be searched.

Optionally, the calculating the key value of the target word according to the word vector of the target word and the text vector matrix includes:

calculating the key value of the target word by using the following key value algorithm:

Wherein K is the key value, W is the text vector matrix, T is a matrix transpose symbol, W is a modulo symbol, Word vectors that segment the target word.

In order to solve the above problems, the present invention also provides a video retrieval apparatus, the apparatus comprising:

the system comprises a shot segmentation module, a video segmentation module and a video segmentation module, wherein the shot segmentation module is used for executing segmentation operation on each preset video file through a shot to obtain a first video segment set corresponding to each preset video file;

the semantic segmentation module is used for sequentially carrying out semantic segmentation on each first video segment in the first video segment set to obtain a second video segment set of the corresponding video file;

the video feature extraction module is used for sequentially extracting video segment features of each second video segment in the second video segment set by utilizing a pre-trained CLIP+LSTM model, and fusing all the video segment features to obtain video features of corresponding video files;

And the text and video feature comparison module is used for receiving the text to be searched, extracting the text features of the text to be searched by utilizing the pre-trained CLIP+LSTM model, sequentially calculating the feature similarity between the text features and the video features of each preset video file, and selecting the video file corresponding to the feature similarity meeting the preset similarity condition as a target video file.

In order to solve the above-described problems, the present invention also provides a computer-readable storage medium having stored therein at least one computer program that is executed by a processor in an electronic device to implement the video retrieval method described above.

According to the embodiment of the invention, the preset video file is segmented according to the lens and segmented according to the semantics, so that the video file is accurately refined, the video segment characteristics of the refined second video segment are extracted, the video characteristics of the final video file are obtained based on the video segment characteristics, the video characteristics embody the image characteristics of the video file, finally, the video file meeting the preset similarity condition is selected as the target video file in a mode of calculating the characteristic similarity between the text characteristics of the text to be searched and the video characteristics of the video file, and the searching mode is based on the video characteristics of the video file, so that the searching accuracy of the video file is improved.

Drawings

Fig. 1 is a flow chart of a video retrieval method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a detailed implementation of one of the steps in the video searching method according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating another step in the video searching method according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating another step in the video searching method according to an embodiment of the present invention;

FIG. 5 is a functional block diagram of a video retrieval device according to an embodiment of the present invention;

Fig. 6 is a schematic structural diagram of an electronic device for implementing the video retrieval method according to an embodiment of the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The embodiment of the application provides a video retrieval method. The execution subject of the video retrieval method includes, but is not limited to, at least one of a server, a terminal, and the like, which can be configured to execute the method provided by the embodiment of the application. In other words, the video retrieval method may be performed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform. The server may be an independent server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (ContentDelivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.

Referring to fig. 1, a flowchart of a video retrieval method according to an embodiment of the invention is shown.

In this embodiment, the video retrieval method includes:

s1, executing segmentation operation on each preset video file through a lens to obtain a first video segment set corresponding to each preset video file;

in the embodiment of the invention, an intelligent medical platform is taken as an example for explanation. The intelligent medical platform is used for providing medical assistance and medical knowledge science popularization for common users by maintaining and releasing a series of medical science popularization videos. The preset video file refers to a medical video maintained by the intelligent medical platform, for example, a common disease prevention knowledge video, a family medical aid common knowledge video, a public health protection video, a medical hot event video and the like.

It will be appreciated that, in general, a video text is composed of one or more groups of shots, each shot representing an independent meaning, each of the preset video files is segmented according to shots, and one of the preset video files is segmented into a plurality of video segments, each of the video segments containing image information of one of the shots. Thus, the operation is convenient, and the characteristics of each preset video file can be better acquired later.

In the embodiment of the present invention, each preset video file may be segmented by using the shot segmentation tool disclosed in OPENCV.

S2, carrying out semantic segmentation on each first video segment in the first video segment set in sequence to obtain a second video segment set of a corresponding video file;

it will be appreciated that if a shot contains a lot of semantic information, e.g. a long shot. Each first video segment can be subdivided, so that the semantics of the second video segment obtained through final cutting are purer, granularity is not overlarge, and the accuracy of extracting video features of a corresponding video file based on the final cut video segment is improved.

In detail, referring to fig. 2, the step S2 includes:

s21, identifying the text of each first video segment, and carrying out clause on the text of each first video segment;

s22, performing sentence vector conversion on each clause of each first video segment to obtain a sentence vector set corresponding to the first video segment;

s23, calculating adjacent window similarity and skip window similarity between every two sentence vectors in the sentence vector set to obtain corresponding vector similarity, and dividing a clause corresponding to the vector similarity meeting a preset similarity threshold into a second video segment;

S24, collecting all the second video segments to obtain a second video segment set corresponding to the preset video file.

In the embodiment of the invention, the ASR technology can be utilized to acquire the video segment text of each first video segment, and after the video segment text is divided into sentences, semantic segmentation is carried out on the first video segment by taking the sentence as a unit.

In another optional embodiment of the present invention, an acoustic model may be used to perform speech recognition based on speech information corresponding to the first video segment to obtain a video segment text corresponding to the first video, where the acoustic model performs speech recognition on speech information by modeling each word to establish a database including a plurality of words and standard speech corresponding to each word, and performs probability matching on the speech information to obtain the video segment text by collecting user speech in the first video segment at each moment in the speech information to obtain the speech of the user at each moment, where the speech is pre-constructed and includes a plurality of words and words in a database of standard speech corresponding to each word.

In an alternative embodiment of the present invention, sentence vector conversion may be performed on each of the clauses by the following method:

In the embodiment of the invention, the word segmentation processing can be performed on the clause by adopting a preset standard dictionary to obtain a plurality of segmented words, wherein the standard dictionary comprises a plurality of standard segmented words. The clause may also be segmented using a segmentation tool, such as jieba segmentation.

In the embodiment of the invention, word2vec model, NLP (Natural Language Processing ) model and other models with word vector conversion function can be adopted to carry out word vector conversion on each word.

In an alternative embodiment of the present invention, the word vector matrix may be pooled by using a k-max pooling method, where the k value may be predefined, for example, the k value may be 35%. Since the number of the words included in each clause is not equal, it is preferable to round up 35% of the words in each clause, and take the top K maximum values in each pooling block, for example, if clause 1 includes only 1 word, 1×35% rounds up to 1, and clause 2 includes 5 words, and 5×35% rounds up to 2.

In detail, referring to fig. 3, the calculating the adjacent window similarity and the skip window similarity between every two sentence vectors in the sentence vector set to obtain the corresponding vector similarity divides the clause corresponding to the vector similarity meeting the preset similarity threshold into one second video segment includes:

In the embodiment of the invention, the adjacent window similarity and the skip window similarity are the similarity between every two sentence vectors, namely the adjacent window similarity and the skip window similarity. For example, if the sentence vectors of a first video segment include S1, S2, S3, S4, and S5, respectively, then there is a neighboring window similarity between S1 and S2, a skip window similarity between S1 and S3, a neighboring window similarity between S2 and S3, a skip window similarity between S2 and S4, and a skip window similarity between S3 and S5.

In the embodiment of the invention, the adjacent window similarity or the jump window similarity between every two sentence vectors can be calculated by utilizing a pre-trained MLP (Multilayer Perceptron, multi-layer perceptron) model.

In the embodiment of the invention, different weights can be allocated to the adjacent window similarity and the jump window similarity in advance, and finally the vector similarity is obtained by carrying out weighted averaging operation on the adjacent window similarity and the jump window similarity.

In the embodiment of the invention, the preset similarity threshold can be set according to the actual service condition.

For example, assuming that a first video segment includes 4 clauses, the corresponding sentence vectors are S1, S2, S3, and S4, and the semantic segmentation is performed on the first video segment, the following division results may be obtained:

The first division result comprises three second video segments S1, S2+ S3 and S4, wherein the adjacent window similarity between S1 and S2 is smaller than the preset similarity threshold, so that the clause where S1 is positioned is divided into an independent second video segment, the adjacent window similarity between S2 and S3 is larger than the preset similarity threshold, and the vector similarity corresponding to the adjacent window similarity between S2 and S3 and the jump window similarity between S2 and S4 is smaller than the preset similarity threshold, so that S2 and S3 are divided into an independent second video segment, and S4 is divided into an independent video segment;

The second division result comprises three second video segments S1+S2, S3 and S4, wherein the adjacent window similarity between S1 and S2 is larger than the preset similarity threshold, the vector similarity corresponding to the adjacent window similarity between S2 and S3 and the jump window similarity between S1 and S3 is smaller than the preset similarity threshold, so that S1 and S2 are divided into independent second video segments, the adjacent window similarity between S3 and S4 is smaller than the preset similarity threshold, so that the clause where S3 is located is divided into independent second video segments, and the clause corresponding to S4 is divided into independent second video segments;

The third division result comprises that two video segments S1+S2+S3 and S4 are provided, the adjacent window similarity between S1 and S2 is larger than the preset similarity threshold, the vector similarity corresponding to the adjacent window similarity between S2 and S3 and the jump window similarity between S1 and S3 is larger than the preset similarity threshold, the vector similarity corresponding to the adjacent window similarity between S3 and S4 and the jump window similarity between S2 and S4 is smaller than the preset similarity threshold, S4 is an independent second video segment, and S1, S2 and S3 are independent second video segments.

It should be noted that the above is only an example, and there may be a plurality of division results for S1, S2, S3, S4.

S3, extracting video segment characteristics of each second video segment in the second video segment set in sequence by utilizing a pre-trained CLIP+LSTM model, and fusing all the video segment characteristics to obtain video characteristics of a corresponding video file;

in the embodiment of the invention, the Pre-trained CLIP+LSTM model comprises a CLIP part (Contrastive Language-Image Pre-training) and an LSTM part (Long Short-Term Memory network).

In detail, the extracting video segment features of each second video segment in the second video segment set sequentially by using a pre-trained clip+lstm model, and fusing all the video segment features to obtain video features of a corresponding video file includes:

In an optional embodiment of the present invention, 4 video frames are extracted from each second video segment according to the time sequence, so as to form a video frame set of the second video segment.

According to the embodiment of the invention, the video frames of each second video segment are sequentially extracted by taking the second video segment as a unit, the video segment characteristics of the corresponding second video segment are extracted based on the frame characteristic vector of each video frame by utilizing the pre-trained CLIP+LSTM model, and finally the video characteristics of the corresponding video file are obtained based on the video segment characteristics of all the second video segments, wherein the video characteristics embody the image and picture characteristics of the video.

And S4, receiving a text to be searched, extracting text characteristics of the text to be searched by using the pre-trained CLIP+LSTM model, sequentially calculating characteristic similarity between the text characteristics and video characteristics of each preset video file, and selecting a video file corresponding to the characteristic similarity meeting the preset similarity condition as a target video file.

In the embodiment of the invention, the text characteristics of the text to be searched can be extracted by using the same model, namely the pre-trained CLIP+LSTM model. The text characteristics of the text to be searched and the video characteristics of the video file can be mapped to the same characterization space by the operation, so that the text characteristics and the video characteristics of the text to be searched can be conveniently compared and calculated.

In detail, referring to fig. 4, the extracting text features of the text to be retrieved by using the pretrained clip+lstm model includes:

s41, word segmentation is carried out on the text to be searched to obtain one or more search word segments, and word vectors of each search word segment are obtained;

S42, splicing word vectors of each search word by utilizing the CLIP part in the pre-trained CLIP+LSTM model to obtain a text vector matrix;

s43, sequentially selecting a search word as a target word, and calculating a key value of the target word according to a word vector of the target word and the text vector matrix;

s44, selecting a preset number of search terms as characteristic terms according to the sequence of the key values from large to small;

S45, word vectors of the feature word segmentation are spliced to obtain text features of the text to be searched.

In detail, since the text to be searched contains a large number of search terms, but not every search term is a feature of the text to be searched, the search terms are required to be screened, one of the search terms is selected one by one from the search terms to be a target term, and a key value of the target term is calculated according to a term vector of the target term and the text vector matrix, so that feature terms which are representative of the text to be searched are screened according to the key value, and the text feature of the text to be searched is obtained.

Specifically, the calculating the key value of the target word according to the word vector of the target word and the text vector matrix includes:

In the embodiment of the invention, the preset number of search terms are selected from the plurality of search terms as characteristic terms according to the order of the key value of each search term from large to small.

For example, the plurality of search terms comprise search term A, search term B and search term C, wherein the key value of the search term A is 80, the key value of the search term B is 70, the key value of the search term C is 30, if the preset number is 2, the search term A and the search term B are selected as feature terms according to the sequence from the large key value to the small key value, and word vectors of the search term A and the search term B are spliced to obtain text features of the text to be searched.

In the embodiment of the invention, the feature similarity can be obtained by calculating the cosine similarity between the text feature and the video feature of each video file.

Further, the cosine similarity can be obtained by calculating the cosine similarity between the text feature and the video feature of each second video segment, and the second video segment closest to the text feature is matched, so that the accuracy of video retrieval is further improved.

Fig. 5 is a functional block diagram of a video search device according to an embodiment of the present invention.

The video search apparatus 100 of the present invention may be mounted in an electronic device. The video retrieval device 100 comprises a lens segmentation module 101, a semantic segmentation module 102, a video feature extraction module 103 and a text-to-video feature comparison module 104 according to the functions implemented. The module of the invention, which may also be referred to as a unit, refers to a series of computer program segments, which are stored in the memory of the electronic device, capable of being executed by the processor of the electronic device and of performing a fixed function.

In the present embodiment, the functions concerning the respective modules/units are as follows:

the shot segmentation module 101 is configured to perform a segmentation operation on each preset video file through a shot to obtain a first video segment set corresponding to each preset video file;

The semantic segmentation module 102 is configured to perform semantic segmentation on each first video segment in the first video segment set in sequence to obtain a second video segment set of a corresponding video file;

the video feature extraction module 103 is configured to sequentially extract video segment features of each second video segment in the second video segment set by using a pre-trained clip+lstm model, and fuse all the video segment features to obtain video features of a corresponding video file;

The text-to-video feature comparison module 104 is configured to receive a text to be retrieved, extract text features of the text to be retrieved by using the pre-trained clip+lstm model, sequentially calculate feature similarities between the text features and video features of each preset video file, and select a video file corresponding to the feature similarities that satisfies a preset similarity condition as a target video file.

In detail, each module of the video searching apparatus 100 in the embodiment of the present invention adopts the same technical means as the video searching method described in fig. 1 to 4, and can produce the same technical effects, which are not repeated here.

Fig. 6 is a schematic structural diagram of an electronic device for implementing a video retrieval method according to an embodiment of the present invention.

The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a video retrieval program, stored in the memory 11 and executable on the processor 10.

The memory 11 includes at least one type of readable storage medium, including flash memory, a mobile hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may in other embodiments also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only for storing application software installed in the electronic device 1 and various types of data, such as codes of video search programs, but also for temporarily storing data that has been output or is to be output.

The processor 10 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing Unit, CPU), microprocessors, digital processing chips, graphics processors, combinations of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects respective parts of the entire electronic device using various interfaces and lines, and executes various functions of the electronic device 1 and processes data by running or executing programs or modules (e.g., video retrieval programs, etc.) stored in the memory 11, and calling data stored in the memory 11.

The bus may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The bus is arranged to enable a connection communication between the memory 11 and at least one processor 10 etc.

Fig. 6 shows only an electronic device with components, it being understood by a person skilled in the art that the structure shown in fig. 6 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or may combine certain components, or may be arranged in different components.

For example, although not shown, the electronic device 1 may further include a power source (such as a battery) for supplying power to each component, and preferably, the power source may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management, and the like are implemented through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device 1 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described herein.

Further, the electronic device 1 may also comprise a network interface, optionally the network interface may comprise a wired interface and/or a wireless interface (e.g. WI-FI interface, bluetooth interface, etc.), typically used for establishing a communication connection between the electronic device 1 and other electronic devices.

The electronic device 1 may optionally further comprise a user interface, which may be a Display, an input unit, such as a Keyboard (Keyboard), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device 1 and for displaying a visual user interface.

It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.

The video retrieval program stored in the memory 11 in the electronic device 1 is a combination of instructions that, when executed in the processor 10, may implement:

Further, the modules/units integrated in the electronic device 1 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as separate products. The computer readable storage medium may be volatile or nonvolatile. For example, the computer readable medium may include any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).

The present invention also provides a computer readable storage medium storing a computer program which, when executed by a processor of an electronic device, can implement:

In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.

The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

The embodiment of the application can acquire and process the related data based on the holographic projection technology. Wherein artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the system claims can also be implemented by means of software or hardware by means of one unit or means. The terms second, etc. are used to denote a name, but not any particular order.

Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. A video retrieval method, characterized in that the method comprises:

Performing a segmentation operation on each preset video file through a lens to obtain a first video segment set corresponding to each preset video file;

Sequentially performing semantic segmentation on each first video segment in the first video segment set to obtain a second video segment set corresponding to the video file;

Using a pre-trained CLIP+LSTM model, sequentially extracting video segment features of each second video segment in the second video segment set, and fusing all the video segment features to obtain video features of the corresponding video file;

Receiving a text to be retrieved, extracting text features of the text to be retrieved using the pre-trained CLIP+LSTM model, sequentially calculating feature similarities between the text features and video features of each of the preset video files, and selecting a video file corresponding to the feature similarity that meets a preset similarity condition as a target video file;

Among them, the semantic segmentation is performed on each first video segment in the first video segment set in turn to obtain a second video segment set corresponding to the video file, including: identifying the text of each first video segment and dividing the text of each first video segment into sentences; performing sentence vector conversion on each sentence of each first video segment to obtain a sentence vector set corresponding to the first video segment; calculating the adjacent window similarity and the jump window similarity between every two sentence vectors in the sentence vector set to obtain the corresponding vector similarity, and dividing the sentences corresponding to the vector similarity that meets the preset similarity threshold into a second video segment; and collecting all the second video segments to obtain a second video segment set corresponding to the preset video file.

2. The video retrieval method according to claim 1, wherein the step of converting each sentence of each first video segment into a sentence vector to obtain a sentence vector set corresponding to the first video segment comprises:

Segment each of the sentences in turn, and convert each of the segmented words into a word vector;

Add the word vectors corresponding to each of the clauses to obtain a word vector matrix for each of the clauses;

A pooling operation is performed on each of the word vector matrices to obtain a sentence vector corresponding to each of the clauses.

3. The video retrieval method according to claim 1, wherein calculating the adjacent window similarity and the jump window similarity between each two sentence vectors in the sentence vector set to obtain corresponding vector similarities, and dividing sentences corresponding to vector similarities that meet a preset similarity threshold into a second video segment, comprises:

Step A: taking the first sentence vector in the sentence vector set as the starting point;

Step B: calculating the adjacent window similarity between the starting point and the sentence vectors adjacent to the starting point, and determining whether the adjacent window similarity is greater than a preset similarity threshold;

When the adjacent window similarity is greater than the preset semantic similarity threshold, step C is performed, and the starting point and the sentence vector adjacent to the starting point are used as a temporary video segment;

Step C1: removing the sentence vectors in the temporary video segment from the sentence vector set, and determining whether the sentence vector set after removing the vectors is empty;

When the sentence vector set after the vector is removed is empty, executing C11, dividing the temporary video segment into a second video segment, and jumping to step E1;

If the sentence vector set after the vector is removed is not empty, executing C12, taking the first sentence vector in the sentence vector set as a starting point, calculating the adjacent window similarity and the jump window similarity between the starting point and the vector in the temporary video segment, performing weighted averaging on the adjacent window similarity and the jump window similarity to obtain a vector similarity, and determining whether the vector similarity is greater than a preset similarity threshold;

When the vector similarity is greater than the preset similarity threshold, executing C121, adding the starting point to the temporary video segment, and returning to step C1;

When the vector similarity is not greater than the preset similarity threshold, executing C122, dividing the temporary video segment into a second video segment, removing the vector corresponding to the second video segment from the vector set, and jumping to step E;

When the adjacent window similarity is not greater than the preset similarity threshold, executing step D, dividing the starting point into a second video segment, and removing the starting point from the sentence vector set;

Step E: Determine whether the sentence vector set after removing the vector is empty;

When the sentence vector set after removing the vector is empty, executing step E1, collecting the second video segments to obtain the second video segment set;

When the sentence vector set after removing the vector is not empty, return to the above step A.

4. The video retrieval method according to claim 1, wherein the step of sequentially extracting the video segment features of each second video segment in the second video segment set using the pre-trained CLIP+LSTM model and fusing all the video segment features to obtain the video features of the corresponding video file comprises:

Extracting video frames from each of the second video segments in sequence according to time to obtain a video frame set for each of the second video segments;

Using the CLIP part of the pre-trained CLIP+LSTM model, sequentially extracting a frame feature vector of each video frame in the video frame set;

Using the LSTM part of the pre-trained CLIP+LSTM model, a convolution operation is performed on all frame feature vectors of each second video segment to obtain a video segment feature corresponding to the second video segment;

A pooling operation is performed on all video segment features corresponding to the preset video file to obtain video features of the preset video file.

5. The video retrieval method according to claim 1, wherein extracting text features of the text to be retrieved using the pre-trained CLIP+LSTM model comprises:

Segmenting the text to be searched to obtain one or more search segments, and obtaining a word vector for each search segment;

Using the CLIP part of the pre-trained CLIP+LSTM model, the word vectors of each search word are concatenated to obtain a text vector matrix;

Selecting a search word as a target word in turn, and calculating the key value of the target word according to the word vector of the target word and the text vector matrix;

Selecting a preset number of search terms as feature terms in descending order of the key values;

The word vectors of the feature segmentations are concatenated to obtain the text features of the text to be retrieved.

6. The video retrieval method according to claim 5, wherein the step of calculating the key value of the target word segmentation based on the word vector of the target word segmentation and the text vector matrix comprises:

The key value of the target word is calculated using the following key value algorithm:

in, is the key value, is the text vector matrix, is the matrix transpose symbol, To find the modulo symbol, The word vector of the target word.

7. A video retrieval device for implementing the video retrieval method according to any one of claims 1 to 6, characterized in that the device comprises:

a shot segmentation module, configured to perform a segmentation operation on each preset video file by shot, to obtain a first video segment set corresponding to each preset video file;

a semantic segmentation module, configured to sequentially perform semantic segmentation on each first video segment in the first video segment set to obtain a second video segment set corresponding to the video file;

a video feature extraction module, configured to sequentially extract video segment features of each second video segment in the second video segment set using a pre-trained CLIP+LSTM model, and fuse all the video segment features to obtain video features of the corresponding video file;

The text and video feature comparison module is used to receive the text to be retrieved, use the pre-trained CLIP+LSTM model to extract the text features of the text to be retrieved, calculate the feature similarity between the text features and the video features of each preset video file in turn, and select the video file corresponding to the feature similarity that meets the preset similarity conditions as the target video file.

8. An electronic device, characterized in that the electronic device comprises:

at least one processor; and,

a memory communicatively connected to the at least one processor; wherein,

The memory stores a computer program executable by the at least one processor, and the computer program is executed by the at least one processor to enable the at least one processor to perform the video retrieval method according to any one of claims 1 to 6.

9. A computer-readable storage medium storing a computer program, wherein when the computer program is executed by a processor, the video retrieval method according to any one of claims 1 to 6 is implemented.