CN118861331B

CN118861331B - Multi-system data management method based on multi-mode model

Info

Publication number: CN118861331B
Application number: CN202411338496.3A
Authority: CN
Inventors: 秦超楠; 鹿艳利; 申博宇; 丁剑锋; 姜畅; 徐桂彬; 丁钰; 张晓奇; 张忠奎; 祁德; 周倩倩; 张旭哲
Original assignee: Hubei Central China Technology Development Of Electric Power Co ltd
Current assignee: Hubei Central China Technology Development Of Electric Power Co ltd
Priority date: 2024-09-25
Filing date: 2024-09-25
Publication date: 2025-02-07
Anticipated expiration: 2044-09-25
Also published as: CN118861331A

Abstract

The invention provides a multi-system data management method based on a multi-mode model, which belongs to the technical field of multi-mode data processing, and comprises the following steps that S1, target picture data containing target figures is obtained from the Internet according to text description data corresponding to the target figures; step S2, acquiring target video segment data containing target characters from a multimedia database according to target picture data, wherein the multimedia database at least comprises a plurality of video files, step S3, extracting target voice segment data of the target characters from the target video segment data, and step S4, establishing association relation between text description data and the target voice segment data. According to the technical scheme, the text description data of the target character is sequentially associated with the target picture data, the target video segment data and the target voice segment data, so that the aim of associating the text description data with the target voice segment data is finally achieved, and the multi-mode data management of the character is facilitated.

Description

Multi-system data management method based on multi-mode model

Technical Field

The invention relates to the technical field of multi-mode data processing, in particular to a multi-system data management method based on a multi-mode model.

Background

Multimodal includes a series of different forms of data representation of text, pictures, speech, video, etc. surrounding a subject. The traditional different mode data of a certain theme often exist independently, and the effective association is lacking, so that the effective management and the efficient indexing of the related theme data are inconvenient. The text description, image, voice, video and other data of the character are the most widely used and most typical multi-modal data with the largest data volume. In the prior art, when multi-mode data management of a person is performed, association of text data and picture data is only realized, but the technical problem in the field is always how to realize association of text data, video data and voice data of the person, and particularly, a scheme for associating text data and voice data of the person is not found, and related content disclosure is not found.

Disclosure of Invention

The invention can at least solve one of the technical problems in the prior art, and provides a multi-system data management method based on a multi-mode model.

The multi-system data management method based on the multi-mode model comprises the following steps:

Step S1, acquiring target picture data containing a target person from the Internet according to text description data corresponding to the target person;

S2, acquiring target video segment data containing the target person from a multimedia database according to the target picture data, wherein the multimedia database at least comprises a plurality of video files;

s3, extracting target voice segment data of the target person from the target video segment data;

And S4, establishing an association relation between the text description data and the target voice segment data.

Optionally, step S1 includes:

step S101, acquiring the name of the target person from the text description data, and carrying out picture searching in the Internet by taking the name as a keyword to obtain a picture searching result;

Step S102, screening out pictures containing the front face images of the human faces from the picture searching results, wherein all the screened out pictures form a picture set to be processed;

Step S103, based on facial features of face front images contained in the pictures, clustering the pictures in the picture set to be processed, and calculating the number of pictures corresponding to each face class according to a clustering result, wherein the number of pictures is equal to the ratio of the number of pictures contained in the corresponding face class to the number of pictures contained in the picture set to be processed;

Step S104, judging whether the largest picture quantity occupation ratio is larger than or equal to a first occupation ratio threshold value;

If the judgment result of the step S104 is yes, the step S105 is executed;

Step 105, taking all the pictures contained in the face category corresponding to the largest picture quantity ratio as the target picture data.

Optionally, in step S1, step S100 is further included before step S101;

step S100, enabling the value of i to be 0;

If the judgment result in the step S104 is no, the step S106 is executed;

step S106, adding 1 to the value of i;

step S107, i feature words of the target person are obtained from the text description data, and picture searching is carried out on the internet by taking the name and the i feature words as key words at the same time, so that a picture searching result is obtained;

Step S102 is executed again after step S107 ends;

between step S102 and step S103, further comprising:

Step S103a, judging whether the number of the pictures in the picture set to be processed is larger than or equal to a first number threshold;

If not, executing a step S103b;

Step S103b, obtaining target picture data which is designated by a person and contains the target person.

Optionally, step S2 includes:

step S201, taking a picture in the target picture data as a reference, and detecting whether the video frame picture contains the target person for each video frame picture in each video file;

Step S202, dividing a plurality of continuous video frame pictures in the video file into continuous picture sets in all the video frame pictures including the target person, and extracting video segments corresponding to the continuous picture sets from the corresponding video file, wherein all the extracted video segments are used as the target video segment data.

Optionally, step S3 includes:

step S301, selecting a video segment containing voice from all video segments included in the target video data as a video segment to be processed;

Step S302, extracting tone characteristics of voice in each video segment to be processed;

Step S303, clustering each video segment to be processed according to tone characteristics to obtain tone clustering results, wherein each tone clustering result comprises n tone categories and the video segments to be processed contained in each tone category;

Step S304, judging whether n is equal to 1;

If the judgment result of the step S304 is yes, the step S305 is executed;

Step S305, extracting corresponding speech segments from each of the video segments to be processed contained in only 1 tone color category in the tone color clustering result, where all the extracted speech segments are used as the target speech segment data.

Optionally, in step S3, if the result of the determination in step S304 is no, step S306 is executed;

Step S306, for each video segment to be processed in the tone color clustering result, recognizing each Chinese character voice in the video segment to be processed one by one according to pronunciation of Chinese characters, and representing the recognition result of each Chinese character voice in a pinyin form, wherein the recognition results of all Chinese character voices in the video segment to be processed are sequentially arranged to form a first pinyin recognition result;

Step S307, for each video segment to be processed in the tone color clustering result, detecting a recognition result similarity between the corresponding first pinyin recognition result and second pinyin recognition result;

step 308, detecting whether the similarity of the recognition results corresponding to all the video segments to be processed in the tone color clustering results is simultaneously greater than or equal to a first similarity threshold or is simultaneously smaller than the first similarity threshold;

If the detection result in step S308 is that the similarity of the identification result corresponding to at least one video segment to be processed is greater than or equal to the first similarity threshold, and the similarity of the identification result corresponding to at least one video segment to be processed is less than the first similarity threshold, step S309 is executed;

Step S309, deleting the video segments to be processed with the recognition result similarity greater than or equal to a first similarity threshold in the tone color clustering result to update the tone color clustering result;

step S310, judging whether the number of tone color categories in the updated tone color clustering result is equal to 1;

if the determination result in step S310 is yes, step S305 is executed.

Optionally, in step S3, if the detection result in step S310 is no, step S311 is performed;

step S311, detecting a pronunciation characteristic stability index of a voice segment contained in the video segment to be processed according to each video segment to be processed in the tone color clustering result;

Step S312, detecting whether the pronunciation characteristic stability index corresponding to all the video segments to be processed in the tone color clustering result is simultaneously greater than or equal to a corresponding preset stability threshold or is simultaneously smaller than the corresponding preset stability threshold;

if the detection result in step S312 is that the pronunciation characteristic stability index corresponding to at least one video segment to be processed is greater than or equal to the preset stability threshold and that the pronunciation characteristic stability index corresponding to at least one video segment to be processed is less than the preset stability threshold, step S313 is executed;

Step S313, deleting the video segments to be processed with pronunciation characteristic stability indexes greater than or equal to a preset stability threshold value from the tone color clustering result to update the tone color clustering result;

step S314, judging whether the number of tone color categories in the updated tone color clustering result is equal to 1;

if the determination result in step S310 is yes, step S305 is executed.

Optionally, in step S3, if the detection result in step S314 is no, step S315 is executed;

Step 315, in the tone color clustering result, calculating the total duration of the video segments corresponding to each tone color category, where the total duration of the video segments corresponding to the tone color category is equal to the sum of the durations of all the video segments to be processed contained in the tone color category;

Step S316, respectively extracting corresponding voice segments from each video segment to be processed contained in the tone class corresponding to the maximum total duration of the video segments, and taking all the extracted voice segments as the target voice segment data.

Optionally, the step of establishing the association relationship between the text description data and the target voice segment data includes:

And constructing an index data structure of the text description data, the target picture data, the target video segment data and the target voice segment data, wherein the index data structure is a directed graph data structure, and nodes representing the text description data in the directed graph data structure point to nodes representing the target picture data, nodes representing the target video segment data and nodes representing the target voice segment data respectively.

Optionally, the multi-system data management method based on the multi-mode model further includes:

Step S5, storing the data of each video segment in the target video segment data;

in step S5, the step of storing the data of one target video segment of the target video segment data includes:

Step S501, calculating the picture similarity between any two adjacent video frame pictures in the target video segment, and dividing the target video segment into a plurality of video subsections between two adjacent video frame pictures with the picture similarity less than or equal to a second similarity threshold;

step S502, dividing each video frame image in each video sub-segment by adopting a semantic segmentation model to determine a foreground part image and a background part image in each video image, wherein the foreground part image is an image of an area surrounded by the outline edge of the target person in the video frame image, the background part image is other images except the foreground part image in the video frame image, and the background part image comprises a moving object background image and a fixed object background image;

Step S503, for each video sub-segment, extracting face image and body posture information from foreground part images in each video frame picture in the video sub-segment, and determining a body appearance image for sharing corresponding to the video sub-segment according to foreground part images in at least part of video frame pictures in the video sub-segment;

Step S504, for each video sub-segment, carrying out fixed background merging processing on the fixed background part images in all video frame pictures included in the video sub-segment in space to obtain a complete fixed background image corresponding to the video sub-segment, and recording the regional position information of the fixed background part images in the complete fixed background image in each video frame picture in the video sub-segment;

Step S505, determining sub-segment storage data corresponding to each video sub-segment, wherein the sub-segment storage data comprises one shared body appearance image corresponding to the video sub-segment, one complete fixed background image corresponding to the video sub-segment, the face image corresponding to each video frame picture included in the video sub-segment, the body posture information corresponding to each video frame picture included in the video sub-segment, and the region position information corresponding to each video frame picture included in the video sub-segment;

Step S506, storing the sub-segment storage data of all the video sub-segments included in the target video segment.

According to the technical scheme, the text description data of the target person is obtained from the Internet, then the target video segment data containing the target person is obtained from the multimedia database according to the target picture data, and then the target voice segment data of the target person is extracted from the target video segment data, namely, the text description data of the target person, the target picture data, the target video segment data and the target voice segment data are sequentially associated by taking the target picture data and the target video segment data as association bridges, so that the aim of associating the text description data with the target voice segment data is finally achieved, and the multi-mode data management of the person is facilitated.

Drawings

Fig. 1 is a flowchart of a multi-system data management method based on a multi-mode model according to the present invention.

Fig. 2 is a flowchart of an alternative implementation method of step S1 in the present invention.

Fig. 3 is a flowchart of an alternative implementation method of step S2 in the present invention.

Fig. 4 is a flowchart of an alternative implementation method of step S3 in the present invention.

FIG. 5 is a graph of a relationship between multi-modal data in accordance with the present invention.

FIG. 6 is a diagram of a directed graph data structure corresponding to multi-modal data in accordance with the present invention.

Fig. 7 is a flowchart of another multi-system data management method based on a multi-mode model according to the present invention.

Fig. 8 is a flowchart of an alternative implementation method of step S5 in the present invention.

Detailed Description

Example embodiments will be described more fully hereinafter with reference to the accompanying drawings, but may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other steps, operations, elements, and/or components thereof.

It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the present invention and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Referring to fig. 1, the multi-system data management method based on the multi-mode model includes:

step S1, acquiring target picture data containing the target person from the Internet according to text description data corresponding to the target person.

The text description data refers to data describing the text of the target person in a written language, wherein the text description data is a sentence or a combination of a plurality of sentences with complete and systematic meaning (Message), and one text can be a sentence (Sentence), a Paragraph (Paragraph) or a chapter (Discourse). In the present invention, the text description data records at least the name of the target object. In addition, text description data typically has some feature words (typically located in sentences) that are used to describe the target object.

The target picture data includes at least one picture including a target person.

And S2, acquiring target video segment data containing target characters from a multimedia database according to the target picture data.

The multimedia database at least comprises a plurality of video files, the video files generally comprise video data and audio data, and the target video segment data comprise at least one video segment containing target characters.

And S3, extracting target voice segment data of the target person from the target video segment data.

Wherein the target speech segment data includes at least one speech segment of speech uttered by the target person. The speech segment obtained in step S3 is extracted from the video segment in the target video segment data.

Wherein the text description data may be associated with the target speech segment data by means of defining a data structure.

According to the invention, the text description data of the target person is used for acquiring the target picture data of the target person from the Internet, then the target video segment data containing the target person is acquired from the multimedia database according to the target picture data, and then the target voice segment data of the target person is extracted from the target video segment data, namely, the text description data of the target person, the target picture data, the target video segment data and the target voice segment data are sequentially associated by taking the target picture data and the target video segment data as association bridges, so that the aim of associating the text description data and the target voice segment data is finally achieved, and the multi-mode data management of the person is facilitated.

In order to facilitate a better understanding of the technical solutions of the present invention by those skilled in the art, the technical solutions of the present invention will be exemplarily described below with reference to some embodiments.

Referring to fig. 2, optionally, step S1 includes the following steps S101 to S105.

Step S101, obtaining the name of the target person from the text description data, and carrying out picture searching in the Internet by taking the name as a keyword to obtain a picture searching result.

In general, a larger number of pictures are included in a picture search result obtained when a picture search is performed in the internet using only names as keywords. Therefore, when the number of pictures in the picture search result is large, the previous m (for example, m has a value of 100) pictures can be intercepted as the final picture search result for subsequent processing.

Step S102, a picture containing a face front image is screened from the picture searching result, and all screened pictures form a picture set to be processed.

Step S103, clustering the pictures in the picture set to be processed based on the facial features of the face front images contained in the pictures, and calculating the number proportion of the pictures corresponding to each face class according to the clustering result.

The clustering processing of the pictures in the picture set to be processed according to the facial features can be performed by adopting a clustering algorithm based on feature similarity calculation, a K-Means clustering algorithm, a hierarchical clustering algorithm, a DBSCAN clustering algorithm and the like. The present invention is not limited to the clustering algorithm used in step S103.

The number of the pictures is equal to the ratio of the number of the pictures contained in the corresponding face class to the number of the pictures contained in the picture set to be processed. That is, the step S103 may obtain that the pictures in the to-be-processed picture set are finally divided into several categories and the number of pictures corresponding to each category is counted.

The clustering result includes several face categories, so that it can be indicated to a certain extent that the pictures in the to-be-processed picture set come from several people (one face category corresponds to one person). When the number of face categories is greater than or equal to 2, it may be indicated that the people included in the pictures in the picture set to be processed have a duplicate name.

Step S104, judging whether the maximum picture quantity duty ratio is larger than or equal to a first duty ratio threshold value.

Wherein the first duty ratio threshold is a constant designed in advance according to practical experience. Typically, the first duty cycle threshold is greater than or equal to 85%, for example, the first duty cycle threshold may take on values of 90%, 95%, etc.

If the judgment result in step S104 is yes, that is, the maximum number of pictures is greater than or equal to the first duty threshold, the maximum reliability indicates that the person in the corresponding face class is the target person, and then step S105 is executed.

Step S105, taking all pictures contained in the face category corresponding to the largest picture quantity ratio as target picture data.

With continued reference to fig. 2, optionally, step S103a and step S103b are further included in step S1 and between step S102 and step S103.

Step S103a, judging whether the number of pictures in the to-be-processed picture set is greater than or equal to a first number threshold.

When the number of pictures in the picture set to be processed is greater than or equal to the first number threshold, it indicates that the number of pictures in the picture set to be processed is sufficient for the subsequent processing, and step S103 is performed thereafter. When the number of the pictures in the to-be-processed picture set is smaller than the first number threshold, the number of the pictures in the to-be-processed picture set is too small, so that the reliability represented by the subsequent processing result is low, and then step S103b is executed.

Step S103b, acquiring target picture data including a target person specified by a person.

Referring again to FIG. 2, optionally, step S100 is further included in step S1 and before step S101, and step S106 and step S107 are further included after step S105.

Step S100, let i take a value of 0.

If the determination result in step S104 is no, that is, it is difficult to determine the picture of the target object from the current to-be-processed picture set (the number of pictures including the names of the persons and the name of the target object is large), then step S106 is performed.

Step S106, adding 1 to the value of i.

Step S107, i feature words of the target person are obtained from the text description data, and picture searching is carried out in the Internet by taking the name and the i feature words as keywords at the same time, so that a picture searching result is obtained. After the end of step S107, step S102 is executed again.

The above step S107 is to increase the number of keywords to perform more accurate search in the internet, so as to reduce the number of pictures of the names of the included characters and the name of the target object as much as possible.

Referring to fig. 3, optionally, step S2 includes the following steps S201 and S202.

Step S201, taking the picture in the target picture data as a reference, and detecting whether the video frame picture contains the target person for each video frame picture in each video file.

As an example, when the number of pictures in the target picture data is small, for each video frame picture in the video file, the similarity between the video frame picture and each picture in the target picture data may be calculated, and the reliability of the target person included in the video frame picture is obtained by using the similarity between the video frame picture and each picture in the target picture data (for example, the maximum value, the minimum value, the average value or the median value of the similarity between the video frame picture and each picture in the target picture data is taken as the reliability), then the reliability is compared with a preset reliability threshold, if the reliability is greater than the preset reliability threshold, the target person is detected to be included in the detected video frame picture, and otherwise, the target person is not included in the detected video frame picture.

As another example, when the number of pictures in the target picture data is large, a target person detection model that can be used for detecting a target person may be trained by using the pictures in the target picture data in a machine learning manner, and then each video frame picture in each video file is detected using the target person detection model.

Step S202, dividing a plurality of continuous video frame pictures in a video file into continuous picture sets in all video frame pictures containing a target person, and extracting video segments corresponding to the continuous picture sets from the corresponding video file, wherein all the extracted video segments are used as target video segment data.

In the invention, based on the picture detection mode, the continuity of the video frame pictures containing the target person in the video file is combined, and the corresponding video segment can be extracted from the video file, so that the association of the target object and the video segment is realized, namely, the association of the text description data of the target object and the target video segment data is also realized.

Referring to fig. 4, optionally, step S3 includes the following steps S301 to S305.

Step S301, selecting a video segment containing voice from all video segments included in the target video data as a video segment to be processed.

Step S302, extracting tone characteristics of voice in each video segment to be processed.

Generally, tone color feature extraction is to extract frequency components and features in an audio signal by performing spectral analysis on the audio signal, thereby obtaining tone color features.

And step S303, clustering the video segments to be processed according to the tone characteristics to obtain a tone clustering result.

The tone color clustering result comprises n tone color categories and video segments to be processed contained in each tone color category. Generally, the timbres of different people are different, so n timbre categories represent voices of n different people.

Step S304, judge whether n equals 1.

If the determination result in step S304 is yes, that is, it indicates that the voice is from 1 person (i.e., the target object), step S305 is performed.

Step S305, respectively extracting corresponding voice segments from each video segment to be processed contained in only 1 tone type in the tone color clustering result, and taking all the extracted voice segments as target voice segment data.

With continued reference to fig. 4, in step S3 and after step S304, steps S306-S310 are optionally further included. If the determination result in step S304 is no, that is, if the voice is from more than 1 person, step S306 is executed.

Step S306, for each video segment to be processed in the tone color clustering result, each Chinese character voice in the video segment to be processed is identified one by one according to the pronunciation of Chinese characters, the identification result of each Chinese character voice is represented in a pinyin form, and the identification results of all Chinese character voices in the video segment to be processed are sequentially arranged to form a first pinyin identification result. And carrying out voice recognition on each sentence in the video segment to be processed according to a complete sentence recognition mode through voice recognition software, and representing the recognition result of each sentence voice in a pinyin mode, wherein the recognition results of all the sentences in the video segment to be processed are sequentially arranged to form a second pinyin recognition result.

In step S306, two completely different recognition techniques are adopted to respectively recognize the voice in each video segment to be processed, namely 1) Chinese character-by-Chinese character voice recognition and 2) sentence-by-sentence recognition by applying voice recognition software. Meanwhile, the recognition results are all expressed by pinyin.

The single Chinese character speech recognition is to recognize the original pronunciation of the character, and the speech recognition software is to recognize the whole based on sentences, and usually has a single character recognition error correction function combined with context semantics.

Step S307, for each video segment to be processed in the timbre clustering result, detects the similarity of the recognition results between the corresponding first pinyin recognition result and the second pinyin recognition result.

The similarity between two pinyin segments may be calculated by a similarity calculation algorithm (e.g., may be considered as a similarity calculation between two pinyin strings).

When the similarity between the first pinyin recognition result and the second pinyin recognition result is higher, the Chinese pronunciation standard of the corresponding character is indicated (the corresponding voice is likely to be white at this time, and normally speaking of the person does not adopt very standard pronunciation for each word). Otherwise, the Chinese pronunciation of the corresponding character is not standard.

Step S308, detecting whether the similarity of the recognition results corresponding to all the video segments to be processed in the tone color clustering results is larger than or equal to a first similarity threshold or smaller than the first similarity threshold at the same time.

If the detection result in step S308 is that the similarity of the recognition result corresponding to the at least one video segment to be processed is greater than or equal to the first similarity threshold and the similarity of the recognition result corresponding to the at least one video segment to be processed is less than the first similarity threshold, that is, there is a person pronunciation too standard and the person pronunciation is not too standard, step S309 is executed.

Step S309, deleting the video segments to be processed with the similarity of the identification result greater than or equal to the first similarity threshold in the timbre clustering result, so as to update the timbre clustering result.

The first similarity threshold may be defined according to practical situations, and generally the first similarity threshold is greater than or equal to 90% and less than 1, for example, the first similarity threshold takes a value of 90%, 95%, or the like.

That is, some of the videos to be processed belonging to the voice belonging to the side (pronunciation too standard) can be deleted by step S309.

Step S310, judging whether the number of tone color categories in the updated tone color clustering result is equal to 1.

If the determination result in step S310 is yes, step S305 is executed.

With continued reference to fig. 4, in step S3 and after step S310, steps S311 to S314 may be optionally further included.

Optionally, if the detection result in step S308 is that the similarity of the recognition results corresponding to all the video segments to be processed is greater than or equal to the first similarity threshold or less than the first similarity threshold at the same time, step S311 is performed.

If the detection result in step S310 is no, step S311 is executed.

Step S311, for each video segment to be processed in the timbre clustering result, detecting a pronunciation characteristic stability index of a voice segment included in the video segment to be processed.

In this disclosure, a pronunciation characteristic smoothness index is used to describe the smoothness of pronunciation characteristics in a speech segment. The larger the pronunciation characteristic stability index is, the better the stability of the pronunciation characteristic is, the larger the probability of the pronunciation characteristic is, the smaller the pronunciation characteristic stability index is, the worse the stability of the pronunciation characteristic is, the smaller the probability of the pronunciation characteristic is (the difference between pronunciation characteristics of a broadcasting person, an interviewee and a common person is usually present).

Alternatively, the pronunciation characteristic stability index may be represented by the inverse of the variance or the inverse of the standard deviation of the plurality of pronunciation characteristic values acquired over time in a speech segment.

In some embodiments, the pronunciation characteristic stability index may include a volume stability index (which may be represented by the inverse of the variance of volume or the inverse of the standard deviation) or a speech rate stability index (which may be represented by the inverse of the variance of speech rate or the inverse of the standard deviation).

Step S312, detecting whether the pronunciation characteristic stability index corresponding to all the video segments to be processed in the tone color clustering result is simultaneously greater than or equal to the corresponding preset stability threshold or is simultaneously smaller than the corresponding preset stability threshold.

If the detection result in step S312 is that the pronunciation characteristic stability index corresponding to the at least one video segment to be processed is greater than or equal to the preset stability threshold and the pronunciation characteristic stability index corresponding to the at least one video segment to be processed is less than the preset stability threshold, step S313 is executed.

Step S313, deleting the video segments to be processed with the pronunciation characteristic stability index greater than or equal to a preset stability threshold in the tone color clustering result to update the tone color clustering result.

Step S314, judging whether the number of tone color categories in the updated tone color clustering result is equal to 1.

If the determination result in step S314 is yes, step S305 is performed.

With continued reference to fig. 4, optionally, step S315 and step S316 are also included in step S3 and after step S314.

If the detection result in step S312 is that the pronunciation characteristic stability index corresponding to all the video segments to be processed is simultaneously greater than or equal to the preset stability threshold or is simultaneously less than the preset stability threshold, step S315 is executed.

If the detection result in step S314 is no, that is, the speech segment of the target object cannot be effectively determined based on the foregoing "pronunciation criteria" and "pronunciation characteristics", step S315 is performed.

Step 315, in the tone color clustering result, calculating the total duration of the video segments corresponding to each tone color category, where the total duration of the video segments corresponding to the tone color category is equal to the sum of the durations of all the video segments to be processed contained in the tone color category.

Step S316, respectively extracting corresponding voice segments from each video segment to be processed contained in the tone class corresponding to the maximum total duration of the video segments, and taking all the extracted voice segments as target voice segment data.

In some embodiments, step S4 specifically includes constructing an index data structure of text description data, target picture data, target video segment data, target speech segment data.

Referring to fig. 5 and 6, the text description data uses the target picture data and the target video segment data as association bridges, so that the text description data of the target person is sequentially associated with the target picture data, the target video segment data and the target voice segment data, and a corresponding relationship diagram (shown in fig. 5) of four modal data can be established. Then, an index data structure may be established according to the relationship graph based on the index requirement, where the index data structure may be a directed graph data structure, and nodes in the directed graph data structure in table solicit articles of the description data point to a node representing the target picture data, a node representing the target video segment data, and a node representing the target voice segment data, respectively. In the method, the association between different mode data can be effectively established by establishing the index data structure, and the index efficiency can be effectively improved. Of course, in practical application, the relevance between different data can be established in the index data structure shown in fig. 6 according to the need, for example, if the user is used to continue to search the picture data after the video data is searched, the relevance from the video data to the picture data can be newly added.

Referring to fig. 7 and 8, in some embodiments, the multi-system data management method based on the multi-mode model further includes step S5.

Step S5, storing the data of each video segment in the target video segment data.

In multi-modal data, mainly text, pictures, voice and video, the video data size is usually large, the storage occupation space is large, and the retrieval is more time-consuming. To this end, the present disclosure provides a new method of storing data for video segments.

Step S501, calculating the picture similarity between any two adjacent video frame pictures in the target video segment, and dividing the target video segment into a plurality of video subsections between the two adjacent video frame pictures with the picture similarity less than or equal to the second similarity threshold.

Step S502, for each video sub-segment, a semantic segmentation model (such as a Sam model) is adopted to divide each video frame picture in the video sub-segment, so as to determine a foreground part image and a background part image in each video picture.

The foreground part image is an image of an area surrounded by the outline edge of the target person in the video frame picture, the background part image is other images except the foreground part image in the video frame picture, and the background part image comprises a moving object background image and a fixed object background image. The moving object background image refers to an image of an object capable of moving, such as an automobile or a pedestrian, in the background. A fixed object background image refers to an image of an object fixed in the background (unable to produce movement), such as an image of a road, a building, a tree.

Step S503, for each video sub-segment, extracting face image and body posture information from foreground part images in each video frame image in the video sub-segment, and determining a body appearance image for sharing corresponding to the video sub-segment according to the foreground part images in at least part of the video frame images in the video sub-segment.

The shared physical appearance image may be an image generated by using a certain algorithm based on other partial images except the face image in at least part of foreground partial images in the video frame picture. The image left after the face image is deleted directly from the foreground part image in a certain video frame picture can also be used. The present disclosure is not limited as to the algorithm how to determine the shared physical appearance image.

The body appearance image and the body posture information for sharing are configured to restore the actual body appearance image of the target person in the foreground part image in the corresponding video frame picture.

Optionally, the foreground is generated by generating an antagonism network training model and generating frame by using face images and body posture information, and the person is restored by using the shared body appearance images to perform 'supervised learning'. It can be seen that the foreground part image in the corresponding video frame picture can be obtained by generating the actual body appearance image of the target person in the corresponding video frame picture through the body posture information and the body appearance image for sharing, and then splicing the actual body appearance image and the face image.

Step S504, for each video sub-segment, carrying out fixed background merging processing on fixed background partial images in all video frame pictures included in the video sub-segment in space to obtain a complete fixed background image corresponding to the video sub-segment, and recording the regional position information of the fixed background partial images in the complete fixed background image in each video frame picture in the video sub-segment.

It should be noted that the combined complete fixed background image may be larger than the video shooting range (for example, the shooting position is moving), and the complete fixed background image is actually formed by stitching the fixed background partial images of multiple video frame pictures.

The region position information is configured to be used for extracting a corresponding fixed background part image from the complete fixed background image. In some embodiments, the region position information comprises an upper left corner coordinate and a lower right corner coordinate of the fixed background part image in the complete fixed background image, and a rectangular region can be determined in the complete fixed background image through the upper left corner coordinate and the lower right corner coordinate to serve as a fixed object background image in the corresponding video frame picture.

Step S505, for each video sub-segment, determining sub-segment storage data corresponding to the video sub-segment.

The sub-segment storage data comprises a shared body appearance image corresponding to the video sub-segment, a complete fixed background image corresponding to the video sub-segment, a face image corresponding to each video frame picture included in the video sub-segment, body posture information corresponding to each video frame picture included in the video sub-segment, and region position information corresponding to each video frame picture included in the video sub-segment.

Compared with the video file of one video sub-segment, the storage space occupied by the sub-segment storage data corresponding to one video sub-segment is much smaller, and the sub-segment storage data is more convenient to search.

Of course, the management method provided by the invention can also comprise the step of storing text description data, target picture data and target voice segment data.

In addition, in practical application, after the target voice segment data of the target person is obtained in step S3, feature extraction may be performed on at least some voice segments in the obtained target voice segment data to obtain voice features of the target person, then a person detection model capable of being based on the voice features is obtained through machine training, and then a person detection model based on the voice features may be directly used to detect some voice segments (for example, a pure voice file exists in the multimedia database) to determine whether the corresponding voice segments belong to the target person.

Based on the same inventive concept, the invention also provides a multi-system data management system based on the multi-mode model, which can be used for realizing the multi-system data management method based on the multi-mode model provided by the previous embodiment.

The multi-system data management system based on the multi-mode model comprises a picture acquisition module, a video segment acquisition module, a voice segment acquisition module and a correlation module.

The image acquisition module is configured to acquire target image data containing the target person from the Internet according to the text description data corresponding to the target person.

The video segment acquisition module is configured to acquire target video segment data containing target characters from a multimedia database according to the target picture data, wherein the multimedia database at least comprises a plurality of video files.

The voice segment acquisition module is configured to extract target voice segment data of a target person from the target video segment data.

The association module is configured to establish an association relationship between the text description data and the target speech segment data.

For the description of each functional module, reference may be made to the corresponding content in the foregoing embodiment, which is not repeated here.

According to an embodiment of the present invention, there is also provided a computer-readable medium. The computer readable medium has stored thereon a computer program, wherein the program when executed by a processor implements the steps of the multi-system data management method based on a multi-modal model as in any of the above embodiments.

Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, functional modules/units in the apparatus disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components, for example, one physical component may have a plurality of functions, or one function or step may be cooperatively performed by several physical components. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

Example embodiments have been disclosed herein, and although specific terms are employed, they are used and should be interpreted in a generic and descriptive sense only and not for purpose of limitation. In some instances, it will be apparent to one skilled in the art that features, characteristics, and/or elements described in connection with a particular embodiment may be used alone or in combination with other embodiments unless explicitly stated otherwise. It will therefore be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the scope of the present invention as set forth in the following claims.

Claims

1. A multi-modal model-based multi-system data management method, comprising:

s4, establishing an association relation between the text description data and the target voice segment data;

Wherein, step S3 includes:

step S301, selecting a video segment containing voice from all video segments included in the target video segment data as a video segment to be processed;

Step S304, judging whether n is equal to 1;

If the judgment result of the step S304 is yes, the step S305 is executed;

step S305, extracting corresponding speech segments from each of the video segments to be processed contained in only 1 tone color category in the tone color clustering result, where all the extracted speech segments are used as the target speech segment data;

in step S3, if the determination result in step S304 is no, step S306 is executed;

if the determination result in step S310 is yes, step S305 is executed.

2. The method according to claim 1, wherein step S1 comprises:

If the judgment result of the step S104 is yes, the step S105 is executed;

3. The method according to claim 2, further comprising, in step S1, step S100 before step S101;

step S100, enabling the value of i to be 0;

If the judgment result in the step S104 is no, the step S106 is executed;

step S106, adding 1 to the value of i;

Step S102 is executed again after step S107 ends;

between step S102 and step S103, further comprising:

If not, executing a step S103b;

4. The method according to claim 1, wherein step S2 comprises:

5. The method according to claim 1, wherein in step S3, if the detection result of step S310 is no, step S311 is performed;

If the determination result in step S314 is yes, step S305 is performed.

6. The method according to claim 5, wherein in step S3, if the detection result of step S314 is no, step S315 is performed;

7. The method according to any one of claims 1 to 6, wherein step S4 comprises:

8. The method as recited in claim 1, further comprising:

Step S504, for each video sub-segment, carrying out fixed background merging processing on the fixed object background images in the whole video frame pictures included in the video sub-segment in space to obtain a complete fixed background image corresponding to the video sub-segment, and recording the regional position information of the fixed object background image in the complete fixed background image in each video frame picture in the video sub-segment;