[go: up one dir, main page]

CN110399526B - Video title generation method and device and computer readable storage medium - Google Patents

Video title generation method and device and computer readable storage medium Download PDF

Info

Publication number
CN110399526B
CN110399526B CN201910683487.0A CN201910683487A CN110399526B CN 110399526 B CN110399526 B CN 110399526B CN 201910683487 A CN201910683487 A CN 201910683487A CN 110399526 B CN110399526 B CN 110399526B
Authority
CN
China
Prior art keywords
video
text information
information
video frames
titles
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910683487.0A
Other languages
Chinese (zh)
Other versions
CN110399526A (en
Inventor
彭江军
周智昊
安明洋
熊欢
李时坦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910683487.0A priority Critical patent/CN110399526B/en
Publication of CN110399526A publication Critical patent/CN110399526A/en
Application granted granted Critical
Publication of CN110399526B publication Critical patent/CN110399526B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/74Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The application relates to a video title generation method, a video title generation device, a computer readable storage medium and computer equipment, wherein the method comprises the following steps: extracting a plurality of video frames of a video as a plurality of target video frames; acquiring image contents of a plurality of target video frames, and acquiring text information of a video under each target video frame according to the image contents of the plurality of target video frames; determining alternative titles of the videos according to text information of the videos under the target video frames; and acquiring the comprehensive word frequency of the alternative titles in the text information, and screening the alternative titles with the maximum comprehensive word frequency from the alternative titles to be used as the titles of the videos. The scheme provided by the application can achieve the purpose of automatically generating the title of the video according to a plurality of target video frames of the video; the global information of the video is comprehensively considered, and the word frequency information of the alternative titles is combined, so that the determination of the video titles is more accurate, and the determination accuracy of the video titles is improved.

Description

Video title generation method and device and computer readable storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for generating a video title, a computer-readable storage medium, and a computer device.
Background
With the rapid development of computer technology, various applications for watching videos are in endless, and more users watch videos through the applications; moreover, when a user selects a video to be watched, the user generally selects the video according to a video title; the video title is very important to the viewing rate of the video.
However, in the current video title determination method, a key video frame is generally extracted from a video, such as a video frame with the highest face proportion and the most abundant actions; then determining the title of the video according to the image content corresponding to the key video frame; however, the video title is determined only by the image content of one key video frame, and the global information of the video cannot be embodied, so that the accuracy rate of determining the video title is low.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a method and an apparatus for generating a video title, a computer-readable storage medium, and a computer device, for solving the technical problem of low accuracy in determining a video title.
A method of generating a video title, the method comprising:
extracting a plurality of video frames of a video as a plurality of target video frames;
acquiring image contents of the target video frames, and obtaining text information of the video under each target video frame according to the image contents of the target video frames;
determining alternative titles of the videos according to text information of the videos under the target video frames;
and acquiring the comprehensive word frequency of the alternative titles in the text information, and screening the alternative titles with the maximum comprehensive word frequency from the alternative titles to serve as the titles of the videos.
In one embodiment, the method further comprises:
and screening out the words with the word frequency larger than the preset word frequency from all the words in the alternative titles to serve as the labels of the videos.
In one embodiment, the method further comprises:
receiving description information of the video uploaded by a user terminal;
extracting main information and background information of the video from the description information of the video;
and updating the title of the video according to the main body information and the background information.
An apparatus for generating a video title, the apparatus comprising:
the video frame extraction module is used for extracting a plurality of video frames of the video as a plurality of target video frames;
the text information acquisition module is used for acquiring the image contents of the plurality of target video frames and acquiring text information of the video under each target video frame according to the image contents of the plurality of target video frames;
the alternative title determining module is used for determining alternative titles of the videos according to the text information of the videos under the target video frames;
and the alternative title screening module is used for acquiring the comprehensive word frequency of the alternative titles in the text information and screening the alternative titles with the maximum comprehensive word frequency from the alternative titles to serve as the titles of the videos.
A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:
extracting a plurality of video frames of a video as a plurality of target video frames;
acquiring the image contents of the plurality of target video frames, and obtaining text information of the video under each target video frame according to the image contents of the plurality of target video frames;
determining alternative titles of the videos according to text information of the videos under the target video frames;
and acquiring the comprehensive word frequency of the alternative titles in the text information, and screening the alternative titles with the maximum comprehensive word frequency from the alternative titles to serve as the titles of the videos.
A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of:
extracting a plurality of video frames of a video as a plurality of target video frames;
acquiring the image contents of the plurality of target video frames, and obtaining text information of the video under each target video frame according to the image contents of the plurality of target video frames;
determining alternative titles of the videos according to text information of the videos under the target video frames;
and acquiring the comprehensive word frequency of the alternative titles in the text information, and screening the alternative titles with the maximum comprehensive word frequency from the alternative titles to serve as the titles of the videos.
According to the video title generation method, the video title generation device, the computer readable storage medium and the computer device, the text information of the video under each target video frame is obtained through the image contents of the extracted multiple target video frames of the video, the alternative titles of the video are further determined, and the alternative titles with the maximum comprehensive word frequency are screened out from the alternative titles to serve as the titles of the video based on the comprehensive word frequency of the alternative titles in the text information; the purpose of automatically generating the title of the video according to a plurality of target video frames of the video is achieved; the global information of the video is comprehensively considered, so that the determination of the video title is more accurate, and the determination accuracy of the video title is improved; meanwhile, the alternative titles of the videos are determined, and then the titles of the videos are determined according to the comprehensive word frequency of the alternative titles in the text information, so that the precise determination of the video titles is facilitated, and the determination accuracy of the video titles is further improved.
Drawings
FIG. 1 is a diagram showing an example of an application environment of a method of generating a video title;
FIG. 2 is a block diagram of a computer device in one embodiment;
FIG. 3 is a flowchart illustrating a method for generating a video title according to an embodiment;
FIG. 4 is a flowchart illustrating the steps of extracting a plurality of target video frames of a video in one embodiment;
FIG. 5 is a flowchart illustrating the steps of obtaining a candidate set of key video frames for a video in one embodiment;
FIG. 6 is a flowchart illustrating the steps of filtering out valid key video frames in one embodiment;
FIG. 7 is a flowchart illustrating the steps of obtaining text information for a video in multiple target video frames according to one embodiment;
FIG. 8 is a flowchart illustrating the steps of determining alternative titles for a video in one embodiment;
FIG. 9 is a flowchart illustrating the step of obtaining the integrated word frequency of the alternative topic in the text message according to an embodiment;
FIG. 10 is a flowchart illustrating the steps of updating the title of a video in one embodiment;
fig. 11 is a flowchart illustrating a method of generating a video title according to another embodiment;
FIG. 12 is a diagram illustrating an interface for obtaining a video title in one embodiment;
fig. 13 is a block diagram showing the configuration of a video title generation apparatus according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Fig. 1 is an application environment diagram of a video title generation method in one embodiment. Referring to fig. 1, the application environment diagram includes a server 110. The server 110 extracts a plurality of video frames of the video as a plurality of target video frames, such as a target video frame 1 of the video, a target video frame 2 of the video, a target video frame 3 of the video, and the like, based on the acquired video; acquiring image contents of a plurality of target video frames, and acquiring text information of a video under each target video frame according to the image contents of the plurality of target video frames; determining alternative titles of the videos according to the text information of the videos in each target video frame; and acquiring the comprehensive word frequency of the alternative titles in the text information, and screening the alternative titles with the maximum comprehensive word frequency from the alternative titles to serve as the titles of the videos. In addition, the method for generating the video title may also be applied to a video uploading system, a video making system, a video recommending system, a video playing system, and the like, and the application is not limited specifically.
FIG. 2 is a diagram illustrating an internal structure of a computer device in one embodiment. The computer device may specifically be the server 110 in fig. 1. As shown in fig. 2, the computer apparatus includes a processor, a memory, a network interface, an input device, and a display screen connected through a system bus. The memory comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by the processor, causes the processor to implement a method of generating a video title. The internal memory may also have a computer program stored therein, which when executed by the processor, causes the processor to perform a method of generating a video title. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on a shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 2 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
As shown in fig. 3, in one embodiment, a method of generating a video title is provided. The embodiment is mainly illustrated by applying the method to the server 110 in fig. 1. Referring to fig. 3, the method for generating a video title specifically includes the following steps:
s302, a plurality of video frames of the video are extracted as a plurality of target video frames.
Wherein, the video is composed of a plurality of static pictures, and the static pictures are called video frames; for example, a video that includes at least 24 video frames per second will not appear to be jammed until viewed by the user. The target video frame refers to a key video frame in the video, and the key video frame refers to a key picture for representing main content of the video, such as a picture for representing a character or an object moving, a picture for representing a key action in a change, and the like.
It should be noted that a plurality of target video frames may be included in one video. The multiple target video frames of the video refer to at least two target video frames of the video, and the at least two target video frames may be two or more target video frames, which is not limited in the present application.
Specifically, the server collects videos with undefined video titles on the current network based on a big data technology; preprocessing the video, such as removing video noise, enhancing video image definition and the like; carrying out segmentation processing on the preprocessed video to obtain a plurality of video frames; and analyzing and processing the video frames by a frame difference method to determine which video frame is the key video frame and which video frame is not the key video frame, so as to obtain the key video frame of the video, and taking the key video frame as the target video frame of the video.
Further, after the key video frames are obtained, in order to avoid some key video frames being background frames, the server may further detect the obtained key video frames; for example, whether the obtained key video frame is a valid key video frame is determined, and if yes, the key video frame is used as a target video frame of the video. Therefore, the obtained key video frame is detected, so that an effective key video frame can be obtained and used as a target video frame, the extracted target video frame is more accurate, and the determination accuracy of the target video frame of the video is further improved.
In one embodiment, the server may further extract a video to be analyzed from the locally cached video; carrying out segmentation processing on a video to be analyzed to obtain a plurality of video frames; analyzing and processing the video frames by a frame difference method to determine which video frame is a key video frame and which video frame is not the key video frame, thereby obtaining the key video frame of the video; and detecting the key video frame, and if the key video frame is effective, taking the key video frame as a target video frame of the video so as to obtain a plurality of target video frames of the video.
S304, acquiring the image contents of the plurality of target video frames, and obtaining the text information of the video under each target video frame according to the image contents of the plurality of target video frames.
The image content of the target video frame is used for representing the static picture content corresponding to the target video frame; the text information of the target video frame is used for describing the static picture content corresponding to the target video frame and is obtained based on the image content of the target video frame; for example, the image content of the target video frame shows a picture that a gull is flying from the water surface, and the text information corresponding to the target video frame is "a gull flying over the water surface".
Specifically, the server obtains text information of the video under each target video frame by adopting an image description method; for example, the server obtains the still picture contents of a plurality of target video frames as the image contents of the target video frames; and extracting image characteristic information in the image content of the target video frame, and decoding the image characteristic information to translate the image characteristic information into natural sentences so as to obtain text information of the target video frame. The image characteristic information is used for representing characteristic information in the image content of the target video frame. Therefore, the text information of the video under each target video frame is automatically generated according to the image content of the obtained target video frame, the defect that the process of manually determining the text information of the target video frame is complex is overcome, the labor cost is further reduced, and meanwhile, the method is beneficial to determining the alternative titles of the video according to the text information of the video under each target video frame in the follow-up process.
S306, determining the alternative titles of the videos according to the text information of the videos in the target video frames.
The alternative titles refer to video titles to be determined.
Specifically, the server combines text information of the video under each target video frame to obtain a text information set; clustering the text information set to obtain a plurality of text information clusters; and taking the text information corresponding to the center of each text information cluster as the alternative title of the video. The text information cluster refers to a set of similar text information; and if the clustering effect of the text information cluster obtained by taking one text information as the center is the best, the text information is the center of the text information cluster.
In one embodiment, the clustering the text information set by the server to obtain a plurality of text information clusters includes: the server takes the text information under each target video frame as a single data object, counts the distance between the data objects, and forms a text information cluster by using the data objects with closer distances, thereby obtaining a plurality of text information clusters.
S308, acquiring the comprehensive word frequency of the alternative titles in the text information, and screening the alternative titles with the maximum comprehensive word frequency from the alternative titles to serve as the titles of the videos.
The term frequency refers to the frequency of occurrence of a word in the text information of each target video frame, for example, in the text information of 10 target video frames, if a word occurs in the text information of 5 target video frames, the term frequency of the word is 50%. The comprehensive word frequency refers to the sum of the occurrence frequency of each word in the alternative titles in the text information of each target video frame.
Specifically, the server acquires each word in the alternative titles, counts the occurrence frequency of each word in the alternative titles in the text information of each target video frame, and obtains the word frequency of each word in the alternative titles; counting the sum of the word frequency of each word in the alternative titles to serve as the comprehensive word frequency of the alternative titles in the text information; and screening the candidate titles with the maximum comprehensive word frequency from the plurality of candidate titles to serve as the titles of the videos. Therefore, the alternative titles of the videos are determined firstly, and then the titles of the videos are determined according to the comprehensive word frequency of the alternative titles in the text information, so that the accurate determination of the titles of the videos is favorably realized, and the determination accuracy of the titles of the videos is further improved.
Further, the server can also determine words with the word frequencies arranged in the first few digits according to the word frequency of each word in the alternative titles; the words with the words arranged in the first few bits frequently are used as video labels; for example, words with the first two words arranged frequently are used as video labels to label the key information of the video.
The method for generating the video title obtains text information of the video under each target video frame through the extracted image contents of a plurality of target video frames of the video, further determines the alternative titles of the video, and screens out the alternative titles with the maximum comprehensive word frequency from the alternative titles as the titles of the video on the basis of the comprehensive word frequency of the alternative titles in the text information; the purpose of automatically generating the title of the video according to a plurality of target video frames of the video is achieved; the global information of the video is comprehensively considered, so that the determination of the video title is more accurate, and the determination accuracy of the video title is improved; meanwhile, the alternative titles of the videos are determined, and then the titles of the videos are determined according to the comprehensive word frequency of the alternative titles in the text information, so that the precise determination of the video titles is facilitated, and the determination accuracy of the video titles is further improved.
As shown in fig. 4, in an embodiment, the extracting multiple video frames of the video in step S302 as multiple target video frames specifically includes the following steps:
step S402, a key video frame candidate set of the video is obtained.
The key video frame candidate set refers to a set composed of key video frames.
Specifically, the server extracts a video to be analyzed from a locally cached video; or receiving a video uploaded by a user as a video to be analyzed; carrying out segmentation processing on a video to be analyzed to obtain a plurality of video frames; analyzing and processing the obtained video frames by a frame difference method (such as a two-frame difference method and a three-frame difference method) to obtain key video frames of the video; and collecting the key video frames of the video to obtain a candidate set of the key video frames of the video.
Further, key video frames of a video may also be obtained by: the method comprises the steps that a server obtains a key video frame screening instruction, such as a two-frame difference instruction, a three-frame difference instruction and the like; screening the video frames of the video according to the key video frame screening instruction to obtain key video frames of the video; the key video frame screening instruction refers to an instruction for identifying whether a video frame is a key video frame.
Step S404, a plurality of valid key video frames are screened out from the key video frame candidate set as a plurality of target video frames of the video.
The valid key video frames refer to real key video frames, and specifically refer to key video frames that do not belong to background frames.
Specifically, the server detects a key video frame in the key video frame candidate set, and if the key video frame does not belong to a background frame, the key video frame is determined to be an effective key video frame; if the key video frame belongs to the background frame, confirming that the key video frame is an invalid key video frame; by the method, a plurality of effective key video frames can be screened out from the key video frame candidate set and serve as a plurality of target video frames of the video.
In the embodiment, the key video frame candidate set is obtained, so that the key video frames of the video can be screened out quickly, and some key video frames are prevented from being missed; and then screening a plurality of effective key video frames from the key video frame candidate set as a plurality of target video frames of the video, which is beneficial to eliminating background frames, so that the extracted target video frames are more accurate, and the determination accuracy of the target video frames of the video is further improved.
As shown in fig. 5, in an embodiment, the obtaining of the key video frame candidate set of the video in step S402 specifically includes the following steps:
step S502, sampling the video according to a preset video frame sampling frequency to obtain video frames of the video in different time periods.
The video frame sampling frequency refers to the number of video frames sampled in a video per second, for example, the video frame sampling frequency is 1 video frame sampled per second, and refers to one video frame collected from a plurality of video frames included in a video per second.
Specifically, the server acquires a preset video frame sampling frequency, and samples the video according to the preset video frame sampling frequency to obtain video frames of the video in different time periods; therefore, the method is beneficial to uniformly collecting the video frames and avoiding the interference of redundant video frames, thereby reducing the video frame range for screening the key video frames and further improving the determination accuracy of the key video frames.
For example, there is a 5 minute video, assuming 24 video frames per second, then the video has a total of 5 x 60 x 24 video frames; since the image content of each video frame contained in the video per second is substantially the same, one video frame can be sampled for each second, and the total number of video frames is 5 × 60 × 1, such as the video frame of the 1 st second, the video frame of the 2 nd second, the video frame of the 3 rd second, the video frame of the 299 th second, and the video frame of the 300 th second.
Step S504, respectively counting the difference absolute value between the video frames of the adjacent time periods, and taking the video frame of the next time period in the video frames of the adjacent time periods with the difference absolute value larger than the preset threshold value as the key video frame.
The video frames have corresponding image information, and the image information is stored in a computer in a digital matrix form, so that each video frame is a digital matrix; the difference absolute value between the video frames of the adjacent time periods refers to the absolute value of the difference between the digital matrixes corresponding to the video frames of the adjacent time periods; the preset threshold is used for judging whether a video frame in a later time period in video frames in adjacent time periods is a key video frame or not, and can be obtained according to historical data.
It should be noted that, the video frames in the adjacent time periods have strong similarity, and when the object in the video does not change, the absolute difference value between the video frames in the adjacent time periods is small; when the object in the video changes greatly, the absolute difference value between the video frames in the adjacent time periods is larger.
Specifically, the server respectively counts difference absolute values between video frames of adjacent time periods, and marks the video frame of a subsequent time period in the video frames of the adjacent time periods, of which the difference absolute value is greater than a preset threshold value; and screening the marked video frames from the video frames in different time periods to serve as key video frames of the video.
For example, if the absolute value of the difference between the 1 st second video frame and the 2 nd second video frame is greater than the preset threshold, the 2 nd second video frame is used as the key video frame; if the absolute value of the difference between the 2 nd second video frame and the 3 rd second video frame is less than or equal to a preset threshold value, the 3 rd second video frame is not reserved; if the absolute value of the difference between the video frame of the 3 rd second and the video frame of the 4 th second is greater than the preset threshold, the video frame of the 4 th second is used as a key video frame, and so on, the key video frame of the video can be obtained, such as the video frame of the 1 st second, the video frame of the 2 nd second, the video frame of the 4 th second, and so on. It should be noted that the starting video frame (e.g., the 1 st second video frame) is generally used as the candidate key video frame.
Step S506, a key video frame candidate set is constructed according to the key video frames.
Specifically, the server gathers the starting video frame and the obtained key video frame to obtain a key video frame candidate set.
In this embodiment, video frames of a video are uniformly acquired to obtain video frames of the video in different time periods, and a video frame in a later time period in video frames of an adjacent time period with a difference absolute value greater than a preset threshold is used as a key video frame to further obtain a key video frame candidate set; the method is beneficial to quickly screening out the key video frames of the video, avoids missing some key video frames, enables the subsequently extracted target video frames to be more accurate, and further improves the determination accuracy of the target video frames of the video.
Further, considering that some background frames may exist in the obtained candidate set of key video frames, the key video frames of the candidate set of key video frames need to be detected through a pre-trained video frame prediction model to eliminate some background frames that may exist.
As shown in fig. 6, in an embodiment, the screening out a plurality of valid key video frames from the key video frame candidate set in step S404 specifically includes the following steps:
step S602, respectively inputting the key video frames in the key video frame candidate set into a pre-trained video frame prediction model, to obtain the probability that each key video frame belongs to an effective key video frame.
The video frame prediction model is a neural network model capable of predicting key video frames and is obtained through historical data training.
Specifically, the server inputs the key video frames in the key video frame candidate set into a pre-trained video frame prediction model respectively, and performs prediction analysis on the key video frames through the video frame prediction model to obtain the probability that each key video frame belongs to an effective key video frame.
In one embodiment, the video frame prediction model may be trained by: obtaining a plurality of sample key video frames; training a video frame prediction model to be trained according to the sample key video frame to obtain a trained video frame prediction model; obtaining the probability that a sample key video frame output by a video frame prediction model belongs to an effective key video frame, and counting the error between the preset probability and the output probability if the output probability is smaller than the preset probability; when the error is larger than or equal to the preset error, repeatedly training the video frame prediction model according to the error until the error obtained according to the trained video frame prediction model is smaller than the preset error; the trained video frame prediction model is used as a video frame prediction model trained in advance in the application.
In step S604, the key video frames with the probability greater than the preset probability are used as valid key video frames.
The preset probability is used for measuring whether the key video frame is a valid key video frame.
Specifically, if the probability that a key video frame belongs to an effective key video frame is greater than a preset probability, the key video frame is taken as the effective key video frame; if the probability that the key video frame belongs to the effective key video frame is smaller than or equal to the preset probability, taking the key video frame as a failed key video frame, namely, not reserving the key video frame; by the method, a plurality of effective key video frames can be screened out.
In this embodiment, effective key video frames are screened out from the key video frame candidate set, which is beneficial to eliminating background frames, so that the extracted target video frames are more accurate, and the determination accuracy of the target video frames of the video is further improved.
As shown in fig. 7, in an embodiment, the obtaining text information of the video in the multiple target video frames according to the image contents of the multiple target video frames in step S304 specifically includes the following steps:
step S702, inputting the image contents of a plurality of target video frames into a pre-trained CNN model respectively to obtain the image characteristic information of each target video frame; the pre-trained CNN model is used for extracting image characteristic information in the image content of a plurality of target video frames.
The CNN (Convolutional Neural Network) model refers to a Convolutional Neural Network model, and specifically refers to a model for extracting image feature information, such as a googlenet model.
Specifically, the server inputs the image contents of the plurality of target video frames into a pre-trained CNN model respectively, and extracts the image feature information of the image contents based on convolution through the pre-trained CNN model, thereby obtaining the image feature information of each target video frame.
Further, the pre-trained CNN model is obtained by: pre-training the CNN model based on the ImageNet database to obtain a pre-trained CNN model (the pre-trained CNN model can obtain some better network parameters); acquiring image content of a preset sample video frame and corresponding actual image characteristic information, and retraining the pre-trained CNN model according to the image content of the preset sample video frame so as to adjust network parameters of the pre-trained CNN model; acquiring an error between image characteristic information output by the retrained CNN model and corresponding actual image characteristic information; when the error is larger than or equal to a first preset threshold value, reversely transmitting the error to the CNN model so as to adjust the network parameters of the CNN model according to the error to obtain an adjusted CNN model; repeatedly training the adjusted CNN model until the error obtained according to the trained CNN model is smaller than a first preset threshold; the trained CNN model is used as a pre-trained CNN model in the present application. Therefore, the purpose of accurately configuring the network parameters of the CNN model is achieved, and the accuracy of the image characteristic information output by the CNN model is further improved.
Step S704, respectively inputting the image characteristic information of each target video frame into a pre-trained LSTM model to obtain the text information of the video under each target video frame; the pre-trained LSTM model is used for generating text information corresponding to the image characteristic information of each target video frame.
The LSTM (Long Short-Term Memory) model refers to a Long Short-Term Memory network model, and is used for outputting text information corresponding to image information.
Specifically, the server converts the image feature information of each target video frame into corresponding feature vectors, inputs the feature vectors corresponding to the image feature information of each target video frame into a pre-trained LSTM model, and analyzes and processes the feature vectors corresponding to the image feature information of each target video frame through the pre-trained LSTM model to obtain text information of the video in each target video frame.
Further, the pre-trained LSTM model is obtained by: obtaining an LSTM model after pre-training, wherein the LSTM model after pre-training can obtain some better network parameters; acquiring a feature vector corresponding to image feature information of a preset sample video frame and corresponding actual text information, and retraining the pre-trained LSTM model according to the feature vector corresponding to the image feature information of the preset sample video frame so as to adjust network parameters of the pre-trained LSTM model; acquiring an error between the text information output by the retrained LSTM model and the corresponding actual text information; when the error is larger than or equal to a second preset threshold value, the error is reversely transmitted to the LSTM model so as to adjust the network parameters of the LSTM model according to the error to obtain an adjusted LSTM model; repeatedly training the adjusted LSTM model until the error obtained according to the trained LSTM model is smaller than a second preset threshold; the trained LSTM model is used as the LSTM model trained in advance in the present application. Therefore, the purpose of accurately configuring the network parameters of the LSTM model is achieved, and the accuracy of the text information output by the LSTM model is further improved.
In the embodiment, the text information of the video under each target video frame is automatically generated according to the image content of the obtained target video frame, so that the defect that the process of manually determining the text information of the target video frame is complicated is avoided, the labor cost is further reduced, and meanwhile, the method is beneficial to determining the alternative titles of the video according to the text information of the video under each target video frame in the follow-up process.
As shown in fig. 8, in an embodiment, the determining, in step S306, the alternative titles of the video according to the text information of the video in each target video frame specifically includes the following steps:
step S802, clustering the text information of the video under each target video frame to obtain a text information cluster.
Wherein each text information cluster comprises one or more text information.
Step S804, recognizing the text information corresponding to the center of each text information cluster as the target text information.
The target text information refers to text information corresponding to the center of the text information cluster.
Specifically, the server analyzes the text information in each text information cluster to obtain the center of each text information cluster, and takes the text information corresponding to the center of each text information cluster as the target text information.
And step 806, obtaining an alternative title of the video according to the target text information.
For example, suppose that text information of a video under each target video frame is clustered to obtain 4 text information clusters; and taking the text information corresponding to the centers of the 4 text information clusters as target text information to obtain 4 target text information, and taking the 4 target text information as 4 alternative titles to obtain 4 alternative titles of the video.
In this embodiment, the candidate titles of the video can be effectively determined by clustering the text information of the video in each target video frame, so that the determination accuracy of the candidate titles of the video is improved, the subsequent determination of the video titles is more accurate, and the determination accuracy of the video titles is further improved.
As shown in fig. 9, in an embodiment, the acquiring a comprehensive word frequency of the candidate title in the text information in step S308 specifically includes the following steps:
and step S902, performing word segmentation processing on the alternative titles to obtain words contained in the alternative titles.
Specifically, the server performs word segmentation processing on the alternative titles, such as a word segmentation method based on string matching, a word segmentation method based on understanding, a word segmentation method based on statistics, and the like, so as to obtain words included in the alternative titles.
For example, if the alternative title is "revenge alliance", then performing word segmentation on the alternative title may obtain the following words included in the alternative title: "revenge", "league".
Step S904, respectively counting each word included in the candidate titles, and obtaining the occurrence frequency of the text information under each target video frame to obtain the word frequency of each word included in the candidate titles.
And step S906, adding the word frequencies of all words contained in the alternative titles to obtain the comprehensive word frequency of the alternative titles in the text information.
Specifically, the server respectively counts the occurrence frequency of the text information of each word contained in the alternative titles under each target video frame, and takes the occurrence frequency of the text information of each word contained in the alternative titles under each target video frame as the word frequency of each word contained in the alternative titles; and adding the word frequency of each word contained in the alternative titles to obtain the comprehensive word frequency of the alternative titles in the text information.
For example, assume that alternative heading a contains a total of 3 words, which are word a, word b, and word c; the frequency of occurrence of the text information of the word a in each target video frame is a1, the frequency of occurrence of the text information of the word b in each target video frame is b1, and the frequency of occurrence of the text information of the word c in each target video frame is c1, so that the comprehensive word frequency of the alternative title a in the text information is a1+ b1+ c1.
In the embodiment, the comprehensive word frequency of the alternative titles in the text information is counted, so that the titles of the videos can be determined according to the comprehensive word frequency of the alternative titles in the text information, the accurate determination of the titles of the videos is facilitated, and the accuracy of determining the titles of the videos is further improved.
Further, the method for generating a video title of the present application further includes: and screening out the words with the word frequency larger than the preset word frequency from all the words in the alternative titles to serve as the labels of the videos.
Wherein the label of the video is used for identifying the key information of the video.
For example, alternative heading 1 contains all the words: the word frequencies of the word a, the word b and the word c are respectively 10%,15% and 35%; alternative heading 2 contains all the words: the word frequency of the word d, the word e and the word f is respectively 45%,5% and 20%; alternative heading 3 contains all the words: the word frequencies of the word g, the word h and the word i are respectively 10%,40% and 15%; assuming that the preset word frequency is 30%, the words which are greater than the preset word frequency are respectively a word d, a word h and a word c (the words are arranged according to the sequence of the word frequencies from large to small); and taking the word d, the word h and the word c as the labels of the videos.
In the embodiment, the words with the word frequency higher than the preset word frequency in the alternative titles are used as the labels of the videos, so that the key information of the videos can be accurately identified.
As shown in fig. 10, in an embodiment, after the candidate titles with the largest comprehensive word frequency are screened from the candidate titles in step S308 and used as the titles of the video, the method further includes a step of updating the titles of the video, which specifically includes the following steps:
step S1002, receiving description information of a video uploaded by the user terminal.
The description information of the video refers to short information for describing the video, and specifically may be a short text description of the video uploaded by the user or a user comment.
For example, as shown in fig. 12, when a user uploads a video, description information of the video is uploaded, and then the server may receive the description information of the video uploaded by the user terminal.
In step S1004, the main body information and the background information of the video are extracted from the description information of the video.
The main body information of the video refers to the name of a main body in the video; the background information of the video refers to the occurrence background of a story in the video, such as a place, a title of a drama, and the like.
Specifically, the server identifies the main body information and the background information in the description information through a semantic identification technology, and extracts the main body information and the background information from the description information of the video respectively as the main body information and the background information of the video.
For example, the server acquires an identifier of the main body information and an identifier of the background information, and extracts information corresponding to the identifier of the main body information from the description information of the video as the main body information of the video; and extracting information corresponding to the identifier of the background information from the description information of the video to be used as the background information of the video, thereby obtaining the main body information and the background information of the video.
In addition, the server can analyze the description information of the video through a pre-trained sentence structure analysis model to obtain structural components such as main and subordinate shape completions in the description information; extracting a subject of the description information as subject information of the video; background information is generated in the status words and the complements, and if there are no status words and complements, there is no background information generated.
In step S1006, the title of the video is updated according to the main body information and the background information.
Specifically, the server identifies the title of the video through the enumerated semantic identification technology to obtain initial main body information and initial background information of the video; and replacing the initial main body information and the initial background information of the video with the obtained main body information and background information respectively to update the title of the video so as to obtain the final video title.
In this embodiment, the subject information and the background information of the video title are updated through the description information of the video uploaded by the user terminal, so that the purpose of refining and completing the video title is achieved, the obtained video title is more accurate, and the accuracy rate of determining the video title is further improved.
In an embodiment, as shown in fig. 11, another method for generating a video title is provided, which specifically includes the following steps:
step S1102, receiving the video uploaded by the user terminal and the corresponding description information.
And step S1104, processing the video by using a frame difference method to obtain a key video frame candidate set of the video.
Specifically, the server uniformly samples the video according to a preset video frame sampling frequency to obtain video frames of the video in different time periods; respectively counting difference absolute values between the video frames of the adjacent time periods, and taking the video frame of the next time period in the video frames of the adjacent time periods with the difference absolute value larger than a preset threshold value as a key video frame; and constructing a key video frame candidate set according to the key video frames.
Step S1106, a plurality of effective key video frames are screened out from the key video frame candidate set through a pre-trained video frame prediction model, and are used as a plurality of target video frames of the video.
Specifically, the server inputs each key video frame in the key video frame candidate set into a pre-trained video frame prediction model respectively to obtain the probability that each key video frame belongs to an effective key video frame; and taking the key video frames with the probability greater than the preset probability as effective key video frames, and taking the effective key video frames as target video frames of the video, thereby obtaining a plurality of target video frames.
Step S1108, acquiring image contents of the multiple target video frames, and obtaining text information of the video in each target video frame according to the image contents of the multiple target video frames.
Specifically, the server acquires image contents of a plurality of target video frames, and respectively inputs the image contents of the plurality of target video frames into a pre-trained CNN model to obtain image characteristic information of each target video frame; and respectively inputting the image characteristic information of each target video frame into a pre-trained LSTM model to obtain the text information of the video under each target video frame.
Step S1110, clustering the text information of the video under each target video frame to obtain a text information cluster; recognizing text information corresponding to the center of each text information cluster as target text information; and taking the target text information as an alternative title of the video.
Step S1112 obtains the comprehensive word frequency of the candidate titles in the text information, and selects the candidate title with the maximum comprehensive word frequency from the candidate titles as the title of the video.
Specifically, the server performs word segmentation processing on the alternative titles to obtain words contained in the alternative titles; respectively counting each word contained in the alternative titles and the occurrence frequency of the text information under each target video frame to obtain the word frequency of each word contained in the alternative titles; adding the word frequencies of all words contained in the alternative titles to obtain the comprehensive word frequency of the alternative titles in the text information; and taking the alternative title with the maximum comprehensive word frequency as the title of the video.
Step S1114, extracting the main body information and the background information of the video from the description information of the video; and updating the title of the video according to the main body information and the background information.
Further, the server takes the updated video title as a final video title, and adds the final video title to a video preview interface corresponding to the video uploaded by the user terminal.
Referring to fig. 12, a video playing application is taken as an example for explanation. The user terminal is provided with a video playing application program, and a user is supposed to upload a video A and corresponding description information B from a local file library to a user terminal interface, trigger a video title generation request and send the video title generation request to a corresponding server through the user terminal. The server analyzes the received video title generation request to obtain a video A uploaded by the user terminal and corresponding description information B; and processing the video A and the corresponding description information B to obtain a video title of the video A, and pushing the video title of the video A to a video playing application program so as to display the video title of the video A through an uploading page of the video playing application program.
It should be noted that the method for generating a video title according to the present application may be applied to other scenes besides the video uploading scene, and the present application is not limited specifically.
In the embodiment, the purpose of automatically generating the title of the video according to a plurality of target video frames of the video is realized; the global information of the video is comprehensively considered, so that the determination of the video title is more accurate, and the determination accuracy of the video title is improved; the alternative titles of the videos are determined firstly, and then the titles of the videos are determined according to the comprehensive word frequency of the alternative titles in the text information, so that the accurate determination of the titles of the videos is facilitated, and the accuracy rate of the determination of the titles of the videos is further improved; meanwhile, the video title is updated according to the video description information uploaded by the user, so that the content of the video title is enriched, and the accuracy rate of determining the video title is further improved.
It should be understood that although the various steps in the flow charts of fig. 3-11 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 3-11 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.
As shown in fig. 13, in one embodiment, there is provided an apparatus 1300 for generating a video title, the apparatus 1300 including: a video frame extracting module 1302, a text information obtaining module 1304, an alternative title determining module 1306, and an alternative title filtering module 1308, wherein:
the video frame extracting module 1302 is configured to extract a plurality of video frames of a video as target video frames.
The text information obtaining module 1304 is configured to obtain image contents of the multiple target video frames, and obtain text information of the video in each target video frame according to the image contents of the multiple target video frames.
And an alternative title determining module 1306, configured to determine an alternative title of the video according to text information of the video in each target video frame.
The alternative title screening module 1308 is configured to obtain a comprehensive word frequency of the alternative titles in the text information, and screen an alternative title with a maximum comprehensive word frequency from the alternative titles, where the alternative title is used as a title of the video.
In one embodiment, the video frame extraction module 1302 is further configured to obtain a candidate set of key video frames for the video; and screening a plurality of effective key video frames from the key video frame candidate set to serve as a plurality of target video frames of the video.
In one embodiment, the video frame extraction module 1302 is further configured to sample a video according to a preset video frame sampling frequency to obtain video frames of the video at different time periods; respectively counting difference absolute values between the video frames of the adjacent time periods, and taking the video frame of the next time period in the video frames of the adjacent time periods with the difference absolute value larger than a preset threshold value as a key video frame; and constructing a key video frame candidate set according to the key video frames.
In one embodiment, the video frame extraction module 1302 is further configured to input the key video frames in the key video frame candidate set into a pre-trained video frame prediction model, so as to obtain a probability that each key video frame belongs to an effective key video frame; and taking the key video frame with the probability greater than the preset probability as an effective key video frame.
In one embodiment, the text information obtaining module 1304 is further configured to input image contents of a plurality of target video frames into a pre-trained CNN model, respectively, to obtain image feature information of each target video frame; the pre-trained CNN model is used for extracting image characteristic information in image contents of a plurality of target video frames; respectively inputting the image characteristic information of each target video frame into a pre-trained LSTM model to obtain text information of the video under each target video frame; the pre-trained LSTM model is used for generating text information corresponding to the image characteristic information of each target video frame.
In an embodiment, the alternative title determining module 1306 is further configured to cluster text information of the video in each target video frame to obtain a text information cluster; recognizing text information corresponding to the center of each text information cluster as target text information; and obtaining the alternative titles of the videos according to the target text information.
In one embodiment, the alternative caption screening module 1308 performs word segmentation processing on the alternative caption to obtain words included in the alternative caption; respectively counting each word contained in the alternative titles, and obtaining the word frequency of each word contained in the alternative titles, wherein the occurrence frequency of the text information under each target video frame is obtained; and adding the word frequencies of all words contained in the alternative titles to obtain the comprehensive word frequency of the alternative titles in the text information.
In an embodiment, the apparatus 1300 for generating a video title specifically further includes: and a video tag acquisition module.
And the video label acquisition module is used for screening out the words with the word frequency larger than the preset word frequency from all the words in the alternative titles as the labels of the videos.
In an embodiment, the apparatus 1300 for generating a video title specifically further includes: and a video title updating module.
The video title updating module is used for receiving the description information of the video uploaded by the user terminal; extracting main information and background information of the video from description information of the video; and updating the title of the video according to the main body information and the background information.
In this embodiment, text information of a video in each target video frame is obtained through image contents of a plurality of target video frames of the extracted video, and then alternative titles of the video are determined, and an alternative title with the maximum comprehensive word frequency is screened from the alternative titles as a title of the video based on the comprehensive word frequency of the alternative titles in the text information; the purpose of automatically generating the title of the video according to a plurality of target video frames of the video is achieved; the global information of the video is comprehensively considered, so that the determination of the video title is more accurate, and the determination accuracy of the video title is improved; meanwhile, the alternative titles of the videos are determined firstly, and then the titles of the videos are determined according to the comprehensive word frequency of the alternative titles in the text information, so that the accurate determination of the titles of the videos is facilitated, and the accuracy rate of the determination of the titles of the videos is further improved.
In one embodiment, the video title generation apparatus provided in the present application may be implemented in the form of a computer program, and the computer program may be run on a computer device as shown in fig. 2. The memory of the computer device may store various program modules constituting the content processing apparatus, such as a video frame extraction module 1302, a text information acquisition module 1304, an alternative title determination module 1306, and an alternative title filtering module 1308 shown in fig. 13. The computer program constituted by the respective program modules causes the processor to execute the steps in the content processing method of the respective embodiments of the present application described in the present specification.
For example, the computer device shown in fig. 2 may extract a plurality of video frames of a video as a plurality of target video frames through the video frame extraction module 1302 in the video title generation apparatus 1300 shown in fig. 13, acquire image contents of the plurality of target video frames through the text information acquisition module 1304, and obtain text information of the video under each target video frame according to the image contents of the plurality of target video frames. The computer device may determine, by the alternative title determining module 1306, an alternative title of the video according to the text information of the video in each target video frame, obtain, by the alternative title screening module 1308, the comprehensive word frequency of the alternative title in the text information, and screen, from the alternative title, an alternative title with the maximum comprehensive word frequency as the title of the video.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the above-described video title generation method. Here, the steps of the video title generation method may be steps in the video title generation methods of the above-described respective embodiments.
In one embodiment, a computer-readable storage medium is provided, which stores a computer program that, when executed by a processor, causes the processor to perform the steps of the above-described video title generation method. Here, the steps of the video title generation method may be steps in the video title generation methods of the above-described respective embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct Rambus Dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM), among others.
All possible combinations of the technical features in the above embodiments may not be described for the sake of brevity, but should be considered as being within the scope of the present disclosure as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent application shall be subject to the appended claims.

Claims (12)

1. A method for generating a video title, the method comprising:
extracting a plurality of video frames of a video as a plurality of target video frames;
acquiring image contents of the target video frames, and obtaining text information of the video under each target video frame according to the image contents of the target video frames; the text information of the video under each target video frame is used for describing the image content of each target video frame, and is the text information corresponding to the image characteristic information in the image content of each target video frame;
determining alternative titles of the videos according to text information of the videos under the target video frames; the alternative titles of the videos are text information corresponding to the centers of text information clusters obtained by clustering the text information of the videos under the target video frames, and the text information clusters are sets of similar text information;
acquiring the comprehensive word frequency of the alternative titles in the text information, and screening the alternative titles with the maximum comprehensive word frequency from the alternative titles to serve as the titles of the videos; the comprehensive word frequency refers to the sum of the occurrence frequency of each word in the alternative titles in the text information of each target video frame;
receiving description information of the video uploaded by a user terminal; extracting main body information and background information of the video from the description information of the video; identifying the title of the video to obtain initial main body information and initial background information of the title; replacing the initial main body information and the initial background information of the title with the main body information and the background information respectively so as to update the title of the video;
the extracting a plurality of video frames of the video as a plurality of target video frames comprises:
sampling a video according to a preset video frame sampling frequency to obtain video frames of the video in different time periods;
respectively counting difference absolute values between the video frames of the adjacent time periods, and marking the video frame of the next time period in the video frames of the adjacent time periods with the difference absolute value larger than a preset threshold value;
screening marked video frames from the video frames of the video in different time periods to serve as key video frames of the video;
screening out key video frames with the probability of the effective key video frames being greater than the preset probability from the key video frames of the video to serve as target video frames; and the probability is obtained by performing predictive analysis on the key video frame of the video through a pre-trained video frame prediction model.
2. The method according to claim 1, wherein said obtaining text information of the video at each of the target video frames according to the image contents of the target video frames comprises:
inputting the image contents of the plurality of target video frames into a pre-trained CNN model respectively to obtain the image characteristic information of each target video frame; the pre-trained CNN model is used for extracting image characteristic information in the image contents of the plurality of target video frames;
respectively inputting the image characteristic information of each target video frame into a pre-trained LSTM model to obtain text information of the video under each target video frame; the pre-trained LSTM model is used for generating text information corresponding to the image feature information of each target video frame.
3. The method of claim 1, wherein determining the alternative titles of the videos according to the text information of the videos under the respective target video frames comprises:
clustering the text information of the video under each target video frame to obtain a text information cluster;
recognizing text information corresponding to the center of each text information cluster as target text information;
and obtaining the alternative titles of the videos according to the target text information.
4. The method of claim 1, wherein said obtaining the integrated word frequency of the candidate topic in the text message comprises:
performing word segmentation processing on the alternative titles to obtain words contained in the alternative titles;
respectively counting each word contained in the alternative titles, and obtaining the word frequency of each word contained in the alternative titles according to the occurrence frequency of the text information under each target video frame;
and adding the word frequencies of all the words contained in the alternative titles to obtain the comprehensive word frequency of the alternative titles in the text information.
5. The method of claim 4, further comprising:
and screening out the words with the word frequency higher than the preset word frequency from all the words in the alternative titles to serve as the labels of the videos.
6. An apparatus for generating a video title, the apparatus comprising:
the video frame extraction module is used for extracting a plurality of video frames of a video as a plurality of target video frames;
the text information acquisition module is used for acquiring the image contents of the plurality of target video frames and acquiring text information of the video under each target video frame according to the image contents of the plurality of target video frames; the text information of the video under each target video frame is respectively used for describing the image content of each target video frame, and is text information corresponding to image characteristic information in the image content of each target video frame;
the alternative title determining module is used for determining alternative titles of the videos according to the text information of the videos under the target video frames; the alternative titles of the videos are text information corresponding to the centers of text information clusters obtained by clustering the text information of the videos under the target video frames, and the text information clusters are sets of similar text information;
the alternative title screening module is used for acquiring the comprehensive word frequency of the alternative titles in the text information and screening the alternative titles with the maximum comprehensive word frequency from the alternative titles to serve as the titles of the videos; the comprehensive word frequency refers to the sum of the occurrence frequency of each word in the alternative titles in the text information of each target video frame;
the video title updating module is used for receiving the description information of the video uploaded by the user terminal; extracting main information and background information of the video from the description information of the video; identifying the title of the video to obtain initial main body information and initial background information of the title; replacing the initial main body information and the initial background information of the title with the main body information and the background information respectively so as to update the title of the video;
the video frame extraction module is further used for sampling a video according to a preset video frame sampling frequency to obtain video frames of the video in different time periods; respectively counting difference absolute values between the video frames of the adjacent time periods, and marking the video frame of the next time period in the video frames of the adjacent time periods with the difference absolute value larger than a preset threshold value; screening marked video frames from the video frames of the video in different time periods to serve as key video frames of the video; screening out key video frames with the probability of the effective key video frames being greater than the preset probability from the key video frames of the video to serve as target video frames; and the probability is obtained by performing predictive analysis on the key video frame of the video through a pre-trained video frame prediction model.
7. The apparatus according to claim 6, wherein the text information obtaining module is further configured to input image contents of the plurality of target video frames into a pre-trained CNN model, respectively, to obtain image feature information of each of the target video frames; the pre-trained CNN model is used for extracting image characteristic information in the image contents of the plurality of target video frames; respectively inputting the image characteristic information of each target video frame into a pre-trained LSTM model to obtain text information of the video under each target video frame; the pre-trained LSTM model is used for generating text information corresponding to the image feature information of each target video frame.
8. The apparatus according to claim 6, wherein the alternative title determining module is further configured to cluster text information of the video in each target video frame to obtain a text information cluster; recognizing text information corresponding to the center of each text information cluster as target text information; and obtaining the alternative titles of the videos according to the target text information.
9. The apparatus according to claim 6, wherein the alternative title screening module is further configured to perform word segmentation on the alternative titles to obtain words included in the alternative titles; respectively counting each word contained in the alternative titles, and obtaining the word frequency of each word contained in the alternative titles according to the occurrence frequency of the text information under each target video frame; and adding the word frequencies of all the words contained in the alternative titles to obtain the comprehensive word frequency of the alternative titles in the text information.
10. The apparatus of claim 9, further comprising: and the video tag acquisition module is used for screening out the words with the word frequency higher than the preset word frequency from all the words in the alternative titles to serve as the tags of the videos.
11. A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 5.
12. A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 5.
CN201910683487.0A 2019-07-26 2019-07-26 Video title generation method and device and computer readable storage medium Active CN110399526B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910683487.0A CN110399526B (en) 2019-07-26 2019-07-26 Video title generation method and device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910683487.0A CN110399526B (en) 2019-07-26 2019-07-26 Video title generation method and device and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN110399526A CN110399526A (en) 2019-11-01
CN110399526B true CN110399526B (en) 2023-02-28

Family

ID=68325145

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910683487.0A Active CN110399526B (en) 2019-07-26 2019-07-26 Video title generation method and device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN110399526B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110837579B (en) * 2019-11-05 2024-07-23 腾讯科技(深圳)有限公司 Video classification method, apparatus, computer and readable storage medium
CN111241340B (en) * 2020-01-17 2023-09-08 Oppo广东移动通信有限公司 Video tag determination method, device, terminal and storage medium
CN111639250B (en) * 2020-06-05 2023-05-16 深圳市小满科技有限公司 Enterprise description information acquisition method and device, electronic equipment and storage medium
CN112541095B (en) * 2020-11-30 2023-09-05 北京奇艺世纪科技有限公司 Video title generation method and device, electronic equipment and storage medium
CN112818984B (en) * 2021-01-27 2023-10-24 北京奇艺世纪科技有限公司 Title generation method, device, electronic equipment and storage medium
CN112883234A (en) * 2021-02-18 2021-06-01 北京明略昭辉科技有限公司 Label data generation method and device, storage medium and electronic equipment
CN113987267A (en) * 2021-10-28 2022-01-28 上海数禾信息科技有限公司 Video file label generation method and device, computer equipment and storage medium
CN114357989B (en) * 2022-01-10 2023-09-26 北京百度网讯科技有限公司 Video title generation method and device, electronic equipment and storage medium
CN114363673B (en) 2022-01-10 2022-12-27 北京百度网讯科技有限公司 Video clipping method, model training method and device
CN116033207B (en) * 2022-12-09 2024-06-14 北京奇艺世纪科技有限公司 Video title generation method and device, electronic equipment and readable storage medium
CN116208824B (en) * 2023-02-07 2024-07-30 腾讯音乐娱乐科技(深圳)有限公司 Title generation method, computer device, storage medium, and computer program product
CN116567340A (en) * 2023-05-23 2023-08-08 上海哔哩哔哩科技有限公司 Video title extraction method and device
CN118784626B (en) * 2024-06-11 2025-01-28 北京积加科技有限公司 Security monitoring video transmission method, device, equipment and computer readable medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106845390A (en) * 2017-01-18 2017-06-13 腾讯科技(深圳)有限公司 Video title generation method and device
CN107027072A (en) * 2017-05-04 2017-08-08 深圳市金立通信设备有限公司 A kind of video marker method, terminal and computer-readable recording medium
CN107194419A (en) * 2017-05-10 2017-09-22 百度在线网络技术(北京)有限公司 Video classification methods and device, computer equipment and computer-readable recording medium
CN108495185A (en) * 2018-03-14 2018-09-04 北京奇艺世纪科技有限公司 A kind of video title generation method and device
KR101916874B1 (en) * 2017-10-19 2018-11-08 충남대학교산학협력단 Apparatus, method for auto generating a title of video contents, and computer readable recording medium
CN108829881A (en) * 2018-06-27 2018-11-16 深圳市腾讯网络信息技术有限公司 video title generation method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104317931B (en) * 2014-10-31 2018-04-17 北京奇虎科技有限公司 The definite method and apparatus of web page title

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106845390A (en) * 2017-01-18 2017-06-13 腾讯科技(深圳)有限公司 Video title generation method and device
CN107027072A (en) * 2017-05-04 2017-08-08 深圳市金立通信设备有限公司 A kind of video marker method, terminal and computer-readable recording medium
CN107194419A (en) * 2017-05-10 2017-09-22 百度在线网络技术(北京)有限公司 Video classification methods and device, computer equipment and computer-readable recording medium
KR101916874B1 (en) * 2017-10-19 2018-11-08 충남대학교산학협력단 Apparatus, method for auto generating a title of video contents, and computer readable recording medium
CN108495185A (en) * 2018-03-14 2018-09-04 北京奇艺世纪科技有限公司 A kind of video title generation method and device
CN108829881A (en) * 2018-06-27 2018-11-16 深圳市腾讯网络信息技术有限公司 video title generation method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
密集帧率采样的视频标题生成;汤鹏杰;《计算机科学与探索》;20170920;第12卷(第6期);第5-7页 *

Also Published As

Publication number Publication date
CN110399526A (en) 2019-11-01

Similar Documents

Publication Publication Date Title
CN110399526B (en) Video title generation method and device and computer readable storage medium
US11412023B2 (en) Video description generation method and apparatus, video playing method and apparatus, and storage medium
US11868738B2 (en) Method and apparatus for generating natural language description information
US10304458B1 (en) Systems and methods for transcribing videos using speaker identification
CN109862397B (en) Video analysis method, device, equipment and storage medium
CN112668559B (en) Multi-mode information fusion short video emotion judgment device and method
CN111858973B (en) Method, device, server and storage medium for detecting multimedia event information
CN112860943A (en) Teaching video auditing method, device, equipment and medium
CN112580523A (en) Behavior recognition method, behavior recognition device, behavior recognition equipment and storage medium
CN112052352B (en) Video ordering method, device, server and storage medium
CN114245232B (en) Video abstract generation method and device, storage medium and electronic equipment
CN113850162A (en) Video auditing method and device and electronic equipment
CN115396690B (en) Audio and text combination method, device, electronic device and storage medium
CN110162664B (en) Video recommendation method and device, computer equipment and storage medium
CN110489593B (en) Topic processing method and device for video, electronic equipment and storage medium
CN112925972B (en) Information pushing method, device, electronic equipment and storage medium
Wei et al. Sequence-to-segment networks for segment detection
CN119580738A (en) Video processing method, device, equipment and medium based on multimodal information fusion
CN111291666A (en) Game live video identification method and device, electronic equipment and storage medium
CN110796003A (en) Lane line detection method and device and electronic equipment
CN115294227B (en) A method, apparatus, device and medium for generating a multimedia interface
CN114218434B (en) Automatic labeling method, automatic labeling device and computer readable storage medium
CN114780757A (en) Short media label extraction method and device, computer equipment and storage medium
CN117061815A (en) Video processing method, video processing device, computer readable medium and electronic equipment
KR102818491B1 (en) High speed split device and method for video section

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant