CN113518160A

CN113518160A - Video generation method, device, equipment and storage medium

Info

Publication number: CN113518160A
Application number: CN202110035480.5A
Authority: CN
Inventors: 黄晓彤; 张涛; 陈晨; 高鹏飞; 黄亚雄; 肖荣涛; 邢小京; 王玥
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-01-12
Filing date: 2021-01-12
Publication date: 2021-10-19
Anticipated expiration: 2041-01-12
Also published as: CN113518160B

Abstract

The present application discloses a video generation method, device, equipment and medium, which are applied in the field of artificial intelligence. The method includes: in response to a first recording operation, recording a first voice content; displaying a first text material and a first multimedia material corresponding to the first voice content; video, the first video segment is generated based on the first text material and the first multimedia material. This method enables the user to generate a video with a relatively simple operation.

Description

Video generation method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the field of artificial intelligence, in particular to a video generation method, a device, equipment and a storage medium.

Background

Video is a better quality information transmission carrier in internet information. Many online information websites, video websites, and short video APPs (applications) all use video as a main presentation method.

In the related art, professional video production software is provided, and a user prepares material files such as text materials, picture materials, audio materials, video materials, transition animations and the like in advance. Various materials are then edited in the video production software to generate a video file.

Because the human-computer interaction operation of the video production software is complex, for many lightweight video production scenes, the learning cost of the user using the video production software is too high, and the efficiency is low.

Disclosure of Invention

The application provides a video generation method, a device, equipment and a storage medium, which can generate tutorials and training videos with simpler recording operation and lower learning cost. The technical scheme is as follows:

according to an aspect of the present application, there is provided a video generation method, the method including:

responding to the first recording operation, and recording first voice content;

displaying a first text material and a first multimedia material corresponding to the first voice content;

in response to a video generation operation, displaying a video having a first video segment generated based on the first text material and the first multimedia material.

According to another aspect of the present application, there is provided a video generation method, which is applied to a server, the method including:

responding to a first sound recording operation in a terminal, and obtaining a first text material and a first multimedia material, wherein the first text material is obtained by performing voice recognition on the first voice content, and the first multimedia material is obtained by searching based on at least one of the first voice content and the first text material;

in response to a video generation operation in the terminal, sending a video having a first video clip to the terminal, the first video clip being generated based on the first text material and the first multimedia material.

According to another aspect of the present application, there is provided a video generating apparatus, the apparatus including:

the recording module is used for responding to the first recording operation and recording the first voice content;

the display module is used for displaying a first text material and a first multimedia material corresponding to the first voice content;

the display module is further configured to display a video having a first video clip in response to a video generation operation, the first video clip being generated based on the first text material and the first multimedia material.

In an optional design of the present application, the display module is further configured to display the first text material obtained by performing speech recognition on the first speech content in a recording process.

In an optional design of the present application, the display module is further configured to display the first multimedia material corresponding to the first voice content after the recording is finished.

In an optional design of the present application, the display module is further configured to display the first multimedia material corresponding to the first speech recognition content after recognizing the first speech recognition content in a recording process; after the second voice recognition content is recognized, displaying the updated first multimedia material; the updated first multimedia material corresponds to the second speech recognition content, or the updated first multimedia material corresponds to the first speech recognition content and the second speech recognition content.

In an optional design of the present application, the display module is further configured to display the first multimedia material corresponding to the first speech recognition content after recognizing the first speech recognition content in a recording process; and after the second voice recognition content is recognized, displaying the first multimedia material corresponding to the second voice recognition content.

In an optional design of the present application, the display module is further configured to display a video production interface, and the video production interface includes a recording control.

The recording module is further configured to record the first voice content in response to a first recording operation on the recording control.

In an alternative design of the present application, the first multimedia material includes: picture materials; video frames of the first video clip are generated based on the picture material, and audio frames of the first video clip are generated based on the voice content; or, the first multimedia material comprises: video materials; video frames of the first video segment are generated based on the video material, and audio frames of the first video segment are generated based on the speech content; or, the first multimedia material comprises: audio materials; audio frames of the first video segment are generated based on the speech content and the audio material.

In an optional design of the present application, the display module is further configured to display the edited first text material in response to an editing operation on the first text material; wherein the editing operation comprises: at least one of a text adding operation, a text deleting operation, a text searching operation, a text modifying operation, a text replacing operation, a text moving operation and a format changing operation.

In an optional design of the present application, the display module is further configured to display a replacement control corresponding to the first multimedia material.

The display module is further configured to display, in response to a replacement operation on the replacement control, an alternative multimedia material as the first multimedia material, where the alternative multimedia material is obtained by searching based on at least one of the voice content and the first text material.

In an optional design of the present application, the display module is further configured to display a deletion control corresponding to the first multimedia material; responding to the deleting operation on the deleting control, deleting the first multimedia material and displaying an importing control; and responding to the import operation on the import control, and displaying the imported multimedia material as the first multimedia material.

In an optional design of the present application, the recording module is further configured to record the second voice content in response to a second recording operation.

The display module is further configured to display a second text material and a second multimedia material corresponding to the second voice content, where the second text material is obtained by performing voice recognition on the second voice content, and the second multimedia material is obtained by performing search based on at least one of the voice content and the second text material.

The display module is further configured to display a video having a first video segment and a second video segment in response to a video generation operation, the second video segment being generated based on the second text material and the second multimedia material.

In an alternative design of the present application, a transition animation is further provided between the first video clip and the second video clip.

In an alternative design of the present application, the first text material is displayed in the first video segment in the form of subtitles.

In an optional design of the present application, the display module is further configured to display a sharing control corresponding to the video having the first video clip.

The device further comprises:

the communication module is used for responding to the sharing operation on the sharing control and sending the video with the first video clip to other terminals; or, in response to a sharing operation on the sharing control, sending the video with the first video clip onto a web space.

In an optional design of the present application, the communication module is further configured to send the first voice content to a server.

The communication module is further used for receiving the first text material and the first multimedia material replied by the server.

In an alternative design of the present application, the apparatus further includes:

and the recognition module is used for carrying out voice recognition on the first voice content to obtain the first text material.

The communication module is further used for sending the first text material to a server.

The communication module is further used for receiving the first multimedia material replied by the server based on the first text material.

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a first text material and a first multimedia material, the first text material is obtained by performing voice recognition on first voice content, and the first multimedia material is obtained by searching based on at least one of the first voice content and the first text material; the first voice content is obtained by recording a first recording operation on the terminal;

a synthesizing module, configured to generate a video with a first video clip according to the first text material and a first multimedia material, where the first video clip is generated based on the first text material and the first multimedia material;

a sending module, configured to send the video with the first video clip to the terminal.

In an optional design of the present application, the obtaining module is further configured to receive the first voice content sent by the terminal.

The synthesis module is further configured to perform speech recognition on the first speech content to obtain the first text material; extracting keywords from the first text material; and searching and obtaining the first multimedia material based on the keyword.

In an optional design of the present application, the obtaining module is further configured to receive the first text material sent by the terminal, where the first text material is obtained by performing voice recognition on the first voice content by the terminal.

The synthesis module is further used for extracting the key words from the first text material; and searching and obtaining the first multimedia material based on the keyword.

According to an aspect of the present application, there is provided a computer device including: a processor and a memory, the memory storing a computer program that is loaded and executed by the processor to implement the video generation method as described above.

According to another aspect of the present application, there is provided a computer-readable storage medium storing a computer program which is loaded and executed by a processor to implement the video generation method as described above.

According to another aspect of the application, a computer program product or a computer program is provided, comprising computer instructions, which are stored in a computer readable storage medium. The controller reads the computer instructions from the computer-readable storage medium, and the controller executes the computer instructions to enable the display device to execute the video generation method.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

the video can be generated quickly and simply only by recording the first voice content and performing simple selection operation by the user, the user does not need to spend a large amount of time to search for multimedia materials, and the user does not need to have professional video editing knowledge. The method and the device can reduce the learning cost of the user for using the video making software, also avoid the process of searching materials by the user, and improve the efficiency of video making and the human-computer interaction efficiency.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of an exemplary interface of a video generation method provided by an exemplary embodiment of the present application;

FIG. 2 is a block diagram of a computer system provided in an exemplary embodiment of the present application;

FIG. 3 is a schematic flow chart diagram of a video generation method provided by an exemplary embodiment of the present application;

FIG. 4 is a schematic diagram of an exemplary interface of a video generation method provided by an exemplary embodiment of the present application;

FIG. 5 is a schematic flow chart diagram of a video generation method provided by an exemplary embodiment of the present application;

FIG. 6 is a schematic diagram of an exemplary interface of a video generation method provided by an exemplary embodiment of the present application;

FIG. 7 is a schematic diagram of an exemplary interface of a video generation method provided by an exemplary embodiment of the present application;

FIG. 8 is a schematic diagram of an exemplary interface of a video generation method provided by an exemplary embodiment of the present application;

FIG. 9 is a schematic flow chart diagram of a video generation method provided by an exemplary embodiment of the present application;

FIG. 10 is a schematic flow chart diagram of a video generation method provided by an exemplary embodiment of the present application;

FIG. 11 is a schematic flow chart diagram of a video generation method provided by an exemplary embodiment of the present application;

FIG. 12 is an exemplary service architecture provided by an exemplary embodiment of the present application;

FIG. 13 is an exemplary architecture for converting speech content into text material as provided by an exemplary embodiment of the present application;

FIG. 14 is an exemplary block diagram of a backend system structure provided by an exemplary embodiment of the present application;

FIG. 15 is a diagram illustrating an exemplary architecture of a Windows server provided by an exemplary embodiment of the present application;

fig. 16 is a schematic flow chart of a video composition method according to an exemplary embodiment of the present application;

fig. 17 is a block diagram of a video compositing apparatus provided by an exemplary embodiment of the present application;

fig. 18 is a block diagram of a video compositing apparatus provided by an exemplary embodiment of the present application;

fig. 19 is a block diagram of a server provided in an exemplary embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

It is to be understood that reference herein to "a number" means one or more and "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

For convenience of understanding, terms referred to in the embodiments of the present application will be described below.

Artificial Intelligence (AI): artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML): machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

Natural Language Processing (NLP): is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

The scheme provided by the embodiment of the application relates to technologies such as artificial intelligence natural language processing, image processing, machine learning and the like, and is specifically explained by the following embodiments:

for example, as shown in fig. 1, a video generation scheme provided by the present application is described by taking a doctor as an example to create a video. First, the user interface 11 is a video production interface, the interface content "creation content" is displayed on the top of the user interface 11, the video title "symptoms and treatment of eczema" is displayed under the text, the text "press the button below to start recording" is displayed in the middle position of the user interface 11, and the doctor can press the recording control 101 on the user interface 11 to prepare to start recording the voice content.

After the recording control 101 is clicked, the user interface 12 is displayed, the aforementioned symptoms and treatments of the interface title "authored content" and the video title "eczema" are still displayed on the user interface 12, and the text reminder and the graphic reminder of "recording" are displayed below the video title for reminding the doctor that the voice content is being recorded. The working state of 'voice to character' is displayed under the 'recording' reminder, and the voice content which is input and finished is displayed under the working state in a character form for the doctor to check at any time. The doctor may click on a pause control 102 centrally located at the bottom of user interface 12 to temporarily stop the recording of speech. Similarly, the doctor can click on the completion control 103 located at the right side of the pause control to complete the input of the speech content.

After the physician clicks the completion control 103, the user interface 13 is displayed, and likewise, the display interface title "authored content" and the video title "symptoms and treatment of eczema" remain on the user interface 13. In the middle position of the user interface 13, a play control 104, a first text material 105 and a first multimedia material 106 are displayed from top to bottom. The playing control 104 is used for playing the inputted voice content, the first text material 105 is a text converted from the inputted voice content, and the first multimedia material 106 is a picture related to the voice content and/or the first text material 105, for example, a picture of a eczema patient. A delete control 115 is also displayed in the upper right-hand corner of the user interface 13, and if the physician is not satisfied with the input speech content, the delete control 115 can be clicked to delete all or a portion of the play control 104, the first text material 105 and the first multimedia material 106 on the user interface 13. A change control 107 is also displayed on the first multimedia material, and if the generated first multimedia material 107 is not satisfactory, the doctor can click the change control 107 to change the first multimedia material 106 to another multimedia material. Similarly, a recording control 101 and a completion control 103 are also displayed at the bottom of the user interface 13, and when the recording control 101 is clicked, the doctor can continue to record a next voice content; and clicking the completion control 103 to complete the recording of the voice content, and entering the next step.

Assuming that the doctor has recorded two segments of voice content, the user interface 14 is displayed, and the second text material 108 and the second multimedia material 109 are displayed on the user interface 14, the doctor can also replace the second multimedia material 109 or play the voice content, which is not described herein again.

After the doctor has finished inputting all the voice contents, the user interface 15 is displayed by clicking the completion control 103 on the completion user interface 14, and the user can perform an editing operation on the first text material 105, the first multimedia material 106, the second text material 108 and the second multimedia material 109 displayed on the user interface 15, for example, modifying the text contents of the first text material 105, or clicking the deletion control 110 on the upper right of the first multimedia material 106 to delete the first multimedia material 106.

After the doctor has finished editing, the doctor clicks on the video generation control 111 on the user interface 15, the user interface 16 is displayed, the user interface 16 displays the video 112, and text material 113 is displayed below the video 112, and the text material 113 is obtained from the first text material 105 and the second text material 108. The doctor can click on the sharing control 114 to share the generated video 112 and the text material 113 with other users or upload the video and the text material to a preset web space.

Fig. 2 shows a schematic structural diagram of a computer system provided in an exemplary embodiment of the present application. The computer system 200 includes: a terminal 220 and a server 240.

The terminal 220 has a client installed thereon in connection with video generation. The client may be an applet in the APP, or a special application program, or a web client. The user performs operations related to video generation on the terminal 220. The terminal 220 is at least one of a smartphone, a tablet, an e-book reader, an MP3 player, an MP4 player, a laptop portable computer, and a desktop computer.

The terminal 220 is connected to the server 240 through a wireless network or a wired network.

The server 240 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform. The server 240 is used to provide background services for applications that support video generation. Alternatively, server 240 undertakes primary computational tasks and terminal 220 undertakes secondary computational tasks; alternatively, server 240 undertakes the secondary computing work and terminal 220 undertakes the primary computing work; alternatively, both the server 240 and the terminal 220 employ a distributed computing architecture for collaborative computing.

Fig. 3 is a flowchart illustrating a video generation method according to an exemplary embodiment of the present application. The method may be performed by the terminal 220 shown in fig. 2, the method comprising the steps of:

step 302: and responding to the first recording operation, and recording the first voice content.

The first recording operation refers to an operation of recording voice by a user. The first recording operation is to record voice by pressing one or more preset physical keys, and the user can also perform the first recording operation by a signal generated by releasing, long-pressing, clicking, double-clicking and/or sliding on the touch screen.

The first voice content refers to voice recorded by the user in real time. Optionally, the first voice content is obtained by downloading through a network, or the first voice content is obtained by querying locally stored audio data, or the first voice content is sent by other terminals. The present embodiment takes the first voice content as an example of real-time recording for the user.

Illustratively, as shown in fig. 4, the user interface 41 is a video production interface, a recording control 401 is displayed on the user interface 41, the user can click the recording control 401 to start recording the first voice content, and a video title and/or an interface title may be displayed on the user interface 41, where the video title is used to represent main content of a video to be generated, and the interface title represents main content of the user interface. After the recording control 401 is clicked, the user interface 42 is displayed, the user interface 42 is a recording interface, the video title and/or the interface title on the user interface 41 are/is kept displayed on the user interface, and a reminding icon is displayed, the reminding icon is used for reminding the user terminal that the user terminal is receiving the first voice content of the user, and the reminding icon is composed of characters and/or graphics. A text material 403 may also be displayed on the interface 42, and a part of the input first voice content is converted into words, so as to obtain the text material 403, that is, the text material 403 is a part of which the voice conversion words have been completed. Also displayed on the interface 42 is a pause control 404, which the user can click on to temporarily stop the input of the first speech content. A completion control 405 is also displayed on the interface 42, and the user can click the completion control 405 to complete the input of the first voice content.

Step 304: and displaying a first text material and a first multimedia material corresponding to the first voice content.

The first text material refers to a text obtained by recognizing the first voice content. Illustratively, the first speech content is "how to treat the cold", and the first text material is correspondingly "how to treat the cold". The first text material is a text representation of the first voice content, and the semantics of the first text material and the semantics of the first voice content are the same.

Speech recognition may convert speech content input by a user into text. There are many ways to implement speech recognition, for example, establishing a database corresponding to speech and text, and searching for corresponding text in the database after inputting a section of speech, or using a trained speech recognition neural network that can output the input speech as text.

The first text material may be obtained by performing a segmented translation on the first voice content, or may be obtained by performing an overall translation on the first voice content. Illustratively, the first speech content is "rained, remembered to take an umbrella", when the segmented translation is adopted, the speech content is translated after the input speech content "rained", the speech content is translated after the input speech content "remembered to take an umbrella", and the two translation results are spliced to obtain a first text material "rained, remembered to take an umbrella". When the integral translation is adopted, after the integral voice content 'raining and remembering to take an umbrella' is input, the integral voice content is translated to obtain a first text material 'raining and remembering to take an umbrella'.

The first multimedia material is obtained based on at least one of a first voice content and a first text material, and a direct or indirect relationship exists between the first multimedia material and the first voice content and the first text material. The first multimedia material includes: at least one of pictures, video, audio. Illustratively, the first text material is "a cold," and the first multimedia material is a video of how to treat the cold, or alternatively, a picture of a man sneezing.

Illustratively, as shown in fig. 4, the user interface 42 is an intermediate user interface for voice recording, and after the user finishes voice recording, the user clicks the finish control 405 on the user interface 42 to display the user interface 43, the video title and/or the interface title still remain to be displayed on the user interface 43, a play control 406 is further displayed on the user interface 43, and when the play control 406 is clicked, the first voice content recorded by the user is played, and a progress bar is further displayed at a position around the play control 406, and the progress bar is used for indicating a playing progress of the first voice content. The user can also slide the progress bar to adjust the played content. Also displayed on the user interface 43 is a first text material 407, the first text material 407 being obtained by speech recognition of the first speech content. Also displayed on the user interface 43 is a first multimedia material 408. A recording control 401 is also displayed on the user interface for the user to record the next voice content. Also displayed on user interface 43 is a completion control 414, completion control 414 functioning differently than completion control 405 in user interface 42, where completion control 414 functions to complete all voice content input and prepare to generate a video. Optionally, a deletion control 413 may be further displayed on the user interface 43, and clicking the deletion control 413 may delete display elements such as the play control 406, the first text material 407, and the first multimedia material 408 displayed on the user interface, and redisplay the user interface 41.

Step 306: in response to a video generation operation, a video having a first video segment is displayed, the first video segment being generated based on a first text material and a first multimedia material.

The video generation operation is for generating a video having a first video segment based on the first text material and the first multimedia material. The video generation operation is to perform video generation by pressing one or more preset physical keys, and the user may also perform the video generation operation by a signal generated by performing a release, a long press, a click, a double click and/or a slide on the touch screen.

Optionally, the first text material is displayed in the first video segment in the form of subtitles. The subtitles may be displayed in a horizontal arrangement on the lower side or the upper side of the first video segment, or may be displayed in a vertical arrangement on the left side or the right side of the first video segment. The specific display position of the subtitle is not limited in the present application.

The video is formed by combining a plurality of video frames and a plurality of audio frames, and the situation of generating the first video segment according to the difference of the first multimedia material includes but is not limited to the following situations:

1. the first multimedia material includes: and (5) picture materials.

All or part of the video frames of the first video segment are generated based on the picture material, and all or part of the audio frames of the first video segment are generated based on the voice content.

2. The first multimedia material includes: video material.

All or a portion of the video frames of the first video segment are generated based on the video material, and all or a portion of the audio frames of the first video segment are generated based on the speech content.

3. The first multimedia material includes: audio material.

All or a portion of the audio frames of the first video segment are generated based on the audio material and the speech content.

In the present application, the video frame and the audio frame of the first video segment are generated in at least one or a combination of the above manners.

The first video segment refers to a video generated based on a first text material and a first multimedia material.

Optionally, in response to a video generation operation, a video having a first video segment and a first text material are displayed.

Optionally, the video may be transcoded into videos with different resolutions and bit rates to adapt to different kinds of terminals.

Illustratively, as shown in fig. 4, the user clicks the completion control 414 on the user interface 43, the user interface 44 is displayed, and a video 410 is displayed on the user interface 44, the video 410 being generated from the first multimedia material 408 and the first text material 407. Also displayed on the user interface 44 is text material 411, the text material 411 comprising all or part of the content of the first text material 407, the first text material 407 further comprising text in addition to the first text material. Optionally, a sharing control 412 is displayed on the user interface 44, and when the user clicks the sharing control 412, the generated video 410 and the text material 411 can be sent to other users or sent to a designated network space.

In summary, in this embodiment, the video can be quickly obtained only by recording the voice content and performing simple operations, and the user does not need to spend a lot of time to search for the material and the user does not need to have professional editing knowledge. The method and the device can reduce the learning cost of the user for using the video making software, also avoid the process of searching materials by the user, and improve the efficiency of video making and the human-computer interaction efficiency.

Fig. 5 is a flowchart illustrating a video generation method according to an exemplary embodiment of the present application. The method may be performed by the terminal 220 shown in fig. 2, the method comprising the steps of:

step 501: and displaying a video production interface, wherein the video production interface comprises a recording control.

The video production interface is an initial interface for video production, and a user starts to produce videos through the interface.

The recording control is used to start recording the user's voice.

Optionally, a video title and/or a video summary are also displayed on the video production interface.

Illustratively, as shown in fig. 1, the user interface 11 is a video production interface, and a recording control 101 is displayed on the interface 11. An interface title or video title may also be displayed.

Step 502: and responding to a first recording operation on the recording control, and recording the first voice content.

In this embodiment, the first voice content refers to a voice recorded by the user in real time.

Illustratively, as shown in fig. 1, a user clicks a recording control 101 on an interface 11 to start recording voice content, and during the recording process of the voice content, the interface of the terminal is displayed as a user interface 12, and a text reminder and a graphic reminder are displayed on the interface to remind the user that the voice content is being recorded. A pause control 102 is also displayed on the user interface 12, and a completion control 103 is also displayed on the user interface 12 for completing the input of the first voice content.

Step 503: and in the recording process, displaying a first text material obtained by performing voice recognition on the first voice content.

The first text material is obtained by performing voice recognition on the first voice content.

Optionally, after the recording is completed, a first text material obtained by performing speech recognition on the first speech content is displayed.

Illustratively, as shown in fig. 1, on the user interface 12, a user is inputting a first voice content, and a first text material 105 obtained by performing voice recognition on the first voice content that has been input is displayed on the user interface 12.

Step 504: and displaying the first multimedia material corresponding to the first voice content after the recording is finished.

The first multimedia material is searched based on at least one of the first speech content and the first text material.

Optionally, at least one of the first multimedia material and the first text material is displayed after the recording is finished.

Illustratively, as shown in fig. 1, the user interface 13 is displayed after the recording is finished, and a play control 104, a first text material 105 and a first multimedia material 106 can be further displayed on the user interface 13.

Optionally, after the recording is finished, at least one of a picture material, a video material and an audio material corresponding to the first voice content is displayed.

Illustratively, as shown in fig. 7, after the recording is completed, a user interface 71 is displayed, and in the user interface 71, a plurality of picture materials 701 are displayed.

Illustratively, as shown in fig. 8, after the recording is completed, the user interface 81 is displayed, and in the user interface 81, a plurality of picture materials 801 and audio materials 802 are simultaneously displayed.

Optionally, the first text material includes first speech recognition content and second speech recognition content, and the first speech recognition content and the second speech recognition content are obtained by performing speech recognition on different portions of the first speech content.

Optionally, after the first voice recognition content is recognized in the recording process, displaying a first multimedia material corresponding to the first voice recognition content; after the second voice recognition content is recognized, displaying the updated first multimedia material; and the updated first multimedia material corresponds to the second voice recognition content, or the updated first multimedia material corresponds to the first voice recognition content and the second voice recognition content. Illustratively, the user inputs the first voice content as "i go to the playground to run on a sunny day", after the user inputs the first voice content, the terminal acquires the first voice recognition content "i go to the playground to run", the terminal displays a picture related to "sunny day", for example, a picture of the sun, on the user interface according to the first voice recognition content, and after the terminal acquires the second voice recognition content "i go to the playground to run", the terminal updates the first multimedia material, cancels the display of the original first multimedia material, and displays the updated first multimedia material, for example, a picture of a person who runs.

Optionally, after the first voice recognition content is recognized in the recording process, displaying a first multimedia material corresponding to the first voice recognition content; and after the second voice recognition content is recognized, displaying the first multimedia material corresponding to the second voice recognition content. Illustratively, the user inputs a first voice content "i go to the playground to run" on a sunny day ", after the terminal obtains the first voice recognition content" on a sunny day ", a first multimedia material corresponding to the first voice recognition content, for example, a picture related to" sunny day ", is displayed on the user interface, after the user inputs a second voice recognition content, the first multimedia material is kept to be displayed, and a picture of a first multimedia material corresponding to the second voice recognition content, for example, a running person, is displayed in other areas of the user interface.

Step 505: and displaying the replacement control corresponding to the first multimedia material.

Illustratively, as shown in fig. 1, a replacement control 107 is displayed at the bottom of the first multimedia 106 in the user interface 13, and the replacement control 107 is displayed as a text label of "one sheet after another".

Step 506: and displaying the alternative multimedia material as the first multimedia material in response to the replacement operation on the replacement control.

The replacement operation is for the user to replace the first multimedia material with the alternative multimedia material. The replacing operation is to replace the first multimedia material by pressing one or more preset physical keys, and the user can also replace the first multimedia material by a signal generated by releasing, long pressing, clicking, double clicking and/or sliding on the touch screen.

The alternative multimedia material is searched based on at least one of the first speech content and the first text material.

Optionally, at least one multimedia material is arranged according to the search result, the first arranged multimedia material is set as a first multimedia material, and the remaining multimedia materials are set as alternative multimedia materials.

Optionally, the first multimedia material and the alternative multimedia material are arranged according to the degree of association with the first text material, or according to the degree of association with the first speech content.

Optionally, the user may repeat step 506.

Illustratively, the user may click on the change control 107 in the user interface 13 to change the first multimedia material 106 to an alternative multimedia material.

Step 507: and responding to the second recording operation, and recording the second voice content.

The second recording operation refers to an operation of recording voice by the user. The second recording operation refers to an operation of recording voice by the user. The second recording operation is to record voice by pressing one or more preset physical keys, and the user can also perform the second recording operation by a signal generated by releasing, long-pressing, clicking, double-clicking and/or sliding on the touch screen. Optionally, the second recording operation is the same as or different from the first recording operation.

The second voice content refers to voice recorded by the user in real time. Optionally, the second voice content is obtained by downloading through a network, or the second voice content is obtained by querying locally stored audio data, or the second voice content is sent by other terminals. The present embodiment takes the second voice content as an example of being recorded by the user in real time.

For example, as shown in fig. 1, clicking the recording control 101 on the user interface 13 to record the second voice content, please refer to the user interface 12 in the recording process, which is not described herein again.

Step 508: and displaying a second text material and a second multimedia material corresponding to the second voice content.

The second text material is obtained by performing voice recognition on the second voice content.

The second multimedia material is searched based on at least one of the second speech content and the second text material. Optionally, the second multimedia material comprises: at least one of pictures, video, audio.

Illustratively, as shown in fig. 1, after the recording of the second voice content is completed, the user interface 14 is displayed, and a play control 104, a second text material 108 and a second multimedia material 109 are displayed on the interface 14. Also displayed on the user interface 14 are a recording control 101 and a completion control 103. The user may click on the record control 101 to record more voice content. After the user completes the input of the voice content, the user clicks the completion control 103 to end the input of the voice content.

Step 509: and responding to the editing operation on the first text material, and displaying the edited first text material.

The editing operation is used to modify the first text material. Wherein the editing operation comprises: at least one of a text adding operation, a text deleting operation, a text searching operation, a text modifying operation, a text replacing operation, a text moving operation and a format changing operation.

Editing operations on the first text material and the first multimedia material may be performed prior to step 505. The present application does not limit the specific timing of the operations.

Optionally, the edited second text material is displayed in response to an editing operation on the second text material.

The user may repeat step 509.

Illustratively, as shown in FIG. 1, after all speech content has been entered, the user interface 15 is displayed by clicking on the completion control 103 on the user interface 14. On the user interface 15, a first text material 105, a first multimedia material 106, a second text material 108 and a second multimedia material 109 are displayed on the user interface. The user can directly click on the first text material 105 or the second text material 108 at the user interface 15 to edit the text content therein. Optionally, more text material or multimedia material may be displayed on the user interface 15, or less text material or multimedia material may be displayed, which is not limited in this application.

Step 510: and displaying a deletion control corresponding to the first multimedia material.

The delete control is used to cancel the display of the first multimedia material.

Optionally, a deletion control corresponding to the second multimedia material is displayed.

Illustratively, as shown in FIG. 6, a delete control 110 is displayed on the user interface 15. A delete control 110, the delete control 110 being displayed superimposed over the first multimedia material 106.

Step 511: and responding to the deletion operation on the deletion control, deleting the first multimedia material and displaying the import control.

The deleting operation is to delete the first multimedia material by pressing one or more preset physical keys, and the user can also delete the first multimedia material by a signal generated by releasing, long pressing, clicking, double clicking and/or sliding on the touch screen.

Optionally, in response to a deletion operation on the deletion control corresponding to the second multimedia material, the second multimedia material is deleted and the import control is displayed.

Optionally, an import control corresponding to the second multimedia material is displayed, and the second multimedia material is imported in response to an import operation on the pour control corresponding to the second multimedia material. At this time, the first multimedia material and the second multimedia material are simultaneously displayed on the user interface. The user may also choose to display more multimedia material.

For example, as shown in fig. 6, the delete control 110 on the user interface 15 is clicked, the user interface 61 is displayed, the first multimedia material 106 is deleted and cancelled, the import control 601 is displayed at the position of the original first multimedia material 106, and the import control 601 may be displayed as a segment of text, for example, the text "import other pictures" displayed on the user interface 61 may also be displayed in the form of a button, which is not limited in this application.

Step 512: and responding to the import operation on the import control, and displaying the imported multimedia material as the first multimedia material.

The import operation is for a user to import desired multimedia material. The importing operation is to perform the importing of the multimedia material by pressing one or more preset physical keys, and the user can also perform the importing operation by pressing signals generated by releasing, long pressing, clicking, double clicking and/or sliding on the touch screen.

Optionally, in response to a delete operation on the delete control, the first multimedia material is deleted and the imported multimedia material is displayed. In this case, the terminal can directly import and display the multimedia material on the default path without the need of the user to perform an import operation.

Illustratively, as shown in fig. 6, clicking on the import control 601 displays the user interface 62, wherein the multimedia material is changed from the first multimedia material 106 to the first multimedia material 602 in the user interface 62.

Optionally, in steps 510 to 512, a deletion control corresponding to the second multimedia material may be displayed instead; in response to the deleting operation on the deleting control, deleting the second multimedia material and displaying the importing control; and responding to the import operation on the import control, and displaying the imported multimedia material as second multimedia material. The steps 510 to 512 after replacement can be performed together with the original steps 510 to 512.

The user may repeat steps 510 through 512.

Step 509 is not sequential to step 510 to step 512 in timing.

Step 513: in response to a video generation operation, a video having a first video segment and a second video segment is displayed.

The video generation operation is an operation for generating a video. Optionally, the video generation operation is clicking, double-clicking or pressing a video generation control or a video completion control, or clicking, double-clicking or pressing a key of a physical keyboard to generate the video.

Transition animations are also provided between the first video clip and the second video clip. The cut-scene animation is used for connecting the first video clip and the second video clip, so that the picture playing of the video is smoother.

The second video segment is generated based on the second text material and the second multimedia material.

Optionally, a sharing control corresponding to the video with the first video clip is further displayed on the user interface, and the video with the first video clip is sent to other terminals in response to the sharing operation on the sharing control; or, in response to the sharing operation on the sharing control, sending the video with the first video clip to the network space.

Illustratively, as shown in fig. 1, clicking on the generate video control 111 on the user interface 15 displays the user interface 16, wherein the user interface 16 displays a video 112, the video 112 has a first video segment and a second video segment, and a text material 113 is displayed below the video 112, and the text material 113 is obtained from the first text material 105 and the second text material 108. A sharing control 114 is also displayed on the user interface 16, and the user can share the video 112 and the text material 113 with other users through the sharing control 114 or upload the video and the text material to a preset web space.

In conclusion, the embodiment can reduce the learning cost of the user for using the video making software, also saves the process of searching materials by the user, and improves the video making efficiency and the human-computer interaction efficiency.

Moreover, a user can splice a plurality of sections of videos into one section of video, the duration of the video is prolonged, the content of the video is enriched, and the video production efficiency and the human-computer interaction efficiency are further improved.

Moreover, the user can edit the first multimedia material and the first text material, so that the quality of the video is improved, and the content of the video is closer to reality.

As explained above, the present application relates to: a voice recognition process, a keyword extraction process, a multimedia material search process, and a video production process.

Then, each of the four processes can be implemented by a client or a server, and then at least the following possible implementations are possible:

1. the video production process is realized by a client, and the voice recognition process, the keyword extraction process and the multimedia material searching process are realized by a server.

2. The voice recognition process and the video production process are realized by a client, and the keyword extraction process and the multimedia material search process are realized by a server.

3. The voice recognition process, the keyword extraction process, the multimedia material search process and the video production process are all realized by the client.

4. The voice recognition process is realized by the client, and the keyword extraction process, the multimedia material search process and the video production process are realized by the client.

5. The voice recognition process and the keyword extraction process are realized by the client, and the multimedia material searching process and the video production process are realized by the client.

6. The voice recognition process, the keyword extraction process and the video production process are realized by the client, and the multimedia material searching process is realized by the client.

The above-described embodiment 1 will be described with reference to an embodiment shown in fig. 9. The method comprises the following steps:

step 901: and responding to the first recording operation, and recording the first voice content by the terminal.

Step 902: the terminal sends the first voice content to the server.

Step 903: the server receives first voice content.

And the terminal receives the first text material and the first multimedia material replied by the server.

Step 904: and the server performs voice recognition on the first voice content to obtain a first text material.

Step 905: the server extracts keywords from the first text material.

Extracting keywords refers to extracting words in the text material that can express the core idea. A segment of text material contains at least one keyword. Illustratively, the textual material is "how to treat the cold? ", the keywords in this text material are" cold "and" treatment ". There are various methods for extracting keywords, for example, inputting text materials into a keyword extraction neural network, which is used for extracting keywords in the text materials and outputting the keywords, or building a corresponding database according to the relationship between a large number of text materials and keywords, and retrieving keywords from the database.

Step 906: and the server searches and obtains the first multimedia material based on the keyword.

The terminal sends the first text material to the server.

Optionally, the server searches for the first voice content and the first text material by using a search engine to obtain the first multimedia material, or the server queries a local memory to obtain the first multimedia material, where a corresponding relationship between the first voice material and the first multimedia material or a corresponding relationship between the first text material and the first multimedia material is stored in the local memory, which is not limited in this application.

Step 907: the server sends the first text material and the first multimedia material to the terminal.

The terminal receives a first multimedia material replied by the server based on the first text material.

Step 908: the terminal receives a first text material and a first multimedia material which are sent by the server.

The voice recognition means that the first voice content is converted into the first text material, and the two are expressed with the same meaning. Illustratively, the first speech content "how to treat the cold" is identified as the first text material "how to treat the cold".

Step 909: based on the first text material and the first multimedia material, the terminal generates a video.

The video has a first video segment generated from a first text material and a first multimedia material.

Optionally, the recording operation is divided into a plurality of times, and a plurality of video clips are generated, and transition animations between adjacent video clips are automatically generated or set by a user.

Step 910: the terminal displays the video.

The terminal sends the keyword to the server.

Illustratively, as shown in FIG. 1, a video 112 is displayed on the user interface 16.

In summary, the present embodiment provides a method for generating a video, and because the processing capabilities of various terminals are different, the present embodiment places the video production process at the terminal for processing, and when the performance of the terminal is poor, the method provided by the present embodiment can still be implemented, thereby reducing the processing pressure of the terminal.

The above-described embodiment 2 will be described with reference to the embodiment shown in fig. 10. The method comprises the following steps:

the following steps may refer to steps 901 to 910, and the specific steps may be different in implementation subject, but do not affect the specific implementation process.

Step 1001: and responding to the first recording operation, and recording the first voice content by the terminal.

Step 1002: and the terminal performs voice recognition on the first voice content to obtain a first text material.

Step 1003: the terminal sends the first text material to the server.

Step 1004: the server receives the first text material.

Step 1005: the server extracts keywords from the first text material.

Step 1006: based on the keywords, the server searches for a first multimedia material.

Step 1007: the server sends the first multimedia material to the terminal.

Step 1008: the terminal receives a first multimedia material sent by the server.

Step 1009: based on the first text material and the first multimedia material, the terminal generates a video.

Step 1010: the terminal displays the video.

In summary, in the embodiment, the voice recognition process and the video production process are implemented by the client, and the keyword extraction process and the multimedia material search process are implemented by the server. The terminal bears part of the calculation task, so that the pressure of the server can be effectively reduced.

The above two embodiments are easily conceived according to the above 2 embodiments, and are not described in detail herein.

Fig. 11 shows a flowchart of a video generation method according to an exemplary embodiment of the present application. The method is applied to a server, and the method can be executed by the server 240 shown in fig. 2, and the method includes the following steps:

step 1101: the method comprises the steps of obtaining a first text material and a first multimedia material, wherein the first text material is obtained by performing voice recognition on first voice content, the first multimedia material is obtained by searching based on at least one of the first voice content and the first text material, and the first voice content is obtained by first recording operation on a terminal.

The server obtains a first text material and a first multimedia material.

Optionally, the first text material and the first multimedia material are sent to the server through the terminal, or the first text material and the first multimedia material are pre-stored in the server.

Step 1102: a video having a first video segment is generated from the first text material and the first multimedia material, the first video segment being generated based on the first text material and the first multimedia material.

Step 1103: and sending the video with the first video clip to the terminal.

In summary, in this embodiment, the video can be quickly obtained only by recording the voice content and performing simple operations, and the user does not need to spend a lot of time to search for the material and the user does not need to have professional editing knowledge. The method and the device can reduce the learning cost of the user for using the video making software, also avoid the process of searching materials by the user, and improve the efficiency of video making and the human-computer interaction efficiency. The processing process is put into the server to be implemented, so that the computing pressure of the terminal can be reduced, and meanwhile, the processing capacity of the server is generally stronger than that of the terminal, so that the efficiency can be improved.

Illustratively, the service architecture of an exemplary server of the present application is listed, as shown in fig. 12:

the service architecture can be divided into four layers, namely a data access layer 1201, a business logic layer 1202, a data access layer 1203 and a persistence layer 1204. The application accesses the server through a restful (a design style and development mode of the web application) interface in the data access layer 1201.

The data Access layer 1001 includes a restful interface and an Access Server (Access Server). The restful interface is used for accessing an application program, and receiving and responding to the request after the application program is accessed; the access server is used for receiving an http (simple request-response protocol) request, converting the http request into a Grpc (Google remote procedure call, a remote procedure call method developed by Google), and calling a business logic layer.

Business logic layer 1202 is implemented using Golang (a compiled language) to expose Grpc services to upper layers while providing rpc (remote procedure call) calls to internal layers. The business logic layer 1002 includes at least one of voice synthesis, voice conversion, picture search, text synthesis, text summarization, and video generation. The voice synthesis refers to the generation of corresponding voice content from a text material, and is the reverse process of voice conversion; converting the voice into a corresponding text material obtained by the voice content; the picture searching means that corresponding pictures are searched based on voice content and text materials; the image-text synthesis means that a corresponding article is generated according to the image and the text material, wherein the article comprises the image and the text material; the image-text abstract is to extract key words in the text material and form a corresponding abstract; video generation refers to generating video from pictures and voice content. The business logic layer 1202 also includes at least one of service alerts, content alerts, user management, article management, and information configuration. The service alarm is used for detecting whether the service provided by the server has risks; the content alarm is used for judging whether the content provided by the server meets a preset condition or not; the user management provides the user with the authority to manage part or all of the content of the server; the article management is used for managing text materials stored in the server; the information is configured to configure various types of information for the server content. The business logic layer also includes log records. The log records record the history of the server, and are convenient for users and technicians to look up. The service logic layer may also include other contents, which are not specifically limited in this application.

Data access layer 1203 is used to provide methods for accessing data for servers. The data access layer 1203 includes at least one of an Aggregation Pipeline (API call), an Application Programming Interface call (API call), a Mysql connection Pool (Mysql is a relational database management system of open source code, Pool represents a connection Pool for storing various types of connections), and a Redis connection Pool (Redis is an open source database, and Pool is the same as above, connection Pool). The aggregation pipeline is a data aggregation framework based on data processing pipeline concept modeling, and can convert input documents into aggregated results; the application program interface calls an application program corresponding to the calling interface; the Mysql connection pool is used to provide connections to Mysql; the Redis connection pool is used to provide connections to Redis. Optionally, the data access layer 1003 further includes at least one of database interaction, cache interaction, and application program interface interaction.

The persistent layer 1204 is mainly used for storing various types of data. The persistent layer includes at least one of a cluster and a Tencent Cloud distributed file System (Tencent Cloud Operating System, Tencent Cloud COS). Wherein a cluster is a group of mutually independent computers interconnected through a high-speed network, which constitute a group and are managed in a single system mode. When a client interacts with a cluster, the cluster is like an independent server, and the cluster comprises a main node and a slave node; the Tencent cloud distributed file system may provide distributed storage services.

At least one of a penetration testing tool (e.g., Sparta Nginx), a remote procedure call (e.g., Golong grpc, which is a grpc-based programming language), a read instruction (e.g., Crontab instruction), and an open source log component (e.g., Logback) are included in a message queue of the service framework. Wherein, the Nginx penetration testing tool is used for carrying out port scanning; the reading instruction can read the standard instruction of the input equipment and store the standard instruction in a specified file for later reading and execution; the open source log component may store logs.

Illustratively, an exemplary framework for converting speech content into text material is given, as shown in fig. 13:

the architecture includes at least one of a service access layer 1301, a capability combination 1302, a speech recognition base 1303, a corpus 1304, and an external capability 1305.

The service access layer 1301 is used for accessing other external applications, and the service access layer 1301 includes: web Services (a stand-alone application), the restful interface of HyperText transfer protocol, and Software Development Kit (SDK). The Web Services can provide a platform for data interaction or integration; the restful interface of the hypertext transfer protocol can simultaneously support the hypertext transfer protocol interface and the restful interface, so that data interaction is facilitated; the software development kit is used to provide an application program interface for an application program.

The capability combination 1302 combines services provided by the servers, packages the services for a specific purpose, and provides a standard service interface to the outside. The capability combination 1302 includes at least one of a different capability combination call application interface, a speech recognition development application interface, and a semantic understanding development application interface. The different capability combinations call the application program interface to provide the interface which can realize various different services; the speech recognition development application program interface is used for providing an interface for speech recognition; the semantic understanding development application program interface is used for providing an interface for semantic understanding for the outside. The capability combination 1302 may also provide other standard service interfaces, which are not described herein, and this is not specifically limited in this application.

The speech recognition base 1303 refers to various basic services that the server can provide. The speech recognition base 1303 includes at least one of text transcription, voiceprint recognition, recording segmentation, pinyin annotation, and silence detection. The text transcription can transcribe the text material into texts in other formats; the voiceprint recognition is used for recognizing the voiceprint corresponding to the voice content; the recording division is used for dividing the voice content, so that the voice recognition is facilitated; the pinyin label is used for labeling corresponding pinyin on the text material; silence detection is used to determine whether the voice content is in a silent state and determine whether voice recognition is required. The speech recognition base 1303 also includes services for word segmentation, synonyms, labels, parsing, stop words, pinyin retrieval. The word segmentation service is used for judging whether the word segmentation exists in the voice content; the synonym service is used for replacing synonyms for part or all of texts in the text materials; the marking service is used for marking pinyin or annotation in the text materials; the grammar analysis service is used for analyzing whether the grammar of the text material is correct or not and correcting the grammar; the service of pinyin retrieval is for the pinyin for text in the text material.

The corpus part 1304 is used for storing various corpora and services, and the corpus part 1304 includes corpus resources and service resources. The corpus part comprises at least one of a general corpus database, a professional corpus database and a special corpus database. The universal corpus database stores daily used language materials; the professional corpus database stores language materials used in the professional field; the special corpus database stores language materials corresponding to a plurality of special vocabularies. The service resource comprises at least one of a general service database, an account service database and a calling service database. The general service database stores data required by common services, for example, data corresponding to display elements on a page; the account service database stores the account information of the user; the calling service database stores calling interfaces corresponding to various services.

The external capabilities 1305 are used to provide services to other applications outside. The external capabilities 1305 include at least one of service call interface management, user management, access service management, and statistical analysis management. The service calling interface management is used for providing various services which can be realized by the framework, such as individual or combination of services such as text transcription, pinyin annotation and the like, for the outside; user management is used for providing a method for managing users for outside; the access service management is used for accessing the architecture by other external application programs and calling all or part of functions of the architecture; the statistical analysis management is used for counting various types of data of the framework, such as the number of visitors and the number of use times, and analyzing the data to obtain corresponding analysis results.

In this embodiment, the core of the architecture is to modularly distinguish various basic capabilities of the voice big data to be processed, and define various modularized external service interfaces, so that the processing of the voice big data is more oriented to business requirements of an application software system and an analysis system, and values contained in the big data can be fully mined. It should be noted that the semantic understanding technology is also a core technology in big data mining, and in fact, if a simple speech recognition technology is not fully fused with the semantic understanding technology, the effect of the speech big data mining and application will be greatly reduced.

Illustratively, fig. 14 shows an exemplary structure diagram of a backend system structure provided by an exemplary embodiment of the present application.

The background system comprises a front end 1401, a server 1402 and a data end 1403. The background system integrally adopts an LNMP framework (L refers to Linux, a set of common operating systems, N refers to Nginx, a high-performance HTTP and reverse proxy web server, M refers to Mysql database, P refers to PHP, Personal Home Page, a powerful server-side script language for creating dynamic interactive sites).

The front end 1401 is composed of a HyperText Markup Language (HTML), a standard Markup Language for creating a web page, a Cascading Style Sheet (CSS), a computer Language for representing HTML file styles, and a very fast (JQUERY, a fast, small, and functionally rich JavaScript library that can traverse and manipulate HTML documents).

The Server 1402 is composed of a hypertext preprocessor (PHP), NGINX, FFMPEG (a set of open source computer programs that can record, convert digital audio and video, and convert them into streams), LINUX, graphics video processing software (e.g., Adobe After Effects, a piece of graphics video processing software from Adobe corporation), and Windows Server (Windows Server).

The data terminal 1403 is composed of a database (e.g., MYSQL database).

Illustratively, fig. 15 shows a Windows Server architecture provided by an exemplary embodiment of the present application.

In the architecture, a web client (front end) 1501 is connected to a Gateway (Gateway)1505 through an API interface, and is connected to a web client (back end) 1504 through a programming language instruction (page), an application platform 1502 and a mobile end application 1503 are both connected to the web client (back end) 1504 through the API interface, the Gateway 1505 is connected to a conditional random field algorithm manager extensible node 1506(conditional random field algorithm-capable nodes, crf-manager scalable-nodes) through an HTTP API, and the conditional random field algorithm manager extensible node 1506 includes a resource-based network router (crf-manager scalable-nodes), a message push space (socket push, entity), an authentication center/security center (authentication/security node), a storage service, a database, and a file cache (storage, cache) At least one of a broker mq publish, a message queue list (task-queue scheduler). Conditional random field algorithm manager extensible node 1506 is connected to conditional random field algorithm renderer scalable node 1507(conditional random field algorithm scalable-nodes, crf-renderer scalable-nodes) through AMQC remote procedure call (where AMQP refers to Advanced Message Queuing Protocol, an application layer standard high-level Message Queuing Protocol providing unified Message service), conditional random field algorithm renderer node 1507 includes a local task management template, at least one of job/sub-job (local task management templates, jobs/sub-jobs), local configuration topic(s), channel (local configuration topic, channel), event subscription, notification (event management subsystem), renderer adapter layer multi-engine support (rendering adapter layer subsystem). The window server architecture further includes a KV buffer 1508 (KV refers to a design of computer Cache) and a Message Queue 1509(Message Queue). The windows server architecture further includes an underlying device deployment 1510(deployment architecture), wherein the underlying device deployment 1510 includes at least one of persistent Integration compilation, test, creation (content Integration line, test, build, CI line, test, build), temporary version control and creation (version control configuration), tools for defining and running applications (e.g., docker composition), a Cloud server (qcloud service cvm), and a load balancing monitor (load balance monitor).

Illustratively, fig. 16 shows a flowchart of a video composition method according to an exemplary embodiment of the present application. The method comprises the following steps:

step 1601: the first multimedia material, the first speech content and the first text material are standardized.

The first multimedia material, the first speech content and the first text material are standardized by means of FFMPEG. The processing content includes at least one of a size and an encoding format.

Step 1602: and sending the first multimedia material, the first voice content and the first text material to a window server.

The first multimedia material, the first speech content and the first text material are sent to a video composition service on a windows server in JSON format.

Step 1603: the video with the first video clip is returned by the window server.

The video synthesis service on the window server analyzes the JSON content, extracts the first multimedia material, the first voice content, the first text material and the video configuration content, and synthesizes the first multimedia material, the first voice content and the first text material into a video of the first video segment through an After Effect interface.

Alternatively, a video with a first video segment is generated by a command execution program aeender (a command line execution program of the After effect, a video production tool of Adobe corporation) of the After effect (a professional software for generating a video), and calling a template xml (extensible markup language) to add a corresponding special effect to the video.

Illustratively, the start command is "aeronder-project test. aepx-comp" test "-RStemplate" test _1 "-Omtemplate" test _2 "-output test. mov.

The specific parameters are explained as follows:

project represents that the current engineering template file is test.aepx;

the parameter comp indicates that the synthesizer name used for this synthesis is test;

the parameter RStemplate indicates that the name of the rendering template is test _ 1;

the parameter Omtemplate represents that the name of the video output template is test _ 2;

the parameter output indicates the output video name is test.

Optionally, a multi-segment effect can be added in the After effect template, and the generated video is subjected to repeated overlapping effect processing by calling aeender in a chain loop mode.

Step 1604: a video having a first video segment is transcoded.

And transcoding the video with the first video segment through FFMPEG to generate a format which is suitable for each terminal.

Optionally, the video with the first video segment is processed by FFMPEG encoding, audio integration, and subtitle integration.

Optionally, after the aerender processing, the effect content is supplemented by FFMPEG processing again, for example, adding slice headers, adding slice trailers, adding sound, video coding, etc.

Step 1605: and sending the transcoded video to a terminal.

In summary, the present embodiment provides an exemplary way to implement video composition, which provides a technical possibility. The video can be quickly obtained only by recording voice content and performing simple operation by the user, the user does not need to spend a large amount of time to search materials, and the user does not need to have professional editing knowledge. The method and the device can reduce the learning cost of the user for using the video making software, also avoid the process of searching materials by the user, and improve the efficiency of video making and the human-computer interaction efficiency.

Fig. 17 shows a block diagram of a video compositing apparatus provided in an exemplary embodiment of the present application, the apparatus 1700 includes:

a recording module 1701 for recording a first voice content in response to a first recording operation;

a display module 1702, configured to display a first text material and a first multimedia material corresponding to the first voice content, where the first text material is obtained by performing voice recognition on the first voice content, and the first multimedia material is obtained by performing search based on at least one of the first voice content and the first text material;

in an optional design of the present application, the display module 1702 is further configured to display the first multimedia material corresponding to the first speech recognition content after recognizing the first speech recognition content in a recording process; after the second voice recognition content is recognized, displaying the updated first multimedia material; the updated first multimedia material corresponds to the second speech recognition content, or the updated first multimedia material corresponds to the first speech recognition content and the second speech recognition content.

In an optional design of the present application, the display module 1702 is further configured to display the first multimedia material corresponding to the first speech recognition content after recognizing the first speech recognition content in a recording process; and after the second voice recognition content is recognized, displaying the first multimedia material corresponding to the second voice recognition content.

The display module 1702 is further configured to display a video having a first video segment in response to the video generation operation, the first video segment being generated based on the first text material and the first multimedia material.

In an optional design of the present application, the display module 1702 is further configured to display the first text material obtained by performing speech recognition on the first speech content in a recording process.

In an optional design of the present application, the display module 1702 is further configured to display the first multimedia material corresponding to the first voice content after the recording is finished.

In an optional design of the present application, the display module 1702 is further configured to display a video production interface, where the video production interface includes a recording control.

The recording module 1701 is further configured to record the first voice content in response to a first recording operation on the recording control.

In an optional design of the present application, the display module 1702 is further configured to display the edited first text material in response to an editing operation on the first text material; wherein the editing operation comprises: at least one of a text adding operation, a text deleting operation, a text searching operation, a text modifying operation, a text replacing operation, a text moving operation and a format changing operation.

In an optional design of the present application, the display module 1702 is further configured to display a replacement control corresponding to the first multimedia material.

The display module 1702 is further configured to, in response to the replacing operation on the replacing control, display alternative multimedia material as the first multimedia material, where the alternative multimedia material is obtained by searching based on at least one of the voice content and the first text material.

In an optional design of the present application, the display module 1702 is further configured to display a deletion control corresponding to the first multimedia material; responding to the deleting operation on the deleting control, deleting the first multimedia material and displaying an importing control; and responding to the import operation on the import control, and displaying the imported multimedia material as the first multimedia material.

In an alternative design of the present application, the recording module 1701 is further configured to record a second voice content in response to a second recording operation.

The display module 1702 is further configured to display a second text material and a second multimedia material corresponding to the second voice content, where the second text material is obtained by performing voice recognition on the second voice content, and the second multimedia material is obtained by performing search based on at least one of the voice content and the second text material.

The display module 1702 is further configured to display a video having a first video segment and a second video segment in response to the video generating operation, the second video segment being generated based on the second text material and the second multimedia material.

In an optional design of the present application, the display module 1702 is further configured to display a sharing control corresponding to the video with the first video segment.

The apparatus 1700 further comprises:

a communication module 1703, configured to send the video with the first video clip to another terminal in response to the sharing operation on the sharing control; or, in response to a sharing operation on the sharing control, sending the video with the first video clip onto a web space.

In an optional design of the present application, the communication module 1703 is further configured to send the first voice content to a server.

The communication module 1703 is further configured to receive the first text material and the first multimedia material replied by the server.

In an alternative design of the present application, the apparatus 1700 further includes:

a recognition module 1704, configured to perform speech recognition on the first speech content to obtain the first text material.

The communication module 1703 is further configured to send the first text material to a server.

The communication module 1703 is further configured to receive the first multimedia material replied by the server based on the first text material.

In an optional design of the present application, the recognition module 1704 is further configured to perform speech recognition on the first speech content to obtain the first text material; extracting keywords from the first text material.

The communication module 1703 is further configured to send the keyword to a server; and receiving the first multimedia material replied by the server based on the keyword.

Fig. 18 shows a block diagram of a video compositing apparatus provided by an exemplary embodiment of the present application, the apparatus 1800 comprising:

an obtaining module 1801, configured to obtain a first text material and a first multimedia material, where the first text material is obtained by performing voice recognition on a first voice content, and the first multimedia material is obtained by performing search based on at least one of the first voice content and the first text material; the first voice content is obtained by recording a first recording operation on the terminal;

a synthesis module 1802 for generating a video having a first video segment from the first text material and a first multimedia material, the first video segment being generated based on the first text material and the first multimedia material;

a sending module 1803, configured to send the video with the first video segment to the terminal.

In an optional design of the present application, the obtaining module 1801 is further configured to receive the first voice content sent by the terminal.

The synthesis module 1802 is further configured to perform speech recognition on the first speech content to obtain the first text material; extracting keywords from the first text material; and searching and obtaining the first multimedia material based on the keyword.

In an optional design of the present application, the obtaining module 1801 is further configured to receive the first text material sent by the terminal, where the first text material is obtained by performing voice recognition on the first voice content by the terminal.

The synthesis module 1802 is further configured to extract the keyword from the first text material; and searching and obtaining the first multimedia material based on the keyword.

Fig. 19 is a schematic diagram illustrating a configuration of a server according to an example embodiment. The server 1900 includes a Central Processing Unit (CPU) 1901, a system Memory 1904 including a Random Access Memory (RAM) 1902 and a Read-Only Memory (ROM) 1903, and a system bus 1905 connecting the system Memory 1904 and the CPU 1901. The server 1900 also includes a basic Input/Output system (I/O system) 1906 for facilitating information transfer between devices within the computer device, and a mass storage device 1907 for storing an operating system 1913, application programs 1914, and other program modules 1915.

The basic input/output system 1906 includes a display 1908 for displaying information and an input device 1909, such as a mouse, keyboard, etc., for user input of information. Wherein the display 1908 and input device 1909 are coupled to the central processing unit 1901 through an input-output controller 1910 coupled to the system bus 1905. The basic input/output system 1906 may also include an input/output controller 1910 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 1910 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1907 is connected to the central processing unit 1901 through a mass storage controller (not shown) connected to the system bus 1905. The mass storage device 1907 and its associated computer-device readable media provide non-volatile storage for the server 1900. That is, the mass storage device 1907 may include a computer device-readable medium (not shown) such as a hard disk or Compact Disc-Only Memory (CD-ROM) drive.

Without loss of generality, the computer device readable media may comprise computer device storage media and communication media. Computer device storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer device readable instructions, data structures, program modules or other data. Computer device storage media includes RAM, ROM, Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), CD-ROM, Digital Video Disk (DVD), or other optical, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer device storage media is not limited to the foregoing. The system memory 1904 and mass storage device 1907 described above may be collectively referred to as memory.

The server 1900 may also operate with remote computer devices connected to a network through a network, such as the internet, according to various embodiments of the present disclosure. That is, the server 1900 may be connected to the network 1911 through the network interface unit 1912 connected to the system bus 1905, or may be connected to other types of networks or remote computer device systems (not shown) using the network interface unit 1912.

The memory further includes one or more programs, the one or more programs are stored in the memory, and the central processor 1901 implements all or part of the steps of the video synthesis method by executing the one or more programs.

In an exemplary embodiment, a computer readable storage medium is further provided, in which at least one instruction, at least one program, a set of codes, or a set of instructions is stored, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by a processor to implement the video composition method provided by the above-mentioned various method embodiments.

The present application further provides a computer-readable storage medium, in which at least one instruction, at least one program, a code set, or a set of instructions is stored, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by a processor to implement the video synthesis method provided by the above-mentioned method embodiment.

Optionally, the present application also provides a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The controller reads the computer instructions from the computer-readable storage medium, and the controller executes the computer instructions to enable the display device to execute the video generation method.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of video generation, the method comprising:

responding to the first recording operation, and recording first voice content;

2. The method of claim 1, wherein displaying the first text material corresponding to the first speech content comprises:

and in the recording process, displaying the first text material obtained by performing voice recognition on the first voice content.

3. The method of claim 1, wherein said displaying a first multimedia material corresponding to said first speech content comprises:

and displaying the first multimedia material corresponding to the first voice content after the recording is finished.

4. The method of claim 1, wherein the first text material includes first speech recognized content and second speech recognized content, the first speech recognized content being recognized before the second speech recognized content;

the displaying of the first multimedia material corresponding to the first voice content includes:

after the first voice recognition content is recognized in the recording process, displaying the first multimedia material corresponding to the first voice recognition content;

after the second voice recognition content is recognized, displaying the updated first multimedia material;

the updated first multimedia material corresponds to the second speech recognition content, or the updated first multimedia material corresponds to the first speech recognition content and the second speech recognition content.

5. The method of claim 1, wherein the first text material includes first speech recognized content and second speech recognized content, the first speech recognized content being recognized before the second speech recognized content;

and after the second voice recognition content is recognized, displaying the first multimedia material corresponding to the second voice recognition content.

6. The method according to any one of claims 1 to 5,

the first multimedia material comprises: picture materials; video frames of the first video clip are generated based on the picture material, and audio frames of the first video clip are generated based on the voice content;

or the like, or, alternatively,

the first multimedia material comprises: video materials; video frames of the first video segment are generated based on the video material, and audio frames of the first video segment are generated based on the speech content;

or the like, or, alternatively,

the first multimedia material comprises: audio materials; audio frames of the first video segment are generated based on the speech content and the audio material.

7. The method of any of claims 1 to 5, further comprising:

responding to the editing operation on the first text material, and displaying the edited first text material;

wherein the editing operation comprises: at least one of a text adding operation, a text deleting operation, a text searching operation, a text modifying operation, a text replacing operation, a text moving operation and a format changing operation.

8. The method of any of claims 1 to 5, further comprising:

displaying a replacement control corresponding to the first multimedia material;

and responding to the replacement operation on the replacement control, and displaying alternative multimedia materials as the first multimedia materials, wherein the alternative multimedia materials are obtained by searching based on at least one of the voice content and the first text material.

9. The method of any of claims 1 to 5, further comprising:

displaying a deletion control corresponding to the first multimedia material;

responding to the deleting operation on the deleting control, deleting the first multimedia material and displaying an importing control;

and responding to the import operation on the import control, and displaying the imported multimedia material as the first multimedia material.

10. The method of any of claims 1 to 5, further comprising:

responding to a second recording operation, and recording second voice content;

displaying a second text material and a second multimedia material corresponding to the second voice content, wherein the second text material is obtained by performing voice recognition on the second voice content, and the second multimedia material is obtained by searching based on at least one of the voice content and the second text material;

the displaying a video having a first video segment in response to a video generation operation includes:

in response to a video generation operation, displaying a video having a first video segment and a second video segment, the second video segment generated based on the second text material and the second multimedia material.

11. A video generation method is applied to a server, and the method comprises the following steps:

acquiring a first text material and a first multimedia material, wherein the first text material is obtained by performing voice recognition on first voice content, and the first multimedia material is obtained by searching based on at least one of the first voice content and the first text material; the first voice content is obtained by recording a first recording operation on the terminal;

generating a video having a first video clip from the first text material and a first multimedia material, the first video clip being generated based on the first text material and the first multimedia material;

transmitting the video with the first video clip to the terminal.

12. A video generation apparatus, characterized in that the apparatus comprises:

13. A video generation apparatus, characterized in that the apparatus comprises:

14. A computer device, characterized in that the computer device comprises: a processor and a memory, the memory storing a computer program that is loaded and executed by the processor to implement the video generation method of any of claims 1 to 11.

15. A computer-readable storage medium, characterized in that it stores a computer program which is loaded and executed by a processor to implement the video generation method according to any one of claims 1 to 11.