CN119626203A

CN119626203A - Character dubbing method, device, electronic device and storage medium

Info

Publication number: CN119626203A
Application number: CN202311190066.7A
Authority: CN
Inventors: 薛鹤洋; 张晴; 王平; 胡志伟; 毕梦霄; 朱鹏程; 郭帅
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2023-09-14
Filing date: 2023-09-14
Publication date: 2025-03-14

Abstract

The present application provides a method, device, electronic device and storage medium for dubbing a role, and relates to the field of speech processing technology. The method determines the role and the dubbing personnel corresponding to the role, so that the target dubbing timbre matching the text information to be dubbed can be determined based on the identification of the dubbing personnel and the emotional type corresponding to the text information to be dubbed of the role, so as to perform speech synthesis processing on the text information to be dubbed according to the target dubbing timbre, automatically realize the text-to-speech conversion processing, and obtain the dubbing audio corresponding to the text information to be dubbed, which effectively improves the dubbing efficiency and has low cost. Moreover, through the determined dubbing personnel, the candidate timbres of the dubbing personnel can be obtained, and according to the emotional type corresponding to the text information to be dubbed, the target dubbing timbre that matches the text information to be dubbed better can be determined from the candidate timbres, so that the final dubbing audio effect is better and more matched with the role itself.

Description

Role dubbing method and device, electronic equipment and storage medium

Technical Field

The application relates to the technical field of voice processing, in particular to a role dubbing method, a device, electronic equipment and a storage medium.

Background

The game role dubbing is to dub characters in the game, no matter the characters are prime characters or characters in the scenario, the game development needs to be promoted through dialogue, so the purpose of the game role dubbing is to make the game role walk more, the player can quickly go deep into the game, and the development of the game scenario is more natural.

At present, a real person dubbing mode is often adopted for dubbing a game role, and dubbing is needed to be carried out by dubbing staff one by one, so that huge consumption is consumed in both economic cost and time cost.

Disclosure of Invention

The application aims to provide a role dubbing method, a role dubbing device, electronic equipment and a storage medium, so as to improve dubbing efficiency and reduce dubbing cost.

In order to achieve the above purpose, the technical scheme adopted by the embodiment of the application is as follows:

In a first aspect, an embodiment of the present application provides a role dubbing method, including:

determining a target role and a target dubbing person;

Determining text information to be dubbed corresponding to the target role;

determining a target dubbing tone according to the target dubbing personnel and the emotion type corresponding to the text information to be dubbed;

and converting the text information to be dubbed according to the target dubbing tone to generate initial dubbing audio.

In a second aspect, the embodiment of the application also provides a role dubbing device, which comprises a determining module and a processing module;

the determining module is used for determining a target role and target dubbing personnel;

the determining module is used for determining text information to be dubbed corresponding to the target role;

The determining module is used for determining a target dubbing tone according to the emotion types corresponding to the target dubbing personnel and the text information to be dubbed;

The processing module is used for converting the text information to be dubbed according to the target dubbing tone color to generate initial dubbing audio.

In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a storage medium, and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor communicating with the storage medium over the bus, the processor executing the machine-readable instructions to perform a role dubbing method as provided in the first aspect when the electronic device is running.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs a character dubbing method as provided in the first aspect.

The beneficial effects of the application are as follows:

The application provides a role dubbing method, a device, electronic equipment and a storage medium, wherein by determining a role and dubbing personnel corresponding to the role, a target dubbing tone matched with text information to be dubbed can be determined based on the identification of the dubbing personnel and the emotion type corresponding to the text information to be dubbed of the role, so that voice synthesis processing is carried out on the text information to be dubbed according to the target dubbing tone, text-to-voice conversion processing is automatically realized, dubbing audio corresponding to the text information to be dubbed is obtained, the dubbing efficiency is effectively improved, and the cost is lower. And each candidate tone of the dubbing person can be obtained through the determined dubbing person, and the target dubbing tone with better matching effect with the text information to be dubbed can be determined from each candidate tone according to the emotion type corresponding to the text information to be dubbed, so that the finally obtained dubbing audio effect is better, the target dubbing tone is more matched with the character, and the character is more true.

In addition, through the character dubbing editing interface provided by the embodiment, re-optimization adjustment of the generated dubbing audio can be realized, and dubbing modification operation with multiple dimensions is provided, so that the generated dubbing audio has controllability, and the accuracy of the generated dubbing audio can be improved through dubbing optimization.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a role dubbing method provided by an embodiment of the present application;

fig. 2 is a flow chart of another role dubbing method provided by the embodiment of the application;

Fig. 3 is a flow chart of another role dubbing method according to an embodiment of the present application;

Fig. 4 is a schematic diagram of a role dubbing editing interface provided by an embodiment of the present application;

Fig. 5 is a flow chart of another role dubbing method provided by the embodiment of the application;

fig. 6 is a schematic diagram of another character dubbing editing interface according to an embodiment of the present application;

Fig. 7 is a flow chart of another role dubbing method according to an embodiment of the present application;

Fig. 8is a schematic diagram of a role dubbing apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described with reference to the accompanying drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for the purpose of illustration and description only and are not intended to limit the scope of the present application. In addition, it should be understood that the schematic drawings are not drawn to scale. A flowchart, as used in this disclosure, illustrates operations implemented according to some embodiments of the present application. It should be understood that the operations of the flow diagrams may be implemented out of order and that steps without logical context may be performed in reverse order or concurrently. Moreover, one or more other operations may be added to or removed from the flow diagrams by those skilled in the art under the direction of the present disclosure.

In addition, the described embodiments are only some, but not all, embodiments of the application. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present application.

It should be noted that the term "comprising" will be used in embodiments of the application to indicate the presence of the features stated hereafter, but not to exclude the addition of other features.

Dubbing a character is involved in both a movie work and a game scene to make the character more distracting. Especially, the dubbing of the game character, whether the character is a main character in the game or a character in the scenario, is required to promote the development of the game through a dialogue, so the purpose of the dubbing of the game character is to make the game character walk more, the player can quickly go deep into the game, and the development of the game scenario is more natural.

In real game development, the cost of requesting a person to play a game dubbing is high. The dubbing efficiency of a real person is low, only a small amount of dubbing can be recorded in one day, and the selection of the dubbing of the real person is difficult in time and money cost when a scenario frame and a game flow are required to be quickly built in the early stage of a game. While many games have a large number of unimportant NPC (non-PLAYER CHARACTER ) characters, the cost is further increased if professional dubbing actors are called for recording.

With the development of the voice synthesis technology, the current voice synthesis technology can achieve high expressive force and high natural synthesis expression, and the game voice dubbing editing tool combining the voice synthesis technology and game dubbing can greatly reduce the time and money cost in the game development process.

Based on the above, the character dubbing method is provided, based on the identification of dubbing personnel matched by the user for the character, and according to the emotion type corresponding to the text information to be dubbed of the character, the target dubbing tone matched with the text information to be dubbed can be determined, so that the target dubbing audio corresponding to the text information to be dubbed is generated according to the target dubbing tone by adopting a voice synthesis technology. By the method for dubbing the roles, the dubbing efficiency can be greatly improved, and the cost is lower.

Fig. 1 is a flow chart of a role dubbing method provided by an embodiment of the present application, where an execution subject of the method may be a terminal device or a server interacting with the terminal device. A character dubbing editing interface may be displayed on the terminal device, for providing a dubbing editing entry for the user. As shown in fig. 1, the method may include:

S101, determining a target role and a target dubbing person.

Optionally, the user may upload the scenario to be dubbed through the scenario import portal of the character dubbing editing interface, or may input the scenario to be dubbed by itself, where the scenario to be dubbed may include a large number of speech. In some embodiments, each sentence of the speech word in the script is matched with a corresponding character, and after the script is imported, each speech word and character information corresponding to each speech word can be displayed in a character dubbing editing interface. In other embodiments, each sentence of the speech word in the script has no pre-matched character, after the script is imported, the user can create corresponding characters for different speech words, and display each speech word and character information corresponding to each speech word in the character dubbing editing interface.

Optionally, the user may determine a target role from the roles, determine a target dubbing person corresponding to the target role from the candidate dubbing persons, and establish a binding relationship between the target role and the target dubbing person, so that dubbing can be performed for the speech associated with the target role by the target dubbing person.

In one implementation, the user may click on the identity of the character from the characters matched with the displayed speech, so as to determine the target character, for example, click on the head image of the character or the name of the character, etc., and a dubbing personnel list may be displayed on the character dubbing editing interface, and the user may lock the target dubbing personnel from the dubbing personnel list and click on the identity of the target dubbing personnel so as to determine the target dubbing personnel.

In another implementation manner, a search entry may be further provided in the character dubbing editing interface, so that the user may determine the target character from all the characters by inputting the identification of the target character, and similarly, determine the target dubbing person from all the candidate dubbing persons by inputting the identification of the target dubbing person. Of course, the character feature tag matched with the desired dubbing effect may be input according to the desired dubbing effect of the target character, so as to search for the target dubbing person capable of having the desired dubbing effect.

Currently, the above-provided two determination methods are not limited.

S102, determining text information to be dubbed corresponding to the target role.

As described above, no matter whether the character is associated in advance with the line in the scenario or the character is created for the line in the later period, each line in the scenario has the character associated with it, that is, there is a correspondence between the character and the line, then after the target character is determined, the line corresponding to the target character can be determined according to the correspondence between the character and the line, and the line in this case refers to the text information to be dubbed. The text information to be dubbed corresponding to one character can comprise a plurality of pieces.

S103, determining the target dubbing tone according to the emotion types corresponding to the target dubbing personnel and the text information to be dubbed.

In this embodiment, the dubbing is performed on any piece of text information to be dubbed of the character, and in practice, any piece of text information to be dubbed may refer to the dubbing.

Optionally, since the timbre of each dubbing person under different emotion types is different, that is, one dubbing person can correspond to a plurality of timbres, different emotion types can correspond to a specific timbre, and a character can also have different emotions when speaking each sentence of speech, in order to improve the authenticity of the character dubbing, in order to better conform to the context, the target dubbing timbre can be determined according to the emotion types corresponding to the dubbing person and the text information to be dubbed.

According to the determined target dubbing personnel, each candidate tone of the target dubbing personnel can be determined, and according to the emotion type corresponding to the text information to be dubbed, the target dubbing tone matched with the text information to be dubbed can be determined from the candidate tone.

S104, converting the text information to be dubbed according to the target dubbing tone color, and generating initial dubbing audio.

Optionally, based on the determined target dubbing tone, a speech synthesis technology may be used to perform text-to-speech conversion processing on the text information to be dubbed, and convert the text information to be dubbed into initial dubbing audio after dubbing with the target dubbing tone.

In summary, according to the role dubbing method provided by the embodiment, by determining the role and the dubbing personnel corresponding to the role, the target dubbing tone matched with the text information to be dubbed can be determined based on the dubbing personnel and the emotion type corresponding to the text information to be dubbed of the role, so that the text information to be dubbed is subjected to voice synthesis processing according to the target dubbing tone, the conversion processing from text to voice is automatically realized, the dubbing audio corresponding to the text information to be dubbed is obtained, the dubbing efficiency is effectively improved, and the cost is lower. And each candidate tone of the dubbing person can be obtained through the determined dubbing person, and the target dubbing tone with better matching effect with the text information to be dubbed can be determined from each candidate tone according to the emotion type corresponding to the text information to be dubbed, so that the finally obtained dubbing audio effect is better, the target dubbing tone is more matched with the character, and the character is more true.

Fig. 2 is a flow chart of another character dubbing method according to an embodiment of the present application, optionally, in step S103, determining a target dubbing timbre according to emotion types corresponding to target dubbing personnel and text information to be dubbed may include:

S201, determining at least one tone corresponding to the target dubbing person according to the target dubbing person and the pre-established mapping relation between each candidate dubbing person and the at least one tone corresponding to each candidate dubbing person, wherein each tone corresponds to one emotion type.

In general, the correspondence between dubbing staff and emotion types and timbres can be constructed in advance and stored. For example, each candidate dubbing person reads (records) a predetermined text with a given different emotion type, and audio of each candidate dubbing person under the different emotion types is generated, and tone information is included in the audio.

Based on the above, the mapping relation between each candidate dubbing person and each tone corresponding to each candidate dubbing person can be constructed, and the mapping relation between each tone and emotion type of each candidate dubbing person can be constructed.

Then, based on the determined target matched person, each tone corresponding to the target dubbing person can be determined.

Timbre (Timbre) means that different sounds always have distinctive characteristics in terms of waveform, and different object vibrations all have different characteristics. The emotion types herein may include various natural emotions such as happiness, difficulty, and angry.

S202, determining the dubbing tone of the target dubbing person under the emotion type according to the emotion type corresponding to the text information to be dubbed and the corresponding relation between each tone corresponding to each candidate dubbing person and the emotion type constructed in advance.

The context that the character may face when speaking each sentence of the speech is different, and the emotion types are also different, so that in order to promote the dubbing effect, according to the independent text information to be dubbed of each sentence, the tone corresponding to the emotion type can be determined from the tone of the target dubbing personnel according to the emotion type corresponding to the text information to be dubbed.

S203, determining the dubbing tone of the target dubbing person under the emotion type as the target dubbing tone.

Optionally, the determined timbre of the target dubbing person under the emotion type corresponding to the text information to be dubbed can be determined as the target dubbing timbre.

When the target dubbing tone is adopted to dub the text information to be dubbed, the obtained dubbing audio is more true in emotion, the dubbing effect is more natural and smooth, and the character is more accordant.

Fig. 3 is a flow chart of another character dubbing method according to an embodiment of the present application, optionally, in step S104, converting text information to be dubbed according to a target dubbing timbre, generating initial dubbing audio may include:

S301, converting text information to be dubbed into a phoneme sequence.

In some embodiments, when converting text information into audio, the input text information to be dubbed may be first converted into a phoneme sequence, where phonemes are minimum phonetic units divided according to natural properties of speech, and analyzed according to pronunciation actions in syllables, and one action constitutes one phoneme. Phonemes are divided into two major classes, vowels and consonants. For example, the Chinese syllable o (ā) has only one phoneme, the love (a i) has two phonemes, the generation (d a i) has three phonemes, etc. The text information to be dubbed is converted into the phoneme sequence so as to facilitate audio processing by a computer, each pronunciation in each text can be accurately analyzed, and the conversion effect from the text to the voice is improved.

S302, performing text-to-speech conversion processing on the phoneme sequence according to the target dubbing timbre, and generating initial dubbing audio corresponding to the text information to be dubbed, wherein the initial dubbing audio has the target dubbing timbre.

The text-to-speech conversion is performed by using a conventional speech synthesis technique, and in the case of speech synthesis, the synthesis is performed according to the target dubbing tone, so that the text information to be dubbed is converted into the initial dubbing audio with the target dubbing tone through processing, that is, the audio content of the initial dubbing audio is the text information to be dubbed, and the initial dubbing audio has the target dubbing tone.

Optionally, in step S104, after converting the text information to be dubbed according to the target dubbing tone color and generating the initial dubbing audio, the method further comprises the step of responding to a dubbing modification instruction of a target character in the input text information to be dubbed, performing dubbing modification on the target character in the initial dubbing audio and generating the optimized dubbing audio corresponding to the text information to be dubbed, wherein the dubbing modification instruction comprises at least one of a pronunciation modification instruction, a tone insertion instruction, a tone modification instruction and a tone length modification instruction.

In this embodiment, some dubbing modification methods are further provided, where the initial dubbing audio generated by the above method may not reach the user's expectation, and the user may further perform the audio modification operation through the character dubbing editing interface, so as to obtain the ideal dubbing audio.

Fig. 4 is a schematic diagram of a character dubbing editing interface provided by an embodiment of the present application, as shown in fig. 4, with a piece of text information to be dubbed as a unit, and each piece of text information to be dubbed is independently displayed. Firstly, text regular processing can be carried out on text information to be dubbed, the text information to be dubbed is divided into units of characters, each character is an independent individual, and independent dubbing modification can be carried out.

The purpose of the text regularization is to regularize the non-pronouncing symbols in the text information to be dubbed, such as ellipses, signature numbers, etc., which are not independently pronounded in pronunciation, and can be regularized.

As shown in fig. 4, at least one dubbing modification operation option may be displayed under each piece of text information to be dubbed, including, but not limited to, pronunciation modification, left side insert speech event, right side insert speech event, tone modification, duration modification, etc.

In addition, each piece of text information to be dubbed can correspondingly display a play control, and initial dubbing audio corresponding to the text information to be dubbed can be played by clicking the play control. After play, the user may select a character from which the pronunciation is unsatisfactory for individual modification.

For example, the text information to be dubbed in fig. 4 is that today weather is good | if the user is not satisfied with the "good" dubbing audio, the "good" can be selected, and at least one operation option is selected from at least one dubbing modification operation options displayed below, then a dubbing modification instruction for "good" of the target character can be input.

When the dubbing modification operation option selected by the user is pronunciation modification, the correspondingly input dubbing modification instruction is pronunciation modification.

It is noted here that the mood inserting instructions may include a left side inserting voice event and a right side inserting voice event, and the mood inserting instructions are used to insert mood words, such as, for example, java, yak, and aide, on the left or right side of the character.

The pronunciation modification is to modify the pronunciation of the character with multiple pronunciations, and the pronunciation of some multi-tone words may have pronunciation errors when the initial dubbing audio is generated in the early stage.

The pitch modification can modify the pitch of the pronunciation, and the length of the pronunciation can be modified.

The dubbing modification instructions described above may all be input and modified based on the various dubbing modification operation options presented below each piece of text information to be dubbed in fig. 4.

Of course, in practical applications, the available dubbing modification options are not limited to the ones listed in the figures or the text.

Optionally, in step S104, after converting the text information to be dubbed according to the target dubbing tone color and generating the initial dubbing audio, the method comprises the steps of responding to the input emotion intensity of the initial dubbing audio and the adjustment operation of the audio speech speed and generating the optimized dubbing audio corresponding to the text information to be dubbed.

In other embodiments, in addition to the above-mentioned dubbing modification, overall modification of dubbing audio is provided, and as shown in fig. 4, emotion controls, emotion intensity controls, speech rate controls, and the like may be displayed on the right side of the interface.

And the generated initial dubbing audio can be subjected to emotion adjustment operation input by the user through the emotion control, and the emotion refers to the emotion type.

Similarly, based on the emotion intensity control, the emotion intensity of the initial dubbing audio can be adjusted by sliding the control in the horizontal direction. Based on the speech speed control, the control is slid in the horizontal direction, so that the pronunciation speech speed of the initial dubbing audio can be adjusted.

Of course, the emotion control, emotion intensity control and speech speed control can also be used for independently adjusting single characters, and the emotion control, emotion intensity control and speech speed control can be flexibly set without special limitation.

Optionally, as shown in fig. 4, the character dubbing editing interface may further include an export audio control, and after the user adjusts the audio according to the above items, if satisfactory dubbing audio is obtained, the produced audio may be exported through the export audio control.

Fig. 5 is a flow chart of another character dubbing method according to an embodiment of the present application, optionally, after generating optimized dubbing audio corresponding to text information to be dubbed in the above steps, the method may further include:

S501, if the optimized dubbing audio does not meet the dubbing requirement, generating voice information corresponding to the text information to be dubbed according to the target tone requirement.

Fig. 6 is a schematic diagram of another character dubbing editing interface provided by an embodiment of the present application, where, as shown in fig. 6, the interface may include a recording function and a sound changing function.

In some embodiments, if the optimized dubbing audio obtained after the single character pronunciation modification and the overall audio modification does not meet the requirements of the user, the method of the embodiment may further perform optimization adjustment again.

Optionally, by triggering the start recording control, the user records the text information to be dubbed according to the required tone to generate the voice information. Or recording the text information to be dubbed by a professional dubbing personnel to generate voice information. The text information to be dubbed refers to text information corresponding to optimized dubbing audio, which is not satisfied by the user. Namely the original codebook corresponding to the optimized dubbing audio.

S502, processing the voice information according to the emotion type corresponding to the target dubbing personnel and the text information to be dubbed, and generating target dubbing audio corresponding to the text information to be dubbed.

Optionally, the generated voice information can be uploaded through the uploading control, and then voice processing can be triggered through the control for generating the change voice, so that the voice information is converted into the tone of the dubbing personnel corresponding to the role, and the target dubbing audio is obtained. The target dubbing audio frequency, the voice content and the tone of the voice information are kept unchanged, but the tone color is adjusted to be the tone color matched with the text information to be dubbed corresponding to the voice information. Namely, the voice information is converted into the target dubbing audio with the target dubbing tone through the target dubbing personnel matched with the target roles and the target dubbing tone determined by the emotion types corresponding to the text information to be dubbed.

Generally, based on the foregoing character modification and emotion modification, the obtained optimized dubbing audio can basically reach the expected situation of the user, but only the situation that the effect is not ideal exists in a single way, and the voice processing mode of the embodiment can be used for processing the dubbing with the non-ideal effect in a single way, so that the data processing amount is smaller, and the processing pressure of the system is not caused.

Fig. 7 is a flowchart of another character dubbing method according to an embodiment of the present application, optionally, in step S502, processing the voice information according to the emotion type corresponding to the target dubbing person and the text information to be dubbed, and generating target dubbing audio corresponding to the text information to be dubbed may include:

S701, determining target dubbing timbre according to the target dubbing personnel, emotion types corresponding to text information to be dubbed, a mapping relation between a pre-built mark of each candidate dubbing personnel and at least one timbre corresponding to each candidate dubbing personnel and a corresponding relation between each timbre corresponding to each candidate dubbing personnel and emotion types.

In the implementation of this step, referring to steps S201 to S203, the target dubbing tone color may be determined first, and the dubbing personnel matched with the roles are adopted here, so as to ensure that the same role is dubbed by the same dubbing personnel, and ensure the dubbing accuracy.

S702, processing the voice information according to the target dubbing tone, and adjusting the original tone of the voice information to the target dubbing tone to generate target dubbing audio.

Alternatively, the original timbre of the voice information may be adjusted according to the target dubbing timbre to convert the voice information into the voice with the target dubbing timbre, thereby obtaining the target dubbing audio. Before and after conversion, the voice content and intonation remain unchanged, and the intonation is the intonation satisfied by the user.

Optionally, in step S101, determining the target role and the target dubbing staff may include determining the target role from the roles to be dubbed, determining the target dubbing staff matched with the target role according to tag information of each candidate dubbing staff, where the tag information includes character, age, gender and tone.

In connection with the above related embodiment of step S102, the determination of the target role may be performed by identifying each target role displayed on the interface, selecting the target role by clicking the confirmation, or determining the target role by inputting a search keyword of the target role, etc.

In some embodiments, when selecting the matched dubbing person for each character, the selection may be based on label information of the dubbing person.

The target dubbing person may be determined from among the candidate dubbing persons according to a dubbing effect desired by the target character or a character characteristic desired by the target character.

The required information such as the tone quality of the dubbing can be determined according to the desired dubbing effect, and the tone quality is usually related to the sex, character, sound type and the like of the character, so that the target dubbing person capable of generating the tone quality matched with the desired dubbing effect of the target character can be determined from the candidate dubbing persons according to the label information of the candidate dubbing persons.

By way of example, the dubbing person's tag may include, but is not limited to, personality, age, gender, timbre, e.g., male, personality sun, voice warmth, etc.

In summary, according to the role dubbing method provided by the embodiment, by determining the role and the dubbing personnel corresponding to the role, the target dubbing tone matched with the text information to be dubbed can be determined based on the identification of the dubbing personnel and the emotion type corresponding to the text information to be dubbed of the role, so that the text information to be dubbed is subjected to voice synthesis processing according to the target dubbing tone, the conversion processing of the text to the voice is automatically realized, the dubbing audio corresponding to the text information to be dubbed is obtained, the dubbing efficiency is effectively improved, and the cost is lower. And each candidate tone of the dubbing person can be obtained through the determined dubbing person, and the target dubbing tone with better matching effect with the text information to be dubbed can be determined from each candidate tone according to the emotion type corresponding to the text information to be dubbed, so that the finally obtained dubbing audio effect is better, the target dubbing tone is more matched with the character, and the character is more true.

The following describes a device, equipment, storage medium, etc. for executing the role dubbing method provided by the present application, and specific implementation processes and technical effects thereof are referred to above, and are not described in detail below.

Fig. 8 is a schematic diagram of a role dubbing apparatus according to an embodiment of the present application, where functions implemented by the role dubbing apparatus correspond to steps executed by the method described above. The device can be understood as the above-mentioned terminal equipment or server, or the processor of the server, or can be understood as a component which is independent from the above-mentioned server or processor and is controlled by the server to implement the functions of the present application, as shown in fig. 8, the device can include a determining module 810 and a processing module 820;

A determining module 810, configured to determine a target character and a target dubbing person;

The determining module 810 is configured to determine a target dubbing timbre according to the target dubbing personnel and emotion types corresponding to the text information to be dubbed;

The processing module 820 is configured to convert the text information to be dubbed according to the target dubbing timbre, and generate initial dubbing audio.

Optionally, the determining module 810 is specifically configured to determine at least one tone corresponding to the target dubbing person according to the mapping relationship between the target dubbing person and at least one tone corresponding to each candidate dubbing person, where each tone corresponds to one emotion type;

Determining the dubbing tone of the target dubbing person under the emotion type according to the emotion type corresponding to the text information to be dubbed and the corresponding relation between each tone corresponding to each candidate dubbing person and the emotion type, which is constructed in advance;

and determining the dubbing timbre of the target dubbing personnel under the emotion type as the target dubbing timbre.

Optionally, the processing module 820 is specifically configured to convert text information to be dubbed into a phoneme sequence;

And according to the target dubbing tone, performing text-to-speech conversion processing on the phoneme sequence, and generating initial dubbing audio corresponding to the text information to be dubbed, wherein the initial dubbing audio has the target dubbing tone.

Optionally, the system also comprises an optimization module;

The optimizing module is used for responding to a dubbing modification instruction of a target character in input text information to be dubbed, carrying out dubbing modification on the target character in initial dubbing audio, and generating optimized dubbing audio corresponding to the text information to be dubbed, wherein the dubbing modification instruction comprises at least one of a pronunciation modification instruction, a tone insertion instruction, a tone modification instruction and a tone length modification instruction.

Optionally, the optimizing module is further configured to generate optimized dubbing audio corresponding to the text information to be dubbed in response to the input emotion intensity of the initial dubbing audio and the adjustment operation of the audio speech speed.

Optionally, the optimizing module is further configured to generate, if the optimized dubbing audio frequency does not meet the dubbing requirement, speech information corresponding to the text information to be dubbed according to the target tone requirement;

and processing the voice information according to the emotion type corresponding to the target dubbing personnel and the text information to be dubbed, and generating target dubbing audio corresponding to the text information to be dubbed.

Optionally, the optimization module is specifically configured to determine the target dubbing timbre according to the target dubbing personnel, the emotion type corresponding to the text information to be dubbed, the mapping relationship between each candidate dubbing personnel and at least one timbre corresponding to each candidate dubbing personnel, and the corresponding relationship between each timbre corresponding to each candidate dubbing personnel and the emotion type;

And processing the voice information according to the target dubbing tone, and adjusting the original tone of the voice information to the target dubbing tone to generate target dubbing audio.

Optionally, the determining module 810 is specifically configured to determine a target role from the roles to be dubbed;

And determining target dubbing personnel matched with the target roles according to the label information of each candidate dubbing personnel, wherein the label information comprises characters, ages, sexes and tone.

The modules above may be one or more integrated circuits configured to implement the methods above, such as one or more Application SPECIFIC INTEGRATED Circuits (ASIC), or one or more microprocessors (DIGITAL SINGNAL processor, DSP), or one or more field programmable gate arrays (Field Programmable GATE ARRAY, FPGA), or the like. For another example, when a module above is implemented in the form of a processing element scheduler code, the processing element may be a general-purpose processor, such as a central processing unit (Central Processing Unit, CPU) or other processor that may invoke the program code. For another example, the modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

The modules may be connected or communicate with each other via wired or wireless connections. The wired connection may include a metal cable, optical cable, hybrid cable, or the like, or any combination thereof. The wireless connection may include a connection through a LAN, WAN, bluetooth, zigBee, or NFC, or any combination thereof. Two or more modules may be combined into a single module, and any one module may be divided into two or more units. It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and apparatus may refer to corresponding procedures in the method embodiments, and are not repeated in the present disclosure.

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application, including a processor 801, a storage medium 802, and a bus 803, where the storage medium 802 stores machine-readable instructions executable by the processor 801, and when the electronic device performs a role dubbing method as in the embodiment, the processor 801 communicates with the storage medium 802 through the bus 803, and the processor 801 executes the machine-readable instructions to perform the following steps:

determining a target role and a target dubbing person;

Determining text information to be dubbed corresponding to a target role;

Determining a target dubbing tone according to the target dubbing personnel and emotion types corresponding to the text information to be dubbed;

And converting the text information to be dubbed according to the target dubbing tone color to generate initial dubbing audio.

In a possible embodiment, the processor 801 is specifically configured to, when executing the determining the target dubbing timbre according to the emotion type corresponding to the target dubbing person and the text information to be dubbed:

determining at least one tone corresponding to the target dubbing person according to the target dubbing person and the pre-established mapping relation between each candidate dubbing person and at least one tone corresponding to each candidate dubbing person;

In one possible embodiment, the processor 801, when performing the conversion of the text information to be dubbed according to the target dubbing timbre to generate the initial dubbing audio, is specifically configured to convert the text information to be dubbed into a phoneme sequence;

In one possible implementation, the processor 801 is further configured to, after performing conversion of the text information to be dubbed according to the target dubbing tone color to generate an initial dubbing audio, perform dubbing modification on the target character in the initial dubbing audio in response to a dubbing modification instruction of the target character in the input text information to be dubbed, and generate an optimized dubbing audio corresponding to the text information to be dubbed, where the dubbing modification instruction includes at least one of a pronunciation modification instruction, a tone insertion instruction, a tone modification instruction, and a duration modification instruction.

In one possible embodiment, the processor 801 is further configured to, after performing the conversion of the text information to be dubbed according to the target dubbing timbre, generate initial dubbing audio, generate optimized dubbing audio corresponding to the text information to be dubbed in response to the input emotion intensity of the initial dubbing audio and the adjustment operation of the audio speech rate.

In one possible embodiment, the processor 801 is further configured to, after executing the generation of the optimized dubbing audio corresponding to the text information to be dubbed, generate speech information corresponding to the text information to be dubbed according to the target pitch requirement if the optimized dubbing audio does not meet the dubbing requirement;

In a possible implementation manner, the processor 801 is specifically configured to determine a target dubbing tone according to the target dubbing person, the emotion type corresponding to the text information to be dubbed, a mapping relationship between a pre-built identifier of each candidate dubbing person and at least one tone corresponding to each candidate dubbing person, and a corresponding relationship between each tone corresponding to each candidate dubbing person and the emotion type when executing processing on the text information according to the target dubbing person and the emotion type corresponding to the text information to be dubbed and generating the target dubbing audio corresponding to the text information to be dubbed;

In one possible embodiment, the processor 801, when executing the determining the target character and the target dubbing person, is specifically configured to determine the target character from the characters to be dubbed, determine the target dubbing person matching the target character according to the tag information of each candidate dubbing person, where the tag information includes character, age, gender, and tone.

By the method, when the graphical user interface in the terminal equipment is the character dubbing editing interface, the server can determine the target character and the target dubbing personnel corresponding to the target character based on the operation of the user, so that the target dubbing tone matched with the text information to be dubbed can be determined based on the emotion types corresponding to the target dubbing personnel and the text information to be dubbed of the character, the voice synthesis processing is performed on the text information to be dubbed according to the target dubbing tone, the conversion processing from text to voice is automatically realized, the dubbing audio corresponding to the text information to be dubbed is obtained, the dubbing efficiency is effectively improved, and the cost is lower. And each candidate tone of the target dubbing person can be obtained through the target dubbing person, and the target dubbing tone with better matching effect with the text information to be dubbed can be determined from each candidate tone according to the emotion type corresponding to the text information to be dubbed, so that the finally obtained dubbing audio effect is better, the target dubbing tone is matched with the character, and the character is more true.

In which a storage medium 802 stores program code that, when executed by the processor 801, causes the processor 801 to perform various steps in the character dubbing method according to various exemplary embodiments of the application described in the "exemplary method" section of the specification.

The Processor 801 may be a general purpose Processor such as a Central Processing Unit (CPU), digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field programmable gate array (Field Programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in the processor for execution.

The storage medium 802 is a non-volatile computer-readable storage medium that can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory may include at least one type of storage medium, which may include, for example, flash Memory, hard disk, multimedia card, card Memory, random access Memory (Random Access Memory, RAM), static random access Memory (Static Random Access Memory, SRAM), programmable Read-Only Memory (Programmable Read Only Memory, PROM), read-Only Memory (ROM), charged erasable programmable Read-Only Memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory, EEPROM), magnetic Memory, magnetic disk, optical disk, and the like. The memory is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The storage medium 802 of the present application may also be circuitry or any other device capable of implementing a storage function for storing program instructions and/or data.

Optionally, an embodiment of the present application further provides a computer readable storage medium having stored thereon a computer program, which when executed by a processor, performs the steps of:

determining a target role and a target dubbing person;

Determining text information to be dubbed corresponding to a target role;

In one possible embodiment, the processor is specifically configured to, when executing the determining the target dubbing timbre according to the emotion type corresponding to the target dubbing person and the text information to be dubbed:

In one possible implementation, the processor is specifically configured to convert the text information to be dubbed into a phoneme sequence when performing conversion of the text information to be dubbed according to the target dubbing timbre to generate initial dubbing audio;

In one possible implementation, after the processor performs conversion on the text information to be dubbed according to the target dubbing tone color to generate initial dubbing audio, the processor is further configured to perform dubbing modification on target characters in the initial dubbing audio in response to a dubbing modification instruction of target characters in the input text information to be dubbed to generate optimized dubbing audio corresponding to the text information to be dubbed, wherein the dubbing modification instruction comprises at least one of a pronunciation modification instruction, a tone insertion instruction, a tone modification instruction and a tone length modification instruction.

In one possible implementation, the processor is further configured to, after performing the conversion of the text information to be dubbed according to the target dubbing timbre and generating the initial dubbing audio, generate optimized dubbing audio corresponding to the text information to be dubbed in response to the input emotion intensity of the initial dubbing audio and the adjustment operation of the audio speech rate.

In one possible embodiment, the processor is further configured to, after executing the generation of the optimized dubbing audio corresponding to the text information to be dubbed, generate speech information corresponding to the text information to be dubbed according to the target pitch requirement if the optimized dubbing audio does not meet the dubbing requirement;

In one possible implementation, the processor is specifically configured to determine a target dubbing tone according to the target dubbing person, an emotion type corresponding to the text information to be dubbed, a mapping relationship between a pre-built identifier of each candidate dubbing person and at least one tone corresponding to each candidate dubbing person, and a correspondence between each tone corresponding to each candidate dubbing person and the emotion type when executing processing on the text information according to the target dubbing person and the emotion type corresponding to the text information to be dubbed and generating a target dubbing audio corresponding to the text information to be dubbed;

In one possible implementation, the processor is specifically configured to determine a target character from characters to be dubbed when executing determining the target character and target dubbing staff, determine the target dubbing staff matching the target character according to tag information of each candidate dubbing staff, wherein the tag information comprises characters, ages, sexes and timbres.

In the embodiments of the present application, the computer program may also execute other machine readable instructions when executed by a processor to perform other methods as in the embodiments, and the specific implementation of the method steps and principles are referred to in the description of the embodiments and are not described in detail herein.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.

The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (english: processor) to perform some of the steps of the methods according to the embodiments of the application. The storage medium includes various media capable of storing program codes, such as a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk or an optical disk.

Claims

1. A character dubbing method, comprising:

determining a target role and a target dubbing person;

Determining text information to be dubbed corresponding to the target role;

2. The method of claim 1, wherein the determining the target dubbing timbre according to the emotion types corresponding to the target dubbing person and the text information to be dubbed comprises:

Determining at least one tone corresponding to the target dubbing person according to the target dubbing person and a pre-established mapping relation between each candidate dubbing person and at least one tone corresponding to each candidate dubbing person;

Determining the dubbing timbre of the target dubbing person under the emotion type according to the emotion type corresponding to the text information to be dubbed and the corresponding relation between each timbre and emotion type corresponding to each candidate dubbing person constructed in advance;

3. The method of claim 1, wherein the converting the text information to be dubbed according to the target dubbing timbre to generate initial dubbing audio comprises:

Converting the text information to be dubbed into a phoneme sequence;

And performing text-to-speech conversion processing on the phoneme sequence according to the target dubbing timbre, and generating initial dubbing audio corresponding to the text information to be dubbed, wherein the initial dubbing audio has the target dubbing timbre.

4. The method of claim 1, wherein the converting the text information to be dubbed according to the target dubbing timbre, after generating initial dubbing audio, comprises:

And responding to an input dubbing modification instruction of a target character in the text information to be dubbed, performing dubbing modification on the target character in the initial dubbing audio, and generating optimized dubbing audio corresponding to the text information to be dubbed, wherein the dubbing modification instruction comprises at least one of a pronunciation modification instruction, a tone insertion instruction, a tone modification instruction and a tone length modification instruction.

5. The method of claim 1, wherein the converting the text information to be dubbed according to the target dubbing timbre, after generating initial dubbing audio, comprises:

and responding to the input emotion intensity of the initial dubbing audio and the audio speed adjusting operation, and generating optimized dubbing audio corresponding to the text information to be dubbed.

6. The method according to claim 4 or 5, wherein after generating the optimized dubbing audio corresponding to the text information to be dubbed, the method further comprises:

If the optimized dubbing audio frequency does not meet the dubbing requirement, generating voice information corresponding to the text information to be dubbed according to the target tone requirement;

7. The method of claim 6, wherein the processing the voice information according to the emotion type corresponding to the target dubbing person and the text information to be dubbed, generating the target dubbing audio corresponding to the text information to be dubbed, comprises:

Determining a target dubbing tone according to the target dubbing personnel, the emotion type corresponding to the text information to be dubbed, the mapping relation between each candidate dubbing personnel and at least one tone corresponding to each candidate dubbing personnel, and the corresponding relation between each tone corresponding to each candidate dubbing personnel and the emotion type;

And processing the voice information according to the target dubbing tone, adjusting the original tone of the voice information into the target dubbing tone, and generating the target dubbing audio.

8. The method of claim 1, wherein the determining the target character and the target dubbing person comprises:

determining a target role from the roles to be dubbed;

and determining target dubbing personnel matched with the target roles according to label information of each candidate dubbing personnel, wherein the label information comprises characters, ages, sexes and timbres.

9. The character dubbing device is characterized by comprising a determining module and a processing module;

10. An electronic device comprising a processor, a storage medium and a bus, wherein the storage medium stores program instructions executable by the processor, the processor and the storage medium communicating over the bus when the electronic device is in operation, the processor executing the program instructions to perform the character dubbing method according to any one of claims 1 to 8.

11. A computer readable storage medium, characterized in that the storage medium has stored thereon a computer program which, when executed by a processor, performs the character dubbing method according to any one of claims 1 to 8.