CN119785769A

CN119785769A - A speech synthesis method based on syntax and prosody and related equipment

Info

Publication number: CN119785769A
Application number: CN202510032215.XA
Authority: CN
Inventors: 孙奥兰; 王健宗; 程宁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2025-01-08
Filing date: 2025-01-08
Publication date: 2025-04-08

Abstract

The embodiment of the present application belongs to the field of speech synthesis technology, and relates to a speech synthesis method based on syntax-prosody and related equipment, the method comprising: receiving a speech synthesis request carrying a speech synthesis text sent by a user terminal; constructing a recursive phonological structure according to the speech synthesis text; calling a trained recursive phonological model, and inputting the recursive phonological structure into the trained recursive phonological model for speech synthesis operation to obtain target audio data. The present application introduces a recursive phonological model, which can accurately capture the initial falling tone in speech and successfully reproduce syntactic differences, and can effectively construct a hierarchical structure of speech to generate more natural speech.

Description

Speech synthesis method based on syntax-rhythm and related equipment

Technical Field

The application relates to the technical field of speech synthesis, is suitable for the financial or medical field, in particular to a speech synthesis method based on syntax-rhythm and related equipment.

Background

End-to-end speech synthesis (TTS) systems significantly improve the speech synthesis quality of alphabetic-based languages, such as english. Unlike English, which contains only 26 different letters, chinese presents unique challenges due to its large character set and pronunciation inconsistencies.

To cope with this complexity, the chinese end-to-end speech synthesis approach relies on the use of phoneme sequences for the purpose of synthesizing speech.

However, the applicant found that the conventional speech synthesis method still has the problems of poor quality, insufficient naturalness and insufficient effect of the synthesized speech.

Disclosure of Invention

The embodiment of the application aims to provide a voice synthesis method based on syntax-rhythm and related equipment, so as to solve the problems that the traditional voice synthesis method still has poor quality, nature and outstanding effect of synthesized voice.

In order to solve the above technical problems, the embodiment of the present application provides a method for synthesizing speech based on syntax-prosody, which adopts the following technical scheme:

receiving a voice synthesis request carrying a voice synthesis text sent by a user terminal;

Constructing a recursive sound system structure according to the speech synthesis text, wherein the recursive sound system structure comprises a phoneme sequence, accent annotation data, sound system phrase annotation data and sound system clause annotation data, and the recursive sound system structure PPhrase _n is expressed as:

PPhrase_n＝PPhrase_n-1+PClause_n

Wherein PPhrase _n denotes the nth phonetic phrase annotation data, PClause _n denotes the nth phonetic clause annotation data;

and calling the trained recursive sound system model, and inputting the recursive sound system structure into the trained recursive sound system model to perform voice synthesis operation to obtain target audio data.

Further, the step of constructing a recursive sound system structure according to the speech synthesis text, wherein the recursive sound system structure comprises a phoneme sequence, accent annotation data, sound system phrase annotation data and sound system clause annotation data, and specifically comprises the following steps:

performing syntactic analysis operation on the voice synthesized text according to a syntactic analyzer to obtain syntactic structure data;

Converting the syntactic structure data into the phoneme sequence and syntactic structure data carrying accent annotation data;

and converting the syntactic structure data carrying accent annotation data into the sound system phrase annotation data and the sound system clause annotation data according to the syntactic-prosody mapping relation to obtain the recursive sound system structure.

Further, the step of converting the syntactic structure data carrying accent annotation data into the sound phrase annotation data and the sound clause annotation data according to the syntactic-prosody mapping relationship to obtain the recursive sound structure specifically includes the following steps:

converting the auxiliary word phrase that governs the noun phrase and the verb phrase in the syntactic structure data into the system phrase labeling data, wherein the system phrase labeling data PPhrase is expressed as:

PPhrase=PP+NP+VP

Wherein PP represents the word-aid phrase, NP represents the noun phrase, VP represents the verb phrase;

Marking the inflected phrase or the composite structure in the syntactic structure data as the sound clause annotation data, wherein the clause annotation data PClause is expressed as:

PClause =ip or PClause =cp

Wherein IP represents the inflected phrase and CP represents the composite structure.

Further, after the step of converting the syntactic structure data carrying accent annotation data into the sound phrase annotation data and the sound clause annotation data according to the syntactic-prosody mapping relationship to obtain the recursive sound structure, the method further comprises the following steps:

And performing sound system boundary insertion operation on the recursive sound system structure according to the stress peak property principle and the inverse interval constraint principle.

Further, after the step of performing the system boundary insertion operation on the recursive system structure according to the stress peaking principle and the inverse interval constraint principle, the method further includes the following steps:

and performing redundant structure elimination operation on the recursive sound system structure after the sound system boundary insertion operation.

Further, before the step of calling the trained recursive sound system model and inputting the recursive sound system structure to the trained recursive sound system model to perform speech synthesis operation to obtain the target audio data, the method further comprises the following steps:

Reading a system database, and acquiring model training data in the system database, wherein the model training data comprises training text data and training voice data corresponding to the training text data;

converting the training text data into a training recursive sound system structure;

and calling an original text-to-speech model, and performing model training operation on the original text-to-speech model by taking the training recursive sound system structure as input data and the training speech data as output data to obtain the trained recursive sound system model.

In order to solve the above technical problems, the embodiment of the present application further provides a speech synthesis apparatus based on syntax-prosody, which adopts the following technical scheme:

the request receiving module is used for receiving a voice synthesis request carrying a voice synthesis text sent by the user terminal;

A recursive sound system structure construction module, configured to construct a recursive sound system structure according to the speech synthesis text, where the recursive sound system structure includes a phoneme sequence, accent annotation data, sound system phrase annotation data, and sound system clause annotation data, and the recursive sound system structure PPhrase _n is expressed as:

PPhrase_n＝PPhrase_n-1+PClause_n

and the voice synthesis module is used for calling the trained recursive sound system model, inputting the recursive sound system structure into the trained recursive sound system model for voice synthesis operation, and obtaining target audio data.

Further, the recursive sound system structure construction module includes:

the syntactic analysis sub-module is used for carrying out syntactic analysis operation on the voice synthesized text according to the syntactic analyzer to obtain syntactic structure data;

a first syntax structure conversion sub-module, configured to convert the syntax structure data into the phoneme sequence and syntax structure data carrying accent annotation data;

And the second syntactic structure conversion sub-module is used for converting the syntactic structure data carrying accent annotation data into the sound system phrase annotation data and the sound system clause annotation data according to the syntactic-prosody mapping relation to obtain the recursive sound system structure.

In order to solve the above technical problems, the embodiment of the present application further provides a computer device, which adopts the following technical schemes:

Comprising a memory having stored therein computer readable instructions which when executed by a processor implement the steps of the syntax-prosody based speech synthesis method described above.

In order to solve the above technical problems, an embodiment of the present application further provides a computer readable storage medium, which adopts the following technical schemes:

The computer readable storage medium has stored thereon computer readable instructions which, when executed by a processor, implement the steps of the syntax-prosody based speech synthesis method as described above.

The application provides a voice synthesis method based on syntax-rhythm, which comprises the steps of receiving a voice synthesis request carrying a voice synthesis text sent by a user terminal, constructing a recursive sound system structure according to the voice synthesis text, wherein the recursive sound system structure comprises a phoneme sequence, accent annotation data, sound system phrase annotation data and sound system clause annotation data, the recursive sound system structure PPhrase _n is shown as PPhrase _n＝PPhrase_n-1+PClause_n, PPhrase _n shows nth sound system phrase annotation data, PClause _n shows nth sound system clause annotation data, calling a trained recursive sound system model, and inputting the recursive sound system structure into the trained recursive sound system model for voice synthesis operation to obtain target audio data. Compared with the prior art, the application introduces a recursive sound system model, can accurately capture the initial falling tone in the voice, successfully reproduce the syntactic difference, effectively construct the hierarchical structure of the voice and generate more natural voice.

Drawings

In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for the description of the embodiments of the present application, it being apparent that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without the exercise of inventive effort for a person of ordinary skill in the art.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flowchart of an implementation of a method for syntactic-prosody-based speech synthesis provided by an embodiment of the present application;

Fig. 3 is a schematic structural diagram of a speech synthesis apparatus based on syntax-prosody according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of one embodiment of a computer device in accordance with the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs, the terms used in the description herein are used for the purpose of describing particular embodiments only and are not intended to limit the application, and the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the above description of the drawings are intended to cover non-exclusive inclusions. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, the system architecture 100 may include a terminal device 101, a network 102, and a server 103, where the terminal device 101 may be a notebook 1011, a tablet 1012, or a cell phone 1013. Network 102 is the medium used to provide communication links between terminal device 101 and server 103. Network 102 may include various connection types such as wired, wireless communication links, or fiber optic cables, among others.

A user may interact with the server 103 via the network 102 using the terminal device 101 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, and the like, may be installed on the terminal device 101.

The terminal device 101 may be various electronic devices having a display screen and supporting web browsing, and the terminal device 101 may be an electronic book reader, an MP3 player (Moving Picture Experts Group Audio Layer III, moving picture experts compression standard audio layer III), an MP4 (Moving Picture Experts Group Audio Layer IV, moving picture experts compression standard audio layer IV) player, a laptop portable computer, a desktop computer, and the like, in addition to the notebook 1011, the tablet 1012, or the mobile phone 1013.

The server 103 may be a server providing various services, such as a background server providing support for pages displayed on the terminal device 101.

It should be noted that, the method for synthesizing speech based on the syntax-prosody provided by the embodiment of the present application is generally performed by a server/terminal device, and accordingly, the apparatus for synthesizing speech based on the syntax-prosody is generally disposed in the server/terminal device.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to fig. 2, a flow chart of one embodiment of a syntax-prosody based speech synthesis method according to the present application is shown. The method for synthesizing the voice based on the syntax-rhythm comprises the steps of S201, S202 and S203.

In step S201, a speech synthesis request carrying speech synthesis text sent by a user terminal is received.

In the embodiment of the present application, a user terminal refers to a terminal device for performing the image processing method for preventing document abuse provided by the present application, and the user terminal may be a mobile terminal such as a mobile phone, a smart phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet personal computer), a PMP (portable multimedia player), a navigation device, etc., and a fixed terminal such as a digital TV, a desktop computer, etc., it should be understood that the examples of the user terminal herein are merely for convenience of understanding and are not intended to limit the present application.

In the embodiment of the application, the user inputs a speech synthesis request through their terminal equipment (such as mobile phone, computer, etc.), and the input is received by the system. This speech synthesis text is a specific content of the speech synthesis of the system that the user wishes to make, and in particular, the speech synthesis text may be "transaction data or payment data or business data or purchase data" related to a financial institution (e.g. bank, etc.), and the speech synthesis text may also be medical data related to a medical scenario, such as personal health record, prescription, examination report, etc., as an example, it should be understood that the example of the speech synthesis text is only for convenience of understanding and is not intended to limit the present application.

In step S202, a recursive sound system structure is constructed according to the speech synthesis text, wherein the recursive sound system structure includes a phoneme sequence, accent labeling data, sound system phrase labeling data, and sound system clause labeling data, and the recursive sound system structure PPhrase _n is represented as:

PPhrase_n＝PPhrase_n-1+PClause_n

Wherein PPhrase _n denotes the nth phonetic phrase annotation data, PClause _n denotes the nth phonetic clause annotation data.

In the embodiment of the application, the implementation mode of constructing the recursive sound system structure according to the speech synthesis text can be to perform syntactic analysis operation on the speech synthesis text according to a syntactic analyzer to obtain syntactic structure data, convert the syntactic structure data into a phoneme sequence and syntactic structure data carrying accent annotation data, and convert the syntactic structure data carrying accent annotation data into sound system phrase annotation data and sound system clause annotation data according to a syntactic-prosodic mapping relationship to obtain the recursive sound system structure.

In step S203, a trained recursive sound system model is invoked, and the recursive sound system structure is input to the trained recursive sound system model for speech synthesis operation, so as to obtain target audio data.

In the embodiment of the present application, the sentence end mark is represented by "." representing a statement sentence, ".

In an embodiment of the application, a novel speech synthesis method named recursive audio model is provided. The model integrates the insights of various sound system theories, is mainly based on the sound system structure proposed (syntactic-prosodic mapping assumption), and combines the boundary driving theory and prosodic validity constraint, wherein the model is unique in that specific symbols are introduced to represent accents, sound system phrases and sound clauses, respectively represented by "\", "[ ]" and "{ }", besides using phoneme symbols. Since the present application emphasizes PPhrases recursion in this method, the present application is referred to as a recursive acoustic model.

In the embodiment of the application, by the method, the model can effectively capture complex prosody characteristics in the voice and generate more natural voice.

The embodiment of the application provides a syntactic-prosody-based speech synthesis method, which comprises the steps of receiving a speech synthesis request carrying a speech synthesis text sent by a user terminal, constructing a recursive sound system structure according to the speech synthesis text, wherein the recursive sound system structure comprises a phoneme sequence, accent annotation data, sound system phrase annotation data and sound system clause annotation data, the recursive sound system structure PPhrase _n is shown as PPhrase _n＝PPhrase_n-1+PClause_n, PPhrase _n shows nth sound system phrase annotation data, PClause _n shows nth sound system clause annotation data, calling a trained recursive sound system model, and inputting the recursive sound system structure into the trained recursive sound system model for speech synthesis operation to obtain target audio data. Compared with the prior art, the application introduces a recursive sound system model, can accurately capture the initial falling tone in the voice, successfully reproduce the syntactic difference, effectively construct the hierarchical structure of the voice and generate more natural voice.

In some optional implementations of the embodiments of the present application, the above-mentioned constructing a recursive sound system structure according to a speech synthesis text, where the recursive sound system structure includes a phoneme sequence, accent annotation data, sound phrase annotation data, and sound clause annotation data, and specifically includes the following steps:

Carrying out syntactic analysis operation on the voice synthesized text according to the syntactic analyzer to obtain syntactic structure data;

converting the syntactic structure data into a phoneme sequence and carrying accent annotation data;

And converting the syntactic structure data carrying accent annotation data into the sound system phrase annotation data and the sound system clause annotation data according to the syntactic-prosody mapping relation to obtain a recursive sound system structure.

In an embodiment of the present application, the present application first automatically extracts a syntax structure using a syntax parser. The phoneme sequence and accent are then generated by converting the text, wherein the output of the syntax parser is converted into representations of the phonemes and accents.

In some optional implementations of the embodiments of the present application, the step of converting the syntax structure data carrying accent annotation data into the syllable phrase annotation data and the syllable clause annotation data according to the syntax-prosody mapping relationship to obtain the recursive syllable structure specifically includes the following steps:

Converting the auxiliary word phrase that governs the noun phrase and the verb phrase in the syntactic structure data into the system phrase labeling data, wherein the system phrase labeling data PPhrase is represented as:

PPhrase=PP+NP+VP

wherein PP represents a word-aiding phrase, NP represents a noun phrase, and VP represents a verb phrase;

Marking inflected phrases or composite structures in the syntactic structure data as clause annotation data, wherein the clause annotation data PClause is represented as:

PClause =ip or PClause =cp

Wherein IP represents a inflected phrase and CP represents a composite structure.

In an embodiment of the present application, a Prosodic Phrase (PP) that governs Noun Phrases (NP) and Verb Phrases (VP) is converted to a system phrase (PPhrases) in a syntactic structure, denoted by "[ ]", according to a syntactic-prosodic mapping assumption. To implement an audio clause (PClauses), the present application marks the IP output by the syntax parser as PClause, denoted by "{ }". Furthermore, if the CP dominates the IP, the CP is only marked PClause, replacing the IP with "{ }", specific:

The audio phrase (PPhrases) constructs:

PPhrase=PP+NP+VP

the sound clause (PClauses) is constructed:

PClause =ip or PClause =cp

In some optional implementations of the embodiments of the present application, after the step of converting the syntactic structure data carrying accent annotation data into the lineage phrase annotation data and the lineage clause annotation data according to the syntactic-prosodic mapping relationship, the steps of obtaining the recursive lineage structure further include the steps of:

and performing system boundary insertion operation on the recursive system structure according to the stress peaking principle and the inverse interval constraint principle.

In the embodiment of the application, according to the principles of stress topicality and inverse interval constraint, a sound system boundary is needed after each stress. This means that AA (accent + accent) sequences are separated due to accent peaking, while AU (accent + non-accent) sequences are separated due to anti-spacing constraints. Conversely, UA and UU sequences may form one PPh rase, e.g., UA and UU. However, the AA and AU sequences must be split into [ A ] [ A ] and [ A ] [ U ], respectively. For example, in the first process, the AU sequence is expressed as "[ (/) () ]", converted to "[ [/] ]", by inserting a PPhrase border. On the other hand, the UA sequence is expressed as "[ (\) ]", does not violate accent peaking or inverse spacing constraints, and therefore does not require the insertion of PPhrase boundaries, converting to a single PPhrase "[ \ ]", in particular:

AA sequence:

AA→[A][A]

AU sequence:

AU→[A][U]

UA and UU sequences:

UA→[UA],UU→[UU]

In some optional implementations of the embodiments of the present application, after the step of performing the system boundary insertion operation on the recursive system structure according to the stress peaking principle and the inverse interval constraint principle, the method further includes the steps of:

In an embodiment of the present application, the redundant configuration elimination operation refers to the internal PPhr ase being acceptable only when there are multiple PPhr ases within the external PPhrase, depending on the constraints. If there is only one PPhr ase within the outer portion PPhrase, the outer portion PPh rase is removed.

In some optional implementations of the embodiments of the present application, before the step of calling the trained recursive sound system model and inputting the recursive sound system structure to the trained recursive sound system model to perform the speech synthesis operation, the method further includes the steps of:

And calling the original text-to-speech model, and performing model training operation on the original text-to-speech model by taking the training recursive sound system structure as input data and the training speech data as output data to obtain a trained recursive sound system model.

In the embodiment of the present application, the training process of the model of the present application can be summarized as the following steps:

1. input text preprocessing, converting the input text into a phoneme sequence and labeling accents, syllable phrases and syllable clauses.

2. And constructing a recursive sound system structure, namely forming a complete sound system structure by recursively constructing sound system phrases.

3. Model training-TTS models are trained using recursive sound structures and corresponding speech data.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Wherein artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by computer readable instructions stored in a computer readable storage medium that, when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a syntactic-prosody-based speech synthesis apparatus, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 3, the syntax-prosody-based speech synthesis apparatus 200 of an embodiment of the present application includes:

a request receiving module 210, configured to receive a speech synthesis request carrying speech synthesis text sent by a user terminal;

A recursive sound system structure construction module 220, configured to construct a recursive sound system structure according to the speech synthesis text, where the recursive sound system structure includes a phoneme sequence, accent annotation data, sound system phrase annotation data, and sound system clause annotation data, and the recursive sound system structure PPhrase _n is represented as:

PPhrase_n＝PPhrase_n-1+PClause_n

the speech synthesis module 230 is configured to invoke the trained recursive sound system model, and input the recursive sound system structure to the trained recursive sound system model for performing speech synthesis operation, so as to obtain the target audio data.

In an embodiment of the present application, a speech synthesis apparatus 200 based on syntax-prosody is provided, which includes a request receiving module 210 configured to receive a speech synthesis request carrying a speech synthesis text sent by a user terminal, a recursive pitch structure constructing module 220 configured to construct a recursive pitch structure according to the speech synthesis text, where the recursive pitch structure includes a phoneme sequence, accent annotation data, pitch phrase annotation data, and pitch clause annotation data, the recursive pitch structure PPhrase _n is denoted as PPhrase _n＝PPhrase_n-1+PClause_n, PPhrase _n denotes the nth pitch phrase annotation data, PClause _n denotes the nth pitch clause annotation data, and a speech synthesis module 230 configured to invoke a trained recursive pitch model and input the recursive pitch structure to the trained recursive pitch model for performing a speech synthesis operation, so as to obtain target audio data. Compared with the prior art, the application introduces a recursive sound system model, can accurately capture the initial falling tone in the voice, successfully reproduce the syntactic difference, effectively construct the hierarchical structure of the voice and generate more natural voice.

In some optional implementations of the embodiments of the present application, the recursive sound system structure building module includes:

The first syntax structure conversion sub-module is used for converting the syntax structure data into a phoneme sequence and carrying the syntax structure data of accent annotation data;

And the second syntactic structure conversion sub-module is used for converting the syntactic structure data carrying accent annotation data into the sound system phrase annotation data and the sound system clause annotation data according to the syntactic-prosody mapping relation to obtain a recursive sound system structure.

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 4, fig. 4 is a basic structural block diagram of a computer device according to an embodiment of the present application.

The computer device 300 includes a memory 310, a processor 320, and a network interface 330 communicatively coupled to each other via a system bus. It should be noted that only computer device 300 having components 310-330 is shown in the figures, but it should be understood that not all of the illustrated components need be implemented, and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), a Programmable gate array (Field-Programmable GATE ARRAY, FPGA), a digital Processor (DIGITAL SIGNAL Processor, DSP), an embedded device, and the like.

The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory 310 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 310 may be an internal storage unit of the computer device 300, such as a hard disk or a memory of the computer device 300. In other embodiments, the memory 310 may also be an external storage device of the computer device 300, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the computer device 300. Of course, the memory 310 may also include both internal storage units and external storage devices of the computer device 300. In an embodiment of the present application, the memory 310 is generally used to store an operating system and various application software installed on the computer device 300, such as computer readable instructions of a syntax-prosody-based speech synthesis method. In addition, the memory 310 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 320 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 320 is generally used to control the overall operation of the computer device 300. In an embodiment of the present application, the processor 320 is configured to execute computer readable instructions stored in the memory 310 or process data, such as computer readable instructions for executing the syntax-prosody based speech synthesis method.

The network interface 330 may include a wireless network interface or a wired network interface, the network interface 330 typically being used to establish communication connections between the computer device 300 and other electronic devices.

The computer equipment provided by the application introduces a recursive sound system model, can accurately capture the initial falling tone in the voice, successfully reproduce the syntactic difference, effectively construct the hierarchical structure of the voice and generate more natural voice.

The present application also provides another embodiment, namely, a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of a method of syntactic-prosody based speech synthesis as described above.

The computer readable storage medium provided by the application introduces a recursive sound system model, can accurately capture the initial falling tone in the voice, successfully reproduce the syntax difference, effectively construct the hierarchical structure of the voice and generate more natural voice.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.

It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.

Claims

1. A method for speech synthesis based on syntax and prosody, characterized in that it comprises the following steps:

Receiving a speech synthesis request carrying speech synthesis text sent by a user terminal;

A recursive phonological structure is constructed according to the speech synthesis text, wherein the recursive phonological structure includes a phoneme sequence, stress marking data, phonological phrase marking data, and phonological clause marking data, and the recursive phonological structure PPhrase _n is expressed as:

PPhrase _n =PPhrase _n-1 + PClause _n

Among them, PPhrase _n represents the nth phonological phrase annotation data, and PClause _n represents the nth phonological clause annotation data;

The trained recursive phonological model is called, and the recursive phonological structure is input into the trained recursive phonological model to perform a speech synthesis operation to obtain target audio data.

2. The method for speech synthesis based on syntax and prosody according to claim 1 is characterized in that the step of constructing a recursive phonological structure according to the speech synthesis text, wherein the recursive phonological structure includes a phoneme sequence, stress marking data, phonological phrase marking data, and phonological clause marking data, specifically comprises the following steps:

Performing a syntactic parsing operation on the speech synthesis text according to a syntactic parser to obtain syntactic structure data;

Converting the syntactic structure data into the phoneme sequence and the syntactic structure data carrying the accent marking data;

The syntactic structure data carrying the stress marking data is converted into the phonological phrase marking data and the phonological clause marking data according to the syntax-prosody mapping relationship to obtain the recursive phonological structure.

3. The method for speech synthesis based on syntax-prosody according to claim 2 is characterized in that the step of converting the syntactic structure data carrying the stress marking data into the phonological phrase marking data and the phonological clause marking data according to the syntax-prosody mapping relationship to obtain the recursive phonological structure specifically comprises the following steps:

The particle phrases controlling the noun phrases and the verb phrases in the syntactic structure data are converted into the phonological phrase annotation data, wherein the phonological phrase annotation data PPhrase is represented as:

PPhrase＝PP+NP+VP

Wherein, PP represents the particle phrase, NP represents the noun phrase, and VP represents the verb phrase;

The inflectional phrases or compound structures in the syntactic structure data are marked as the phonological clause annotation data, wherein the clause annotation data PClause is represented as:

PClause＝IP or PClause＝CP

Wherein, IP represents the inflectional phrase, and CP represents the compound structure.

4. The method for speech synthesis based on syntax-prosody according to claim 2, characterized in that after the step of converting the syntactic structure data carrying the stress marking data into the phonological phrase marking data and the phonological clause marking data according to the syntax-prosody mapping relationship to obtain the recursive phonological structure, the method further comprises the following steps:

The recursive phonological structure is subjected to a phonological boundary insertion operation according to the stress apex principle and the anti-spacing constraint principle.

5. The method for speech synthesis based on syntax and prosody according to claim 5, characterized in that after the step of inserting a phonological boundary into the recursive phonological structure according to the principle of stress peakness and the principle of anti-spacing constraint, the method further comprises the following steps:

The redundant structure elimination operation is performed on the recursive phonological structure after the phonological boundary insertion operation.

6. The method for speech synthesis based on syntax-prosody according to claim 1, characterized in that before the step of calling the trained recursive phonological model and inputting the recursive phonological structure into the trained recursive phonological model for speech synthesis to obtain target audio data, the method further comprises the following steps:

Reading a system database, and acquiring model training data in the system database, wherein the model training data includes training text data and training voice data corresponding to the training text data;

converting the training text data into a training recursive phonological structure;

The original text-to-speech model is called, and the training recursive phonological structure is used as input data and the training speech data is used as output data to perform a model training operation on the original text-to-speech model to obtain the trained recursive phonological model.

7. A syntax-prosody based speech synthesis device, comprising:

A request receiving module, used for receiving a speech synthesis request carrying speech synthesis text sent by a user terminal;

A recursive phonological structure construction module is used to construct a recursive phonological structure according to the speech synthesis text, wherein the recursive phonological structure includes a phoneme sequence, stress marking data, phonological phrase marking data, and phonological clause marking data, and the recursive phonological structure PPhrase _n is expressed as:

PPhrase _n =PPhrase _n-1 + PClause _n

The speech synthesis module is used to call the trained recursive phonology model and input the recursive phonology structure into the trained recursive phonology model to perform speech synthesis operation to obtain target audio data.

8. The syntax-prosody based speech synthesis device according to claim 7, wherein the recursive phonological structure building module comprises:

A syntactic parsing submodule, used for performing a syntactic parsing operation on the speech synthesis text according to a syntactic parser to obtain syntactic structure data;

A first syntactic structure conversion submodule, configured to convert the syntactic structure data into the phoneme sequence and syntactic structure data carrying stress marking data;

The second syntactic structure conversion submodule is used to convert the syntactic structure data carrying the stress marking data into the phonological phrase marking data and the phonological clause marking data according to the syntax-prosody mapping relationship to obtain the recursive phonological structure.

9. A computer device comprising a memory and a processor, wherein the memory stores computer-readable instructions, and the processor implements the steps of the syntax-prosody-based speech synthesis method as described in any one of claims 1 to 6 when executing the computer-readable instructions.

10. A computer-readable storage medium, characterized in that computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by a processor, the steps of the syntax-prosody-based speech synthesis method as described in any one of claims 1 to 6 are implemented.