[go: up one dir, main page]

CN119785769A - A speech synthesis method based on syntax and prosody and related equipment - Google Patents

A speech synthesis method based on syntax and prosody and related equipment Download PDF

Info

Publication number
CN119785769A
CN119785769A CN202510032215.XA CN202510032215A CN119785769A CN 119785769 A CN119785769 A CN 119785769A CN 202510032215 A CN202510032215 A CN 202510032215A CN 119785769 A CN119785769 A CN 119785769A
Authority
CN
China
Prior art keywords
phonological
data
recursive
speech synthesis
phrase
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510032215.XA
Other languages
Chinese (zh)
Inventor
孙奥兰
王健宗
程宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202510032215.XA priority Critical patent/CN119785769A/en
Publication of CN119785769A publication Critical patent/CN119785769A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

本申请实施例属于语音合成技术领域,涉及一种基于句法‑韵律的语音合成方法及相关设备,该方法包括:接收用户终端发送的携带有语音合成文本的语音合成请求;根据语音合成文本构建递归音系结构;调用训练好的递归音系模型,并将递归音系结构输入至训练好的递归音系模型进行语音合成操作,得到目标音频数据。本申请引入了递归音系模型,可以准确捕捉语音中的初始降调,并成功再现句法差异,能够有效地构建语音的层次结构,生成更自然的语音。

The embodiment of the present application belongs to the field of speech synthesis technology, and relates to a speech synthesis method based on syntax-prosody and related equipment, the method comprising: receiving a speech synthesis request carrying a speech synthesis text sent by a user terminal; constructing a recursive phonological structure according to the speech synthesis text; calling a trained recursive phonological model, and inputting the recursive phonological structure into the trained recursive phonological model for speech synthesis operation to obtain target audio data. The present application introduces a recursive phonological model, which can accurately capture the initial falling tone in speech and successfully reproduce syntactic differences, and can effectively construct a hierarchical structure of speech to generate more natural speech.

Description

Speech synthesis method based on syntax-rhythm and related equipment
Technical Field
The application relates to the technical field of speech synthesis, is suitable for the financial or medical field, in particular to a speech synthesis method based on syntax-rhythm and related equipment.
Background
End-to-end speech synthesis (TTS) systems significantly improve the speech synthesis quality of alphabetic-based languages, such as english. Unlike English, which contains only 26 different letters, chinese presents unique challenges due to its large character set and pronunciation inconsistencies.
To cope with this complexity, the chinese end-to-end speech synthesis approach relies on the use of phoneme sequences for the purpose of synthesizing speech.
However, the applicant found that the conventional speech synthesis method still has the problems of poor quality, insufficient naturalness and insufficient effect of the synthesized speech.
Disclosure of Invention
The embodiment of the application aims to provide a voice synthesis method based on syntax-rhythm and related equipment, so as to solve the problems that the traditional voice synthesis method still has poor quality, nature and outstanding effect of synthesized voice.
In order to solve the above technical problems, the embodiment of the present application provides a method for synthesizing speech based on syntax-prosody, which adopts the following technical scheme:
receiving a voice synthesis request carrying a voice synthesis text sent by a user terminal;
Constructing a recursive sound system structure according to the speech synthesis text, wherein the recursive sound system structure comprises a phoneme sequence, accent annotation data, sound system phrase annotation data and sound system clause annotation data, and the recursive sound system structure PPhrase n is expressed as:
PPhrasen=PPhrasen-1+PClausen
Wherein PPhrase n denotes the nth phonetic phrase annotation data, PClause n denotes the nth phonetic clause annotation data;
and calling the trained recursive sound system model, and inputting the recursive sound system structure into the trained recursive sound system model to perform voice synthesis operation to obtain target audio data.
Further, the step of constructing a recursive sound system structure according to the speech synthesis text, wherein the recursive sound system structure comprises a phoneme sequence, accent annotation data, sound system phrase annotation data and sound system clause annotation data, and specifically comprises the following steps:
performing syntactic analysis operation on the voice synthesized text according to a syntactic analyzer to obtain syntactic structure data;
Converting the syntactic structure data into the phoneme sequence and syntactic structure data carrying accent annotation data;
and converting the syntactic structure data carrying accent annotation data into the sound system phrase annotation data and the sound system clause annotation data according to the syntactic-prosody mapping relation to obtain the recursive sound system structure.
Further, the step of converting the syntactic structure data carrying accent annotation data into the sound phrase annotation data and the sound clause annotation data according to the syntactic-prosody mapping relationship to obtain the recursive sound structure specifically includes the following steps:
converting the auxiliary word phrase that governs the noun phrase and the verb phrase in the syntactic structure data into the system phrase labeling data, wherein the system phrase labeling data PPhrase is expressed as:
PPhrase=PP+NP+VP
Wherein PP represents the word-aid phrase, NP represents the noun phrase, VP represents the verb phrase;
Marking the inflected phrase or the composite structure in the syntactic structure data as the sound clause annotation data, wherein the clause annotation data PClause is expressed as:
PClause =ip or PClause =cp
Wherein IP represents the inflected phrase and CP represents the composite structure.
Further, after the step of converting the syntactic structure data carrying accent annotation data into the sound phrase annotation data and the sound clause annotation data according to the syntactic-prosody mapping relationship to obtain the recursive sound structure, the method further comprises the following steps:
And performing sound system boundary insertion operation on the recursive sound system structure according to the stress peak property principle and the inverse interval constraint principle.
Further, after the step of performing the system boundary insertion operation on the recursive system structure according to the stress peaking principle and the inverse interval constraint principle, the method further includes the following steps:
and performing redundant structure elimination operation on the recursive sound system structure after the sound system boundary insertion operation.
Further, before the step of calling the trained recursive sound system model and inputting the recursive sound system structure to the trained recursive sound system model to perform speech synthesis operation to obtain the target audio data, the method further comprises the following steps:
Reading a system database, and acquiring model training data in the system database, wherein the model training data comprises training text data and training voice data corresponding to the training text data;
converting the training text data into a training recursive sound system structure;
and calling an original text-to-speech model, and performing model training operation on the original text-to-speech model by taking the training recursive sound system structure as input data and the training speech data as output data to obtain the trained recursive sound system model.
In order to solve the above technical problems, the embodiment of the present application further provides a speech synthesis apparatus based on syntax-prosody, which adopts the following technical scheme:
the request receiving module is used for receiving a voice synthesis request carrying a voice synthesis text sent by the user terminal;
A recursive sound system structure construction module, configured to construct a recursive sound system structure according to the speech synthesis text, where the recursive sound system structure includes a phoneme sequence, accent annotation data, sound system phrase annotation data, and sound system clause annotation data, and the recursive sound system structure PPhrase n is expressed as:
PPhrasen=PPhrasen-1+PClausen
Wherein PPhrase n denotes the nth phonetic phrase annotation data, PClause n denotes the nth phonetic clause annotation data;
and the voice synthesis module is used for calling the trained recursive sound system model, inputting the recursive sound system structure into the trained recursive sound system model for voice synthesis operation, and obtaining target audio data.
Further, the recursive sound system structure construction module includes:
the syntactic analysis sub-module is used for carrying out syntactic analysis operation on the voice synthesized text according to the syntactic analyzer to obtain syntactic structure data;
a first syntax structure conversion sub-module, configured to convert the syntax structure data into the phoneme sequence and syntax structure data carrying accent annotation data;
And the second syntactic structure conversion sub-module is used for converting the syntactic structure data carrying accent annotation data into the sound system phrase annotation data and the sound system clause annotation data according to the syntactic-prosody mapping relation to obtain the recursive sound system structure.
In order to solve the above technical problems, the embodiment of the present application further provides a computer device, which adopts the following technical schemes:
Comprising a memory having stored therein computer readable instructions which when executed by a processor implement the steps of the syntax-prosody based speech synthesis method described above.
In order to solve the above technical problems, an embodiment of the present application further provides a computer readable storage medium, which adopts the following technical schemes:
The computer readable storage medium has stored thereon computer readable instructions which, when executed by a processor, implement the steps of the syntax-prosody based speech synthesis method as described above.
The application provides a voice synthesis method based on syntax-rhythm, which comprises the steps of receiving a voice synthesis request carrying a voice synthesis text sent by a user terminal, constructing a recursive sound system structure according to the voice synthesis text, wherein the recursive sound system structure comprises a phoneme sequence, accent annotation data, sound system phrase annotation data and sound system clause annotation data, the recursive sound system structure PPhrase n is shown as PPhrase n=PPhrasen-1+PClausen, PPhrase n shows nth sound system phrase annotation data, PClause n shows nth sound system clause annotation data, calling a trained recursive sound system model, and inputting the recursive sound system structure into the trained recursive sound system model for voice synthesis operation to obtain target audio data. Compared with the prior art, the application introduces a recursive sound system model, can accurately capture the initial falling tone in the voice, successfully reproduce the syntactic difference, effectively construct the hierarchical structure of the voice and generate more natural voice.
Drawings
In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for the description of the embodiments of the present application, it being apparent that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without the exercise of inventive effort for a person of ordinary skill in the art.
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flowchart of an implementation of a method for syntactic-prosody-based speech synthesis provided by an embodiment of the present application;
Fig. 3 is a schematic structural diagram of a speech synthesis apparatus based on syntax-prosody according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram of one embodiment of a computer device in accordance with the present application.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs, the terms used in the description herein are used for the purpose of describing particular embodiments only and are not intended to limit the application, and the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the above description of the drawings are intended to cover non-exclusive inclusions. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.
As shown in fig. 1, the system architecture 100 may include a terminal device 101, a network 102, and a server 103, where the terminal device 101 may be a notebook 1011, a tablet 1012, or a cell phone 1013. Network 102 is the medium used to provide communication links between terminal device 101 and server 103. Network 102 may include various connection types such as wired, wireless communication links, or fiber optic cables, among others.
A user may interact with the server 103 via the network 102 using the terminal device 101 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, and the like, may be installed on the terminal device 101.
The terminal device 101 may be various electronic devices having a display screen and supporting web browsing, and the terminal device 101 may be an electronic book reader, an MP3 player (Moving Picture Experts Group Audio Layer III, moving picture experts compression standard audio layer III), an MP4 (Moving Picture Experts Group Audio Layer IV, moving picture experts compression standard audio layer IV) player, a laptop portable computer, a desktop computer, and the like, in addition to the notebook 1011, the tablet 1012, or the mobile phone 1013.
The server 103 may be a server providing various services, such as a background server providing support for pages displayed on the terminal device 101.
It should be noted that, the method for synthesizing speech based on the syntax-prosody provided by the embodiment of the present application is generally performed by a server/terminal device, and accordingly, the apparatus for synthesizing speech based on the syntax-prosody is generally disposed in the server/terminal device.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to fig. 2, a flow chart of one embodiment of a syntax-prosody based speech synthesis method according to the present application is shown. The method for synthesizing the voice based on the syntax-rhythm comprises the steps of S201, S202 and S203.
In step S201, a speech synthesis request carrying speech synthesis text sent by a user terminal is received.
In the embodiment of the present application, a user terminal refers to a terminal device for performing the image processing method for preventing document abuse provided by the present application, and the user terminal may be a mobile terminal such as a mobile phone, a smart phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet personal computer), a PMP (portable multimedia player), a navigation device, etc., and a fixed terminal such as a digital TV, a desktop computer, etc., it should be understood that the examples of the user terminal herein are merely for convenience of understanding and are not intended to limit the present application.
In the embodiment of the application, the user inputs a speech synthesis request through their terminal equipment (such as mobile phone, computer, etc.), and the input is received by the system. This speech synthesis text is a specific content of the speech synthesis of the system that the user wishes to make, and in particular, the speech synthesis text may be "transaction data or payment data or business data or purchase data" related to a financial institution (e.g. bank, etc.), and the speech synthesis text may also be medical data related to a medical scenario, such as personal health record, prescription, examination report, etc., as an example, it should be understood that the example of the speech synthesis text is only for convenience of understanding and is not intended to limit the present application.
In step S202, a recursive sound system structure is constructed according to the speech synthesis text, wherein the recursive sound system structure includes a phoneme sequence, accent labeling data, sound system phrase labeling data, and sound system clause labeling data, and the recursive sound system structure PPhrase n is represented as:
PPhrasen=PPhrasen-1+PClausen
Wherein PPhrase n denotes the nth phonetic phrase annotation data, PClause n denotes the nth phonetic clause annotation data.
In the embodiment of the application, the implementation mode of constructing the recursive sound system structure according to the speech synthesis text can be to perform syntactic analysis operation on the speech synthesis text according to a syntactic analyzer to obtain syntactic structure data, convert the syntactic structure data into a phoneme sequence and syntactic structure data carrying accent annotation data, and convert the syntactic structure data carrying accent annotation data into sound system phrase annotation data and sound system clause annotation data according to a syntactic-prosodic mapping relationship to obtain the recursive sound system structure.
In step S203, a trained recursive sound system model is invoked, and the recursive sound system structure is input to the trained recursive sound system model for speech synthesis operation, so as to obtain target audio data.
In the embodiment of the present application, the sentence end mark is represented by "." representing a statement sentence, ".
In an embodiment of the application, a novel speech synthesis method named recursive audio model is provided. The model integrates the insights of various sound system theories, is mainly based on the sound system structure proposed (syntactic-prosodic mapping assumption), and combines the boundary driving theory and prosodic validity constraint, wherein the model is unique in that specific symbols are introduced to represent accents, sound system phrases and sound clauses, respectively represented by "\", "[ ]" and "{ }", besides using phoneme symbols. Since the present application emphasizes PPhrases recursion in this method, the present application is referred to as a recursive acoustic model.
In the embodiment of the application, by the method, the model can effectively capture complex prosody characteristics in the voice and generate more natural voice.
The embodiment of the application provides a syntactic-prosody-based speech synthesis method, which comprises the steps of receiving a speech synthesis request carrying a speech synthesis text sent by a user terminal, constructing a recursive sound system structure according to the speech synthesis text, wherein the recursive sound system structure comprises a phoneme sequence, accent annotation data, sound system phrase annotation data and sound system clause annotation data, the recursive sound system structure PPhrase n is shown as PPhrase n=PPhrasen-1+PClausen, PPhrase n shows nth sound system phrase annotation data, PClause n shows nth sound system clause annotation data, calling a trained recursive sound system model, and inputting the recursive sound system structure into the trained recursive sound system model for speech synthesis operation to obtain target audio data. Compared with the prior art, the application introduces a recursive sound system model, can accurately capture the initial falling tone in the voice, successfully reproduce the syntactic difference, effectively construct the hierarchical structure of the voice and generate more natural voice.
In some optional implementations of the embodiments of the present application, the above-mentioned constructing a recursive sound system structure according to a speech synthesis text, where the recursive sound system structure includes a phoneme sequence, accent annotation data, sound phrase annotation data, and sound clause annotation data, and specifically includes the following steps:
Carrying out syntactic analysis operation on the voice synthesized text according to the syntactic analyzer to obtain syntactic structure data;
converting the syntactic structure data into a phoneme sequence and carrying accent annotation data;
And converting the syntactic structure data carrying accent annotation data into the sound system phrase annotation data and the sound system clause annotation data according to the syntactic-prosody mapping relation to obtain a recursive sound system structure.
In an embodiment of the present application, the present application first automatically extracts a syntax structure using a syntax parser. The phoneme sequence and accent are then generated by converting the text, wherein the output of the syntax parser is converted into representations of the phonemes and accents.
In some optional implementations of the embodiments of the present application, the step of converting the syntax structure data carrying accent annotation data into the syllable phrase annotation data and the syllable clause annotation data according to the syntax-prosody mapping relationship to obtain the recursive syllable structure specifically includes the following steps:
Converting the auxiliary word phrase that governs the noun phrase and the verb phrase in the syntactic structure data into the system phrase labeling data, wherein the system phrase labeling data PPhrase is represented as:
PPhrase=PP+NP+VP
wherein PP represents a word-aiding phrase, NP represents a noun phrase, and VP represents a verb phrase;
Marking inflected phrases or composite structures in the syntactic structure data as clause annotation data, wherein the clause annotation data PClause is represented as:
PClause =ip or PClause =cp
Wherein IP represents a inflected phrase and CP represents a composite structure.
In an embodiment of the present application, a Prosodic Phrase (PP) that governs Noun Phrases (NP) and Verb Phrases (VP) is converted to a system phrase (PPhrases) in a syntactic structure, denoted by "[ ]", according to a syntactic-prosodic mapping assumption. To implement an audio clause (PClauses), the present application marks the IP output by the syntax parser as PClause, denoted by "{ }". Furthermore, if the CP dominates the IP, the CP is only marked PClause, replacing the IP with "{ }", specific:
The audio phrase (PPhrases) constructs:
PPhrase=PP+NP+VP
wherein PP represents a word-aiding phrase, NP represents a noun phrase, and VP represents a verb phrase;
the sound clause (PClauses) is constructed:
PClause =ip or PClause =cp
Wherein IP represents a inflected phrase and CP represents a composite structure.
In some optional implementations of the embodiments of the present application, after the step of converting the syntactic structure data carrying accent annotation data into the lineage phrase annotation data and the lineage clause annotation data according to the syntactic-prosodic mapping relationship, the steps of obtaining the recursive lineage structure further include the steps of:
and performing system boundary insertion operation on the recursive system structure according to the stress peaking principle and the inverse interval constraint principle.
In the embodiment of the application, according to the principles of stress topicality and inverse interval constraint, a sound system boundary is needed after each stress. This means that AA (accent + accent) sequences are separated due to accent peaking, while AU (accent + non-accent) sequences are separated due to anti-spacing constraints. Conversely, UA and UU sequences may form one PPh rase, e.g., UA and UU. However, the AA and AU sequences must be split into [ A ] [ A ] and [ A ] [ U ], respectively. For example, in the first process, the AU sequence is expressed as "[ (/) () ]", converted to "[ [/] ]", by inserting a PPhrase border. On the other hand, the UA sequence is expressed as "[ (\) ]", does not violate accent peaking or inverse spacing constraints, and therefore does not require the insertion of PPhrase boundaries, converting to a single PPhrase "[ \ ]", in particular:
AA sequence:
AA→[A][A]
AU sequence:
AU→[A][U]
UA and UU sequences:
UA→[UA],UU→[UU]
In some optional implementations of the embodiments of the present application, after the step of performing the system boundary insertion operation on the recursive system structure according to the stress peaking principle and the inverse interval constraint principle, the method further includes the steps of:
and performing redundant structure elimination operation on the recursive sound system structure after the sound system boundary insertion operation.
In an embodiment of the present application, the redundant configuration elimination operation refers to the internal PPhr ase being acceptable only when there are multiple PPhr ases within the external PPhrase, depending on the constraints. If there is only one PPhr ase within the outer portion PPhrase, the outer portion PPh rase is removed.
In some optional implementations of the embodiments of the present application, before the step of calling the trained recursive sound system model and inputting the recursive sound system structure to the trained recursive sound system model to perform the speech synthesis operation, the method further includes the steps of:
reading a system database, and acquiring model training data in the system database, wherein the model training data comprises training text data and training voice data corresponding to the training text data;
Converting the training text data into a training recursive sound system structure;
And calling the original text-to-speech model, and performing model training operation on the original text-to-speech model by taking the training recursive sound system structure as input data and the training speech data as output data to obtain a trained recursive sound system model.
In the embodiment of the present application, the training process of the model of the present application can be summarized as the following steps:
1. input text preprocessing, converting the input text into a phoneme sequence and labeling accents, syllable phrases and syllable clauses.
2. And constructing a recursive sound system structure, namely forming a complete sound system structure by recursively constructing sound system phrases.
3. Model training-TTS models are trained using recursive sound structures and corresponding speech data.
The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Wherein artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.
Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by computer readable instructions stored in a computer readable storage medium that, when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.
With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a syntactic-prosody-based speech synthesis apparatus, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.
As shown in fig. 3, the syntax-prosody-based speech synthesis apparatus 200 of an embodiment of the present application includes:
a request receiving module 210, configured to receive a speech synthesis request carrying speech synthesis text sent by a user terminal;
A recursive sound system structure construction module 220, configured to construct a recursive sound system structure according to the speech synthesis text, where the recursive sound system structure includes a phoneme sequence, accent annotation data, sound system phrase annotation data, and sound system clause annotation data, and the recursive sound system structure PPhrase n is represented as:
PPhrasen=PPhrasen-1+PClausen
Wherein PPhrase n denotes the nth phonetic phrase annotation data, PClause n denotes the nth phonetic clause annotation data;
the speech synthesis module 230 is configured to invoke the trained recursive sound system model, and input the recursive sound system structure to the trained recursive sound system model for performing speech synthesis operation, so as to obtain the target audio data.
In an embodiment of the present application, a speech synthesis apparatus 200 based on syntax-prosody is provided, which includes a request receiving module 210 configured to receive a speech synthesis request carrying a speech synthesis text sent by a user terminal, a recursive pitch structure constructing module 220 configured to construct a recursive pitch structure according to the speech synthesis text, where the recursive pitch structure includes a phoneme sequence, accent annotation data, pitch phrase annotation data, and pitch clause annotation data, the recursive pitch structure PPhrase n is denoted as PPhrase n=PPhrasen-1+PClausen, PPhrase n denotes the nth pitch phrase annotation data, PClause n denotes the nth pitch clause annotation data, and a speech synthesis module 230 configured to invoke a trained recursive pitch model and input the recursive pitch structure to the trained recursive pitch model for performing a speech synthesis operation, so as to obtain target audio data. Compared with the prior art, the application introduces a recursive sound system model, can accurately capture the initial falling tone in the voice, successfully reproduce the syntactic difference, effectively construct the hierarchical structure of the voice and generate more natural voice.
In some optional implementations of the embodiments of the present application, the recursive sound system structure building module includes:
The syntactic analysis sub-module is used for carrying out syntactic analysis operation on the voice synthesized text according to the syntactic analyzer to obtain syntactic structure data;
The first syntax structure conversion sub-module is used for converting the syntax structure data into a phoneme sequence and carrying the syntax structure data of accent annotation data;
And the second syntactic structure conversion sub-module is used for converting the syntactic structure data carrying accent annotation data into the sound system phrase annotation data and the sound system clause annotation data according to the syntactic-prosody mapping relation to obtain a recursive sound system structure.
In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 4, fig. 4 is a basic structural block diagram of a computer device according to an embodiment of the present application.
The computer device 300 includes a memory 310, a processor 320, and a network interface 330 communicatively coupled to each other via a system bus. It should be noted that only computer device 300 having components 310-330 is shown in the figures, but it should be understood that not all of the illustrated components need be implemented, and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), a Programmable gate array (Field-Programmable GATE ARRAY, FPGA), a digital Processor (DIGITAL SIGNAL Processor, DSP), an embedded device, and the like.
The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.
The memory 310 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 310 may be an internal storage unit of the computer device 300, such as a hard disk or a memory of the computer device 300. In other embodiments, the memory 310 may also be an external storage device of the computer device 300, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the computer device 300. Of course, the memory 310 may also include both internal storage units and external storage devices of the computer device 300. In an embodiment of the present application, the memory 310 is generally used to store an operating system and various application software installed on the computer device 300, such as computer readable instructions of a syntax-prosody-based speech synthesis method. In addition, the memory 310 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 320 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 320 is generally used to control the overall operation of the computer device 300. In an embodiment of the present application, the processor 320 is configured to execute computer readable instructions stored in the memory 310 or process data, such as computer readable instructions for executing the syntax-prosody based speech synthesis method.
The network interface 330 may include a wireless network interface or a wired network interface, the network interface 330 typically being used to establish communication connections between the computer device 300 and other electronic devices.
The computer equipment provided by the application introduces a recursive sound system model, can accurately capture the initial falling tone in the voice, successfully reproduce the syntactic difference, effectively construct the hierarchical structure of the voice and generate more natural voice.
The present application also provides another embodiment, namely, a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of a method of syntactic-prosody based speech synthesis as described above.
The computer readable storage medium provided by the application introduces a recursive sound system model, can accurately capture the initial falling tone in the voice, successfully reproduce the syntax difference, effectively construct the hierarchical structure of the voice and generate more natural voice.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.
It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.

Claims (10)

1.一种基于句法-韵律的语音合成方法,其特征在于,包括下述步骤:1. A method for speech synthesis based on syntax and prosody, characterized in that it comprises the following steps: 接收用户终端发送的携带有语音合成文本的语音合成请求;Receiving a speech synthesis request carrying speech synthesis text sent by a user terminal; 根据所述语音合成文本构建递归音系结构,其中,所述递归音系结构包括音素序列、重音标注数据、音系短语标注数据以及音系从句标注数据,所述递归音系结构PPhrasen表示为:A recursive phonological structure is constructed according to the speech synthesis text, wherein the recursive phonological structure includes a phoneme sequence, stress marking data, phonological phrase marking data, and phonological clause marking data, and the recursive phonological structure PPhrase n is expressed as: PPhrasen=PPhrasen-1+PClausen PPhrase n =PPhrase n-1 + PClause n 其中,PPhrasen表示第n个音系短语标注数据,PClausen表示第n个音系从句标注数据;Among them, PPhrase n represents the nth phonological phrase annotation data, and PClause n represents the nth phonological clause annotation data; 调用训练好的递归音系模型,并将所述递归音系结构输入至所述训练好的递归音系模型进行语音合成操作,得到目标音频数据。The trained recursive phonological model is called, and the recursive phonological structure is input into the trained recursive phonological model to perform a speech synthesis operation to obtain target audio data. 2.根据权利要求1所述的基于句法-韵律的语音合成方法,其特征在于,所述根据所述语音合成文本构建递归音系结构,其中,所述递归音系结构包括音素序列、重音标注数据、音系短语标注数据以及音系从句标注数据的步骤,具体包括下述步骤:2. The method for speech synthesis based on syntax and prosody according to claim 1 is characterized in that the step of constructing a recursive phonological structure according to the speech synthesis text, wherein the recursive phonological structure includes a phoneme sequence, stress marking data, phonological phrase marking data, and phonological clause marking data, specifically comprises the following steps: 根据句法解析器对所述语音合成文本进行句法解析操作,得到句法结构数据;Performing a syntactic parsing operation on the speech synthesis text according to a syntactic parser to obtain syntactic structure data; 将所述句法结构数据转换成所述音素序列以及携带有重音标注数据的句法结构数据;Converting the syntactic structure data into the phoneme sequence and the syntactic structure data carrying the accent marking data; 根据句法-韵律映射关系将所述携带有重音标注数据的句法结构数据转换为所述音系短语标注数据以及所述音系从句标注数据,得到所述递归音系结构。The syntactic structure data carrying the stress marking data is converted into the phonological phrase marking data and the phonological clause marking data according to the syntax-prosody mapping relationship to obtain the recursive phonological structure. 3.根据权利要求2所述的基于句法-韵律的语音合成方法,其特征在于,所述根据句法-韵律映射关系将所述携带有重音标注数据的句法结构数据转换为所述音系短语标注数据以及所述音系从句标注数据,得到所述递归音系结构的步骤,具体包括下述步骤:3. The method for speech synthesis based on syntax-prosody according to claim 2 is characterized in that the step of converting the syntactic structure data carrying the stress marking data into the phonological phrase marking data and the phonological clause marking data according to the syntax-prosody mapping relationship to obtain the recursive phonological structure specifically comprises the following steps: 将所述句法结构数据中支配名词短语和动词短语的助词短语转换为所述音系短语标注数据,其中,所述音系短语标注数据PPhrase表示为:The particle phrases controlling the noun phrases and the verb phrases in the syntactic structure data are converted into the phonological phrase annotation data, wherein the phonological phrase annotation data PPhrase is represented as: PPhrase=PP+NP+VPPPhrase=PP+NP+VP 其中,PP表示所述助词短语,NP表示所述名词短语,VP表示所述动词短语;Wherein, PP represents the particle phrase, NP represents the noun phrase, and VP represents the verb phrase; 将所述句法结构数据中的屈折短语或复合结构标记为所述音系从句标注数据,其中,所述从句标注数据PClause表示为:The inflectional phrases or compound structures in the syntactic structure data are marked as the phonological clause annotation data, wherein the clause annotation data PClause is represented as: PClause=IP或PClause=CPPClause=IP or PClause=CP 其中,IP表示所述屈折短语,CP表示所述复合结构。Wherein, IP represents the inflectional phrase, and CP represents the compound structure. 4.根据权利要求2所述的基于句法-韵律的语音合成方法,其特征在于,在所述根据句法-韵律映射关系将所述携带有重音标注数据的句法结构数据转换为所述音系短语标注数据以及所述音系从句标注数据,得到所述递归音系结构的步骤之后,还包括下述步骤:4. The method for speech synthesis based on syntax-prosody according to claim 2, characterized in that after the step of converting the syntactic structure data carrying the stress marking data into the phonological phrase marking data and the phonological clause marking data according to the syntax-prosody mapping relationship to obtain the recursive phonological structure, the method further comprises the following steps: 根据重音顶峰性原则以及反间隔约束原则对所述递归音系结构进行音系边界插入操作。The recursive phonological structure is subjected to a phonological boundary insertion operation according to the stress apex principle and the anti-spacing constraint principle. 5.根据权利要求5所述的基于句法-韵律的语音合成方法,其特征在于,在所述根据重音顶峰性原则以及反间隔约束原则对所述递归音系结构进行音系边界插入操作的步骤之后,还包括下述步骤:5. The method for speech synthesis based on syntax and prosody according to claim 5, characterized in that after the step of inserting a phonological boundary into the recursive phonological structure according to the principle of stress peakness and the principle of anti-spacing constraint, the method further comprises the following steps: 对音系边界插入操作后的递归音系结构进行冗余结构消除操作。The redundant structure elimination operation is performed on the recursive phonological structure after the phonological boundary insertion operation. 6.根据权利要求1所述的基于句法-韵律的语音合成方法,其特征在于,在所述调用训练好的递归音系模型,并将所述递归音系结构输入至所述训练好的递归音系模型进行语音合成操作,得到目标音频数据的步骤之前,还包括下述步骤:6. The method for speech synthesis based on syntax-prosody according to claim 1, characterized in that before the step of calling the trained recursive phonological model and inputting the recursive phonological structure into the trained recursive phonological model for speech synthesis to obtain target audio data, the method further comprises the following steps: 读取系统数据库,在所述系统数据库中获取模型训练数据,其中,所述模型训练数据包括训练文本数据以及与所述训练文本数据相对应的训练语音数据;Reading a system database, and acquiring model training data in the system database, wherein the model training data includes training text data and training voice data corresponding to the training text data; 将所述训练文本数据转换成训练递归音系结构;converting the training text data into a training recursive phonological structure; 调用原始文本到语音模型,将所述训练递归音系结构作为输入数据、所述训练语音数据作为输出数据对所述原始文本到语音模型进行模型训练操作,得到所述训练好的递归音系模型。The original text-to-speech model is called, and the training recursive phonological structure is used as input data and the training speech data is used as output data to perform a model training operation on the original text-to-speech model to obtain the trained recursive phonological model. 7.一种基于句法-韵律的语音合成装置,其特征在于,包括:7. A syntax-prosody based speech synthesis device, comprising: 请求接收模块,用于接收用户终端发送的携带有语音合成文本的语音合成请求;A request receiving module, used for receiving a speech synthesis request carrying speech synthesis text sent by a user terminal; 递归音系结构构建模块,用于根据所述语音合成文本构建递归音系结构,其中,所述递归音系结构包括音素序列、重音标注数据、音系短语标注数据以及音系从句标注数据,所述递归音系结构PPhrasen表示为:A recursive phonological structure construction module is used to construct a recursive phonological structure according to the speech synthesis text, wherein the recursive phonological structure includes a phoneme sequence, stress marking data, phonological phrase marking data, and phonological clause marking data, and the recursive phonological structure PPhrase n is expressed as: PPhrasen=PPhrasen-1+PClausen PPhrase n =PPhrase n-1 + PClause n 其中,PPhrasen表示第n个音系短语标注数据,PClausen表示第n个音系从句标注数据;Among them, PPhrase n represents the nth phonological phrase annotation data, and PClause n represents the nth phonological clause annotation data; 语音合成模块,用于调用训练好的递归音系模型,并将所述递归音系结构输入至所述训练好的递归音系模型进行语音合成操作,得到目标音频数据。The speech synthesis module is used to call the trained recursive phonology model and input the recursive phonology structure into the trained recursive phonology model to perform speech synthesis operation to obtain target audio data. 8.根据权利要求7所述的基于句法-韵律的语音合成装置,其特征在于,所述递归音系结构构建模块包括:8. The syntax-prosody based speech synthesis device according to claim 7, wherein the recursive phonological structure building module comprises: 句法解析子模块,用于根据句法解析器对所述语音合成文本进行句法解析操作,得到句法结构数据;A syntactic parsing submodule, used for performing a syntactic parsing operation on the speech synthesis text according to a syntactic parser to obtain syntactic structure data; 第一句法结构转换子模块,用于将所述句法结构数据转换成所述音素序列以及携带有重音标注数据的句法结构数据;A first syntactic structure conversion submodule, configured to convert the syntactic structure data into the phoneme sequence and syntactic structure data carrying stress marking data; 第二句法结构转换子模块,用于根据句法-韵律映射关系将所述携带有重音标注数据的句法结构数据转换为所述音系短语标注数据以及所述音系从句标注数据,得到所述递归音系结构。The second syntactic structure conversion submodule is used to convert the syntactic structure data carrying the stress marking data into the phonological phrase marking data and the phonological clause marking data according to the syntax-prosody mapping relationship to obtain the recursive phonological structure. 9.一种计算机设备,包括存储器和处理器,其特征在于,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如权利要求1至6中任一项所述的基于句法-韵律的语音合成方法的步骤。9. A computer device comprising a memory and a processor, wherein the memory stores computer-readable instructions, and the processor implements the steps of the syntax-prosody-based speech synthesis method as described in any one of claims 1 to 6 when executing the computer-readable instructions. 10.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如权利要求1至6中任一项所述的基于句法-韵律的语音合成方法的步骤。10. A computer-readable storage medium, characterized in that computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by a processor, the steps of the syntax-prosody-based speech synthesis method as described in any one of claims 1 to 6 are implemented.
CN202510032215.XA 2025-01-08 2025-01-08 A speech synthesis method based on syntax and prosody and related equipment Pending CN119785769A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510032215.XA CN119785769A (en) 2025-01-08 2025-01-08 A speech synthesis method based on syntax and prosody and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510032215.XA CN119785769A (en) 2025-01-08 2025-01-08 A speech synthesis method based on syntax and prosody and related equipment

Publications (1)

Publication Number Publication Date
CN119785769A true CN119785769A (en) 2025-04-08

Family

ID=95232273

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510032215.XA Pending CN119785769A (en) 2025-01-08 2025-01-08 A speech synthesis method based on syntax and prosody and related equipment

Country Status (1)

Country Link
CN (1) CN119785769A (en)

Similar Documents

Publication Publication Date Title
CN112927674B (en) Speech style transfer method, device, readable medium and electronic device
CN112331176B (en) Speech synthesis method, speech synthesis device, storage medium and electronic equipment
US11960852B2 (en) Robust direct speech-to-speech translation
CN112397047A (en) Speech synthesis method, device, electronic equipment and readable storage medium
CN112802446B (en) Audio synthesis method and device, electronic equipment and computer readable storage medium
CN109754783A (en) Method and apparatus for determining the boundary of audio sentence
CN110197655A (en) Method and apparatus for synthesizing voice
CN113345431B (en) Cross-language voice conversion method, device, equipment and medium
CN111508466A (en) Text processing method, device and equipment and computer readable storage medium
CN116072098A (en) Audio signal generation method, model training method, device, equipment and medium
CN118658466A (en) A speech analysis method, device, equipment and storage medium thereof
CN119207365A (en) Cross-language speech generation method based on artificial intelligence and related equipment
CN119068861A (en) Speech synthesis method, device, terminal equipment and medium based on artificial intelligence
CN112382269B (en) Audio synthesis method, device, equipment and storage medium
CN113160793A (en) Speech synthesis method, device, equipment and storage medium based on low resource language
CN118312040A (en) Virtual digital person interaction method, device, equipment and storage medium
CN115376496B (en) A speech recognition method, device, computer equipment and storage medium
CN117975932A (en) Voice recognition method, system and medium based on network collection and voice synthesis
CN119785769A (en) A speech synthesis method based on syntax and prosody and related equipment
CN114093340B (en) Speech synthesis method, device, storage medium and electronic device
CN114694657A (en) Method for cutting audio file and related product
CN114121010A (en) Model training, speech generation, speech interaction method, device and storage medium
CN113505612A (en) Multi-person conversation voice real-time translation method, device, equipment and storage medium
CN111104118A (en) AIML-based natural language instruction execution method and system
CN118053416B (en) Sound customization method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination