CN1308908C

CN1308908C - Transformation from characters to sound for synthesizing text paragraph pronunciation

Info

Publication number: CN1308908C
Application number: CNB031327095A
Authority: CN
Inventors: 陈桂林; 黄建成
Original assignee: Motorola Inc
Current assignee: Nuance Communications Inc
Priority date: 2003-09-29
Filing date: 2003-09-29
Publication date: 2007-04-04
Anticipated expiration: 2023-09-29
Also published as: EP1668629A1; RU2006114705A; DE602004019949D1; CN1604184A; KR20060056404A; KR100769032B1; EP1668629B1; RU2320026C2; WO2005034083A1; EP1668629A4

Abstract

The present invention discloses a method (200) for synthesizing letters into voice. The method (200) comprises the steps that in step (220), a text character string is received, and at least one word is selected from the text character string; in step (240), the word is decomposed sub words which form a sub word sequence, and at least one sub word comprises at least two letters; in step (250), recognition is executed so as to recognize phonemes for the sub words; in step (260), the phonemes are combined to form a phoneme sequence; in step (280), the phoneme sequence is processed in the mode of voice synthesis.

Description

Be used for the method for literal to phonetic synthesis

Technical field

Present invention generally relates to the synthetic of Text To Speech (TTS).For the conversion to sound of the literal of the pronunciation that is used for synthetic field, the present invention is particularly useful.

Background technology

Usually, the conversion of Text To Speech (TTS) is meant the synthetic of coherent Text To Speech, and it makes electronic installation can receive the text-string of input, and with the character string conversion, is expressed as the form of synthetic speech.Yet, need this device to carry out phonetic synthesis to the uncertain text-string of the number that receives, this will bring a difficult problem, and intelligible high-quality synthetic speech promptly is provided.A difficult problem that exists in the conversion from the literal to sound is, the position in the synthetic field of needs according to other adjacent literal and literal, and identical literal or literal combination may have different sound and different stresses/emphasize sound.

In this instructions (comprising claim), with term " comprise ", " comprising " or similarly term represent the conclusion of nonexcludability, therefore, comprise a series of unit method or the device in not merely comprise those unit, may comprise that also other is the unit of listing.

Summary of the invention

According to an aspect of the present invention, provide a kind of method that is used for literal to phonetic synthesis, this method comprises:

Receive text-string, and therefrom select at least one word;

Word is decomposed into sub-word, and sub-word forms sub-word sequence, has at least one sub-word to comprise at least two literal (1etter) in the sub-word;

The phoneme of recognin word;

With the phoneme formation aligned phoneme sequence that links up; With

Aligned phoneme sequence is carried out phonetic synthesis,

Wherein, by the sub-word that may comprise described word is analyzed, determine sub-word sequence.

Each possible sub-word is preferably with related predefined weight.

Suitably, selection can form the sub-word with maximum synthetic weight of selected word, to produce sub-word sequence.According to analysis result, determine suitable sub-word sequence to direct acyclic graph (Direct Acyclic Graph).

Suitably, use phoneme identifier list identification phoneme, comprise phoneme in the phoneme identifier list corresponding at least one above-mentioned sub-word.

Preferably also comprise the position associated indicator in the identifier list, be used for representing the position correlation of word neutron word.

The phoneme weight also may be related with corresponding position associated indicator.

Description of drawings

For the present invention being more readily understood and being committed to reality, will carry out reference with the preferred embodiment of signal with reference to the accompanying drawings, wherein:

Fig. 1 is the schematic block diagram according to electronic installation of the present invention;

Fig. 2 is used for the process flow diagram of the synthetic method of Text To Speech for expression;

Fig. 3 has illustrated direct acyclic graph (DAG);

Fig. 4 is the part of mapping table, mapping table with sign map to voice;

Fig. 5 is the part of phoneme identifier list; With

Fig. 6 is the part of vowel his-and-hers watches.

Embodiment

Referring to Fig. 1, illustrated the electronic installation 100 that exists with wireless telephonic form, it comprises de-vice processor 102, and it operationally is coupled to user interface 104 by bus 103, and typical user interface 104 is touch-screen or display screen and keypad.Also have language corpus 106, voice operation demonstrator 110, nonvolatile memory 120, ROM (read-only memory) 118 and wireless communication module 116 in the electronic installation 100, they are all operationally by bus 103 and processor 102 couplings.Have output terminal on the voice operation demonstrator 110, be coupled to drive loudspeaker 112.The information that comprises expression word or phoneme in the language corpus 116 also comprises the relevant conversation waveform PUW after sampling, digitizing and processing.In other words, as described below, use nonvolatile memory 120 (memory module) to carry out synthetic (text is received by module 116 or other device) of Text To Speech (TTS).Comprise also in the waveform language corpus that its form is increasing the weight of/strengthening of phoneme and prosodic features through sampling and digitized conversation waveform.

To understand that as those skilled in the art typically, radio frequency communications unit 116 is for having the combination receiver and the transmitter of community antenna.Have the transceiver that is coupled to antenna by radio frequency amplifier in the radio frequency communications unit 116.Transceiver also is coupled with the modulator/demodulator that makes up, and combined modulator/demodulator is coupled to processor 102 with communication unit 116.In the present embodiment, nonvolatile memory 112 (memory module) is also stored the phone directory database D b of user-programmable, also is de-vice processor 102 storage operation codes (OC) in the ROM (read-only memory) 118.

Referring to Fig. 2, illustrated to be used for the synthetic method 200 of Text To Speech.After initial step 210, carry out the step 220 that receives text-string TS from storer 120.Text-string TS is the text message that is received by module 116 or alternate manner.The effect of step 230 is to select at least one word from text-string TS, and the effect of decomposition step 240 is that word is decomposed into sub-word, and sub-word forms sub-word sequence, and at least one comprises at least two literal in the described sub-word.The effect of identification step 250 is to be sub-word identification phoneme.The effect of the step 260 that links up is with the phoneme formation aligned phoneme sequence that links up.By the sub-word that may constitute word is analyzed, determine corresponding sub-word sequence.For example, temporary transient referring to the direct acyclic graph (DAG) among Fig. 3, if selected word is " mention ", then formed direct acyclic graph DAG with the whole possible sub-word that can constitute selected word " mention ".For each sub-word word provides predefined weight, for example, shown sub-word " ment ", " men " reaches " tion " and has weight 88,86 and 204 respectively.Therefore, the step 260 that links up has run through DAG, and selects to form sub-word selected word, that have maximum synthetic (add and) weight.At word is under the situation of " mention ", and " men " reaches " tion " with the chooser word.

Use two tables that are stored in the storer 120 in the step 250 of identification phoneme, wherein a table as shown in Figure 4 is mapping table MT, is phoneme with sign map.As shown in the figure, phoneme ae indicates with symbol @, and phoneme th indicates with symbol D.The another one table is a phoneme identifier table (PIT), and Fig. 5 has illustrated its part.Comprise sub-word territory in the phoneme identifier table (PIT); Phoneme weight territory; The position domain of dependence (one or more) or indicator; Phoneme identifier domain (one or more).For example, in Fig. 5, the first behavior aa, 120 A_C, wherein aa is sub-word; 120 is the phoneme weight, and alphabetical A is that the position is relevant, and C is a phoneme indicator corresponding and sub-word aa.The position is relevant to be marked as: A represents the relevant of whole positions; I represents sub-word being correlated with at the word front end; M represents sub-word being correlated with in the middle of word; F represents sub-word being correlated with at the word end.Therefore, use phoneme identifier table (PIT) and, discern the step 250 generation effect of phoneme according to the position of sub-word at Dan Tongzhong.

Phoneme weight and predetermined DAC weights W T are the equal weight of obtaining from Fig. 5.After determining these weights, if select frequency as weight, then a substring will have the weight bigger than character string self.Therefore, if the weight limit of selecting to produce forms field, the character string that then has short morpheme feature usually is preferable.For example, word seeing will be broken down into s|ee|in|g rather than s|ee|ing.But in general, the relation of growing between character string and the aligned phoneme sequence is more credible.Have higher priority in order to ensure character string, the aspect below considering with the plain feature of long word:

-affixe (affix) is if short character strings is the prefix or the suffix of long character string, and its frequency (occuring time) is added on the long character string; But do not consider other substring.

-ambiguity (ambiguity) in some cases, character string with morpheme feature can corresponding a plurality of phoneme character strings; For example, the pronunciation of en can be ehn and axn.In order to reduce uncertainty, use the character string position, for example reach suffix in prefix, the word.Even in this case, the character string with morpheme feature can be corresponding to a phoneme character string.In order to address this problem, select to have the phoneme character string of maximum frequency, and with following formula calculating ratio r:

r = \frac{\max {N_{uk}}}{Σ N_{uk}}

Wherein

U is a community string index community, and k is a location index.If (a is a threshold value to r＜a, a=0.7), then gets rid of the character string that this has the morpheme feature.For example, the pronunciation of the en of suffix can be ehn and axn, if total degree (total time) is 1000, if the number of times corresponding with axn is 800 (certainly, these are maximum times), and r=0.8.Therefore, suffix en can be added in the tabulation.

-minimum frequency.Minimum frequency min (min=9) can also be set to threshold value.The character string that frequency is less than this threshold value abandons.

Under these constraints, can be in the following manner for distributing each character string weights W _s, W _s=101nN _s, N _sBe adjustable frequency.

After this, method 200 execution in step 265, effect is to distribute the stress of expression vowel or emphasize sound on phoneme.This step 265 identifies vowel from the corresponding identification phoneme that previous step 250 identifies.In fact, search for the sound of strengthening relatively/weakening in the vowel his-and-hers watches of this step 265 in being stored in storer 120.Illustrated the part of these vowel his-and-hers watches among Fig. 6.For example, consider can be identified as in the word 3 vowels of phoneme, these vowels are identified as symbol (obtaining) from mapping table MT ' ax; Aa and ae.Analyze the vowel his-and-hers watches then, when before ' when ax occurred in before the aa, the stress weight of then indicating a was 368, when aa occurs in ' ax, then the stress weight is 354.Therefore, by being ' ax; Aa and ae analyze the vowel his-and-hers watches, can obtain following analysis result: the stress that the vowel of symbol ae indication has first (maximum); Symbol ' vowel of ax indication has deputy stress; The vowel of symbol aa indication does not have stress.In fact, by using the training dictionary to determine the stress weight.Each input part of this dictionary has the form of word, and its pairing pronunciation, comprises the distribution to voice of stress, syllable boundary and letter.According to this dictionary, can determine stress by statistical study.In this, stress has reflected the strong/weak relation between the vowel.In order to produce the data that need, need in advance statistical study to be carried out in whole inputs of dictionary.Especially, in the scope of word, if vowel v _iFor increasing the weight of v _jNot increasing the weight of, then is to (v _i, v _j) distribute a bit, for to (v _j, v _i) distribute some zero points.If two are not all increased the weight of, then all be zero.

Carry out testing procedure 270 then, judge in text-string TS, whether to also have other a plurality of lists with needing processing.If then method skips back to step 230, otherwise execution in step 280 is carried out speech recognition to voice sequence.The speech recognition of being carried out by compositor 110 is to the voice sequence generation effect of each word.Then, method 200 ends at and stops step 290.

In the process of carrying out phonetic synthesis step 280, also strengthen using the stress (suitable first, second or atony) of vowel with suitable stress, thus improved synthetic speech quality.

Advantage of the present invention is according to other adjacent character and the position in synthetic field, to improve or alleviated at least increasing the weight of/strengthening of sound and vowel.

Preferred embodiment only is provided in the detailed description, but therefore do not limit the scope of the invention, application scenario or structure.And DETAILED DESCRIPTION OF THE PREFERRED provides the explanation that may realize the preferred embodiment of the present invention for those skilled in the art.It will be appreciated that, under the situation of the claim that does not deviate from the present invention and add, can carry out different modifications the arrangement of its function and element.

Claims

1. one kind is used for the method that literal arrives phonetic synthesis, and method comprises:

Receive text-string, and therefrom select at least one word;

Described word is decomposed into sub-word, and described sub-word forms sub-word sequence, has at least one to comprise at least two literal in the described sub-word;

Be described sub-word identification phoneme;

With the described phoneme formation aligned phoneme sequence that links up; With

Described aligned phoneme sequence is carried out phonetic synthesis,

Wherein, according to analysis result to direct acyclic graph, selection can form sub-word described selected word, that have maximum synthetic weight and produce sub-word sequence, wherein each possible sub-word has related predefined weight, and by the sub-word that may constitute described word is analyzed, to determine described sub-word sequence.

2. the method that is used for literal to phonetic synthesis according to claim 1, wherein, the step of described identification phoneme is used a phoneme identifier list, comprises the phoneme corresponding at least one above-mentioned sub-word in the described phoneme identifier list.

3. the method that is used for literal to phonetic synthesis according to claim 2 wherein, also comprises the position associated indicator in the described identifier list, be used to refer to the position correlation of described word neutron word.

4. the method that is used for literal to phonetic synthesis according to claim 3, wherein, described identifier list also comprises the phoneme weight related with the position associated indicator.