CN103165126A

CN103165126A - Method for voice playing of mobile phone text short messages

Info

Publication number: CN103165126A
Application number: CN2011104243757A
Authority: CN
Inventors: 卢晓鹏
Original assignee: Wuxi Vimicro Corp
Current assignee: Wuxi Vimicro Corp
Priority date: 2011-12-15
Filing date: 2011-12-15
Publication date: 2013-06-19

Abstract

The invention discloses a method for voice playing of mobile phone text short messages. After a mobile phone receives a short message in a text type, text analysis is carried out on text word strings of the short message to obtain corresponding voice waveform, and therefore a voice is synthetized and played. The method has the advantages that instant voice synthesis and instant text voice conversion are achieved, time is saved, driving safety of a user is guaranteed, and convenience is brought to old users of poor vision.

Description

A kind of method of speech play of mobile phone text note

Technical field

The present invention relates to moving communicating field, but particularly a kind of real-time voice is play the method for note.

Background technology

Along with popularizing of increasing cell phone type PDA equipment, people have obtained many convenience in life or work, can exchange with the household with friend timely and conveniently.But many users receive note in steering vehicle, can not in time consult, if consult by force, easily cause again traffic hazard; Moreover, for old cellphone subscriber, because the size of screen font and its eyesight degree are not inconsistent, also cause many difficulties in use.Therefore, seek a kind of speech synthesis technique, the text SMS that mobile phone is received in time plays back by the mode of voice, becomes a very useful function of mobile phone.

Summary of the invention

For solving above-mentioned technical barrier, but the present invention aims to provide a kind of method that real-time voice is play note, after mobile phone receives the note of textual form, to the text word string of this note through text analyzing, obtain corresponding speech waveform, thereby form synthetic speech and play.

It comprises that text normalization is processed and the symbol step of converting, and the special symbol, abbreviation, English word and the measurement unit that are used for the text-string of the note that will obtain are converted to discernible phonation unit and identify.

Comprise the participle model treatment step, be used for the text to input is carried out the division of word by the participle rule that presets, determine the pronunciation of rhythm structure and the polyphone of sentence.

Also comprise prosody prediction step, coarticulation step and select the word step, wherein the prosody prediction step determines each word pronunciation, coarticulation has determined the annexation between each word, selects the word step to select optimum pronunciation according to the pronunciation of rhythm requirement and word in dictionary.

When selecting acoustic elements structure sound bank, utilize the degree of loss function to describe the synthesis capability with formed objects sound bank, the degree of loss function can be expressed as:

ζ(f，d，c)＝cf/d

Wherein f is the word frequency of current acoustic elements, d is the prediction duration of acoustic elements, and c is not considering under rhythm condition for the size of coarticulation between the phoneme that comprises in this unit, during sound bank that structure is comprised of acoustic elements, make that the value of degree of loss letter on this sound bank is minimum is target.

Adopt the base frequency parameters model to control the generation of the rhythm.

The speech playing method of mobile phone text note provided by the invention, at mobile phone, the text voice translation system is installed simultaneously, automatically complete phonetic synthesis and play by loudspeaker, it is synthetic that the present invention has real-time phonetic, instant text voice conversion, save time, guarantee user's traffic safety, convenient old user's weak-eyed advantage.Set and the sound bank customization by the user, can change male and female students, perhaps cartoon character sound.Mobile phone also can be according to predefined originator's sex, this sex sound bank of Automatically invoked.

Description of drawings

Fig. 1 is method flow diagram in the embodiment of the present invention;

Fig. 2 is the process flow diagram of a kind of specific embodiment of the present invention.

Embodiment

With reference to figure 1, but the method for a kind of real-time voice broadcast of the present invention note comprises the following steps:

Step 1, note receive

It is to utilize mobile phone receive the note of one or more textual forms by its radio-frequency module from the base station and temporarily be stored in the internal memory of mobile phone that note receives.

Step 2, the conversion of civilian language

Adopt civilian language modular converter (Text-To-Speech model, TTS Model), namely one take text strings as the input voice synthetic module.Its input be common text word string, text analyzer in module is at first according to Pronounceable dictionary, the text strings of input is decomposed into word and pronunciation symbol thereof with attribute flags, again according to semantic rules and phonetic rules, for stress grade and sentence structure and intonation determined in each word, each syllable, and various pauses etc.Text strings just changes the symbol code string into like this.According to the result of front surface analysis, generate the prosodic features of target voice, then carry out the voice combination, synthesize the output voice.

The present invention reports the note data in mobile phone out so that the form of voice is instant, corresponding user only needs passive listening to get final product, here, requirement to speech synthesis system is fast response time, computation complexity and storage space complexity are low, be with good expansibility and synthetic speech sharpness, the property understood strong, be suitable for the interchange of daily life or some professional domain etc.

The present invention utilizes the degree of loss function to describe the synthesis capability with formed objects sound bank when selecting acoustic elements structure sound bank.The degree of loss function can be expressed as:

ζ(f，d，c)＝cf/d

Wherein f is the word frequency of current acoustic elements, and d is the prediction duration of acoustic elements, and c is the size of coarticulation between the phoneme that comprises in this unit.Do not considering under rhythm condition, when constructing the sound bank that is comprised of acoustic elements, should make the value of degree of loss letter on this sound bank minimum is target.

The present invention adopts the Fujisaki model to control the generation of the rhythm, and it is a kind of widely used base frequency parameters model, mainly predicts the variation of fundamental frequency by simulation people's Mechanism of Speech Production, controls rhythm, tone intonation, the emotion of synthetic speech;

The inconvenient situation of directly going to see note with eyes of all users that has been suitable for of the present invention.

Set and the sound bank customization by the user, can change male and female students, perhaps cartoon character sound.

Mobile phone also can be according to predefined originator's sex, this sex sound bank of Automatically invoked.

With reference to figure 2,, the text of input mainly transforms by standardization processing and symbol, and wherein special symbol, abbreviation, English word and measurement unit etc. are converted to discernible phonation unit sign.In participle model, the text of inputting is carried out the division of word by the participle rule that presets, just substantially determined the pronunciation of rhythm structure and the polyphone of sentence by word segmentation processing.Prosody prediction determines each word pronunciation; Coarticulation has determined the annexation between each word.Select the word module to select optimum pronunciation according to the pronunciation of rhythm requirement and word in dictionary, reconstruct recovers waveform through voice.The speech waveform of each word is completed the synthetic of final statement under the control of splicing parameter through concatenation module.

1, acoustic elements is selected and is generated

For making synthetic speech have higher sharpness, intelligibility and naturalness, usually take the speech synthesis technique based on waveform.Synthesis unit in the waveform concatenation phonetic synthesis cuts out from the primitive nature voice, has kept some prosodic features of natural-sounding.According to voice and the rhythm rule of natural language, store suitable speech primitive, make these unit have maximum voice and rhythm coverage rate under the memory capacity of determining.Export voice after the steps such as the selection of process acoustic elements, waveform concatenation, smoothing processing when synthetic.By well-designed corpus, and choose optimal acoustic elements according to voice and prosodic rules from the sound storehouse, make the voice of system's outputting high quality.

Common voice unit candidate can have phrase, syllable, phoneme and diphones etc.When the needed corpus of structure waveform concatenation, can be in conjunction with the relative merits of dissimilar sample, for example for the strong phoneme of some coarticulations that often occur in natural flow, syllable combination, when forming target voice by waveform concatenation, should avoid splicing between the large phonotactics of these coarticulation impacts as far as possible, otherwise slightly having of unit selection is improper, will cause the acceptance that is difficult to acoustically.So type and the length of the acoustic elements of taking when the practical synthesis system of structure will be all unfixed.

When selecting acoustic elements structure sound bank, usually utilize certain degree of loss function to describe the synthesis capability with formed objects sound bank.A typical degree of loss function can be expressed as:

ζ(f，d，c)＝cf/d (1)

Wherein f is the word frequency of current acoustic elements, and d is the prediction duration of acoustic elements, and c is the size of coarticulation between the phoneme that comprises in this unit.Do not considering under rhythm condition, when constructing the sound bank that is comprised of acoustic elements, should make minimum by the value of degree of loss letter on this sound bank of (1) expression is target.

The acoustic elements that is used for splicing is obtained by continuous flow cutting usually.In living, common expressions can obtain word frequency information by statistics, and selects sentence under the guidance of word frequency information, makes the sentence of selecting have preferably high frequency words and covers, and these select sentences become needs the script recorded after a while.

Select suitable announcer, the contrast script is rationally read aloud, and recording.The speech waveform data of recording gained are carried out cutting by the division of script and acoustic elements, usually can cutting be word, word (CV structure) and English needs cutting to arrive word and a small amount of phoneme or diphones usually for Chinese, thereby consist of the phonation unit storehouse.The acoustic elements that cutting the is obtained words that the position in former sentence (after in front) and front and back are connected by it marks.These markup informations provide foundation to the judgement of selecting the word module

2, the generation of the rhythm

Prosodic parameter is significant for the rhythm of controlling synthetic speech, tone intonation, emotion etc., and to Chinese spectrum mandarin, fundamental frequency is that the harmony straightening connects relevant physical parameter.The constitution principle of Chinese can be summed up as follows: consist of initial consonant or simple or compound vowel of a Chinese syllable by phoneme, become after tone on the rhythm master tape and transfer mother, become syllable by single accent mother or by initial consonant with transferring female the splicing.Chinese has high and level tone, rising tone, upper sound, falling tone, 5 accent softly, and more than 1200 has the tuning joint.A syllable is exactly the sound of a word, i.e. syllable word.Consist of word by syllable word, consist of sentence by word more at last.

Prosody generation based on machine learning.Although obtained many rules about the rhythm at present, these rules also fall far short for forming the rhythm that gets close to nature very much.Hide and inenarrable prosodic rules utilizes the method for machine learning to realize the generation of the rhythm usually for realizing.Algorithm model commonly used has hidden Markov model (HMM), artificial neural network (ANN), support vector machine (SVM) and decision tree etc.

Prosody generation based on parameterized model.Rhythm model based on machine learning extracts the detailed rules and regulations that some manually can't be analyzed, the adult reduces the workload that artificial participation is analyzed, but also there are the following problems simultaneously for this method: at first, general learning algorithm all requires many data resources, when particularly attributive character is many; Secondly, if the data with existing maldistribution of the resources is even, the whole deviation of training, impact analysis result will be caused; Again, expertise well in conjunction with utilizing, is not a kind of information waste; The 4th, training pattern does not have and language feature and human perception hook, can't shift and adjust.Fundamental frequency and duration be the direct parameters,acoustic that affects people's rhythm sense of hearing, both temporal evolution and environmental change.Parameter model utilizes priori, and the relation of first analyzing fundamental frequency duration and language feature, people's sense of hearing is built this relation and touched, and extracts fundamental frequency duration and language feature and people's the directly related parameter of sense of hearing.Such model has effectively utilized expertise, just can train with few data the relation of text language feature and parameter, just can reach simultaneously the purpose of the prosodic features that changes sense of hearing by the adjustment model parameter.

The Fujisaki model is a kind of widely used base frequency parameters model, and it mainly predicts the variation of fundamental frequency by simulation people's Mechanism of Speech Production.Fujisaki thinks that the change of fundamental frequency mainly contains two reasons: the impact of (Accent) transferred in the impact on prosodic phrase border (Phrase) and syllable.The generation of fundamental curve is the mechanism according to vocal cord vibration, and with Phrase and the Accent input as prognoses system, with the input of fundamental curve as system, wherein the form with pulse signal produces the Phrase shape, produces the Accent shape with step function.Fundamental curve can be expressed as under this model:

\ln [F_{0} (t)] - \ln [F_{\min}] + Σ_{i = 1}^{l} A_{pi} G_{pi} (t - T_{0 i}) + Σ_{j = 1}^{J} A_{aj} [G_{aj} (t - T_{1 j}) - G_{aj} (t - T_{2 j})] - - - (2)

Wherein,

G_{pi} (t) = \{\begin{matrix} q_{i} texp (- a_{i} t) & t > 0 \\ 0 \end{matrix} - - - (3)

G_{aj} (t) = \{\begin{matrix} \min [1 - (1 + βt) \exp (- βt), θ] & t \leq 0 \\ 0 & else \end{matrix}

Other parameters in formula are as follows: Fmin, fundamental frequency minimum value; a _i, i Phrase order control coefrficient; I, the Phrase number of elements; β _j, j Accent order control coefrficient; J, the Accent number of elements; θ, Accent order maximal value parameter; T _0i, the time mark of i Phrase order; A _pi, i Phrase order amplitude; T _1j, j Accent order start time; A _aj, j Accent order amplitude; T _2j, j Accent order concluding time.

The mechanism of Fujisaki model is very simple, for each phrase order, is exactly to pass through the phrase wave filter with a pulse signal, and corresponding fundamental frequency value rises to maximum point, then decay gradually.For continuous phrase order, fundamental curve produces continuous fluctuation.The Accent order is by a step function initialization, because the parameter alpha of accent wave filter much larger than β, makes the Accent element reach very soon its maximal value, and then decay rapidly.

Beneficial effect of the present invention:

It is synthetic that the present invention has real-time phonetic, instant text voice conversion, and system effectiveness is high, stable;

The text voice conversion module clear in structure that the present invention proposes, the each several part division of labor is clear and definite, and independence is strong;

The present invention is convenient, and old user uses, and thoroughly breaks away from presbyopic glasses;

User of the present invention uses when driving, guarantees user's traffic safety.

Those skilled in the art should understand, embodiments of the invention can be provided as method, system or computer program.Therefore, the present invention can adopt complete hardware implementation example, implement software example or in conjunction with the form of the embodiment of software and hardware aspect fully.And the present invention can adopt the form that wherein includes the upper computer program of implementing of computer-usable storage medium (including but not limited to magnetic disk memory and optical memory etc.) of computer usable program code one or more.

The present invention is that reference is described according to process flow diagram and/or the block scheme of method, equipment (system) and the computer program of the embodiment of the present invention.Should understand can be by the flow process in each flow process in computer program instructions realization flow figure and/or block scheme and/or square frame and process flow diagram and/or block scheme and/or the combination of square frame.Can provide these computer program instructions to the processor of multi-purpose computer, special purpose computer, Embedded Processor or other programmable data processing device to produce a machine, make the instruction of carrying out by the processor of computing machine or other programmable data processing device produce to be used for the device of realizing in the function of flow process of process flow diagram or a plurality of flow process and/or square frame of block scheme or a plurality of square frame appointments.

These computer program instructions also can be stored in energy vectoring computer or the computer-readable memory of other programmable data processing device with ad hoc fashion work, make the instruction that is stored in this computer-readable memory produce the manufacture that comprises command device, this command device is realized the function of appointment in flow process of process flow diagram or a plurality of flow process and/or square frame of block scheme or a plurality of square frame.

These computer program instructions also can be loaded on computing machine or other programmable data processing device, make on computing machine or other programmable devices and to carry out the sequence of operations step producing computer implemented processing, thereby be provided for realizing the step of the function of appointment in flow process of process flow diagram or a plurality of flow process and/or square frame of block scheme or a plurality of square frame in the instruction of carrying out on computing machine or other programmable devices.

Obviously, those skilled in the art can carry out various changes and modification and not break away from the spirit and scope of the present invention the present invention.Like this, if within of the present invention these are revised and modification belongs to the scope of claim of the present invention and equivalent technologies thereof, the present invention also is intended to comprise these changes and modification interior.

Claims

1. the method for the speech play of a mobile phone text note after mobile phone receives the note of textual form,, obtains corresponding speech waveform, thereby forms synthetic speech and play through text analyzing the text word string of this note.

2. method as claimed in claim 1, it is characterized in that: comprise that text normalization is processed and the symbol step of converting, the special symbol, abbreviation, English word and the measurement unit that are used for the text-string of the note that will obtain are converted to discernible phonation unit and identify.

3. method as claimed in claim 2, is characterized in that: comprise the participle model treatment step, be used for the text to input is carried out the division of word by the participle rule that presets, determine the pronunciation of rhythm structure and the polyphone of sentence.

4. method as claimed in claim 3, it is characterized in that: also comprise prosody prediction step, coarticulation step and select the word step, wherein the prosody prediction step determines each word pronunciation, coarticulation has determined the annexation between each word, selects the word step to select optimum pronunciation according to the pronunciation of rhythm requirement and word in dictionary.

5. method as claimed in claim 1, it is characterized in that: when selecting acoustic elements structure sound bank, utilize the degree of loss function to describe the synthesis capability with formed objects sound bank, the degree of loss function can be expressed as:

ζ(f，d，c)＝cf/d

6. method as claimed in claim 1, is characterized in that: adopt the base frequency parameters model to control the generation of the rhythm.