SE519273C2

SE519273C2 - Improvements to, or with respect to, speech-to-speech conversion

Info

Publication number: SE519273C2
Application number: SE9601812A
Authority: SE
Inventors: Bertil Lyberg
Original assignee: Telia Ab
Priority date: 1996-05-13
Filing date: 1996-05-13
Publication date: 2003-02-11
Also published as: NO985178D0; NO985178L; WO1997043707A1; SE9601812L; EP0976026A1; NO318112B1; SE9601812D0

Abstract

A system and method for speech-to-speech conversion for providing spoken responses to speech inputs in at least two natural languages wherein speech inputs are recognised and interpreted in said at least two languages. The recognised speech inputs are evaluated to determine the language of the speech inputs, and a dialogue is undertaken with a database containing speech information data, in said at least two natural languages, to obtain data for the formulation of spoken responses to the speech inputs. The speech information data, obtained from the database, is then converted into spoken responses which exhibit the language characteristics of the respective speech inputs.

Description

30 35 ks19 273 fä" 2 placerad. Dessutom bestämmer betoningen av meningar, eller delar därav, sektioner som skall understrykas i spráket och som kan vara av betydelse när det gäller att avgöra den exakta betydelsen av det talade spráket. 30 35 ks19 273 fä" 2 placed. In addition, the emphasis of sentences, or parts thereof, determines sections to be underlined in the language and which can be of importance in determining the exact meaning of the spoken language.

Behovet för artificiellt producerat tal att vara sä naturligt som möjligt och ha korrekt accentuering är av speciell vikt i röstsvarskommunikationsutrustningar och/eller -system som producerar tal i olika sammanhang. Med kända röstsvarsarrangemang är det reproducerade talet ibland svart att förstá och att tolka. till-tal-omvandlingssystem i vilket det artificiella utgående Det finns därför behov av ett tal- talet är naturligt, har den korrekta accentueringen, och är lätt att första.The need for artificially produced speech to be as natural as possible and to have correct accentuation is of particular importance in voice response communication equipment and/or systems that produce speech in various contexts. With known voice response arrangements, the reproduced speech is sometimes difficult to understand and interpret. There is therefore a need for a speech-to-speech conversion system in which the artificial output speech is natural, has the correct accentuation, and is easy to understand.

Hos sprák som har väl utvecklad satsaccentbetoning och/eller tonhöjd pä enstaka ord, är identifieringen av den naturliga betydelsen av ord/meningar mycket svär. Det faktum att betoningar kan placeras fel ökar risken för feltolkning, eller att meningen gàr helt förlorad för den lyssnande parten.In languages that have well-developed sentence stress and/or pitch on individual words, identifying the natural meaning of words/sentences is very difficult. The fact that stresses can be placed incorrectly increases the risk of misinterpretation, or that the meaning is completely lost to the listener.

Sálunda; skulle det behövas att ett tal-till-tal-omvandlingssystem har förmäga att för att klara av dessa svårigheter, tolka den mottagna talinformationen, oberoende av spràk och/eller dialekt, dialekten i utmatade tal med motsvarande hos respektive och att kunna matcha spräket och/eller inmatade tal. Likasà för att kunna bestämma betydelsen av enstaka ord, eller fraser, pà ett otvetydigt sätt i en talad sekvens, skulle det vara nödvändigt för tal-till-tal- omvandlingssystemen att kunna bestämma, och ta hänsyn till, satsaccent och satsbetoning i den talade sekvensen.Thus; it would be necessary for a speech-to-speech conversion system to be able to overcome these difficulties, to interpret the received speech information, regardless of language and/or dialect, the dialect in output speech with the corresponding one in the respective one and to be able to match the dialect and/or input speech. Similarly, to be able to determine the meaning of individual words, or phrases, in an unambiguous manner in a spoken sequence, it would be necessary for speech-to-speech conversion systems to be able to determine, and take into account, sentence accent and sentence stress in the spoken sequence.

Ett mal med den föreliggande uppfinningen är att tillhandahålla ett system och en metod för tal-till-tal- tolka och behandla talinmatningar pä àtminstone tvà naturliga spràk och tillhandahålla talutmatningar, som de för respektive inmatningar. omvandling som är anpassat att känna igen, dvs talade svar, pä samma sprák 20 25 30 35 0000 519 273 -* :IQ O' 3 Ett annat mål med den föreliggande uppfinningen är att tillhandahålla ett system och en metod för tal-till-tal- tolka och behandla talinmatningar på åtminstone två naturliga språk och omvandling som är anpassat att känna igen, tillhandahålla talutmatningar, dvs talade svar, på samma språk och med samma dialekt som de för respektive inmatningar, där matchningen av dialekterna utförs med användning av prosodi- information och, mera exakt, grundtonskurvan hos de inmatade talen.An object of the present invention is to provide a system and a method for speech-to-speech interpreting and processing speech inputs in at least two natural languages and providing speech outputs, such as those for the respective inputs. Another outputs in at least two natural languages and providing speech outputs, such as those for the respective inputs. Another object of the present invention is to provide a system and a method for speech-to-speech interpreting and processing speech outputs in at least two natural languages and providing speech outputs, such as those for the respective inputs. Another object of the present invention is to provide a system and a method for speech-to-speech interpreting and processing speech outputs in at least two natural languages and providing speech outputs, such as those for the respective inputs. Another object of the present

Ytterligare ett mål med den föreliggande uppfinningen är att tillhandahålla ett röstsvarskommunikationssystem, omfattande ett tal-til1-tal-omvandlingssystem som arbeter enligt en tal-till-tal-omvandlingsmetod.A further object of the present invention is to provide a voice response communication system comprising a speech-to-speech conversion system operating according to a speech-to-speech conversion method.

Uppfinningen tillhandahåller, i ett röstsvarskommunika- tionssystem, en metod för att tillhandahålla ett talat svar till ett inmatat tal, där nämnda metod inkluderar stegen att känna igen och tolka det inmatade talet, och utnyttja tolkningen till att erhålla talinformationsdata från en databas för användning vid formuleringen av det talade svaret, kännetecknat av att databasen innehåller talinformationsdata för åtminstone två naturliga spåk, av att nämnda metod är anpassad att kännna igen och tolka talinmatningar på nämnda åtminstone två språk och tillhandahålla talade svar till talinmatningar på nämnda språk, och av att nämnda metod inkluderar de ytterligare stegen att utvärdera en igenkänd talinmatning för att bestämma språket hos inmatningen, effektuera en dialog med databasen för att erhålla talinformationsdata för formuleringen av ett talat svar på det inmatade talets språk, och omvandla talinformationsdatan som erhålles från databasen till nämnda talade svar.The invention provides, in a voice response communication system, a method for providing a spoken response to an input speech, said method including the steps of recognizing and interpreting the input speech, and utilizing the interpretation to obtain speech information data from a database for use in formulating the spoken response, characterized in that the database contains speech information data for at least two natural languages, in that said method is adapted to recognize and interpret speech inputs in said at least two languages and provide spoken responses to speech inputs in said languages, and in that said method includes the further steps of evaluating a recognized speech input to determine the language of the input, effecting a dialogue with the database to obtain speech information data for formulating a spoken response in the language of the input speech, and converting the speech information data obtained from the database into said spoken response.

I en föredragen metod kan separata databaser användas för vart och ett av nämnda åtminstone två språk, och dialog kan effektueras med endast den av nämnda databaser som innehåller talinformationsdata på det inmatade talets språk. Emellertid, 10 20 25 30 35 519 273 4 i händelse av att átminstone en del av den erforderliga talinformationsdatan för ett talat svar finns lagrat i en annan av nämnda databaser, kan metoden inkludera de ytterligare stegen att effektuera en dialog med nämnda andra databas för att erhàlla den erforderliga talinformationsdatan, översätta informationsdatan till spràket hos nämnda en av databaserna, och omvandla den kombinerade talinformationsdatan till ett talat svar pà det inmatade talets spràk.In a preferred method, separate databases may be used for each of said at least two languages, and dialogue may be effected with only that of said databases which contains speech information data in the language of the input speech. However, in the event that at least a portion of the required speech information data for a spoken response is stored in another of said databases, the method may include the additional steps of effecting a dialogue with said second database to obtain the required speech information data, translating the information data into the language of said one of the databases, and converting the combined speech information data into a spoken response in the language of the input speech.

Taligenkänningen och tolkningen av en talinmatning kan effektueras pà átminstone tvà naturliga spràk. I detta fall utvärderas igenkända delar, eller sekvenser, av det inmatade talet, som är resultatat av taligenkänningen eller tolkningen av de åtminstone tvà naturliga spràken, för att bestämma spràket hos talinmatningen. Utfallet av denna utvärderings- process kan användas för att bestämma den databas med vilken nämnda dialog utföres för att erhálla talinformationsdatan för ett talat svar till det inmatade talet.The speech recognition and interpretation of a speech input can be performed in at least two natural languages. In this case, recognized parts, or sequences, of the input speech resulting from the speech recognition or interpretation of the at least two natural languages are evaluated to determine the language of the speech input. The outcome of this evaluation process can be used to determine the database with which said dialogue is performed to obtain the speech information data for a spoken response to the input speech.

Dialogen med en databas, och/eller mellan databaser, kan effektueras med användning av ett databaskommunikationsspràk som t ex SQL (Structured Query Language).The dialogue with a database, and/or between databases, can be effected using a database communication language such as SQL (Structured Query Language).

I en föredragen metod, enligt den förliggande uppfinningen, inkluderar taligenkänningen och tolkningen stegen att extrahera prosodi-information, dvs grundtonskurvan, fràn en talinmatning, och erhálla dialektinformation fràn nämnda prosodi-information, där nämnda dialektinformation används vid omvandlingen av nämnda talinformationsdata som erhàlles fràn nämnda databas, till ett talsvar, där talsvaren är pá sammma spràk och dialekt som det inmatade talet. Dennna föredragna metod inkluderar de ytterligare stegen att bestämma intonationsmönstret hos grundtonen och därigenom maximum- och minimumvärdena hos grundtonskurvan och deras respektive positioner; att bestämma intonationsmönstret för grundtonskurvan hos en talmodell och därigenom maximum- och minimumvärdena hos grundtonskurvan och deras respektive positioner; att jämföra intonationsmönstret hos det inmatade ø««../' 20 25 30 35 519 273» I" lira X 5 talet med intonationsmönstret hos talmodellen för att identifiera en tidsdifferens mellan förekomsten av maximum- och minimumvärdena hos grundtonskurvan för det inkommande talet i förhållande till maximum- och minimumvärdena hos grundtonskurvan för talmodellen, där den identifierade tidsdifferensen indikerar dialektkarakteristiken hos det inmatade talet. Tidsdifferensen kan bestämmas i relation till en referenspunkt i intonationsmönstret, till exempel den punkt vid vilken en konsonant/vokal-gräns inträffar.In a preferred method, according to the present invention, the speech recognition and interpretation includes the steps of extracting prosody information, i.e. the pitch curve, from a speech input, and obtaining dialect information from said prosody information, wherein said dialect information is used in the conversion of said speech information data obtained from said database, into a speech response, wherein the speech response is in the same language and dialect as the input speech. This preferred method includes the further steps of determining the intonation pattern of the pitch curve and thereby the maximum and minimum values of the pitch curve and their respective positions; determining the intonation pattern of the pitch curve of a speech model and thereby the maximum and minimum values of the pitch curve and their respective positions; to compare the intonation pattern of the input speech with the intonation pattern of the speech model to identify a time difference between the occurrence of the maximum and minimum values of the fundamental tone curve of the input speech relative to the maximum and minimum values of the fundamental tone curve of the speech model, where the identified time difference indicates the dialect characteristic of the input speech. The time difference can be determined in relation to a reference point in the intonation pattern, for example the point at which a consonant/vowel boundary occurs.

Metoden enligt den föreliggande uppfinningen kan inkludera steget att erhålla information om satsaccenter från nämnda prosodi-information.The method according to the present invention may include the step of obtaining information about sentence accents from said prosody information.

Orden i talmodellen kan kontrolleras lexikalt, och fraserna i talmodellen kan kontrolleras syntaktiskt. De ord och fraser som ej är lingvistiskt möjliga utesluts från talmodellen. Dessutom kan ortografin och den fonetiska transkriptionen hos orden i talmodellen kontrolleras, där transkriptionsinformationen inkluderar lexikaliskt abstraherad accentinformation av typ betonade stavelser, och information avseende placeringen av sekundär accent. Accentinformationen kan avse tonal ordaccent I och accent II.The words in the speech model can be checked lexically, and the phrases in the speech model can be checked syntactically. Words and phrases that are not linguistically possible are excluded from the speech model. In addition, the orthography and phonetic transcription of the words in the speech model can be checked, where the transcription information includes lexically abstracted accent information such as stressed syllables, and information regarding the placement of secondary accent. The accent information can refer to tonal word accent I and accent II.

Dessutom kan metoden enligt den föreliggande uppfinningen använda satsaccentinformation vid tolkningen av det inmatade talet.Additionally, the method of the present invention can use sentence accent information in interpreting the input speech.

Uppfinningen tillhandahåller också ett tal-till-tal- omvandlingssystem som, vid utmatning därav, ger talade svar på inmatade tal på åtminstone två naturliga språk, inkluderande taligenkänningshjäpmedel för talinmatningar, tolkningshjälp- medel för tolkning av innehållet i de igenkända inmatade talen, och en databas som innehåller talinformationsdata för användning vid formuleringen av nämnda talsvar, kännetecknat av att talinformationsdatan som finns lagrad i databasen är på nämnda åtminstone två naturliga språk, av att taligenkännings- och tolkningshjälpmedlen är anpassade att känna igen och tolka 20 25 30 35 i 519 273 6 talinmatningar på nämnda åtminstone två naturliga språk, och av att systemet ytterligare inkluderar utvärderingshjälpmedel för utvärdering av de igenkända talinmatningarna och bestämmer spràket hos inmatningarna; dialoghanteringshjälpmedel för effektuering av en dialog med databasen för att erhålla nämnda talinformationsdata på det inmatade talets språk, och hjälpmedel för tal-till-tal-omvandling för att omvandla talinformationsdatan som erhålles från databasen, till ett talat svar.The invention also provides a speech-to-speech conversion system which, upon output thereof, provides spoken responses to input speech in at least two natural languages, including speech recognition aids for speech inputs, interpretation aids for interpreting the content of the recognized input speech, and a database containing speech information data for use in formulating said speech responses, characterized in that the speech information data stored in the database is in said at least two natural languages, in that the speech recognition and interpretation aids are adapted to recognize and interpret speech inputs in said at least two natural languages, and in that the system further includes evaluation aids for evaluating the recognized speech inputs and determining the language of the inputs; dialogue management means for effecting a dialogue with the database to obtain said speech information data in the language of the input speech, and speech-to-speech conversion means for converting the speech information data obtained from the database into a spoken response.

Tal-till-tal-omvandlingssystemet, enligt den föreliggande uppfinningen, som är anpassat att ta emot talinmatningar på två eller flera naturliga språk och till att tillhandahålla, vid utmatningen därav, talade svar på respektive talinmatnings språk, inkluderar företrädesvis, för vart och ett av de naturliga språken, taligenkänningshjälpmedel, där inmatningarna på varje taligenkänningshjälpmedel är anslutna till en gemensam ingång till systemet; talutvärderingshjälp- medel för att bestämma, beroende på utmatningen på vart och ett av taligen-känningshjälpmedlen, språket för en talinmatning; en databas som innehåller talinformationsdata för användning i formuleringen av talade svar på databasens språk; dialoghanteringshjälpmedel för anslutning till ett respektive taligenkänningshjälpmedel, beroende på språket för det inmatade talet, där nämnda hanteringshjälpmedel är anpassat att tolka innehållet i det igenkända talet och, med utgångspunkt fràn tolkningen, accessa och erhålla talinformationsdata från åtminstone respektive en av databaserna; och text-till-tal-omvandlingshjälpmedel för att omvandla talinformationsdatan som erhålles med nämnda hanteringshjälpmedel till talade svar till respektive talinmatningar.The speech-to-speech conversion system, according to the present invention, which is adapted to receive speech inputs in two or more natural languages and to provide, upon output thereof, spoken responses in the respective language of the speech input, preferably includes, for each of the natural languages, speech recognition means, wherein the inputs of each speech recognition means are connected to a common input to the system; speech evaluation means for determining, depending on the output of each of the speech recognition means, the language of a speech input; a database containing speech information data for use in formulating spoken responses in the language of the database; dialogue management means for connecting to a respective speech recognition means, depending on the language of the input speech, said management means being adapted to interpret the content of the recognized speech and, based on the interpretation, access and obtain speech information data from at least one of the databases; and text-to-speech conversion means for converting the speech information data obtained by said processing means into spoken responses to respective speech inputs.

Tal-till-tal-omvandlingssystemet kan inkludera separata databaser för vart och ett av de nämnda åtminstone två språken, och separata dialoghanteringshjälpmedel för var och en av databaserna, där varje dialoghanteringshjälpmedel är anpassat att effektuera en dialog med åtminstone respektive en 10 20 25 30 35 519 273 7 av databaserna. Likaså kan varje dialoghanteringshjälpmedel anpassas att effektuera en dialog med var och en av databaserna. I detta fall inkluderar systemet översättningshjälpmedel för att översätta den utmatade talinformationsdatan från respektive databas till de övriga databasernas språk.The speech-to-speech conversion system may include separate databases for each of the at least two languages, and separate dialogue management means for each of the databases, each dialogue management means being adapted to effect a dialogue with at least one of the databases. Likewise, each dialogue management means may be adapted to effect a dialogue with each of the databases. In this case, the system includes translation means for translating the output speech information data from the respective database into the languages of the other databases.

I händelse av att åtminstone en del av den erforderliga talinformationsdatan för ett talat svar finns lagrad i en databas på ett annat språk än vad som erfordras för det talade svaret, kan talinformationsdatan erhållas från nämnda databas och översättas av nämnda översättningshjälpmedel till det erforderliga språket för det talade svaret. Den översatta talinformationen används sedan antingen ensam eller i kombi- nation med annan talinformation, av dialoghanteringshjälp- medlet för att tillhandahålla en utmatning för applicering på text-till-tal-omvandlingshjälpmedlet.In the event that at least a portion of the required speech information data for a spoken response is stored in a database in a language other than that required for the spoken response, the speech information data may be obtained from said database and translated by said translation means into the required language for the spoken response. The translated speech information is then used, either alone or in combination with other speech information, by the dialogue management means to provide an output for application to the text-to-speech conversion means.

Tal-till-tal-omvandlingssystemet är företrädesvis anpassat till att ta emot talinmatningar på två språk, i vilket fall systemet inkluderar, för vart och ett av de två språken, en databas, dialoghanteringshjälpmedel och översättningshjälpmedel, av att vart och ett av dialoghan- teringshjälpmedlen är anpassat att kommunicera med var och en av databaserna, där datautmatningen från var och en av databaserna ansluts direkt till ett av dialoghanterings- hjälpmedlen och det andra av hanteringshjälpmedlen via ett översättningshjälpmedel.The speech-to-speech conversion system is preferably adapted to receive speech inputs in two languages, in which case the system includes, for each of the two languages, a database, dialogue management means and translation means, each of the dialogue management means being adapted to communicate with each of the databases, the data output from each of the databases being connected directly to one of the dialogue management means and the other of the management means via a translation means.

Tal-till-tal-omvandlingssystemet inkluderar företrädesvis taligenkännings- och översättningshjälpmedel för vart och ett av de nämnda åtminstone två naturliga språken, där inmat- ningarna till taligenkännings- och tolkningshjälpmedlen ansluts till en gemensam ingång. De igenkända delarna, eller sekvenserna, av talinmatningen resulterande från nämnda taligenkänning och tolkning av nämnda åtminstone två naturliga språk, utvärderas av utvärderingshjälpmedlet för att bestämma språket för talinmatningen. Utvärderingshjälpmedlet kan š 519 273 8 användas i ett föredraget system för att välja den databas frán vilken nämnda talinformationsdata kommer att erhällas av nämnda dialoghanteringshjälpmedel för formuleringen av det talade svaret till det inmatade talet.The speech-to-speech conversion system preferably includes speech recognition and translation means for each of said at least two natural languages, wherein the inputs to the speech recognition and interpretation means are connected to a common input. The recognized portions, or sequences, of the speech input resulting from said speech recognition and interpretation of said at least two natural languages are evaluated by the evaluation means to determine the language of the speech input. The evaluation means may be used in a preferred system to select the database from which said speech information data will be obtained by said dialogue management means for the formulation of the spoken response to the input speech.

Taligenkännings- och tolkningshjälpmedlen kan inkludera extraktionshjälpmedel för att extrahera prosodi-information fràn talinmatningen, och hjälpmedel för att erhàlla dialektinformation frán nämnda prosodi-information, där nämnda dialektinformation används av nämnda text-till-tal- omvandlingshjälpmedel vid omvandlingen av nämnda talinformationsdata till det talade svaret, där dialekten hos det talade svaret matchas mot det hos den talade inmatningen.The speech recognition and interpretation means may include extraction means for extracting prosody information from the speech input, and means for obtaining dialect information from said prosody information, said dialect information being used by said text-to-speech conversion means in converting said speech information data into the spoken response, the dialect of the spoken response being matched to that of the spoken input.

Prosodi-informationsutdraget fràn det inmatade talet är grundtonskurvan för det inmatade talet.The prosody information extracted from the input speech is the fundamental pitch curve of the input speech.

Hjälpmedlet för att erhàlla dialektinformation fràn nämnda prosodi-information kan inkludera första analyshjälpmedel för att bestämma intonationsmönstret för grundtonen hos det inmatade talet, och därigenom maximum- och minimumvärdena hos grundtonskurvan och deras respektive positioner; andra analyshjälpmedel för att bestämma intonationsmönstret hos grundtonskurvan för talmodellen och därigneom maximum- och minimumvärdena för grundtonskurvan och deras respektive positioner; jämförelsehjälpmedel för att jämföra intonationsmönstret hos det inmatade talet med intonationsmönstet för talmodellan för att identifiera en tidsdifferens mellan förekomsten av maximum- och minimumvärdena hos grundtonskurvan för det inmatade talet i förhållande till maximum- och minimumvärdena för grundtonskurvan för talmodellen, där den identifierade tidsdifferensen indikerar den dialektala karakteristiken hos det inmatade talet. Tidsdifferensen kan bestämmas i relation till en referenspunkt i intonations-mönstret, dvs den punkt vid vilken en konsonant/vokal-gräns inträffar.The means for obtaining dialect information from said prosody information may include first analysis means for determining the intonation pattern of the fundamental tone of the input speech, and thereby the maximum and minimum values of the fundamental tone curve and their respective positions; second analysis means for determining the intonation pattern of the fundamental tone curve of the speech model and thereby the maximum and minimum values of the fundamental tone curve and their respective positions; comparison means for comparing the intonation pattern of the input speech with the intonation pattern of the speech model to identify a time difference between the occurrence of the maximum and minimum values of the fundamental tone curve of the input speech relative to the maximum and minimum values of the fundamental tone curve of the speech model, where the identified time difference indicates the dialectal characteristic of the input speech. The time difference may be determined in relation to a reference point in the intonation pattern, i.e. the point at which a consonant/vowel boundary occurs.

I0 20 25 30 35 519 273 9 Tal-till-tal-omvandlingssystemet kan också inkludera hjälpmedel för att erhålla information om satsaccent fràn nämnda prosodi-information.I0 20 25 30 35 519 273 9 The speech-to-speech conversion system may also include means for obtaining sentence accent information from said prosody information.

Taligenkänningshjälpmedlet kan inkludera kontrollhjälp- medel för att lexikalt kontrollera orden i talmodellen och för syntaktisk kontroll av fraserna i talmodellen, där orden och fraserna som ej är lingvistiskt möjliga utesluts från talmodellen. Kontrollhjälpmedlet kan anpassas att kontrollera ortografin och den fonetiska transkriptionen hos orden i talmodellen, i vilket fall transkriptionsinformationen inkluderar lexikalt abstraherad accentinformtion av typ betonade stavelser, och information avseende placeringen av sekundär acccent. Accentinformationen kan avse tonal ordaccent I och acccent II.The speech recognition tool may include checking tools for lexically checking the words in the speech model and for syntactic checking of the phrases in the speech model, where words and phrases that are not linguistically possible are excluded from the speech model. The checking tool may be adapted to check the orthography and phonetic transcription of the words in the speech model, in which case the transcription information includes lexically abstracted accent information such as stressed syllables, and information regarding the placement of secondary accents. The accent information may relate to tonal word accent I and accent II.

Satsaccentinformationen kan användas vid tolkningen av innehållet av det igenkända inmatade talet.The sentence accent information can be used in interpreting the content of the recognized input speech.

Satsbetoningarna kan bestämmas och användas vid tolkningen av innehållet av det igenkända inmatade talet.The sentence accents can be determined and used in interpreting the content of the recognized input speech.

Uppfinningen tillhandahåller ytterligare ett röstsvars- kommunikationssystem som inkluderar ett tal-till-tal- omvandlingssystem som skissats i föregående avsnitt, eller utnyttjar en metod som skissats i föregående avsnitt för att tillhandahålla ett talat svar till en talinmatning i systemet.The invention further provides a voice response communication system that includes a speech-to-speech conversion system as outlined in the previous section, or utilizes a method as outlined in the previous section to provide a spoken response to a speech input into the system.

De föregående och övriga särdragen hos den föreliggande uppfinningen kommer att förstås bättre av följande beskrivning, med referens till den enda figuren i de medföljande bilderna, vilken illustrerar, i form av ett blockdiagram, ett tal-till-tal-omvandlingssystem enligt den aktuella uppfinningen.The foregoing and other features of the present invention will be better understood from the following description, with reference to the single figure in the accompanying drawings, which illustrates, in the form of a block diagram, a speech-to-speech conversion system according to the present invention.

Tal-till-tal-omvandlingssystemet enligt den föreliggande uppfinningen är anpassat att tillhandahålla, vid utmatningen därav, talade svar till talinmatningar på åtminstone två 0v-««« 519 273 10 naturliga spràk. Sprákkarakteristiken hos de talade svaren, till exempel dialekt, satsaccent och satsbetoning, matchas genom den föreliggande uppfinningen mot motsvarande hos det inmatade talet för att tillhandahålla naturliga utgående tal som lätt kan förstás, ha korrekt accentuering och ge upphov till ett användarvänligt system. Det kommer att framgà av följande beskrivning att matchningen av spràkkarakteristiken uppnàs genom att extrahera prosodi-information fràn det inmatade talet, dvs grundtonskurvan hos det inmatade talet, och använda prosodi-informationen för att bestämma dialekt-, satsaccent- och satsbetoningsinformation för användning vid formuleringen av de talade svaren.The speech-to-speech conversion system of the present invention is adapted to provide, upon output thereof, spoken responses to speech inputs in at least two natural languages. The speech characteristics of the spoken responses, for example dialect, sentence accent and sentence stress, are matched by the present invention to the corresponding ones of the input speech to provide natural output speech that is easily understood, has correct accentuation and gives rise to a user-friendly system. It will be apparent from the following description that the matching of the speech characteristics is achieved by extracting prosody information from the input speech, i.e. the fundamental tone curve of the input speech, and using the prosody information to determine dialect, sentence accent and sentence stress information for use in formulating the spoken responses.

Tal-till-tal-omvandlingssystemet kan därför användas i mànga applikationer, till exempel i röstsvarskommunikations- system för att effektuera en dialog mellan en användare av systemet och en databas som utgör del av systemets taligen- känningsenhet och som inneháller talinformationsdata för formuleringen av talade svar till talade fragor/förfrågningar fràn användare av systemet. Sádana röstsvarskommunikations- system kan användas inom telekommunikation, eller bankväsende, eller säkerhetssystem etc för att tillhandahålla ett lätt förståeligt, användarvänligt system.The speech-to-speech conversion system can therefore be used in many applications, for example in voice response communication systems to effect a dialogue between a user of the system and a database which forms part of the system's speech recognition unit and which contains speech information data for the formulation of spoken answers to spoken questions/requests from users of the system. Such voice response communication systems can be used in telecommunications, or banking, or security systems etc. to provide an easily understandable, user-friendly system.

Tal-til1-tal-omvandlingssystemet, illusterad i den enda figuren av bifogade bilder, är anpassat att tillhandahålla, vid utmatningen därav, talade svar till talinmatningar pá tvà naturliga spràk, dvs spràk A och B, som kan vara vilka naturliga spräk som helst, t ex svenska och engelska.The speech-to-speech conversion system, illustrated in the only figure of the accompanying drawings, is adapted to provide, at the output thereof, spoken responses to speech inputs in two natural languages, i.e. languages A and B, which may be any natural languages, e.g. Swedish and English.

Förklaringar till Fig.l: =' Taligenkänning, Sprák A.Explanations for Fig.l: =' Speech recognition, Language A.

= Taligenkänning, Spràk B.= Speech recognition, Language B.

Spràk A, Lexikon + Syntax.Language A, Lexicon + Syntax.

= Spràk B, Lexikon + Syntax.= Language B, Lexicon + Syntax.

= Text-till-tal, Spràk A.= Text-to-speech, Language A.

= Text-till-tal, Sprak B.= Text-to-speech, Language B.

'TJEUÜOWW u 20 25 30 35 519 273 ll G = Utvärdering Språk A eller Språk B.'TJEUÜOWW u 20 25 30 35 519 273 ll G = Evaluation Language A or Language B.

H = Dialoghantering + Databasaccess, Språk A.H = Dialog handling + Database access, Language A.

I = Databas, Språk A.I = Database, Language A.

J = Dialoghantering + Databasaccess, Språk B.J = Dialog management + Database access, Language B.

K = Databas, Språk B.K = Database, Language B.

L = Övers. Språk M = Språk B N = Övers. Språk O = Språk B P = SQL Q = Språk A R = Språk B Som visas i den medföljande figuren inkluderar systemet igenkännings- och tolkningsenheter för respektive språken A och B. Ingångarna på enheterna 1 och 2 är anslutna till en gemensam ingång till systemet. Taligenkännings- och tolkningsenheterna 1 och 2 används för att känna igen och tolka innehållet i talinmatningen på ett sätt som skisseras Senare .L = Trans. Language M = Language B N = Trans. Language O = Language B P = SQL Q = Language A R = Language B As shown in the accompanying figure, the system includes recognition and interpretation units for languages A and B respectively. The inputs of units 1 and 2 are connected to a common input to the system. The speech recognition and interpretation units 1 and 2 are used to recognize and interpret the content of the speech input in a manner outlined Later.

En utgång på var och en av enheterna 1 och 2 ansluts till separata ingångar hos en utvärderingsenhet 3 som är anpasssad att utvärdera de igenkända talinmatningarna och bestämma språket hos inmatningarna, dvs språk A, eller språk B.An output of each of the units 1 and 2 is connected to separate inputs of an evaluation unit 3 which is adapted to evaluate the recognized speech inputs and determine the language of the inputs, i.e. language A, or language B.

Systemet för den föreliggande uppfinningen inkluderar också två omkopplingsenheter 4 och 5, vilkas respektive ingångar är anslutna till en utgång på taligenkännings- och tolkningsenheterna l och 2. Funktionen hos omkopplingsen- heterna 4 och 5 styrs, på ett sätt som skisseras senare, av utvärderingsenheten 3, dvs styringångarna till respektive enhet 4 och 5 är anslutna till separata utgångar pá utvärderingsenheten 3.The system of the present invention also includes two switching units 4 and 5, the respective inputs of which are connected to an output of the speech recognition and interpretation units 1 and 2. The operation of the switching units 4 and 5 is controlled, in a manner outlined later, by the evaluation unit 3, i.e. the control inputs to the respective units 4 and 5 are connected to separate outputs of the evaluation unit 3.

Utgångarna på omkopplingsenheterna 4 och 5 är var för sig anslutna till en ingång på dialoghanterarenheterna 6 och 7.The outputs of the switching units 4 and 5 are each connected to an input of the dialogue manager units 6 and 7.

Det kommer att framgå senare i beskrivningen att dialog- I' I D u 20 25 30 35 519 2:73» 12 hanterarenheterna 6 och 7 används för att effektuera en dialog med databasenheterna 8 och 9 för att erhàlla talinformations- data pá det inmatade talets sprák, för användning vid formuleringen av de talade svaren.It will be apparent later in the description that the dialogue handler units 6 and 7 are used to effect a dialogue with the database units 8 and 9 to obtain speech information data in the language of the input speech, for use in formulating the spoken responses.

En lexikon- och syntaxenhet 10 för spràket A är ansluten till en annan utgång pà taligenkännings- och tolkningsenheten 1, till dialoghanterarenheten 6 och till en ingång pà en text- till-tal-omvandlarenhet 12.A dictionary and syntax unit 10 for language A is connected to another output of the speech recognition and interpretation unit 1, to the dialogue manager unit 6 and to an input of a text-to-speech converter unit 12.

En lexikon- och syntaxenhet ll för språket B är ansluten till en annan utgång pá taligenkännings- och tolkningsenheten 2, till dialoghanterarenheten 7 och till en ingáng pà en text- till-tal-omvandlarenhet 13.A dictionary and syntax unit 11 for language B is connected to another output of the speech recognition and interpretation unit 2, to the dialogue manager unit 7 and to an input of a text-to-speech converter unit 13.

Text-till-tal-omvandlarenheterna 12 och 13 är också var för sig anslutna, vid en annan ingáng därav, till en utgáng pà dialoghanterarenheterna 6 och 7.The text-to-speech converter units 12 and 13 are also each connected, at another input thereof, to an output of the dialogue manager units 6 and 7.

Utgàngarna pá text-till-tal-omvandlingsenheterna 12 och 13 är anslutna till en gemensam talutgàng för systemet.The outputs of the text-to-speech conversion units 12 and 13 are connected to a common speech output for the system.

Som visas i den medföljande figuren finns det en tvávägs- kommunikation mellan dialoghanteringsenheten 6 och databasenhet 8, och mellan dialoghanterarenheten 7 och databasenhet 9. Dessa kommunikationsvägar används för att effektuera, pá ett sätt som skisseras senare, en dialog mellan respektive hanterar- och databasenheten för att erhàlla talinformationsdata för att användas vid formuleringen av de talade svaren. Tvàvägskommunikationsvägarna är förenade med varandra för att möjliggöra att en dialog kan utföras mellan hanterarenhet 6 och databasenhet 9 och/eller mellan hanterarenhet 7 och databasenhet 8. I praktiken effektueras dialogen med en databasenhet, och /eller mellan databas- enheter, med användning av ett databaskommunikationssprák, som t ex SQL (Structured Query Language). 15 20 25 30 35 519 zïz, 13 En översättningsenhet 14 tillhandahållas för översättning av språk A till språk B och vice versa. Det framgår av den bilagda figuren att en sektion l4a av översättningsenheten 14 har en ingång för språk B som är ansluten till en utgång på databasenheten 9, och en utgång för språk A som är ansluten till en ingång på dialoghanterarenhet 6. En annan sektion 14b på översättningsenheten 14 har en ingång för språk A som är ansluten till en utgång på databasenhet 8, och en utgång för språk B som är ansluten till en ingång på dialoghanterarenhet 7.As shown in the accompanying figure, there is a two-way communication between the dialogue management unit 6 and the database unit 8, and between the dialogue management unit 7 and the database unit 9. These communication paths are used to effect, in a manner outlined later, a dialogue between the respective management and database units to obtain speech information data for use in formulating the spoken responses. The two-way communication paths are interconnected to enable a dialogue to be carried out between the management unit 6 and the database unit 9 and/or between the management unit 7 and the database unit 8. In practice, the dialogue is effected with a database unit, and/or between database units, using a database communication language, such as SQL (Structured Query Language). 15 20 25 30 35 519 zïz, 13 A translation unit 14 is provided for translating language A into language B and vice versa. It is apparent from the attached figure that a section 14a of the translation unit 14 has an input for language B which is connected to an output on the database unit 9, and an output for language A which is connected to an input on the dialogue manager unit 6. Another section 14b of the translation unit 14 has an input for language A which is connected to an output on the database unit 8, and an output for language B which is connected to an input on the dialogue manager unit 7.

Det sätt på vilket tal-till-tal-omvandlingssystemet är anpassat att ta emot talinmatningar på naturliga språk A och B, och tillhandahålla, vid utmatningen därav, talade svar på språket för respektive talinmatning, skisseras i följande avsnittt.The manner in which the speech-to-speech conversion system is adapted to receive speech inputs in natural languages A and B, and provide, upon output thereof, spoken responses in the language of the respective speech input, is outlined in the following section.

En talinmatning till tal-till-tal-omvandlingssystemet som kan vara antingen på språk A eller språk B, igenkännes och tolkas av var och en av taligenkännings- och tolkningsen- heterna l och 2, i associering med respektive lexikon- och syntaxenheterna 10 och ll, dvs med användande av statistik- baserad taligenkännings- och språkmodelleringsteknik, och garanterande att de igenkända orden och/eller ordkombina- tionerna som används för att forma en modell av det inmatade talet är acceptabelt både lexikaliskt och syntaktiskt. Ändamålet med lexikon/syntax-kontrollen är att identifiera och exkludera varje ord från talmodellen som inte existerar i det aktuella språket, och/eller varje fras vars syntax inte överensstämmer med det aktuella språket.A speech input to the speech-to-speech conversion system, which may be in either language A or language B, is recognized and interpreted by each of the speech recognition and interpretation units 1 and 2, in association with the respective lexicon and syntax units 10 and 11, i.e. using statistical speech recognition and language modeling techniques, and ensuring that the recognized words and/or word combinations used to form a model of the input speech are acceptable both lexically and syntactically. The purpose of the lexicon/syntax check is to identify and exclude from the speech model any word that does not exist in the current language, and/or any phrase whose syntax does not conform to the current language.

De respektive språkmodeller som skapas av enheterna 1 och 10, och enheterna 2 och ll, utvärderingsenheten 3 som bestämmer vilket av språken A och B appliceras, och utvärderas av som är mest sannolikt för det inmatade talet. Denna utvär- dering effektueras på basis av sannolikhet, dvs sannolikheten att talinmatningen är på det ena eller det andra av språken A och B, skillnaderna mellan språkmodellerna, och huruvida 20 25 30 35 i 519 275 14 sprákmodelleringen för det ena eller andra av spràken har slutförts framgàngsrikt. Ju större skillnaden mellan sprákkarakteristika för sprák A och B är, desto lättare kommer uppgiften att bli för utvärderingsenheten 3.The respective language models created by units 1 and 10, and units 2 and 11, the evaluation unit 3 which determines which of the languages A and B is applied, and is evaluated by which is most likely for the input speech. This evaluation is effected on the basis of probability, i.e. the probability that the speech input is in one or the other of the languages A and B, the differences between the language models, and whether the language modeling for one or the other of the languages has been completed successfully. The greater the difference between the language characteristics of languages A and B, the easier the task will be for the evaluation unit 3.

Beroende pà utfallet av utvärderingen med enhet 3, dvs det valda spràket för det inmatade talet, kommer en av omkastarenheterna 4 och 5 att aktiveras för att ansluta taligenkännings- och tolkningsenheten för det valda spráket till motsvarande dialoghanterarenhet.Depending on the outcome of the evaluation with unit 3, i.e. the selected language for the input speech, one of the switch units 4 and 5 will be activated to connect the speech recognition and interpretation unit for the selected language to the corresponding dialogue manager unit.

Om det antas, ur beskrivningssynpunkt, att spràk A har valts som det mest sannolika språket för det inmatade talet, sa kommer omkastarenhet 4 att aktiveras och utgången pà taligenkännings- och tolkningsenheten l kommer att anslutas till en ingàng pà dialoghanterarenheten 6. Sálunda kommer omkopplarenheten 5 att förbli i oaktiverat tillstànd, och ingen anslutning kommer därför att göras mellan dialoghan- terarenheten 9 och taligenkännings- och tolkningsenheten 2.If it is assumed, from the point of view of description, that language A has been selected as the most probable language for the input speech, then switch unit 4 will be activated and the output of speech recognition and interpretation unit 1 will be connected to an input of dialogue manager unit 6. Thus, switch unit 5 will remain in an unactivated state, and no connection will therefore be made between dialogue manager unit 9 and speech recognition and interpretation unit 2.

I nästa skede av tal-till-tal-omvandlingsprocessen gar hanterarenheten 6 in i en lingvistisk dialog med databas- enheten 8, baserad pà det inmatade talets talmodell, för att erhàlla talinformationsdata för formuleringen av ett talat svar till talinmatningen. Talinformationsdatan, som väljes som ett resultat av denna dialog, överförs via hanterarenheten 6 till en ingàng pà text-till-tal-omvandlingsenheten 5 för formuleringen av ett talat svar. Det kommer att framgà av senare beskrivning att spràkkarakteristiken hos det talade svaret matchas, sa làngt detta är möjligt, med sprákkarak- teristiken hos det inmatade talet.In the next stage of the speech-to-speech conversion process, the handling unit 6 enters into a linguistic dialogue with the database unit 8, based on the speech model of the input speech, to obtain speech information data for the formulation of a spoken response to the speech input. The speech information data, which is selected as a result of this dialogue, is transferred via the handling unit 6 to an input of the text-to-speech conversion unit 5 for the formulation of a spoken response. It will be apparent from the later description that the language characteristics of the spoken response are matched, as far as possible, with the language characteristics of the input speech.

För den händelse att àtminstone en del av den erforderliga talinformationsdatan för ett talat svar inte finns lagrad i databasenhet 6, men kan finnas lagrad i databasenhet 9, gàr dialoghanterarenheten 6 in i en dialog med databasenheten 9 för att erhálla den erforderliga talinforma- tionsdatan. Om den erforderliga talinformationsdatan finns U 20 Ü 30 ß 519 273 15 lagrad i databasenhet 9, accessas den och överförs till dialoghanterarenheten 6 via sektion l4a av översättnings- enheten 14, dvs översätts från språk B till A. Den översatta talinformationsdatan används sedan antingen ensam, eller i kombination med talinformationsdata erhållen från databas- dvs omvandlad av enheten 8, för att formulera ett talat svar, text-till-tal-omvandlingsenheten 12 till det talade svaret.In the event that at least part of the required speech information data for a spoken response is not stored in database unit 6, but can be stored in database unit 9, the dialogue manager unit 6 enters into a dialogue with database unit 9 to obtain the required speech information data. If the required speech information data is stored in database unit 9, it is accessed and transferred to dialogue manager unit 6 via section 14a of translation unit 14, i.e. translated from language B to A. The translated speech information data is then used either alone, or in combination with speech information data obtained from database, i.e. converted by unit 8, to formulate a spoken response, text-to-speech conversion unit 12 to the spoken response.

Det är uppenbart att om språk B, hellre än språk A, väljs av utvärderingsenheten 3 som det inmatade talets språk, då 9 och 13 att användas, på samma sätt som 8 och 12, av det talade svaret. Varje information som kan erfordras från kommer enheterna 7, skisserats ovan för enheterna 6, för formuleringen databasenheten 8 kommer att accessas av och överföras till dialoghanterarenheten 7, och översättning av den överförda informationsdatan effektueras av sektion 14b i översättningsenheten 14.It is obvious that if language B, rather than language A, is selected by the evaluation unit 3 as the language of the input speech, then 9 and 13 will be used, in the same way as 8 and 12, of the spoken response. Any information that may be required from the units 7, outlined above for the units 6, for the formulation database unit 8 will be accessed by and transferred to the dialogue manager unit 7, and translation of the transferred information data is effected by section 14b of the translation unit 14.

Igenkänningen och tolkningen av tal kan ge upphov till tekniska problem och om dessa problem inte övervinnes kommer svårigheter att erfaras med att erhålla en korrekt och meningsfull tolkning av de inmatade talen. Speciellt om igenkänningen och tolkningen av de inmatade talen är felaktig så kommer det att bli extremt svårt för utvärderingsenheten 3 att bestämma språket för de inmatade talen, och det kommer därför inte att bli möjligt att tillhandahålla korrekta svar till talinmatningarna.The recognition and interpretation of speech can give rise to technical problems and if these problems are not overcome, difficulties will be experienced in obtaining a correct and meaningful interpretation of the inputted numbers. In particular, if the recognition and interpretation of the inputted numbers is incorrect, it will be extremely difficult for the evaluation unit 3 to determine the language of the inputted numbers, and it will therefore not be possible to provide correct responses to the speech inputs.

Sålunda klaras dessa problem av, i enlighet med den före- liggande uppfinningen, genom att extrahera prosodi-information från talinmatningarna och använda denna information för att dialekt-, och satsbetoningsinformation för användning i bestämma, på ett sätt som skisseras senare, satsaccent-, igenkännings- och tolkningsprocessen och i formuleringen av de talade svaren.Thus, these problems are overcome, in accordance with the present invention, by extracting prosody information from the speech inputs and using this information to derive dialect and sentence stress information for use in determining, in a manner outlined later, the sentence stress, recognition and interpretation process and in the formulation of the spoken responses.

Extraktionen av prosodi-informationen, dvs grundtons- kurvan, från det inmatade talet effektueras genom prosodi- 15 20 25 30 35 519 273 16 extraktionshjälpmedel (ej visade) som utgör del av tal- igenkännings- och tolkningsenheterna l och 2. Dessa enheter inkluderar också hjälpmedel (ej visade) för att erhålla dialektinformation frán prosodi-informationen.The extraction of the prosody information, i.e. the fundamental tone curve, from the input speech is effected by prosody extraction means (not shown) which form part of the speech recognition and interpretation units 1 and 2. These units also include means (not shown) for obtaining dialect information from the prosody information.

Sálunda är, med den föreliggande uppfinningen, taligen- kännings- och tolkningsenheterna l och 2 anpassade att arbeta pà ett sätt väl känt av personer med expertkunskaper inom omradet, för att känna igen och tolka talinmatningar i systemet. Taligenkännings- och tolkningsenheterna 1 och 2 kan, till exempel, arbeta genom användning av en "Hidden Markov"- modell, funktionen hos enheterna l och 2 att omvandla inmatade tal eller en motsvarande modell. I grund och botten är till systemet till en form som är en trogen representation av innehållet i de inmatade talen, och som är lämplig för utvärdering med utvärderingsenheten 3 och att användas av dialoghanterarenheterna 6 och 7. Med andra ord mäste innehállet i textinformationsdatan, vid utgången av var och en av taligenkännings- och tolkningsenheterna 1 och 2, vara: - en exakt representation av det inmatade talet; och - användbar för databashanterarenheterna 6 och 7 att respektive accessa och extrahera talinforma- för att an- talat tionsdata fràn databasenheterna 8 och 9, vändas vid formuleringen av ett syntetiskt, svar, dvs genom respektive en av text-till-tal- omvandlarenheterna 12 och 13.Thus, with the present invention, the speech recognition and interpretation units 1 and 2 are adapted to operate in a manner well known to those skilled in the art, to recognize and interpret speech inputs into the system. The speech recognition and interpretation units 1 and 2 may, for example, operate by using a "Hidden Markov" model, the function of the units 1 and 2 to transform input speech or a corresponding model. Basically, the system is to a form which is a faithful representation of the content of the input speech, and which is suitable for evaluation by the evaluation unit 3 and for use by the dialogue handling units 6 and 7. In other words, the content of the text information data, at the output of each of the speech recognition and interpretation units 1 and 2, must be: - an exact representation of the input speech; and - usable for the database management units 6 and 7 to respectively access and extract speech information from the database units 8 and 9, to be used in the formulation of a synthetic response, i.e. by one of the text-to-speech converter units 12 and 13, respectively.

I praktiken effektueras igenkännings- och tolkningsspro- cessen i grund och botten genom identifiering av ett antal fonem fràn ett segment av det inmatade talet som kombineras till allofonsträngar, där fonemen tolkas som möjliga ord, eller ordkombinationer, för att upprätta en modell av talet.In practice, the recognition and interpretation process is basically accomplished by identifying a number of phonemes from a segment of input speech that are combined into allophone strings, where the phonemes are interpreted as possible words, or word combinations, to establish a model of the speech.

Den upprättade talmodellen kommer att ha ord och satsaccenter enligt ett standardiserat mönster för språket hos det inmatade talet. 20 25 30 35 519 275 Fš-w 17 Informationen beträffande de igenkända orden och ordkombinationerna som genereras av taligenkännings- och tolkningsenheterna 1 och 2, kontrolleras, pà ett sätt som skisserats ovan, báde lexikaliskt och syntaktiskt. I praktiken effektueras detta med användning av ett lexikon med ortografi och transkription.The established speech model will have word and sentence accents according to a standardized pattern for the language of the input speech. 20 25 30 35 519 275 Fš-w 17 The information regarding the recognized words and word combinations generated by the speech recognition and interpretation units 1 and 2 is checked, in a manner outlined above, both lexically and syntactically. In practice, this is effected using a dictionary with orthography and transcription.

Sàlunda, enligt den föreliggande uppfinningen, garanterar taligenkännings- och tolkningsenheterna 1 och 2 att endast de ord och ordkombinationer som befinnes vara acceptabla bade lexikaliskt och syntaktiskt, används för att skapa en modell av det inmatade talet. I praktiken är intonationsmönstret hos talmodellen ett standardiserat intonationsmönster för det aktuella spràket, genom inlärning, eller rätt och slätt kunskaper, med hjälp av eller ett intonationsmönster som etablerats ett antal dialekter av det aktuella språket.Thus, according to the present invention, the speech recognition and interpretation units 1 and 2 ensure that only those words and word combinations that are found to be acceptable both lexically and syntactically are used to create a model of the input speech. In practice, the intonation pattern of the speech model is a standardized intonation pattern for the current language, through learning, or simply knowledge, with the help of or an intonation pattern established in a number of dialects of the current language.

Som nämnts ovan kan prosodi-informationen, dvs extraherad frán det inmatade talet genom användas för att erhàlla dialekt-, grundtonskurvan, extraktionsenheten 3, satsaccent- och satsbetoningsinformation för att användas av tal-till-tal-omvandlingssystemet och metoden i föreliggande uppfinning. Speciellt kan dialektinformationen användas av tal-till-tal-omvandlingssystemet och metoden för att matcha dialekten hos det utmatade talet, med det hos det inmatade talet och satsaccent och betoningsinformationen kan användas vid igenkänningen och tolkningen av det inmatade talet.As mentioned above, the prosody information, i.e., extracted from the input speech by the extraction unit 3, can be used to obtain dialect, pitch curve, sentence accent and sentence stress information for use by the speech-to-speech conversion system and method of the present invention. In particular, the dialect information can be used by the speech-to-speech conversion system and method to match the dialect of the output speech with that of the input speech and the sentence accent and stress information can be used in the recognition and interpretation of the input speech.

Enligt den föreliggande uppfinningen inkluderar hjälpmedlen för att erhàlla dialektinformation frán prosodi- informationen: - ett första analyshjälpmedel för att bestämma intonationsmönstret för grundtonen hos det in- matade talet och därigenom maximum- och minimumvärdena för grundtonskurvan och deras respektive lägen; 10 20 25 30 35 -519 273 ?z É 18 - ett andra analyshjälpmedel för att bestämma intona- tionsmönstret hos grundtonskurvan hos talmodellen och därigenom maximum- och minimumvärdena för grundtonskurvan och deras respektive lägen; och - ett jämförelsehjälpmedel för att jämföra intonationsmönstret hos det inmatade talet med intonationsmönstret hos talmodellen för att identifiera en tidsdifferens mellan förekomsten av maximum- och minimumvärdena hos grundtonskurvan för det inkommande talet i förhållande till maximum- och minimumvärdena hos grundtonskurvan för tal- modellen, där den identifierade tidsdifferensen indikerar dialektkaraktorístiken hos det inmatade talet.According to the present invention, the means for obtaining dialect information from the prosody information include: - a first analysis means for determining the intonation pattern of the fundamental tone of the input speech and thereby the maximum and minimum values of the fundamental tone curve and their respective positions; 10 20 25 30 35 -519 273 ?z É 18 - a second analysis means for determining the intonation pattern of the fundamental tone curve of the speech model and thereby the maximum and minimum values of the fundamental tone curve and their respective positions; and - a comparison aid for comparing the intonation pattern of the input speech with the intonation pattern of the speech model to identify a time difference between the occurrence of the maximum and minimum values of the fundamental tone curve of the incoming speech relative to the maximum and minimum values of the fundamental tone curve of the speech model, where the identified time difference indicates the dialect characteristic of the input speech.

Tidsdifferensen som hänvisas till ovan kan bestämmas i förhållande till en referenspunkt i intonationsmönstret.The time difference referred to above can be determined in relation to a reference point in the intonation pattern.

För svenska språket kan skillnaden, i termer av intona- tionsmönster, mellan olika dialekter beskrivas genom olika punkter i tiden för ord och satsaccent, dvs tidsdifferensen kan bestämmas i förhållande till en punkt i intonations- mönstret, till exempel den punkt vid vilken en konsonant/ vokal-gräns inträffar.For the Swedish language, the difference, in terms of intonation patterns, between different dialects can be described by different points in time for word and sentence accent, i.e. the time difference can be determined in relation to a point in the intonation pattern, for example the point at which a consonant/vowel boundary occurs.

Sålunda; föreliggande uppfinningen är den referens mot vilken i ett föredraget arrangemang för den tidsdifferensen mätes, den punkt vid vilken konsonant/vokal- gränsen, dvs K/V-gränsen inträffar.Thus; the present invention is the reference against which, in a preferred arrangement for that time difference, the point at which the consonant/vowel boundary, i.e. the K/V boundary, occurs.

Den identifierade tidsdifferensen som, vilket nämnts ovan, indikerar dialekten hos det inmatade talet, dvs det talade språket, appliceras på text-till-tal-omvandlarenheten 12 och 13 för att göra det möjligt för intonationsmönstret, och därigenom dialekten, hos det utmatade talet i systemet att korrigeras så att det motsvarar intonationsmönstret hos de motsvarande orden och/eller fraserna i det tal som matas in. 15 20 25 30 35 519 273 19 Sàlunda gör denna korrigeringsprocess det möjligt för dialektinformationen hos det tal som matas in att inkorporeras i det tal som matas ut.The identified time difference which, as mentioned above, indicates the dialect of the input speech, i.e. the spoken language, is applied to the text-to-speech converter unit 12 and 13 to enable the intonation pattern, and thereby the dialect, of the output speech of the system to be corrected so that it corresponds to the intonation pattern of the corresponding words and/or phrases in the input speech. 15 20 25 30 35 519 273 19 Thus, this correction process enables the dialect information of the input speech to be incorporated into the output speech.

Som ovan nämnts baseras grundtonskurvan hos talmodellen pà information resulterande fran de lexikaliska (ortografi och transkription) och syntaktiska kontrollerna. Dessutom inkluderar transkriptionsinformationen lexikalt abstraherad accentinformation av typ betonade stavelser, dvs tonala ordaccenter I och II, och information avseende placeringen av sekundära accenter, dvs information som ges i t ex ordböcker.As mentioned above, the fundamental tone curve of the speech model is based on information resulting from the lexical (orthography and transcription) and syntactic checks. In addition, the transcription information includes lexically abstracted accent information such as stressed syllables, i.e. tonal word accents I and II, and information regarding the placement of secondary accents, i.e. information given in e.g. dictionaries.

Denna information kan användas för att justera igenkänningsmönstret hos taligenkänningsmodellen, till exempel "Hidden Markov"-modellen, för att ta hänsyn till transkriptionsinformationen_ En mer exakt modell av det inmatade talet erhálles därför under tolkningsprocessen.This information can be used to adjust the recognition pattern of the speech recognition model, for example the "Hidden Markov" model, to take into account the transcription information. A more accurate model of the input speech is therefore obtained during the interpretation process.

En ytterligare konsekvens av denna talmodellkorrigerings- process är att, med tiden, talmodellen kommer att fä ett intonationsmönster som har etablerats genom en inlärnings- process.A further consequence of this speech model correction process is that, over time, the speech model will acquire an intonation pattern that has been established through a learning process.

Likasà, med systemet och metoden för den föreliggande uppfinningen, jämförs talmodellen med en talad inmatnings- sekvens, och varje differens dem emellan kan bestämmas och användas för att fá talmodellen i överensstämmelse med den talade sekvensen och/eller för att bestämma betoningarna i den talade sekvensen.Likewise, with the system and method of the present invention, the speech model is compared to a spoken input sequence, and any difference between them can be determined and used to bring the speech model into agreement with the spoken sequence and/or to determine the accents in the spoken sequence.

Dessutom gör identifieringen av betoningarna i en talad sekvens det möjligt att bestämma den exakta betydelsen hos den talade sekvensen pà ett otvetydigt sätt. Speciellt kan relativa satsbetoningar bestämmas genom att klassificera förhållandet mellan variationer och deklination hos grundtonskurvan, varigenom betonade sektioner, eller individulella ord, kan bestämmas. Dessutom kan tonhöjden pà talet bestämmas ur deklinationen för grundtonskurvan, 20 25 30 35 519 273» 20 Sålunda, igenkänningen och tolkningen av de inmatade talen till tal- för att ta hänsyn till satsbetoningar vid till-tal-omvandlingssystemet hos den föreliggande uppfinningen är prosodi- extraktionshjälpmedlen och den tillhörande taligenkännings- och tolkningsenheten, för vart och ett av språken A och B, anpassade att bestämma: - ett första förhållande mellan variationen och deklinationen hos grundtonskurvan för det tal som matas in; - ett andra förhållande mellan variationen och deklinationen hos grundtonskurvan för tal- modellen; och - en jämförelse mellan de första och andra förhåll- andena, där varje identifierad differens används för att bestämma satsaccentplaceringar.Furthermore, the identification of the stresses in a spoken sequence makes it possible to determine the exact meaning of the spoken sequence in an unambiguous manner. In particular, relative sentence stresses can be determined by classifying the relationship between variations and declination of the fundamental tone curve, whereby stressed sections, or individual words, can be determined. Furthermore, the pitch of the speech can be determined from the declination of the fundamental tone curve, 20 25 30 35 519 273» 20 Thus, the recognition and interpretation of the input speech to speech-to-speech conversion system of the present invention is to take into account sentence stresses, the prosody extraction means and the associated speech recognition and interpretation unit, for each of the languages A and B, adapted to determine: - a first relationship between the variation and declination of the fundamental tone curve of the speech being input; - a second relationship between the variation and the declination of the fundamental tone curve of the speech model; and - a comparison between the first and second relationships, where each identified difference is used to determine sentence accent placements.

Vidare gör klassificering av förhållandet mellan variationen och deklinationen hos grundtonskurvan det möjligt att identifiera/bestämma relativa satsbetoningar, och betonade sektioner, eller ord. Även förhållandet mellan variationen och deklinationen hos grundtonskurvan kan utnyttjas till att bestämma dynamiken hos grundtonskurvan.Furthermore, classification of the relationship between the variation and declination of the fundamental curve makes it possible to identify/determine relative sentence stresses, and stressed sections, or words. The relationship between the variation and declination of the fundamental curve can also be used to determine the dynamics of the fundamental curve.

Informationen som erhålles med avseende på grundtonskurvan beträffande dialekt, satsaccent och betoning kan användas för tolkningen av det inmatade talet av enheterna 1 och 2, dvs informationen kan användas, på sätt som skisserats ovan, för att erhålla en bättre förståelse av innnehållet i det inmatade talet och få intonationsmönstret hos talmodellen i överensstämmelse med det inmatade talet.The information obtained with respect to the fundamental tone curve regarding dialect, sentence accent and stress can be used for the interpretation of the input speech by units 1 and 2, i.e. the information can be used, in the manner outlined above, to obtain a better understanding of the content of the input speech and to bring the intonation pattern of the speech model into line with the input speech.

Eftersom den korrigerade talmodellen uppvisar de språk- karakteristika (inkluderande dialektinformation, satsaccent 20 25 30 519 273 21 och betoning) hos det tal som matas in, kan det användas till att ge en ökad förståelse av det tal som matas in och öka sannolikheten för att utvärderingsenheten 3 skall välja det rätta språket för de tal som matas in. Den korrigerade talmodellen kan också användas av databashanterarenheterna 6 och 7 för att erhålla den erforderliga talinformationsdatan från databasenheterna 8 och 9 för formuleringen av ett svar på en röstinmatning i tal-till-tal-omvandlingssystemet.Since the corrected speech model exhibits the language characteristics (including dialect information, sentence accent 20 25 30 519 273 21 and stress) of the input speech, it can be used to provide increased understanding of the input speech and increase the probability that the evaluation unit 3 will select the correct language for the input speech. The corrected speech model can also be used by the database management units 6 and 7 to obtain the required speech information data from the database units 8 and 9 for the formulation of a response to a voice input in the speech-to-speech conversion system.

Förmågan att smidigt tolka olika dialekter i ett språk genom att använda information från grundtonskurvan är av viss betydelse, eftersom sådana tolkningar kan effektueras utan att man behöver lära upp taligenkänningssystemet. Resultatet av detta är att storleken, och därigenom kostnaden, för ett taligenkänningssystem utfört enligt den föreliggande uppfinningen kan bli mycket mindre än som skulle varit möjligt med kända system. Dessa har därför klara fördelar jämfört med kända taligenkänningssystem.The ability to smoothly interpret different dialects of a language using information from the fundamental tone curve is of some importance, since such interpretations can be effected without the need to train the speech recognition system. The result of this is that the size, and thereby the cost, of a speech recognition system implemented in accordance with the present invention can be much smaller than would be possible with known systems. These therefore have clear advantages over known speech recognition systems.

Systemet är därför anpassat att känna igen och exakt tolka innehållet i det tal som matas in på två, eller flera, naturliga språk och att matcha språkkarakteristika, t ex dialekt hos röstsvaren med de hos röstinmatningarna. Denna process tillhandahåller ett användarvänligt system eftersom språket i människa/maskin-dialogen är i överensstämmelse med dialekten hos den aktuella användaren.The system is therefore adapted to recognize and accurately interpret the content of speech input in two or more natural languages and to match language characteristics, e.g. dialect of the voice responses with those of the voice inputs. This process provides a user-friendly system because the language of the human/machine dialogue is consistent with the dialect of the current user.

Den föreliggande uppfinningen är ej begränsad till de utförandeformer som skissats ovan, men kan modifieras inom omfånget för de bifogade patentkraven och uppfinningskonceptet. 0001ccThe present invention is not limited to the embodiments outlined above, but may be modified within the scope of the appended claims and the inventive concept. 0001cc

Claims

10 15 20 25 30 35 40 i519 273 PATENT CLAIMS

A method, in a voice response communication system, for providing a spoken response to an input speech, said method including the steps of recognizing and interpreting the input speech, and utilizing the interpretation to obtain speech information data from a database for use in the formulation. of the spoken answer, characterized in that the database contains speech information data in at least two natural languages, and in that said method is adapted to recognize and interpret input speeches in said at least two languages using statistics-based speech recognition and language modeling techniques to form a lexical and syntactically acceptable speech model for the language in question and to provide spoken answers to speech inputs in said language, and in that said method includes the further steps of evaluating a recognized speech input to determine the language of the input by comparing speech models for the language in question and selecting the language whose speech model is most successful, effecting a dialogue with the database to obtain speech information data for the formulation of a spoken answer in the language of the input speech, and to convert the speech information data obtained from the database into said spoken answer.

A method according to claim 1, characterized in that separate databases are used for each of said at least two languages.

A method according to claim 2, characterized in that said dialogue is effected with only that of said databases containing speech information data in the language of the entered speech.

A method according to claim 2, characterized in that said dialogue is effected with that of said databases containing speech information in the language of the input speech, and in that, in the event that at least a part of the required speech information data for a spoken answer is present, stored in another of said databases, said method includes the further steps of effecting a dialogue with said second of the databases to obtain the required speech information data, translating the information data into the language of said one of the databases, combining coins. 15 20 25 30 35 40 519 2731 2 the speech information data from the databases, and converting the combined speech information data into a spoken answer in the language of the entered speech.

A method according to any one of the preceding claims, characterized in that the outcome of the evaluation process is used to determine the database with which said speech information data for a speech. dialogue is conducted to obtain a spoken response to the input

A method according to any one of the preceding claims, characterized in that the dialogue with a database, and / or is effected using a data SQL between databases, base communication languages, such as e.g. (Structured Query Language).

A method according to any one of the preceding claims, characterized in that said speech recognition and interpretation includes the steps of extracting prosody information from a speech input, and obtaining dialect information from said prosody information, said dialect information being used in the conversion of said speech information data. obtained from said database, to a spoken answer, where the spoken answers are in the same language and dialect as the entered speech.

A method according to claim 7, characterized in that the prosody information extracted from the speech input is the fundamental tone curve of the input speech.

A method according to claim 8, characterized by the steps of determining the intonation pattern of the fundamental tone curve of the input number, and thereby the maximum and minimum values of the fundamental tone curve and their respective positions; determining the intonation pattern of the fundamental tone curve for a speech model, and thereby the maximum and minimum values of the fundamental tone curve and their respective positions; to compare the intonation pattern of the input speech with the intonation pattern of the speech model to identify a time difference between the occurrence of the maximum and minimum values of the fundamental tone curve for the incoming speech in relation to the maximum and minimum values of the fundamental tone curve of the speech model the time difference indicates dialect characteristics of the entered speech.

A method according to claim 9, characterized in that the time difference is determined in relation to a reference point below 40 = 519 273 ~ .... «ø in the intonation pattern.

A method according to claim 10, characterized in that the reference point in the intonation pattern, against which the time difference is measured, is the point at which a consonant / vowel boundary occurs.

A method according to any one of claims 7-11, characterized in that the step of obtaining information about sentence accents from said prosody information.

A method according to claim 12, characterized in that the words in the speech model are checked lexically, in that the phrases in the speech model are checked syntactically, in that the words and phrases that are not linguistically possible are excluded from the speech model, in that the orthography and phonetic transcription of the words in the speech model is checked, and that the transcription information includes lexically abstract accent information, of type stressed syllables, and information regarding the placement of secondary accents.

A method according to claim 13, characterized in that the accent information refers to tonal word accent I and accent II.

A method according to any one of claims 12-14, characterized by the step of using said batch accent information in the interpretation of the entered number.

A voice response communication system utilizing a method according to any preceding claim to provide a voice response to a voice input in the system.

A speech-to-speech conversion system for providing feeds in at least two natural languages, including in the output thereof, spoken responses to speech recognition aids for the input numbers; interpretation aids for interpreting the contents of the recognized input numbers, and a database containing speech information data for use in the formulation of said spoken answer, characterized in that the speech information data stored in the database is in said at least two natural languages, by that the speech recognition and interpretation aids are adapted to recognize and interpret speech inputs in the at least two natural languages using statistics-based speech recognition and language modeling techniques to form a lexically and syntactically acceptable speech model for the language in question, and that the system further 15 20 25 30 35 40 519 273 can-ua 4 includes evaluation aids to evaluate the recognized speech inputs and determine the language of the inputs by comparing speech models for the language in question and selecting the language whose speech model is most successful, dialogue management aids to effect a dialogue with the database for obtaining said speech information data in the language of the input speech, and text-to-speech conversion aids for converting the speech information data obtained from the database into a spoken answer.

A speech-to-speech conversion system according to claim 17, characterized in that the system is adapted to receive speech inputs in two, or more, natural languages and to provide, respectively, speech language input, and in that the system is included in the output hence, spoken answers to rar, for each of the natural languages, speech recognition aids, where the inputs of each of the speech recognition aids are connected to a common input for the system; speech evaluation aids for determining, depending on the output of each of the speech recognition aids, the language of a speech input; a database containing speech information data to be used in formulating spoken answers in the language of the database; dialog handling aids for connection to a respective speech recognition aid, depending on the language of the input speech, said handling aids being adapted to interpret the contents of the recognized speech and, on the basis of the interpretation, access and obtain speech information data from at least one of the databases; and text-to-speech converter aids for converting speech information data obtained with said handling aids into spoken answers to respective speech inputs.

A speech-to-speech conversion system according to claim 17, characterized in that the system includes separate databases for each of said at least two languages.

A speech-to-speech conversion system according to claim 19, characterized in that the system includes separate dialogue management tools for each of the databases, each dialogue management tool being adapted to effect a dialogue with at least one of the databases, respectively.

A speech-to-speech conversion system according to claim 20, characterized in that each dialogue management tool is adapted to effect a dialogue with each of the databases.

A speech-to-speech conversion system according to claim 21, characterized in that the system includes translation aids for translating the outgoing speech information data from each of the databases into the language or languages of the other databases.

A speech-to-speech conversion system according to claim 22, a portion of the required speech information data for a characterized in that, in the event that at least one spoken answer is stored in a database in a language other than that required for it, the spoken answer, said information is obtained from said database and translated by said translation aid into the required language of the spoken answer, and by the translated speech information being used either alone, or in combination with other speech information of the dialog handler aid to provide an output for application to the text-to-speech converter tool.

A speech-to-speech conversion system according to claim 23, for speech input in two languages, characterized in that the system is characterized in that the system is adapted to take, for each of the two languages, a database, dia - log management aids and translation aids, in that each of the dialogue management aids is adapted to communicate with each of the databases, and in that the data outputs from each of the databases are connected directly to one of the dialogue management aids and to the other of the management aids via a translation aids.

A speech-to-speech conversion system according to any one of claims 17-24, characterized in that the system includes speech recognition and interpretation aids for each of said at least two natural languages, wherein the inputs to the speech recognition and interpretation aids are connected to a common system input.

A speech-to-speech conversion system according to any one of claims 17-25, characterized in that the output from the evaluation tool is used to select the database from which said speech information data will be obtained. dialogue management tools for the formulation of the spoken response to speech input.

A speech-to-speech conversion system according to any one of claims 17-26, characterized in that the dialogue with a database, and / or between databases, is effected using a database communication language, such as e.g. SQL (Structured Query Language).

A speech-to-speech conversion system according to any one of claims 17-27, characterized in that said speech recognition and interpretation aids include extraction aids for extracting prosody information from the input speech, and aids for obtaining dialectin aids. formation from said prosody information, wherein said dialect information is used by said text-to-speech conversion aid in the conversion of said speech information data to the spoken answer, where the dialect of the spoken answer is matched with that of the entered speech.

A speech-to-speech conversion system according to claim 28, characterized in that the prosody information extract from the entered speech is the fundamental tone curve of the entered speech.

A speech-to-speech conversion system according to claim 29, characterized in that the means for obtaining dialect information from said prosody information includes first analysis aids for determining the intonation pattern of the fundamental tone of the input speech, and thereby maximum and the minimum values of the fundamental curve and their respective positions; other analysis aids to determine the intonation pattern of the fundamental curve of the speech model and thereby the maximum and minimum values of the fundamental curve and their respective positions; comparison tool for comparing the intonation pattern of the input number with the intonation pattern of the speech model to identify a time difference between the occurrence of the maximum and minimum values of the fundamental tone curve of the incoming speech in relation to the maximum and minimum values of the fundamental tone curve in the speech model the specified time difference indicates dialect characteristics of the entered speech.

31. A speech-to-speech conversion system according to OOIIOI 0000:! Claim 30, characterized in that the time difference is determined in relation to a reference point in the intonation pattern.

A speech-to-speech conversion system according to claim 31, characterized in that the reference point in the intonation pattern, against which the time difference is measured, is the point at which a consonant / vowel boundary occurs.

A speech-to-speech conversion system according to any one of claims 28-32, characterized in that the system further includes aids for obtaining information about sentence accents from said prosody information.

A speech-to-speech conversion system according to claim 33, characterized in that the speech recognition aid includes control aids for lexically checking the words in the speech model and for syntactically checking the phrases in the speech model, where the words and phrases that are not linguistically possible are excluded from the speech model, in that the control aid is adapted to control the orthography and phonetic transcription of the words in the speech model, in that the transcription information includes lexically abstract accent information, of type accented syllables, and information regarding the placement of secondary accent.

A speech-to-speech conversion system according to claim 34, word accent I and accent II. characterized in that the accent information refers to tonal

A speech-to-speech conversion system according to any one of claims 33-35, characterized in that said batch accent information is used in the interpretation of the content of the recognized input speech.

A speech-to-speech conversion system according to any one of claims 28-36, characterized in that batch stresses are determined and used in the interpretation of the content of the recognized input speech.

A voice response communication system including a speech-to-speech conversion system according to any one of claims 17-37.