ES2311351A1

ES2311351A1 - Method to adapt dynamically the acoustic models of recognition of the speech to the user. (Machine-translation by Google Translate, not legally binding)

Info

Publication number: ES2311351A1
Application number: ES200601453A
Authority: ES
Inventors: Tomas Brezmes Llecha; Alvaro Maso Besga
Original assignee: France Telecom Espana SA
Current assignee: Orange Espana SA
Priority date: 2006-05-31
Filing date: 2006-05-31
Publication date: 2009-02-01
Anticipated expiration: 2026-05-31
Also published as: ES2311351B1

Abstract

Method for dynamically adapting acoustic models of speech recognition to the user, which allows improving the efficiency of speech recognition systems used by tele-operators (or any service that requires speech recognition through telecommunications networks) and that it is based on the sending of data of the user's speech profile, obtained in a single previous training and valid for all subsequent sessions in any of speech recognition systems compatible with this method, adapting the acoustic models to the speaker to improve the rate of speech. Success of such systems and avoid the user the need to train multiple platforms. (Machine-translation by Google Translate, not legally binding)

Description

Método para adaptar dinámicamente los modelos acústicos de reconocimiento del habla al usuario.Method to dynamically adapt the models speech recognition acoustics to the user.

Object of the invention

El método descrito facilita el reconocimiento del habla, permitiendo al usuario el uso de sistemas dependientes del locutor, mediante una única fase de entrenamiento, cuyo resultado es utilizado en todas las sesiones posteriores en cualquier plataforma compatible con este método, gracias a un único entrenamiento por parte del locutor, utilizando cualquier aplicación web y/o aplicación de voz remota accedida a través del terminal de usuario, con el fin de crear un perfil de su habla que contendrá los parámetros básicos de su voz que se almacenan en un fichero digital dentro de dicho terminal, este fichero se envía a cualquier plataforma compatible de reconocimiento del habla, que lo utiliza para personalizar sus modelos acústicos de forma dinámica, permitiendo reconocer la voz de ese locutor con mayor fiabilidad.The described method facilitates recognition of speech, allowing the user to use dependent systems of the announcer, through a single training phase, whose result is used in all subsequent sessions in any platform compatible with this method, thanks to a single speaker training, using any web application and / or remote voice application accessed through the user terminal, in order to create a profile of your speech that it will contain the basic parameters of your voice that are stored in a digital file inside said terminal, this file is sent to any compatible speech recognition platform, which use it to customize your acoustic models so dynamic, allowing to recognize the voice of that announcer with greater reliability

Background of the invention

La mayor parte de los sistemas de reconocimiento del habla actuales desarrollan su función mediante un modelo estadístico que determina la probabilidad condicional de que una determinada palabra, produzca la secuencia auditiva observada. Mediante la comparación de estas probabilidades es posible determinar cual es, con mayor probabilidad, la palabra dicha por el usuario. Este modelo estadístico se compone de una serie de estados y unas probabilidades de transición entre los distintos estados. Mientras que los posibles estados suelen estar predeterminados por el modelo utilizado, las probabilidades de transición suelen tratarse como parámetros del modelo, y distintos valores de los mismos permiten ajustar el funcionamiento del sistema según distintos condicionantes -locutor, condiciones de ruido, etc. Estos parámetros pueden optimizarse a través de distintos métodos, siendo los más habituales los basados en el entrenamiento. Atendiendo a la necesidad previa o no de un entrenamiento especifico, los sistemas de reconocimiento del habla pueden dividirse en dos grandes grupos:Most recognition systems current speech develop their function through a model statistic that determines the conditional probability that a certain word, produce the observed auditory sequence. By comparing these probabilities it is possible determine what is most likely the word spoken by the Username. This statistical model is composed of a series of states and some transition probabilities between the different states. While the possible states are usually predetermined by the model used, the transition probabilities usually be treated as model parameters, and different values of the they allow to adjust the operation of the system according to different conditions - announcer, noise conditions, etc. These parameters can be optimized through different methods, the most common being those based on training. Attending to the previous need or not of a training specifically, speech recognition systems can Divide into two large groups:

a)to): Sistemas que requieren una fase de entrenamiento específico. Este tipo de sistemas requieren que el usuario final entrene el sistema previamente a su uso. Suelen presentar dependencia del locutor para el reconocimiento del habla y un dominio de reconocimiento extenso -reconocen una gran variedad de palabras y frases. Para entrenar el sistema, el usuario debe repetir una serie de palabras y/o frases, de modo que el sistema puede ajustar sus parámetros. El principal inconveniente de estos sistemas es la necesidad del usuario de repetir el entrenamiento dedicado para cada uno de los sistemas de reconocimiento del habla que utilice.Systems that require a phase of specific training These types of systems require that the end user train the system before use. Usually present speaker's dependence for speech recognition and an extensive recognition domain - they recognize a great variety of words and phrases To train the system, the user you must repeat a series of words and / or phrases, so that the system can adjust its parameters. The main drawback of these systems is the need of the user to repeat the dedicated training for each of the systems of speech recognition you use.

b)b): Sistemas que no requieren una fase de entrenamiento específico. Este tipo de sistemas se caracterizan por no depender del locutor para el reconocimiento del habla y por disponer de un dominio de reconocimiento reducido, normalmente limitado a unos cientos de palabras.Systems that do not require a phase of specific training These types of systems are characterized by not depend on the speaker for speech recognition and for have a reduced recognition domain, usually Limited to a few hundred words.

       \vskip1.000000\baselineskip\ vskip1.000000 \ baselineskip

Description of the invention

Se utilizan las siguientes definiciones para las diversas entidades que conforman la solución de la invención:The following definitions are used for various entities that make up the solution of the invention:

a)to): Terminal de usuario. Terminal con funcionalidad básica para realizar llamadas de voz y con las tecnologías específicas para almacenar y enviar el fichero con el perfil del habla del usuario.User terminal Terminal with basic functionality to make voice calls and with specific technologies to store and send the file with the User speech profile.

b)b): Plataforma de reconocimiento del habla: Plataforma adaptada a este método, que modifica de forma dinámica los modelos acústicos utilizados para el reconocimiento del habla, en base a la información del perfil del locutor recibido, permitiendo personalizar los parámetros básicos para aumentar la fiabilidad del sistema.Speech recognition platform: Platform adapted to this method, which dynamically modifies the acoustic models used for speech recognition, based on the profile information of the speaker received, allowing to customize the basic parameters to increase the system reliability

La presente invención se basa en la modificación de la plataforma de reconocimiento del habla, partiendo de un sistema independiente del locutor y que no requiere una fase de entrenamiento específico -capaz de reconocer un número limitado de palabras-, para evolucionar a un sistema dependiente del locutor que sólo requiere la información contenida en un fichero del perfil del habla del usuario a través de GPRS, UMTS, ADSL -o la tecnología de envío de datos disponible en su momento- mejorando ampliamente el número de palabras reconocibles con un alto porcentaje de acierto.The present invention is based on the modification of the speech recognition platform, based on a independent speaker system and does not require a phase of specific training - able to recognize a limited number of words-, to evolve into a speaker dependent system which only requires the information contained in a profile file of user speech through GPRS, UMTS, ADSL - or technology Data delivery available at the time - vastly improving the number of recognizable words with a high percentage of success.

Dicho fichero se habrá obtenido anteriormente en un único entrenamiento realizado por el locutor utilizando cualquier aplicación web y/o aplicación de voz remota accedida a través del terminal de usuario.This file will have been obtained previously in a single training done by the announcer using any web application and / or remote voice application accessed through the user terminal.

Al acceder un usuario a través de su terminal a la plataforma de reconocimiento, ésta recibe y almacena el archivo del perfil del habla del usuario, adaptando al perfil de voz del usuario de forma dinámica los modelos acústicos que utiliza, disminuyendo la tasa de errores y por tanto mejorando la eficiencia.When accessing a user through his terminal to the recognition platform, it receives and stores the file of the user's speech profile, adapting to the voice profile of the user dynamically use the acoustic models you use, decreasing the error rate and therefore improving the efficiency.

Description of the figures

Para complementar la descripción que se está realizando y con objeto de facilitar la comprensión de las características de la invención, se acompaña a la presente memoria descriptiva un juego de dibujos en los que, con carácter ilustrativo y no limitativo, se ha representado lo siguiente:To complement the description that is being performing and in order to facilitate the understanding of characteristics of the invention, is attached herein descriptive a set of drawings in which, with character Illustrative and not limiting, the following has been represented:

En la figura 1 se muestra un diagrama de flujo de la fase de entrenamiento única de este método.A flow chart is shown in Figure 1 of the unique training phase of this method.

La figura 2 muestra la fase de reconocimiento de habla en cualquier plataforma.Figure 2 shows the recognition phase of Talk on any platform.

Preferred Embodiment of the Invention

Tal y como se aprecia en el diagrama de la figura 1, el usuario realiza un único entrenamiento a través de una aplicación específica y como resultado se obtiene un fichero digital con los parámetros del modelo de la voz del usuario. Este fichero se almacena en el terminal de usuario.As can be seen in the diagram of the Figure 1, the user performs a single training through a specific application and as a result you get a file digital with the parameters of the user's voice model. This File is stored in the user terminal.

Posteriormente, según se aprecia en la figura 2, al establecerse cualquier llamada a un teleoperador (con una plataforma compatible con este método), a través del terminal de usuario se envía de forma manual o automática el fichero con el perfil del usuario a petición de la plataforma de reconocimiento del habla del teleoperador.Subsequently, as seen in Figure 2, when any call to a telemarketer is established (with a platform compatible with this method), through the terminal user manually or automatically sends the file with the User profile at the request of the recognition platform of the speech of the telemarketer.

La plataforma de reconocimiento del habla almacena el fichero del perfil enviado por el usuario, y adapta dinámicamente y en tiempo real los modelos acústicos utilizados por la plataforma en base a los parámetros básicos de modelo de voz del locutor descritos en el fichero, a partir de ese momento se lleva a cabo el reconocimiento del habla del usuario de forma personalizada.The speech recognition platform stores the profile file sent by the user, and adapts dynamically and in real time the acoustic models used by the platform based on the basic voice model parameters of the announcer described in the file, from that moment on he takes out the recognition of the user's speech so customized

Así pues, el método de la presente invención requiere para su desarrollo una plataforma de reconocimiento del habla capaz de:Thus, the method of the present invention requires for its development a recognition platform of the speaks capable of:

a)to): Hacer la petición al usuario para que le envíe el archivo con su perfil de voz.Do the request to the user to send him the file with his profile of voice.

b)b): Recibir y almacenar el fichero procedente del terminal de usuario y utilizar sus parámetros básicos para adaptar en tiempo real el modelo acústico mejorando las técnicas de adaptación al locutor.Receive and store the file from the user terminal and use its parameters basics to adapt the acoustic model in real time by improving the techniques of adaptation to the announcer.

Una vez descrita suficientemente la naturaleza de la invención, se hace constar a los efectos oportunos que los materiales, forma, tamaño y disposición de los elementos descritos podrán ser modificados, siempre y cuando ello no suponga una alteración de las características esenciales de la invención que se reivindican a continuación.Once nature is sufficiently described of the invention, it is stated for the appropriate purposes that Materials, shape, size and arrangement of the elements described may be modified, as long as this does not imply alteration of the essential characteristics of the invention that is claim below.

Claims

1. Method to dynamically adapt acoustic speech recognition models to the user that is based on the modification of a speech recognition platform that does not require a specific training phase, so that it is able to recognize a limited number of words, to evolve into a speaker-dependent system that only requires the information contained in a file of the user's speech profile, increasing the number of recognizable words with a high percentage of success, characterized in that, according to it, it comprises:

a)to): una fase de entrenamiento en la cual el usuario realiza un único entrenamiento a través de una aplicación específica, obteniendo como resultado un fichero digital con los parámetros del modelo de la voz del usuario, que se almacena en el terminal de usuario;a training phase in which the user performs a single training through a specific application, obtaining as a result a digital file with the model parameters of the user's voice, which is stored in the terminal Username;

b)b): una fase de reconocimiento de habla por cualquier plataforma compatible, en la cual, cada vez que establezca una llamada a un teleoperador a través del terminal de usuario, se envía de forma manual o automática el fichero con el perfil del usuario a petición de la plataforma de reconocimiento del habla del teleoperador, adaptando dinámicamente y en tiempo real los modelos acústicos utilizados por la plataforma, en base a los parámetros básicos de modelo de voz del locutor descritos en el fichero, para llevar cabo a partir de ese momento el reconocimiento del habla del usuario de forma personalizada.a speech recognition phase by any platform compatible, in which, every time you establish a call to a telemarketer through the user terminal, is sent in a way Manual or automatic file with user profile a speech recognition platform request teleoperator, dynamically adapting and real-time models acoustics used by the platform, based on the parameters basic voice model of the announcer described in the file, for carry out from that moment the speech recognition of the user in a personalized way.