FR2842923A1

FR2842923A1 - GENERATOR OF A DOMAIN-SPECIFIC CORPUS

Info

Publication number: FR2842923A1
Application number: FR0209531A
Authority: FR
Inventors: Camal Tazine; Celestin Sedogbo
Original assignee: Thales SA
Current assignee: Thales SA
Priority date: 2002-07-26
Filing date: 2002-07-26
Publication date: 2004-01-30
Anticipated expiration: 2022-07-26
Also published as: EP1540512A1; WO2004013766A1; FR2842923B1; AU2003262524A1

Abstract

Le produit/programme et le procédé selon l'invention permettent de générer un corpus de textes spécifiques d'un domaine d'application à partir d'un corpus général et d'une grammaire du domaine. Ainsi un corpus spécifique peut être créé de manière automatique sans interaction avec l'utilisateur.The product / program and the method according to the invention make it possible to generate a corpus of texts specific to a field of application from a general corpus and a grammar of the field. Thus a specific corpus can be created automatically without interaction with the user.

Description

ii

Générateur d'un corpus spécifique à un domaine. Generator of a corpus specific to a domain.

La présente invention appartient au domaine des traitements automatiques du langage naturel. Plus particulièrement, elle adresse le problème de la génération d'un ensemble de textes spécifique d'un domaine The present invention belongs to the field of automatic processing of natural language. More specifically, it addresses the problem of generating a specific set of texts for a domain

d'application ou corpus spécifique. application or specific corpus.

Les corpus spécifiques sont nécessaires, notamment dans les systèmes informatiques de reconnaissance de la parole pour atteindre un taux de reconnaissance acceptable pour l'utilisateur. C'est particulièrement nécessaire pour les systèmes à large vocabulaire (typiquement 20 000 mots) . La génération d'un corpus spécifique est un traitement qui nécessite encore dans l'état de l'art de longues opérations d'apprentissage progressif basées sur une sélection des textes d'un corpus d'un domaine Specific corpora are needed, especially in computer speech recognition systems to achieve a recognition rate acceptable to the user. This is particularly necessary for large vocabulary systems (typically 20,000 words). The generation of a specific corpus is a treatment that still requires in the state of the art long progressive learning operations based on a selection of the texts of a corpus of a domain

donné à partir des phrases entrées par l'utilisateur. given from the sentences entered by the user.

Une méthode de sélection basée sur un modèle probabiliste a été A selection method based on a probabilistic model has been

divulguée notamment dans le brevet US 5,444,617. disclosed in particular in US Patent 5,444,617.

Cette méthode présente l'inconvénient de nécessiter des This method has the disadvantage of requiring

interactions avec l'utilisateur longues et coteuses. long and expensive user interactions.

La présente invention surmonte cet inconvénient en permettant à l'utilisateur de formuler une spécification du corpus spécifique sous la forme d'une grammaire, ou ensemble de règles de syntaxe, propre au domaine d'application, la génération du corpus spécifique du domaine étant ensuite automatique. L'invention réduit ainsi de manière importante le temps nécessaire The present invention overcomes this disadvantage by allowing the user to formulate a specification of the specific corpus in the form of a grammar, or set of syntax rules, specific to the application domain, the generation of the specific corpus of the domain being then automatic. The invention thus significantly reduces the time required

à la collecte du corpus spécifique. to the collection of the specific corpus.

A cet effet, l'invention propose un produit/programme et un procédé de collecte d'un ensemble de textes spécifiques d'un domaine d'application à partir d'un ensemble de textes non spécifique, caractérisé en ce qu'il comprend un module de commande par une grammaire du For this purpose, the invention proposes a product / program and a method for collecting a set of texts specific to a field of application from a set of non-specific texts, characterized in that it comprises a command module by a grammar of the

domaine d'application.application domain.

L'invention sera mieux comprise à l'examen des figures suivantes The invention will be better understood on examining the following figures

dont le contenu est explicité dans le corps de la description: whose content is explained in the body of the description:

Figure 1: Schéma montrant les modules du produit/programme Figure 1: Diagram showing the modules of the product / program

selon l'invention.according to the invention.

Figure 2: Schéma montrant un exemple de génération des ngrammes d'une grammaire spécifique d'un domaine d'application. Figure 2: Diagram showing an example of generation of ngrams of a specific grammar of an application domain.

Figure 3: Schéma explicitant un algorithme de calcul des Figure 3: Diagram explaining an algorithm for calculating

distances entre mots selon l'invention. distances between words according to the invention.

Figure 4: Exemple de calcul des distances entre mots selon l'invention. Figure 4: Example of calculating the distances between words according to the invention.

Figure 5: Exemple de calcul sémantique. Figure 5: Example of semantic calculation.

Figure 6: Graphique montrant la répartition des n-grammes en Figure 6: Graph showing the distribution of n-grams in

fonction de la distance aux mots du vocabulaire de la grammaire. distance function to grammar vocabulary words.

La figure 1 montre l'enchaînement des modules et des traitements selon l'invention. Les définitions de la figure sont les suivantes: Le corpus général 10 est un ensemble de textes, disponible dans le commerce, non spécifique d'un domaine, qui peut comporter plusieurs millions de textes. n-gram[VcORpusj 13 est un ensemble de suites de mots ordonnés extraites du corpus général ou n-uplets, lesdits mots étant présents dans le vocabulaire. La manière dont ces n-grammes sont constitués est décrite plus loin. Le vocabulaire de ce corpus VCORPUS 11 est l'ensemble des mots les plus fréquemment rencontrés dans ce corpus ou ensemble des unigrammes. On limite généralement le vocabulaire à 20 000 mots. Le générateur AEF 20 est un module qui permet de générer les n-grammes d'une grammaire du domaine A à partir de ladite grammaire, d'une manière Figure 1 shows the sequence of modules and treatments according to the invention. The definitions of the figure are as follows: The general corpus 10 is a set of texts, commercially available, nonspecific domain, which may include several million texts. n-gram [VcORpusj 13 is a set of sequences of ordered words extracted from the general corpus or n-tuples, said words being present in the vocabulary. The manner in which these n-grams are formed is described below. The vocabulary of this corpus VCORPUS 11 is the set of words most frequently encountered in this corpus or set of unigrams. The vocabulary is generally limited to 20,000 words. The AEF generator 20 is a module that makes it possible to generate the n-grams of a domain A grammar from said grammar, in a manner

également explicitée dans la suite de la description. Un ensemble ngram[VcFG(A)] 33 est généré à partir de la grammaire CFG(A) 30 d'une also explained in the following description. A set ngram [VcFG (A)] 33 is generated from the CFG (A) grammar of a

manière explicitée dans la suite de la description. Le corpus spécifique de A, explained in the following description. The specific corpus of A,

CORPUS(A) 40 est initialisé avec les n-grammes VCFG(A) 33. On rajoute par récurrence à CORPUS(A) 40 les n-grammes de VCORPUS 13 qui remplissent la condition: CORPUS (A) 40 is initialized with n-grams VCFG (A) 33. CORPUS (A) 40 is recurrently added to the n-grams of VCORPUS 13 which satisfy the condition:

3 n-gram[VcFG(A)]: D (n-gram [VcoRpus], n-gram [VCFG(A)]) < 6. 3 n-gram [VcFG (A)]: D (n-gram [VcoRpus], n-gram [VCFG (A)]) <6.

Plusieurs fonctions de distance D sont utilisables comme expliqué 30 dans la suite de la description. 6 est le seuil de distance qui doit être réglé Several distance functions D can be used as explained in the rest of the description. 6 is the distance threshold that must be set

de manière à optimiser la constitution de CORPUS(A) 40 pour les applications de reconnaissance spécifiques du domaine A. Typiquement les n-grammes de VCORPUS13 seront des bi-grammes ou des tri-grammes. Un bigramme est un ensemble de deux mots qui appartiennent au vocabulaire VcoRpusl 1 auquel sont associées leurs probabilités d'occurrence dans le corpus général 10. Les tri-grammes sont des ensembles de trois mots dans l'ordre dans lequel ils se présentent dans le corpus général 10 auquel sont in order to optimize the constitution of CORPUS (A) 40 for domain A specific recognition applications. Typically, the n-grams of VCORPUS13 will be bi grams or trigrams. A bigram is a set of two words that belong to the vocabulary VcoRpusl 1 with which their probabilities of occurrence are associated in the general corpus 10. The trigrams are sets of three words in the order in which they occur in the corpus general 10 to which are

associées leurs probabilités d'occurrence dans le corpus général 10. associated their probabilities of occurrence in the general corpus 10.

Pour générer n-gram[VcoRpus] on peut utiliser des outils du commerce généralement désignés sous l'appellation générique d'outils de génération de modèle statistique de langage. On peut par exemple utiliser celui développé par l'Université Carnegie Mellon décrit par Philippe Clarkson et Ronald Rosenfeld dans une publication de l'Université [Rosenfeld 95] Rosenfeld R., The CMU Statistical Language Modeling Toolkit and its use, To generate n-gram [VcoRpus] one can use tools of the trade generally denominated under the generic name of tools of generation of statistical model of language. One can for example use the one developed by Carnegie Mellon University described by Philippe Clarkson and Ronald Rosenfeld in a publication of Rosenfeld University [Rosenfeld 95], The CMU Statistical Language Modeling Toolkit and its use,

ARPA Spoken Language Technology Workshop, Austin Texas (USA) pp 4550, 1995. Cet article est incorporé par référence à la présente description. La ARPA Spoken Language Technology Workshop, Austin Texas (USA) pp. 4550, 1995. This article is incorporated by reference into this specification. The

plupart des modèles statistiques de langage, et notamment celui décrit dans l'article sous référence, corrigent les probabilités d'occurrences les plus faibles de manière à supprimer le biais qui est classique dans ce type d'analyse statistique. Les n-grammes les moins observés ont en effet une probabilité d'occurrence biaisée vers le bas et les plus observés une most statistical language models, and in particular the one described in the article under reference, correct the probabilities of the lowest occurrences so as to eliminate the bias that is conventional in this type of statistical analysis. The least observed n-grams have indeed a probability of occurrence biased downwards and most observed a

probabilité d'occurrence biaisée vers le haut. probability of occurrence biased upwards.

La grammaire CFG (A) 30 est une grammaire indépendante du contexte, ce qui veut dire que des variations du contexte ne modifient pas la grammaire elle-même. Cette grammaire est, dans l'état de l'art, constituée manuellement. Les n-gram[VcFG(A)] 33 seront typiquement des tri-grammes ou des quadri-grammes. Ils sont créés par le générateur AEF 20 dont un exemple est décrit à la figure 2. La génération des n-grammes de CFG(A) 30 se déroule comme suit, - création de l'automate déterministe correspondant (on ne tient pas compte des probabilités). (Sommet = état, Transition = symbole terminal); - on sélectionne tous les n-arcs consécutifs appartenant à cet automate, soit la CFG suivante: GRAMMAIRE = unité (alpha OU bravo) (rejoignez OU (allez vers) l'unité (alpha OU bravo) Le vocabulaire est donc VCFG = (unité, alpha, bravo, rejoignez, The CFG grammar (A) 30 is a context-independent grammar, which means that variations of the context do not alter the grammar itself. This grammar is, in the state of the art, constituted manually. N-gram [VcFG (A)] 33 will typically be tri grams or quad grams. They are created by the AEF generator 20, an example of which is described in FIG. 2. The generation of the n-grams of CFG (A) follows as follows: - creation of the corresponding deterministic automaton (we do not take into account the probabilities). (Summit = state, Transition = terminal symbol); - we select all the consecutive n-arcs belonging to this automaton, ie the following CFG: GRAMMAR = unit (alpha OR bravo) (join OR (go to) the unit (alpha OR bravo) The vocabulary is thus VCFG = (unit , alpha, bravo, join,

allez vers, l'unité).go to, the unit).

On remarque que | I = 6We notice that | I = 6

On obtient l'automate fini déterministe représenté à la figure 2. We obtain the deterministic finite automaton represented in FIG.

Les unigrammes sont:The unigrams are:

unité, alpha, bravo, rejoignez, allez vers, l'unité (on retombe sur VCFG). unity, alpha, bravo, join, go to, the unit (we fall on VCFG).

Les bigrammes sont: unité alpha, unité bravo, alpha rejoingnez, alpha allez vers, bravo rejoignez, bravo allez vers, rejoingnez l'unité, allez vers l'unité, l'unité alpha, l'unité bravo. Il y en a 10, c'est-à-dire beaucoup moins que I VCGF 12 = 36 Les trigrammes sont: unité alpha rejoignez, unité alpha allez vers, unité bravo rejoignez, unité bravo allez vers, alpha rejoignez l'unité, alpha allez vers l'unité, bravo rejoignez l'unité, bravo allez vers l'unité, rejoignez l'unité alpha, rejoignez The bigrams are: alpha unit, bravo unit, alpha join, alpha go to, bravo join, bravo go to, join the unit, go to the unit, the alpha unit, the unit bravo. There are 10, that is, much less than I VCGF 12 = 36 The trigrams are: unit alpha join, unit alpha go to, unit bravo join, unit bravo go to, alpha join the unit, alpha go to the unit, bravo join the unit, bravo go to the unit, join the alpha unit, join

l'unité bravo, allez vers l'unité alpha, allez vers l'unité bravo. bravo unit, go to the alpha unit, go to the bravo unit.

11 y en a 12, c'est-à-dire beaucoup moins que IVCGF 3= 216 Bien qu'en théorie, le nombre de n-grammes peut atteindre There are 12, that is, much less than IVCGF 3 = 216 Although in theory, the number of n-grams can reach

IVCGF n, il est n réalité bien plus petit que cela (quelques milliers de ngrammes pour une grammaire dont le vocabulaire atteint 200 mots). IVCGF n, it is actually much smaller than that (a few thousand grams for a grammar whose vocabulary reaches 200 words).

Le vocabulaire VCFG(A) 31 est l'ensemble des uni-grammes. The vocabulary VCFG (A) 31 is the set of uni grams.

La figure 3 illustre le fonctionnement de l'algorithme de calcul de distance entre deux mots d'un dictionnaire. Dans l'application on utilise les FIG. 3 illustrates the operation of the algorithm for calculating the distance between two words of a dictionary. In the application we use the

trois dictionnaires 10, 12 et 32 de la figure 1. three dictionaries 10, 12 and 32 of FIG.

Les dictionnaires dico-VcoRpus 12 et dico-VCFG( A) 32 sont des dictionnaires extraits d'un dictionnaire général 1 Oa qui est un composant que The dictionary dico-VcoRpus 12 and dico-VCFG (A) 32 are dictionaries extracted from a general dictionary 1 Oa which is a component that

l'on trouve dans le commerce.it is found in commerce.

Ce dictionnaire général apporte des informations aux formes fléchies des mots, telle que la prononciation, la racine du mot. On peut aussi y ajouter des informations sémantiques qui peuvent être représentées sous This general dictionary provides information to inflected forms of words, such as pronunciation, the root word. We can also add semantic information that can be represented under

forme de graphe ou de vecteurs conceptuels. form of graphs or conceptual vectors.

Cet algorithme comprend trois étapes: - Le calcul de distance lettre à lettre, qui utilise l'algorithme de distance d'édition et les paramètres ins-del-sub; - Le calcul de distance entre deux mots quelconques, qui va pondérer suivant la longueur du mot transformé; - Le calcul de distance entre deux mots du dictionnaire qui va This algorithm consists of three steps: - letter-to-letter distance calculation, which uses the edit distance algorithm and the ins-del-sub parameters; - The calculation of distance between any two words, which will weight according to the length of the transformed word; - The calculation of distance between two words of the dictionary that will

prendre en compte le type et le sens des mots. take into account the type and meaning of words.

Les notations sont les suivantes I a I nombre de lettres de a C mot vide Div opérateur de division entière Une des méthodes de calcul des distances entre deux mots est décrite dans les ouvrages accessibles à l'homme de métier sous le nom de distance d'édition (ou de Levenstein ou de Wagner-Fisher) [Wagner & Fisher, 1974] Wagner, R. A. & Fisher, M.J. (1974). The string-to-string correction problem. Journal of the Association for computing Machinery, 21(1), 168-173. [Amengual & Vidai, 1998] Amengual, J.-C. & Vidal, E. Efficient error-correcting viterbi parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(10), 1109-1116, 1998. Ces articles The notation is as follows: ## EQU1 ## The number of letters of a C empty word Div full division operator One of the methods for calculating the distances between two words is described in the works accessible to a person skilled in the art under the name of distance. edition (or of Levenstein or Wagner-Fisher) [Wagner & Fisher, 1974] Wagner, RA & Fisher, MJ (1974). The string-to-string correction problem. Journal of the Association for Computing Machinery, 21 (1), 168-173. [Amengual & Vidai, 1998] Amengual, J.-C. & Vidal, E. Efficient error-correcting viterbi parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20 (10), 1109-1116, 1998. These Articles

sont incorporés par référence dans la présente demande. are incorporated by reference in the present application.

Etant donnés deux mots a et b, la distance d'édition renvoie le nombre minimal d'opérations d'édition nécessaires pour transformer le mot a en mot b. Ces opérations d'édition sont en général l'insertion d'une lettre, la suppression d'une lettre et la substitution d'une lettre. On peut affecter un poids à chacune de ces opérations. Dans ce cas, la distance d'édition Given two words a and b, the edit distance returns the minimum number of editing operations needed to transform the word a into a word b. These editing operations are usually the insertion of a letter, the deletion of a letter and the substitution of a letter. We can assign a weight to each of these operations. In this case, the editing distance

renverra le poids total minimal qui transforme le mot a en mot b. will return the minimum total weight that transforms the word a into a word b.

Soit DL (a,b) la fonction qui renvoie la distance d'édition (Levenstein) qui permet de transformer a en b. DL admet pour paramètres les entiers Pins, Pdel, Psub (les distances unitaires pour chaque 25 insertion/suppression/substitution). Les choix de ces paramètres est arbitraire dans un premier temps, c'est en fonction des résultats que l'on pourra les affiner, mais par exemple on peut prendre: Pdel = 2, Pins = 3 et Psub =4. Pour notre application, la distance DQ entre deux mots quelconques doit mettre en évidence la dénaturation du mot d'origine: l'importance de la transformation effectuée peut se mesurer par rapport à la taille du mot d'origine. Etant donné un mot a de longeur n, si k opérations d'édition ont été effectuées, alors la dénaturation du mot d'origine peut s'estimer à k/n. Dans le cas particulier o a est vide (a = E) on effectue le calcul comme si la longueur du mot vide était de 1. Les vocabulaires VCFG et VCORPUS étant finis, la longueur du mot le plus long est donnée par la formule: DQ(a,b) = (K*DL(a,b)) div lai sia! É 0 et Do (a,b) = DL (F,b) sia! = 0 o K = max( {al/a a VCFG},{l bl/b Y VCORPUS}) Il est maintenant souhaitable pour calculer la distance D entre deux mots du dictionnaire de corriger la distance d'édition d'un facteur tenant compte de leur distance étymologique et sémantique. Pour ne pas dégrader la vitesse d'exécution du traitement, on choisit avantageusement un indicateur simplifié de cette proximité constitué comme il est dit cidessous Soit DQmax la distance maxi entre deux mots quelconques Soient a E VCFG (A) et b E VcoRpus les deux mots dont on veut Let DL (a, b) be the function that returns the editing distance (Levenstein) that transforms a to b. DL admits for parameters the integers Pins, Pdel, Psub (unit distances for each insertion / deletion / substitution). The choices of these parameters are arbitrary at first, it is according to the results that we will be able to refine them, but for example we can take: Pdel = 2, Pins = 3 and Psub = 4. For our application, the distance DQ between any two words must highlight the denaturation of the original word: the importance of the transformation performed can be measured in relation to the size of the original word. Given a word of length n, if k editing operations were performed, then the denaturation of the original word can be estimated at k / n. In the particular case oa is empty (a = E) the computation is carried out as if the length of the empty word was 1. The vocabulary VCFG and VCORPUS being finished, the length of the longest word is given by the formula: DQ ( a, b) = (K * DL (a, b)) div lai sia! E 0 and Do (a, b) = DL (F, b) sia! = 0 o K = max ({al / aa VCFG}, {l bl / b Y VCORPUS}) It is now desirable to calculate the distance D between two words of the dictionary to correct the editing distance of a factor taking into account of their etymological and semantic distance. In order not to degrade the speed of execution of the treatment, a simplified indicator of this proximity is advantageously chosen, as described below. Let D max be the maximum distance between any two words. Let E be VCFG (A) and b E VcoRpus the two words which one wants

mesurer la distance.measure the distance.

D (a,b) = Do + DQ (a,b) Si a et b ont la même racine, DO = 0 Si a et b ont le même sens, Do = DOmax Si a et b appartiennent au vocabulaire de la CFG, Do = 2* DQmax Sinon, Do = 3* DQmax Toute fonction de calcul de distance entre a et b est utilisable. Il est cependant préférable que la fonction D soit continue par D (a, b) = C + DQ (a, b) If a and b have the same root, DO = 0 If a and b have the same meaning, Do = DOmax If a and b belong to the CFG vocabulary, Do = 2 * DQmax Otherwise, Do = 3 * DQmax Any distance calculation function between a and b can be used. It is however preferable that the function D be continuous with

morceaux et croissante en fonction de DQ. pieces and growing according to DQ.

On donne ci-dessous un exemple d'exécution de l'algorithme de An example of execution of the algorithm of

calcul de la distance entre mots.calculating the distance between words.

Prenons Pins=Pdel=Psub=l Soit VCFG défini par: VCFG = {g, " unité ", " afficher "} Soit Vcorpus défini par VCORPUS = { ú, " montrer ", <" cheval >" K=8 longueur du mot le plus long: " afficher " Donc l'expression de la distance entre deux mots du dictionnaire devient Do(a,b) = 8*DL(a,b) div I a I 8*DL(E, b) si a!=0 sinon DQmax = 8*1 = 8 D(ú, e)=O D(ú,,< montrer,") = Do= 24 Do(z,, unité,)= 7 D(ú, " unité ")= 24+7=31 D(ú<, " cheval ")= Take Pins = Pdel = Psub = 1 Let VCFG be defined by: VCFG = {g, "unit", "display"} Let Vcorpus be defined by VCORPUS = {ú, "show", <"horse>" K = 8 length of the word longest: "display" So the expression of the distance between two words of the dictionary becomes Do (a, b) = 8 * DL (a, b) div I to I 8 * DL (E, b) if a! = 0 otherwise DQmax = 8 * 1 = 8 D (ú, e) = OD (ú ,, <show, ") = Do = 24 Do (z ,, unit,) = 7 D (ú," unit ") = 24 + 7 = 31 D (ú <, "horse") =

DO= 24DO = 24

DQ(F, " cheval ")= 6 D(ú, " cheval ")= 24+6=30 D(" unité ", ú)= Do=24 Do("< unité >", F)= (5*8) div 5 = 8 D(<< unité ",E)=24+8=32 D(" unité ", <" montrer ")= Do = 3*Domax = 24 DQ ("< unité ", "montrer ") = (6*8) div 5 = 9 car (" unité "-> <" munité ->> <" monité "-> " montré "-> <" montre " -, "< montrer " = 6) D("c unité ", "< montrer ")= 24+9 = 33 "< monté "-- > D"< unité ", <c cheval ")= Do = 3*DQmax = 24 DQ (<c unité ", "cheval ") = (6*8) div 5 = 9 car ("< unité "--> " cnité "->> "chité "--> "cheté "-, " chevé "-* " cheva" -> <c cheval " = 6) D (" unité >, <" cheval,)= 24+9 = 33 D(" afficher ",)= DQ (F, "horse") = 6 D (ú, "horse") = 24 + 6 = 30 D ("unit", ú) = Do = 24 Do ("<unit>", F) = (5 * 8) div 5 = 8 D ("unit", E) = 24 + 8 = 32 D ("unit", <"show") = Do = 3 * Domax = 24 DQ ("<unit", "show" ) = (6 * 8) div 5 = 9 because ("unit" -> <"munity - >> <" monity "->" shown "-> <" shows "-," <show "= 6) D ( "c unit", "<show") = 24 + 9 = 33 "<mounted" -> D "<unit", <c horse ") = Do = 3 * DQmax = 24 DQ (<c unit", " horse ") = (6 * 8) div 5 = 9 because (" <unit "->" cnity "- >>" chité "->" cheté "-," chevé "- *" cheva "-> < c horse "= 6) D (" unit>, <"horse,) = 24 + 9 = 33 D (" display ",) =

DO = 24DO = 24

DQ(" afficher >",)= (8*8) div 8 = 8 D(" afficher ",g) =24+8= 32 D(" afficher ", " montrer ")= Do = Damax = 8 DQ(", afficher ", " montrer ")= (6*8) div 8 = 6 Car(",afficher"-> "mfficher >-> <moficher >,--><"monicher "- ' <"montcher"-> "<montrher,- " ->montrer> =6) D(" afficher,, " montrer ")= 8 + 6 = 14 D(" afficher ", " cheval ")= DO = 3*DQmax = 24 DQ (" afficher ", "cheval ") = (7*8) div 8 = 7 car ("< afficher "-> "fficher ">> "ficher "--> " icher "--> "cher "-@ " chev" --> " cheva "-* " cheval " = 7) D ("< afficher,, " cheval ")= 24+7 = 31 Le tableau des distances unitaires résultant des calculs est donné DQ ("display>",) = (8 * 8) div 8 = 8 D ("display", g) = 24 + 8 = 32 D ("display", "show") = Do = Damax = 8 DQ ( ", show", "show") = (6 * 8) div 8 = 6 Car (", display" -> "display> -> <moficher>, -> <" monicher "- '<" mount "- > "<show, -" -> show> = 6) D ("show ,," show ") = 8 + 6 = 14 D (" show "," horse ") = DO = 3 * DQmax = 24 DQ ( "display", "horse") = (7 * 8) div 8 = 7 car ("<display" -> "ffile" >> "file" -> "icher" -> "expensive" - @ "chev "->" cheva "- *" horse "= 7) D (" <display ,, "horse") = 24 + 7 = 31 The table of unit distances resulting from calculations is given

figure 4.figure 4.

On remarque que la distance la plus faible (hormis les mots vides) 25 est celui pour le couple (afficher, montrer). Il est en effet plus facile d'insérer que de supprimer: la suppression conduit à la perte d'information, tandis Note that the lowest distance (except for empty words) is that for the couple (display, show). It is indeed easier to insert than to delete: the deletion leads to the loss of information, while

que l'insertion ajoute du bruit dans l'information. that the insertion adds noise in the information.

Dans une variante de réalisation, Il est possible d'avoir un découpage plus fin au niveau du sens en considérant plusieurs niveaux de sémantiques différentes (par le biais d'une classification). Par exemple: couleur, rouge, et vert sont voisins, mais rouge et vert sont plus proches In an alternative embodiment, it is possible to have a finer division in the sense by considering several levels of different semantics (through a classification). For example: color, red, and green are neighbors, but red and green are closer

entre eux qu'avec couleur.between them only with color.

Le calcul sémantique est fait à partir de dictionnaires sémantiques. Semantic computation is done from semantic dictionaries.

Il existe plusieurs formes de disctionnaires sémantiques dont deux en There are several forms of semantic discourse, two of which

particulier: ceux à base de graphes, et ceux à base de vecteurs. particular: those based on graphs, and those based on vectors.

Sur l'exemple des couleurs, si le dictionnaire sémantique est un graphe, on peut obtenir le schéma de la figure 5; Pour calculer la distance, on peut utiliser la convention suivante - deux frères ont une distance de 1 un père et un fils ont une distance de 2 Par exemple, la distance entre couleurs et rouge est de 2. Celle On the example of the colors, if the semantic dictionary is a graph, one can obtain the diagram of figure 5; To calculate the distance, we can use the following convention - two brothers have a distance of 1 a father and a son have a distance of 2 For example, the distance between colors and red is 2. That

entre rouge et vert est de 1.between red and green is 1.

Par contre la distance entre rouge et chien est de 5, comme However, the distance between red and dog is 5, as

illustré figure 5b.illustrated figure 5b.

Lorsqu'il s'agit de vecteurs conceptuels, la distance se résume à un produit scalaire. Dans la mesure o ce qui nous intéresse, ce sont les distances très proches entre les mots (à partir d'une certaine distance, on n'est plus intéressé par la valeur). Par conséquent, dans l'invention on considèrera la distance sémantique comme étant binaire (deux mots sont When it comes to conceptual vectors, the distance comes down to a scalar product. As far as we are concerned, it is the very close distances between the words (from a certain distance, we are no longer interested in the value). Therefore, in the invention we will consider the semantic distance as being binary (two words are

1 5 proches sémantiquement ou ne le sont pas). 1 5 semantically close or not).

Les distances sont des valeurs entières, ce qui permet de constituer plus facilement des tableaux d'analyse qui permettront de choisir The distances are integer values, which makes it easier to create analysis tables that will make it possible to choose

le seuil.the threshold.

A l'aide de cet algorithme de calcul des distances entre mots (ou distances unitaires), il est possible de calculer D (a,b) pour tout mot a et tout Using this algorithm for calculating distances between words (or unit distances), it is possible to calculate D (a, b) for any word a and all

mot b appartenant respectivement à VCORPUS et à VCFG(A). word b respectively belonging to VCORPUS and VCFG (A).

Il s'agit maintenant de déduire de cette première matrice une deuxième matrice des distances entre les n-grammes (x;)' de VcoRpus et les mgrammes (y,)' de VCFG(A). On utilisera avantageusement pour ce faire un algorithme connu de l'homme du métier décrit dans l'ouvrage [Chodorowski, 2001] Chodorowski, J. Inférence grammaticale pour l'apprentissage de la syntaxe en reconnaissance de la parole et dialogue oral. Thèse, Université It is now necessary to deduce from this first matrix a second matrix of distances between the n-grams (x;) 'of VcoRpus and the mgrams (y,)' of VCFG (A). It is advantageous to use an algorithm known to those skilled in the art described in [Chodorowski, 2001] Chodorowski, J. Inference grammatical for the learning of syntax in speech recognition and oral dialogue. Thesis, University

de Rennes 1, 2001, page 50 qui est incorporé par référence dans la présente description. On cherche à calculer l'élément M (n, m) de cette deuxième 30 matrice qui est la distance entre le n-gramme (x*);' et le mgramme (yj)rn. de Rennes 1, 2001, page 50 which is incorporated by reference in the present description. We seek to calculate the element M (n, m) of this second matrix which is the distance between the n-gram (x *); and the gram (yj) rn.

L'algorithme décrit dans la référence ci-dessus calcule M (n, m) par une récurrence de programmation dynamique définie de la manière suivante The algorithm described in the above reference computes M (n, m) by a dynamic programming recursion defined as follows

M (0,0)= 0M (0,0) = 0

M (i,0) = Y,'= D (xk, c) pour 1< i < m, xk étant le kème mot du i-gram M (i,0) = D, D (E, Yk) pour 1< j < n M (i - 1, j) + D (xi, e) M (i, j) = min M (ij, j- 1) + D (e, y1) M (i - 1, j) + D (xi, yj) La distance entre deux n-grammes utilise la distance de Levenstein entre deux séquences, mais cette fois-ci au niveau phrase. Le travail est exactement le même: la distance entre deux mots quelconques étant connue, on applique cette mesure de distance comme si les mots M (i, 0) = Y, '= D (xk, c) for 1 <i <m, where xk is the kth word of i-gram M (i, 0) = D, D (E, Yk) for 1 <j <n M (i - 1, j) + D (xi, e) M (i, j) = min M (ij, j - 1) + D (e, y1) M (i - 1, j) + D (xi, yj) The distance between two n-grams uses the Levenstein distance between two sequences, but this time at the sentence level. The work is exactly the same: the distance between any two words being known, we apply this measure of distance as if the words

étaient des simples symboles.were mere symbols.

Exemple:Example:

Distance entre < unité alpha allez vers " et " unité alpha avancez Distance between <alpha unit go to "and" alpha unit go ahead

vers ".towards ".

1 5 M1=unité M2=alpha M3=allez M4=vers M5=avancez La distance entre ces deux phrases est égale à la distance entre les séquences M1M2M3M4 et M1M2M3M5, étant donnée la matrice de 1 5 M1 = unit M2 = alpha M3 = go M4 = to M5 = move forward The distance between these two sentences is equal to the distance between the sequences M1M2M3M4 and M1M2M3M5, given the matrix of

distance unitaire D(Mi,Mj) calculée précédemment. unit distance D (Mi, Mj) calculated previously.

Le nombre de calculs pour chaque élément M (n, m) est donc de l'ordre de n x m, soit de 12 dans le cas d'un mode priviligié de réalisation o n = 3 et m = 4. On notera que les calculs de rang inférieur dans la récurrence The number of computations for each element M (n, m) is therefore of the order of nxm, ie of 12 in the case of a privileged mode of realization on = 3 and m = 4. It will be noted that the computations of rank lower in recurrence

permettent de remplir d'autres cases de la matrice des distances entre ngram et m-grams. allow to fill other spaces of the matrix of the distances between ngram and m-grams.

Cependant bien entendu, d'autres algorithmes de calcul des However, of course, other algorithms for calculating

distances entre n-grammes et m-grammes pourront être choisis. distances between n-grams and m-grams may be chosen.

Le seuil ô de distance au-dessous duquel les n-grams du corpus général dont la distance à des m-grams de VCFG(A) sont rajoutés à CORPUS The threshold δ of distance below which the n-grams of the general corpus whose distance to m-grams of VCFG (A) are added to CORPUS

(A), celui-ci étant initialisé au départ par VCFG(A). (A), this being initialized initially by VCFG (A).

L'analyse numérique de la matrice des M (n, m) permet de tracer un graphe un graphe des fréquences de n-grammes en fonction de la The numerical analysis of the matrix of M (n, m) makes it possible to plot a graph of the frequencies of n-grams according to the

distance aux mots et VCFG(A) (figure 6). distance to words and VCFG (A) (Figure 6).

Quelques itérations seront utiles pour régler 8. Some iterations will be useful to adjust 8.

il La mise en oeuvre de l'invention est possible sur un ordinateur du commerce de type quelconque pourvu des interfaces classiques d'entrée et de restitution de données (clavier, souris,écran, imprimante). L'intégration avec un système de reconnaissance vocale est possible sur une configuration commune. Dans ce cas, le système informatique dispose en outre d'un microphone, de hauts-parleurs, d'une carte spécialisée de It is possible to implement the invention on a commercial computer of any type provided with conventional interfaces for inputting and restoring data (keyboard, mouse, screen, printer). Integration with a voice recognition system is possible on a common configuration. In this case, the computer system also has a microphone, loudspeakers, a specialized

traitement de signal et d'un logiciel spécialisé de reconnaissance vocale. signal processing and specialized speech recognition software.

Claims

A product / program for collecting a set of texts (40) specific to a field of application (to) from a non-specific text set (10), characterized in that it comprises a module of command by a grammar (30) of the application domain (to).

2. Product / program according to claim 1, characterized in that the control module comprises a distance measuring module D between sentences of the set of non-specific texts and sentences of

the grammar of the application domain.

3. Product / program according to claim 2, characterized in that the control module comprises an adjustable value δ threshold of

distance between sentences.

4. Product / program according to one of claims 2 or 3,

characterized in that the distance measuring module calculates the distance D between a sentence of n words and a sentence of m words recursively

from a measure of the distance between words.

5. Product / program according to claim 4, characterized in that for i and j 'respectively varying from 1 to n and 1 to m, an element of rank (i, j) in the recurrence is the minimum of the sum of l element of rank (i 1, j) and of the distance between the word of rank i of the first sentence and the word "space", of the sum of the element of rank (i, j - 1) and the distance between the word "space" and the word of rank j of the second sentence and the sum of the rank element (i - 1, j - 1) and the distance between the word of rank

i of the first sentence and the word of rank j of the second sentence.

6. Product / program according to claim 4, characterized in that, for i and j respectively varying from 1 to n and from 1 to m, elements of rank (i, o) or (o, j) in the recurrence are each the sum for k variant

from 1 to i distances between the word of rank k and the word "space".

Product / program according to claim 4, characterized in that

that, the element of rank (o, o) in the recurrence is equal to o.

8. Product / program according to one of claims 4 to 7,

characterized in that the distance between words is a decreasing function of their etymological and semantic and increasing proximity to a measure of the

letter-to-letter transformation of one of the two words in the other.

9. Product / program according to one of claims 1 to 8,

characterized in that the n-grams of the application domain-specific grammar (33) are generated by a module (20) where the parameter user

a deterministic finite state automaton.

10. Speech recognition system including a

product / program according to one of claims 1 to 9.

11. Information processing system comprising a module

according to one of claims 1 to 10.

12. A method of collecting a set of texts specific to an application domain A from a set of non-specific texts (10), characterized in that the collection is controlled by a grammar of the

field of application D (30).

13. The method of claim 12, characterized in that the command comprises a measure of distance D between sentences of the non-specific text set and sentences of the grammar of the

application domain.

14. The method according to claim 13, characterized in that the control comprises an adjustable value of the distance threshold between sentences.

15. Method according to one of claims 13 or 14, characterized

in that the measure of distance between a sentence of n words and a sentence of m words is calculated recurrently from a measurement of the

distance between words.

16. Method according to claim 15, characterized in that for i and j 'respectively varying from 1 to n and 1 to m, an element of rank (i, j) in the recurrence is the minimum of the sum of the element of rank (i-1, j) and the distance between the word of rank i of the first sentence and the word "space", the sum of the rank element (i, j-1) and the distance between the word "space" and the word of rank j of the second sentence and the sum of the rank element (i - 1, j - 1) and the distance between the word of rank i of the

first sentence and the word of rank j of the second sentence.

17. Method according to claim 15, characterized in that, for i and j respectively varying from 1 to n and from 1 to m, elements of rank (i, o) or (o, j) in the recurrence are each the sum for k varying from 1 to i of

distances between the word of rank k and the word "space".

18. Process according to claim 15, characterized in that,

the element of rank (o, o) in the recurrence is equal to o.

19. Process according to one of Claims 15 to 18, characterized in

that the distance between words is a decreasing function of their etymological and semantic and increasing proximity to a measure of the cost of

letter-by-letter transformation of one of the two words into the other.

20. Method according to one of claims 12 to 19, characterized in

the n-grams of the specific grammar of the application domain (33) are generated by a module (20) where the user sets an automaton

deterministic finite states.

21. Speech recognition method implementing a

process according to one of claims 12 to 20.

1 5

22. An information processing method implementing a

Method according to one of Claims 12 to 21.