CN1223984C

CN1223984C - Client-server based distributed speech recognition system

Info

Publication number: CN1223984C
Application number: CN01823555.7A
Authority: CN
Inventors: 赵庆伟; 张向东; 杨永红; 袁宝胜
Original assignee: Intel China Ltd; Intel Corp
Current assignee: Intel China Ltd; Intel Corp
Priority date: 2001-06-19
Filing date: 2001-06-19
Publication date: 2005-10-19
Anticipated expiration: 2021-06-19
Also published as: CN1545694A; WO2002103675A1; WO2002103675A8

Abstract

Generally, new client-server based distributed speech recognition (DSR) systems provide an efficient method for recognizing human speech on a client device and transmit it over a network to a remote server. This system distributes the speech recognition processing between the client and server, allowing a speaker-related language model to be utilized, resulting in higher accuracy compared to traditional DSR systems. Accordingly, the client device is configured to generate a phonetic glossary by performing speech recognition using a voice model trained by the same end-user whose speech is to be recognized. The obtained phonetic glossary is sent to the server, which performs the speech processing and generates the recognized word string. Compared to designs using traditional DSR, this new DSR method and system achieves a 2-3 times lower word error rate, resulting in a more accurate recognition system.

Description

Distributed Speech Recognition System Based on Client-Server

技术领域technical field

本发明涉及分布式语音识别(DSR)系统和构架。更加具体来说，本发明涉及一种新的DSR系统和方法，其在客户机设备执行语音识别的声音处理部分和在服务器设备的语言处理部分。The present invention relates to distributed speech recognition (DSR) systems and architectures. More specifically, the present invention relates to a new DSR system and method that performs the sound processing portion of speech recognition at a client device and the language processing portion at a server device.

背景技术Background technique

自从有了现代计算机的思想开始，工程师和语言学家已经共同工作，以使得通过一个机器完美地实现人的语音识别。自动语音识别的一个目标是使得一个系统接收输入的人的语音、把其转换为可识别的形式，并且用所识别的语音执行有用的功能。Since the idea of modern computers, engineers and linguists have worked together to make human speech recognition perfectly realized by a machine. One goal of automatic speech recognition is to enable a system to take input human speech, convert it to a recognizable form, and perform useful functions with the recognized speech.

目前，存在各种用于语音识别技术的商业应用程序。听写机例如可以“听”人口述，并且“实时地”把“所听到”的文本传送到监视器上。另一个应用程序涉及能够接收和执行由人的语音而不是通过鼠标或键盘所发出的控制命令的机器。例如，一个人可以对计算机说“读我的电子邮件”。该应用程序可以使用语音识别技术来识别由说话人所发出的字串。执行所需任务的一系列命令然后可以被发出，导致计算机读取该人的电子邮件。Currently, there are various commercial applications for speech recognition technology. A dictation machine, for example, can "listen" to a person's dictation and transmit the "heard" text to a monitor in "real time". Another application involves machines capable of receiving and executing control commands issued by a human voice rather than via a mouse or keyboard. For example, a person can say "read my email" to a computer. The application can use speech recognition technology to recognize word strings uttered by the speaker. A series of commands to perform the desired task can then be issued, causing the computer to read the person's email.

另一种应用程序已经被开发用于基于客户机-服务器的语音系统和构架。通常，语音识别的任务被分布在客户机和服务器之间。例如，移动电话或个人数字助理(PDA)可以被用作为客户机，其捕获语音，获得语音特征，并且把该特征发送到位于一个中央位置的服务器。该通信可以在例如互联网这样的网络上发生。一旦该语音特征被该服务器所接收，则它们被处理，用于声音识别和用于所用的给定人的语言的语言处理。Another application has been developed for client-server based speech systems and frameworks. Typically, the task of speech recognition is distributed between the client and the server. For example, a mobile phone or personal digital assistant (PDA) can be used as a client that captures speech, obtains speech characteristics, and sends the characteristics to a server at a central location. This communication can take place over a network such as the Internet. Once the speech features are received by the server, they are processed for voice recognition and language processing for the given person's language used.

更加具体来说，人的语音被例如麦克风这样的设备在客户机方所捕获。该语音信号被转换为数字形式，以便于被数字计算机所分析。该数字信号被通过一个特征提取模块，其将提取该语音信号的声音特征，例如在周期采样点处的能量集中。所提取的特征然后被通过例如Mel频率对数倒频谱系数(Mel Frequency Cepstral Coefficients)这样的数学模型而量化。该被量化的特征被组织为一个数据包，用于发送到一个服务器。More specifically, human speech is captured at the client side by a device such as a microphone. The speech signal is converted to digital form so that it can be analyzed by a digital computer. The digital signal is passed through a feature extraction module, which will extract the acoustic features of the speech signal, such as energy concentration at periodic sampling points. The extracted features are then quantized by mathematical models such as Mel Frequency Cepstral Coefficients. The quantized features are organized into a data packet for sending to a server.

然后，该服务器接收包含量化特征的数据包，并且执行声音和语言处理，以提供一个字串。由于该服务器服务于多个客户机，因此该声音处理被一个与说话人无关(SI)的模型所模拟。The server then receives the data packets containing the quantified features and performs sound and language processing to provide a word string. Since the server serves multiple clients, the sound processing is simulated by a speaker-independent (SI) model.

在传统的DSR方法的缺点中的一个缺点是它不能够利用与说话人相关(SD)的模型所提供的改进的字错误率(WER)的优点。在两种模型之间的差别在于一个SD模型已经被特定人的语音所训练，结果用于该特定人的WER较低。这是因为来自不同语言背景的人对于相同的字词发出显著不同的声音信号。来自不同区域的人可能具有不同的口音和发音。Among the shortcomings of the traditional DSR approach is its inability to take advantage of the improved word error rate (WER) provided by speaker-dependent (SD) models. The difference between the two models is that one SD model has been trained on a specific person's speech, and as a result the WER for that specific person is lower. This is because people from different linguistic backgrounds produce significantly different vocal signals for the same word. People from different regions may have different accents and pronunciations.

相反，当该系统由各种说话人所使用时，例如一个自动出纳机(ATM)，使用一个SI模型并且其被指定为处理任何说话人，而与说话人的语言特征无关，例如说话人的发音，由于性别和年龄以及说话人的声音的强度所导致的语音变化，SD模型具有比SI模型低2-3倍的WER。由于传统的DSR方法处理在该服务器而不是在客户机处处理的声音，因此使得系统构架采用具有较低的WER的SD声音识别模型来提高整体识别精确度是不现实和没有效率的。In contrast, when the system is used by various speakers, such as an automated teller machine (ATM), an SI model is used and it is specified to handle any speaker regardless of the speaker's linguistic characteristics, such as the speaker's Pronunciation, the variation in speech due to gender and age and the intensity of the speaker's voice, the SD model has 2-3 times lower WER than the SI model. Since the traditional DSR method processes the voice at the server instead of the client, it is unrealistic and inefficient for the system architecture to adopt the SD voice recognition model with lower WER to improve the overall recognition accuracy.

附图说明Description of drawings

图1示出采用本发明的一个实施例的新的DSR方法的示意通信网络的方框图。Fig. 1 shows a block diagram of a schematic communication network employing the new DSR method of one embodiment of the present invention.

图2示出采用本发明一个实施例的新的DSR方法的示意基于客户机-服务器的DSR系统的方框图。FIG. 2 shows a block diagram of a schematic client-server based DSR system employing the new DSR method of one embodiment of the present invention.

图3示出采用本发明一个实施例的新的DSR方法的示意图。Fig. 3 shows a schematic diagram of a new DSR method employing one embodiment of the present invention.

图4示出在一个基于客户机-服务器系统的客户机节点处采用本发明的一个实施例的新的DSR方法的示意图。FIG. 4 shows a schematic diagram of the new DSR method using an embodiment of the present invention at a client node of a client-server based system.

图5示出采用根据本发明一个实施例的方法的在一个客户机节点处产生的示意音标字图(phonetic word graph)。FIG. 5 shows a schematic phonetic word graph generated at a client node using a method according to an embodiment of the present invention.

图6A示出根据本发明一个实施例的一个音标字图的发送处理的示意流程图。Fig. 6A shows a schematic flow chart of sending processing of a phonetic symbol graph according to an embodiment of the present invention.

图6B示出用于发送根据本发明一个实施例的方法的示意音标字图的示意数据报(datagram)。FIG. 6B shows a schematic datagram for sending a schematic phonetic graph of a method according to an embodiment of the present invention.

图7A示出在一个基于客户机-服务器系统的服务器节点处根据本发明一个实施例的方法的新的DSR方法的示意图。FIG. 7A shows a schematic diagram of a new DSR method of a method according to an embodiment of the present invention at a server node based on a client-server system.

图7B示出根据本发明一个实施例的方法的在一个客户机-服务器网络系统的服务器节点处产生的示意音标字图。FIG. 7B shows a schematic phonetic graph generated at a server node of a client-server network system according to a method according to an embodiment of the present invention.

图7C示出从图5中所示的音标字图扩展的一个示意音标字图。FIG. 7C shows a schematic phonetic word map extended from the phonetic word map shown in FIG. 5 .

图8示出采用根据本发明一个实施例的新的DSR方法的示意系统的方框图。FIG. 8 shows a block diagram of a schematic system employing the new DSR method according to one embodiment of the present invention.

具体实施方式Detailed ways

在下文对本发明的详细描述中，给出各种具体细节，以提供对本发明的彻底理解。但是，本领域的普通技术人员显然将认识到可以在没有这些具体细节的情况下实现根据本发明的方法。在其他方面，没有描述公知的方法、处理、部件和电路，以避免对本发明的各个方面造成混淆。In the following detailed description of the invention, various specific details are given in order to provide a thorough understanding of the invention. It will be apparent, however, to one of ordinary skill in the art that methods in accordance with the present invention may be practiced without these specific details. In other instances, well-known methods, processes, components and circuits have not been described in order not to obscure aspects of the invention.

根据本发明的方法包括将在下文中描述的各种步骤。这些步骤可以用硬件部件来实现，或者可以体现机器可执行的指令中，其可以被用于使用该指令编程的通用处理器来执行该步骤。另外，该步骤可以通过硬件和软件的组合来执行。The method according to the invention comprises various steps which will be described hereinafter. The steps may be implemented by hardware components, or may be embodied in machine-executable instructions, which can be used in a general purpose processor programmed with the instructions to perform the steps. Also, this step can be performed by a combination of hardware and software.

本发明揭示了一种新的DSR方法，其不同于传统的DSR方法，并且获得改进的识别精度。该新的DSR方法利用与SD声音识别模型相关的较低WER的优点。这通过把该语音识别处理分割和分布到在客户机设备的声音识别和在服务器设备的语言处理来实现。因此在一个客户机设备捕获语音之后，其根据一种SD个性化的声音模型来进行声音处理。该处理获得关于最可能说出的内容的一个N个最佳假设(N-besthypothesis)。接着，形成一个字图包，并且通过网络发送到一个服务器设备。最后，该服务器设备接收该字图包，对其进行解码，并且进行语言处理，获得一个所识别的字串。The present invention discloses a new DSR method which differs from conventional DSR methods and achieves improved recognition accuracy. This new DSR approach takes advantage of the lower WER associated with SD sound recognition models. This is accomplished by splitting and distributing the speech recognition processing to speech recognition at the client device and language processing at the server device. So after a client device captures speech, it performs sound processing according to an SD personalized sound model. This process obtains an N-best hypothesis (N-best hypothesis) about the most likely spoken content. Next, a word map packet is formed and sent to a server device through the network. Finally, the server device receives the word map packet, decodes it, and performs language processing to obtain a recognized word string.

一旦人的语音被在一个客户机设备处捕获，则它将被进行声音分析。声音识别涉及提取所捕获的语音信号的特征，以及在一个声音模型中搜索在所捕获语音的所提取特征和在该声音模型中存储的已知以前记录的语音特征之间的一个或多个可能的匹配。在一个优选实施例中，一个音标字图可以被用于表示该语音。该声音模型可以是一个个性化的SD模型，其由用户，例如移动电话或PDA的所有者，进行个人训练。从而，该字图被包装并且通过一个网络发送到一个中央服务器。该服务器然后可以利用所选择的语言模型来对该音标字图进行语言处理，并且产生一个所识别的字串。Once the human speech is captured at a client device, it is subjected to acoustic analysis. Voice recognition involves extracting features of a captured speech signal and searching in an acoustic model for one or more possibilities between the extracted features of the captured speech and known previously recorded speech features stored in the acoustic model. match. In a preferred embodiment, a phonetic graph may be used to represent the speech. The sound model can be a personalized SD model which is personally trained by the user, for example the owner of a mobile phone or PDA. Thereby, the word graph is packaged and sent to a central server through a network. The server can then perform linguistic processing on the phonetic graph using the selected language model and generate a recognized word string.

现在参见图1，其中示出一个示意的DSR系统。客户机1、客户机2和C1-C4是采用根据本发明一个实施例的新的DSR方法的不同客户机设备的例子。客户机1是一个个人数字助理110。客户机2是一个移动电话120，以及C1-C4是作为在一个示意的LAN 138中的节点的计算机终端。在每种情况中，该客户机设备被配置为捕获人的语音。例如，对于PDA 110的人的语音112以及对于移动电话120的。所捕获的信号然后被进行声音处理，并且在该客户机设备产生所获得的包含一系列音标字图的所获得数据包。Referring now to FIG. 1, a schematic DSR system is shown. Client 1, Client 2, and C1-C4 are examples of different client devices employing the new DSR method according to one embodiment of the present invention. Client 1 is a personal digital assistant 110 . Client 2 is a mobile phone 120, and C1-C4 are computer terminals as nodes in an illustrative LAN 138. In each case, the client device is configured to capture human speech. For example, human voice 112 for PDA 110 and for mobile phone 120. The captured signal is then acoustically processed and an obtained data packet containing a series of phonetic word maps is generated at the client device.

然后，该数据包被通过网络100发送，例如通过互联网发送。对于在PDA 110所发出的语音，处理服务器150接收该数据包(未示出)，然后对包含在该数据包中的数据执行声音处理，产生所识别的字串152。类似地，主机180和主机160从移动电话120(客户机2)和C1130接收数据包，并且分别产生所识别的字串182和162。客户1可以通过最初读出一个字串和训练PDA来对PDA 110进行训练。类似地，客户2可以通过读出一个字串并且对该移动电话提供所他所说的文本来训练移动电话120。一旦客户机设备已经被训练，其可以形成一个个性化的声音模型，其可以被客户机设备用作为检索和比较所说的内容的基础。The data packet is then sent over the network 100, for example over the Internet. For the voice sent at the PDA 110, the processing server 150 receives the data packet (not shown), and then performs sound processing on the data contained in the data packet to generate the recognized word string 152. Similarly, host 180 and host 160 receive data packets from mobile phone 120 (client 2) and C1130 and generate identified strings 182 and 162, respectively. Client 1 can train the PDA 110 by initially reading a word string and training the PDA. Similarly, customer 2 can train mobile phone 120 by reading a word string and providing the mobile phone with the text of what he said. Once a client device has been trained, it can develop a personalized voice model that can be used by the client device as a basis for retrieving and comparing spoken content.

现在参见图2，其中示出一个示意的DSR系统的方框图。该图示出在客户机设备220所说出的人的语音信号200如何在服务器设备252被转换为所识别字串260。该信号200被在一个基于客户机-服务器系统的客户机220节点处捕获，并且在该客户机设备220处执行声音处理和识别230。如图2中所示，客户机220可以是一个计算机终端210。但是，客户机220还可以是一个移动电话、PDA等等。实际上，该客户机设备是具有接收人的语音信号200，对其进行声音处理并且识别其音标构成，以及准备所获得的音标数据用于通过例如通信网络240这样的网络242发送的功能的任何设备。、Referring now to FIG. 2, a schematic block diagram of a DSR system is shown. The diagram shows how a human speech signal 200 spoken at a client device 220 is converted into a recognized word string 260 at a server device 252 . The signal 200 is captured at a client 220 node of a client-server based system, and sound processing and recognition 230 is performed at the client device 220 . As shown in FIG. 2 , the client machine 220 may be a computer terminal 210 . However, the client machine 220 could also be a mobile phone, PDA or the like. In fact, the client device is any device having the functions of receiving a person's speech signal 200, performing sound processing on it and recognizing its phonetic composition, and preparing the obtained phonetic data for transmission over a network 242 such as a communication network 240. equipment. ,

仍然参见图2，该处理服务器254处理该新的DSR系统的语言处理250阶段。该处理服务器254可以是一个服务器计算机系统252，其能够接收音标数据并且对其进行语言分析，以获得字串260。一旦该服务器252已经完成语言处理250，则该服务器252产生一个所识别的字串260，然后其可以被通过网络242发送回客户机220。Still referring to FIG. 2, the processing server 254 handles the language processing 250 stage of the new DSR system. The processing server 254 may be a server computer system 252 capable of receiving phonetic data and performing linguistic analysis on it to obtain word strings 260 . Once the server 252 has completed language processing 250 , the server 252 generates a recognized word string 260 which can then be sent back to the client 220 over the network 242 .

现在参见图3，其中示出根据本发明一个实施例的方法而设计的新的DSR系统的示意图。该图表示当采用新的DSR方法时人的语音信号300所经受的一个示意操作序列。Referring now to FIG. 3, there is shown a schematic diagram of a new DSR system designed according to the method of one embodiment of the present invention. The figure shows a schematic sequence of operations to which the human speech signal 300 is subjected when the new DSR method is employed.

在功能块310处，在该客户机设备340接收人的语音信号300。例如，一个人可以对麦克风说话，其将捕获人的语音信号300。在一个示意实施例中，人可能被限于可以被用作为控制命令的单词命令，即，一个自动语音识别(ASR)系统。在另一个实施例中，该系统可以包括大词汇连续语音识别(LVCSR)。根据本实施例的方法采用ASR和LVCSR。At functional block 310 , a human voice signal 300 is received at the client device 340 . For example, a person can speak into a microphone, which will capture the person's voice signal 300 . In an exemplary embodiment, a human may be limited to word commands that can be used as control commands, ie, an automatic speech recognition (ASR) system. In another embodiment, the system may include large vocabulary continuous speech recognition (LVCSR). The method according to this embodiment employs ASR and LVCSR.

在功能块310处，所捕获的人的语音信号300被一个模数转换器转换为数字信号。在功能块320中，所获得数字化信号的特征例如被一个特征参数提取模块820所提取(参见图8)。功能块320可以被进一步细分为：如功能块322所示的结束点检测、如功能块324中所示的预先强调滤波(pre-emphasizing filtration)，以及在功能块326所示的特征计算。在结束点检测过程中，对于一个语音特征的开始和结束进行检测。换句话说，对一个特征何时结束以及另一个特征何时开始进行判断。在预先强调滤波过程中，该语音信号被滤波，以放大该语音信号的重要特征。最后，在该特征计算过程中，该语音信号的特征被计算，以形成一系列可能的候选项。At functional block 310, the captured human speech signal 300 is converted to a digital signal by an analog-to-digital converter. In functional block 320, the features of the obtained digitized signal are extracted, for example, by a feature parameter extraction module 820 (see FIG. 8 ). Functional block 320 can be further subdivided into: end point detection as shown in functional block 322 , pre-emphasizing filtration as shown in functional block 324 , and feature computation as shown in functional block 326 . During the end point detection process, the start and end of a speech feature are detected. In other words, make a judgment about when one feature ends and another begins. During pre-emphasis filtering, the speech signal is filtered to amplify important features of the speech signal. Finally, during the feature computation, features of the speech signal are computed to form a series of possible candidates.

相应地，在已经提取语音特征之后，在功能块342处对所捕获的人的语音信号进行声音处理。该声音处理是提供在功能块320识别的语音特征与已知的音标单元(phonetic unit)的匹配。因此，声音处理包括接收一个人的语音信号，并且使用一个声音模型来重新产生最接近于表示该输入语音的一系列声音。该声音模型可以由例如音标级、半音节或音节单元这样的子字单元来组织。但是，也可以应用使用其他音标单元的声音模型。Accordingly, the captured human speech signal is acoustically processed at functional block 342 after the speech features have been extracted. The sound processing is to provide the matching of the speech features identified at block 320 with known phonetic units. Thus, sound processing involves taking a human speech signal and using a sound model to regenerate a series of sounds that most closely represent the input speech. The sound model may be organized by sub-word units such as phonetic level, demisyllable or syllable units. However, sound models using other phonetic units can also be applied.

执行声音处理的一个方法是通过利用隐藏马尔可夫模型(HMM)。本领域所公知的HMM是由声音状态的一个马尔可夫链所构成的随机有限状态自动控制。这些状态模拟语音的瞬时结构，即，该状态如何随时间而变化。用于每个这些状态的概率函数、模拟发射和声音矢量的观察由HMM所表示。One way to perform sound processing is by utilizing Hidden Markov Models (HMMs). The HMM known in the art is a stochastic finite state automatic control composed of a Markov chain of sound states. These states model the temporal structure of speech, ie, how this state changes over time. The probability functions for each of these states, the observations of simulated emissions and sound vectors, are represented by HMMs.

一旦一个HMM被用于表示该语音特征，则一个搜索空间被确定，并且可以在一个声音模型内对以前形成的HMM进行搜索。该HMM可以在一个客户机设备的训练阶段过程中形成，该训练阶段可能在一个人第一次使用该客户机设备340时出现。例如，当一个人购买一个移动电话时，该电话可以具有一个这样的按键，当该按键被按下时可以把该电话置于训练模型。在该模型的过程中，该人可能被要求说出字、音素或者其他出现在屏幕上的音标单元。然后该移动电话可以捕获由该用户所产生的声音，并且通过图3的功能块322-326来运行它，以提取与该声音相关的特征并且形成一个HMM。在该训练阶段过程中，由于客户机设备340确切地已知该声音所表达的字，因此它可以存储两块信息(所读出的字及其提取的特征)，并且创建对该移动电话的用户个性化的声音模型。Once an HMM is used to represent the speech features, a search space is determined and a search of previously formed HMMs can be performed within an acoustic model. The HMM may be formed during a client device's training phase, which may occur when a person uses the client device 340 for the first time. For example, when a person buys a mobile phone, the phone may have a key that, when pressed, puts the phone into a training model. During the modeling process, the person may be asked to speak words, phonemes, or other phonetic units that appear on the screen. The mobile phone can then capture the sound produced by the user and run it through the functional blocks 322-326 of FIG. 3 to extract features related to the sound and form an HMM. During this training phase, since the client device 340 knows exactly what word the voice represents, it can store two pieces of information (the spoken word and its extracted features) and create a User personalized sound models.

通过创建一个个性化的声音音标模型，该移动电话可以利用一个SD声音模型，其具有比SI声音模型好2-3倍的WER。By creating a personalized sound phonetic model, the mobile phone can utilize an SD sound model with a WER 2-3 times better than the SI sound model.

在功能块334中，配置一个优化处理。可以使用任何知识来源来对所说的字进行判断。例如，由客户机设备的用户所训练的单个音素的声音音标模型可以被单独使用或者与其他知识来源相结合使用，该知识来源例如为发音词典。但是，如果该用户不是实际使用该客户机设备的人，则实际使用该设备的人应当训练该设备，因为这是该人的语音特征，这会导致更加精确的识别处理。In functional block 334, an optimization process is configured. Any source of knowledge may be used to make judgments about the spoken word. For example, phonetic phonetic models of individual phonemes trained by users of client devices may be used alone or in combination with other sources of knowledge, such as pronunciation dictionaries. However, if the user is not the person who actually uses the client device, the person who actually uses the device should train the device, since this is the person's speech characteristics, which will result in a more accurate recognition process.

在功能块336处，在完成声音模型的搜索之后确定一个N个最佳假设。但是，除了N个最佳假设之外，可以利用一个单独最佳假设策略(single-best hypothesis strategy)。在功能块338中，产生一个音标字图(Pword图)。该pword图的主要思想是在关于实际说出的音素的不确定性较高的的语音信号的区域中提出音标的替换选项。所期望获得的优点是声音识别处理与复杂语言模型的应用相分离。该语言模型可以被随后根据本发明的实施例的方法应用于在服务器计算机处执行的后处理中。字替换选项的数目是可以根据用户所需的不确定性级别或精度而变化的设计参数。At functional block 336, an N-best hypothesis is determined after completing the search for the acoustic model. However, instead of the N-best hypotheses, a single-best hypothesis strategy can be utilized. In function block 338, a phonetic word graph (Pword graph) is generated. The main idea of the pword map is to propose phonetic alternatives in regions of the speech signal where the uncertainty about the actually spoken phoneme is high. A desirable advantage is the separation of the voice recognition processing from the application of complex language models. This language model can then be applied by methods according to embodiments of the present invention in post-processing performed at the server computer. The number of word substitution options is a design parameter that can vary depending on the level of uncertainty or precision desired by the user.

一旦一个Pword图已经在功能块338处产生，则Pword图可以被打包并且发送到该服务器设备。任何发送介质以及任何打包方案可以被用于把该Pword图发送到该服务器。例如，一个网际协议数据报可以通过把该Pword图打包为数据报而在所示的功能块354处产生。该数据报然后可以通过网络350发送，如功能块352所示。在本发明的一个优选实施例中，网络350可以是互联网，但是可以使用例如局域网这样的任何其他类型的网络。Once a Pword graph has been generated at functional block 338, the Pword graph can be packaged and sent to the server device. Any delivery medium and any packaging scheme can be used to send the Pword graph to the server. For example, an Internet Protocol datagram can be generated at functional block 354 as shown by packaging the Pword graph into a datagram. The datagram may then be sent over network 350 as indicated by functional block 352 . In a preferred embodiment of the present invention, network 350 may be the Internet, but any other type of network such as a local area network may be used.

在功能块356中，包含Pword图的数据报被一个服务器所接收，并且该Pword图被从该数据报上除去。在功能块382处，可以在Pword图上执行语言处理。该语言处理涉及把该系列声音组织在一个Pword图中，并且把其转换为实际的字。所接收的Pword图被一个节点接着一个节点地分析。对于每个节点，对可用和由用户所选择的特定语言模型检查该字典和语法规则。在一个实施例中，该客户机设备可以具有一个语言选择按键，使得用户用英语或汉语或者该系统可以支持的任何其他语言来说话。在功能块390处，根据由该客户机设备所发送的Pword图形成一个实际Pword图(参见图5)。最后，在功能块386处，采用一种搜索算法来通过一个字典和语法词典来确定所识别的字串。In functional block 356, the datagram containing the Pword map is received by a server, and the Pword map is removed from the datagram. At functional block 382, linguistic processing may be performed on the Pword graph. The language processing involves organizing the series of sounds in a Pword graph and converting it into actual words. The received Pword graph is analyzed node by node. For each node, the dictionary and grammar rules are checked against the specific language model available and selected by the user. In one embodiment, the client device may have a language selection button that allows the user to speak in English or Chinese or any other language that the system may support. At functional block 390, an actual Pword map is formed from the Pword map sent by the client device (see FIG. 5). Finally, at functional block 386, a search algorithm is used to determine the recognized word string through a dictionary and a grammar dictionary.

现在参见图4，其中示出一个客户机设备的方框图。该客户机设备可以是多个便携式设备，例如移动电话、PDA、便携式计算机或者可以由与位于不同地理位置的另一个设备通信的用户所使用的任何其他设备。Referring now to FIG. 4, a block diagram of a client device is shown. The client device may be a number of portable devices such as a mobile phone, PDA, portable computer, or any other device that may be used by a user communicating with another device located in a different geographic location.

一旦一个人决定与一个远程服务器进行通信，则该人将具有与例如麦克风这样的客户机设备的接收器模块说话的选项。但是，该客户机设备对所捕获的人的语音信号400执行一系列操作。这些操作在图4中的功能块420和450中示出。在人的语音信号400上执行的操作通常可以被分为两种。在功能块422、424、426和428处表示的第一系列功能过程中，人的语音信号被经过一个处理，其中该人的语音信号被根据本领域所公知的方法转换为数字信号。然后，在功能块412处，该数字化的信号被显示给一个特征提取模块，其提取在该人的语音信号中存在的特征。这些特征可以表示在被定期测量的语音信号中的集中的能量，并且可以被表示为声音矢量的总和，例如在功能块428中所示为x1、x2、…、xT。但是，还可以提取本领域所公知的该声音信号的其他特征。Once a person decides to communicate with a remote server, that person will have the option of speaking into the receiver module of the client device, such as a microphone. However, the client device performs a series of operations on the captured human voice signal 400 . These operations are shown in functional blocks 420 and 450 in FIG. 4 . The operations performed on the human speech signal 400 can generally be divided into two types. In a first series of functional processes represented at functional blocks 422, 424, 426 and 428, a human speech signal is subjected to a process in which the human speech signal is converted to a digital signal according to methods known in the art. Then, at functional block 412, the digitized signal is presented to a feature extraction module, which extracts the features present in the person's speech signal. These features may represent the concentrated energy in the speech signal being measured periodically, and may be represented as a sum of sound vectors, eg shown in function block 428 as x1, x2, . . . , xT. However, other features of the sound signal known in the art may also be extracted.

在功能块450处，声音矢量x1、x2、…、xT被提供给一个声音处理器，其可以识别产生该x1、x2、…、xT声音矢量的语音。为了实现该任务，该声音处理器可以参考一个声音模型，其包含用于由使用该客户机设备的人以前发出的各种语音的声音矢量。该模型可以容易地由将最初使用该客户机设备的人所训练。例如，当第一次购买的人可以编程或训练该客户机设备。该设备可以具有一个“训练我(train me)”的开关，当该开关被激活时，将在其屏幕上闪现文字，提示用户对该文字发音。该设备例如可以根据特定的设计参数闪现文字、音素、音节、半音节或者任何字的其他单元。音标单元的选择对基于本发明的实施例的方法没有影响。At functional block 450, the sound vectors x1, x2, . To accomplish this task, the sound processor may refer to a sound model containing sound vectors for various speech sounds previously uttered by a person using the client device. The model can be easily trained by the person who will initially use the client device. For example, a person may program or train the client device when purchasing it for the first time. The device could have a "train me" switch that, when activated, would flash text on its screen prompting the user to pronounce the text. The device can, for example, flash words, phonemes, syllables, demisyllables or other units of any word according to certain design parameters. The selection of the phonetic symbol unit has no influence on the method based on the embodiment of the present invention.

因此，例如该设备闪现单词“apple”，并且用户说出“apple”。该设备将例如通过麦克风捕获由该用户所说出的语音产生的语音信号。本领域的普通技术人员知道该信号是一个模拟信号，当在一个示波器上观看时，该信号可能类似于语音信号400。在捕获该信号之后，该声音处理器可以使用在功能块412、422、424、426和428的功能，以提取由说出单词“apple”的用户所产生的信号的特征，导致产生一组声音矢量。然后，该表达被与该单词“apple”的表示一同存储在一个数据库中。该处理可以一个单词接着一个单词地连续进行。显示给该设备的单词越多，则用于该用户或设备拥有者的声音模型越完整。一旦该模型被完成，则该设备被准备用于在功能块450出现的声音识别。So, for example, the device flashes the word "apple" and the user speaks "apple". The device will capture the voice signal produced by the user's spoken voice, for example through a microphone. Those of ordinary skill in the art know that this signal is an analog signal, which may resemble speech signal 400 when viewed on an oscilloscope. After capturing the signal, the sound processor may use the functions at functional blocks 412, 422, 424, 426, and 428 to extract features of the signal produced by the user speaking the word "apple," resulting in the generation of a set of sounds vector. This expression is then stored in a database along with the representation of the word "apple". This processing can be performed continuously word by word. The more words that are displayed to the device, the more complete the sound model for that user or device owner. Once the model is complete, the device is ready for voice recognition which occurs at functional block 450 .

该声音处理器现在负责识别所说出的语音的任务。它通过对包含被训练的声音模型的数据库进行搜索而完成该任务。在功能块446处，进行搜索，以发现用于该语音的一个或多个匹配。对所说出的单词的判断可以通过一个优化处理来实现。几种搜索处理方法已经被开发并且是本领域所公知的。例如，可以使用具有修改选项的一个定向搜索策略。另外，可以应用一个树词典或一次完成的算法。特定搜索策略的选择不影响或改变根据本发明的实施例的方法。The sound processor is now responsible for the task of recognizing spoken speech. It does this by searching a database containing trained sound models. At functional block 446, a search is performed to find one or more matches for the speech. The judgment of the spoken word can be realized through an optimization process. Several search processing methods have been developed and are known in the art. For example, a directed search strategy with modification options can be used. Alternatively, a dictionary of trees or a one-shot algorithm can be applied. The choice of a particular search strategy does not affect or alter the method according to embodiments of the present invention.

在功能块442处，包含声音模型的数据库被收集。本实施例的语音识别系统的训练阶段发生在该功能块处。在功能块444处，一个语言模型被考虑，以连接到在功能块446处使用的搜索策略。但是，在客户机方添加一个语言模型可以是一种设计选择。不必包含一个语言模型来实现根据本实施例的方法。At functional block 442, a database containing sound models is collected. The training phase of the speech recognition system of this embodiment occurs at this functional block. At functional block 444 , a language model is considered to link to the search strategy used at functional block 446 . However, adding a language model on the client side can be a design choice. It is not necessary to include a language model to implement the method according to this embodiment.

该搜索结果在功能块448处产生。在此，产生一个N个最佳假设。尽管，还可以在一个优选实施例中使用单个最佳假设，但是N个最佳假设产生更高的精度，因为它不但对所说出的内容提供单一的猜测，而是多个猜测。在功能块452处，从该信息可以产生一个字图。一个字图的主要思想是字的替换。字图必须被证明在需要高精度的情况下是有效的。实际上，在图5中所示的一个字图显示具有类似声音、或特征、或声音矢量的字词。这种相似性可能造成混淆。例如，在汉语中的字“duo”和“dao”和“yao”在频谱分析器上看起来几乎相同。类似地，参见图5，字“dai”、“nai”和“mai”除了一个字母或音素之外相类似。这些在大多数语言中普遍的相似性可以通过使用在下文将参照图7A讨论的语言模型中给出的语法词典作进一步的分析。The search results are generated at functional block 448 . Here, an N-best hypothesis is generated. Although, a single best hypothesis could also be used in a preferred embodiment, the N-best hypothesis yields higher accuracy because it provides not just a single guess at what was said, but multiple guesses. At functional block 452, a word map may be generated from this information. The main idea of a word graph is word replacement. Word graphs must be proven efficient in situations where high precision is required. In fact, a word graph shown in Figure 5 shows words with similar sounds, or features, or sound vectors. This similarity can be confusing. For example, the words "duo" and "dao" and "yao" in Chinese look almost the same on a spectrum analyzer. Similarly, referring to FIG. 5, the words "dai", "nai" and "mai" are similar except for one letter or phoneme. These general similarities in most languages can be further analyzed by using a grammar lexicon given in the language model discussed below with reference to Figure 7A.

参见图4，一旦产生表示所说出的字的替换选项的字图，该设备可以把该信息作为一个二进制文件发送到一个远程服务器。该字图可以被表示为如图6B中所示的一个数据报中。但是，可以采用该数据的任何其他形式的打包。Referring to Figure 4, once a word map representing alternative options for a spoken word is generated, the device can send this information as a binary file to a remote server. The word map can be represented in a datagram as shown in Figure 6B. However, any other form of packing of the data may be employed.

现在参见图5，其中示出具有两级替换选项容量的字图的一个例子。在本例中，该实际读出的字词的汉语为：“wo yao mai zhong ke jian”，其含义“我要买中科健”(中科健是在中国股票市场上的一种股票的名称)。该字图是如图4中所示的声音处理器的输出，在功能块452处。该声音处理器把该设备所捕获的声音矢量与该声音模型相比较，并且对该声音处理器提供为每个字词提供三个替换选项。字512、511和510表示“yao”及其替换选项。字514、515和516表示“mai”及其替换选项，相应地，在图5中所示的字图可以与一个语言模型相结合而使用，其包括一个字典和语法词典，以确定由该字图所表示的单个最佳句子。把一个语言模型应用到该字图可以在一个服务器节点处进行，因为该语言处理是相当复杂的处理并且与声音识别处理无关。因此，本实施例的方法通过产生具有两级字替换选项的字图而利用SD声音模型的优点。该字图将被传送到一个服务器，其然后完成该识别处理，并且确定单个最佳句子。Referring now to FIG. 5, an example of a word graph having two levels of substitution option capacity is shown. In this example, the Chinese of the actually read word is: "wo yao mai zhong ke jian", which means "I want to buy Zhongkejian" (Zhongkejian is the name of a stock on the Chinese stock market ). This word map is the output of the sound processor as shown in FIG. 4 at functional block 452 . The sound processor compares the sound vectors captured by the device with the sound model and provides the sound processor with three alternatives for each word. Words 512, 511 and 510 represent "yao" and its replacement options. Words 514, 515, and 516 represent "mai" and its replacement options. Accordingly, the word graph shown in FIG. The single best sentence represented by the graph. Applying a language model to the word graph can be done at a server node, since the language processing is a rather complex process and has nothing to do with the voice recognition process. Therefore, the method of this embodiment takes advantage of the SD sound model by generating word graphs with two levels of word replacement options. The word graph will be transmitted to a server which then completes the recognition process and determines the single best sentence.

现在参见图6A，其中示出根据本发明一个实施例的发送处理。在功能块602处，由客户机设备产生一个音标字图。在功能块604处，该字图被转换为一个二进制文件，并且被打包用于通过网络发送。例如，在功能块604，一个TCP/IP数据报被用作为用于发送的载体。但是，可以使用对该字图打包以便于发送的任何其他方法，并且该特定的选择对于根据本发明的实施例的方法没有影响。在功能块606处，该数据报被发送到该服务器，并且在功能块608处，该数据报被在该服务器处接收。Referring now to FIG. 6A, there is shown a transmission process according to one embodiment of the present invention. At functional block 602, a phonetic word map is generated by the client device. At functional block 604, the word graph is converted to a binary file and packaged for sending over the network. For example, at functional block 604, a TCP/IP datagram is used as a vehicle for transmission. However, any other method of packaging the wordmap for transmission may be used, and this particular choice has no effect on methods according to embodiments of the present invention. At functional block 606, the datagram is sent to the server, and at functional block 608, the datagram is received at the server.

现在参见图6B，其中示出一个示意的网际协议数据报。在该数据报600的报头612部分中，包含本领域所公知的客户机设备的逻辑地址和服务器设备的逻辑地址以及任何其他控制信息。该数据区域610可以包括由该客户机设备所产生的音标字图的二进制表示。Referring now to FIG. 6B, an exemplary Internet Protocol datagram is shown. In the header 612 portion of the datagram 600, the logical address of the client device and the logical address of the server device and any other control information known in the art are contained. The data area 610 may include a binary representation of the phonetic graph generated by the client device.

现在参见图7a，其中示出服务器节点700的示意方框图。在功能块710处，如图6B中所示的TCP/IP数据报由服务器700所接收。在功能块712处，从其二进制形式对该字图解码，并且形成在相应的客户机节点(未示出)说出的内容的实际字图表示。在这一点处，该服务器具有该语音的N个假设表示的等价物。如本领域所公知那样，该服务器可以使用在该功能块720的一个语言模型以及如功能块718所示的字典，以进行搜索并且判断最可能的语音。Referring now to Figure 7a, a schematic block diagram of a server node 700 is shown. At functional block 710, a TCP/IP datagram is received by server 700 as shown in FIG. 6B. At functional block 712, the word graph is decoded from its binary form and an actual word graph representation of what was spoken at the corresponding client node (not shown) is formed. At this point, the server has the equivalent of N hypothetical representations of the speech. As known in the art, the server can use a language model at block 720 and a dictionary as shown at block 718 to search and determine the most likely speech.

在该处理过程中，对于每个音标字图节点(参见图7b和7c)，该服务器700通过检查该字典和语法词典而查找所选择的音标字。但是，本发明不限于该字典和语法模型。任何其他语言模型，例如还可以使用基于高速缓存的语言模型、基于触发器的语言模型以及长范围的三元语言模型(编入词典的无上下文的语法)。无论所使用的特定语言模型，在功能块720的结果是可以被存储的一个被识别的字串，或者可以被用作为该服务器700的命令。During this process, for each phonetic word graph node (see Figures 7b and 7c), the server 700 looks up the selected phonetic word by checking the dictionary and the grammar dictionary. However, the present invention is not limited to this dictionary and grammar model. Any other language models such as cache-based language models, trigger-based language models, and long-range trigram language models (lexicographically-encoded context-free grammars) can also be used. Regardless of the particular language model used, the result at function block 720 is a recognized string that can be stored, or used as a command to the server 700 .

现在参见图7b，其中示出一个示意的真实音标字图。该音标字图表示所说出的字串“我要买中科健(wo yao mai zhong ke jian)”。从该字图中，可以产生如图7c中所示的一个相应字图。在该处理中，该服务器对于每个音标字图(例如“yao”)搜索声音类似于“yao”的字。作为另一个例子，对于该音标字“zhong”，实际的字可能是“中”或者“重”或“种”(这是发音类似于“zhong”的字)。这些字的发音在英语上不类似，但是在汉语中它们是类似的。根据本实施例的方法不限于英语或汉语。可以使用能够构造一个语言模型的任何语言。Referring now to Figure 7b, a schematic graph of real phonetic characters is shown. The phonetic character graph represents the spoken word string "I want to buy Zhong Ke Jian (wo yao mai zhong ke jian)". From this word map, a corresponding word map as shown in Fig. 7c can be generated. In this process, the server searches for words that sound similar to "yao" for each phonetic word map (for example, "yao"). As another example, for the phonetic character "zhong", the actual character might be "zhong" or "zhong" or "kind" (which is a character that sounds like "zhong"). The pronunciation of these characters is not similar in English, but they are similar in Chinese. The method according to the present embodiment is not limited to English or Chinese. Any language capable of constructing a language model can be used.

再参见图7b，一旦对于每个音标字节点获得字替换选项，该服务器可以根据被查找到的这些字产生多个实际字的节点。然后在该音标字图中复制该拓扑关系，以获得如图7c中所示的扩展的音标字图，该图示出从图7b中所示的字图获得的音标字图。Referring again to FIG. 7b, once the word replacement options are obtained for each phonetic symbol node, the server can generate a plurality of actual word nodes according to the found words. This topological relationship is then replicated in the phonetic graph to obtain the extended phonetic graph as shown in Figure 7c, which shows the phonetic graph obtained from the graph shown in Figure 7b.

现在参见图7c，该扩展的音标字图被示出。在此，该服务器将根据一个语言模型考虑所读出序列的不同可能。例如，在该字“I”(主语)之后，该语言模型可能检测一个动词，例如“want”，而不是一个名词。相应地，可以采用一个修正策略，其中在“I”之后的名词不被进一步考虑，例如名词“medicine”可能不会跟随在作为主语的字“I”之后。按照这种方式，结果的搜索空间可能被大大地减小。类似地，一个词典可以被用于消除其他类似读音的字。在此，可以使用基于二元语言模型或三元语言模型的语言模型。该二元或三元语言模型的选择不影响根据实施例的方法。Referring now to Figure 7c, the extended phonetic graph is shown. In this case, the server takes into account the different possibilities of the read sequence on the basis of a language model. For example, after the word "I" (the subject), the language model might detect a verb such as "want" instead of a noun. Accordingly, a revision strategy can be employed wherein nouns following "I" are not considered further, eg the noun "medicine" may not follow the word "I" as subject. In this way, the resulting search space can be greatly reduced. Similarly, a dictionary can be used to eliminate other similar-sounding words. Here, a language model based on a bigram language model or a trigram language model can be used. The choice of the bigram or trigram language model does not affect the method according to the embodiments.

现在参见图8，其中示出包括客户机设备、服务器设备和一个通信网络的语音识别系统的示意框图。语音输入800可以是一个用户的名字，例如John。该语音输入800可以被连接到属于John的一个客户机设备的麦克风所捕获，例如John的移动电话或PDA。John可以使用其设备的训练模型来训练他的设备识别他的语音。位于客户机设备810的声音模型824被用于该训练模型中。当John被提示说出不同的字、短语或句子时，该语言模型收集对应于每个语言的数据。当John准备通过一个通信网络840与一个远程服务器850进行通信时，他可以切断该训练模式，并且开始说话，就好像他与另一个人进行普通对话那样。该客户机设备810将捕获John的语音并且使其通过特征提取模块822，以按照如本领域普通技术人员所公知那样对该模拟人的语音信号800执行一系列前端处理。而在现有技术的模型中，根据本发明一个实施例，所提取的特征被发送到该服务器，用于语言处理，一个附加功能出现在该服务器设备处，即，导致产生一个音标字图的声音处理。由此，一个实施例利用SD声音模型的优点，因为John能够个性化地训练该设备，因此导致获得一个SD个性化的声音模型。现有技术不能够利用与SD模型相关的较低WER的优点，在该客户机收集的特征被直接发送到该服务器，并且该服务器执行声音识别和分析。通过该现有技术使用SD模型是不实际的，因为该服务器服务于许多用户而不知道他们的身份。因此，该现有技术被限于SI模型，这容易导致较高的错误率。Referring now to FIG. 8, there is shown a schematic block diagram of a speech recognition system including client devices, server devices and a communication network. Voice input 800 may be a user's name, such as John. The speech input 800 may be captured by a microphone connected to a client device belonging to John, such as John's mobile phone or PDA. John can use his device's trained model to train his device to recognize his speech. The acoustic model 824 located at the client device 810 is used in the training model. When John is prompted to speak different words, phrases or sentences, the language model collects data corresponding to each language. When John is ready to communicate with a remote server 850 via a communication network 840, he can switch off the training mode and start talking as if he were having a normal conversation with another person. The client device 810 will capture John's speech and pass it through the feature extraction module 822 to perform a series of front-end processing on the simulated human speech signal 800 as known to those of ordinary skill in the art. Whereas in prior art models, according to one embodiment of the invention, the extracted features are sent to the server for language processing, an additional function occurs at the server device, i.e., resulting in the generation of a phonetic graph sound processing. Thus, an embodiment takes advantage of the SD sound model, since John is able to train the device individually, thus resulting in an SD personalized sound model. Prior art cannot take advantage of the lower WER associated with SD models, where features collected at the client are sent directly to the server, and the server performs sound recognition and analysis. Using the SD model with this prior art is impractical because the server serves many users without knowing their identities. Therefore, this prior art is limited to the SI model, which easily leads to a higher error rate.

一旦该声音处理器接收所提取的特征，它搜索由John的声音所训练的与说话者相关的声音模型。所获得匹配是可以用数据报发送到该服务器850的已知音标单元830。该数据报被在服务器850处所接收，并且提供到一个与读音词典857相结合的一个语言处理器855，并且语言模型859确定所识别的字串。Once the voice processor receives the extracted features, it searches for a speaker-dependent voice model trained by John's voice. The obtained matches are known phonetic units 830 that can be sent to the server 850 in a datagram. The datagram is received at server 850 and provided to a language processor 855 in conjunction with a pronunciation dictionary 857 and language model 859 to determine the recognized word string.

Claims

1. a distributed speech recognition method comprises:

Voice signal a client node recipient;

Discern the feature of described people's voice signal;

Identification is corresponding to the known phonetic symbol unit of the described feature that is identified;

Formation comprises the packet of at least one described known phonetic symbol unit; And

Described packet is sent to a server node.

2. distributed speech recognition method according to claim 1, wherein said client node is selected from mobile phone, personal digital assistant and portable computer system.

3. distributed speech recognition method according to claim 1, the feature of wherein discerning described people's voice signal comprises:

Voice signal to described people is carried out end point detection;

Filtering is emphasized in voice signal execution to described people in advance; And

Quantize described people's voice signal.

4. distributed speech recognition method according to claim 1 is wherein discerned corresponding to the known phonetic symbol unit pack of the described feature that is identified and is drawn together sound model of search, comprising:

The sub-word cell of simulating by the hidden Markov model of the Markov chain that comprises a sound status.

5. distributed speech recognition method according to claim 1 is wherein discerned corresponding to the known phonetic symbol unit pack of the described feature that is identified and is drawn together sound model relevant with the speaker of use.

6. distributed speech recognition method according to claim 1, wherein said known phonetic symbol unit forms a phonetic symbol word figure.

7. distributed speech recognition method according to claim 1, wherein said packet comprise the binary representation of a source address, destination address and described known phonetic symbol unit.

8. distributed speech recognition method according to claim 1, wherein said packet is sent by the internet.

9. a distributed speech recognition system comprises:

Client node, comprising:

The characteristic extracting module of the feature of identification people's voice signal,

Be connected to the acoustic processing module of described characteristic extracting module, this acoustic processing module is from the known phonetic symbol of the described feature identification that is identified unit, and

Be connected to the transmitter module of described acoustic processing module, this transmitter module forms and comprises the packet of at least one described phonetic symbol unit and described packet is sent to a server; And

Server, comprising:

Receiver module is used for receiving described packet and removes described at least one described known phonetic symbol unit from described packet; And

Language processing module is used to discern and described at least one relevant word in described known phonetic symbol unit.

10. distributed speech recognition system according to claim 9, wherein said client node is selected from mobile phone, personal digital assistant and portable computer system.

11. distributed speech recognition system according to claim 9, wherein said characteristic extracting module also are configured to described people's voice signal is carried out end point detection, emphasized filtering and quantification in advance.

12. distributed speech recognition system according to claim 9, wherein said acoustic processing module comprises a sound model relevant with the speaker.

13. distributed speech recognition system according to claim 9, wherein said acoustic processing module forms a phonetic symbol word figure according to described known phonetic symbol unit.

14. distributed speech recognition system according to claim 9, wherein said transmitter module forms the binary representation of described word figure, and before sending described word figure, described binary representation and source address and destination address are together placed a datagram.

15. a client devices, comprising:

Receiver module is used for recipient's voice signal;

Characteristic extracting module, it is connected to described receiver module, is used to discern the feature of described people's voice signal;

The acoustic processing module, it is connected to described characteristic extracting module, is used for discerning known phonetic symbol unit from the described feature that is identified, and forms the packet that comprises at least one described phonetic symbol unit; And

Transmitter module, it is connected to described acoustic processing module, is used for described packet is sent to a server node.

16. client devices according to claim 15, wherein said characteristic extracting module also are configured to described people's voice signal is carried out end point detection, emphasized filtering and quantification in advance.

17. client devices according to claim 15, wherein said acoustic processing module comprises a sound model relevant with the speaker.

18. client devices according to claim 15, wherein said acoustic processing module forms a word figure from the described known phonetic symbol unit that is identified.

19. client devices according to claim 18, wherein said word figure are phonetic symbol word figure.

20. a server, comprising:

Receiver module is used for comprising the packet of at least one known phonetic symbol unit from a client node reception, and removes described at least one known phonetic symbol unit from described packet; And

Language processing module, it is connected to described receiver module, is used for determining and described at least one relevant word in known phonetic symbol unit.

21. server according to claim 20 wherein receives described packet by the internet.

22. server according to claim 20, wherein said packet are datagrams, wherein comprise:

Header portion with address and described known phonetic symbol unit of the address of described client node, described server.