CN103117058B

CN103117058B - Based on Multi-voice engine switch system and the method for intelligent television platform

Info

Publication number: CN103117058B
Application number: CN201210558320.XA
Authority: CN
Inventors: 陈冠霖; 赵波; 刘贤洪; 杨金峰; 毕端
Original assignee: Sichuan Changhong Electric Co Ltd
Current assignee: Sichuan Changhong Electric Co Ltd
Priority date: 2012-12-20
Filing date: 2012-12-20
Publication date: 2015-12-09
Anticipated expiration: 2032-12-20
Also published as: CN103117058A

Abstract

The invention relates to a smart TV software platform, which discloses a multi-speech engine switching method based on a smart TV platform, which realizes automatic search and switching of the speech engine with the highest current recognition efficiency, and improves the user's speech interaction experience. The method can be summarized as follows: when the user runs the voice application program to use the voice recognition function, the voice engine selection module obtains the collected voice data through the voice application interface, and then sends the voice data to each voice engine module, records and compares each voice The engine module returns the response time of the recognition result, and the speech engine module with the shortest response time is selected for switching. In addition, the invention also discloses a corresponding switching system, which is suitable for realizing the fast speech recognition function in the smart TV.

Description

Multi-speech engine switching system and method based on smart TV platform

技术领域 technical field

本发明涉及智能电视软件平台，具体的说，是涉及一种基于智能电视平台的多语音引擎切换系统及方法。The invention relates to a smart TV software platform, in particular to a multi-voice engine switching system and method based on a smart TV platform.

背景技术 Background technique

随着电视终端智能化、网络化的发展，智能电视可获取的内容得到了极大的丰富，功能也更加的多元化，电视的操控随之变得更加频繁和复杂。语音识别技术在智能电视上的应用大大简化了用户的操作过程，用户体验得到极大提高。由于语音识别需要占用巨大的系统资源，智能电视目前一般都通过网络连接云端服务器来实现语音识别功能；With the development of intelligent and networked TV terminals, the content available to smart TVs has been greatly enriched, and the functions have become more diversified, and the control of TVs has become more frequent and complicated. The application of speech recognition technology on smart TVs greatly simplifies the user's operation process, and the user experience is greatly improved. Since speech recognition needs to occupy huge system resources, smart TVs are generally connected to cloud servers through the network to realize speech recognition functions;

在服务器中用于实现语音识别功能的语音识别引擎由语音检测模块、特征提取模块和识别搜索模块组成；其中，语音检测模块的功能是进行语音信号的检测和与处理，电视将采集到的原始语音数据送入到该模块，语音信号数据需要在语音检测模块里转换成标准的数据格式（比如：8K，16bit）；同时，利用高效的信号检测算法，判断出语音的起始点和终止点；特征提取模块收到检测后的语音数据流，从中提取得到语音信号的特征矢量流。语音特征是利用数字信号处理技术，从语音信号中提取最反应其本质属性的信息。在这个模块中，需要对语音信号进行预加重、分帧、加窗、品与变换、倒谱变换、差分等处理，最终得到数十维左右的特征矢量；识别搜索模块将收到的未知语音信号特征与引擎内的声学模型库、词典/字典和识别语法信息进行匹配，得到最适合未知语音特征的词序列。这个过程可以简单描述如下：通过检索词典/字典，可以将句子由词序列分解成音素的序列。这种音素的序列与声学模型相结合，就得到更反映其本质属性的声学模型单元序列信息。然后，将原始语音的特征矢量与所有可能的句子候选的声学模型单元序列的信息相互匹配，计算得到其匹配概率，从中挑选出具有最大后验概率的声学模型单元序列。通过该单元序列，可以得到与之对应的词序列，这就是引擎输出给电视的文字序列。The voice recognition engine used in the server to realize the voice recognition function is composed of a voice detection module, a feature extraction module and a recognition search module; wherein, the function of the voice detection module is to detect and process voice signals, and the TV will collect the original The voice data is sent to the module, and the voice signal data needs to be converted into a standard data format (for example: 8K, 16bit) in the voice detection module; at the same time, the start and end points of the voice are judged by using an efficient signal detection algorithm; The feature extraction module receives the detected voice data stream, and extracts the feature vector stream of the voice signal from it. Speech feature is the use of digital signal processing technology to extract information that best reflects its essential attributes from the speech signal. In this module, it is necessary to perform pre-emphasis, framing, windowing, product and transformation, cepstrum transformation, and difference processing on the speech signal, and finally obtain a feature vector of about tens of dimensions; the unknown speech that the recognition search module will receive The signal features are matched with the acoustic model library, dictionary/dictionary and recognition grammar information in the engine to obtain the word sequence most suitable for unknown speech features. This process can be briefly described as follows: By retrieving a dictionary/dictionary, a sentence can be decomposed from a sequence of words into a sequence of phonemes. This phoneme sequence is combined with the acoustic model to obtain acoustic model unit sequence information that reflects its essential properties. Then, the feature vector of the original speech is matched with the information of all possible sentence candidate acoustic model unit sequences, the matching probability is calculated, and the acoustic model unit sequence with the largest posterior probability is selected. Through this unit sequence, the corresponding word sequence can be obtained, which is the text sequence output by the engine to the TV.

而由于服务器中存在多个语音识别引擎，如果单一的使用某一个固定引擎进行语音识别，不利于智能电视语音识别效率的提升，造成用户语音交互体验不好；因此，如何在多个语音识别引擎之间查找当前最有效率的语音识别引擎并进行切换是语音交互应用中一个亟待解决的问题。And because there are multiple speech recognition engines in the server, if a single fixed engine is used for speech recognition, it is not conducive to the improvement of the efficiency of smart TV speech recognition, resulting in poor user experience in speech interaction; therefore, how to use multiple speech recognition engines Finding the current most efficient speech recognition engine and switching between them is an urgent problem to be solved in speech interaction applications.

发明内容 Contents of the invention

本发明所要解决的技术问题是：提出一种基于智能电视平台的多语音引擎切换系统及方法，实现自动查找当前识别效率最高的语音引擎并进行切换，提升用户的语音交互体验。The technical problem to be solved by the present invention is to propose a multi-speech engine switching system and method based on a smart TV platform, realize automatic search for and switch the speech engine with the highest current recognition efficiency, and improve the user's speech interaction experience.

本发明解决上述技术问题采用的方案是：基于智能电视平台的多语音引擎切换系统，包括：语音引擎选择模块及至少两个语音引擎模块；所有的语音引擎模块由统一的语音引擎接口进行封装，并通过语音引擎接口连接语音引擎选择模块；所述语音引擎选择模块通过语音应用接口与语音应用程序相连。The solution adopted by the present invention to solve the above-mentioned technical problems is: a multi-voice engine switching system based on the smart TV platform, comprising: a voice engine selection module and at least two voice engine modules; all voice engine modules are encapsulated by a unified voice engine interface, And the voice engine selection module is connected through the voice engine interface; the voice engine selection module is connected with the voice application program through the voice application interface.

进一步，所述语音引擎模块用于从语音引擎接口获取语音引擎选择模块传送的语音数据，并对语音数据进行识别，然后向语音引擎选择模块返回识别结果；所述语音引擎选择模块用于在语音应用程序使用语音识别功能时，通过语音应用接口获取采集到的语音数据，将语音数据通过语音引擎接口发送给每一个语音引擎模块，并接收所有语音引擎模块返回的识别结果，记录各个语音引擎模块返回识别结果的响应时间并进行对比，选择响应时间最短的语音引擎模块进行切换，使得语音应用程序可以调用到识别效率最高的语音引擎模块。Further, the voice engine module is used to obtain the voice data transmitted by the voice engine selection module from the voice engine interface, and recognize the voice data, and then return the recognition result to the voice engine selection module; When the application program uses the voice recognition function, it obtains the collected voice data through the voice application interface, sends the voice data to each voice engine module through the voice engine interface, and receives the recognition results returned by all voice engine modules, and records each voice engine module The response time of the recognition result is returned and compared, and the speech engine module with the shortest response time is selected for switching, so that the speech application program can call the speech engine module with the highest recognition efficiency.

进一步，所述选择响应时间最短的语音引擎模块进行切换是指：语音引擎选择模块通过语音引擎接口连接到响应时间最短的语音引擎模块，同时断开与其它语音引擎模块的连接。Further, the selection of the speech engine module with the shortest response time for switching refers to: the speech engine selection module is connected to the speech engine module with the shortest response time through the speech engine interface, and disconnected from other speech engine modules at the same time.

此外，本发明还提出了一种相应的基于智能电视平台的多语音引擎切换方法，包括：In addition, the present invention also proposes a corresponding smart TV platform-based multi-speech engine switching method, including:

a.当用户运行语音应用程序使用语音识别功能时，语音引擎选择模块通过语音应用接口获取采集到的语音数据；a. When the user runs the voice application program to use the voice recognition function, the voice engine selection module obtains the collected voice data through the voice application interface;

b.语音引擎选择模块将语音数据通过语音引擎接口发送给每一个语音引擎模块；b. the voice engine selection module sends the voice data to each voice engine module through the voice engine interface;

c.各个语音引擎模块对语音数据进行识别，然后向语音引擎选择模块返回识别结果；c. each voice engine module identifies the voice data, and then returns the recognition result to the voice engine selection module;

d.语音引擎选择模块记录各个语音引擎模块返回识别结果的响应时间并进行对比，选择响应时间最短的语音引擎模块进行切换。d. The speech engine selection module records and compares the response time of each speech engine module returning the recognition result, and selects the speech engine module with the shortest response time to switch.

进一步，步骤d中，所述选择响应时间最短的语音引擎模块进行切换是指：语音引擎选择模块通过语音引擎接口连接到响应时间最短的语音引擎模块，同时断开与其它语音引擎模块的连接。Further, in step d, selecting the speech engine module with the shortest response time to switch means: the speech engine selection module is connected to the speech engine module with the shortest response time through the speech engine interface, and disconnected from other speech engine modules at the same time.

本发明的有益效果是：通过对各个语音引擎模块返回识别结果的响应时间（即识别速度）进行对比，选择响应时间最短的语音引擎模块进行切换，使得语音应用程序可以调用到识别效率最高的语音引擎模块进行语音识别，从而提升了语音识别的整体识别效率；并且，由于语音应用程序与语音引擎选择模块之间的连接载体（语音应用接口）保持不变，当语音引擎模块发生切换时，语音应用程序无需关注具体是哪一个语音引擎模块发生切换，从而保证了语音识别的稳定性和延续性。The beneficial effects of the present invention are: by comparing the response time (i.e. recognition speed) of the recognition results returned by each speech engine module, the speech engine module with the shortest response time is selected for switching, so that the speech application program can call the speech with the highest recognition efficiency The engine module performs speech recognition, thereby improving the overall recognition efficiency of speech recognition; and, since the connection carrier (voice application interface) between the speech application program and the speech engine selection module remains unchanged, when the speech engine module switches, the speech The application does not need to pay attention to which speech engine module is switched, thus ensuring the stability and continuity of speech recognition.

附图说明 Description of drawings

图1为本发明中基于智能电视平台的多语音引擎切换系统实现构架图；Fig. 1 is based on the multi-speech engine switching system of intelligent TV platform among the present invention and realizes frame diagram;

图2为本发明中的基于智能电视平台的多语音引擎切换方法的流程图。Fig. 2 is a flow chart of the multi-speech engine switching method based on the smart TV platform in the present invention.

具体实施方式 Detailed ways

本发明的实现原理是：由于系统中各个语音引擎模块的性能差异，这些模块对语音数据的处理就有快有慢，因此，我们可以通过设置一个语音引擎选择模块来对各个语音引擎模块处理语音数据的响应时间进行记录和比较，从而找出处理时间最短、响应最快的语音引擎模块，然后切换至该语音引擎模块的连接即可，而语音引擎选择模块的引入由于其与语音应用程序之间的应用接口始终未发生改变，因此，同时还能解决系统的稳定性问题。The realization principle of the present invention is: due to the performance difference of each speech engine module in the system, these modules just have fast or slow to the processing of speech data, therefore, we can process speech to each speech engine module by arranging a speech engine selection module The response time of the data is recorded and compared, so as to find out the voice engine module with the shortest processing time and the fastest response, and then switch to the connection of the voice engine module, and the introduction of the voice engine selection module is due to its relationship with the voice application program The application interface among them has not changed all the time, therefore, it can also solve the stability problem of the system at the same time.

参见图1，本发明中基于智能电视平台的多语音引擎切换系统包括语音引擎选择模块及多个语音引擎模块；所有的语音引擎模块由统一的语音引擎接口进行封装，并通过语音引擎接口连接语音引擎选择模块；所述语音引擎选择模块通过语音应用接口与语音应用程序相连。Referring to Fig. 1, among the present invention, the multi-speech engine switching system based on smart TV platform comprises speech engine selection module and a plurality of speech engine modules; All speech engine modules are encapsulated by unified speech engine interface, and connect voice Engine selection module; the speech engine selection module is connected with the speech application program through the speech application interface.

其中，所述语音引擎模块用于从语音引擎接口获取语音引擎选择模块传送的语音数据，并对语音数据进行识别，然后向语音引擎选择模块返回识别结果；所述语音引擎选择模块用于在语音应用程序使用语音识别功能时，通过语音应用接口获取采集到的语音数据，将语音数据通过语音引擎接口发送给每一个语音引擎模块，并接收所有语音引擎模块返回的识别结果，记录各个语音引擎模块返回识别结果的响应时间并进行对比，选择响应时间最短的语音引擎模块进行切换，使得语音应用程序可以调用到识别效率最高的语音引擎模块。Wherein, the voice engine module is used to obtain the voice data transmitted by the voice engine selection module from the voice engine interface, and recognizes the voice data, and then returns the recognition result to the voice engine selection module; When the application program uses the voice recognition function, it obtains the collected voice data through the voice application interface, sends the voice data to each voice engine module through the voice engine interface, and receives the recognition results returned by all voice engine modules, and records each voice engine module The response time of the recognition result is returned and compared, and the speech engine module with the shortest response time is selected for switching, so that the speech application program can call the speech engine module with the highest recognition efficiency.

图2给出了切换方法的相应流程，其包括以下实现步骤：Figure 2 shows the corresponding flow of the handover method, which includes the following implementation steps:

a.当用户运行语音应用程序使用语音识别功能时，语音引擎选择模块通过语音应用接口获取采集到的语音数据；该语音数据来源于智能电视的语音采集设备采集到得音源信号；a. When the user runs the speech application program to use the speech recognition function, the speech engine selection module obtains the collected speech data through the speech application interface; the speech data comes from the sound source signal collected by the speech collection device of the smart TV;

b.语音引擎选择模块将语音数据通过语音引擎接口发送给每一个语音引擎模块；由于采用了统一的语音引擎接口进行封装，每一个语音引擎模块都能同时收到同样的语音数据；b. The voice engine selection module sends the voice data to each voice engine module through the voice engine interface; due to the adoption of a unified voice engine interface for encapsulation, each voice engine module can receive the same voice data at the same time;

d.语音引擎选择模块记录各个语音引擎模块返回识别结果的响应时间并进行对比，选择响应时间最短的语音引擎模块进行切换：语音引擎选择模块通过语音引擎接口连接到响应时间最短的语音引擎模块，同时断开与其它语音引擎模块的连接。此后，语音应用程序都可以通过调用该响应时间最短的语音引擎模块实现快速的语音识别，提升用户的语音交互体验。d. the voice engine selection module records the response time of each voice engine module to return the recognition result and compares it, selects the voice engine module with the shortest response time to switch: the voice engine selection module is connected to the voice engine module with the shortest response time through the voice engine interface, Disconnect from other voice engine modules at the same time. Afterwards, voice applications can implement fast voice recognition by invoking the voice engine module with the shortest response time, thereby improving the user's voice interaction experience.

Claims

1. The multi-voice engine switching system based on the smart TV platform is characterized in that, comprising: a voice engine selection module and at least two voice engine modules; all voice engine modules are encapsulated by a unified voice engine interface, and through the voice engine interface Connect the voice engine selection module; the voice engine selection module is connected with the voice application program through the voice application interface;

The voice engine module is used to obtain the voice data transmitted by the voice engine selection module from the voice engine interface, and recognizes the voice data, and then returns the recognition result to the voice engine selection module; the voice engine selection module is used for voice application When using the speech recognition function, obtain the collected speech data through the speech application interface, send the speech data to each speech engine module through the speech engine interface, and receive the recognition results returned by all speech engine modules, and record the recognition returned by each speech engine module The response time of the results is compared, and the speech engine module with the shortest response time is selected for switching, so that the speech application program can call the speech engine module with the highest recognition efficiency;

The selection of the speech engine module with the shortest response time for switching means that the speech engine selection module is connected to the speech engine module with the shortest response time through the speech engine interface, and disconnected from other speech engine modules at the same time.

2. based on the multi-speech engine switching method of smart TV platform, be applied in the system as claimed in claim 1, it is characterized in that, comprising:

a. When the user runs the voice application program to use the voice recognition function, the voice engine selection module obtains the collected voice data through the voice application interface;

b. the voice engine selection module sends the voice data to each voice engine module through the voice engine interface;

c. each voice engine module identifies the voice data, and then returns the recognition result to the voice engine selection module;

d. The speech engine selection module records the response time of each speech engine module to return the recognition result and compares it, and selects the speech engine module with the shortest response time to switch;

In step d, selecting the speech engine module with the shortest response time to switch means: the speech engine selection module is connected to the speech engine module with the shortest response time through the speech engine interface, and disconnected from other speech engine modules at the same time.