CN112236816A

CN112236816A - Information processing device, information processing system, and imaging device

Info

Publication number: CN112236816A
Application number: CN201980026560.5A
Authority: CN
Inventors: 尾崎哲
Original assignee: Hisense Visual Technology Co Ltd; Toshiba Visual Solutions Corp
Current assignee: Hisense Visual Technology Co Ltd; Toshiba Visual Solutions Corp
Priority date: 2018-09-20
Filing date: 2019-09-16
Publication date: 2021-01-15
Anticipated expiration: 2039-09-16
Also published as: WO2020057467A1; JP7009338B2; CN112236816B; JP2020046564A

Abstract

The invention relates to an information processing apparatus, an information processing system and a video apparatus. The information processing device includes an acquisition unit, a determination unit, and a replacement unit. The acquisition unit acquires first speech recognition data obtained by speech recognition of a speech uttered by a first speech recognition device, and a grammar analysis result of the first speech recognition data. The determination unit determines whether the first speech recognition data includes first program information related to a program based on a result of the grammar analysis. The replacing unit acquires second program information included in second voice recognition data for voice recognition by a second voice recognition device, and replaces the first program information included in the first voice recognition data with the second program information, the second voice recognition device having a dictionary in which information related to a program is registered, when the determining unit determines that the first voice recognition data includes the first program information. The accuracy of speech recognition of program-related information is improved for speech recognition results based on general speech recognition.

Description

Information processing device, information processing system, and imaging device

The present application claims priority of japanese patent application filed on 20/09/2018 by the japanese patent office, application No. 2018-175656, entitled "information processing apparatus, information processing system, and video apparatus", the entire contents of which are incorporated herein by reference.

Technical Field

Embodiments of the present invention relate to an information processing apparatus, an information processing system, and a video apparatus.

Background

A technique of converting a speech to be uttered into text data by speech recognition is known.

In addition, there is known a technique of a voice assistance service for operating AV equipment (Audio Video) or the like based on a result of recognition by such voice recognition.

Prior art documents

Patent document

Patent document 1: japanese laid-open patent application No. 2004-029317

Disclosure of Invention

However, when a general-purpose voice assistance service is used, it may be difficult to perform voice recognition of information related to a program with high accuracy.

An information processing device according to an embodiment of the present application includes an acquisition unit, a determination unit, and a replacement unit. The acquisition unit acquires first speech recognition data obtained by speech recognition of a speech uttered by a first speech recognition device, and a grammar analysis result of the first speech recognition data. The determination unit determines whether the first speech recognition data includes first program information related to a program based on a result of the grammar analysis. The replacing unit acquires second program information included in second voice recognition data in which voice recognition is performed by a second voice recognition device having a dictionary in which program-related information is registered, the second voice recognition data including second program information, and replaces the first program information included in the first voice recognition data with the second program information, when the determining unit determines that the first voice recognition data includes the first program information.

Drawings

Fig. 1 is a diagram of the overall structure of an exemplary information processing system according to an embodiment of the present application;

fig. 2 is a diagram of a hardware structure of an exemplary television apparatus according to an embodiment of the present application;

fig. 3 is a diagram of functions that a television apparatus according to an embodiment of the present application has;

fig. 4 is a diagram of voice information transmitted by a television apparatus according to an embodiment of the present application;

fig. 5 is a diagram of functions that a program information identification server has according to an embodiment of the present application;

FIG. 6 is a diagram of a dictionary according to an embodiment of the present application;

FIG. 7 is a diagram of speech recognition results according to an embodiment of the present application;

FIG. 8 is a diagram of the functionality a storage server has according to an embodiment of the present application;

fig. 9 is a diagram of functions that an intention judgment server has according to an embodiment of the present application;

fig. 10 is a sequence diagram of a flow of a voice recognition process according to an embodiment of the present application;

fig. 11 is a diagram of a dictionary according to modification 1 of the embodiment of the present application;

fig. 12 is a diagram of the overall configuration of an information processing system according to variation 2 of the embodiment of the present application.

Detailed Description

Fig. 1 is a diagram of the overall configuration of an information processing system S1 according to an embodiment of the present application. As shown in fig. 1, the information processing system S1 includes a television device 10, a program information recognition server 20, an intention determination server 30, and a storage server 40.

Each device included in the information processing system S1 is connected via a network such as the internet. Furthermore, the television apparatus 10 and the intention judging server 30 are connected to the voice assistance server 50 via a network. It should be noted that the information processing system S1 may include the voice assistance server 50.

The television apparatus 10 includes a voice input device such as a microphone, and inputs a voice uttered by a user. The television apparatus 10 transmits the input voice as a voice signal to the voice assistance server 50 and the program information recognition server 20. The television apparatus 10 receives the speech recognition result transmitted from the program information recognition server 20 described later, and transmits the received speech recognition result to the storage server 40. The television device 10 operates in accordance with an instruction signal received from an intention determination server 30 described later. The television device 10 is an example of the video device of the present embodiment.

The voice assistance server 50 is a device that performs a general voice assistance service. For example, the voice assistance server 50 performs voice recognition on a voice signal received from the television apparatus 10, and performs internet search, control of various home appliances, and the like based on the result of the voice recognition. The voice assistance server 50 converts the voice signal into text data by voice recognition, and performs syntax analysis on the text data.

In the present embodiment, character data obtained by converting a speech signal into characters by speech recognition by the speech assistance server 50 is referred to as first speech recognition data. Information for specifying the object or predicate included in the first speech recognition data is referred to as a grammar analysis result. Details of the first speech recognition data and the grammar parsing result will be described later.

The voice assistance server 50 transmits the first voice recognition data and the result of the grammar parsing for the first voice recognition data to the intention judgment server 30. The speech assistance server 50 is an example of the first speech recognition device and other speech recognition devices in the present embodiment.

The program information identification server 20 is a device as follows: a dictionary in which information (hereinafter, referred to as program information) related to program contents (hereinafter, referred to as programs) is registered is stored, and voice recognition is performed on a voice signal received from the television apparatus 10 based on the dictionary. The program information recognition server 20 transmits the second voice recognition data and the determination result of the program information to the television apparatus 10. The program information recognition server 20 is an example of the second speech recognition device and the speech recognition device in the present embodiment.

The program information is information related to a program, and includes information on any one of a program title, a Genre (Genre) of the program, and a performer name of the program. For example, in the present embodiment, the program information is a program title.

The second voice recognition data is the voice recognition result by the program information recognition server 20, and more specifically, is the character data in which the program information recognition server 20 converts the voice signal into characters based on the dictionary in which the program information is registered. The result of specifying the program information is information for specifying a portion corresponding to the program information in the second speech recognition data. Here, the program information included in the second speech recognition data is referred to as second program information.

The storage server 40 acquires the second voice recognition data voice-recognized by the program information recognition server 20 and the determination result of the program information via the television apparatus 10, and stores these pieces of information. The storage server 40 is an example of the storage device and the external device in the present embodiment.

The intention judgment server 30 judges whether or not the uttered voice is a voice command intended to operate the television apparatus 10. Specifically, the intention judging server 30 acquires the first speech recognition data and the grammar analysis result from the speech assistance server 50, and judges whether or not the first speech recognition data includes program information. When determining that the first voice recognition data includes program information, the intention determination server 30 determines that the uttered voice is a voice command intended to operate the television apparatus 10. Here, the program information included in the first speech recognition data is referred to as first program information.

When determining that the first speech recognition data includes the program information, the intention determining server 30 acquires the second program information from the storage server 40 and replaces the first program information included in the first speech recognition data with the second program information. The speech recognition data after the substitution processing is referred to as third speech recognition data.

The intention judging server 30 transmits an instruction signal to the television apparatus 10 based on the voice recognition data (third voice recognition data) after the replacement processing, and controls the television apparatus 10. The instruction signal is a signal for instructing the television apparatus 10 to operate, and includes, for example, a command for specifying a recorded program or a reproduced program. The intention determination server 30 is an example of the information processing apparatus in the present embodiment.

The program information recognition server 20, the intention judgment server 30, the storage server 40, and the voice support server 50 of the present embodiment are hardware configurations using a general computer, including a control device such as a CPU, a storage device such as a ROM (Read Only Memory) or a RAM, and an external storage device such as an HDD or a CD drive. The program information recognition server 20, the intention determination server 30, the storage server 40, and the voice assistance server 50 according to the present embodiment may be configured, for example, as a cloud environment on a network.

In addition, the information processing system S1 may include a plurality of television apparatuses 10. In this case, the program information recognition server 20, the intention determination server 30, the storage server 40, and the voice support server 50 are connected to the plurality of television apparatuses 10 and transmit and receive information.

Next, the details of each device included in the information processing system S1 according to the present embodiment will be described.

Fig. 2 is a diagram of a hardware configuration of the television apparatus 10 according to the embodiment of the present application. As shown in fig. 2, the television device 10 includes an antenna 101, an input terminal 102a, a tuner 103, a demodulator 104, a demultiplexer (demultiplexer)105,

input terminals

102b and 102c, an a/D (analog/digital) converter 106, a selector 107, a signal processing unit 108, a speaker 109, a display panel 110, an operation unit 111, a light receiving unit 112, an IP communication unit 113, a CPU114, a memory (memory)115, a storage (storage)116, a microphone 117, and an audio I/F (interface) 118.

The antenna 101 receives a broadcast signal of a digital broadcast, and supplies the received broadcast signal to the tuner 103 via the input terminal 102 a. The tuner 103 selects a frequency of a broadcast signal of a desired channel from broadcast signals supplied from the antenna 101, and supplies the selected broadcast signal to the demodulator 104.

The demodulator 104 demodulates the broadcast signal supplied from the tuner 103, and supplies the demodulated broadcast signal to the demultiplexer 105. The demultiplexer 105 demultiplexes the broadcast signal supplied from the demodulator 104 to generate a video signal and an audio signal, and supplies the generated video signal and audio signal to a selector 107 described later.

The input terminal 102b receives analog signals (video signals and audio signals) input from the outside. The input terminal 102c receives digital signals (video signals and audio signals) input from the outside. For example, the input terminal 102c is provided so as to be able to input a digital signal from a recorder (BD recorder) or the like equipped with a drive device that drives a recording medium for recording and reproducing such as a blu-ray disc. The a/D converter 106 supplies a digital signal generated by a/D converting the analog signal supplied from the input terminal 102c to the selector 107.

The operation unit 111 receives an operation input from a user. The light receiving unit 112 receives infrared rays from the remote control 119. The IP communication unit 113 is a communication interface for performing IP (internet protocol) communication via the network 300. The IP communication unit 113 can communicate with the program information recognition server 20, the intention determination server 30, the storage server 40, and the voice assistance server 50 via the network 300.

The CPU114 is a control unit that controls the entire television apparatus 10. The memory 115 is a ROM that stores various computer programs executed by the CPU114, a RAM that provides a work area to the CPU114, and the like. Also, the memory 116 is an HDD (hard disk drive), an SSD (solid state drive), or the like. The memory 116 records the signal selected by the selector 107 as video data, for example.

The microphone 117 picks up the voice uttered by the user and sends the voice to the audio I/F118. The audio I/F118 performs analog/digital conversion of the voice acquired by the microphone 117, and outputs the converted voice as a voice signal to the CPU 114.

Next, the functions of the television apparatus 10 according to the present embodiment will be described.

Fig. 3 is a diagram of functions that the television apparatus 10 according to the embodiment of the present application has. As shown in fig. 3, the television device 10 includes a voice input unit 11, a wakeup word determination unit 12, a first transmission unit 13, a first reception unit 14, a second transmission unit 15, a second reception unit 16, a reproduction unit 17, and a recording unit 18.

The voice input unit 11 inputs (acquires) a voice uttered by the user as a voice signal. More specifically, the voice input unit 11 receives an input of a voice signal obtained by digitally converting a voice uttered by the user from the audio I/F118. The voice input unit 11 sends the acquired voice signal (voice) to the awakened word determination unit 12.

The wakeup word determination unit 12 determines whether or not the voice signal acquired by the voice input unit 11 contains a predetermined wakeup word. The wake word is a prescribed voice command that becomes a trigger to start a voice assistance function, and is also called an Invocation word (Invocation word). The wake-up word is predetermined. The method for determining whether the voice signal contains the wake-up word may adopt a well-known voice recognition technology. When determining that the voice signal acquired through the voice input unit 11 contains the predetermined wake-up word, the wake-up word determination unit 12 transmits, to the first transmission unit 13, the voice signal following the predetermined wake-up word in the acquired voice signal.

The first transmitter 13 transmits the audio signal immediately after the predetermined wakeup word to the program information recognition server 20 and the audio support server 50, in which the audio signal is associated with the identification information that can identify the television apparatus 10 and the identification information that can identify the audio signal.

Fig. 4 is a diagram of voice information 81 transmitted from the television apparatus 10 according to the embodiment of the present application. As shown in fig. 4, the speech information 81 is information in which identification information (television apparatus ID) capable of identifying the television apparatus 10, identification information (speech ID) capable of identifying a speech signal, and a speech signal are associated with each other.

Returning to fig. 3, the first receiving unit 14 receives the television apparatus ID, the voice ID, the second voice recognition data, and the determination result of the program information from the program information recognition server 20. The first receiving unit 14 transmits the received information to the second transmitting unit 15.

The second transmitting unit 15 transmits the determination result of the television apparatus ID, the voice ID, the second voice recognition data, and the program information received by the first receiving unit 14 to the storage server 40. The second transmitting unit 15 may determine the second program information included in the second voice recognition data based on the determination result of the program information, and may transmit the television apparatus ID, the voice ID, and the second program information to the storage server 40.

The second receiving unit 16 receives an instruction signal from the intention judging server 30, the instruction signal instructing an operation related to the program information identified by the program information identifying server 20. The second receiving unit 16 transmits the instruction signal received from the intention judging server 30 to the reproducing unit 17 and the recording unit 18.

The reproducing unit 17 reproduces the recorded data of the program stored in the memory 116 or an external storage device based on the instruction signal received by the second receiving unit 16. For example, the playback unit 17 retrieves recorded data of a program title designated by the instruction signal from the memory 116 or an external storage device, and plays back the recorded data. When the instruction signal instructs that the program being played back is displayed instead of the recorded data, the playback unit 17 may control the tuner 103 to select a channel on which the program designated by the instruction signal is played back, and display the selected program on the display panel 110.

The recording unit 18 controls the selector 107 based on the instruction signal received by the second receiving unit 16 to select a program to be recorded, and stores (records) the program in the memory 116 or an external storage device.

Next, the function of the program information identification server 20 according to the present embodiment will be described.

Fig. 5 is a diagram of functions that the program information identification server 20 has according to the embodiment of the present application. As shown in fig. 5, the program information identification server 20 includes a receiving unit 21, a specifying unit 22, an output unit 23, and a storage unit 25.

A dictionary in which program information is registered is stored in advance in the storage unit 25. The storage unit 25 is a storage device such as an HDD.

Fig. 6 is a diagram of a dictionary 80 according to an embodiment of the present application. As shown in fig. 6, the character data of the title of the program is registered in the dictionary 80 in association with information (phonetic kana) indicating the pronunciation of the title of the program. In general, since a title of a program may be a symbol or a dummy word, a pronunciation of the title of the program may be different from a normal reading method, but a correct pronunciation of the title of the program is registered in advance in the dictionary 80. It should be noted that, in japanese, the program title may be represented by characters similar to chinese characters, and the pronunciation of the program title may be represented by a form similar to chinese pinyin. For example, for the contents shown in fig. 6, in the middle text, the program title may be "happy talk" in chinese characters, and the pronunciation of the program title may be "yukuaidettanhua" in pinyin corresponding to the chinese characters. Instead of the character data of the title of the program, identification information such as an ID capable of identifying the program may be registered in the dictionary 80. In addition, various metadata related to the program, such as the broadcast time of the program, may be registered in the dictionary 80.

Returning to fig. 5, the receiving unit 21 receives a voice signal from the television apparatus 10. More specifically, the receiving unit 21 receives, from the television apparatus 10, the voice information 81 in which the voice signal, the identification information (television apparatus ID) capable of identifying the television apparatus 10, and the identification information (voice ID) capable of identifying the voice signal are associated with each other. The receiving unit 21 sends the received speech information 81 to the specifying unit 22.

The determination unit 22 generates second speech recognition data by speech recognition using the dictionary 80. Specifically, the specifying unit 22 converts the voice signal included in the voice information 81 received by the receiving unit 21 into character data by voice recognition. The specifying unit 22 specifies the program title included in the character data based on the dictionary 80 stored in the storage unit 25. For example, when a part matching the pronunciation of the program title registered in the dictionary 80 exists in the second speech recognition data, the specifying unit 22 specifies the part as the program title.

Fig. 7 is a diagram of a speech recognition result according to an embodiment of the present application. As shown in fig. 7, when the user 9 inputs the voice 90 to the television apparatus 10, the television apparatus 10 transmits the voice 90 as a voice signal to the voice assistance server 50 and the program information recognition server 20. When the specifying unit 22 of the program information recognition server 20 generates the second voice recognition data 92, it converts a part specified as a program title in the character data into character data of the program title registered in the dictionary 80. As described above, since the dictionary 80 registers the character data of the program title in association with the information indicating the pronunciation of the program title, the specifying unit 22 can specify the program title included in the second speech recognition data 92 with high accuracy even when the reading of the program title is different from the general reading. The method of specifying the speech recognition and the program information by the specifying unit 22 is not limited to this, and other known methods may be employed.

The specification unit 22 associates the second speech recognition data 92 with information (a result of specifying the program information) specifying a portion corresponding to the program information in the second speech recognition data 92, and transmits the result to the output unit 23.

Returning to fig. 5, the output unit 23 outputs the second program information determined by the determination unit 22. More specifically, the output unit 23 associates the television apparatus ID, the voice ID, the second voice recognition data 92, and the result of specifying the program information with each other, and outputs the result to the television apparatus 10.

Next, the function of the storage server 40 according to the present embodiment will be described.

Fig. 8 is a diagram of functions that the storage server 40 has according to the embodiment of the present application. As shown in fig. 8, the storage server 40 includes a storage processing unit 41, a search unit 42, and a storage unit 45.

The storage processing unit 41 receives the television apparatus ID, the voice ID, the second voice recognition data 92, and the determination result of the program information output from the television apparatus 10, and stores the received data in the storage unit 45.

The storage unit 45 stores the television apparatus ID, the voice ID, the second voice recognition data 92, and the result of specifying the program information, which are output from the television apparatus 10, in association with each other. The storage unit 45 is a storage device such as an HDD.

When receiving a request for transmission of second program information included in the second speech recognition data 92 from the intention determination server 30, the search unit 42 searches the storage unit 45 for the second speech recognition data 92 associated with the television apparatus ID and the speech ID transmitted from the intention determination server 30, and transmits the second speech recognition data 92 to the intention determination server 30. Note that, the search unit 42 may transmit a portion corresponding to the second program information in the second speech recognition data 92 to the intention judging server 30, not the entire second speech recognition data 92.

Next, the function of the intention judging server 30 according to the present embodiment will be described.

Fig. 9 is a diagram of functions that the intention judging server 30 has according to the embodiment of the present application. As shown in fig. 9, the intention judging server 30 includes an acquisition unit 31, a judgment unit 32, a replacement unit 33, a video device control unit 34, and a storage unit 35.

Predetermined commands used for commands to the television apparatus 10 are stored in advance in the storage unit 35. The predetermined command is a command for specifying the operation of the television apparatus 10, and examples thereof include "playback", "recording", and "open", but are not limited thereto. The storage unit 35 is a storage device such as an HDD.

The acquisition unit 31 acquires the first speech recognition data, the grammar analysis result, the television apparatus ID, and the speech ID output from the speech assistance server 50.

Here, the first speech recognition data will be explained.

As shown in fig. 7, the television apparatus 10 transmits the voice 90 generated by the user 9 to the voice assistance server 50 as a voice signal. The voice assist server 50 converts the received voice signal into text data by voice recognition, thereby generating first voice recognition data 91. Then, the speech assistance server 50 performs syntax analysis of the generated first speech recognition data 91.

The grammar parsing result is information for specifying an object or a predicate included in the first speech recognition data 91. For example, the speech assistance server 50 specifies, by grammar analysis, a range of characters corresponding to an object and a range of characters corresponding to a predicate including a verb and the like in a phrase included in the first speech recognition data 91. The syntax parsing method may be a known method. The voice assistance server 50 associates the television apparatus ID, the voice ID, the generated first voice recognition data 91, and the grammar analysis result with each other, and transmits the association to the intention judging server 30.

Returning to fig. 9, the determination unit 32 determines whether or not the first speech recognition data 91 includes program information based on the result of the grammar analysis. In the present embodiment, the determination unit 32 determines whether or not the first speech recognition data 91 includes a program title. For example, when the part of the first speech recognition data 91 identified as the "predicate" by the syntax analysis includes any of the predetermined commands stored in the storage unit 35, the determination unit 32 determines that the part of the first speech recognition data 91 identified as the "object" is the program title. The method of determining whether or not the first speech recognition data 91 includes the program information is not limited to this, and other known analysis methods may be employed.

When the determination unit 32 determines that the first speech recognition data 91 includes program information (first program information), the replacement unit 33 acquires second program information included in the second speech recognition data from the storage server 40, and replaces the first program information included in the first speech recognition data 91 with the second program information. For example, in the example shown in fig. 7, the substitution unit 33 substitutes "happy conversation" of "reproduced happy conversation" of the first voice recognition data 91 with "happy conversation" of the second voice recognition data 92! | A ". The replacement unit 33 sends the third speech recognition data after the replacement process to the video device control unit 34.

The video device control unit 34 transmits an instruction signal to the television device 10 based on the third speech recognition data, and controls the operation of the television device 10. For example, the video device control unit 34 converts a program title and a command included in the third voice recognition data into a signal, and transmits the signal to the television device 10 as an instruction signal.

Next, a flow of the voice recognition processing of the present embodiment will be described.

Fig. 10 is a sequence diagram of a flow of a speech recognition process according to an embodiment of the present application. The voice input unit 11 of the television apparatus 10 receives the input of the voice 90 uttered by the user 9 (S1). The speech input unit 11 sends the input speech 90 to the awakened word determination unit 12.

Next, the wake word determination unit 12 determines whether the input speech 90 includes a predetermined wake word (S2). When determining that the speech sound 90 acquired through the speech sound input unit 11 includes the predetermined wake word, the wake word determination unit 12 transmits, to the first transmission unit 13, the speech sound immediately following the predetermined wake word in the acquired speech sound 90. When determining that the speech sound 90 does not include the predetermined wakeup word, the wakeup word determination unit 12 does not transmit the speech sound to the first transmission unit 13.

Next, the first transmitter 13 transmits the voice following the predetermined wakeup word in the input voice 90 to the program information recognition server 20 and the voice assist server 50 as a voice signal. More specifically, the first transmitter 13 transmits the audio information 81 in which the audio signal, the television apparatus ID, and the audio ID are associated with each other to the program information recognition server 20 and the audio support server 50 (S3).

Then, the receiving unit 21 of the program information recognition server 20 receives the audio information 81 from the television apparatus 10. The receiving unit 21 sends the received speech information 81 to the specifying unit 22. The determination unit 22 of the program information recognition server 20 converts the voice signal included in the voice information 81 received by the reception unit 21 into character data by voice recognition, and determines the program title included in the character data based on the dictionary 80 (S4).

Next, the output unit 23 of the program information recognition server 20 associates the television apparatus ID, the voice ID, the second voice recognition data 92, and the result of specifying the program information with each other, and outputs the result of voice recognition to the television apparatus 10 (S5).

Then, the first receiving unit 14 of the television apparatus 10 receives the television apparatus ID, the voice ID, the second voice recognition data 92, and the determination result of the program information as the voice recognition result of the voice signal. Next, the second transmitting unit 15 of the television apparatus 10 associates the voice recognition result received by the first receiving unit 14, that is, the television apparatus ID, the voice ID, the second voice recognition data 92, and the determination result of the program information, and transmits the association to the storage server 40 (S6).

Then, the storing processing unit 41 of the storage server 40 stores the speech recognition result based on the program information recognition server 20 received from the television apparatus 10 in the storage unit 45 (S7).

The voice assistance server 50 performs voice recognition on the voice signal included in the voice information 81 transmitted from the television apparatus 10 in the process of S3, converts the voice signal into character data, and performs syntax analysis of the character data (S8). The voice assistance server 50 associates the television apparatus ID, the voice ID, the first voice recognition data 91, and the grammar analysis result with each other, and transmits the result to the intention judging server 30 as a voice recognition result (S9).

Then, the acquisition unit 31 of the intention determination server 30 acquires the television apparatus ID, the voice ID, the first voice recognition data 91, and the grammar analysis result output from the voice assistance server 50. Then, the determination unit 32 of the intention determination server 30 determines whether or not the first speech recognition data 91 includes a program title based on the obtained grammar analysis result (S10).

If the determination unit 32 determines in the process of S10 that the first speech recognition data 91 includes program information (first program information), the process branches to S100. Specifically, when the determination unit 32 determines that the first speech recognition data 91 includes program information (first program information), the replacement unit 33 of the intention determination server 30 requests the storage server 40 to transmit the second program information included in the second speech recognition data 92 (S11). More specifically, the replacing unit 33 transmits the television apparatus ID and the voice ID associated with the first voice recognition data 91 determined by the determining unit 32 to include the first program information to the storage server 40.

Then, the search unit 42 of the storage server 40 searches the storage unit 45 for the second speech recognition data 92 associated with the television apparatus ID and the speech ID transmitted from the intention judging server 30, and transmits the second speech recognition data to the intention judging server 30 (S12). The search process of the second speech recognition data 92 may be executed by the replacement unit 33 of the intention judging server 30. The replacing unit 33 may acquire the entire second speech recognition data 92 including the second program information from the storage server 40, or may acquire only the second program information in the second speech recognition data 92.

The replacement unit 33 of the intention judging server 30 replaces the first program information included in the first speech recognition data 91 with the second program information acquired from the storage server 40 (S13).

Then, the video apparatus control unit 34 of the intention judgment server 30 transmits an instruction signal instructing an operation to the television apparatus 10 based on the third speech recognition data subjected to the replacement processing by the replacement unit 33 (S14).

Then, the second receiving unit 16 of the television apparatus 10 transmits the instruction signal received from the intention judging server 30 to the reproducing unit 17 or the recording unit 18. Then, the reproducing unit 17 or the recording unit 18 executes processing in accordance with the instruction signal transmitted from the intention judging server 30 (S15). For example, the reproducing unit 17 or the recording unit 18 reproduces the recorded data of the program designated by the instruction signal, records the program designated by the instruction signal, or the like.

When there are a plurality of candidates for a program to be reproduced or recorded, the reproducing unit 17 or the recording unit 18 may display the candidates on the display panel 110 so as to be selectable. For example, when the program of the title designated by the instruction signal is a plurality of recorded broadcast segments, the playback unit 17 may display a selection screen on the display panel 110, the selection screen enabling selection of a playback target from the plurality of recorded broadcast segments. In this case, the playback segment to be played back may be selected by the user operating the remote control 119, or may be selected by voice.

In S10, when the determination unit 32 of the intention determination server 30 determines that the first speech recognition data 91 does not include the first program information, the process branches to S200. Specifically, the determination unit 32 of the intention determination server 30 transmits the determination result that the first speech recognition data 91 does not include the first program information to the speech assistance server 50 (S15).

In this case, since the speech 90 uttered by the user 9 is not a speech related to the program, the speech assistance server 50 starts another speech assistance process based on the first speech recognition data 91 (S16). The other voice support processing is processing other than the operation of the television apparatus 10. Since the voice assistance server 50 executes a general-purpose voice assistance service, the contents of other voice assistance processing are not particularly limited. Here, the processing shown in the sequence diagram of fig. 10 ends.

As described above, in the intention judging server 30 of the present embodiment, when it is judged that the first speech recognition data 91 having speech recognized the speech 90 by the speech assisting server 50 contains the first program information, the first program information is replaced with the second program information contained in the second speech recognition data 92 having speech recognized the speech 90 by the program information recognition server 20. Therefore, according to the intention judging server 30 of the present embodiment, it is possible to acquire the voice recognition result by the general-purpose voice recognition and to adopt the voice recognition result using the dedicated dictionary 80 for the information on the program, thereby improving the accuracy of the voice recognition for the information on the program.

For example, when only a general-purpose voice assistance service is used, it may be difficult to accurately voice-recognize information related to a program such as a program title. In addition, when a voice support service dedicated to information related to a program such as a program title is used, it may be difficult to perform voice recognition with high accuracy for other than a voice command related to the program. In contrast, in the intention judging server 30 of the present embodiment, the results of the voice recognition performed by the voice assisting server 50 and the program information recognizing server 20 are acquired, and the recognition result of the program information recognizing server 20 is used for the information related to the program. Thus, according to the intention judging server 30 of the present embodiment, it is possible to realize both general speech recognition and high-precision speech recognition of information relating to a program.

The intention judging server 30 of the present embodiment controls the operation of the television apparatus 10 based on the third speech recognition data obtained by replacing the first program information with the second program information. Thus, according to intention judging server 30 of the present embodiment, the operation of television apparatus 10 can be controlled based on the highly accurate speech recognition result of the information on the program.

The first program information and the second program information of the present embodiment include any one of a title of a program, a genre of the program, and a name of a performer of the program. More specifically, the first program information and the second program information of the present embodiment are program titles. Then, the program information recognition server 20 specifies the title of the program included in the speech 90 as the second program information based on the dictionary 80 in which the pronunciation of the program title is registered. Therefore, according to the intention judging server 30 of the present embodiment, even when the reading of the program title is different from the general reading, the program title which has been voice-recognized with high accuracy by the program information recognizing server 20 can be acquired.

The information processing system S1 of the present embodiment includes the television device 10, the program information recognition server 20, the intention judgment server 30, and the storage server 40. The television apparatus 10 of the present embodiment transmits the input speech 90 as a speech signal to the program information recognition server 20 and the speech assistance server 50. The program information recognition server 20 of the present embodiment specifies the second program information included in the speech signal based on the information registered in the dictionary 80, and outputs the second speech recognition data 92 including the specified second program information. The storage server 40 of the present embodiment stores the second program information identified by the program information identification server 20. When determining that the first speech recognition data 91 acquired from the speech assistance server 50 contains the first program information, the intention determination server 30 of the present embodiment replaces the first program information contained in the first speech recognition data 91 with the second program information. Then, the intention judging server 30 of the present embodiment controls the operation of the television apparatus 10 based on the third speech recognition data after the substitution processing. In this way, in the information processing system S1 of the present embodiment, both the common speech recognition and the speech recognition using the dictionary 80 are performed on the speech 90 uttered by the user 9. Therefore, according to the information processing system S1 of the present embodiment, it is possible to provide the user 9 with a voice assistance service based on general-purpose voice recognition, and to control the television apparatus 10 based on the result of performing voice recognition on information relating to a program with high accuracy. That is, according to the information processing system S1 of the present embodiment, it is possible to provide the user 9 with the voice assistance service and use the result of performing the voice recognition on the information related to the program with high accuracy for the voice assistance service.

For example, when a dedicated voice support service for information related to a program is separately used to control a television apparatus separately from a general-purpose voice support service, a user may use 2 voice support services separately according to the purpose of use, which may complicate the operation. In contrast, according to the present embodiment, since the television apparatus 10 transmits the input speech 90 as a speech signal to the program information recognition server 20 and the speech assistance server 50, it is possible to utilize highly accurate speech recognition for information relating to a program even if the user 9 does not intentionally use the speech assistance service separately.

In the present embodiment, the television apparatus 10 is an example of a video apparatus, but the video apparatus may be a BD recorder, a DVD recorder, or the like, or may be a voice input apparatus connected to the television apparatus 10.

In the present embodiment, the television apparatus 10 records or reproduces video or video on/from the television apparatus 10 in response to the instruction signal from the intention judging server 30.

The microphone 117 may be an example of the voice input unit of the present embodiment. Further, the voice input unit 11, the microphone 117, and the audio I/F118 may be used as the voice input unit.

In the present embodiment, the speech assistance server 50 performs grammar analysis, but the intention determination server 30 may perform grammar analysis. In this case, voice assistance server 50 associates the television apparatus ID, voice ID, and first voice recognition data 91 and transmits them to intention determination server 30.

(modification 1)

In the above-described embodiment, the case where the program information is the program title has been described, but the program information is not limited thereto, and may be the genre of the program or the name of the performer of the program. Also, the program information recognition server 20 may store a dictionary in which the genre of the program or the artist name of the program is registered.

Fig. 11 is a diagram of a dictionary 1080 according to a modification of the embodiment of the present application. As shown in fig. 11, the dictionary 1080 registered in the storage unit 25 of the program information recognition server 20 may store, in addition to the character data of the title of the program and the information (phonetic kana) indicating the pronunciation of the title of the program, the type of the program, the character data of the artist name of the program, the pronunciation of the artist name of the program, the combination name of the artists who contain the program, the pronunciation of the combination name of the artists who contain the program, and the like in association with each other. These pieces of information can be registered in a plurality of databases in a distributed manner.

For example, when the speech 90 uttered by the user 9 includes the name of the performer of the program, the specifying unit 22 of the program information recognition server 20 may specify the program performed by the performer as the second program information. In addition, in the case where the voice 90 uttered by the user 9 includes a combination name, a program registered as a performer as a member belonging to the group may be determined as the second program information. Further, the name of the art may be changed in the middle or may be called by a plurality of names. In the dictionary 1080, new and old utterances such as a plurality of artist names and love names can be registered as the utterances of the performers.

By storing various information related to the program in the dictionary 1080 in this way, the program information recognition server 20 can specify the information related to the program included in the speech 90 uttered by the user 9 with higher accuracy.

(modification 2)

In the above-described embodiment, the voice recognition result by the program information recognition server 20 is transmitted to the television apparatus 10 and then transmitted to the storage server 40 via the television apparatus 10, but the transmission path of the voice recognition result is not limited to this.

Fig. 12 is a diagram of the overall configuration of an information processing system S2 according to a modification of the embodiment of the present application. As shown in fig. 12, the information processing system S2 includes a television device 1010, a program information identification server 1020, an intention determination server 30, and a storage server 1040. The television device 1010 and the intention judging server 30 are connected to the voice assistance server 50 via a network.

The program information recognition server 1020 of the present modification has the functions of the above-described embodiment, and transmits the voice recognition result to the storage server 1040. More specifically, the output unit 23 of the program information recognition server 1020 associates the television apparatus ID, the voice ID, the second voice recognition data 92, and the result of specifying the program information with each other, and outputs the result to the storage server 1040.

The storage server 1040 according to the present modification includes the functions of the above-described embodiment, and also stores the television apparatus ID, the audio ID, the second audio identification data 92, and the result of specifying the program information, which are transmitted from the program information identification server 1020.

Further, the intention determination server 30, the television device 1010, and the voice assistance server 50 according to the present modification have the functions of the above-described embodiment.

By directly transmitting the voice recognition result from the program information recognition server 1020 to the storage server 1040 as in the present modification, the voice recognition result can be stored in the storage server 1040 even if the television apparatus 1010 is not present as a medium of information between the program information recognition server 1020 and the storage server 1040.

(modification 3)

In the above-described embodiment, the case where the program information recognition server 20, the intention determination server 30, the storage server 40, and the speech assistance server 50 are each configured as a different server has been described, but the functions of a plurality of servers may be realized by one server. For example, voice assistance server 50 and intent determination server 30 may be combined into one server. Also, the program information recognition server 20 and the storage server 40 may be combined into one server. The intention judging server 30 and the storage server 40 may be combined into one server. Moreover, the function of one server can be realized by a plurality of computers by a technique such as virtualization.

As described above, according to the above-described embodiment, the accuracy of speech recognition on program-related information can be improved by using the speech recognition result using the

specific dictionary

80, 1080 for the program-related information as the speech recognition result by the general speech recognition.

The programs executed by the television devices 10 and 1010, the program information recognition servers 20 and 1020, the intention determination server 30, the storage servers 40 and 1040, and the voice support server 50 according to the embodiments are provided as files in an installable or executable format, which are recorded on a computer-readable recording medium such as a CD-ROM, a Flexible Disk (FD), or a CD-R, DVD (Digital Versatile Disk).

The programs executed by the television devices 10 and 1010, the program information recognition servers 20 and 1020, the intention determination server 30, the storage servers 40 and 1040, and the voice support server 50 according to the embodiments may be stored in a computer connected to a network such as the internet, and may be provided by downloading the programs via the network. The programs executed by the television apparatus 10, 1010, the program information recognition server 20, 1020, the intention determination server 30, the storage server 40, 1040, and the voice assistance server 50 according to the embodiment may be provided or distributed via a network such as the internet. The programs executed by the television apparatus 10, 1010, the program information recognition server 20, 1020, the intention determination server 30, the storage server 40, 1040, and the voice assistance server 50 according to the embodiment may be provided by being loaded in advance in a ROM or the like.

The program executed by the television apparatus 10 or 1010 according to the embodiment has a module configuration including the above-described respective units (the voice input unit, the wakeup word determination unit, the first transmission unit, the first reception unit, the second transmission unit, the second reception unit, the reproduction unit, and the recording unit), and a CPU (processor) reads out the program from the storage medium and executes the program to load the respective units on the main storage device as actual hardware, thereby generating the voice input unit, the wakeup word determination unit, the first transmission unit, the first reception unit, the second transmission unit, the second reception unit, the reproduction unit, and the recording unit in the main storage device.

The program executed by the program information identification server 20 or 1020 according to the embodiment has a module configuration including the above-described respective units (receiving unit, specifying unit, and output unit), and the CPU reads out the program from the storage medium and executes the program to load the respective units into the main storage device as actual hardware, thereby generating the receiving unit, the specifying unit, and the output unit in the main storage device.

The program executed by the intention judging server 30 according to the embodiment has a module configuration including the above-described respective units (the acquiring unit, the judging unit, the replacing unit, and the video device control unit), and the CPU reads out the program from the storage medium and executes the program to load the respective units into the main storage device as actual hardware, thereby generating the acquiring unit, the judging unit, the replacing unit, and the video device control unit in the main storage device.

The program executed by the storage servers 40 and 1040 according to the embodiments has a module configuration including the above-described respective units (the storage processing unit and the search unit), and the CPU reads the program from the storage medium and executes the program to load the respective units into the main storage device as actual hardware, thereby generating the storage processing unit and the search unit in the main storage device.

Although the embodiments of the present invention have been described, these embodiments are presented as examples and do not limit the scope of the invention. These new embodiments can be implemented in other various ways, and various omissions, substitutions, and changes can be made without departing from the spirit of the invention. The above-described embodiments and modifications thereof are included in the scope and gist of the invention, and are included in the invention described in the claims and the equivalent scope thereof.

Description of the reference numerals

101010 television device

11 voice input part

12 Wake-up word judging section

13 first transmission part

14 first receiving part

15 second transmitting part

16 second receiving part

17 reproducing unit

18 video recording part

201020 program information identification server

21 receiving part

22 determination unit

23 output part

25 storage unit

30 intention judging server

31 acquisition part

32 determination part

33 replacement part

34 image device control part

35 storage part

401040 storage server

41 storage processing part

42 search unit

45 storage unit

50 voice auxiliary server

801080 dictionary

81 speech information

90 speech

91 first speech recognition data

92 second speech recognition data

117 microphone

S1 and S2 information processing system

Claims

PCT domestic application, the claims have been published.