JP2010181461A

JP2010181461A - Digital photograph frame, information processing system, program, and information storage medium

Info

Publication number: JP2010181461A
Application number: JP2009022628A
Authority: JP
Inventors: Ryohei Sugihara; 良平杉原; Seiji Tatsuta; 成示龍田; Yoichi Iba; 陽一井場; Miho Kameyama; 未帆亀山; Isato Fujigaki; 勇人藤垣
Original assignee: Olympus Corp
Current assignee: Olympus Corp
Priority date: 2009-02-03
Filing date: 2009-02-03
Publication date: 2010-08-19

Abstract

<P>PROBLEM TO BE SOLVED: To provide a digital photograph frame that achieves display control in which sound information acquired by a sound sensor is reflected, an information processing system, a program and an information storage medium. <P>SOLUTION: The digital photograph frame 300 includes a display part 340 displaying an image, a display controller 318 performing display control of the display part 340, and a sound information acquiring part 304 acquiring the sound information detected by the sound sensor for detecting ambient sound. The display controller 318 performs the display control to change the display form of the image displayed on the display part 340 in accordance with the acquired sound information. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、デジタルフォトフレーム、情報処理システム、プログラム及び情報記憶媒体等に関する。 The present invention relates to a digital photo frame, an information processing system, a program, an information storage medium, and the like.

近年、デジタルスチルカメラなどのデジタルカメラで撮影した画像を手軽に再生できる装置として、デジタルフォトフレームが脚光を浴びている。このデジタルフォトフレームは、フォトスタンドの写真を入れる部分が液晶ディスプレイに置き換えられた形態の装置であり、メモリカードや通信装置を介して読み込まれたデジタルの画像データ（電子写真）の再生処理を行う。 In recent years, a digital photo frame has been spotlighted as a device that can easily reproduce an image taken by a digital camera such as a digital still camera. This digital photo frame is a device in which a portion for putting a photo on a photo stand is replaced with a liquid crystal display, and performs reproduction processing of digital image data (electrophotography) read via a memory card or a communication device. .

デジタルフォトフレームの従来技術としては例えば特許文献１に開示される技術がある。この従来技術では、電話回線接続装置を、デジタルフォトフレームであるデジタルフォトスタンドに具備させて、フォトスタンドと有線又は無線の電話回線との間の伝送路の形成を実現している。 As a conventional technique of a digital photo frame, for example, there is a technique disclosed in Patent Document 1. In this prior art, a telephone line connection device is provided in a digital photo stand, which is a digital photo frame, to form a transmission path between the photo stand and a wired or wireless telephone line.

しかしながら、これまでのデジタルフォトフレームでは、デジタルカメラ等で撮影された画像を単に再生する機能しか有しておらず、周囲音やユーザの発話などを反映させた表示制御は行われていなかった。このため再生画像が単調であり、ユーザに対して多様なコンテンツを表示することができなかった。 However, digital photo frames so far have only a function of reproducing an image taken with a digital camera or the like, and display control reflecting ambient sounds, user utterances, and the like has not been performed. For this reason, the reproduced image is monotonous and various contents cannot be displayed to the user.

特開２０００−３２４４７３号公報JP 2000-324473 A

本発明の幾つかの態様によれば、音センサにより取得された音情報を反映させた表示制御を可能にするデジタルフォトフレーム、情報処理システム、プログラム及び情報記憶媒体等を提供できる。 According to some aspects of the present invention, it is possible to provide a digital photo frame, an information processing system, a program, an information storage medium, and the like that enable display control that reflects sound information acquired by a sound sensor.

本発明の一態様は、画像を表示する表示部と、前記表示部の表示制御を行う表示制御部と、周囲音を検知する音センサにより検知された音情報を取得する音情報取得部とを含み、前記表示制御部は、取得された前記音情報に応じて、前記表示部に表示される画像の表示態様を変化させる表示制御を行うデジタルフォトフレームに関係する。 One aspect of the present invention includes a display unit that displays an image, a display control unit that performs display control of the display unit, and a sound information acquisition unit that acquires sound information detected by a sound sensor that detects ambient sound. The display control unit includes a digital photo frame that performs display control to change a display mode of an image displayed on the display unit in accordance with the acquired sound information.

本発明の一態様によれば、音センサにより検知された音情報が取得される。そして取得された音情報に応じて表示部に表示される画像の表示態様が変化するように、デジタルフォトフレームの表示部の表示制御が行われる。従って、音センサにより取得された音情報を反映させた表示制御が可能になり、多様な画像表現を実現できる。 According to one aspect of the present invention, sound information detected by a sound sensor is acquired. And display control of the display part of a digital photo frame is performed so that the display mode of the image displayed on a display part changes according to the acquired sound information. Therefore, display control reflecting sound information acquired by the sound sensor is possible, and various image expressions can be realized.

また本発明の一態様では、取得された前記音情報に基づいて、ユーザの発話を検知する発話検知部を含み、前記表示制御部は、ユーザの発話の検知結果に基づいて、前記表示部に表示される画像の表示態様を変化させる表示制御を行ってもよい。 Moreover, in one aspect of the present invention, an utterance detection unit that detects a user's utterance based on the acquired sound information is included, and the display control unit is connected to the display unit based on a detection result of the user's utterance. Display control for changing the display mode of the displayed image may be performed.

このようにすれば、ユーザの発話に応じて表示画像が変化するようになり、ユーザの発話を反映させた表示制御が可能になる。 If it does in this way, a display image will come to change according to a user's utterance, and display control which reflected a user's utterance will be attained.

また本発明の一態様では、前記表示制御部は、ユーザの発話が検知された場合に、表示コンテンツの詳細情報又は関連情報を表示する制御を行ってもよい。 In one aspect of the present invention, the display control unit may perform control to display detailed information or related information of display content when a user's utterance is detected.

このようにすれば、例えば表示されているコンテンツに対してユーザが興味等を持って発話すると、その発話に応じて表示コンテンツの詳細情報や関連情報が表示されるようになり、これまでにないデジタルフォトフレームのインターフェース環境を提供できる。 In this way, for example, when the user utters interest with respect to the displayed content, the detailed information and related information of the display content will be displayed according to the utterance, which has never existed before A digital photo frame interface environment can be provided.

また本発明の一態様では、ユーザの視認状態を判断する視認状態判断部を含み、前記表示制御部は、ユーザの発話の検知結果及びユーザの視認状態の判断結果に基づいて、前記表示部に表示される画像の表示態様を変化させる表示制御を行ってもよい。 According to another aspect of the present invention, the display control unit includes a visual recognition state determination unit that determines a visual recognition state of the user, and the display control unit includes: Display control for changing the display mode of the displayed image may be performed.

このようにすれば、ユーザの発話の検知結果のみならず、ユーザの視認状態も反映させた表示制御が可能になる。 In this way, it is possible to perform display control that reflects not only the user's utterance detection result but also the user's visual recognition state.

また本発明の一態様では、前記表示制御部は、ユーザの発話の検知結果に基づいて、コンテンツの表示タイミングを制御してもよい。 In the aspect of the invention, the display control unit may control the display timing of the content based on the detection result of the user's utterance.

このようにすれば、ユーザは、発話することによってコンテンツの表示タイミングを変化させることが可能になり、これまでにないタイプのインターフェース環境を提供できる。 In this way, the user can change the display timing of the content by speaking and can provide an unprecedented type of interface environment.

また本発明の一態様では、前記発話検知部は、ユーザの無発話状態を検知し、前記表示制御部は、ユーザの無発話状態が検知された場合に、コンテンツのスライド表示において次のコンテンツへの切り替え処理を行ってもよい。 In one aspect of the present invention, the utterance detection unit detects a user's non-utterance state, and the display control unit moves to the next content in the slide display of the content when the user's non-utterance state is detected. The switching process may be performed.

このようにすれば、ユーザが表示コンテンツに興味等を持って発話している状態では現在の表示コンテンツの表示が維持され、無発話状態が検知されると、次のコンテンツに切り替わるため、スライド表示に好適なインターフェース環境を提供できる。 In this way, the display of the current display content is maintained when the user is speaking with interest in the display content, and when the no-speaking state is detected, the next content is displayed, so the slide display A suitable interface environment can be provided.

また本発明の一態様では、取得された前記音情報に基づいて、ユーザの発話の音声認識を行う音声認識部を含み、前記表示制御部は、音声認識結果に基づいて、前記表示部に表示される画像の表示態様を変化させる表示制御を行ってもよい。 According to another aspect of the present invention, a voice recognition unit that performs voice recognition of a user's utterance based on the acquired sound information is included, and the display control unit displays on the display unit based on a voice recognition result. Display control for changing the display mode of the image to be displayed may be performed.

このようにすれば、ユーザの発話の音声認識結果に応じてデジタルフォトフレームの表示画像の表示態様が変化するようになり、ユーザの発話の音声認識結果を反映させた表示制御が可能になる。 In this way, the display mode of the display image of the digital photo frame changes according to the voice recognition result of the user's utterance, and display control that reflects the voice recognition result of the user's utterance becomes possible.

また本発明の一態様では、前記音声認識部は、音声認識によりユーザの発話キーワードを抽出し、前記表示制御部は、抽出された前記発話キーワードに基づき選択されたコンテンツを表示する制御を行ってもよい。 In one aspect of the present invention, the voice recognition unit extracts a user's utterance keyword by voice recognition, and the display control unit performs control to display content selected based on the extracted utterance keyword. Also good.

このようにすれば、ユーザが発話すると、その発話を構成する発話キーワードにより選択されたコンテンツが表示されるようになり、ユーザの発話キーワードによる表示制御が可能になる。 In this way, when the user utters, the content selected by the utterance keyword constituting the utterance is displayed, and display control by the user's utterance keyword becomes possible.

また本発明の一態様では、前記表示制御部は、表示コンテンツに関連づけられたタグキーワードと前記発話キーワードとが一致した場合には、前記タグキーワードに対応するコンテンツを表示する制御を行ってもよい。 In one aspect of the present invention, the display control unit may perform control to display content corresponding to the tag keyword when the tag keyword associated with the display content matches the utterance keyword. .

このようにすれば、表示されているコンテンツに対して興味等を持ったユーザがそのコンテンツに関連等する発話キーワードを発話し、その発話キーワードが表示コンテンツのタグキーワードに一致すると、一致したタグキーワードに対応するコンテンツが表示されるようになる。これにより、次に表示されるコンテンツの内容に対してユーザの意思等を反映させることができ、これまでにないタイプのデジタルフォトフレームの提供が可能になる。 In this way, if a user who is interested in the displayed content utters an utterance keyword related to the content, and the utterance keyword matches the tag keyword of the display content, the matched tag keyword The content corresponding to is displayed. As a result, the intention of the user can be reflected in the content of the content to be displayed next, and a digital photo frame of an unprecedented type can be provided.

また本発明の一態様では、前記表示制御部は、表示コンテンツに関連づけられたタグキーワードと前記発話キーワードが一致しなかった場合には、前記タグキーワードと前記発話キーワードとのＡＮＤ検索により選択されたコンテンツを表示する制御を行ってもよい。 In one aspect of the present invention, the display control unit is selected by AND search of the tag keyword and the utterance keyword when the tag keyword associated with the display content does not match the utterance keyword. You may perform control which displays a content.

このようにすれば、表示されているコンテンツに対してユーザが発話キーワードを発話し、その発話キーワードが表示コンテンツのタグキーワードに一致しない場合には、発話キーワードとタグキーワードのＡＮＤ検索より選択されたコンテンツが表示されるようになる。これにより、ユーザの興味や意思等をある程度反映させて、次のコンテンツを表示できるようになり、変化に富んだ多様なコンテンツ表示が可能になる。 In this way, when the user utters an utterance keyword with respect to the displayed content and the utterance keyword does not match the tag keyword of the display content, it is selected by AND search of the utterance keyword and the tag keyword. The content will be displayed. As a result, it becomes possible to display the next content while reflecting the user's interests and intentions to some extent, and a variety of content can be displayed with various changes.

また本発明の一態様では、前記表示制御部は、前記タグキーワードと前記発話キーワードとが一致した場合には、前記タグキーワードに対応するコンテンツを表示し、前記タグキーワードと前記発話キーワードとが一致しなかった場合には、前記タグキーワードと前記発話キーワードとのＡＮＤ検索により選択されたコンテンツを表示する制御を行ってもよい。 In one aspect of the present invention, the display control unit displays content corresponding to the tag keyword when the tag keyword and the utterance keyword match, and the tag keyword and the utterance keyword are one. If not, control may be performed to display the content selected by AND search of the tag keyword and the utterance keyword.

このようにすれば、タグキーワードと発話キーワードが一致した場合にはタグキーワードに対応するコンテンツが表示され、一致しなかった場合にはＡＮＤ検索により選択されたコンテンツが表示されるようになるため、ユーザの興味や意思等をある程度反映させた多様なコンテンツ表示が可能になる。 In this way, when the tag keyword and the utterance keyword match, the content corresponding to the tag keyword is displayed, and when the tag keyword does not match, the content selected by the AND search is displayed. Various content displays that reflect the user's interests and intentions to some extent are possible.

また本発明の一態様では、前記表示制御部は、ユーザの発話から複数の異なる発話キーワードが抽出された場合に、複数の異なる発話キーワードのＡＮＤ検索により選択されたコンテンツを表示する制御を行ってもよい。 In one aspect of the present invention, the display control unit performs control to display content selected by AND search of a plurality of different utterance keywords when a plurality of different utterance keywords are extracted from the user's utterance. Also good.

このようにすれば、ユーザの発話を構成する複数の発話キーワードから、ユーザが望むコンテンツを類推して、次のコンテンツを表示できるようになる。 In this way, it becomes possible to display the next content by analogizing the content desired by the user from a plurality of utterance keywords constituting the user's utterance.

また本発明の一態様では、ユーザの視認状態を判断する視認状態判断部を含み、前記表示制御部は、ユーザの発話が検知された場合に、音声認識の結果及びユーザの視認状態の判断結果に基づいて、前記表示部に表示される画像の表示態様を変化させる表示制御を行ってもよい。 Moreover, in one aspect of the present invention, the display control unit includes a visual recognition state determination unit that determines a visual recognition state of the user, and the display control unit determines a result of speech recognition and a determination result of the visual recognition state of the user when the user's speech is detected. Based on the above, display control for changing the display mode of the image displayed on the display unit may be performed.

このようにすれば、ユーザの発話が検知された場合に、その発話の音声認識結果とユーザの視認状態の両方を反映させた表示制御が可能になり、多様なコンテンツ表示が可能になる。 In this way, when a user's utterance is detected, display control that reflects both the speech recognition result of the utterance and the user's visual state is possible, and various content displays are possible.

また本発明の一態様では、前記表示制御部は、ユーザの発話が検知され、ユーザが前記表示部を注視していると判断された場合には、表示コンテンツに関連づけられたタグキーワードにより選択されたコンテンツを表示する制御を行ってもよい。 In one aspect of the present invention, the display control unit is selected by a tag keyword associated with display content when a user's utterance is detected and it is determined that the user is gazing at the display unit. Control may be performed to display the displayed content.

このようにすれば、表示コンテンツに対してユーザが発話し、且つ、ユーザがその表示コンテンツを注視した場合には、ユーザがその表示コンテンツに興味等を持っていると考え、表示コンテンツのタグキーワードにより選択されたコンテンツを表示できるようになる。 In this way, when the user speaks to the display content and the user watches the display content, the user thinks that the user is interested in the display content, and the tag keyword of the display content. The content selected by can be displayed.

また本発明の一態様では、前記表示制御部は、ユーザが前記表示部を注視していると判断され、前記タグキーワードと前記発話キーワードとが一致した場合には、前記タグキーワードにより選択されたコンテンツを表示する制御を行ってもよい。 In one aspect of the present invention, the display control unit is selected by the tag keyword when it is determined that the user is gazing at the display unit and the tag keyword matches the utterance keyword. You may perform control which displays a content.

このようにすれば、表示コンテンツを注視しているユーザが、その表示コンテンツのタグキーワードと一致する発話キーワードが発話した場合に、次に表示するコンテンツとして、タグキーワードにより選択されたコンテンツを表示できるようになる。 In this way, when the user who is watching the display content utters an utterance keyword that matches the tag keyword of the display content, the content selected by the tag keyword can be displayed as the content to be displayed next. It becomes like this.

また本発明の一態様では、前記表示制御部は、ユーザの発話が検知されたが、ユーザが前記表示部を注視していないと判断された場合には、前記発話キーワードに対応するコンテンツを表示する制御を行ってもよい。 In one aspect of the present invention, the display control unit displays content corresponding to the utterance keyword when it is determined that the user's utterance is detected but the user does not watch the display unit. Control may be performed.

このようにすれば、ユーザが発話したが、表示コンテンツを注視していない場合には、ユーザの発話キーワードの方を優先し、次に表示するコンテンツとして発話キーワードに対応するコンテンツを表示できるようになる。 In this way, when the user utters but does not watch the display content, the user's utterance keyword is given priority, and the content corresponding to the utterance keyword can be displayed as the next content to be displayed. Become.

また本発明の一態様では、前記表示制御部は、抽出された発話キーワードの出現回数をカウントし、出現回数が所定回数を超えた発話キーワードに対応するコンテンツを表示する制御を行ってもよい。 In the aspect of the invention, the display control unit may count the number of appearances of the extracted utterance keyword and perform control to display content corresponding to the utterance keyword whose appearance number exceeds a predetermined number.

このようにすれば、ユーザの発話に反応してコンテンツ表示が頻繁に変化してしまう事態を防止できるようになる。 In this way, it is possible to prevent a situation in which the content display frequently changes in response to the user's utterance.

また本発明の一態様では、前記表示制御部は、コンテンツ画像を構成する複数の表示オブジェクトに関連づけられたタグキーワードが、前記発話キーワードに一致した場合に、一致したタグキーワードが関連づけられた表示オブジェクトを拡大表示する制御、或いは当該表示オブジェクトの詳細情報又は関連情報を表示する制御を行ってもよい。 In one aspect of the present invention, the display control unit may display a display object associated with the matched tag keyword when tag keywords associated with the plurality of display objects constituting the content image match the utterance keyword. You may perform control which expands and displays detailed information or related information of the display object concerned.

このようにすれば、ユーザが興味等を持った表示オブジェクトを拡大表示したり、その詳細情報や関連情報を表示できるようになる。 In this way, it becomes possible to enlarge and display a display object that the user is interested in, and to display its detailed information and related information.

また本発明の一態様では、前記表示制御部は、前記タグキーワードが前記発話キーワードに一致しなかった場合には、前記複数の表示オブジェクトを順次拡大表示する制御を行ってもよい。 In the aspect of the invention, the display control unit may perform control to sequentially enlarge and display the plurality of display objects when the tag keyword does not match the utterance keyword.

このようにすれば、タグキーワードと発話キーワードが一致しなかった場合にも、複数の表示オブジェクトが順次拡大表示されるようになり、これまでにないタイプのコンテンツ表示が可能になる。 In this way, even when the tag keyword and the utterance keyword do not match, a plurality of display objects are sequentially enlarged and displayed, and an unprecedented type of content can be displayed.

また本発明の一態様では、前記音声認識部は、ユーザの発話音声からユーザの話者認識を行い、前記表示制御部は、話者認識されたユーザに対応するコンテンツを表示する制御を行ってもよい。 In the aspect of the invention, the voice recognition unit may recognize a user's speaker from the user's uttered voice, and the display control unit may perform control to display content corresponding to the user who is recognized by the speaker. Also good.

このようにすれば、ユーザの発話音声からそのユーザの話者認識を行い、そのユーザの趣味・嗜好等を反映させたコンテンツをユーザに表示できるようになる。 If it does in this way, the user's speaker recognition will be performed from a user's utterance voice, and the content which reflected the user's hobbies, tastes, etc. can be displayed to a user.

また本発明の一態様では、前記音声認識部は、ユーザの発話音声からユーザの感情状態を認識し、前記表示制御部は、認識された前記感情状態に対応するコンテンツを表示する制御を行ってもよい。 In one aspect of the present invention, the voice recognition unit recognizes a user's emotional state from the user's uttered voice, and the display control unit performs control to display content corresponding to the recognized emotional state. Also good.

このようにすれば、音センサにより取得された音情報を有効活用して、ユーザの感情状態を反映させたコンテンツ表示が可能になる。 In this way, it is possible to display content that reflects the emotional state of the user by effectively utilizing the sound information acquired by the sound sensor.

また本発明の一態様では、ユーザと前記表示部との位置関係を判断する位置関係判断部を含み、前記表示制御部は、ユーザと前記表示部との距離が所定距離以内である場合に、取得された前記音情報に応じて、前記表示部に表示される画像の表示態様を変化させる表示制御を行ってもよい。 Moreover, in one aspect of the present invention, a positional relationship determination unit that determines a positional relationship between a user and the display unit, the display control unit, when the distance between the user and the display unit is within a predetermined distance, Display control may be performed to change a display mode of an image displayed on the display unit according to the acquired sound information.

このようにすれば、ユーザがデジタルフォトフレームに近づいて、ユーザの音声を検出できる状態になってから、音情報に基づく表示制御を実行できるようになる。 In this way, display control based on sound information can be executed after the user approaches the digital photo frame and can detect the user's voice.

また本発明の一態様では、前記表示制御部は、取得された前記音情報に基づいて、前記表示部の表示をオンにする制御を行ってもよい。 In the aspect of the invention, the display control unit may perform control to turn on the display of the display unit based on the acquired sound information.

このようにすれば、音センサにより取得された音情報を有効活用して、デジタルフォトフレームの表示部の表示オン制御を実現できるようになる。 In this way, it is possible to realize display on control of the display unit of the digital photo frame by effectively utilizing the sound information acquired by the sound sensor.

また本発明の一態様では、前記表示制御部は、取得された前記音情報に基づいて、ユーザの操作指示を特定し、特定された操作指示に応じた表示制御を行ってもよい。 In the aspect of the invention, the display control unit may specify a user operation instruction based on the acquired sound information, and may perform display control according to the specified operation instruction.

このようにすれば、音センサにより取得された音情報を有効活用して、デジタルフォトフレームの操作指示を実現できるようになる。 In this way, it is possible to effectively use the sound information acquired by the sound sensor to realize an operation instruction for the digital photo frame.

また本発明の他の態様は、コンテンツを選択するコンテンツ選択部と、選択されたコンテンツに基づいて、デジタルフォトフレームの表示部に表示される画像の表示指示を行う表示指示部とを含み、前記表示指示部は、周囲音を検知する音センサにより検知された音情報に応じて、前記表示部に表示される画像の表示態様を変化させる表示指示を行う情報処理システムに関係する。 Another aspect of the present invention includes a content selection unit that selects content, and a display instruction unit that instructs display of an image displayed on the display unit of the digital photo frame based on the selected content, The display instruction unit relates to an information processing system that issues a display instruction to change a display mode of an image displayed on the display unit according to sound information detected by a sound sensor that detects ambient sounds.

本発明の他の態様によれば、音センサにより検知された音情報に応じて、表示部に表示される画像の表示態様が変化するように、デジタルフォトフレームの表示部の表示制御が指示される。従って、音センサにより取得された音情報を反映させた表示制御が可能になり、多様な画像表現を実現できる。 According to another aspect of the present invention, the display control of the display unit of the digital photo frame is instructed so that the display mode of the image displayed on the display unit changes according to the sound information detected by the sound sensor. The Therefore, display control reflecting sound information acquired by the sound sensor is possible, and various image expressions can be realized.

また本発明の他の態様は、上記各部としてコンピュータを機能させるプログラム、又は該プログラムを記憶したコンピュータ読み取り可能な情報記憶媒体に関係する。 Another aspect of the present invention relates to a program that causes a computer to function as each of the above-described units, or a computer-readable information storage medium that stores the program.

図１（Ａ）、図１（Ｂ）はデジタルフォトフレームの例を示す図。1A and 1B are diagrams illustrating examples of digital photo frames. 本実施形態のデジタルフォトフレームの構成例。2 is a configuration example of a digital photo frame according to the present embodiment. 図３（Ａ）、図３（Ｂ）は発話検知による表示制御手法の説明図。3A and 3B are explanatory diagrams of a display control method based on utterance detection. 図４（Ａ）、図４（Ｂ）は発話検知結果及び視認状態の判断結果に基づく表示制御手法の説明図。4A and 4B are explanatory diagrams of a display control method based on the utterance detection result and the visual recognition state determination result. 発話検知に基づく表示制御手法を説明するためのフローチャート。The flowchart for demonstrating the display control method based on speech detection. 注視状態の判断処理を説明するためのフローチャート。7 is a flowchart for explaining gaze state determination processing. 図７（Ａ）、図７（Ｂ）は発話検知による表示制御手法の他の例の説明図及びフローチャート。FIGS. 7A and 7B are explanatory diagrams and flowcharts of another example of the display control method based on utterance detection. 音声認識による表示制御手法の説明図。Explanatory drawing of the display control method by voice recognition. 音声認識による表示制御手法の説明図。Explanatory drawing of the display control method by voice recognition. 音声認識による表示制御手法の説明図。Explanatory drawing of the display control method by voice recognition. 音声認識による表示制御手法の説明図。Explanatory drawing of the display control method by voice recognition. 音声認識による第１の表示制御手法を説明するためのフローチャート。The flowchart for demonstrating the 1st display control method by speech recognition. 音声認識による第２の表示制御手法を説明するためのフローチャート。The flowchart for demonstrating the 2nd display control method by speech recognition. 発話キーワードの出現回数をカウントして表示コンテンツを選択する手法を説明するためのフローチャート。The flowchart for demonstrating the method of counting the frequency | count of appearance of an utterance keyword and selecting display content. 図１５（Ａ）、図１５（Ｂ）はコンテンツ画像が複数の表示オブジェクトにより構成される場合の表示制御手法の説明図。FIGS. 15A and 15B are explanatory diagrams of a display control method in the case where a content image is composed of a plurality of display objects. コンテンツ画像が複数の表示オブジェクトにより構成される場合の表示制御手法を説明するためのフローチャート。The flowchart for demonstrating the display control method in case a content image is comprised with a some display object. ユーザの登録情報を利用した表示制御手法の説明図。Explanatory drawing of the display control method using a user's registration information. 図１８（Ａ）、図１８（Ｂ）はユーザの登録情報を利用した表示制御手法の説明図。FIG. 18A and FIG. 18B are explanatory diagrams of a display control method using user registration information. 図１９（Ａ）、図１９（Ｂ）はユーザとの位置関係を利用した表示制御手法の説明図。19A and 19B are explanatory diagrams of a display control method using a positional relationship with a user. 図２０（Ａ）〜図２０（Ｃ）はユーザとの位置関係の検出手法の説明図。FIGS. 20A to 20C are explanatory diagrams of a method for detecting the positional relationship with the user. 音情報に基づき表示をオンにしたり操作指示を特定する手法を説明するためのフローチャート。The flowchart for demonstrating the method of turning on a display and specifying an operation instruction | indication based on sound information. 本実施形態のシステム構成の変形例。The modification of the system configuration | structure of this embodiment.

以下、本実施形態について説明する。なお、以下に説明する本実施形態は、特許請求の範囲に記載された本発明の内容を不当に限定するものではない。また本実施形態で説明される構成の全てが、本発明の必須構成要件であるとは限らない。 Hereinafter, this embodiment will be described. In addition, this embodiment demonstrated below does not unduly limit the content of this invention described in the claim. In addition, all the configurations described in the present embodiment are not necessarily essential configuration requirements of the present invention.

１．構成
図１（Ａ）に本実施形態のデジタルフォトフレーム３００（デジタルフォトプレーヤ、画像再生装置）の例を示す。図１（Ａ）は、いわゆるフォトスタンドタイプのデジタルフォトフレームの例である。このデジタルフォトフレーム３００は、家の中などの任意の場所にユーザにより設置される。そして、デジタルの画像データや音データなどのコンテンツ情報の再生処理（画像再生、音再生）を実行する。デジタルフォトフレーム３００は、例えばユーザの明示的な再生指示がなくても、画像等のコンテンツ情報（メディア情報）を自動的に再生することができる。例えば写真のスライドショーを自動的に実行したり、映像の自動再生を行う。 1. Configuration FIG. 1A shows an example of a digital photo frame 300 (digital photo player, image reproduction apparatus) of this embodiment. FIG. 1A illustrates an example of a so-called photo stand type digital photo frame. The digital photo frame 300 is installed by a user in an arbitrary place such as a house. Then, reproduction processing (image reproduction, sound reproduction) of content information such as digital image data and sound data is executed. The digital photo frame 300 can automatically play back content information (media information) such as an image without an explicit playback instruction from the user, for example. For example, a photo slideshow is automatically executed, or an image is automatically reproduced.

なお図１（Ａ）はフォトスタンドタイプのデジタルフォトフレームの例であるが、例えば図１（Ｂ）に示すように、壁掛けタイプのものであってもよい。この壁掛けタイプのデジタルフォトフレームとしては、例えば電気泳動型ディスプレイ等により実現される電子ペーパなどを用いることができる。また、デジタルフォトフレーム３００に、コンテンツ情報の再生指示ボタンなどを設けたり、リモコンを用いて再生指示ができるようにしてもよい。 Although FIG. 1A shows an example of a photo stand type digital photo frame, for example, as shown in FIG. As this wall-mounted digital photo frame, for example, electronic paper realized by an electrophoretic display or the like can be used. The digital photo frame 300 may be provided with a content information reproduction instruction button or the like, or a reproduction instruction may be performed using a remote controller.

デジタルフォトフレーム３００は、例えばＳＤカード等のメモリカードのインターフェースを備えることができる。或いは、無線ＬＡＮ、ブルートゥースなどの無線通信のインターフェースや、ＵＳＢ等の有線の通信インターフェースを備えることができる。例えばユーザが、メモリカードにコンテンツ情報を保存して、デジタルフォトフレーム３００のメモリカードインターフェースに装着すると、デジタルフォトフレーム３００は、メモリカードに保存されたコンテンツ情報の自動再生（スライドショー等）を実行する。或いは、デジタルフォトフレーム３００は、無線通信や有線通信によりに外部からコンテンツ情報を受信すると、このコンテンツ情報の再生処理（自動再生処理）を実行する。例えば、ユーザが所持するデジタルカメラや携帯電話機などの携帯型電子機器がブルートゥース等の無線機能を有する場合には、この無線機能を利用して、携帯型電子機器からデジタルフォトフレーム３００にコンテンツ情報を転送する。すると、デジタルフォトフレーム３００は、転送されたコンテンツ情報の再生処理を実行する。 The digital photo frame 300 can include a memory card interface such as an SD card, for example. Alternatively, a wireless communication interface such as a wireless LAN or Bluetooth, or a wired communication interface such as USB can be provided. For example, when a user saves content information in a memory card and attaches it to the memory card interface of the digital photo frame 300, the digital photo frame 300 executes automatic reproduction (slideshow or the like) of the content information saved in the memory card. . Alternatively, when the digital photo frame 300 receives content information from the outside through wireless communication or wired communication, the digital photo frame 300 executes a reproduction process (automatic reproduction process) of the content information. For example, when a portable electronic device such as a digital camera or a mobile phone possessed by a user has a wireless function such as Bluetooth, content information is transferred from the portable electronic device to the digital photo frame 300 using this wireless function. Forward. Then, the digital photo frame 300 executes a reproduction process of the transferred content information.

図２に本実施形態のデジタルフォトフレーム（広義には画像表示装置）３００の構成例を示す。このデジタルフォトフレーム３００は、処理部３０２、記憶部３２０、通信部３３８、表示部３４０、センサ３５０、操作部３６０を含む。なおこれらの一部の構成要素（例えば通信部、操作部、センサ）を省略したり、他の構成要素（例えばスピーカ）を追加するなどの種々の変形実施が可能である。 FIG. 2 shows a configuration example of a digital photo frame (image display device in a broad sense) 300 according to the present embodiment. The digital photo frame 300 includes a processing unit 302, a storage unit 320, a communication unit 338, a display unit 340, a sensor 350, and an operation unit 360. Various modifications such as omitting some of these components (for example, a communication unit, an operation unit, and a sensor) and adding other components (for example, a speaker) are possible.

処理部３０２は、各種の制御処理や演算処理を行う。例えばデジタルフォトフレーム３００の記憶部３２０や表示部３４０などの上述の各部の制御を行ったり全体的な制御を行う。この処理部３０２の機能は、各種プロセッサ（ＣＰＵ等）、ＡＳＩＣ（ゲートアレイ等）などのハードウェアや、当該処理部３０２に接続された情報記憶媒体３３０に記憶されたプログラムなどにより実現できる。 The processing unit 302 performs various control processes and arithmetic processes. For example, the above-described units such as the storage unit 320 and the display unit 340 of the digital photo frame 300 are controlled or overall control is performed. The functions of the processing unit 302 can be realized by hardware such as various processors (CPU and the like), ASIC (gate array and the like), a program stored in the information storage medium 330 connected to the processing unit 302, and the like.

記憶部３２０は、処理部３０２、通信部３３８などのワーク領域となるものであり、その機能はＲＡＭなどのメモリやＨＤＤ（ハードディスクドライブ）などにより実現できる。この記憶部３２０は、画像や音などのコンテンツ情報を記憶するコンテンツ情報記憶部３２２、取得された音情報を記憶する音情報記憶部３２４、取得された検知情報を記憶する検知情報記憶部３２５、登録キーワード（一般登録キーワード、ユーザ登録キーワード）を記憶する登録キーワード記憶部３２６、特定されたユーザ状態を記憶するユーザ状態記憶部３２７、ユーザの登録情報や感性モデル情報等のユーザ情報を記憶するユーザ情報記憶部３２８を含む。 The storage unit 320 serves as a work area for the processing unit 302, the communication unit 338, and the like, and its functions can be realized by a memory such as a RAM, an HDD (hard disk drive), or the like. The storage unit 320 includes a content information storage unit 322 that stores content information such as images and sounds, a sound information storage unit 324 that stores acquired sound information, a detection information storage unit 325 that stores acquired detection information, Registration keyword storage unit 326 for storing registration keywords (general registration keywords, user registration keywords), user state storage unit 327 for storing specified user states, and users for storing user information such as user registration information and sensitivity model information An information storage unit 328 is included.

情報記憶媒体３３０（コンピュータにより読み取り可能な媒体）は、プログラムやデータなどを格納するものであり、その機能は、メモリカードや光ディスクなどにより実現できる。処理部３０２は、情報記憶媒体３３０に格納されるプログラム（データ）に基づいて本実施形態の種々の処理を行う。即ち情報記憶媒体３３０には、本実施形態の各部としてコンピュータ（操作部、処理部、記憶部、出力部を備える装置）を機能させるためのプログラム（各部の処理をコンピュータに実行させるためのプログラム）が記憶される。 The information storage medium 330 (a computer-readable medium) stores programs, data, and the like, and its function can be realized by a memory card, an optical disk, or the like. The processing unit 302 performs various processes of the present embodiment based on a program (data) stored in the information storage medium 330. That is, in the information storage medium 330, a program for causing a computer (an apparatus including an operation unit, a processing unit, a storage unit, and an output unit) to function as each unit of the present embodiment (a program for causing the computer to execute processing of each unit). Is memorized.

通信部３３８（通信インターフェース）は、無線や有線の通信などにより外部デバイス（例えばサーバ、携帯型電子機器）との間で情報のやり取りを行うものであり、その機能は、通信用ＡＳＩＣ又は通信用プロセッサなどのハードウェアや、通信用ファームウェアにより実現できる。 The communication unit 338 (communication interface) exchanges information with an external device (for example, a server or a portable electronic device) by wireless or wired communication, and functions as a communication ASIC or communication. This can be realized by hardware such as a processor or communication firmware.

表示部３４０は、コンテンツ情報である画像を表示するためのものであり、例えば液晶ディスプレイや、有機ＥＬなどの発光素子を用いたディスプレイや、電気泳動型ディスプレイなどにより実現できる。なお表示部３４０をタッチパネルディスプレイ（タッチスクリーン）により構成してもよい。 The display unit 340 is for displaying an image as content information, and can be realized by, for example, a liquid crystal display, a display using a light emitting element such as an organic EL, an electrophoretic display, or the like. The display unit 340 may be configured with a touch panel display (touch screen).

センサ３５０（音センサ、ユーザ検知センサ等）は、検知結果に基づいて検知情報を出力するデバイスである。センサ３５０としては、例えば音センサを用いることができる。音センサは音を電気信号等に変換するセンサであり、音の物理量である音圧を測定するマイクロホンなどである。マイクロフォンとしては、ムービング・コイル型やリボン型のダイナミックマイクロフォンや、音声信号の振動による静電容量の変化を検知するコンデンサ型のマイクロフォンや、圧電効果を利用する圧電型のマイクロフォンや、カーボン型のマイクロフォンなどを用いることができる。 The sensor 350 (sound sensor, user detection sensor, etc.) is a device that outputs detection information based on the detection result. For example, a sound sensor can be used as the sensor 350. The sound sensor is a sensor that converts sound into an electrical signal or the like, and is a microphone that measures sound pressure, which is a physical quantity of sound. Microphones include moving coil type and ribbon type dynamic microphones, condenser type microphones that detect changes in capacitance due to vibration of audio signals, piezoelectric microphones that use the piezoelectric effect, and carbon type microphones. Etc. can be used.

またセンサ３５０としてはユーザ検知センサを用いることができる。ユーザ検知センサとしては、焦電センサなどの人感センサや、ＣＣＤ、ＣＭＯＳセンサなどの撮像センサや、超音波センサなどの距離センサや、ユーザの動作状態（手や体の動き）を検知するモーションセンサなどを用いることができる。 As the sensor 350, a user detection sensor can be used. User detection sensors include human sensors such as pyroelectric sensors, imaging sensors such as CCD and CMOS sensors, distance sensors such as ultrasonic sensors, and motion that detects the user's operating state (hand and body movements). A sensor or the like can be used.

焦電センサは、人等が発生する赤外線を受光し、赤外線を熱に変換し、その熱を素子の焦電効果で電荷に変えるセンサである。この焦電センサを用いることで、検知範囲（検知エリア）にユーザ（人）が存在するか否かや、検知範囲に存在するユーザの動きや、検知範囲内に存在するユーザの人数などを検知できる。撮像センサ（イメージセンサ）は、１次元又は２次元の光学情報を、時系列の電気信号に変換する光センサである。この撮像センサを用いることで、検知範囲にユーザが存在するか否かや、検知範囲に存在するユーザの動きや、検知範囲内に存在するユーザの人数などを検知できる。また撮像センサを用いた顔画像認識により、ユーザの人物認証を実現できる。また撮像センサを用いた顔検出により、ユーザと表示部３４０との距離や表示部３４０に対するユーザの視線の角度などの位置関係を検出できる。或いは、ユーザの視野範囲内に表示部３４０が入っている状態か否かや、ユーザが表示部３４０を注視している状態か否かなどのユーザの視認状態を検出できる。或いはユーザが接近中なのか否かなども検出できる。 The pyroelectric sensor is a sensor that receives infrared rays generated by a person or the like, converts the infrared rays into heat, and converts the heat into electric charges by the pyroelectric effect of the element. By using this pyroelectric sensor, it is detected whether there is a user (person) in the detection range (detection area), the movement of the user in the detection range, the number of users in the detection range, etc. it can. An imaging sensor (image sensor) is an optical sensor that converts one-dimensional or two-dimensional optical information into a time-series electrical signal. By using this image sensor, it is possible to detect whether or not there is a user in the detection range, user movements in the detection range, the number of users in the detection range, and the like. Further, user authentication can be realized by face image recognition using an imaging sensor. Further, the positional relationship such as the distance between the user and the display unit 340 and the angle of the user's line of sight with respect to the display unit 340 can be detected by face detection using an imaging sensor. Alternatively, it is possible to detect a user's visual state such as whether or not the display unit 340 is in the user's visual field range and whether or not the user is gazing at the display unit 340. Alternatively, whether or not the user is approaching can also be detected.

なおセンサ３５０は、センサデバイス自体であってもよいし、センサデバイスの他に制御部や通信部等を含むセンサ機器であってもよい。また検知情報は、センサから直接得られる１次情報であってもよいし、１次情報を加工処理（情報処理）することで得られる２次情報であってもよい。 The sensor 350 may be the sensor device itself, or may be a sensor device including a control unit, a communication unit, and the like in addition to the sensor device. The detection information may be primary information obtained directly from the sensor, or may be secondary information obtained by processing (information processing) the primary information.

またセンサ３５０は、デジタルフォトフレーム３００に直接に取り付けてもよいし、ホームセンサなどをセンサ３５０として利用してもよい。センサ３５０をデジタルフォトフレーム３００に取り付ける場合には、図１（Ａ）に示すように、センサ３５０をデジタルフォトフレーム３００の例えば枠の部分に取り付けることができる。或いは有線のケーブル等を用いてセンサ３５０とデジタルフォトフレーム３００を接続する形態にしてもよい。 The sensor 350 may be directly attached to the digital photo frame 300, or a home sensor or the like may be used as the sensor 350. When the sensor 350 is attached to the digital photo frame 300, the sensor 350 can be attached to, for example, a frame portion of the digital photo frame 300 as shown in FIG. Alternatively, the sensor 350 and the digital photo frame 300 may be connected using a wired cable or the like.

操作部３６０は、ユーザが各種情報を入力するためのものであり、例えば操作ボタンやリモコン装置などにより実現できる。ユーザは、この操作部３６０を用いて、ユーザ登録を行ったり、自身が所望する再生コンテンツ（お気に入り画像）の登録などを行うことができる。例えばユーザは、操作部３６０を用いて、ユーザ登録情報を入力することができる。なお表示部３４０がタッチパネルディスプレイにより構成される場合には、表示部３４０が操作部３６０の機能を兼ねることになる。 The operation unit 360 is used by the user to input various types of information, and can be realized by, for example, an operation button or a remote control device. The user can use this operation unit 360 to perform user registration and registration of playback content (favorite images) desired by the user. For example, the user can input user registration information using the operation unit 360. When the display unit 340 is configured by a touch panel display, the display unit 340 also functions as the operation unit 360.

処理部３０２は、音情報取得部３０４、検知情報取得部３０５、発話検知部３０６、音声認識部３０７、ユーザ状態判断部３１０、登録処理部３１４、コンテンツ選択部３１６、表示制御部３１８を含む。なお、これらの一部の構成要素（例えば検知情報取得部、ユーザ状態判断部、登録処理部、コンテンツ選択部等）を省略したり、他の構成要素を追加するなどの種々の変形実施が可能である。 The processing unit 302 includes a sound information acquisition unit 304, a detection information acquisition unit 305, an utterance detection unit 306, a voice recognition unit 307, a user state determination unit 310, a registration processing unit 314, a content selection unit 316, and a display control unit 318. Various modifications such as omission of some of these components (for example, a detection information acquisition unit, a user status determination unit, a registration processing unit, and a content selection unit) and addition of other components are possible. It is.

音情報取得部３０４は、センサ３５０の１つである音センサにより検知された音情報を取得する処理を行う。例えば音センサにより音声や音楽等の音（周囲音）が検知されて、検知結果である音情報が出力されると、音情報取得部３０４は、その音情報を取り込む。そして取り込まれた音情報は記憶部３２０の音情報記憶部３２４に記憶される。同様に、検知情報取得部３０５は、センサ３５０の１つであるユーザ検知センサによりユーザ状態等が検知されて、検知結果である検知情報（撮像情報等）が出力されると、その検知情報を取り込む。そして、取り込まれた検知情報は検知情報記憶部３２５に記憶される。 The sound information acquisition unit 304 performs processing for acquiring sound information detected by a sound sensor that is one of the sensors 350. For example, when a sound (sound, music, etc.) (ambient sound) is detected by a sound sensor and sound information as a detection result is output, the sound information acquisition unit 304 takes in the sound information. The acquired sound information is stored in the sound information storage unit 324 of the storage unit 320. Similarly, when a user state or the like is detected by a user detection sensor that is one of the sensors 350 and detection information (imaging information or the like) that is a detection result is output, the detection information acquisition unit 305 outputs the detection information. take in. The captured detection information is stored in the detection information storage unit 325.

なおセンサ３５０として、ホームセンサ等の外部のセンサを用いる場合には、通信部３３８が音情報や検知情報を受信し、音情報取得部３０４、検知情報取得部３０５は、受信した音情報や検知情報を取得することになる。 When an external sensor such as a home sensor is used as the sensor 350, the communication unit 338 receives sound information and detection information, and the sound information acquisition unit 304 and the detection information acquisition unit 305 receive the received sound information and detection. You will get information.

発話検知部３０６は、センサ３５０である音センサにより取得された音情報に基づいて、ユーザの発話（会話）を検知する。例えば人間の音声に特有の周波数成分や振幅（パワー）を有する音を抽出して、ユーザの発話を雑音から区別して検知する。具体的には一定の振幅レベル以上の音に対して特定周波数帯域を通過させるフィルタ処理等を行ってユーザの発話を検知する。なお音情報による検知対象は、ユーザの発話に限定されず、物音・騒音・生活音等であってもよいし、ユーザが発する咳やくしゃみや拍手の音であってもよい。 The utterance detection unit 306 detects the user's utterance (conversation) based on sound information acquired by the sound sensor which is the sensor 350. For example, a sound having a frequency component or amplitude (power) peculiar to human speech is extracted, and the user's speech is distinguished from noise and detected. Specifically, a user's utterance is detected by performing a filter process or the like that passes a specific frequency band for a sound having a certain amplitude level or higher. Note that the detection target based on the sound information is not limited to the user's utterance, and may be a sound, noise, life sound, or the like, or a cough, sneeze, or applause sound uttered by the user.

音声認識部３０７は、音センサにより取得された音情報に基づいて、ユーザの発話の音声認識を行う。そして音声認識によりユーザの発話キーワード（単語）を抽出する。或いはユーザの発話音声からユーザの話者認識を行ったり、ユーザの感情状態を認識する。 The voice recognition unit 307 performs voice recognition of the user's utterance based on the sound information acquired by the sound sensor. Then, the user's utterance keyword (word) is extracted by voice recognition. Alternatively, the user's speaker recognition is performed from the user's uttered voice or the emotional state of the user is recognized.

ここで音声認識は、人間（出演者、ユーザ）の話す音声言語をコンピュータにより解析し、テキストデータとして抽出する処理である。音声認識は、音響分析、認識デコーダ、音響モデル、辞書、言語モデルにより実現される。音響分析では、人間の音声をフーリエ解析等の信号処理により特徴量情報に変換する。認識デコーダでは、特徴量情報に基づいてテキストデータを出力する。具体的には音響情報と言語情報を総合的に判断して音声をテキストデータに変換する。この認識デコーダでの判断処理は、隠れマルコフモデルや動的時間伸縮法などの統計的手法により実現される。辞書は、認識対象の単語（ワード）をデータ化したものであり、音素列と単語を関連づけるものである。言語モデルは、辞書の単語についての確率をデータ化したものである。具体的には各単語の出現確率や接続確率をデータ化する。このような音響分析、認識デコーダ、音響モデル、辞書、言語モデルを用いた音声認識により、音センサにより取得された音情報からユーザの発話キーワードを適正に抽出できるようになる。なお発話検知部３０６での発話の検知は、音声認識部３０７による音声認識を利用して実現することができる。即ち音声認識エンジンの機能を活用してユーザの発話を検知する。 Here, speech recognition is a process of analyzing a speech language spoken by a human (performer, user) by a computer and extracting it as text data. Speech recognition is realized by acoustic analysis, a recognition decoder, an acoustic model, a dictionary, and a language model. In acoustic analysis, human speech is converted into feature information by signal processing such as Fourier analysis. The recognition decoder outputs text data based on the feature amount information. Specifically, sound is converted into text data by comprehensively determining acoustic information and language information. The determination process in the recognition decoder is realized by a statistical method such as a hidden Markov model or a dynamic time expansion / contraction method. A dictionary is a data of words (words) to be recognized, and associates phoneme strings with words. The language model is a data of probabilities for words in the dictionary. Specifically, the appearance probability and connection probability of each word are converted into data. By such speech analysis using the acoustic analysis, recognition decoder, acoustic model, dictionary, and language model, the user's utterance keyword can be appropriately extracted from the sound information acquired by the sound sensor. Note that the utterance detection by the utterance detection unit 306 can be realized by using voice recognition by the voice recognition unit 307. That is, the user's speech is detected by utilizing the function of the voice recognition engine.

ユーザ状態判断部３１０は、検知情報取得部３０５により取得された検知情報に基づいてユーザ状態を判断する。例えば、取得された検知情報に基づいて、表示部３４０に対するユーザの視認状態や、ユーザ（人物）と表示部３４０との位置関係等を判断する。そして、ユーザの視認状態や位置関係を表すユーザ状態の情報は、ユーザ状態記憶部３２７に記憶される。 The user state determination unit 310 determines the user state based on the detection information acquired by the detection information acquisition unit 305. For example, based on the acquired detection information, the visual recognition state of the user with respect to the display unit 340, the positional relationship between the user (person) and the display unit 340, and the like are determined. Information on the user state representing the user's visual state and positional relationship is stored in the user state storage unit 327.

ここで視認状態は、ユーザの視野範囲の状態や注視状態などであり、具体的には、ユーザの視野範囲（ビューボリューム）に表示部３４０が入っているか否かや、ユーザが表示部３４０を注視しているか否かなどである。このユーザの視認状態は視認状態判断部３１１が判断する。例えばユーザの視認状態として、ユーザが表示部３４０を注視している状態か否かを判断する。 Here, the viewing state is a state of the user's visual field range, a gaze state, or the like. Specifically, whether or not the display unit 340 is included in the user's visual field range (view volume), or the user sets the display unit 340. Whether or not they are gazing. The visual recognition state determination unit 311 determines the visual recognition state of the user. For example, it is determined whether the user is gazing at the display unit 340 as the user's viewing state.

また位置関係は、ユーザと表示部３４０との距離や、表示部３４０に対するユーザの視線方向などであり、ユーザと表示部３４０との位置関係は、位置関係判断部３１２が判断する。例えば、ユーザと表示部３４０との位置関係として、ユーザと表示部３４０との間の距離（距離情報、距離パラメータ）を判断する。 The positional relationship is the distance between the user and the display unit 340, the user's line-of-sight direction with respect to the display unit 340, and the positional relationship determination unit 312 determines the positional relationship between the user and the display unit 340. For example, as a positional relationship between the user and the display unit 340, a distance (distance information, distance parameter) between the user and the display unit 340 is determined.

なおセンサ３５０として、ユーザを撮像する撮像センサが設けられたとする。この場合には、ユーザ状態判断部３１０（位置関係判断部）は、撮像センサからの撮像情報に基づいて、ユーザの顔領域（矩形の枠領域）を検出する。そして検出された顔領域のサイズに基づいて、ユーザと表示部３４０との間の距離を判断（推定）する。またユーザ状態判断部３１０は、検出された顔領域を内包し顔領域よりもサイズが大きな計測領域を設定する。即ち顔領域にオーバーラップする計測領域を設定する。そして計測領域内に顔領域が存在する時間を計測し、計測された時間に基づいて、ユーザが表示部３４０を注視しているか否かを判断する。例えば計測領域内への存在時間が所定時間以上であった場合に、ユーザが注視していたと判断する。 It is assumed that an image sensor that images the user is provided as the sensor 350. In this case, the user state determination unit 310 (positional relationship determination unit) detects the user's face area (rectangular frame area) based on the imaging information from the imaging sensor. Based on the detected size of the face area, the distance between the user and the display unit 340 is determined (estimated). In addition, the user state determination unit 310 sets a measurement area that includes the detected face area and is larger than the face area. That is, a measurement area that overlaps the face area is set. Then, the time during which the face area exists in the measurement area is measured, and based on the measured time, it is determined whether or not the user is gazing at the display unit 340. For example, when the presence time in the measurement area is a predetermined time or more, it is determined that the user is gazing.

或いは、ユーザ状態判断部３１０は、ユーザに対する自動焦点合わせ処理（オートフォーカス機能）により、ユーザと表示部３４０との間の距離を判断してもよい。例えばアクティブ方式を採用した場合には、デジタルフォトフレーム３００等に赤外線や超音波を射出するデバイスを設けると共に、センサ３５０として赤外線や超音波の受光センサを設ける。そしてユーザからの反射光を受光センサにより検知することで、ユーザとの距離等を検出すればよい。またパッシブ方式を採用した場合には、センサ３５０として撮像センサを設け、撮像画像に対して、位相差検出方式やコントラスト検出方式による画像処理を行うことで、ユーザとの距離等を検出すればよい。 Alternatively, the user state determination unit 310 may determine the distance between the user and the display unit 340 by an automatic focusing process (autofocus function) for the user. For example, when the active method is adopted, a device for emitting infrared rays or ultrasonic waves is provided on the digital photo frame 300 or the like, and an infrared or ultrasonic light receiving sensor is provided as the sensor 350. And what is necessary is just to detect the distance with a user, etc. by detecting the reflected light from a user with a light receiving sensor. When the passive method is adopted, an image sensor is provided as the sensor 350, and the distance to the user or the like may be detected by performing image processing on the captured image using a phase difference detection method or a contrast detection method. .

登録処理部３１４はユーザの登録処理を行う。例えばユーザ登録情報の設定処理を行う。具体的には、ユーザ登録画面等においてユーザが、操作部３６０等を用いてユーザ登録情報を入力した場合に、入力されたユーザ登録情報をユーザに関連づけてユーザ情報記憶部３２８に記憶する。ここでユーザ登録情報は、例えばユーザのＩＤ、パスワードや、表示部３４０に表示される画像のカスタマイズ情報などを含むことができる。そして表示部３４０にはユーザ登録情報を反映させた画像が表示されることになる。 The registration processing unit 314 performs user registration processing. For example, user registration information setting processing is performed. Specifically, when the user inputs user registration information using the operation unit 360 or the like on the user registration screen or the like, the input user registration information is associated with the user and stored in the user information storage unit 328. Here, the user registration information can include, for example, a user ID, a password, customization information of an image displayed on the display unit 340, and the like. The display unit 340 displays an image reflecting the user registration information.

なお、ユーザの登録処理の際に、ユーザの音声をユーザ登録情報として登録してもよい。例えばユーザの音声の特徴量情報をユーザ登録情報としてユーザ情報記憶部３２８に記憶して登録する。そして音声認識部３０７により話者認識を行う場合には、音センサにより取得された音情報と登録されたユーザの音声の特徴量情報とを比較して、発話しているユーザが登録ユーザであるか否かを判断する。また、センサ３５０として撮像センサを設け、この撮像センサによってユーザを撮像し、ユーザの顔画像の特徴量情報をユーザ登録情報として登録してもよい。この場合には、撮像センサにより取得された画像情報とユーザの顔画像の特徴量情報を比較して、撮像されたユーザが登録ユーザであるか否かを判断する人物認証を行う。 In the user registration process, the user's voice may be registered as user registration information. For example, the feature amount information of the user's voice is stored and registered in the user information storage unit 328 as user registration information. When speaker recognition is performed by the voice recognition unit 307, the sound information acquired by the sound sensor is compared with the registered feature information of the user's voice, and the user who is speaking is the registered user. Determine whether or not. Alternatively, an image sensor may be provided as the sensor 350, the user may be imaged by the image sensor, and the feature amount information of the user's face image may be registered as user registration information. In this case, the image information acquired by the image sensor and the feature amount information of the user's face image are compared, and person authentication is performed to determine whether the imaged user is a registered user.

コンテンツ選択部３１６は、ユーザに提示するコンテンツの選択処理を行う。例えば、取得された音情報に基づく発話検知や音声認識の結果に基づいて、コンテンツ情報記憶部３２２から対応するコンテンツの情報を読み出して、ユーザに提示するコンテンツを選択する。或いは、通信部３３８を介して、ホームサーバ等の外部サーバからコンテンツの情報を受信することで、コンテンツを選択する。 The content selection unit 316 performs processing for selecting content to be presented to the user. For example, based on the results of speech detection and speech recognition based on the acquired sound information, the corresponding content information is read from the content information storage unit 322, and the content to be presented to the user is selected. Alternatively, content is selected by receiving content information from an external server such as a home server via the communication unit 338.

表示制御部３１８は、表示部３４０の表示制御を行う。例えば、コンテンツ選択部３１６が、コンテンツ情報記憶部３２２からコンテンツ情報を読み出したり通信部３３８によりコンテンツ情報を受信することで、コンテンツを選択すると、表示制御部３１８は、選択されたコンテンツの画像を表示部３４０に表示するための制御を行う。 The display control unit 318 performs display control of the display unit 340. For example, when the content selection unit 316 reads content information from the content information storage unit 322 or receives content information by the communication unit 338 and selects content, the display control unit 318 displays an image of the selected content. Control for displaying on the unit 340 is performed.

そして本実施形態では、音情報取得部３０４が、音センサにより検知された音情報を取得すると、表示制御部３１８は、取得された音情報に応じて、表示部３１８に表示される画像の表示態様を変化させる表示制御を行う。ここで表示態様を変化させるとは、画像の内容そのものを変化させたり、画像を表示するタイミングや表示時間や表示方法等を変化させることである。例えば表示制御部３１８により行われる表示態様を変化させる表示制御としては、表示画像（コンテンツ画像）自体を変化させる制御、表示タイミングや表示時間や切り替え速度を変化させる制御、表示方法（フェード、ワイプ、スクロール等のスライド表示方法）を変化させる制御、画像のコントラストや色調や画像エフェクト（例えばレトロ調）を変化させる制御、画像の拡大率や縮小率を変化させる制御、或いは、画面分割数を変化させる制御などを想定できる。或いは、発話（会話）のスピード、トーン、パワーなどに応じて、画像の表示態様を変化させたり、発話から話者認識を行い、話者認識されたユーザに応じた表示態様で画像を表示してもよい。或いは発話の音声認識により抽出されたキーワードに応じた表示制御を行ったり、発話音声から喜び・怒り・緊張・平常などの感情が認識された場合に、その感情に応じた表示制御を行ってもよい。或いは発話（会話）に応じて画面をズームしたり、コンテンツに関する話題が検出された場合に、スライドの切り替え時間を速くしたり、会話のスピードが速い場合に、表示切り替えをゆっくりにしたり、表示時間を長くしてもよい。 In the present embodiment, when the sound information acquisition unit 304 acquires the sound information detected by the sound sensor, the display control unit 318 displays an image displayed on the display unit 318 according to the acquired sound information. Display control for changing the aspect is performed. Here, changing the display mode means changing the content of the image itself, or changing the display timing, display time, display method, and the like of the image. For example, display control performed by the display control unit 318 to change the display mode includes control for changing the display image (content image) itself, control for changing the display timing, display time, and switching speed, and display method (fade, wipe, Control to change the slide display method (such as scrolling), control to change the contrast and color tone of the image and image effects (for example, retro tone), control to change the enlargement ratio and reduction ratio of the image, or change the number of screen divisions Control can be assumed. Alternatively, the display mode of the image is changed according to the speed, tone, power, etc. of the utterance (conversation), the speaker is recognized from the utterance, and the image is displayed in a display mode according to the user who is recognized by the speaker. May be. Or if display control is performed according to the keywords extracted by speech recognition of speech, or if emotions such as joy, anger, tension, and normality are recognized from speech speech, display control according to that emotion may be performed. Good. Alternatively, zoom the screen according to the utterance (conversation), increase the slide switching time when a topic related to the content is detected, slow the display switching when the conversation speed is fast, display time May be lengthened.

また表示制御部３１８は、発話検知部３０６が、音センサにより取得された音情報に基づいてユーザの発話を検知すると、ユーザの発話の検知結果（発話の検知の有無等）に基づいて、表示部３４０に表示される画像の表示態様を変化させてもよい。例えばユーザの発話が検知されると、発話検知時に表示されているコンテンツである表示コンテンツの詳細情報や関連情報を表示する。例えば表示コンテンツに設定されているテーマの詳細な情報やそのテーマに関連する情報を表示する。 Further, when the utterance detection unit 306 detects the user's utterance based on the sound information acquired by the sound sensor, the display control unit 318 displays based on the detection result of the user's utterance (whether or not the utterance is detected). The display mode of the image displayed on unit 340 may be changed. For example, when the user's utterance is detected, detailed information and related information of the display content that is the content displayed when the utterance is detected is displayed. For example, detailed information on the theme set in the display content and information related to the theme are displayed.

また表示制御部３１８は、視認状態判断部３１１が、ユーザの視認状態を判断すると、ユーザの発話の検知結果及びユーザの視認状態の判断結果に基づいて、表示部３４０に表示される画像の表示態様を変化させてもよい。例えばユーザが発話しているか否か及びユーザが表示部３４０を注視しているか否かに応じて、表示画像の表示態様を変化させる。 Further, when the visual recognition state determination unit 311 determines the user's visual recognition state, the display control unit 318 displays an image displayed on the display unit 340 based on the user's utterance detection result and the user's visual recognition state determination result. You may change an aspect. For example, the display mode of the display image is changed according to whether or not the user is speaking and whether or not the user is gazing at the display unit 340.

また表示制御部３１８は、ユーザの発話の検知結果に基づいて、コンテンツの表示タイミングを制御してもよい。例えば発話検知部（音声認識部）３０６によりユーザの無発話状態が検知された場合に、コンテンツのスライド表示において次のコンテンツ（スライド）への切り替え処理を行う。具体的には、ユーザが発話をしていると判断される場合には、現在のコンテンツ（現在のスライド）の表示を続け、無発話期間が所定時間を超えると、次のコンテンツ（次のスライド）に切り替える。 The display control unit 318 may control the display timing of the content based on the detection result of the user's utterance. For example, when the speech detection unit (speech recognition unit) 306 detects the user's non-speech state, a process for switching to the next content (slide) is performed in the slide display of the content. Specifically, if it is determined that the user is speaking, the current content (current slide) continues to be displayed, and if the non-utterance period exceeds a predetermined time, the next content (next slide) is displayed. ).

また表示制御部３１８は、音声認識部３０７によりユーザの発話の音声認識が行われた場合に、音声認識結果に基づいて、表示部３４０に表示される画像の表示態様を変化させてもよい。例えば音声認識によりユーザの会話中から抽出可能なキーワードである発話キーワードが抽出されると、抽出された発話キーワードに基づき選択されたコンテンツを表示する。具体的には発話キーワードをテーマとするコンテンツや発話キーワードにより検索されたコンテンツを表示する。 The display control unit 318 may change the display mode of the image displayed on the display unit 340 based on the voice recognition result when the voice recognition unit 307 performs voice recognition of the user's utterance. For example, when an utterance keyword, which is a keyword that can be extracted from the user's conversation, is extracted by voice recognition, the content selected based on the extracted utterance keyword is displayed. Specifically, the content having the theme of the utterance keyword and the content searched by the utterance keyword are displayed.

或いは表示制御部３１８は、表示コンテンツに関連づけられて設けられたキーワードであるタグキーワードと発話キーワードとが一致した場合には、タグキーワードに対応するコンテンツを表示する。例えば発話キーワードをテーマとするコンテンツやタグキーワードにより検索されたコンテンツを表示する。一方、タグキーワードと発話キーワードが一致しなかった場合には、タグキーワードと発話キーワードとのＡＮＤ検索により選択されたコンテンツを表示する。例えばタグキーワードは、表示コンテンツに関連づけられてコンテンツ情報記憶部３２２に記憶される。そして、ユーザの発話から１又は複数の発話キーワードが抽出されると、抽出された発話キーワードの中に、タグキーワードに一致する発話キーワードがあるか否かを照合する。そしていずれかの発話キーワードがタグキーワードに一致すると、そのタグキーワードに関連づけられたコンテンツを表示する。一方、いずれの発話キーワードもタグキーワードに一致しない場合には、タグキーワードと発話キーワードのＡＮＤ検索によりコンテンツを選択して表示する。例えばタグキーワード及び発話キーワードの両方をテーマとするコンテンツを検索して表示する。なおユーザの発話から複数の異なる発話キーワードが抽出された場合に、複数の異なる発話キーワードのＡＮＤ検索により選択されたコンテンツを表示してもよい。例えばこれらの複数のキーワードの組み合わせをテーマとするコンテンツを検索して表示する。 Or the display control part 318 displays the content corresponding to a tag keyword, when the tag keyword which is the keyword provided linked | related with the display content, and an utterance keyword correspond. For example, the content that is the subject of the utterance keyword or the content that is searched by the tag keyword is displayed. On the other hand, if the tag keyword and the utterance keyword do not match, the content selected by the AND search of the tag keyword and the utterance keyword is displayed. For example, the tag keyword is stored in the content information storage unit 322 in association with the display content. When one or more utterance keywords are extracted from the user's utterance, it is checked whether or not there is an utterance keyword that matches the tag keyword in the extracted utterance keywords. When any utterance keyword matches the tag keyword, the content associated with the tag keyword is displayed. On the other hand, if no utterance keyword matches the tag keyword, the content is selected and displayed by AND search of the tag keyword and the utterance keyword. For example, it searches for and displays content with the theme of both tag keywords and utterance keywords. When a plurality of different utterance keywords are extracted from the user's utterance, content selected by AND search of the plurality of different utterance keywords may be displayed. For example, the content having the theme of a combination of these keywords is searched and displayed.

また表示制御部３１８は、発話検知部３０６によりユーザの発話が検知された場合に、音声認識部３０７による音声認識の結果及び視認状態判断部３１１によるユーザの視認状態の判断結果に基づいて、表示部３４０に表示される画像の表示態様を変化させてもよい。例えばユーザの発話が検知され、且つユーザが表示部３４０を注視していると判断されると、発話検知時に表示されているコンテンツに関連づけられたタグキーワードにより選択されたコンテンツを表示する。具体的には、ユーザが注視しており、タグキーワードと発話キーワードとが一致した場合には、タグキーワードに対応するコンテンツ（タグキーワードにより検索されたコンテンツ、タグキーワードに関連するコンテンツ）を表示する。一方、ユーザの発話が検知されたが、ユーザが表示部３４０を注視していない場合には、発話キーワードに対応するコンテンツを表示する。例えば発話キーワードにより検索されたコンテンツや発話キーワードとタグキーワードによりＡＮＤ検索されたコンテンツを表示する。 In addition, when the utterance detection unit 306 detects the user's utterance, the display control unit 318 displays based on the result of speech recognition by the speech recognition unit 307 and the determination result of the user's visual state by the visual state determination unit 311. The display mode of the image displayed on unit 340 may be changed. For example, when the user's utterance is detected and it is determined that the user is gazing at the display unit 340, the content selected by the tag keyword associated with the content displayed when the utterance is detected is displayed. Specifically, when the user is gazing and the tag keyword matches the utterance keyword, the content corresponding to the tag keyword (content searched for by the tag keyword, content related to the tag keyword) is displayed. . On the other hand, when the user's utterance is detected but the user is not gazing at the display unit 340, the content corresponding to the utterance keyword is displayed. For example, the content searched by the utterance keyword or the content AND-searched by the utterance keyword and the tag keyword is displayed.

またコンテンツ画像が複数の表示オブジェクトを含んだ画像により構成され、当該画像に含まれる各表示オブジェクトに各タグキーワードが関連づけられていたとする。そして表示オブジェクトに関連づけられたタグキーワードが、発話キーワードに一致すると、表示制御部３１８は、一致したタグキーワードが関連づけられた表示オブジェクトを拡大表示したり、当該表示オブジェクトの詳細情報又は関連情報を表示する。例えば第１〜第Ｎ（Ｎは２以上の整数）の表示オブジェクトに対して第１〜第Ｎのタグキーワードが関連づけられており、第Ｋ（１≦Ｋ≦Ｎ）のタグキーワードが発話キーワードに一致すると、第Ｋの表示オブジェクトを拡大したり、第Ｋの表示オブジェクトの詳細情報や関連情報を表示する。一方、第１〜第Ｎの表示オブジェクトのいずれのタグキーワードも発話キーワードに一致しなかった場合には、第１〜第Ｎの表示オブジェクトを順次（順番）に拡大表示する。 Further, it is assumed that the content image is configured by an image including a plurality of display objects, and each tag keyword is associated with each display object included in the image. When the tag keyword associated with the display object matches the utterance keyword, the display control unit 318 enlarges the display object associated with the matched tag keyword, or displays detailed information or related information of the display object. To do. For example, the first to Nth tag keywords are associated with the first to Nth display objects (N is an integer of 2 or more), and the Kth (1 ≦ K ≦ N) tag keywords are used as the utterance keywords. If they match, the Kth display object is enlarged, or detailed information and related information of the Kth display object are displayed. On the other hand, if none of the tag keywords of the first to Nth display objects match the utterance keyword, the first to Nth display objects are enlarged and displayed sequentially (in order).

また表示制御部３１８は、抽出された発話キーワードの出現回数をカウントし、出現回数が所定回数を超えた発話キーワードに対応するコンテンツを表示してもよい。例えば音声認識により抽出された発話キーワードのヒストグラムを作成し、出現回数（出現頻度）が所定回数を超えた発話キーワードを検出する。そして検出された発話キーワードに対応するコンテンツ（発話キーワードにより検索されるコンテンツ）を選択して表示する。なお出現回数が所定回数を超える発話キーワードが複数存在する場合には、例えば出現回数が多い発話キーワードに対応するコンテンツを優先して表示してもよい。 In addition, the display control unit 318 may count the number of appearances of the extracted utterance keyword and display content corresponding to the utterance keyword whose appearance number exceeds a predetermined number. For example, a histogram of utterance keywords extracted by voice recognition is created, and utterance keywords whose appearance count (appearance frequency) exceeds a predetermined count are detected. Then, the content corresponding to the detected utterance keyword (content searched by the utterance keyword) is selected and displayed. When there are a plurality of utterance keywords whose appearance count exceeds the predetermined count, for example, content corresponding to the utterance keyword with a high appearance count may be displayed preferentially.

また音声認識部３０７が、ユーザの発話音声からユーザの話者認識を行ったとする。具体的にはユーザ情報記憶部３２８に登録されているユーザの音声の特徴量情報と、取得された音声情報に基づいて話者認識（ユーザ認識）を行う。すると表示制御部３１８は、話者認識されたユーザに対応するコンテンツを表示する。例えば、ユーザの趣味・嗜好等がユーザ登録情報として登録されていた場合には、そのユーザ登録情報に応じたコンテンツを選択して表示する。或いは、音声認識部３０７が、ユーザの発話音声からユーザの感情状態を認識した場合には、表示制御部３１８は、認識された感情状態に対応するコンテンツを表示してもよい。例えばユーザが悲しんでいる状態である場合には、ユーザの気休めとなるコンテンツを表示し、ユーザがストレス状態である場合には、ユーザをリラックスさせるコンテンツを表示する。またユーザが盛り上がっている状態や笑い声が検知された場合には、更にユーザを盛り上げるコンテンツを表示する。 Further, it is assumed that the voice recognition unit 307 recognizes the user's speaker from the user's uttered voice. Specifically, speaker recognition (user recognition) is performed based on the feature amount information of the user's voice registered in the user information storage unit 328 and the acquired voice information. Then, the display control unit 318 displays content corresponding to the user who is recognized as a speaker. For example, when a user's hobbies / preferences are registered as user registration information, content corresponding to the user registration information is selected and displayed. Alternatively, when the voice recognition unit 307 recognizes the emotional state of the user from the user's uttered voice, the display control unit 318 may display content corresponding to the recognized emotional state. For example, when the user is sad, content that makes the user feel at home is displayed, and when the user is in a stressed state, content that makes the user relax is displayed. Further, when a state where the user is excited or a laughing voice is detected, contents that further increase the user are displayed.

また位置関係判定部３１２が、ユーザと表示部３４０との位置関係を判断したとする。具体的には、ユーザと表示部３４０との間の距離を判断したとする。すると表示制御部３１８は、ユーザと表示部３４０との距離が所定距離以内である場合に、取得された音情報に応じて、表示画像の表示態様を変化させる。例えば所定距離よりも遠い場合には、音情報が検出されてもそれは雑音であるとして、音情報に基づく表示制御を実行せず、所定距離よりも近づいた場合に、上述したような音情報に基づく表示制御を実行する。なお、ユーザと表示部３４０との距離は、距離そのもののみならず、距離と等価なパラメータであってもよい。 Further, it is assumed that the positional relationship determination unit 312 determines the positional relationship between the user and the display unit 340. Specifically, it is assumed that the distance between the user and the display unit 340 is determined. Then, when the distance between the user and the display unit 340 is within a predetermined distance, the display control unit 318 changes the display mode of the display image according to the acquired sound information. For example, if the sound information is farther than a predetermined distance, it is assumed that the sound information is detected as noise, and display control based on the sound information is not executed. Based display control is executed. Note that the distance between the user and the display unit 340 may be a parameter equivalent to the distance as well as the distance itself.

また表示制御部３１８は、取得された音情報に基づいて、表示部３４０の表示をオンにする制御を行ってもよい。例えば検知された音が一定レベル以上である場合に、表示部３４０の表示をオンして、コンテンツ画像を表示する。そして、例えば無音状態期間が所定時間以上経過すると、表示部３４０の表示をオフする。また取得された音情報に基づいて、ユーザの操作指示を特定し、特定された操作指示に応じた表示制御を行ってもよい。例えばコンテンツのスライド表示における「進む」「戻る」「停止」などのユーザの操作指示を、音情報に基づいて特定し、特定された操作指示にしたがってスライド表示を制御する。 The display control unit 318 may perform control to turn on the display of the display unit 340 based on the acquired sound information. For example, when the detected sound is above a certain level, the display of the display unit 340 is turned on to display the content image. For example, when the silent state period elapses for a predetermined time or longer, the display of the display unit 340 is turned off. Moreover, based on the acquired sound information, a user's operation instruction may be specified, and display control according to the specified operation instruction may be performed. For example, user operation instructions such as “forward”, “return”, and “stop” in slide display of content are specified based on sound information, and slide display is controlled according to the specified operation instruction.

２．発話検知による表示制御
本実施形態ではユーザの発話の検知結果に基づいて種々の表示制御を行うことができる。例えば図３（Ａ）にユーザの発話の検知結果の例を模式的に示す。即ち、検知された音の周波数解析を行った結果、検知された音の周波数が所定周波数の帯域であり、その振幅が所定レベル以上である場合に、検知された音は、物音や騒音などの雑音ではなく、ユーザの発話音声であると判断する。そして図３（Ｂ）に示すように、発話検知時に表示されていたコンテンツの詳細情報や関連情報を表示する。即ち、表示コンテンツのテーマ（タグキーワード）についての詳細な情報や、そのテーマに関連づけられた情報を表示する。例えば表示コンテンツが北海道の写真である場合（コンテンツのテーマが北海道である場合）には、北海道についての詳細情報（北海道についての観光情報、名産物情報等の各種情報）や関連情報（北海道のテーマに予め関連づけられた情報）を表示する。 2. Display Control by Utterance Detection In the present embodiment, various display controls can be performed based on the detection result of the user's utterance. For example, FIG. 3A schematically shows an example of the detection result of the user's utterance. That is, as a result of the frequency analysis of the detected sound, when the detected sound frequency is in a predetermined frequency band and the amplitude is equal to or higher than a predetermined level, the detected sound is a sound, noise, etc. It is determined that the voice is not the noise but the user's voice. Then, as shown in FIG. 3B, the detailed information and related information of the content displayed at the time of detecting the utterance are displayed. That is, detailed information about the theme (tag keyword) of the display content and information associated with the theme are displayed. For example, if the displayed content is a picture of Hokkaido (when the theme of the content is Hokkaido), detailed information about Hokkaido (various information such as tourism information and special product information about Hokkaido) and related information (Hokkaido theme) Is displayed in advance).

またユーザの注視状態等の視認状態が判断された場合には、発話の検知結果に加えて、視認状態の判断結果を反映させて、表示制御を行ってもよい。例えば図４（Ａ）では、センサ３５０である音センサによりユーザの発話が検知されると共に、ユーザが表示部３４０を注視しているか否かが判断される。そして、例えば所定時間内にユーザの発話とユーザの注視が検知された場合に、その時に表示されているコンテンツの詳細情報や関連情報を表示する。このようにすれば、表示コンテンツに興味を持って発話したり、表示コンテンツを注視しているユーザに対して、その表示コンテンツについての詳細情報や関連情報を伝えることができ、これまでにないデジタルフォトフレームを実現できる。 Further, when a visual recognition state such as a user's gaze state is determined, display control may be performed by reflecting the visual recognition state determination result in addition to the speech detection result. For example, in FIG. 4A, the user's utterance is detected by the sound sensor which is the sensor 350, and it is determined whether or not the user is gazing at the display unit 340. For example, when the user's speech and the user's gaze are detected within a predetermined time, the detailed information and the related information of the content displayed at that time are displayed. In this way, users who are interested in the display content or who are gazing at the display content can be notified of detailed information and related information about the display content. A photo frame can be realized.

例えばユーザの注視状態の検知は、センサ３５０である撮像センサを用いて実現できる。即ち図４（Ｂ）に示すように、撮像センサからの撮像情報に基づいて、ユーザの顔領域ＦＡＲを検出する。次に検出された矩形の顔領域ＦＡＲに対応する計測領域ＳＡＲを設定する。この計測領域ＳＡＲは、顔領域ＦＡＲを内包し、顔領域ＦＡＲよりもサイズが大きな領域である。この計測領域ＳＡＲは、例えば顔領域ＦＡＲをオーバーサイジングすることで設定できる。そして、この計測領域ＳＡＲ内に顔領域ＦＡＲが存在する時間を計測し、計測された時間に基づいて、ユーザが表示部３４０を注視しているか否かを判断する。例えば顔領域ＦＡＲが計測領域ＳＡＲ内に一定時間以上位置していた場合には、ユーザが表示部３４０を注視していると判断する。 For example, detection of the user's gaze state can be realized using an image sensor that is the sensor 350. That is, as shown in FIG. 4B, the user's face area FAR is detected based on the imaging information from the imaging sensor. Next, a measurement area SAR corresponding to the detected rectangular face area FAR is set. This measurement area SAR is an area that includes the face area FAR and is larger in size than the face area FAR. This measurement area SAR can be set, for example, by oversizing the face area FAR. Then, the time during which the face area FAR exists in the measurement area SAR is measured, and it is determined whether the user is gazing at the display unit 340 based on the measured time. For example, when the face area FAR is located in the measurement area SAR for a certain time or more, it is determined that the user is gazing at the display unit 340.

次に発話検知による表示制御手法について図５のフローチャートを用いて説明する。まずコンテンツを表示する（ステップＳ１）。例えばランダムに又はユーザ情報（登録されたユーザの個人情報）に基づいてコンテンツを選択して表示する。そして音センサによりユーザの発話が検知されたか否かを判断し、発話が検知された場合にはユーザが表示部３４０（デジタルフォトフレーム、表示コンテンツ）を注視しているか否かを判断する（ステップＳ２、Ｓ３）。そして注視している場合には、現在表示されているコンテンツの詳細情報又は関連情報のコンテンツを選択して表示する（ステップＳ４）。 Next, a display control method based on utterance detection will be described with reference to the flowchart of FIG. First, the content is displayed (step S1). For example, content is selected and displayed randomly or based on user information (registered user personal information). Then, it is determined whether or not the user's utterance is detected by the sound sensor. If the utterance is detected, it is determined whether or not the user is gazing at the display unit 340 (digital photo frame, display content) (step). S2, S3). If the user is gazing, the detailed information of the currently displayed content or the content of the related information is selected and displayed (step S4).

次に注視状態の検出処理について図６のフローチャートを用いて説明する。まず、撮像センサ（カメラ）を用いた顔検出により、顔領域（枠領域）を検出する（ステップＳ２１）。次に、検出された顔領域を内包し、顔領域よりもサイズが大きな計測領域を設定する（ステップＳ２２）。即ち図４（Ｂ）に示すように、顔領域をオーバーサイジングした計測領域を設定する。そして計測領域内に顔領域が存在する時間をタイマを用いて計測する（ステップＳ２３）。即ち計測領域の設定後、タイマの計測を開始し、顔領域が計測領域内に位置する時間を計測する。そして所定時間以上、経過したか否かを判断し（ステップＳ２４）、経過した場合には注視状態であると判断する。 Next, gaze state detection processing will be described with reference to the flowchart of FIG. First, a face area (frame area) is detected by face detection using an imaging sensor (camera) (step S21). Next, a measurement area that includes the detected face area and is larger than the face area is set (step S22). That is, as shown in FIG. 4B, a measurement area oversized from the face area is set. Then, the time during which the face area exists in the measurement area is measured using a timer (step S23). That is, after setting the measurement area, timer measurement is started and the time during which the face area is located within the measurement area is measured. Then, it is determined whether or not a predetermined time or more has elapsed (step S24).

なお注視状態の検出手法は図６の手法に限定されない。例えばユーザの赤目を検出することで注視状態を検出してもよい。或いは、２台のカメラ（ステレオカメラ）で撮影されたユーザの顔画像の目周辺の画像領域の明暗から、瞳孔の位置を検出し、検出された瞳孔の中心位置と眼球の中心位置から、ユーザの視線方向を検出し、ユーザが注視状態か否かを判断してもよい。 The gaze state detection method is not limited to the method shown in FIG. For example, the gaze state may be detected by detecting the user's red eyes. Alternatively, the position of the pupil is detected from the brightness and darkness of the image area around the eyes of the user's face image captured by the two cameras (stereo cameras), and the user is detected from the detected center position of the pupil and the center position of the eyeball. May be detected to determine whether the user is in a gaze state.

また発話検知による表示制御手法も図５の手法に限定されない。例えばユーザの発話の検知結果に基づいて、コンテンツの表示タイミングを制御してもよい。具体的には、ユーザの無発話状態が検知された場合に、コンテンツのスライド表示において次のコンテンツへの切り替え処理を行う。 Further, the display control method based on the utterance detection is not limited to the method shown in FIG. For example, the display timing of the content may be controlled based on the detection result of the user's utterance. Specifically, when the user's non-speech state is detected, a process for switching to the next content is performed in the slide display of the content.

例えば図７（Ａ）では、コンテンツＣＴ１（スライド１）が表示されているときに発話が検知されている。そして発話が検知されている期間においては次のコンテンツＣＴ２（スライド２）への切り替えは行わず、現在のコンテンツＣＴ１を表示する。具体的には無発話期間が所定時間を超えない場合には、コンテンツＣＴ１の表示を続ける。一方、無発話期間が所定時間を超えると、次のコンテンツＣＴ２に切り替える。そしてコンテンツＣＴ２の表示中においても、無発話期間がカウントされ、無発話期間が所定時間を超えない場合には、コンテンツＣＴ２の表示を続ける。一方、無発話期間が所定時間を超えると、次のコンテンツＣＴ３に切り替える。 For example, in FIG. 7A, the utterance is detected when the content CT1 (slide 1) is displayed. During the period in which the utterance is detected, the current content CT1 is displayed without switching to the next content CT2 (slide 2). Specifically, when the non-utterance period does not exceed a predetermined time, the display of the content CT1 is continued. On the other hand, when the non-utterance period exceeds a predetermined time, the content CT2 is switched to the next content CT2. Even during the display of the content CT2, the non-speech period is counted, and when the non-speech period does not exceed the predetermined time, the display of the content CT2 is continued. On the other hand, when the non-utterance period exceeds the predetermined time, the content is switched to the next content CT3.

このようにすれば、表示コンテンツに対してユーザが興味を持っており発話している場合には、そのコンテンツの表示が維持され、ユーザの発話が途切れると、次のコンテンツに切り替わるようになる。これにより、これまでにはないデジタルフォトフレームのスライド表示制御を実現できる。 In this way, when the user is interested in the display content and is speaking, the display of the content is maintained, and when the user's speech is interrupted, the next content is switched. As a result, it is possible to realize slide display control of a digital photo frame that has never been achieved.

図７（Ｂ）に、図７（Ａ）の表示制御手法のフローチャートを示す。まずスライド表示のコンテンツＣＴｉを表示する（ステップＳ３１）。そして音センサによりユーザの発話を検知し、ユーザの無発話期間（発話が検知されない期間）が所定時間を超えたか否かを判断する（ステップＳ３２、Ｓ３３）。そして所定時間を超えた場合には、ｉを１だけインクリメントして、次のコンテンツに切り替える（ステップＳ３４）。 FIG. 7B shows a flowchart of the display control method of FIG. First, the slide display content CTi is displayed (step S31). Then, the user's utterance is detected by the sound sensor, and it is determined whether or not the user's non-utterance period (period in which the utterance is not detected) exceeds a predetermined time (steps S32 and S33). If the predetermined time is exceeded, i is incremented by 1 and the next content is switched (step S34).

３．音声認識による表示制御
本実施形態ではユーザの発話の音声認識結果に基づいて種々の表示制御を行うことができる。例えば音声認識によりユーザの発話キーワードを抽出し、抽出された発話キーワードに基づき選択されたコンテンツを表示する。即ち図８では、ユーザの発話キーワードである「北海道」や「Ｅ動物園」が音声認識により抽出されている。この場合には、発話キーワードに基づき選択されたコンテンツとして、「北海道」にある「Ｅ動物園」の詳細情報や関連情報を表示する。 3. Display Control by Voice Recognition In this embodiment, various display controls can be performed based on the voice recognition result of the user's utterance. For example, the user's utterance keyword is extracted by voice recognition, and the content selected based on the extracted utterance keyword is displayed. That is, in FIG. 8, the user's utterance keywords “Hokkaido” and “E Zoo” are extracted by voice recognition. In this case, detailed information and related information of “E Zoo” in “Hokkaido” are displayed as the content selected based on the utterance keyword.

具体的には、音センサにより周囲音を検知して音情報を取得する。次に、取得された音情報を音声認識によりテキスト情報に変換し、得られたテキスト情報の中から単語を抽出する。そして、抽出された単語と、登録キーワード記憶部３２６（キーワードデータベース）に記憶された登録キーワードとの照合処理を行い、抽出された単語が登録キーワードであるか否かを判定する。そして抽出された単語が登録キーワードである場合には、その単語を発話キーワードであると判定する。これにより、意味のない情報や不快な情報がユーザに提示されてしまう事態を防止できる。 Specifically, sound information is acquired by detecting ambient sounds with a sound sensor. Next, the acquired sound information is converted into text information by voice recognition, and a word is extracted from the obtained text information. Then, the extracted word and a registered keyword stored in the registered keyword storage unit 326 (keyword database) are collated to determine whether or not the extracted word is a registered keyword. If the extracted word is a registered keyword, the word is determined to be an utterance keyword. Thereby, the situation where meaningless information and unpleasant information are presented to the user can be prevented.

そして、得られた発話キーワードに基づいて、デジタルフォトフレームのコンテンツ情報記憶部３２２やサーバ（ホームサーバ、外部サーバ）から、発話キーワードに対応（関連）するコンテンツを検索する。そして、検索されたコンテンツの画像を表示する。 Based on the obtained utterance keyword, a content corresponding to (related to) the utterance keyword is searched from the content information storage unit 322 of the digital photo frame or the server (home server, external server). Then, an image of the searched content is displayed.

このようにすれば、表示されているコンテンツに対して興味等を持ったユーザがそのコンテンツに関連等する発話キーワードを発話すると、その発話キーワードに対応するコンテンツが表示されるようになり、これまでにないタイプのデジタルフォトフレームを実現できる。 In this way, when a user who is interested in the displayed content utters an utterance keyword related to the content, the content corresponding to the utterance keyword is displayed. A type of digital photo frame that is not available.

ここで登録キーワードとしては、一般登録キーワードと個人登録キーワードがある。一般登録キーワードは、例えば物・動物・植物・人名・地名などの一般的に通じる名詞などである。この一般登録キーワードは、辞書登録情報として一般辞書データベースに登録される。なお、必要に応じてジャンルをユーザが個人で取捨選択できるようにしてもよい。一方、個人登録キーワードは、例えば家族・親戚・友人の名前や、ユーザの趣味・嗜好情報や、ユーザのお気に入りの場所・店や、個人所有の画像に付与されたタグなどである。この個人登録キーワードは、ユーザ登録情報としてユーザ辞書データベースに登録される。例えば「いいなー、北海道」という一般登録キーワードを含む発話が認識された場合には、北海道に関する情報や画像を表示する。一方、「Ａ叔母さんが今度来るって」という個人登録キーワードを含む発話が認識された場合には、Ａ叔母さんのタグが付いた個人撮影写真を表示する。 Here, the registration keyword includes a general registration keyword and an individual registration keyword. General registration keywords are, for example, commonly used nouns such as things, animals, plants, personal names, and place names. This general registration keyword is registered in the general dictionary database as dictionary registration information. Note that the genre may be individually selected by the user as necessary. On the other hand, personal registration keywords include, for example, names of family members, relatives, friends, user hobbies / preference information, user favorite places / stores, tags attached to personally owned images, and the like. This personal registration keyword is registered in the user dictionary database as user registration information. For example, when an utterance including a general registered keyword “Ii-na, Hokkaido” is recognized, information and images related to Hokkaido are displayed. On the other hand, when an utterance including an individual registration keyword “Aunt A is coming this time” is recognized, a personal photograph with a tag of Aunt is displayed.

また本実施形態では、表示コンテンツに関連づけられたタグキーワードと発話キーワードとが一致した場合に、タグキーワードに対応するコンテンツを表示してもよい。 In the present embodiment, content corresponding to the tag keyword may be displayed when the tag keyword associated with the display content matches the utterance keyword.

例えば図９では、表示コンテンツのタグキーワードとして「北海道」、「旬の食材」、「鮭」、「カニ」、「ウニ」が関連づけられている。そしてユーザの発話からは「カニ」という発話キーワードが音声認識により抽出されており、「カニ」という発話キーワードが表示コンテンツのタグキーワードに一致している。この場合には、ユーザが「カニ」の情報に対して興味を持っていると判断し、「カニ」に対応するコンテンツとして、「カニ」についての詳細情報や関連情報を表示する。このようにすれば、表示コンテンツの表示内容（テーマ）のうち、ユーザが興味を持った内容についての詳細情報や関連情報を表示できるようになり、ユーザにとって好適なインターフェース環境を実現できる。 For example, in FIG. 9, “Hokkaido”, “seasonal ingredients”, “mochi”, “crab”, and “urchin” are associated as tag keywords for display content. Then, from the user's utterance, the utterance keyword “crab” is extracted by voice recognition, and the utterance keyword “crab” matches the tag keyword of the display content. In this case, it is determined that the user is interested in the information of “crab”, and detailed information and related information about “crab” are displayed as content corresponding to “crab”. In this way, it becomes possible to display detailed information and related information about the contents that the user is interested in among the display contents (themes) of the display contents, and an interface environment suitable for the user can be realized.

また本実施形態では、表示コンテンツに関連づけられたタグキーワードと発話キーワードが一致しなかった場合に、タグキーワードと発話キーワードとのＡＮＤ検索により選択されたコンテンツを表示してもよい。 In this embodiment, when the tag keyword associated with the display content and the utterance keyword do not match, the content selected by the AND search of the tag keyword and the utterance keyword may be displayed.

例えば図１０では、表示コンテンツのタグキーワードとして「カニ」が関連づけられている。そしてユーザの発話からは「料理」という発話キーワード（一般登録キーワード）が音声認識により抽出されており、タグキーワードである「カニ」と発話キーワードである「料理」は一致していない。この場合には、ユーザが「カニ」の「料理」に対して興味を持っていると類推し、「カニ」と「料理」のＡＮＤ検索により選択されたコンテンツを表示する。具体的には「カニ料理」についての詳細情報や関連情報を表示する。このようにすれば、表示コンテンツの内容とユーザの発話内容の両方を考慮したコンテンツを、次のコンテンツとして表示できるようになり、多様なコンテンツ表示が可能になる。 For example, in FIG. 10, “crab” is associated as a tag keyword of the display content. Then, from the user's utterance, an utterance keyword (general registered keyword) “cooking” is extracted by voice recognition, and the tag keyword “crab” and the utterance keyword “cooking” do not match. In this case, it is assumed that the user is interested in “cooking” of “crab”, and the content selected by AND search of “crab” and “cooking” is displayed. Specifically, detailed information and related information about “crab dishes” are displayed. In this way, content that takes into account both the content of the display content and the content of the user's utterance can be displayed as the next content, and various content displays are possible.

また本実施形態では、ユーザの発話から複数の異なる発話キーワードが抽出された場合に、複数の異なる発話キーワードのＡＮＤ検索により選択されたコンテンツを表示してもよい。 In the present embodiment, when a plurality of different utterance keywords are extracted from the user's utterance, the content selected by AND search of the plurality of different utterance keywords may be displayed.

例えば図１１では、表示コンテンツを見たユーザの発話から「カニ」、「料理」という異なる複数の発話キーワードが抽出されている。この場合にも、ユーザが「カニ」の「料理」に対して興味を持っていると類推し、「カニ」と「料理」のＡＮＤ検索により選択されたコンテンツを表示する。具体的には「カニ料理」についての詳細情報や関連情報を表示する。このようにすれば、表示コンテンツを見たユーザの発話内容から適切なコンテンツを選択して、ユーザに表示できるようになる。 For example, in FIG. 11, a plurality of different utterance keywords “crab” and “cooking” are extracted from the utterances of the user who viewed the display content. Also in this case, it is assumed that the user is interested in “cooking” of “crab”, and the content selected by AND search of “crab” and “cooking” is displayed. Specifically, detailed information and related information about “crab dishes” are displayed. In this way, it becomes possible to select appropriate content from the utterance contents of the user who viewed the display content and display it to the user.

図１２に音声認識による本実施形態の第１の表示制御手法のフローチャートを示す。まずランダムに又はユーザ情報（個人情報）に基づいてコンテンツを選択する（ステップＳ４１）。そして選択されたコンテンツを表示する（ステップＳ４２）。 FIG. 12 shows a flowchart of the first display control method of the present embodiment by voice recognition. First, contents are selected randomly or based on user information (personal information) (step S41). The selected content is displayed (step S42).

次に、音センサからの音情報に基づき、ユーザの発話が検知されたか否かを判断する（ステップＳ４３）。そして、発話が検知された場合には、音声認識により発話キーワードが抽出されたか否かを判断する（ステップＳ４４）。そしてユーザの発話が検知されない場合や発話キーワードが抽出されなかった場合にはステップＳ４１に戻る。 Next, based on the sound information from the sound sensor, it is determined whether or not the user's speech has been detected (step S43). If an utterance is detected, it is determined whether an utterance keyword has been extracted by voice recognition (step S44). If no user utterance is detected or no utterance keyword is extracted, the process returns to step S41.

ステップＳ４４で発話キーワードが抽出された場合には、表示コンテンツのタグキーワードに一致する発話キーワードが抽出されたか否かを判断する（ステップＳ４５）。そして、タグキーワードに一致する発話キーワードが抽出された場合には、タグキーワードに対応するコンテンツを選択する（ステップＳ４６）。即ち図９で説明したように、タグキーワードと発話キーワードが一致した場合には、そのタグキーワードの関連情報や詳細情報を表示する。一方、タグキーワードに一致する発話キーワードが抽出されなかった場合には、タグキーワードと発話キーワードのＡＮＤ検索を行い、ＡＮＤ検索により抽出されたコンテンツを選択する（ステップＳ４７、Ｓ４８）。即ち図１０で説明したように、タグキーワードと発話キーワードが一致しなかった場合には、タグキーワードと発話キーワードのＡＮＤ検索により選択されたコンテンツを表示する。 If an utterance keyword is extracted in step S44, it is determined whether an utterance keyword matching the tag keyword of the display content has been extracted (step S45). When an utterance keyword that matches the tag keyword is extracted, content corresponding to the tag keyword is selected (step S46). That is, as described with reference to FIG. 9, when the tag keyword and the utterance keyword match, related information and detailed information of the tag keyword are displayed. On the other hand, if an utterance keyword that matches the tag keyword is not extracted, an AND search of the tag keyword and the utterance keyword is performed, and the content extracted by the AND search is selected (steps S47 and S48). That is, as described with reference to FIG. 10, when the tag keyword and the utterance keyword do not match, the content selected by the AND search of the tag keyword and the utterance keyword is displayed.

このように図１２の第１の表示制御手法では、タグキーワードと同じ発話キーワードが検知されると、タグキーワードの詳細情報又は関連情報（以下、適宜、詳細・関連情報と記載）が表示され、タグキーワードとは異なる発話キーワードが検知されると、ＡＮＤ検索が行われてコンテンツが表示される。これにより多様なコンテンツ表示が可能になる。 As described above, in the first display control method of FIG. 12, when the same utterance keyword as the tag keyword is detected, detailed information or related information (hereinafter referred to as detailed / related information as appropriate) of the tag keyword is displayed. When an utterance keyword different from the tag keyword is detected, an AND search is performed and the content is displayed. As a result, various contents can be displayed.

図１３に音声認識による本実施形態の第２の表示制御手法のフローチャートを示す。図１３の第２の表示制御手法では、ユーザの発話検知・音声認識の結果に加えて、ユーザの視認状態である注視状態の判断結果によりコンテンツの表示制御を行っている。 FIG. 13 shows a flowchart of the second display control method of the present embodiment by voice recognition. In the second display control method of FIG. 13, content display control is performed based on the determination result of the gaze state, which is the user's visual state, in addition to the result of the user's speech detection and voice recognition.

まずランダムに又はユーザ情報に基づいてコンテンツを選択する（ステップＳ５１）。そして選択されたコンテンツを表示する（ステップＳ５２）。次に、ユーザの発話が検知されたか否かを判断し、検知されなかった場合にはユーザが表示部（表示コンテンツ）を注視しているか否かを判断する（ステップＳ５３、Ｓ５４）。そして、注視していた場合には、表示コンテンツのタグキーワードのいずれかに対応するコンテンツを選択する（ステップＳ５５）。例えば図９のように表示コンテンツに対して「北海道」、「旬の食材」、「鮭」、「カニ」、「ウニ」のタグキーワードが関連づけられていた場合には、これらのタグキーワードのいずれかについての詳細・関連情報を表示する。 First, content is selected randomly or based on user information (step S51). Then, the selected content is displayed (step S52). Next, it is determined whether or not the user's speech has been detected. If not detected, it is determined whether or not the user is gazing at the display unit (display content) (steps S53 and S54). If the user is gazing, content corresponding to one of the tag keywords of the display content is selected (step S55). For example, as shown in FIG. 9, when the tag keywords “Hokkaido”, “Seasonal ingredients”, “Samurai”, “Crab”, “Sea urchin” are associated with the display content, any of these tag keywords Display details and related information about

ステップＳ５３で発話が検知された場合には、音声認識により発話キーワードを抽出する（ステップＳ５６）。そして、ユーザが表示部を注視しているか否かを判断し、注視していない場合には、抽出された発話キーワードに対応するコンテンツを選択する（ステップＳ５７、Ｓ５８）。例えば、ユーザの発話キーワードが辞書データベースに登録されている登録キーワード（一般登録キーワード等）である場合に、その登録キーワードに関連したコンテンツを選択して表示する。 If an utterance is detected in step S53, an utterance keyword is extracted by voice recognition (step S56). Then, it is determined whether or not the user is gazing at the display unit. When the user is not gazing, content corresponding to the extracted utterance keyword is selected (steps S57 and S58). For example, when the user's utterance keyword is a registered keyword (such as a general registered keyword) registered in the dictionary database, content related to the registered keyword is selected and displayed.

ステップＳ５７でユーザが注視していると判断された場合には、表示コンテンツのタグキーワードに一致する発話キーワードが抽出されたか否かを判断する（ステップＳ５９）。そして、タグキーワードに一致する発話キーワードが抽出された場合には、他の登録キーワードが発話キーワードとして抽出されたか否かを判断する（ステップＳ６０）。そして他の登録キーワードが抽出された場合には、発話キーワードに一致したタグキーワードと、抽出された登録キーワードとのＡＮＤ検索によりコンテンツを選択する（ステップＳ６１）。一方、他の登録キーワードが抽出されなかった場合には、発話キーワードに一致したタグキーワードに対応するコンテンツを選択する（ステップＳ６２）。 If it is determined in step S57 that the user is gazing, it is determined whether or not an utterance keyword matching the tag keyword of the display content has been extracted (step S59). If an utterance keyword that matches the tag keyword is extracted, it is determined whether another registered keyword has been extracted as the utterance keyword (step S60). If another registered keyword is extracted, the content is selected by AND search of the tag keyword that matches the utterance keyword and the extracted registered keyword (step S61). On the other hand, if no other registered keyword is extracted, the content corresponding to the tag keyword that matches the utterance keyword is selected (step S62).

例えば発話キーワードとして「カニ」、「料理」が抽出され、表示コンテンツのタグキーワードが「カニ」であったとする。この場合には、発話キーワードの「カニ」とタグキーワードの「カニ」が一致しており、且つ、他の登録キーワードである「料理」が発話キーワードとして抽出されている。従って、この場合には「カニ」と「料理」のＡＮＤ検索によりコンテンツが選択される。一方、発話キーワードとして「カニ」だけが抽出され、他の登録キーワードが発話キーワードとして抽出されなかった場合には、タグキーワードである「カニ」の詳細・関連情報が選択されて表示される。 For example, it is assumed that “crab” and “cooking” are extracted as utterance keywords, and the tag keyword of the display content is “crab”. In this case, the utterance keyword “crab” matches the tag keyword “crab”, and the other registered keyword “cooking” is extracted as the utterance keyword. Therefore, in this case, the content is selected by AND search of “crab” and “cooking”. On the other hand, when only “crab” is extracted as the utterance keyword and no other registered keyword is extracted as the utterance keyword, the detail / related information of the tag keyword “crab” is selected and displayed.

ステップＳ５９でタグキーワードに一致する発話キーワードが抽出されなかった場合には、他の登録キーワードが発話キーワードとして抽出されたか否かを判断する（ステップＳ６３）。そして、他の登録キーワードが抽出された場合には、表示コンテンツのいずれかのタグキーワードと、抽出された登録キーワードのＡＮＤ検索によりコンテンツを選択し、抽出されなかった場合には、表示コンテンツのいずれかのタグキーワードに対応するコンテンツを選択する（ステップＳ６４、Ｓ６５）。 If an utterance keyword matching the tag keyword is not extracted in step S59, it is determined whether another registered keyword has been extracted as an utterance keyword (step S63). If another registered keyword is extracted, the content is selected by AND search of any tag keyword of the display content and the extracted registered keyword, and if not extracted, any of the display content is selected. The content corresponding to the tag keyword is selected (steps S64 and S65).

例えば表示コンテンツのタグキーワードが「カニ」、「鮭」、「ウニ」であり、発話キーワードとして「料理」が抽出されたとする。この場合には、発話キーワードとタグキーワードは一致していないが、他の登録キーワードである「料理」が発話キーワードとして抽出されている。従って、「カニ」、「鮭」、「ウニ」のいずれかと「料理」のＡＮＤ検索によりコンテンツが選択される。例えば「カニ料理」、「鮭料理」、「ウニ料理」のコンテンツが選択される。一方、表示コンテンツのタグキーワードが「カニ」、「鮭」、「ウニ」であり、他の登録キーワードが発話キーワードとして抽出されなかったとする。この場合には、「カニ」、「鮭」、「ウニ」のいずれかに対応するコンテンツが選択されて表示される。 For example, it is assumed that the tag keywords of the display content are “crab”, “mochi”, “urchin”, and “cooking” is extracted as the utterance keyword. In this case, the utterance keyword and the tag keyword do not match, but another registered keyword “cooking” is extracted as the utterance keyword. Therefore, the content is selected by AND search of any one of “crab”, “mochi”, “urchin” and “cooking”. For example, the contents of “crab dish”, “crab dish”, and “urchin dish” are selected. On the other hand, it is assumed that the tag keywords of the display content are “crab”, “鮭”, “uni”, and no other registered keywords are extracted as utterance keywords. In this case, the content corresponding to any of “crab”, “crab”, and “urchin” is selected and displayed.

以上のように図１３の第２の表示制御手法では、ユーザの発話が検知された場合に、音声認識の結果及びユーザの視認状態の判断結果に基づいて、表示画像の表示態様を変化させている。 As described above, in the second display control method of FIG. 13, when the user's utterance is detected, the display mode of the display image is changed based on the result of the speech recognition and the determination result of the user's visual state. Yes.

例えば、ユーザの発話は検知されたが（ステップＳ５３）、ユーザが注視していない場合には（ステップＳ５７）、発話キーワードに対応するコンテンツを表示する（ステップＳ５８）。即ち、ユーザの発話の音声認識を行い、表示されているコンテンツに関係なく、音声認識より抽出された発話キーワード（登録キーワード）によりコンテンツを選択する。 For example, when the user's utterance is detected (step S53) but the user is not gazing (step S57), the content corresponding to the utterance keyword is displayed (step S58). That is, voice recognition of the user's utterance is performed, and the content is selected by the utterance keyword (registered keyword) extracted from the voice recognition regardless of the displayed content.

一方、ユーザの発話が検知され（ステップＳ５３）、ユーザが注視している場合には（ステップＳ５７）、表示コンテンツのタグキーワードにより選択されたコンテンツを表示する。具体的には、ユーザが注視しており、タグキーワードと発話キーワードとが一致した場合には、一致したタグキーワードにより選択されたコンテンツを表示する。例えば、タグキーワードと発話キーワードが一致すると共に、他の登録キーワード（一般登録キーワード）が抽出された場合には、タグキーワードと登録キーワードとのＡＮＤ検索を行ってコンテンツを選択する（ステップＳ６１）。また、タグキーワードと発話キーワードが一致するが、他の登録キーワードが抽出されなかった場合には、タグキーワードに対応するコンテンツを選択する（ステップＳ６２）。 On the other hand, when the user's speech is detected (step S53) and the user is gazing (step S57), the content selected by the tag keyword of the display content is displayed. Specifically, when the user is gazing and the tag keyword matches the utterance keyword, the content selected by the matched tag keyword is displayed. For example, when the tag keyword matches the utterance keyword and another registered keyword (general registered keyword) is extracted, the content is selected by performing an AND search of the tag keyword and the registered keyword (step S61). If the tag keyword matches the utterance keyword, but no other registered keyword is extracted, the content corresponding to the tag keyword is selected (step S62).

また、タグキーワードと発話キーワードが一致しないが、他の登録キーワードが抽出された場合には、表示コンテンツのいずれかのタグキーワードと登録キーワードとのＡＮＤ検索によりコンテンツを選択する（ステップＳ６４）。また登録キーワードが抽出されなかった場合には、表示コンテンツのいずれかのタグキーワードの詳細・関連情報を選択する（ステップＳ６５）。 If the tag keyword does not match the utterance keyword but other registered keywords are extracted, the content is selected by AND search of any tag keyword of the display content and the registered keyword (step S64). If the registered keyword is not extracted, the detail / related information of any tag keyword of the display content is selected (step S65).

以上のようにすれば、ユーザの注視とユーザの発話キーワードの両方を反映させたインテリジェントな表示制御を実現でき、デジタルフォトフレームによる多様なコンテンツ表示を実現できる。 As described above, intelligent display control reflecting both the user's gaze and the user's utterance keyword can be realized, and various contents display using a digital photo frame can be realized.

なお本実施形態では、抽出された発話キーワードの出現回数をカウントし、出現回数が所定回数を超えた発話キーワードに対応するコンテンツを表示するようにしてもよい。このような処理を行う場合のフローチャートを図１４に示す。 In the present embodiment, the number of appearances of the extracted utterance keyword may be counted, and the content corresponding to the utterance keyword whose appearance number exceeds a predetermined number may be displayed. FIG. 14 is a flowchart for performing such processing.

まず、ユーザの発話が検知されたか否かを判断し、検知された場合には音声認識により発話キーワードを抽出する（ステップＳ９１、Ｓ９２）。そして、発話キーワードの出現回数をカウントする（ステップＳ９３）。例えば各発話キーワードの出現回数のヒストグラムを作成する。そして、発話キーワードの出現回数が所定回数を超えたか否かを判断し、超えた場合に、その発話キーワードに対応するコンテンツを表示コンテンツとして選択する（ステップＳ９４、Ｓ９５）。 First, it is determined whether or not the user's utterance has been detected. If detected, the utterance keyword is extracted by voice recognition (steps S91 and S92). Then, the number of appearances of the utterance keyword is counted (step S93). For example, a histogram of the number of appearances of each utterance keyword is created. Then, it is determined whether or not the number of appearances of the utterance keyword exceeds a predetermined number, and if it exceeds, the content corresponding to the utterance keyword is selected as display content (steps S94 and S95).

このようにすれば、ユーザの発話に反応してコンテンツ画像が頻繁に変化してしまうような事態を防止でき、ユーザにとって見やすいコンテンツ画像を表示できるようになる。 In this way, it is possible to prevent a situation in which the content image frequently changes in response to the user's utterance, and to display a content image that is easy for the user to view.

なお、発話キーワードの出現回数や発話キーワード履歴を蓄積して記録し、よく使われる発話キーワードに対応するコンテンツの表示頻度を高くするように制御してもよい。 The number of appearances of the utterance keyword and the utterance keyword history may be accumulated and recorded, and control may be performed so as to increase the display frequency of the content corresponding to the frequently used utterance keyword.

或いは、多数の他のユーザの発話キーワード履歴を保有する外部のデータベースサーバにアクセスし、ユーザがよく使うキーワードと同じようなキーワードをよく使う他のユーザを特定し、特定された他のユーザの発話キーワード履歴に基づいて、コンテンツを選択してもよい。このような協調フィルタリング処理を行うことで、ユーザの嗜好に沿った発話キーワードによるコンテンツ選択を実現できるようになる。 Alternatively, an external database server that holds the utterance keyword history of many other users is accessed, and other users who frequently use the same keywords as those frequently used by the user are identified, and the utterances of the identified other users Content may be selected based on the keyword history. By performing such collaborative filtering processing, it becomes possible to realize content selection based on utterance keywords in accordance with user preferences.

４．複数の表示オブジェクトの表示制御
本実施形態では、コンテンツ画像が複数の表示オブジェクトを含んだ画像により構成される場合に、各表示オブジェクトのタグキーワードと発話キーワードの一致を判断し、一致した場合に、その表示オブジェクトを拡大表示してもよい。 4). Display control of a plurality of display objects In the present embodiment, when the content image is configured by an image including a plurality of display objects, it is determined whether the tag keyword and the utterance keyword of each display object match, The display object may be enlarged and displayed.

例えば図１５（Ａ）では、コンテンツ画像として「カニ」、「鮭」、「ウニ」を表す複数の表示オブジェクトが表示されている。そして、これらの表示オブジェクトには「カニ」、「鮭」、「ウニ」のタグキーワードが設定されている。 For example, in FIG. 15A, a plurality of display objects representing “crabs”, “crabs”, and “urchins” are displayed as content images. These display objects are set with tag keywords “crab”, “crab”, and “urchin”.

そしてユーザの発話キーワードとして「カニ」が検知されると、「カニ」の表示オブジェクトを中心とする拡大表示を行ったり、「カニ」についての詳細・関連情報を表示する。同様に、ユーザの発話キーワードとして「鮭」が検知されると、「鮭」の表示オブジェクトを拡大表示したり、「鮭」についての詳細・関連情報を表示し、発話キーワードとして「ウニ」が検知されると、「ウニ」の表示オブジェクトを拡大表示したり、「ウニ」についての詳細・関連情報を表示する。 When “crab” is detected as the user's utterance keyword, an enlarged display centering on the display object of “crab” is performed, and detailed / related information about “crab” is displayed. Similarly, when “鮭” is detected as the user ’s utterance keyword, the display object of “鮭” is enlarged or the details / related information about “鮭” is displayed, and “uni” is detected as the utterance keyword. Then, the display object of “urchin” is enlarged and detailed / related information about “urchin” is displayed.

一方、「カニ」、「鮭」、「ウニ」のタグキーワードに一致する発話キーワードが抽出されなかった場合には、「カニ」、「鮭」、「ウニ」の表示オブジェクトを順次拡大表示する。例えば「カニ」、「鮭」、「ウニ」の順番で表示オブジェクトを拡大表示する。 On the other hand, if an utterance keyword that matches the tag keywords “crab”, “鮭”, and “urchin” is not extracted, the display objects “crab”, “鮭”, and “urchin” are sequentially enlarged and displayed. For example, the display objects are enlarged and displayed in the order of “crab”, “鮭”, and “urchin”.

このようにすれば、ユーザが興味を持った表示オブジェクトを拡大表示したり、その詳細・関連情報を表示できるようになり、例えば生物などの図鑑のコンテンツ表現をデジタルフォトフレームにより実現できるようになる。 In this way, the display object that the user is interested in can be enlarged and the details and related information can be displayed. For example, the contents of a picture book such as a living thing can be realized by a digital photo frame. .

なお、表示オブジェクトを拡大表示しているときに、ユーザの発話を検知した場合に、拡大表示されている表示オブジェクトの詳細・関連情報を表示するようにしてもよい。この場合に、例えば「そうそう、これ、これ」などのような、「こそあど」言葉にのみ反応して、拡大表示されている表示オブジェクトの詳細・関連情報を表示してもよい。 In addition, when the display object is displayed in an enlarged manner, when the user's utterance is detected, the details / related information of the display object that is displayed in an enlarged form may be displayed. In this case, for example, the details / related information of the display object that is displayed in an enlarged manner may be displayed in response to only the word “Sadaido” such as “Yes, this, this”.

また複数の表示オブジェクトにより構成されたコンテンツ画像の表示態様としては種々の態様を想定できる。例えば図１５（Ｂ）に示すように「天気情報」、「交通情報」、「株価情報」、「時間情報」などの複数の分割画面にコンテンツ画像が分割される場合には、これらの各分割画面が表示オブジェクトになる。そしてユーザの発話キーワードとして「天気」が検知されると、「天気情報」の分割画面である表示オブジェクトが拡大表示される。同様にユーザの発話キーワードとして「交通」、「株価」、「時間」が検知されると、各々、「交通情報」、「株価情報」、「時間情報」の分割画面である表示オブジェクトが拡大表示される。 In addition, various modes can be assumed as the display mode of the content image composed of a plurality of display objects. For example, as shown in FIG. 15B, when a content image is divided into a plurality of divided screens such as “weather information”, “traffic information”, “stock price information”, “time information”, etc. The screen becomes a display object. When “weather” is detected as the user's utterance keyword, a display object that is a divided screen of “weather information” is displayed in an enlarged manner. Similarly, when “traffic”, “stock price”, and “time” are detected as user utterance keywords, display objects that are divided screens of “traffic information”, “stock price information”, and “time information” are enlarged and displayed, respectively. Is done.

図１６に複数の表示オブジェクトの表示制御手法のフローチャートを示す。まず、ランダムに又はユーザ情報に基づいてコンテンツを選択し、選択されたコンテンツを表示する（ステップＳ７１、Ｓ７２）。 FIG. 16 shows a flowchart of a display control method for a plurality of display objects. First, content is selected randomly or based on user information, and the selected content is displayed (steps S71 and S72).

次に、コンテンツの表示オブジェクトが複数か否かを判断し、表示オブジェクトが複数ではなく単数である場合には、ユーザの発話が検知されたか否かを判断する（ステップＳ７３、Ｓ７４）。そして、発話が検知された場合には、その表示オブジェクトの詳細・関連情報のコンテンツを選択する（ステップＳ７５）。一方、発話が検知されなかった場合には、ランダムに又はユーザ情報に基づいてコンテンツを選択する（ステップＳ７１）。 Next, it is determined whether or not there are a plurality of content display objects. If there is a single display object instead of a plurality, it is determined whether or not the user's speech has been detected (steps S73 and S74). If an utterance is detected, the content of the details / related information of the display object is selected (step S75). On the other hand, if no utterance is detected, content is selected randomly or based on user information (step S71).

表示オブジェクトが複数である場合には、ユーザの発話が検知されたか否かを判断し、発話が検知されなかった場合には、ユーザが注視しているか否かを判断する（ステップＳ７６、Ｓ７７）。そして、ユーザが注視している場合には、複数の表示オブジェクトを順次拡大表示する制御を行う（ステップＳ７８）。 If there are a plurality of display objects, it is determined whether or not the user's utterance has been detected. If no utterance has been detected, it is determined whether or not the user is gazing (steps S76 and S77). . When the user is gazing, control is performed to sequentially enlarge and display a plurality of display objects (step S78).

ステップＳ７６で、ユーザの発話が検知された場合には、音声認識により発話キーワードを抽出する（ステップＳ７９）。そして、ユーザが注視しているか否かを判断し、注視していない場合には、発話キーワードに対応するコンテンツを選択する（ステップＳ８０、Ｓ８１）。例えば、ユーザの発話キーワードが辞書データベースに登録されている登録キーワードである場合に、その登録キーワードに関連したコンテンツを選択して表示する。 If the user's utterance is detected in step S76, the utterance keyword is extracted by voice recognition (step S79). Then, it is determined whether or not the user is gazing. If not, the content corresponding to the utterance keyword is selected (steps S80 and S81). For example, when the user's utterance keyword is a registered keyword registered in the dictionary database, content related to the registered keyword is selected and displayed.

ステップＳ８０でユーザが注視していると判断された場合には、複数の表示オブジェクトの各表示オブジェクトに関連づけられたタグキーワードに一致する発話キーワードが抽出されたか否かを判断する（ステップＳ８２）。そして、抽出された場合には、一致したタグキーワードが関連づけられた表示オブジェクトの拡大表示や、その表示オブジェクトについての詳細・関連情報を選択する（ステップＳ８３）。一方、抽出されなかった場合には、複数の表示オブジェクトを順次拡大表示する制御を行う（ステップＳ７８）。 If it is determined in step S80 that the user is gazing, it is determined whether or not an utterance keyword matching the tag keyword associated with each display object of the plurality of display objects has been extracted (step S82). If extracted, the enlarged display of the display object associated with the matched tag keyword and the details / related information about the display object are selected (step S83). On the other hand, if not extracted, control is performed to sequentially enlarge and display a plurality of display objects (step S78).

以上のように図１６では、コンテンツ画像の複数の表示オブジェクトに関連づけられたタグキーワードが、発話キーワードに一致した場合には、一致したタグキーワードが関連づけられた表示オブジェクトを拡大表示したり、その詳細情報又は関連情報を表示する（ステップＳ８３）。一方、タグキーワードが発話キーワードに一致しなかった場合には、複数の表示オブジェクトを順次拡大表示する制御を行う（ステップＳ７８）。このようにすることで、ユーザの発話を反映させた多様なコンテンツ表示を実現でき、これまでにないデジタルフォトフレームの表示制御を実現できる。 As described above, in FIG. 16, when the tag keyword associated with the plurality of display objects of the content image matches the utterance keyword, the display object associated with the matched tag keyword is enlarged or displayed in detail. Information or related information is displayed (step S83). On the other hand, if the tag keyword does not match the utterance keyword, control is performed to sequentially enlarge and display a plurality of display objects (step S78). In this way, various content displays reflecting the user's utterances can be realized, and display control of a digital photo frame that has never been possible can be realized.

５．表示制御手法の変形例
次に本実施形態の表示制御手法の種々の変形例について説明する。例えば本実施形態では、ユーザの発話音声からユーザの話者認識を行い、話者認識されたユーザに対応するコンテンツを表示するようにしてもよい。 5). Next, various modifications of the display control method according to this embodiment will be described. For example, in this embodiment, the user's speaker recognition may be performed from the user's uttered voice, and the content corresponding to the user who has been speaker-recognized may be displayed.

例えば図１７では、ユーザ登録情報として、ユーザのＩＤ、パスワード、ユーザの趣味・嗜好等が入力される。具体的には、ユーザの趣味・嗜好がコンサート鑑賞であることが入力されている。そして入力されたユーザ登録情報は、ユーザ情報としてユーザ情報記憶部３２８に記憶される。なお趣味・嗜好等以外にも、ユーザのお気に入りの情報、画面モードの設定、スライド表示方法等を、ユーザ登録情報として入力できるようにしてもよい。 For example, in FIG. 17, a user ID, a password, a user's hobbies / preferences, and the like are input as user registration information. Specifically, it is input that the user's hobbies / preferences are concert appreciation. The input user registration information is stored in the user information storage unit 328 as user information. In addition to hobbies and preferences, user favorite information, screen mode settings, slide display methods, and the like may be input as user registration information.

このようなユーザ登録情報を入力した後に、ユーザの音声登録が行われる。具体的には、ユーザが、音センサであるマイクに向かって自分の名前を喋ることで、音声を入力して音声登録を行う。そして、入力されたユーザの音声の特徴量情報が抽出され、抽出された特徴量情報が、図１７で入力されたユーザ登録情報（趣味・嗜好情報等）と関連づけて登録される。このようにして、ユーザの登録処理が実現される。 After inputting such user registration information, voice registration of the user is performed. Specifically, the user speaks his / her name toward a microphone, which is a sound sensor, and inputs voice to perform voice registration. Then, the feature amount information of the input user's voice is extracted, and the extracted feature amount information is registered in association with the user registration information (hobby / preference information etc.) input in FIG. In this way, the user registration process is realized.

このようにデジタルフォトフレーム３００にユーザ登録情報が登録された状態で、音センサによりユーザの発話音声が検知されて、ユーザの話者認識が行われたとする。この場合には、発話認識結果とユーザ登録情報とに基づいて、デジタルフォトフレーム３００の表示コンテンツを選択する。具体的には図１７では、ユーザの趣味・嗜好がコンサート鑑賞であることがユーザ登録情報として入力されている。従って、音センサからの音情報に基づいて、そのユーザが話者認識されると、図１８（Ａ）に示すように、複数の候補コンテンツの中から、コンサート情報を紹介するコンテンツが選択されて表示される。また例えば図１７のユーザ登録画面において、ユーザの趣味・嗜好として料理が入力されたとする。この場合には、そのユーザが話者認識されると、図１８（Ｂ）に示すように、料理のレシピ情報を紹介するコンテンツが選択されて表示される。 It is assumed that the user's speech is detected by the sound sensor and the user's speaker recognition is performed with the user registration information registered in the digital photo frame 300 as described above. In this case, the display content of the digital photo frame 300 is selected based on the utterance recognition result and the user registration information. Specifically, in FIG. 17, it is input as user registration information that the user's hobbies / preferences are concert viewing. Accordingly, when the user is recognized as a speaker based on the sound information from the sound sensor, content introducing concert information is selected from a plurality of candidate contents as shown in FIG. Is displayed. Further, for example, it is assumed that cooking is input as a user's hobbies / preferences on the user registration screen of FIG. In this case, when the user is recognized as a speaker, as shown in FIG. 18B, content introducing recipe information of the dish is selected and displayed.

以上のようにすることで、ユーザの趣味・嗜好等を反映したコンテンツが自動的に選択されて、デジタルフォトフレームに表示されるようになるため、ユーザの利便性を向上できる。 By doing so, content reflecting the user's hobbies, preferences, etc. is automatically selected and displayed in the digital photo frame, so that convenience for the user can be improved.

例えばこのようなコンサート情報をユーザが取得する手法として、ユーザが視聴しているアーティストのコンサート情報や料理のレシピ情報をＰＣ（パーソナルコンピュータ）等を利用して検索する手法が考えられる。しかしながら、この手法では、ユーザは種々の操作をしなければならず、ユーザにとって手間になる。 For example, as a technique for the user to acquire such concert information, a technique is conceivable in which concert information of an artist viewed by the user or recipe information of a dish is searched using a PC (personal computer) or the like. However, with this method, the user must perform various operations, which is troublesome for the user.

この点、図１７〜図１８（Ｂ）に示す手法によれば、予めユーザ登録情報を入力しておくことで、ユーザが意識しなくても、音楽情報や番組情報に対応する候補コンテンツの中から、ユーザの趣味・嗜好に応じたコンテンツが自動的に選択されて、デジタルフォトフレーム３００に表示されるようになる。従って、ユーザの利便性を向上でき、ユーザの趣味・嗜好を反映させたコンテンツを表示できるデジタルフォトフレーム３００の提供が可能になる。 In this regard, according to the method shown in FIGS. 17 to 18B, by inputting the user registration information in advance, the content of candidate content corresponding to music information and program information can be obtained without the user being conscious. Therefore, content according to the user's hobbies / preferences is automatically selected and displayed on the digital photo frame 300. Therefore, it is possible to provide the digital photo frame 300 that can improve the convenience for the user and display the content reflecting the user's hobbies and preferences.

なお本実施形態では、ユーザの発話音声からユーザの感情状態を認識し、認識された前記感情状態に対応するコンテンツを表示するようにしてもよい。例えば、ユーザの発話（会話）から笑い声が検知されて、ユーザの感情状態が喜んでいる状態であると判断された場合には、デジタルフォトフレーム３００の表示部３４０に対してエフェクトとして盛り上げる演出表示を行う。一方、ユーザの感情状態が緊張状態だと判断された場合には、ユーザをリラックスさせるようなコンテンツが選択されて表示される。 In this embodiment, the user's emotional state may be recognized from the user's uttered voice, and the content corresponding to the recognized emotional state may be displayed. For example, when a laughing voice is detected from the user's utterance (conversation) and it is determined that the emotional state of the user is delighted, an effect display that excites as an effect on the display unit 340 of the digital photo frame 300 I do. On the other hand, when it is determined that the emotional state of the user is in a tension state, content that relaxes the user is selected and displayed.

ユーザの感情状態（感性の状態）を認識する手法としては種々の手法が考えられる。例えばユーザの感情状態を、平常状態、興奮状態、リラックス状態、喜び状態、悲しみ状態などの複数の状態に分類する。そして、ユーザの音声の認識結果に基づいて、ユーザの感情状態が、これらの各状態のいずれの状態にあるのかを判定する。また各状態における感情状態のレベルについても判定する。具体的には「大変喜んでいる状態」なのか、「少し喜んでいる状態」なのかを判断する。例えばユーザの発話音声の特徴量情報を音声のリズムやトーンなどを解析することで、感情状態を定量的に検出して、上述の各種判定を実現する。 Various methods are conceivable as a method for recognizing the emotional state (sensitivity state) of the user. For example, the emotional state of the user is classified into a plurality of states such as a normal state, an excited state, a relaxed state, a joyful state, and a sad state. Then, based on the recognition result of the user's voice, it is determined which of these states the user's emotional state is. Also, the level of the emotional state in each state is determined. Specifically, it is determined whether the state is “very happy” or “a little happy”. For example, the emotion state is quantitatively detected by analyzing the feature amount information of the uttered voice of the user by the rhythm and tone of the voice, and the above-described various determinations are realized.

また本実施形態では、ユーザとの位置関係を判断し、ユーザと表示部３４０との距離が所定距離以内である場合、即ち近い場合に、取得された音情報に応じて、表示画像の表示態様を変化させてもよい。 Further, in the present embodiment, when the positional relationship with the user is determined and the distance between the user and the display unit 340 is within a predetermined distance, that is, when the distance is close, the display mode of the display image is determined according to the acquired sound information. May be changed.

例えば図１９（Ａ）に示すように、ユーザとの距離が遠い場合、即ち所定距離以上である場合には、音情報に基づく表示制御を行わず、図１９（Ｂ）に示すように、ユーザとの距離が所定距離以内である近い場合に、音情報に基づく表示制御を行う。このようにすれば、ユーザがデジタルフォトフレーム３００に接近して、ユーザの音声を確実に検出できる状態になってから、音情報に基づく表示制御を実行できるようになる。これにより、雑音等により表示画像の表示態様が頻繁に変化してしまうなどの事態を防止でき、ユーザにとって見やすい高品質な画像を表示できる。 For example, as shown in FIG. 19A, when the distance from the user is far, that is, when the distance is greater than or equal to a predetermined distance, display control based on the sound information is not performed, and as shown in FIG. Display control based on the sound information is performed. In this way, the display control based on the sound information can be executed after the user approaches the digital photo frame 300 and the user's voice can be reliably detected. Thus, it is possible to prevent a situation in which the display mode of the display image frequently changes due to noise or the like, and it is possible to display a high-quality image that is easy for the user to see.

ここでユーザとの位置関係の検出手法としては種々の手法が考えられる。例えば図２０（Ａ）では、センサ３５０として、ＣＣＤ、ＣＭＯＳセンサなどの撮像センサ（カメラ）を用いる。そして撮像センサからの撮像情報に基づいて、矩形の枠領域であるユーザの顔領域ＦＡＲを検出する。また顔領域ＦＡＲに映ったユーザの画像に対する画像認識処理を行い、ユーザの顔画像の特徴点データを抽出する。この特徴点データは、例えば顔画像の認識結果として、ユーザ登録情報と関連づけて登録される。 Here, various methods are conceivable as a method for detecting the positional relationship with the user. For example, in FIG. 20A, an imaging sensor (camera) such as a CCD or CMOS sensor is used as the sensor 350. And based on the imaging information from an imaging sensor, the user's face area | region FAR which is a rectangular frame area | region is detected. Further, image recognition processing is performed on the user image shown in the face area FAR, and feature point data of the user face image is extracted. This feature point data is registered in association with user registration information, for example, as a facial image recognition result.

ユーザと表示部３４０との間の位置関係を検出する場合には、撮像センサからの撮像情報に基づいて、顔領域ＦＡＲのサイズを求める。そして求められたサイズに基づいて、ユーザと表示部３４０との間の距離を判断する。 When detecting the positional relationship between the user and the display unit 340, the size of the face area FAR is obtained based on imaging information from the imaging sensor. Based on the obtained size, the distance between the user and the display unit 340 is determined.

例えば図２０（Ｂ）では、顔領域ＦＡＲのサイズが小さいため（所定サイズ以下であるため）、ユーザとの距離は遠いと判断される。この場合には、例えば図１９（Ａ）に示すように、音情報に基づく表示制御は行わないようにする。 For example, in FIG. 20B, since the size of the face area FAR is small (below a predetermined size), it is determined that the distance from the user is long. In this case, for example, as shown in FIG. 19A, display control based on sound information is not performed.

一方、図２０（Ｃ）では、顔領域ＦＡＲのサイズが大きいため（所定サイズよりも大きいため）、ユーザとの距離は近いと判断される。そして、このようにユーザとの距離が近くなって、ユーザがデジタルフォトフレーム３００に近づいたと判断された場合には、図１９（Ｂ）に示すように、音情報に基づく表示制御を行う。即ち音センサにより取得された音情報に基づいて表示画像の表示態様を変化させる。 On the other hand, in FIG. 20C, since the size of the face area FAR is large (because it is larger than the predetermined size), it is determined that the distance to the user is short. When it is determined that the user is close to the digital photo frame 300 in this manner, display control based on sound information is performed as shown in FIG. 19B. That is, the display mode of the display image is changed based on the sound information acquired by the sound sensor.

ここで顔領域の検出手法としては種々の手法が考えられる。例えば、顔検出を行うためには、撮像センサで撮影された撮像画像において、顔がある場所と他の物体とを区別して、顔領域を切り出す必要がある。顔は、目、鼻、口等から構成され、これらの形状・位置関係は個人差はあるものの、ほぼ共通した特徴を有する。そこで、この共通な特徴を用いて、顔を他の物体から識別して画面の中から切り出す。このための手がかりとしては、肌の色、顔の動き、形、大きさ等がある。肌の色を用いる場合には、ＲＧＢデータを色相・輝度・彩度からなるＨＳＶデータに変換し、人の肌の色相を抽出する手法を採用する。 Here, various methods can be considered as a method for detecting a face region. For example, in order to perform face detection, it is necessary to cut out a face region by distinguishing a place where the face is from another object in a captured image captured by an image sensor. The face is composed of eyes, nose, mouth and the like, and these shapes and positional relationships have almost common characteristics, although there are individual differences. Therefore, using this common feature, the face is identified from other objects and cut out from the screen. Clues for this include skin color, facial movement, shape, size, and the like. In the case of using skin color, a method of converting RGB data into HSV data composed of hue, luminance, and saturation and extracting a human skin hue is adopted.

或いは、多数の人の顔パターンから生成した平均顔パターンを顔テンプレートとして作成してもよい。そして、この顔テンプレートを撮像画像の画面上で走査して、撮像画像との相関を求め、最も相関値が高い領域を顔領域として検出する。 Or you may produce the average face pattern produced | generated from the face pattern of many people as a face template. Then, this face template is scanned on the screen of the captured image to obtain a correlation with the captured image, and an area having the highest correlation value is detected as a face area.

なお検出精度を高めるため、複数の顔テンプレートを辞書データとして用意し、これらの複数の顔テンプレートを用いて顔領域を検出してもよい。或いは目、鼻、口の特徴や、これらの位置関係や、顔の中のコントラストなどの情報も考慮して、顔領域を検出してもよい。或いは、ニューラルネットワークモデルを用いた統計的なパターン認識により顔領域を検出することも可能である。 In order to improve the detection accuracy, a plurality of face templates may be prepared as dictionary data, and a face area may be detected using the plurality of face templates. Alternatively, the face area may be detected in consideration of the characteristics of the eyes, nose, mouth, the positional relationship between these, and the contrast in the face. Alternatively, the face area can be detected by statistical pattern recognition using a neural network model.

図２０（Ａ）〜図２０（Ｃ）の検出手法によれば、顔領域ＦＡＲのサイズによりユーザと表示部３４０の距離を検出できるのみならず、ユーザが表示部３４０を見ているか否かも同時に検出できるという利点がある。即ち、ユーザの視線が表示部３４０の方に向いていなかった場合には、顔テンプレートとの相関値が低くなるため、顔領域ＦＡＲは非検出になる。従って、顔領域ＦＡＲが検出されたということは、ユーザの視線が表示部３４０の方に向いており、ユーザの視野範囲内に表示部３４０が入っていることと等価になる。そして、この状態で、顔領域ＦＡＲのサイズを検出して、音情報に基づき表示制御を行い、コンテンツを表示すれば、表示部３４０を見ているユーザに対して適切にコンテンツ画像を表示できるようになる。これにより、これまでにないタイプのデジタルフォトフレーム３００を提供できる。 20A to 20C, not only can the distance between the user and the display unit 340 be detected based on the size of the face area FAR, but also whether or not the user is looking at the display unit 340 at the same time. There is an advantage that it can be detected. That is, when the user's line of sight is not directed toward the display unit 340, the correlation value with the face template is low, and the face area FAR is not detected. Therefore, the detection of the face area FAR is equivalent to the fact that the user's line of sight is directed toward the display unit 340 and the display unit 340 is within the visual field range of the user. In this state, if the size of the face area FAR is detected, display control is performed based on the sound information, and the content is displayed, the content image can be appropriately displayed to the user viewing the display unit 340. become. Thereby, a digital photo frame 300 of an unprecedented type can be provided.

また本実施形態では、取得された音情報に基づいて、デジタルフォトフレーム３００の表示部３４０の表示をオンにしたり、ユーザの操作指示を特定して表示制御を行うようにしてもよい。例えば会話や音楽やテレビの音などの生活音が検知されると、表示をオンにする。また操作指示に関連する発話キーワードが検知された場合に、その発話キーワードに応じた表示制御を行う。例えばスライド表示中に、「進む」の発話キーワードが検知されると、次のコンテンツを表示し、「戻る」の発話キーワードが検知されると、前のコンテンツに戻って表示する。 In the present embodiment, display control may be performed by turning on the display of the display unit 340 of the digital photo frame 300 or specifying a user operation instruction based on the acquired sound information. For example, when a living sound such as conversation, music, or TV sound is detected, the display is turned on. When an utterance keyword related to the operation instruction is detected, display control is performed according to the utterance keyword. For example, when the “forward” utterance keyword is detected during slide display, the next content is displayed, and when the “return” utterance keyword is detected, the previous content is returned and displayed.

図２１に、このような表示制御を行う場合のフローチャートを示す。まず音センサにより音情報が検知されたか否かを判断し、検知された場合にはデジタルフォトフレーム３００の表示部３４０の表示をオンにする（ステップＳ１０１、Ｓ１０２）。そして、音情報により操作指示が特定されたか否かを判断する（ステップＳ１０３）。具体的にはユーザの発話の音声認識を行い、抽出された発話キーワードの中に、操作指示に対応するキーワードに一致する発話キーワードがあるか否かを判断する。そして、ユーザによる操作指示が特定された場合には、特定された操作指示にしたがった表示制御を行う（ステップＳ１０４）。 FIG. 21 is a flowchart for performing such display control. First, it is determined whether or not sound information is detected by the sound sensor. If detected, the display of the display unit 340 of the digital photo frame 300 is turned on (steps S101 and S102). Then, it is determined whether or not an operation instruction is specified by the sound information (step S103). Specifically, speech recognition of the user's utterance is performed, and it is determined whether or not there is an utterance keyword that matches the keyword corresponding to the operation instruction among the extracted utterance keywords. When an operation instruction by the user is specified, display control is performed according to the specified operation instruction (step S104).

６．システム構成の変形例
図２２の本実施形態のシステム構成の変形例について示す。この変形例のシステムでは、サーバ２００（広義には情報処理システム、狭義にはホームサーバ）が設けられている。このサーバ２００は、処理部２０２、記憶部２２０、通信部２３８を含む。なおこれらの一部の構成要素を省略したり、他の構成要素を追加するなどの種々の変形実施が可能である。なお、図２と同様の構成要素については、同様の符号を付してその説明を省略する。 6). Modification of System Configuration FIG. 22 shows a modification of the system configuration of the present embodiment in FIG. In the system of this modification, a server 200 (an information processing system in a broad sense and a home server in a narrow sense) is provided. The server 200 includes a processing unit 202, a storage unit 220, and a communication unit 238. Various modifications may be made such as omitting some of these components or adding other components. In addition, about the component similar to FIG. 2, the same code | symbol is attached | subjected and the description is abbreviate | omitted.

処理部２０２は、サーバ管理処理などの各種の処理を行うものであり、ＣＰＵ等のプロセッサやＡＳＩＣなどにより実現できる。記憶部２２０は、処理部２０２や通信部２３８のワーク領域となるものであり、例えばＲＡＭやＨＤＤ等により実現できる。通信部２３８は、デジタルフォトフレーム３００や、外部サーバ６００との間で、有線又は無線で通信を行うためのものであり、通信用ＡＳＩＣ又は通信用プロセッサなどにより実現できる。例えばデジタルフォトフレーム３００とサーバ２００は、例えば無線ＬＡＮ等のネットワークで通信接続される。 The processing unit 202 performs various processes such as a server management process, and can be realized by a processor such as a CPU or an ASIC. The storage unit 220 is a work area for the processing unit 202 and the communication unit 238, and can be realized by, for example, a RAM or an HDD. The communication unit 238 is for performing wired or wireless communication with the digital photo frame 300 or the external server 600, and can be realized by a communication ASIC or a communication processor. For example, the digital photo frame 300 and the server 200 are communicatively connected via a network such as a wireless LAN.

処理部２０２は、登録処理部２１４、コンテンツ選択部２１６、表示指示部２１８を含む。なおこれらの一部の構成要素を省略したり、他の構成要素を追加するなどの種々の変形実施が可能である。 The processing unit 202 includes a registration processing unit 214, a content selection unit 216, and a display instruction unit 218. Various modifications may be made such as omitting some of these components or adding other components.

登録処理部２１４は、図１７で説明したユーザの登録処理を行う。そして、ユーザの登録情報は、記憶部２２０のユーザ情報記憶部２２８に記憶されて登録される。 The registration processing unit 214 performs the user registration processing described with reference to FIG. The user registration information is stored and registered in the user information storage unit 228 of the storage unit 220.

コンテンツ選択部２１６は、ユーザに提示するコンテンツの選択処理を行う。例えば記憶部２２０のコンテンツ情報記憶部２２２からコンテンツ情報を読み出したり、外部サーバ６００にアクセスしてコンテンツ情報を受信することで、ユーザに提示するコンテンツを選択する。 The content selection unit 216 performs processing for selecting content to be presented to the user. For example, content information is read from the content information storage unit 222 of the storage unit 220, or the content to be presented to the user is selected by accessing the external server 600 and receiving the content information.

表示指示部２１８は、コンテンツ選択部２１６により選択されたコンテンツに基づいて、デジタルフォトフレーム３００の表示部３４０に表示される画像の表示指示を行う。具体的にはコンテンツ選択部２１６により選択されたコンテンツの画像を表示部３４０に表示するための指示を行う。例えば周囲音を検知する音センサにより検知された音情報に応じて、表示部３４０に表示される画像の表示態様を変化させる表示指示を行う。そしてデジタルフォトフレーム３００の表示制御部３１８は、サーバ２００の表示指示部２１８からの指示にしたがって、表示部３４０の表示制御を行う。これにより、取得された音情報に応じて表示部３４０に表示される画像の表示態様が変化するようになる。 The display instruction unit 218 issues an instruction to display an image displayed on the display unit 340 of the digital photo frame 300 based on the content selected by the content selection unit 216. Specifically, an instruction for displaying an image of the content selected by the content selection unit 216 on the display unit 340 is given. For example, a display instruction for changing the display mode of an image displayed on the display unit 340 is performed in accordance with sound information detected by a sound sensor that detects ambient sounds. Then, the display control unit 318 of the digital photo frame 300 performs display control of the display unit 340 in accordance with an instruction from the display instruction unit 218 of the server 200. Thereby, the display mode of the image displayed on the display part 340 changes according to the acquired sound information.

図２２の変形例によれば、コンテンツの選択処理やユーザの登録処理等をサーバ２００が行うため、デジタルフォトフレーム３００の処理負荷を軽減できる。従って、デジタルフォトフレーム３００の処理部３０２（ＣＰＵ）の処理能力が低い場合も、本実施形態の処理を実現できるようになる。なお、これらの処理を、ホームサーバ２００とデジタルフォトフレーム３００の分散処理により実現してもよい。 According to the modification of FIG. 22, since the server 200 performs content selection processing, user registration processing, and the like, the processing load on the digital photo frame 300 can be reduced. Therefore, even when the processing capability of the processing unit 302 (CPU) of the digital photo frame 300 is low, the processing of this embodiment can be realized. Note that these processes may be realized by distributed processing of the home server 200 and the digital photo frame 300.

なお、上記のように本実施形態について詳細に説明したが、本発明の新規事項および効果から実体的に逸脱しない多くの変形が可能であることは当業者には容易に理解できるであろう。従って、このような変形例はすべて本発明の範囲に含まれるものとする。例えば、明細書又は図面において、少なくとも一度、より広義または同義な異なる用語と共に記載された用語は、明細書又は図面のいかなる箇所においても、その異なる用語に置き換えることができる。またデジタルフォトフレーム、情報処理システムの構成、動作や、表示制御手法、音声認識手法、視認状態判定手法等も本実施形態で説明したものに限定に限定されず、種々の変形実施が可能である。 Although the present embodiment has been described in detail as described above, it will be easily understood by those skilled in the art that many modifications can be made without departing from the novel matters and effects of the present invention. Accordingly, all such modifications are intended to be included in the scope of the present invention. For example, a term described together with a different term having a broader meaning or the same meaning at least once in the specification or the drawings can be replaced with the different term anywhere in the specification or the drawings. In addition, the configuration and operation of the digital photo frame and the information processing system, the display control method, the voice recognition method, the visual recognition state determination method, and the like are not limited to those described in this embodiment, and various modifications can be made. .

２００サーバ、２０２処理部、２１４登録処理部、２１６コンテンツ選択部、
２１８表示指示部、２２０記憶部、２２２コンテンツ情報記憶部、
２２８ユーザ情報記憶部、２３８通信部、
３００デジタルフォトフレーム、３０２処理部、３０４音情報取得部、
３０５検知情報取得部、３０６発話検知部、３０７音声認識部、
３１０ユーザ状態判断部、３１１視認状態判断部、３１２位置関係判断部、
３１４登録処理部、３１６コンテンツ選択部、３１８表示制御部、
３２０記憶部、３２２コンテンツ情報記憶部、３２４音情報記憶部、
３２５検知情報記憶部、３２６登録キーワード記憶部、３２７ユーザ状態記憶部、３２８ユーザ情報記憶部、３３０情報記憶媒体、３３８通信部、３４０表示部、
３５０センサ、３６０操作部、６００外部サーバ 200 servers, 202 processing units, 214 registration processing units, 216 content selection units,
218 display instruction unit, 220 storage unit, 222 content information storage unit,
228 User information storage unit, 238 communication unit,
300 digital photo frame, 302 processing unit, 304 sound information acquisition unit,
305 detection information acquisition unit, 306 utterance detection unit, 307 voice recognition unit,
310 user state determination unit, 311 visual recognition state determination unit, 312 positional relationship determination unit,
314 registration processing unit, 316 content selection unit, 318 display control unit,
320 storage unit, 322 content information storage unit, 324 sound information storage unit,
325 Detection information storage unit, 326 Registration keyword storage unit, 327 User status storage unit, 328 User information storage unit, 330 Information storage medium, 338 Communication unit, 340 Display unit,
350 sensor, 360 operation unit, 600 external server

Claims

A display for displaying an image;
A display control unit that performs display control of the display unit;
A sound information acquisition unit that acquires sound information detected by a sound sensor that detects ambient sound,
The display control unit
A digital photo frame that performs display control to change a display mode of an image displayed on the display unit in accordance with the acquired sound information.

In claim 1,
Based on the acquired sound information, including an utterance detection unit for detecting the user's utterance,
The display control unit
A digital photo frame that performs display control for changing a display mode of an image displayed on the display unit based on a detection result of a user's utterance.

In claim 2,
The display control unit
A digital photo frame that performs control to display detailed information or related information of display content when a user's utterance is detected.

In claim 2 or 3,
Including a visual recognition state determination unit for determining the visual state of the user,
The display control unit
A digital photo frame that performs display control for changing a display mode of an image displayed on the display unit based on a detection result of a user's utterance and a determination result of a user's visual state.

In any of claims 2 to 4,
The display control unit
A digital photo frame characterized in that the display timing of content is controlled based on a detection result of a user's utterance.

In claim 5,
The speech detection unit
Detects the user's utterance state,
The display control unit
A digital photo frame characterized in that when a user's non-speaking state is detected, a process for switching to the next content is performed in the slide display of the content.

In claim 1,
Based on the acquired sound information, including a voice recognition unit that performs voice recognition of the user's utterance,
The display control unit
A digital photo frame characterized by performing display control for changing a display mode of an image displayed on the display unit based on a voice recognition result.

In claim 7,
The voice recognition unit
Extract user's utterance keywords by voice recognition,
The display control unit
A digital photo frame characterized by performing control to display content selected based on the extracted utterance keyword.

In claim 8,
The display control unit
A digital photo frame, wherein when a tag keyword associated with display content matches the utterance keyword, control is performed to display content corresponding to the tag keyword.

In claim 8,
The display control unit
When the tag keyword associated with the display content and the utterance keyword do not match, control is performed to display the content selected by AND search of the tag keyword and the utterance keyword. flame.

In claim 8,
The display control unit
When the tag keyword and the utterance keyword match, the content corresponding to the tag keyword is displayed. When the tag keyword and the utterance keyword do not match, the tag keyword and the utterance keyword are displayed. A digital photo frame characterized by performing control to display content selected by AND search.

In claim 8,
The display control unit
A digital photo frame, wherein when a plurality of different utterance keywords are extracted from a user's utterance, control is performed to display content selected by AND search of the plurality of different utterance keywords.

In claim 8,
Including a visual recognition state determination unit for determining the visual state of the user,
The display control unit
A digital control device that performs display control to change a display mode of an image displayed on the display unit based on a result of voice recognition and a determination result of a user's visual state when a user's utterance is detected. Photo frame.

In claim 13,
The display control unit
When the user's utterance is detected and it is determined that the user is gazing at the display unit, control is performed to display the content selected by the tag keyword associated with the display content. Photo frame.

In claim 14,
The display control unit
When it is determined that the user is gazing at the display unit and the tag keyword matches the utterance keyword, control is performed to display the content selected by the tag keyword. flame.

In claim 14 or 15,
The display control unit
A digital photo frame characterized by performing control to display content corresponding to the utterance keyword when it is determined that the user's utterance is detected but the user is not gazing at the display unit.

In any of claims 8 to 16,
The display control unit
A digital photo frame that counts the number of appearances of an extracted utterance keyword and performs control to display content corresponding to the utterance keyword whose appearance number exceeds a predetermined number.

In claim 8,
The display control unit
When tag keywords associated with a plurality of display objects constituting a content image match the utterance keyword, control for enlarging the display object associated with the matched tag keyword, or detailed information on the display object or A digital photo frame characterized by performing control to display related information.

In claim 18,
The display control unit
A digital photo frame, wherein when the tag keyword does not match the utterance keyword, control is performed to sequentially enlarge and display the plurality of display objects.

In claim 7,
The voice recognition unit
Recognize the user's speaker from the user's speech,
The display control unit
A digital photo frame characterized by performing control to display content corresponding to a user who is recognized as a speaker.

In claim 7,
The voice recognition unit
Recognize the user's emotional state from the user's speech,
The display control unit
A digital photo frame characterized by performing control to display content corresponding to the recognized emotional state.

Any one of claims 1 to 21
A positional relationship determination unit that determines a positional relationship between the user and the display unit;
The display control unit
When the distance between the user and the display unit is within a predetermined distance, display control is performed to change a display mode of an image displayed on the display unit according to the acquired sound information. digital photo frame.

In any one of Claims 1 thru | or 22.
The display control unit
A digital photo frame characterized by performing control to turn on display of the display unit based on the acquired sound information.

In any one of Claims 1 thru | or 23.
The display control unit
A digital photo frame characterized by specifying a user operation instruction based on the acquired sound information and performing display control according to the specified operation instruction.

A content selection unit for selecting content;
A display instruction unit for instructing display of an image displayed on the display unit of the digital photo frame based on the selected content,
The display instruction unit
An information processing system that performs a display instruction to change a display mode of an image displayed on the display unit in accordance with sound information detected by a sound sensor that detects ambient sound.

A display control unit that performs display control of a display unit that displays an image;
As a sound information acquisition unit that acquires sound information detected by a sound sensor that detects ambient sound,
Make the computer work,
The display control unit
A program for performing display control for changing a display mode of an image displayed on the display unit in accordance with the acquired sound information.

A computer-readable information storage medium, wherein the program according to claim 26 is stored.