JP2002133390A

JP2002133390A - Learning device and recording medium

Info

Publication number: JP2002133390A
Application number: JP2000318627A
Authority: JP
Inventors: Koji Morikawa; 幸治森川; Natsuki Oka; 夏樹岡
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2000-10-18
Filing date: 2000-10-18
Publication date: 2002-05-10

Abstract

(57)【要約】【課題】環境から報酬と教師信号とが与えられる場合
に、その両方を有効に利用することによって、学習効率
を高めることのできる学習装置を提供する。【解決手段】入力値１５に応じた出力値１７を生成す
る学習装置２６０は、報酬信号に基づく第１学習を実行
することにより、少なくとも１つの第１パラメータに基
づいて入力値１５に応じた第１出力値２１２を生成する
第１学習部１２と、教師信号に基づく第２学習を実行す
ることにより、少なくとも１つの第２パラメータに基づ
いて入力値１５に応じた第２出力値２１３を生成する第
２学習部１３と、第１出力値２１２および第２出力値２
１３の一方を出力値１７として選択的に出力する出力部
１４と、第１学習部１２による第１学習を指定するか否
か、および、第２学習部１３による第２学習を指定する
か否かを決定する学習指定部１１とを備えている。 (57) [Summary] [PROBLEMS] To provide a learning device capable of improving learning efficiency by effectively using both rewards and teacher signals when given from the environment. SOLUTION: A learning device 260 that generates an output value 17 according to an input value 15 executes a first learning based on a reward signal to thereby perform a first learning based on at least one first parameter. The first learning unit 12 that generates one output value 212 and the second learning based on the teacher signal are executed to generate a second output value 213 corresponding to the input value 15 based on at least one second parameter. A second learning unit 13, a first output value 212 and a second output value 2
13, an output unit 14 for selectively outputting one of the output values as an output value 17, whether to designate first learning by the first learning unit 12, and whether to designate second learning by the second learning unit 13. And a learning designating unit 11 for determining whether or not there is a difference.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、外部から教師信号
と報酬との２種類の情報を与えられ、その両方に基づい
て学習を行う学習装置および記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a learning apparatus and a recording medium to which two types of information, a teacher signal and a reward, are externally provided and learning is performed based on both.

【０００２】[0002]

【従来の技術】近年、産業上使用されるシステムは、個
々のユーザの要求に応えてカスタマイズ可能であること
が強く求めらるようになってきている。個々のユーザの
要求は、ますます複雑化している。また、産業上使用さ
れるシステムの使用環境である情報ネットワークは急速
な発展を続けており、システムはこのような環境の変化
にも対応できるものでなければならない。これらの理由
から、システムに必要な全ての機能をシステムの開発時
に予め設計することは困難になりつつある。そのため、
システム自身がユーザや環境に適応することを可能にす
る学習技術の研究がなされている。このように、学習技
術を用いることによってユーザや環境に適応することの
できるシステムを本明細書中で学習装置と呼ぶ。2. Description of the Related Art In recent years, there has been a strong demand for industrially used systems that can be customized in response to individual user requirements. The requirements of individual users are becoming increasingly complex. In addition, information networks, which are used environments of systems used in industry, are rapidly developing, and the systems must be able to cope with such environmental changes. For these reasons, it is becoming difficult to design all the functions necessary for the system in advance when developing the system. for that reason,
Research is being conducted on learning techniques that allow the system itself to adapt to the user and the environment. Such a system that can adapt to a user or an environment by using a learning technique is referred to as a learning device in this specification.

【０００３】学習装置が学習を行う際に使用する情報の
観点から従来の学習技術を分類すると、教師信号に基づ
く学習と、報酬に基づく学習とに分類される。[0003] When the conventional learning techniques are classified from the viewpoint of information used when the learning device performs learning, they are classified into learning based on teacher signals and learning based on rewards.

【０００４】教師信号に基づく学習としては、ニューラ
ルネットワークを用いる方法（文献：Ｄ．Ｅ．Ｒｕ
ｍｅｌｈａｒｔ他、Ｐａｒａｌｌｅｌｄｉｓｔｒｉ
ｂｕｔｅｄｐｒｏｃｅｓｓｉｎｇ：Ｅｘｐｌｏｒａ
ｔｉｏｎｓｉｎｔｈｅｍｉｃｒｏｓｔｒｕｃｔｕｒ
ｅｏｆｃｏｇｎｉｔｉｏｎ、ＴｈｅＭＩＴＰｒ
ｅｓｓ、１９８６）や、決定木を用いる方法（文献：
Ｊ．Ｒ．Ｑｕｉｎｌａｎ、Ｃ４．５：Ｐｒｏｇｒ
ａｍｓｆｏｒｍａｃｈｉｎｅｌｅａｒｎｉｎｇ、
ＭｏｒｇａｎＫａｕｆｍａｎｎ、１９９３）など
が知られている。As learning based on teacher signals, a method using a neural network (document: DE Ru)
melhart et al., Parallel distri
butted processing: Explora
tensions in themicrostructure
e of cognition, The MITPr
ess, 1986) and a method using a decision tree (reference:
J. R. Quinlan, C4.5: Progr
ams for machine learning,
Morgan Kaufmann, 1993).

【０００５】報酬に基づく学習としては、強化学習
（Ｒ．Ｓ．ＳｕｔｔｏｎおよびＡ．Ｂａｒｔｏ：
ＲｅｉｎｆｏｒｃｅｍｅｎｔＬｅａｒｎｉｎｇ：Ａ
ｎＩｎｔｒｏｄｕｃｔｉｏｎ、ＡＢｒａｄｆｏｒ
ｄＢｏｏｋ、ＴｈｅＭＩＴＰｒｅｓｓ、１９
９８）が知られている。Reward-based learning includes reinforcement learning (RS Sutton and A. Barto:
Reinforcement Learning: A
n Induction, A Bradfor
d Book, The MIT Press, 19
98) are known.

【０００６】上記の学習方法を学習装置に採用した従来
技術として、日本国特許第２８５６２５９号公報に開示
される技術および特開２０００−３５９５６号公報に開
示される技術がある。[0006] As a conventional technique which employs the above-mentioned learning method in a learning apparatus, there are a technique disclosed in Japanese Patent No. 2856259 and a technique disclosed in Japanese Patent Application Laid-Open No. 2000-35956.

【０００７】日本国特許第２８５６２５９号公報に開示
される従来技術では、行動決定用ニューラルネットワー
クにおいてロボットの行動方法の学習を行う際に教師付
き学習を行う。その教師付き学習のための教師データと
しては、ノイズ加算器によって適当に生成されたデータ
の組のうちで評価が良かったものが選択される。この教
師データの選択のために評価値学習用ニューラルネット
ワークで今までの報酬の良否を記憶し、報酬が良の場合
のみ教師データとして行動決定用ニューラルネットワー
クの学習に使用する。In the prior art disclosed in Japanese Patent No. 2856259, supervised learning is performed when learning a behavior method of a robot in a neural network for behavior determination. As the teacher data for the supervised learning, a data set having a good evaluation is selected from a data set appropriately generated by the noise adder. For the selection of the teacher data, the evaluation value learning neural network stores the quality of the reward up to now, and only when the reward is good, is used as the teacher data for learning the action determining neural network.

【０００８】特開２０００−３５９５６号公報に開示さ
れる従来技術では、環境からの報酬を基に、行動生成器
の行動を決定し行動生成器の行動によって環境がどのよ
うに変化したかについて状態予測器によって教師付き学
習を行い、これにより状態の変化を予測する。In the prior art disclosed in Japanese Patent Application Laid-Open No. 2000-35956, the action of the action generator is determined based on the reward from the environment, and the state of how the environment is changed by the action of the action generator is determined. A supervised learning is performed by a predictor, thereby predicting a change in state.

【０００９】[0009]

【発明が解決しようとする課題】システムが現実世界の
複雑な環境に適応するために学習技術を用いようとした
場合、システム（学習装置）が効率的に学習を行うため
に、教師信号と報酬とのどちらか一方ではなく両方の情
報を有効に利用することが必要である。現実の環境で
は、学習装置は、その外部から与えられる教師信号と報
酬との両方を利用することが可能であることが多い。例
えば、設計支援システムの構築に用いられる学習装置で
は、教師信号として設計の専門化による設計過程の情
報、すなわち、どのような状況でどのような設計操作を
行ったか、具体的には、配置設計において、どのような
配置状況のときにどのように配置変換操作を行ったか、
という情報が利用可能である。同時に、報酬として、通
常、最終設計結果の良否の情報が利用可能である。ま
た、運搬ロボットに用いられる学習装置では、教師信号
として、人間が操縦することによって示される運搬の手
本の情報が利用可能であり、かつ、報酬としてうまく運
搬できたかどうかの情報が利用可能である。このように
利用可能な教師信号と報酬との両方を有効に活用するこ
とが、効果的な学習のためには必要である。When a system attempts to use a learning technique to adapt to a complex environment in the real world, a teacher signal and a reward are required to efficiently learn the system (learning apparatus). It is necessary to use both information effectively instead of either one. In a real environment, a learning device can often use both a teacher signal and a reward given from the outside. For example, in a learning device used for constructing a design support system, information of a design process by design specialization as a teacher signal, that is, what kind of design operation was performed in what situation, specifically, layout design In what kind of arrangement situation and how to perform the arrangement conversion operation,
Information is available. At the same time, information on the quality of the final design result is usually available as a reward. Further, in the learning device used for the transport robot, information of a transport example indicated by human operation can be used as a teacher signal, and information on whether or not the transport was successfully performed can be used as a reward. is there. Effective use of both the available teacher signals and rewards is necessary for effective learning.

【００１０】しかし、環境から得られる教師信号と報酬
との両方に基づいて学習を行う学習装置は、従来実現さ
れなかった。However, a learning device that performs learning based on both a teacher signal obtained from the environment and a reward has not been realized conventionally.

【００１１】日本国特許第２８５６２５９号公報に開示
される従来技術および特開２０００−３５９５６号公報
に開示される従来技術では、いずれも、教師信号と報酬
とのうち環境から与えられる情報は報酬のみであった。In the prior art disclosed in Japanese Patent No. 2856259 and the prior art disclosed in Japanese Patent Application Laid-Open No. 2000-35956, the information given from the environment among the teacher signal and the reward is only the reward. Met.

【００１２】以上のように、従来技術による学習装置で
は、環境から与えられる利用可能な情報のうち、報酬の
みかまたは教師信号のみを用いて学習を行っていた。こ
のため、環境から与えられる情報を有効に活用すること
ができず、学習効率を高めることができなかった。As described above, in the learning device according to the prior art, learning is performed using only rewards or only teacher signals among available information given from the environment. For this reason, information given from the environment cannot be effectively used, and learning efficiency cannot be improved.

【００１３】本発明は、上記課題に鑑みてなされたもの
であって、環境から報酬と教師信号とが与えられる場合
に、その両方を有効に利用することによって、学習効率
を高めることのできる学習装置を提供することを目的と
する。[0013] The present invention has been made in view of the above-mentioned problem, and when a reward and a teacher signal are given from the environment, the learning efficiency can be improved by effectively using both of them. It is intended to provide a device.

【００１４】[0014]

【課題を解決するための手段】本発明の学習装置は、少
なくとも１つのパラメータに基づいて入力値に応じた出
力値を生成する学習装置であって、前記出力値に関連す
る評価値を示す報酬信号を前記学習装置の外部から受け
取る報酬信号入力部と、前記入力値に対する前記出力値
の期待値を示す教師信号を前記学習装置の外部から受け
取る教師信号入力部と、前記報酬信号と前記教師信号と
に基づいて、前記評価値が高くなり、かつ、前記入力値
に対する前記出力値が前記期待値に近づくように、前記
少なくとも１つのパラメータの値を調整する調整部とを
備えており、これにより上記目的が達成される。A learning device according to the present invention is a learning device for generating an output value corresponding to an input value based on at least one parameter, wherein a reward indicating an evaluation value related to the output value is provided. A reward signal input unit for receiving a signal from outside the learning device; a teacher signal input unit for receiving a teacher signal indicating an expected value of the output value with respect to the input value from outside the learning device; the reward signal and the teacher signal And an adjustment unit that adjusts the value of the at least one parameter so that the evaluation value increases and the output value with respect to the input value approaches the expected value. The above object is achieved.

【００１５】前記報酬信号に基づいて第１学習パターン
を生成し、前記教師信号に基づいて第２学習パターンを
生成するパターン生成部をさらに備え、前記調整部は、
前記第１学習パターンに基づいて、前記評価値が高くな
るように前記少なくとも１つのパラメータの値を調整
し、前記第２学習パターンに基づいて、前記入力値に対
する前記出力値が前記期待値に近づくように前記少なく
とも１つのパラメータの値を調整し、前記第１学習パタ
ーンは、前記調整部が前記第１学習パターンに基づい
て、前記評価値が高くなるように前記少なくとも１つの
パラメータの値を調整した場合に、前記入力値に対する
前記出力値が前記期待値に近づくように構成されてお
り、前記第２学習パターンは、前記調整部が前記第２学
習パターンに基づいて、前記入力値に対する前記出力値
が前記期待値に近づくように前記少なくとも１つのパラ
メータの値を調整した場合に、前記評価値が高くなるよ
うに構成されていてもよい。[0015] The apparatus further includes a pattern generation unit that generates a first learning pattern based on the reward signal and generates a second learning pattern based on the teacher signal.
The value of the at least one parameter is adjusted based on the first learning pattern so that the evaluation value is increased, and the output value with respect to the input value approaches the expected value based on the second learning pattern. Adjusting the value of the at least one parameter, wherein the first learning pattern adjusts the value of the at least one parameter such that the evaluation value is increased based on the first learning pattern. In this case, the output value with respect to the input value is configured to approach the expected value, and the second learning pattern is such that the adjustment unit outputs the output with respect to the input value based on the second learning pattern. When the value of the at least one parameter is adjusted so that the value approaches the expected value, the evaluation value may be configured to increase. .

【００１６】前記パターン生成部は、前記教師信号が前
記パターン生成部に入力された場合には常に前記教師信
号に基づいて前記第２学習パターンを生成し、教師信号
が前記パターン生成部に入力されない場合には前記報酬
信号に基づいて前記第１学習パターンを生成してもよ
い。[0016] The pattern generating section always generates the second learning pattern based on the teacher signal when the teacher signal is input to the pattern generating section, and the teacher signal is not input to the pattern generating section. In this case, the first learning pattern may be generated based on the reward signal.

【００１７】所定の基準に基づいて前記教師信号の品質
を判定する判定部をさらに備え、前記パターン生成部
は、前記教師信号の品質に依存して、前記教師信号に基
づいて前記第２学習パターンを生成するか否かを決定し
てもよい。The apparatus further comprises a determination unit for determining the quality of the teacher signal based on a predetermined criterion, wherein the pattern generation unit depends on the quality of the teacher signal and determines the second learning pattern based on the teacher signal. May be determined.

【００１８】また、本発明の他の学習装置は、入力値に
応じた出力値を生成する学習装置であって、報酬信号に
基づく第１学習を実行することにより、少なくとも１つ
の第１パラメータに基づいて前記入力値に応じた第１出
力値を生成する第１学習部と、教師信号に基づく第２学
習を実行することにより、少なくとも１つの第２パラメ
ータに基づいて前記入力値に応じた第２出力値を生成す
る第２学習部と、前記第１出力値および前記第２出力値
の一方を前記出力値として選択的に出力する出力部と、
前記第１学習部による前記第１学習を指定するか否か、
および、前記第２学習部による前記第２学習を指定する
か否かを決定する学習指定部とを備え、前記第１学習部
は、前記出力値に関連する評価値を示す報酬信号を前記
学習装置の外部から受け取る報酬信号入力部と、前記報
酬信号に基づいて、前記評価値が高くなるように、前記
少なくとも１つの第１パラメータの値を調整する第１調
整部とを含み、前記第２学習部は、前記入力値に対する
前記出力値の期待値を示す教師信号を前記学習装置の外
部から受け取る教師信号入力部と、前記教師信号に基づ
いて、前記入力値に対する前記第２出力値が前記期待値
に近づくように、前記少なくとも１つの第２パラメータ
の値を調整する第２調整部とを含み、これにより上記目
的が達成される。Another learning device of the present invention is a learning device that generates an output value according to an input value, and executes a first learning based on a reward signal to obtain at least one first parameter. A first learning unit that generates a first output value corresponding to the input value based on the input value, and a second learning unit that performs a second learning based on a teacher signal, based on at least one second parameter. A second learning unit that generates two output values, an output unit that selectively outputs one of the first output value and the second output value as the output value,
Whether to designate the first learning by the first learning unit,
And a learning designating unit that determines whether or not to designate the second learning by the second learning unit, wherein the first learning unit sends a reward signal indicating an evaluation value related to the output value to the learning unit. A reward signal input unit received from outside the device, and a first adjustment unit that adjusts a value of the at least one first parameter based on the reward signal so that the evaluation value increases. A learning unit configured to receive a teacher signal indicating an expected value of the output value with respect to the input value from outside the learning device; and a second output value corresponding to the input value based on the teacher signal. A second adjusting unit that adjusts a value of the at least one second parameter so as to approach an expected value, thereby achieving the above object.

【００１９】前記学習指定部は、時刻に応じて変化する
注意力パラメータの値に応じて前記第１学習を指定する
か否かを決定してもよい。The learning specifying section may determine whether or not to specify the first learning according to a value of an attention parameter that changes with time.

【００２０】前記出力部は、時刻に応じて変化する記憶
状況パラメータの値に応じて、前記第１出力値を前記出
力値として出力するか、前記第２出力値を前記出力値と
して出力するかを決定してもよい。The output unit outputs the first output value as the output value or outputs the second output value as the output value according to a value of a storage status parameter that changes with time. May be determined.

【００２１】前記出力部は、前記報酬信号に基づいて、
前記第１出力値を前記出力値として出力するか、前記第
２出力値を前記出力値として出力するかを決定してもよ
い。[0021] The output unit, based on the reward signal,
It may be determined whether to output the first output value as the output value or to output the second output value as the output value.

【００２２】前記学習装置は、前記入力値に関わらず、
前記教師信号を前記第２出力値として出力する直接模倣
部をさらに含んでもよい。[0022] The learning device is configured to perform
The image processing apparatus may further include a direct imitation unit that outputs the teacher signal as the second output value.

【００２３】前記学習指定部は、前記第１学習部による
前記第１学習および前記第２学習部による前記第２学習
を択一的に指定し、前記学習指定部が前記第１学習を指
定している状態が、前記学習指定部が前記第２学習を指
定している状態に遷移した場合には、前記少なくとも１
つの第２パラメータを前記少なくとも１つの第１パラメ
ータに基づいて調整し、前記学習指定部が前記第２学習
を指定している状態が、前記学習指定部が前記第１学習
を指定している状態に遷移した場合には、前記少なくと
も１つの第１パラメータを前記少なくとも１つの第２パ
ラメータに基づいて調整してもよい。[0023] The learning designating section alternatively designates the first learning by the first learning section and the second learning by the second learning section, and the learning designating section designates the first learning. If the learning state has transitioned to a state in which the learning specifying unit specifies the second learning, the at least one
Two learning parameters are adjusted based on the at least one first parameter, and a state in which the learning designating section designates the second learning is a state in which the learning designating section designates the first learning. In the case where the transition has been made, the at least one first parameter may be adjusted based on the at least one second parameter.

【００２４】前記学習装置は、前記学習装置の外部から
入力される音声を認識する音声認識部と、前記音声認識
部による認識結果に基づいて意味情報を抽出する意味解
釈部とをさらに備えていてもよい。[0024] The learning device further includes a voice recognition unit that recognizes voice input from outside the learning device, and a semantic interpretation unit that extracts semantic information based on a recognition result by the voice recognition unit. Is also good.

【００２５】前記学習装置は、前記学習装置に対する動
作指示を受け取る指示入力部をさらに備え、前記意味情
報は、前記動作指示、前記教師信号および前記報酬信号
のうち少なくとも１つに関連する情報であってもよい。[0025] The learning device may further include an instruction input unit for receiving an operation instruction to the learning device, wherein the semantic information is information related to at least one of the operation instruction, the teacher signal, and the reward signal. You may.

【００２６】本発明の他の学習装置は、入力値に応じた
出力値を生成する学習装置であって、前記学習装置は、
学習を実行することにより前記入力値に応じたモジュー
ル出力値を生成する複数の学習モジュールと、所定の第
１規則に基づいて、前記複数の学習モジュールのそれぞ
れによる前記学習を指定するか否かを決定する学習指定
部と、所定の第２規則に基づいて、前記複数の学習モジ
ュールから出力される複数のモジュール出力値の１つを
第１出力値として選択的に出力する動作選択部と、少な
くとも１つのパラメータに基づいて、前記入力値に応じ
た第２出力値を生成する習熟部と、所定の第３規則に基
づいて、前記第１出力値および前記第２出力値の一方を
前記出力値として選択的に出力する出力部とを備え、前
記習熟部は、前記第２出力値が前記第１出力値に近づく
ように、前記少なくとも１つのパラメータを調整し、前
記出力部は、前記第２出力値を選択した場合には、前記
複数の学習モジュールによるモジュール出力値の生成動
作と、前記学習指定部による前記学習を指定するか否か
の決定動作と、前記動作選択部による前記複数のモジュ
ール出力値の１つを前記第１出力値として選択的に出力
する出力動作とを停止させ、これにより上記目的が達成
される。Another learning device of the present invention is a learning device for generating an output value according to an input value, wherein the learning device comprises:
A plurality of learning modules that generate a module output value according to the input value by executing learning, and determine whether or not to specify the learning by each of the plurality of learning modules based on a predetermined first rule. A learning designation unit to be determined, an operation selection unit to selectively output one of a plurality of module output values output from the plurality of learning modules as a first output value based on a predetermined second rule, A learning section that generates a second output value according to the input value based on one parameter; and outputs one of the first output value and the second output value to the output value based on a predetermined third rule. And an output unit that selectively outputs the at least one parameter, such that the second output value approaches the first output value, and the learning unit adjusts the at least one parameter. When the two output values are selected, an operation of generating a module output value by the plurality of learning modules, an operation of determining whether or not to specify the learning by the learning specifying unit, and an operation of determining whether to specify the learning by the operation selecting unit. The output operation of selectively outputting one of the module output values as the first output value is stopped, thereby achieving the above object.

【００２７】本発明の他の学習装置は、入力値に応じた
出力値を生成する学習装置であって、前記学習装置は、
前記入力値に対する第１出力値を規定するルールを示す
少なくとも１つの分類子と、前記出力値に関連する評価
値を示す報酬信号に基づいて第１の新たな分類子を生成
して、前記第１の新たな分類子を前記少なくとも１つの
分類子に追加する第１生成部と、前記入力値に対する前
記出力値の期待値を示す教師信号に基づいて第２の新た
な分類子を生成して、前記第２の新たな分類子を前記少
なくとも１つの分類子に追加する第２生成部と、前記第
１の新たな分類子と前記第２の新たな分類子とが追加さ
れた前記少なくとも１つの分類子のそれぞれに信頼度を
設定する設定部と、前記第１の新たな分類子と前記第２
の新たな分類子とが追加された前記少なくとも１つの分
類子のそれぞれから出力される少なくとも１つの前記第
１出力値と、前記第１の新たな分類子と前記第２の新た
な分類子とが追加された前記少なくとも１つの分類子の
それぞれに設定された前記信頼度とに基づいて、前記出
力値を生成する出力部とを備え、これにより上記目的が
達成される。Another learning device of the present invention is a learning device for generating an output value according to an input value, wherein the learning device comprises:
Generating a first new classifier based on at least one classifier indicating a rule defining a first output value for the input value and a reward signal indicating an evaluation value related to the output value; A first generation unit that adds one new classifier to the at least one classifier; and generates a second new classifier based on a teacher signal indicating an expected value of the output value with respect to the input value. A second generating unit that adds the second new classifier to the at least one classifier, and the at least one of the first and the second new classifiers to which the first new classifier and the second new classifier are added. A setting unit for setting a reliability for each of the two classifiers, the first new classifier and the second
At least one of the first output values output from each of the at least one classifier to which the new classifier has been added, the first new classifier and the second new classifier, And an output unit that generates the output value based on the reliability set for each of the at least one classifier to which the above-mentioned is added, thereby achieving the above object.

【００２８】本発明の記録媒体は、少なくとも１つのパ
ラメータに基づいて入力値に応じた出力値を生成する処
理を学習装置に実行させるためのプログラムを記録した
コンピュータ読み取り可能な記録媒体であって、前記入
力値に応じた出力値を生成する処理は、（ａ）前記出力
値に関連する評価値を示す報酬信号を前記学習装置の外
部から受け取るステップと、（ｂ）前記入力値に対する
前記出力値の期待値を示す教師信号を前記学習装置の外
部から受け取るステップと、（ｃ）前記教師信号と前記
報酬信号とに基づいて、前記入力値に対する前記出力値
が前記期待値に近づき、かつ、前記評価値が高くなるよ
うに、前記少なくとも１つのパラメータの値を調整する
ステップとを含み、これにより上記目的が達成される。A recording medium according to the present invention is a computer-readable recording medium recording a program for causing a learning device to execute a process of generating an output value corresponding to an input value based on at least one parameter, The process of generating an output value according to the input value includes: (a) receiving a reward signal indicating an evaluation value related to the output value from outside the learning device; and (b) outputting the output value with respect to the input value. (C) receiving, based on the teacher signal and the reward signal, the output value with respect to the input value approaching the expected value, and Adjusting the value of the at least one parameter so that the evaluation value becomes higher, thereby achieving the above object.

【００２９】本発明の他の記録媒体は、入力値に応じた
出力値を生成する処理を学習装置に実行させるためのプ
ログラムを記録したコンピュータ読み取り可能な記録媒
体であって、前記入力値に応じた出力値を生成する処理
は、（ａ）報酬信号に基づく第１学習を実行することに
より、少なくとも１つの第１パラメータに基づいて前記
入力値に応じた第１出力値を生成するステップと、
（ｂ）教師信号に基づく第２学習を実行することによ
り、少なくとも１つの第２パラメータに基づいて前記入
力値に応じた第２出力値を生成するステップと、（ｃ）
前記第１出力値および前記第２出力値の一方を前記出力
値として選択的に出力するステップと、（ｄ）前記第１
学習を指定するか否か、および、前記第２学習を指定す
るか否かを決定するステップとを含み、前記ステップ
（ａ）は、（ａ１）前記出力値に関連する評価値を示す
報酬信号を前記学習装置の外部から受け取るステップ
と、（ａ２）前記報酬信号に基づいて、前記評価値が高
くなるように、前記少なくとも１つの第１パラメータの
値を調整するステップとを含み、前記ステップ（ｂ）
は、（ｂ１）前記入力値に対する前記出力値の期待値を
示す教師信号を前記学習装置の外部から受け取るステッ
プと、（ｂ２）前記教師信号に基づいて、前記入力値に
対する前記第１出力値が前記期待値に近づくように、前
記少なくとも１つの第２パラメータの値を調整するステ
ップとを含み、これにより上記目的が達成される。Another recording medium of the present invention is a computer-readable recording medium in which a program for causing a learning device to execute a process of generating an output value according to an input value is recorded. (A) performing a first learning based on a reward signal to generate a first output value corresponding to the input value based on at least one first parameter;
(B) generating a second output value corresponding to the input value based on at least one second parameter by executing second learning based on a teacher signal; and (c).
(D) selectively outputting one of the first output value and the second output value as the output value;
Determining whether to designate learning and whether to designate the second learning, wherein the step (a) comprises: (a1) a reward signal indicating an evaluation value related to the output value; (A2) adjusting the value of the at least one first parameter based on the reward signal so as to increase the evaluation value. b)
(B1) receiving a teacher signal indicating an expected value of the output value with respect to the input value from outside the learning device; and (b2) determining the first output value with respect to the input value based on the teacher signal. Adjusting the value of the at least one second parameter to approach the expected value, thereby achieving the above object.

【００３０】本発明の他の記録媒体は、入力値に応じた
出力値を生成する処理を実行させるためのプログラムを
記録したコンピュータ読み取り可能な記録媒体であっ
て、前記入力値に応じた出力値を生成する処理は、
（ａ）複数の学習を実行することにより、前記入力値に
応じた複数のモジュール出力値を生成するステップと、
（ｂ）所定の第１規則に基づいて、前記複数の学習のそ
れぞれを指定するか否かを決定するステップと、（ｃ）
所定の第２規則に基づいて、前記複数のモジュール出力
値の１つを第１出力値として選択的に出力するステップ
と、（ｄ）少なくとも１つのパラメータに基づいて、前
記入力値に応じた第２出力値を生成するステップと、
（ｅ）所定の第３規則に基づいて、前記第１出力値およ
び前記第２出力値の一方を前記出力値として選択的に出
力するステップとを含み、前記ステップ（ｄ）は、（ｄ
１）前記第２出力値が前記第１出力値に近づくように、
前記少なくとも１つのパラメータを調整するステップを
含み、前記ステップ（ｅ）において前記第２出力値が前
記出力値として出力された場合には、ステップ（ａ）に
おける前記複数のモジュール出力値の生成動作と、ステ
ップ（ｂ）における前記複数の学習のそれぞれを指定す
るか否かの決定動作と、ステップ（ｃ）における前記複
数のモジュール出力値の１つを前記第１出力値として選
択的に出力する出力動作とを停止させ、これにより上記
目的が達成される。Another recording medium of the present invention is a computer-readable recording medium recording a program for executing a process of generating an output value corresponding to an input value, wherein the output value corresponds to the input value. The process of generating
(A) generating a plurality of module output values corresponding to the input values by executing a plurality of learnings;
(B) determining whether to designate each of the plurality of learning, based on a predetermined first rule; and (c).
Selectively outputting one of the plurality of module output values as a first output value based on a predetermined second rule; and (d) performing a second output corresponding to the input value based on at least one parameter. Generating two output values;
(E) selectively outputting one of the first output value and the second output value as the output value based on a predetermined third rule, wherein the step (d) comprises: (d)
1) so that the second output value approaches the first output value,
Adjusting the at least one parameter, and when the second output value is output as the output value in the step (e), generating the plurality of module output values in the step (a). Determining whether or not to designate each of the plurality of learning in step (b); and selectively outputting one of the plurality of module output values as the first output value in step (c). Operation is stopped, thereby achieving the above object.

【００３１】本発明の他の記録媒体は、少なくとも１つ
の分類子を用いて入力値に応じた出力値を生成する処理
を実行させるためのプログラムを記録したコンピュータ
読み取り可能な記録媒体であって、前記少なくとも１つ
の分類子のそれぞれは、前記入力値に対する第１出力値
を規定するルールを示し、前記入力値に応じた出力値を
生成する処理は、（ａ）前記出力値に関連する評価値を
示す報酬信号に基づいて第１の新たな分類子を生成し
て、前記第１の新たな分類子を前記少なくとも１つの分
類子に追加するステップと、（ｂ）前記入力値に対する
前記出力値の期待値を示す教師信号に基づいて第２の新
たな分類子を生成して、前記第２の新たな分類子を前記
少なくとも１つの分類子に追加するステップと、（ｃ）
前記第１の新たな分類子と前記第２の新たな分類子とが
追加された前記少なくとも１つの分類子のそれぞれに信
頼度を設定するステップと、（ｄ）前記第１の新たな分
類子と前記第２の新たな分類子とが追加された前記少な
くとも１つの分類子のそれぞれから出力される少なくと
も１つの前記第１出力値と、前記第１の新たな分類子と
前記第２の新たな分類子とが追加された前記少なくとも
１つの分類子のそれぞれに設定された前記信頼度とに基
づいて、前記出力値を生成するステップとを含み、これ
により上記目的が達成される。Another recording medium of the present invention is a computer-readable recording medium recording a program for executing a process of generating an output value corresponding to an input value using at least one classifier, Each of the at least one classifier indicates a rule that defines a first output value for the input value, and the processing of generating an output value according to the input value includes: (a) an evaluation value associated with the output value Generating a first new classifier based on the reward signal indicating: and adding the first new classifier to the at least one classifier; and (b) the output value for the input value Generating a second new classifier based on a teacher signal indicating an expected value of the second classifier, and adding the second new classifier to the at least one classifier; (c)
Setting a reliability for each of the at least one classifier to which the first new classifier and the second new classifier have been added; and (d) the first new classifier. At least one first output value output from each of the at least one classifier to which the new new classifier has been added, the first new classifier, and the second new classifier. Generating the output value based on the reliability set for each of the at least one classifier to which the new classifier has been added, thereby achieving the above object.

【００３２】[0032]

【発明の実施の形態】はじめに、教師信号に基づく学習
と、報酬に基づく学習との原理を説明する。DESCRIPTION OF THE PREFERRED EMBODIMENTS First, the principles of learning based on a teacher signal and learning based on a reward will be described.

【００３３】図１Ａは、教師信号に基づく学習を行う学
習装置１００の入出力関係を示す。学習装置１００に
は、環境（学習装置１００の外部）から、入力値Ｘが入
力される。学習装置１００は、内部パラメータとして変
数ａ₁を有している。FIG. 1A shows an input / output relationship of a learning device 100 that performs learning based on a teacher signal. An input value X is input to the learning device 100 from an environment (outside the learning device 100). Learning apparatus 100 includes a variable a ₁ as an internal parameter.

【００３４】学習装置１００は、入力値Ｘに応じた出力
値Ｙを出力する。ここで、出力値Ｙと入力値Ｘとの関係
は、（数１）によって表される。The learning device 100 outputs an output value Y corresponding to the input value X. Here, the relationship between the output value Y and the input value X is represented by (Equation 1).

【００３５】[0035]

【数１】Ｙ＝ｆ₁（ａ₁，Ｘ）ここで、ｆ₁は所定の関数である。Y = f ₁ (a ₁ , X) where f ₁ is a predetermined function.

【００３６】一般に、入力値Ｘに対する出力値Ｙの期待
値が存在する。この期待値は、例えば、ある状態におい
て学習装置１００が出力すべき出力値（あるいは、とる
べき行動）を直接指定する情報である。このような期待
値は、学習装置１００の外部から、教師信号Ｙ₀として
学習装置１００に入力される。学習装置１００は、教師
信号Ｙ₀に基づいて、ある特定の入力値Ｘに対する出力
値Ｙ＝ｆ₁（ａ₁，Ｘ）が、教師信号Ｙ₀（期待値）に近
づくように、変数ａ₁の値を調整する。これを教師信号
に基づく学習と呼ぶ。In general, there is an expected value of the output value Y with respect to the input value X. The expected value is, for example, information that directly specifies an output value (or an action to be taken) that the learning device 100 should output in a certain state. Such expected value, from the outside of the learning device 100, is input to the learning apparatus 100 as a teacher signal Y _0. Based on the teacher signal Y ₀ , the learning device 100 sets the variable a ₁ so that the output value Y = f ₁ (a ₁ , X) for a specific input value X approaches the teacher signal Y ₀ (expected value). Adjust the value of. This is called learning based on a teacher signal.

【００３７】なお、図１Ａに示される例では、入力値、
出力値および内部パラメータはそれぞれ単一のスカラー
量であるとしたが、入力値、出力値および内部パラメー
タはそれぞれ複数のスカラー量であってもよいし、ベク
トル量であってもよい。学習装置１００が、ニューラル
ネットを採用した学習装置である場合、内部パラメータ
は、例えば、ニューラルネットの結合荷重であり得る。
学習装置１００がＩＦ−ＴＨＥＮルールの集合によって
入出力関係を記述し、ＩＦ−ＴＨＥＮルールのそれぞれ
に設定された重み係数によって出力値が決定される場
合、内部パラメータは、例えば、ＩＦ−ＴＨＥＮルール
のそれぞれに設定された重み係数であり得る。In the example shown in FIG. 1A, input values,
Although the output value and the internal parameter are each assumed to be a single scalar quantity, the input value, the output value and the internal parameter may each be a plurality of scalar quantities or may be vector quantities. When the learning device 100 is a learning device employing a neural network, the internal parameter may be, for example, a connection weight of the neural network.
When the learning device 100 describes the input / output relationship by a set of IF-THEN rules, and the output value is determined by the weight coefficient set for each of the IF-THEN rules, the internal parameter is, for example, the IF-THEN rule. It may be a weight coefficient set for each.

【００３８】図１Ｂは、報酬に基づく学習を行う学習装
置２００の入出力関係を示す。学習装置２００には、環
境（学習装置２００の外部）から、入力値Ｘが入力され
る。学習装置２００は、内部パラメータとして変数ａ₂
を有している。FIG. 1B shows an input / output relationship of a learning device 200 that performs learning based on rewards. An input value X is input to the learning device 200 from an environment (outside the learning device 200). The learning device 200 uses the variable a ₂ as an internal parameter.
have.

【００３９】学習装置２００は、入力値Ｘに応じた出力
値Ｙを出力する。ここで、出力値Ｙと入力値Ｘとの関係
は、（数２）によって表される。The learning device 200 outputs an output value Y corresponding to the input value X. Here, the relationship between the output value Y and the input value X is represented by (Equation 2).

【００４０】[0040]

【数２】Ｙ＝ｆ₂（ａ₂，Ｘ）ここで、ｆ₂は所定の関数である。Y = f ₂ (a ₂ , X) where f ₂ is a predetermined function.

【００４１】学習装置２００が出力値Ｙを環境に出力す
ると、出力値Ｙに応じて環境から報酬（報酬信号）Ｒ₀
が学習装置２００に入力される。報酬とは、学習装置２
００が出力すべき出力値（あるいは、とるべき行動）を
直接指定するものではなく、学習装置２００が出力値Ｙ
を環境に出力したことに起因して、学習装置２００と、
学習装置２００の存在する環境とによって定義される状
態がある特定の状態になった場合に学習装置２００に与
えられる情報であって、その状態が望ましいものである
か否か、すなわち、出力値Ｙが望ましいものであったか
否かを表す。When the learning device 200 outputs the output value Y to the environment, a reward (reward signal) R _{0 is} output from the environment according to the output value Y.
Is input to the learning device 200. Reward is learning device 2
00 does not directly specify the output value to be output (or the action to be taken), and the learning device 200
Output to the environment, the learning device 200,
Information given to the learning device 200 when a state defined by the environment in which the learning device 200 exists and a specific state, and whether the state is desirable, that is, the output value Y Indicates whether or not was desirable.

【００４２】このように、報酬Ｒ₀は、出力値Ｙに関連
する評価値を示す。学習装置２００が出力値Ｙを出力す
ることによって、学習装置２００と、学習装置２００の
存在する環境とによって定義される状態が望ましい状態
になった場合には、報酬Ｒ₀の値は正の値（高い評価
値）となり、望ましくない状態になった場合には報酬Ｒ
₀の値は負の値（低い評価値）となる。Thus, the reward R₀Is related to the output value Y
The evaluation value is shown. The learning device 200 outputs the output value Y
By doing so, the learning device 200 and the learning device 200
The state defined by the existing environment and the desired state
If it becomes, the reward R₀Is a positive value (high rating
Value), and the reward R
₀Is a negative value (low evaluation value).

【００４３】報酬Ｒ₀の値は、正の所定値および負の所
定値の２値を取り得るようにしてもよいし、状態の望ま
しさの程度に応じて連続値を取り得るようにしてもよ
い。The value of the reward R ₀ may take two values, a positive predetermined value and a negative predetermined value, or may take a continuous value according to the degree of desirability of the state. Good.

【００４４】学習装置２００は、報酬Ｒ₀に基づいて、
評価値が高くなるように、すなわち正の報酬が多く得ら
れるように、変数ａ₂の値を調整する。これを報酬に基
づく学習と呼ぶ。The learning device 200, based on the reward R _0,
As the evaluation value is high, i.e. as positive reward is obtained much, adjusting the value of the variable a _2. This is called reward-based learning.

【００４５】なお、図１Ｂに示される例では、入力値、
出力値および内部パラメータはそれぞれ単一のスカラー
量であるとしたが、入力値、出力値および内部パラメー
タはそれぞれ複数のスカラー量であってもよいし、ベク
トル量であってもよい。学習装置２００が、ニューラル
ネットを採用した学習装置である場合、内部パラメータ
は、例えば、ニューラルネットの結合荷重であり得る。
学習装置２００がＩＦ−ＴＨＥＮルールの集合によって
入出力関係を記述し、ＩＦ−ＴＨＥＮルールのそれぞれ
に設定された重み係数によって出力値が決定される場
合、内部パラメータは、例えば、ＩＦ−ＴＨＥＮルール
のそれぞれに設定された重み係数であり得る。In the example shown in FIG. 1B, the input value,
Although the output value and the internal parameter are each assumed to be a single scalar quantity, the input value, the output value and the internal parameter may each be a plurality of scalar quantities or may be vector quantities. When the learning device 200 is a learning device employing a neural network, the internal parameter may be, for example, a connection weight of the neural network.
When the learning device 200 describes the input / output relationship by a set of IF-THEN rules, and the output value is determined by the weight coefficient set for each of the IF-THEN rules, the internal parameter is, for example, the IF-THEN rule. It may be a weight coefficient set for each.

【００４６】また、出力値Ｙは、ｆ₂（ａ₂，Ｘ）によっ
て示される値にランダムに変動する成分を加えた値であ
ってもよい。出力値Ｙの値がこのようにランダムに変動
する成分を含む場合、入力値Ｘに変化がなく、かつ、正
または負の報酬Ｐ₀が得られないために変数ａ₂の値の調
整が行われない状況であっても、出力値Ｙを変動させる
ことができる。これによって、環境から正または負の報
酬Ｐ₀が得られずに変数ａ₂の値の調整が行われない状況
が長く続くこと、すなわち報酬に基づく学習が進まない
ことを回避できる可能性が大きくなる。The output value Y may be a value obtained by adding a component fluctuating at random to the value indicated by f ₂ (a ₂ , X). When the value of the output value Y includes such a component that fluctuates randomly, the value of the variable a ₂ is adjusted because the input value X does not change and the positive or negative reward P ₀ is not obtained. The output value Y can be fluctuated even in a situation where it is not possible. Thus, positive or negative compensation P ₀ be followed longer adjustment is not performed status value of the variable a ₂ not be obtained from the environment, that is, greater potential to avoid not proceed learning based on reward Become.

【００４７】なお、図１Ａおよび図１Ｂに示される例で
は、学習装置１００および学習装置２００は環境から値
（入力値）を受け取り、値（出力値）を出力するものと
した。しかし、学習装置１００および学習装置２００の
それぞれと、環境とのインタフェースには、任意の既知
の構成要素が介在し得る。例えば、学習装置１００およ
び学習装置２００と、環境とのインタフェースとして、
任意のセンサが用いられ得る。学習装置１００および学
習装置２００は、センサを介して環境から入力値を受け
取ることができる。あるいは、学習装置１００および学
習装置２００はそれぞれ、センサを介して、教師信号お
よび報酬を受け取ってもよい。この場合、教師信号は明
示的に学習装置１００に対して与えられるものではな
く、学習装置１００が人間の行動を観察することによっ
て、人間の行動に基づいて獲得するものであってもよ
い。In the examples shown in FIGS. 1A and 1B, the learning device 100 and the learning device 200 receive values (input values) from the environment and output values (output values). However, the interface between each of the learning devices 100 and 200 and the environment may include any known components. For example, as an interface between the learning device 100 and the learning device 200 and the environment,
Any sensor can be used. The learning device 100 and the learning device 200 can receive an input value from the environment via a sensor. Alternatively, the learning device 100 and the learning device 200 may each receive a teacher signal and a reward via a sensor. In this case, the teacher signal is not explicitly given to the learning device 100, but may be obtained based on the human behavior by the learning device 100 observing the human behavior.

【００４８】また、環境とのインタフェースとして、任
意のアクチュエータが用いられ得る。学習装置１００お
よび学習装置２００は、アクチュエータによって出力値
に応じた操作あるいは行動を環境に対して行い得る。Any actuator can be used as an interface with the environment. The learning device 100 and the learning device 200 can perform an operation or an action corresponding to the output value on the environment by the actuator.

【００４９】学習装置と環境とのインタフェースとして
任意のセンサおよびアクチュエータを介在させ得ること
と、学習装置がセンサを介して教師信号および報酬を受
け取り得ることとは、後述する全ての実施の形態の学習
装置に共通に適用される。The fact that an arbitrary sensor and actuator can be interposed as an interface between the learning device and the environment, and that the learning device can receive a teacher signal and a reward via the sensor are the same as those in all embodiments described later. Applies commonly to devices.

【００５０】以下、図２〜図１７を参照して本発明の実
施の形態を説明する。An embodiment of the present invention will be described below with reference to FIGS.

【００５１】（実施の形態１）図２は、本発明の実施の
形態１の学習装置２５０の構成を示す。(Embodiment 1) FIG. 2 shows a configuration of a learning device 250 according to Embodiment 1 of the present invention.

【００５２】学習装置２５０は、外部入力値１５に応じ
た外部出力値１７を生成する。また、学習装置２５０に
は、学習装置２５０の外部から教師信号と報酬とが入力
される。学習装置２５０は、学習部２１０と、習熟部１
９と、習熟切換部２０とを備える。The learning device 250 generates an external output value 17 corresponding to the external input value 15. The learning device 250 receives a teacher signal and a reward from outside the learning device 250. The learning device 250 includes a learning unit 210 and a learning unit 1
9 and a learning switching unit 20.

【００５３】学習部２１０は、全体として、外部入力値
１５に応じた学習部出力値２１１を生成する。学習部２
１０は、学習指定部１１と、報酬に基づく学習部１２
と、教師信号に基づく学習部１３と、動作選択部１４と
を含む。The learning section 210 generates a learning section output value 211 corresponding to the external input value 15 as a whole. Learning part 2
10 is a learning designating unit 11 and a learning unit 12 based on a reward.
And a learning unit 13 based on a teacher signal, and an operation selecting unit 14.

【００５４】習熟部１９は、習熟部１９の内部のパラメ
ータに基づいて、外部入力値１５に応じた習熟部出力値
２１４を出力する。習熟部１９は、習熟部出力値２１４
が、学習部出力値２１１と等しくなるように、習熟部１
９の内部のパラメータの値を調整する。すなわち、習熟
部１９は学習部２１０全体の入出力関係を学習する。The learning unit 19 outputs a learning unit output value 214 corresponding to the external input value 15 based on the internal parameters of the learning unit 19. The learning unit 19 outputs a learning unit output value 214.
Is set to be equal to the learning unit output value 211.
Adjust the values of the parameters inside 9. That is, the learning unit 19 learns the input / output relationship of the entire learning unit 210.

【００５５】習熟切換部２０は、習熟部１９による学習
が進んでいない段階では、学習部出力値２１１を外部出
力値１７として出力し、習熟部１９による学習が進む
と、習熟部出力値２１４を外部出力値１７として出力す
る。The learning switching unit 20 outputs the learning unit output value 211 as the external output value 17 when the learning by the learning unit 19 has not progressed, and outputs the learning unit output value 214 when the learning by the learning unit 19 progresses. Output as the external output value 17.

【００５６】以下、ある特定の課題を例にとり、その課
題を解決する際の学習装置２５０の動作の詳細を説明す
る。Hereinafter, the operation of the learning device 250 when solving a specific task will be described in detail taking a specific task as an example.

【００５７】図３は、学習装置２５０の動作の詳細を説
明するために用いられる課題を示す。図３に示される例
は、単純化されたピンポンゲームの例である。参照番号
２５は、ピンポンゲームが実行される環境を示し、２次
元のｘｙ平面中の領域として表される。環境２５は、０
≦ｘ≦１かつ０≦ｙ≦１の領域である。参照番号２２は
ボールを示し、その座標はＢ（Ｂｘ，Ｂｙ）である。参
照番号２３は教師によって制御されるコントローラを示
し、参照番号２４は学習装置２５０によって制御される
コントローラを示す。コントローラ２３の座標はＰ（Ｐ
ｘ，Ｐｙ）であり、コントローラ２４の座標はＣ（Ｃ
ｘ，Ｃｙ）である。ここで、教師とは、例えば、課題
（単純化されたピンポンゲーム）に関するルールを知っ
ている人間である。FIG. 3 shows a task used to explain the details of the operation of the learning device 250. The example shown in FIG. 3 is an example of a simplified ping-pong game. Reference numeral 25 indicates an environment in which a ping-pong game is executed, and is represented as an area in a two-dimensional xy plane. Environment 25 is 0
≤x≤1 and 0≤y≤1. Reference numeral 22 indicates a ball, and its coordinates are B (Bx, By). Reference numeral 23 indicates a controller controlled by the teacher, and reference numeral 24 indicates a controller controlled by the learning device 250. The coordinates of the controller 23 are P (P
x, Py), and the coordinates of the controller 24 are C (C
x, Cy). Here, the teacher is, for example, a person who knows the rules related to the task (simplified ping-pong game).

【００５８】ボール２２およびコントローラ２３、２４
はそれぞれ、環境２５内を移動し得る。ただしコントロ
ーラ２３のｙ座標Ｐｙおよびコントローラ２４のｙ座標
Ｃｙは一定値に固定されており、コントローラ２３とコ
ントローラ２４とは、左向きまたは右向き（ｘ座標が減
少する向きまたはｘ座標が増加する向き）にのみ移動す
ることが可能である。The ball 22 and the controllers 23 and 24
May each move within environment 25. However, the y-coordinate Py of the controller 23 and the y-coordinate Cy of the controller 24 are fixed to fixed values, and the controller 23 and the controller 24 are directed leftward or rightward (the direction in which the x coordinate decreases or the x coordinate increases). It is only possible to move.

【００５９】ボール２２は、壁（環境２５の範囲を規定
する境界線）に当たらない場合には、ある方向に移動し
続け、壁に当たった場合には反射する。ボール２２が下
の壁（ｘ軸）に当たった場合には、ボール２２が当たっ
た点の所定距離内の近傍にコントローラ２４が存在する
か否かの判定が行われる。所定距離内の近傍にコントロ
ーラ２４が存在する場合には、正の報酬「Ｒ＋」が学習
装置２５０に与えられ、存在しない場合には、負の報酬
「Ｒ−」が学習装置２５０に与えられる。When the ball 22 does not hit the wall (the boundary defining the range of the environment 25), the ball 22 keeps moving in a certain direction, and when hitting the wall, it reflects. When the ball 22 hits the lower wall (x-axis), it is determined whether or not the controller 24 is present in the vicinity of the point hit by the ball 22 within a predetermined distance. If the controller 24 is present in the vicinity within a predetermined distance, a positive reward “R +” is given to the learning device 250; otherwise, a negative reward “R−” is given to the learning device 250.

【００６０】コントローラ２３は、教師が出力する操作
信号Ｍｐに応じて移動する。コントローラ２４は、学習
装置２５０が出力する操作信号Ｍｃに応じて移動する。
操作信号ＭｐおよびＭｃのそれぞれは、「右」、「動か
ない」、「左」の３値を取り得る。操作信号Ｍｃが
「左」である場合には、コントローラ２４は左向きに
（Ｃｘが減少する向きに）移動する。操作信号Ｍｃが
「動かない」である場合には、コントローラ２４は移動
しない。The controller 23 moves according to the operation signal Mp output by the teacher. The controller 24 moves according to the operation signal Mc output from the learning device 250.
Each of the operation signals Mp and Mc can take three values, “right”, “do not move”, and “left”. When the operation signal Mc is “left”, the controller 24 moves to the left (to the direction in which Cx decreases). When the operation signal Mc is “not move”, the controller 24 does not move.

【００６１】図３に示される環境における学習装置２５
０に課せられた課題は、得られる報酬の累積値が多くな
るようなコントローラの操作方法を学習することであ
る。学習装置２５０は、最初はボールの移動する規則お
よび報酬の得られるタイミングに関する知識を有してい
ない。学習装置２５０は、ボール２２の座標Ｂとコント
ローラ２４の座標Ｃと（すなわち、外部入力値１５）に
応じた、適切な操作信号Ｍｃ（すなわち、外部出力値１
７）を生成することが要求される。教師は、環境内にコ
ントローラ２３を表示させて手本となる操作をしたり、
コントローラ２３を表示せずに学習装置２５０によるコ
ントローラ２４の操作を見守ることができるものとす
る。環境内に表示されたコントローラ２３の座標Ｐは、
学習装置２５０に入力される。The learning device 25 in the environment shown in FIG.
The task assigned to 0 is to learn how to operate the controller so that the cumulative value of the obtained rewards increases. The learning device 250 initially does not have knowledge of the rules of ball movement and the timing at which rewards are obtained. The learning device 250 determines an appropriate operation signal Mc (that is, the external output value 1) according to the coordinate B of the ball 22 and the coordinate C of the controller 24 (that is, the external input value 15).
7) is required. The teacher displays the controller 23 in the environment and performs a model operation,
It is assumed that the operation of the controller 24 by the learning device 250 can be watched without displaying the controller 23. The coordinates P of the controller 23 displayed in the environment are
It is input to the learning device 250.

【００６２】図３を再び参照して、学習部２１０の動作
の詳細を説明する。Referring to FIG. 3 again, the operation of learning section 210 will be described in detail.

【００６３】報酬に基づく学習部１２は、正の報酬がよ
り多く与えられるように学習を行う。報酬に基づく学習
部１２は、例えば、Ｅｌｍａｎ型ニューラルネットワー
ク（文献：Ｊ．Ｌ．Ｅｌｍａｎ、Ｆｉｎｄｉｎｇ
ｓｔｒｕｃｔｕｒｅｉｎｔｉｍｅ、Ｃｏｇｎｉｔ
ｉｖｅＳｃｉｅｎｃｅ、１４、ｐｐ．１７９〜２
１１）によって構成される。このニューラルネットワー
クは、バックプロパゲーション法で学習を行うニューラ
ルネットワークである。The reward-based learning unit 12 performs learning so that more positive rewards are given. The learning unit 12 based on a reward is, for example, an Elman-type neural network (document: JL Elman, Finding).
structure intime, Cognit
ave Science, 14, pp. 179-2
11). This neural network is a neural network that performs learning by the back propagation method.

【００６４】ニューラルネットワークに入力される変数
は、ボール２２の座標Ｂ（Ｂｘ，Ｂｙ）、コントローラ
２３の座標Ｐ（Ｐｘ，Ｐｙ）およびコントローラ２４の
座標Ｃ（Ｃｘ，Ｃｙ）の６変数であり、出力される変数
はコントローラ２４に対する操作信号Ｍｃである。操作
信号Ｍｃは、「右」、「動かない」および「左」の３値
を取り得るので、ニューラルネットワークは操作信号の
値のそれぞれに対応した３個の出力ユニット（図示され
ず）を有するように構成されている。The variables input to the neural network are the six variables of the coordinates B (Bx, By) of the ball 22, the coordinates P (Px, Py) of the controller 23, and the coordinates C (Cx, Cy) of the controller 24. The output variable is an operation signal Mc to the controller 24. Since the operation signal Mc can take three values, “right”, “do not move”, and “left”, the neural network has three output units (not shown) corresponding to the respective values of the operation signal. Is configured.

【００６５】報酬に基づく学習部１２のニューラルネッ
トワークは、入力される６変数の値に応じた値を３個の
出力ユニットのそれぞれに設定する。出力ユニットのそ
れぞれに設定される値は、結合荷重に基づいて算出され
る。それぞれの出力ユニットに設定される値のうち、最
も大きい値が設定された出力ユニットと対応する操作信
号の値が操作信号Ｍｃの値として採用さる。例えば、
「右」、「動かない」および「左」と対応する出力ユニ
ットに設定される値がそれぞれ０．２、０．３および
０．６である場合、操作信号Ｍｃの値としては「左」が
採用される。この値「左」は、報酬に基づく学習部１２
の出力値２１２として動作選択部１４に出力される。The neural network of the learning unit 12 based on the reward sets values corresponding to the input six variables to each of the three output units. The value set for each of the output units is calculated based on the coupling load. Of the values set for each output unit, the value of the operation signal corresponding to the output unit for which the largest value is set is adopted as the value of the operation signal Mc. For example,
When the values set in the output units corresponding to “right”, “do not move” and “left” are 0.2, 0.3 and 0.6, respectively, the value of the operation signal Mc is “left”. Adopted. This value “left” indicates the learning unit 12 based on the reward.
Is output to the operation selection unit 14 as the output value 212 of.

【００６６】報酬に基づく学習部１２のニューラルネッ
トワークの結合荷重は、最初は（学習が進んでいない状
態では）ランダムに設定されている。このため、報酬に
基づく学習部１２の出力値１１２も最初はランダムであ
る。報酬が得られない間は、報酬に基づく学習部１２の
ニューラルネットワークは結合荷重に従って３個の出力
ユニットに設定される値が計算され、出力値２１２が生
成される。このように、ランダムに設定された結合荷重
に基づいて出力値２１２を生成することを「試行錯誤」
と呼ぶ。The connection weight of the neural network of the learning unit 12 based on the reward is initially set at random (in a state where learning has not progressed). For this reason, the output value 112 of the learning unit 12 based on the reward is also random at first. While the reward is not obtained, the neural network of the learning unit 12 based on the reward calculates the values set in the three output units according to the connection weight, and generates the output value 212. As described above, the generation of the output value 212 based on the randomly set connection weight is referred to as “trial and error”.
Call.

【００６７】このような試行錯誤を行う際に、報酬に基
づく学習部１２の出力値２１２に、ランダムに変動する
成分が加えられてもよい。When performing such trial and error, a component that fluctuates randomly may be added to the output value 212 of the learning unit 12 based on the reward.

【００６８】報酬に基づく学習部１２が学習装置２５０
の外部から報酬を受け取ると、ニューラルネットワーク
の結合荷重は、バックプロパゲーション法に従って調整
される。The learning unit 12 based on the reward operates the learning device 250
When a reward is received from outside of the network, the connection weight of the neural network is adjusted according to the back propagation method.

【００６９】報酬として、正の報酬「Ｒ＋」が学習装置
２５０に与えられた場合には、そのときの操作を強化す
るために、バックプロパゲーション法により学習が行わ
れる。例えば、「右」、「動かない」および「左」と対
応するユニットに設定される値がそれぞれ０．２、０．
３および０．６である場合、操作信号Ｍｃとして「左」
が出力値２１２として生成される。この出力値２１２
は、ある条件が満たされた場合に、外部出力値１７とし
て学習装置２５０の外部に出力される。すなわち、コン
トローラ２４が「左」向きに移動する。このとき正の報
酬「Ｒ＋」が学習装置２５０に与えられれば、報酬に基
づく学習部１２のニューラルネットワークには「右」が
０．０、「動かない」が０．０、「左」が１．０である
学習データに基づいて学習が行われる。これにより、正
の報酬「Ｒ＋」が学習装置２５０に与えられた場合と同
じ入力に対しては操作信号の値として「左」が出力され
やすくなるようにニューラルネットワークの結合荷重が
調整される。When a positive reward “R +” is given to the learning device 250 as a reward, learning is performed by a back propagation method in order to strengthen the operation at that time. For example, the values set for the units corresponding to “right”, “do not move” and “left” are 0.2, 0.
In the case of 3 and 0.6, “left” is used as the operation signal Mc.
Is generated as the output value 212. This output value 212
Is output to the outside of the learning device 250 as an external output value 17 when a certain condition is satisfied. That is, the controller 24 moves in the “left” direction. At this time, if a positive reward "R +" is given to the learning device 250, the neural network of the learning unit 12 based on the reward has 0.0 for "right", 0.0 for "not moving", and 1 for "left". The learning is performed based on the learning data of 0.0. Thereby, the connection weight of the neural network is adjusted so that “left” is easily output as the value of the operation signal for the same input as when the positive reward “R +” is given to the learning device 250.

【００７０】なお、出力値２１２が外部出力値１７とし
て学習装置２５０の外部に出力されるための条件とは、
動作選択部１４が報酬に基づく学習部の出力値２１２を
学習部出力値２１１として選択し、かつ習熟切換部２０
が学習部出力値２１１を外部出力値１７として選択して
いることである。動作選択部１４および習熟切換部２０
の動作は、後述される。The conditions for outputting the output value 212 to the outside of the learning device 250 as the external output value 17 are as follows.
The operation selection unit 14 selects the output value 212 of the learning unit based on the reward as the learning unit output value 211, and
Is that the learning unit output value 211 is selected as the external output value 17. Action selection unit 14 and learning switching unit 20
The operation of will be described later.

【００７１】上述した例とは逆に、コントローラ２４が
「左」向きに移動したとき負の報酬「Ｒ−」が学習装置
２５０に与えられれば、学習装置２５０は「左」が間違
った操作であると判断する。報酬に基づく学習部１２の
ニューラルネットワークには「右」が０．８、「動かな
い」が０．８、「左」が０．０である学習データに基づ
いて学習が行われる。これにより、負の報酬「Ｒ−」が
学習装置２５０に与えられた場合と同じ入力に対しては
操作信号の値として「左」以外の値が出力されやすくな
るようにニューラルネットワーク内部の結合荷重が調整
される。Contrary to the above example, if a negative reward “R−” is given to the learning device 250 when the controller 24 moves in the “left” direction, the learning device 250 Judge that there is. In the neural network of the learning unit 12 based on the reward, learning is performed based on learning data in which “right” is 0.8, “does not move” is 0.8, and “left” is 0.0. Thereby, the connection weight inside the neural network is set so that a value other than “left” is easily output as the value of the operation signal for the same input as when the negative reward “R−” is given to the learning device 250. Is adjusted.

【００７２】以上のように結合荷重を調整することによ
って、学習装置２５０が、報酬に基づく学習部１２の出
力値２１２を外部出力１７として環境に出力した場合
に、学習装置２５０は正の報酬「Ｒ＋」を得やすくな
り、負の報酬「Ｒ−」を得にくくなる。すなわち、結合
荷重の調整は、報酬によって示される評価値が高くなる
ように行われる。By adjusting the connection weight as described above, when the learning device 250 outputs the output value 212 of the learning unit 12 based on the reward to the environment as the external output 17, the learning device 250 outputs the positive reward “ It is easier to obtain “R +” and it is difficult to obtain a negative reward “R−”. That is, the adjustment of the connection load is performed such that the evaluation value indicated by the reward increases.

【００７３】このように、報酬に基づく学習部１２は報
酬（報酬信号）を学習装置２５０の外部から受け取る報
酬信号入力部として機能するとともに、報酬に基づい
て、報酬によって示される評価値が高くなるように結合
荷重（少なくとも１つの第１のパラメータ）を調整する
第１調整部として機能する。As described above, the reward-based learning section 12 functions as a reward signal input section that receives a reward (reward signal) from outside the learning device 250, and the evaluation value indicated by the reward increases based on the reward. Functions as a first adjusting unit for adjusting the coupling load (at least one first parameter).

【００７４】また、報酬に基づく学習部１２は、報酬に
基づく学習（第１学習）を実行することにより、結合荷
重（少なくとも１つの第１パラメータ）に基づいて外部
入力値１５に応じた出力値２１２（第１出力値）を生成
する第１学習部として機能する。Further, the reward-based learning unit 12 executes learning based on the reward (first learning), so that the output value corresponding to the external input value 15 is determined based on the connection weight (at least one first parameter). It functions as a first learning unit that generates 212 (first output value).

【００７５】図４は、教師信号に基づく学習部１３の構
成を示す。教師信号に基づく学習部１３は、教師の行動
と環境の状態との関係を記憶する教師行動記憶部１３１
と、教師行動記憶部に記憶された教師の行動に基づい
て、現在の環境の状態で教師はどのような出力値を出力
をするかを推定する記憶に基づく模倣部１３２と、現在
の環境の状態に関わらず教師の行動そのものを出力値と
して出力する直接模倣部１３３とを含む。FIG. 4 shows the configuration of the learning unit 13 based on the teacher signal. The learning unit 13 based on the teacher signal stores a teacher action storage unit 131 that stores the relationship between the teacher's action and the state of the environment.
And an imitation unit 132 based on storage for estimating what output value the teacher outputs in the current environment state based on the teacher's behavior stored in the teacher behavior storage unit; A direct imitation unit 133 that outputs the teacher's action itself as an output value regardless of the state.

【００７６】以下、教師信号に基づく学習部１３に含ま
れる構成要素を説明する。Hereinafter, the components included in the learning unit 13 based on the teacher signal will be described.

【００７７】教師行動記憶部１３１は、教師の行動と環
境の状態との関係を記憶する。図３に示されるピンポン
ゲームの課題では、教師行動記憶部１３１は、ボール２
２の座標Ｂと教師のコントローラ操作Ｍｐとの関係を記
憶する。教師行動記憶部１３１は、例えば、Ｅｌｍａｎ
型のニューラルネットワークによって構成される。この
ニューラルネットワークに入力される変数は、ボール２
２の座標Ｂ（Ｂｘ，Ｂｙ）およびコントローラ２３の座
標Ｐ（Ｐｘ，Ｐｙ）の４変数であり、出力される変数は
コントローラ２４に対する操作信号Ｍｃである。操作信
号Ｍｃは、「右」、「動かない」および「左」の３値を
取り得るので、ニューラルネットワークは操作信号の値
のそれぞれに対応した３個の出力ユニット（図示され
ず）を有している。教師行動記憶部１３１のニューラル
ネットワークの学習データとしては、ある時点のニュー
ラルネットワークの入力である４変数と、その時点の教
師の操作信号Ｍｐとの組が用いられる。The teacher action storage section 131 stores the relationship between the teacher action and the state of the environment. In the task of the ping-pong game shown in FIG.
The relationship between the coordinate B of No. 2 and the controller operation Mp of the teacher is stored. The teacher action storage unit 131 stores, for example, Elman
It is composed of a type of neural network. The variable input to this neural network is ball 2
There are four variables of a coordinate B (Bx, By) of No. 2 and a coordinate P (Px, Py) of the controller 23, and the output variable is an operation signal Mc for the controller 24. Since the operation signal Mc can take three values, “right”, “do not move”, and “left”, the neural network has three output units (not shown) corresponding to the respective values of the operation signal. ing. As learning data of the neural network in the teacher action storage unit 131, a set of four variables that are inputs of the neural network at a certain time and a teacher's operation signal Mp at that time is used.

【００７８】教師行動記憶部１３１のニューラルネット
ワークは、後述される学習指定部１１が教師信号に基づ
く学習を指定している場合には、一定の周期ごとに新し
い学習データを追加して学習を行う。これによって、教
師行動記憶部１３１内部のＥｌｍａｎ型ニューラルネッ
トワークには、ボールの座標Ｂ（Ｂｘ，Ｂｙ）と、その
ときに教師がとった操作Ｍｐ（「右」、「動かない」ま
たは「左」）との関係が記憶される。これは教師の行動
のモデルをニューラルネットワークの内部に作成するこ
とに相当し、課題に対する教師の知識が、ある環境の状
態における操作という形で教師行動記憶部１３１に記憶
される。The neural network of the teacher action storage unit 131 performs learning by adding new learning data at regular intervals when a learning specifying unit 11 described later specifies learning based on a teacher signal. . As a result, the Elman-type neural network in the teacher action storage unit 131 stores the ball coordinates B (Bx, By) and the operation Mp (“right”, “do not move”, or “left”) taken by the teacher at that time. ) Is stored. This corresponds to creating a model of the teacher's action inside the neural network, and the teacher's knowledge of the task is stored in the teacher action storage unit 131 in the form of an operation in a certain environment state.

【００７９】教師行動記憶部１３１がバックプロパゲー
ション法による学習を実行することにより、ニューラル
ネットワークの結合荷重が調整される。ある時点におけ
るボール２２の座標Ｂおよびコントローラ２３の座標Ｐ
と、それに応じて教師がとった行動を示す操作信号Ｍｐ
との組を学習データとしてバックプロパゲーション法に
よる学習を実行することにより、その時点におけるボー
ル２２の座標Ｂおよびコントローラ２３の座標Ｐに応じ
てニューラルネットワークから出力される操作信号Ｍｃ
は、教師がとった行動を示す操作信号Ｍｐに近づくよう
に結合荷重が調整される。The teacher action storage unit 131 executes learning by the back propagation method, so that the connection weight of the neural network is adjusted. The coordinates B of the ball 22 and the coordinates P of the controller 23 at a certain point in time
And an operation signal Mp indicating the action taken by the teacher accordingly
Of the ball 22 and the coordinate P of the controller 23 at that time, the operation signal Mc output from the neural network.
Is adjusted so as to approach the operation signal Mp indicating the action taken by the teacher.

【００８０】学習が進んだ段階では、教師行動記憶部１
３１は、ボール２２とコントローラ２４がある位置に存
在する場合に、教師であればどのような操作をするかを
計算することができる。At the stage where learning has progressed, the teacher action storage unit 1
In a case where the ball 22 and the controller 24 are at a certain position, the teacher 31 can calculate what operation the teacher performs.

【００８１】記憶に基づく模倣部１３２は、教師行動記
憶部１３１による教師の行動記憶に基づいて、操作信号
Ｍｃの値を決定する。記憶に基づく模倣部１３２に入力
される変数は、ボールの座標Ｂ（Ｂｘ，Ｂｙ）とコント
ローラ２４の座標Ｃ（Ｃｘ，Ｃｙ）との４変数であり、
出力される変数はコントローラ２４に対する操作信号Ｍ
ｃ（「右」、「動かない」または「左」）である。The memory-based imitation unit 132 determines the value of the operation signal Mc based on the teacher's behavior stored by the teacher behavior storage unit 131. The variables input to the imitation unit 132 based on the memory are four variables of the coordinates B (Bx, By) of the ball and the coordinates C (Cx, Cy) of the controller 24,
The output variable is an operation signal M for the controller 24.
c (“right”, “does not move” or “left”).

【００８２】記憶に基づく模倣部１３２は、ボールの座
標Ｂ（Ｂｘ，Ｂｙ）とコントローラ２４の座標Ｃ（Ｃ
ｘ，Ｃｙ）と（すなわち、外部入力値１５）が入力され
ると、そのデータをそのまま教師行動記憶部１３１に渡
す。教師行動記憶部１３１では、受け取ったボールの座
標Ｂ（Ｂｘ，Ｂｙ）と、受け取ったコントローラ２４の
座標Ｃ（Ｃｘ，Ｃｙ）の値がニューラルネットワークに
入力される。この際に、教師行動記憶部１３１が受け取
ったコントローラ２４の座標Ｃ（Ｃｘ，Ｃｙ）は、コン
トローラ２３の座標Ｐ（Ｐｘ，Ｐｙ）としてニューラル
ネットワークに入力される。教師行動記憶部１３１のニ
ューラルネットワークの計算結果である操作信号Ｍｃ
は、記憶に基づく模倣部１３２に返される。教師行動記
憶部１３１が結合荷重を調整することにより、ニューラ
ルネットワークの計算結果である操作信号Ｍｃは、ボー
ルの座標Ｂとコントローラ２４の座標Ｃに対する教師の
操作信号Ｍｐにより近づく。すなわち環境の状態に応じ
て、教師の行動記憶に基づいた適切な操作信号Ｍｃが出
力値２１３として生成されるようになる。The imitation unit 132 based on the memory stores the coordinates B (Bx, By) of the ball and the coordinates C (C
When (x, Cy) and (that is, the external input value 15) are input, the data is passed to the teacher action storage unit 131 as it is. In the teacher action storage unit 131, the received values of the coordinates B (Bx, By) of the ball and the received coordinates C (Cx, Cy) of the controller 24 are input to the neural network. At this time, the coordinates C (Cx, Cy) of the controller 24 received by the teacher action storage unit 131 are input to the neural network as the coordinates P (Px, Py) of the controller 23. An operation signal Mc which is a calculation result of the neural network of the teacher action storage unit 131
Is returned to the imitation unit 132 based on the memory. As the teacher action storage unit 131 adjusts the connection weight, the operation signal Mc, which is the calculation result of the neural network, comes closer to the teacher's operation signal Mp for the coordinates B of the ball and the coordinates C of the controller 24. That is, an appropriate operation signal Mc based on the teacher's action memory is generated as the output value 213 according to the state of the environment.

【００８３】記憶に基づく模倣部１３２は、教師行動記
憶部１３１にボール２２の座標Ｂとコントローラ２４の
座標Ｃと（すなわち、外部入力値１５）を渡し、教師行
動記憶部１３１から受け取った操作信号Ｍｃを出力値２
１３として出力する。The memory-based imitation section 132 passes the coordinates B of the ball 22 and the coordinates C of the controller 24 (that is, the external input value 15) to the teacher action storage section 131, and receives the operation signal received from the teacher action storage section 131. Mc output value 2
13 is output.

【００８４】なお、教師の操作信号Ｍｐは、操作信号Ｍ
ｃの期待値（すなわち、外部出力値１７の期待値）を示
し、教師信号として学習装置２５０に入力される。The teacher operation signal Mp is the operation signal Mp.
Indicates the expected value of c (that is, the expected value of the external output value 17), and is input to the learning device 250 as a teacher signal.

【００８５】教師行動記憶部１３１は、教師信号を学習
装置２５０の外部から受け取る教師信号入力部として機
能するとともに、教師信号に基づいて、外部入力値１５
（ボールの座標Ｂとコントローラ２４の座標Ｃ）に対す
る出力値２１３が、期待値に近づくように結合荷重（少
なくとも１つの第２パラメータ）を調整する第２調整部
として機能する。The teacher action storage section 131 functions as a teacher signal input section that receives a teacher signal from outside the learning device 250, and also has an external input value of 15 based on the teacher signal.
The output value 213 corresponding to (the coordinate B of the ball and the coordinate C of the controller 24) functions as a second adjustment unit that adjusts the coupling load (at least one second parameter) so as to approach the expected value.

【００８６】また、教師信号に基づく学習部１３は、そ
の内部の教師行動記憶部１３１が教師信号に基づく学習
（第２学習）を実行することにより、教師行動記憶部１
３１のニューラルネットワークの結合荷重（少なくとも
１つの第２パラメータ）に基づいて、外部入力値１５に
応じた出力値２１３（第２出力値）を生成する第２学習
部として機能する。The learning section 13 based on the teacher signal stores the teacher action storage section 131 by executing the learning (second learning) based on the teacher signal by the internal teacher action storage section 131.
It functions as a second learning unit that generates an output value 213 (second output value) corresponding to the external input value 15 based on a connection weight (at least one second parameter) of the 31 neural networks.

【００８７】直接模倣部１３３は、記憶に基づく模倣部
１３２のように教師の記憶に基づいてではなく、ある時
点における教師による操作信号Ｍｐをそのまま模倣す
る。直接模倣部１３３は、教師による操作信号Ｍｐを入
力とし、その値（「右」、「動かない」または「左」）
をそのまま操作信号Ｍｃとして出力する。すなわち、直
接模倣部１３３は、ボール２２の位置やコントローラ２
３の位置とは無関係に、教師による操作信号Ｍｐをその
まま模倣する。例えば、教師がコントローラ２３を
「左」に操作していれば、直接模倣部１３３は、コント
ローラ２４に対する操作信号Ｍｃとして「左」を出力す
る。直接模倣部１３３は、学習を行わない。直接模倣部
１３３が行うのは単純な模倣であるが、環境が複雑な場
合や、教師がコントローラ２４の位置を考慮しながらコ
ントローラ２３を操作する場合には、直接模倣部１３３
の動作も環境に対しては適切なものになる。直接模倣部
１３３の動作は、後述する習熟部１９に記憶される。こ
れにより、教師が存在しない場合でも直接模倣部１３３
の動作（すなわち教師の動作）を再現することができ
る。The direct imitation unit 133 imitates the operation signal Mp by the teacher at a certain point in time, not based on the memory of the teacher as in the imitation unit 132 based on the memory. The direct imitation unit 133 receives the operation signal Mp by the teacher as an input and the value (“right”, “does not move” or “left”)
Is output as the operation signal Mc as it is. That is, the direct imitation unit 133 determines the position of the ball 22 and the controller 2.
Regardless of the position of 3, the operation signal Mp by the teacher is imitated as it is. For example, if the teacher operates the controller 23 to “left”, the direct imitation unit 133 outputs “left” as the operation signal Mc for the controller 24. The direct imitation unit 133 does not perform learning. Although the direct imitation unit 133 performs a simple imitation, when the environment is complicated or when the teacher operates the controller 23 while considering the position of the controller 24, the direct imitation unit 133 is performed.
Will also be appropriate for the environment. The operation of the direct imitation unit 133 is stored in the learning unit 19 described later. As a result, even when the teacher does not exist, the direct imitation unit 133 can be used.
(I.e., the teacher's operation) can be reproduced.

【００８８】なお、直接模倣部１３３は省略されてもよ
い。Note that the direct copying unit 133 may be omitted.

【００８９】次に、学習指定部１１（図２）の動作を説
明する。報酬に基づく学習部１２と教師信号に基づく学
習部１３とは、同時に学習することができない場合があ
る。このため、学習指定部１１は、報酬に基づく学習を
指定するか否か、および、教師信号に基づく学習を指定
するか否かを決定する。Next, the operation of the learning designating section 11 (FIG. 2) will be described. The learning unit 12 based on the reward and the learning unit 13 based on the teacher signal may not be able to learn at the same time. For this reason, the learning specifying unit 11 determines whether to specify learning based on a reward and whether to specify learning based on a teacher signal.

【００９０】学習指定部１１による学習指定の方法とし
て、環境の状態と、学習装置２５０の内部状態を示す変
数とに応じて学習指定を決定する方法を採用する。本発
明の実施の形態１では、環境の状態として教師が存在す
るか否か（教師信号が入力されるか否か）の情報を用
い、内部状態を示す変数として、注意力パラメータを用
いている。注意力パラメータは、学習装置２５０がどの
程度集中して学習を行うかを制御するパラメータであ
る。例えば、注意力パラメータの値は時刻に応じて変化
する。以下の説明では、注意力パラメータは０〜１の値
を取り、一定の時間をかけて減少した後、一定の時間を
かけて増加するものと仮定する。As a method of the learning designation by the learning designation unit 11, a method of determining the learning designation according to the state of the environment and a variable indicating the internal state of the learning device 250 is employed. In the first embodiment of the present invention, information on whether or not a teacher exists as an environmental state (whether or not a teacher signal is input) is used, and an attention parameter is used as a variable indicating an internal state. . The attention parameter is a parameter that controls how concentrated the learning device 250 performs learning. For example, the value of the attention parameter changes with time. In the following description, it is assumed that the attention parameter takes a value of 0 to 1, decreases over a certain period of time, and then increases over a certain period of time.

【００９１】図５は、学習指定部１１による学習指定の
処理手順を示す。FIG. 5 shows a processing procedure for learning designation by the learning designating section 11.

【００９２】以下、図５に示されるステップＳ１１〜ス
テップＳ１５の各ステップごとに処理手順を説明する。
ステップＳ１１〜ステップＳ１５の処理は、所定のタイ
ミングで学習指定部１１によって実行され、学習指定部
１１によって指定された学習モジュール（習熟部１９、
報酬に基づく学習部１２および教師信号に基づく学習部
１３）において、学習が行われる。Hereinafter, the processing procedure will be described for each of steps S11 to S15 shown in FIG.
The processing of steps S11 to S15 is executed by the learning specifying unit 11 at a predetermined timing, and the learning module (the learning unit 19,
Learning is performed in a learning unit 12 based on a reward and a learning unit 13) based on a teacher signal.

【００９３】ステップＳ１１：習熟部１９による学習が
指定される。習熟部１９では、常に学習が行われる。習
熟部１９による学習については、後述される。Step S11: Learning by the learning section 19 is designated. In the learning section 19, learning is always performed. The learning by the learning section 19 will be described later.

【００９４】ステップＳ１２：教師の操作が存在するか
否か、すなわち、教師信号が入力されるか否かが判定さ
れる。この判定が「ＹＥＳ」であれば処理はステップＳ
１３へ進む。この判定が「ＮＯ」であれは処理はステッ
プＳ１４へ進む。Step S12: It is determined whether or not a teacher operation is present, that is, whether or not a teacher signal is input. If this determination is "YES", the process proceeds to step S
Proceed to 13. If this determination is "NO", the process proceeds to step S14.

【００９５】ステップＳ１３：教師信号に基づく学習が
指定される。この指定がなされた場合、教師行動記憶部
１３１は、教師信号に基づく学習を実行する。Step S13: Learning based on the teacher signal is specified. When this designation is made, the teacher action storage unit 131 executes learning based on the teacher signal.

【００９６】ステップＳ１４：注意力パラメータが所定
値（例えば、０．５）よりも高く、かつ、報酬に基づく
学習部１２の出力値が外部出力値１７として学習装置２
５０の外部に出力されているか否かが判定される。この
判定が「ＹＥＳ」であれば処理はステップＳ１５へ進
む。この判定が「ＮＯ」であれは処理は終了する。報酬
に基づく学習部１２の出力値が外部出力値１７として学
習装置２５０の外部に出力されているか否かの判定は、
動作選択部１４による選択動作と、習熟切換部２０によ
る選択動作とを監視することによって行われ得る。動作
選択部１４が報酬に基づく学習部の出力値２１２を学習
部出力値２１１として選択しており、かつ、学習部出力
値２１１を外部出力値１７として選択しているという条
件が満たされているか否かに依存して、報酬に基づく学
習部１２の出力値が外部出力値１７として学習装置２５
０の外部に出力されているか否かが決まるからである。Step S14: The attention parameter is higher than a predetermined value (for example, 0.5), and the output value of the learning unit 12 based on the reward is set as the external output value 17 as the learning device 2
It is determined whether or not the data is output outside the 50. If this determination is "YES", the process proceeds to step S15. If this determination is "NO", the process ends. Whether the output value of the learning unit 12 based on the reward is output to the outside of the learning device 250 as the external output value 17 is determined.
It can be performed by monitoring the selection operation by the operation selection unit 14 and the selection operation by the learning switching unit 20. Whether the condition that the operation selecting unit 14 selects the learning unit output value 212 based on the reward as the learning unit output value 211 and selects the learning unit output value 211 as the external output value 17 is satisfied. Whether the output value of the learning unit 12 based on the reward is an external output value 17 depends on whether
This is because it is determined whether or not the output is outside 0.

【００９７】ステップＳ１５：報酬に基づく学習が指定
される。この指定がなされた場合、報酬に基づく学習部
１２は、報酬に基づく学習を実行する。Step S15: Learning based on the reward is specified. When this designation is made, the reward-based learning unit 12 executes learning based on the reward.

【００９８】以上のように、学習指定部１１は、報酬信
号に基づく学習部１２（第１学習部）による報酬信号に
基づく学習（第１学習）を指定するか否か、および、教
師信号に基づく学習部１３（第２学習部）による教師信
号に基づく学習（第２学習）を指定するか否かを決定す
る。As described above, the learning specifying unit 11 determines whether or not to specify learning (first learning) based on the reward signal by the learning unit 12 (first learning unit) based on the reward signal, and determines whether or not to specify the teacher signal. It is determined whether or not the learning (second learning) based on the teacher signal by the learning unit 13 (second learning unit) is designated.

【００９９】なお、ステップ１４における監視動作は、
学習指定部１１によって実行されてもよいし報酬に基づ
く学習部１２によって実行されてもよい。ステップ１４
〜ステップ１５において、報酬に基づく学習部１２の出
力値が外部出力値１７として学習装置２５０の外部に出
力されているか否に応じて報酬に基づく学習部１２にお
ける学習を実行するか否かを決定している理由は、学習
装置２５０に与えられた報酬が、報酬に基づく学習部１
２が生成した出力値に関して与えられた評価値であるこ
とを保証するためである。The monitoring operation in step 14 is as follows.
It may be executed by the learning designating unit 11 or by the learning unit 12 based on the reward. Step 14
In Step 15, it is determined whether or not the learning in the learning unit 12 based on the reward is to be performed depending on whether the output value of the learning unit 12 based on the reward is output to the outside of the learning device 250 as the external output value 17. The reason for this is that the reward given to the learning device 250 is the learning unit 1 based on the reward.
This is to ensure that 2 is an evaluation value given for the generated output value.

【０１００】学習指定部１１が上述した処理手順に従っ
て学習指定を行うことにより、教師からのデータ（教師
信号）を全て学習に使用し、教師が存在しないとき（教
師信号が入力されないとき）には報酬に基づく学習を行
うことで、学習装置２５０はその周囲の環境に教師が存
在するか否かに関わらず、学習を続けることができる。The learning specifying unit 11 specifies learning according to the processing procedure described above, so that all data (teacher signals) from the teacher are used for learning, and when there is no teacher (when no teacher signal is input), By performing learning based on the reward, the learning device 250 can continue learning regardless of whether a teacher exists in the surrounding environment.

【０１０１】動作選択部１４（図２）は、報酬に基づく
学習部１２の出力値２１２と、教師信号に基づく学習部
１３の出力値２１３との一方を選択し、選択された方を
学習部出力値２１１として習熟切換部２０に出力する。
本発明の実施の形態１では、図４を参照して説明したよ
うに、教師信号に基づく学習部１３は、記憶に基づく模
倣部１３２から出力される出力値と、直接模倣部１３３
から出力される出力値との２種類の出力値を動作選択部
１４に出力する。従って、動作選択部１４は、報酬に基
づく学習部１２と記憶に基づく模倣部１３２と直接模倣
部１３３のそれぞれから出力されるの３つの出力値から
１つを選択し、選択された出力値を学習部出力値２１１
として習熟切換部２０に出力する。この選択は、教師が
存在するかどうかと、内部状態を示すパラメータである
記憶状況パラメータの値とに基づいて行われる。記憶状
況パラメータは、０〜１の値を取り、学習初期には低い
値を設定し、学習時間が経過するとともに値が増加する
ものとする。学習装置２５０は、学習初期には教師の行
動を正確に再現できないと考えられるからである。The operation selecting section 14 (FIG. 2) selects one of the output value 212 of the learning section 12 based on the reward and the output value 213 of the learning section 13 based on the teacher signal, and determines the selected one as the learning section. It is output to the learning switching unit 20 as an output value 211.
In Embodiment 1 of the present invention, as described with reference to FIG. 4, the learning unit 13 based on the teacher signal outputs the output value output from the imitation unit 132 based on the memory and the direct imitation unit 133
And two output values, ie, an output value output from the operation selection unit 14. Accordingly, the operation selecting unit 14 selects one of the three output values output from the learning unit 12 based on the reward, the imitation unit 132 based on the memory, and the direct imitation unit 133, and outputs the selected output value. Learning unit output value 211
Is output to the learning switching section 20. This selection is made based on whether or not a teacher exists and the value of the storage status parameter which is a parameter indicating the internal state. It is assumed that the storage status parameter takes a value of 0 to 1, is set to a low value at the beginning of learning, and increases as the learning time elapses. This is because it is considered that the learning device 250 cannot accurately reproduce the behavior of the teacher at the beginning of learning.

【０１０２】図６は、動作選択部１４による出力選択の
処理手順を示す。FIG. 6 shows a processing procedure of output selection by the operation selecting section 14.

【０１０３】以下、図６に示されるステップＳ２１〜ス
テップＳ２５の各ステップごとに処理手順を説明する。
ステップＳ２１〜ステップＳ２５の処理は、所定のタイ
ミングで動作選択部１４によって実行される。Hereinafter, the processing procedure will be described for each of steps S21 to S25 shown in FIG.
The processing of steps S21 to S25 is executed by the operation selecting unit 14 at a predetermined timing.

【０１０４】ステップＳ２１：教師の操作が存在するか
否か判定される。教師の操作が存在するか否かは、教師
のコントローラ２３が環境中に存在するか否かと等価で
ある。この判定が「ＹＥＳ」であれば処理はステップＳ
２２に進み、判定が「ＮＯ」であれば処理はステップＳ
２３に進む。Step S21: It is determined whether or not a teacher operation exists. Whether or not the operation of the teacher exists is equivalent to whether or not the controller 23 of the teacher exists in the environment. If this determination is "YES", the process proceeds to step S
22. If the determination is "NO", the process proceeds to step S
Proceed to 23.

【０１０５】ステップＳ２２：直接模倣部１３３から出
力される出力値が選択される。Step S22: An output value output directly from the imitation unit 133 is selected.

【０１０６】ステップＳ２３：記憶状況が悪いか否か、
すなわち、記憶状況パラメータの値が所定値（例えば、
０．５）よりも低いか否かが判定される。この判定が
「ＹＥＳ」であれば処理はステップＳ２４に進み、判定
が「ＮＯ」であれば処理はステップＳ２５に進む。Step S23: Whether the storage situation is bad or not
That is, the value of the storage status parameter is a predetermined value (for example,
0.5) is determined. If the determination is "YES", the process proceeds to step S24, and if the determination is "NO", the process proceeds to step S25.

【０１０７】ステップＳ２４：報酬に基づく学習部１２
から出力される出力値２１２が選択される。Step S24: Learning unit 12 based on reward
The output value 212 output from is selected.

【０１０８】ステップＳ２５：教師の記憶に基づく行
動、すなわち、記憶に基づく模倣部１３２から出力され
る出力値が選択される。Step S25: The action based on the memory of the teacher, that is, the output value output from the imitation unit 132 based on the memory is selected.

【０１０９】次に、習熟部１９（図２）の動作を説明す
る。習熟部１９は、環境からの入力であるボール２２お
よびコントローラ２４の位置と、学習装置２５０による
操作信号Ｍｃとの関係を常に学習する。習熟部１９は、
例えば、バックプロパゲーション法に従って学習を実行
するＥｌｍａｎ型のニューラルネットワークによって構
成される。ニューラルネットワークの入力は、ボール２
２の座標Ｂ（Ｂｘ，Ｂｙ）、コントローラ２３の座標Ｐ
（Ｐｘ，Ｐｙ）およびコントローラ２４の座標Ｃ（Ｃ
ｘ，Ｃｙ）の６変数であり、出力される変数はコントロ
ーラ２４に対する操作信号Ｍｃである。操作信号Ｍｃ
は、「右」、「動かない」および「左」の３値を取り得
るので、ニューラルネットワークはそれぞれの値に対応
した３個の出力ユニット（図示されず）を有するように
構成されている。習熟部１９のニューラルネットワーク
は、入力される６変数の値に応じた値を３個の出力ユニ
ットのそれぞれに設定する。出力ユニットのそれぞれに
設定される値は、結合荷重に基づいて算出される。Next, the operation of learning section 19 (FIG. 2) will be described. The learning unit 19 always learns the relationship between the positions of the ball 22 and the controller 24, which are inputs from the environment, and the operation signal Mc from the learning device 250. The learning section 19
For example, it is configured by an Elman-type neural network that executes learning according to the back propagation method. The input of the neural network is ball 2
2 coordinates B (Bx, By), coordinates P of the controller 23
(Px, Py) and the coordinates C (C
x, Cy), and the output variable is an operation signal Mc for the controller 24. Operation signal Mc
Can take three values, "right", "static" and "left", so that the neural network is configured to have three output units (not shown) corresponding to each value. The neural network of the learning unit 19 sets a value corresponding to the values of the six input variables to each of the three output units. The value set for each of the output units is calculated based on the coupling load.

【０１１０】習熟部１９がバックプロパゲーション法に
従って学習を行う際には、学習データとして、ある時点
のニューラルネットワークの入力である６変数と、その
時点の学習部出力値２１１との組が用いられる。これに
よって、習熟部１９が出力する習熟部出力値２１４が、
学習部出力値２１１に近づくように結合荷重（少なくと
も１つのパラメータ）の調整が行われる。学習部出力値
２１１は、動作選択部１４によって選択された出力値で
あるので、習熟部１９は、外部入力値１５（ボール２２
の座標Ｂとコントローラ２４の座標Ｃ）と、報酬に基づ
く学習部１２、記憶に基づく模倣部１３２および直接模
倣部１３３のうちのいずれかによって出力される出力値
との関係を学習する。習熟部１９の学習初期には、各学
習モジュールによる学習が進んでいないため、習熟部１
９には主に学習装置２５０の行動履歴が記憶される。各
学習部による学習が進んだ段階では、環境に対して適切
な行動が学習データとして習熟部１９に与えられるよう
になる。この段階では、習熟部１９には各学習部ではな
く学習装置２５０全体の習熟度合いが記憶される。When the learning unit 19 performs learning according to the back propagation method, a set of six variables which are inputs of the neural network at a certain time and a learning unit output value 211 at that time is used as learning data. . Thereby, the learning unit output value 214 output by the learning unit 19 is
The connection weight (at least one parameter) is adjusted so as to approach the learning unit output value 211. Since the learning unit output value 211 is an output value selected by the operation selection unit 14, the learning unit 19 outputs the external input value 15 (the ball 22
The relationship between the coordinate B of the controller 24 and the coordinate C) of the controller 24 and the output value output by any of the learning unit 12 based on reward, the imitation unit 132 based on storage, and the direct imitation unit 133 is learned. In the early stage of learning of the learning unit 19, since learning by each learning module has not progressed, the learning unit 1
9 mainly stores an action history of the learning device 250. At the stage where learning by each learning unit has advanced, an action appropriate for the environment is given to the learning unit 19 as learning data. At this stage, the learning unit 19 stores the learning level of the entire learning device 250 instead of the learning units.

【０１１１】習熟部１９のニューラルネットワークの入
力は、ボール２２の座標Ｂ（Ｂｘ，Ｂｙ）、コントロー
ラ２３の座標Ｐ（Ｐｘ，Ｐｙ）およびコントローラ２４
の座標Ｃ（Ｃｘ，Ｃｙ）の６変数である。この６変数に
は、外部入力値１５（ボール２２の座標とコントローラ
２４の座標Ｃ）が含まれる。従って、習熟部１９は、ニ
ューラルネットワークの結合荷重（少なくとも１つのパ
ラメータ）に基づいて、外部入力値１５に応じた習熟部
出力値２１４を生成する。The inputs of the neural network of the learning section 19 include the coordinates B (Bx, By) of the ball 22, the coordinates P (Px, Py) of the controller 23, and the controller 24.
Are the six variables of the coordinates C (Cx, Cy). These six variables include an external input value 15 (the coordinates of the ball 22 and the coordinates C of the controller 24). Therefore, the learning unit 19 generates the learning unit output value 214 according to the external input value 15 based on the connection weight (at least one parameter) of the neural network.

【０１１２】習熟切換部２０（図２）は、学習装置２５
０の外部出力値１７として習熟部１９の出力値（習熟部
出力値２１４）を選択するか動作選択部１４の出力値
（学習部出力値２１１）を選択するかを切り換える。学
習装置２５０の外部出力値１７として習熟部１９の出力
値が選択された場合には、学習指定部１１と報酬に基づ
く学習部１２と教師信号に基づく学習部１３と動作選択
部１４とのそれぞれの動作が停止される。これにより、
習熟部出力値２１４が選択された場合、学習装置２５０
全体で必要な計算コストを削減できる。The learning switching section 20 (FIG. 2)
As the external output value 17 of 0, the output value of the learning unit 19 (learning unit output value 214) or the output value of the operation selection unit 14 (learning unit output value 211) is switched. When the output value of the learning unit 19 is selected as the external output value 17 of the learning device 250, each of the learning specifying unit 11, the learning unit 12 based on the reward, the learning unit 13 based on the teacher signal, and the operation selecting unit 14 Operation is stopped. This allows
When the learning section output value 214 is selected, the learning device 250
The necessary computation cost can be reduced as a whole.

【０１１３】習熟切換部２０は、学習指定部１１によっ
て使用された注意力パラメータと、動作選択部１４によ
って使用された記憶状況パラメータとに基づいて、出力
値の選択を行う。習熟切換部２０は、学習装置２５０の
注意力が低い場合、例えば注意力パラメータの値が０．
２より低い場合、または、学習装置２５０の記憶状況が
よい場合、例えば、記憶状況パラメータの値が０．８よ
りも高い場合には習熟部１９から出力される習熟部出力
値２１４を外部出力値１７として出力し、それ以外の場
合は、動作選択部１４から出力される学習部出力値２１
１を外部出力値１７として出力する。記憶状況がよい場
合には、習熟部１９による学習が十分進んでいることが
期待され、習熟部１９だけの計算によって学習装置２５
０の外部出力値１７を決めることができると判断される
からである。The learning switching unit 20 selects an output value based on the attention parameter used by the learning designating unit 11 and the storage status parameter used by the action selecting unit 14. When the attention of the learning device 250 is low, the learning switching unit 20 sets the value of the attention parameter to 0.
2 or the storage status of the learning device 250 is good, for example, when the value of the storage status parameter is higher than 0.8, the learning unit output value 214 output from the learning unit 19 is changed to the external output value. 17; otherwise, the learning unit output value 21 output from the operation selection unit 14
1 is output as the external output value 17. When the memory condition is good, it is expected that the learning by the learning unit 19 is sufficiently advanced, and the learning device 25 is calculated only by the learning unit 19.
This is because it is determined that the external output value 17 of 0 can be determined.

【０１１４】このように、習熟切換部２０は、所定の規
則に基づいて、動作選択部１４が生成した出力値と、習
熟部１９が生成した出力値との一方を外部出力値１７と
して選択的に出力する。Thus, learning switching section 20 selectively outputs one of the output value generated by operation selecting section 14 and the output value generated by learning section 19 as external output value 17 based on a predetermined rule. Output to

【０１１５】習熟部１９の動作に要する計算量は、学習
指定部１１、報酬に基づく学習部１２、教師信号に基づ
く学習部１３および動作選択部１４の動作に要する計算
量の合計よりも少なくて済む。従って、学習装置２５０
が習熟部１９および習熟切換部２０を備えることによ
り、習熟部１９および習熟切換部２０を備えない場合と
比較して、学習結果に基づく出力生成計算をより少ない
計算コスト（空間計算量および時間計算量）で実行する
ことが可能になる。The amount of calculation required for the operation of the learning unit 19 is smaller than the total amount of calculation required for the operations of the learning specifying unit 11, the learning unit 12 based on the reward, the learning unit 13 based on the teacher signal, and the operation selecting unit 14. I'm done. Therefore, the learning device 250
Is equipped with the learning unit 19 and the learning switching unit 20, compared with the case where the learning unit 19 and the learning switching unit 20 are not provided, the output generation calculation based on the learning result requires less calculation cost (the space calculation amount and the time calculation time). Volume).

【０１１６】図７は、報酬に基づく学習部１２、教師行
動記憶部１３１、記憶に基づく模倣部１３２、直接模倣
部１３３および習熟部１９の入力および出力の関係を示
す。図７の「モジュール」欄７０１に示されるモジュー
ル（学習装置に含まれる各構成要素）はそれぞれ、ある
入力値に応じた出力値を生成する。各モジュールは、
「入力」欄７０２に示される変数と、「出力」欄７０３
に示される変数との対応関係を学習するために、「教
師」欄７０４に示される変数または「報酬」７０５欄に
示される報酬（「Ｒ＋」または「Ｒ−」）を使用する。FIG. 7 shows the relationship between the input and output of the reward-based learning unit 12, the teacher action storage unit 131, the memory-based imitation unit 132, the direct imitation unit 133, and the learning unit 19. Each of the modules (each component included in the learning device) shown in the “module” column 701 of FIG. 7 generates an output value corresponding to a certain input value. Each module is
The variables shown in the “input” column 702 and the “output” column 703
In order to learn the correspondence relationship with the variables shown in (1), the variables shown in the “teacher” column 704 or the rewards (“R +” or “R−”) shown in the “reward” column 705 are used.

【０１１７】なお、学習が進んだ段階では、図７に示さ
れる「入力」欄７０２に示される変数が必ずしも全てそ
ろっていない場合であっても、各モジュールは「出力」
欄７０３に示される変数として適切な値を生成できるよ
うになる。例えば、報酬に基づく学習部１２および習熟
部１９は、教師のコントローラの座標Ｐが入力されない
（教師が存在しない）場合であっても、適切な値を制御
信号Ｍｃとして生成できるようになる。At the stage where the learning has progressed, even if all the variables shown in the “input” column 702 shown in FIG.
An appropriate value can be generated as a variable shown in the column 703. For example, the learning unit 12 and the learning unit 19 based on the reward can generate an appropriate value as the control signal Mc even when the coordinates P of the controller of the teacher are not input (there is no teacher).

【０１１８】以上に述べた学習装置２５０（図２）がピ
ンポンゲーム（図３）の課題を実行しながら学習を実行
する過程を説明する。A process in which the learning device 250 (FIG. 2) executes learning while executing the task of the ping-pong game (FIG. 3) will be described.

【０１１９】学習装置２５０は、教師が存在しない場合
には正負の報酬を利用して報酬に基づく学習部１２によ
り学習を行う。学習装置２５０は、学習初期の段階で
は、どのようにすれば報酬が与えられるかについて何も
情報を持たない。このため、学習装置２５０は、教師が
存在しない場合には報酬に基づく学習を実行し、ランダ
ムな操作信号Ｍｃを出力して試行錯誤的に正の報酬「Ｒ
＋」や負の報酬「Ｒ−」を獲得する。When there is no teacher, the learning device 250 uses positive and negative rewards to perform learning by the learning unit 12 based on rewards. At the initial stage of learning, the learning device 250 has no information on how to be rewarded. For this reason, when there is no teacher, the learning device 250 executes learning based on the reward, outputs a random operation signal Mc, and performs a trial-and-error positive reward “R
+ "And negative reward" R- ".

【０１２０】報酬に基づく学習部１２により、報酬が与
えられるタイミングと、ボール２２の位置および学習装
置２５０のコントローラ２４の位置との関係が学習され
ると、正の報酬「Ｒ＋」が得られるようにコントローラ
２４を制御する操作信号Ｍｃが出力できるようになる。When the relationship between the timing at which a reward is given, the position of the ball 22, and the position of the controller 24 of the learning device 250 is learned by the reward-based learning section 12, a positive reward "R +" is obtained. , An operation signal Mc for controlling the controller 24 can be output.

【０１２１】しかし、このような、報酬のみを用いる強
化学習的な手法では、学習に膨大な時間が必要になるこ
とが知られている。このため、教師にコントローラ２３
を操作してもらうことによって、学習装置２５０は教師
の操作を観察し、教師信号に基づく学習部１３に記憶す
ることができる。これにより、学習装置２５０は直接的
にボール２２の位置とコントローラ２３の操作を関連付
けることができ、効果的な学習が実行できる。However, it is known that such a reinforcement learning method using only rewards requires an enormous amount of time for learning. Therefore, the controller 23
The learning device 250 can observe the teacher's operation and store it in the learning unit 13 based on the teacher signal. Thus, the learning device 250 can directly associate the position of the ball 22 with the operation of the controller 23, and can execute effective learning.

【０１２２】以上のようにして、学習装置２５０では、
報酬に基づく学習部１２と教師信号に基づく学習部１３
とで、それぞれの学習方法による学習が進む。As described above, in the learning device 250,
Learning unit 12 based on reward and learning unit 13 based on teacher signal
Thus, learning by each learning method proceeds.

【０１２３】習熟部１９は、外部入力値１５と、学習部
２１０が出力した学習部出力値２１１との関係を学習す
る。これにより、習熟部１９は、試行錯誤によって学習
装置２５０が獲得した学習成果と教師に教えてもらった
学習成果との両方を記憶する。学習が進んだ段階では、
習熟部１９のみによって環境に対する正しい行動（適切
な出力）を生成することが可能になり、報酬に基づく学
習部１２および教師信号に基づく学習部１３を動作させ
る必要がなくなる。このため、計算コスト（空間計算量
および時間計算量）が削減できる。The learning unit 19 learns the relationship between the external input value 15 and the learning unit output value 211 output by the learning unit 210. Thereby, the learning unit 19 stores both the learning result obtained by the learning device 250 by trial and error and the learning result taught by the teacher. At the stage where learning has progressed,
It is possible to generate correct behavior (appropriate output) for the environment only by the learning unit 19, and it is not necessary to operate the learning unit 12 based on rewards and the learning unit 13 based on teacher signals. Therefore, the calculation cost (the amount of space calculation and the amount of time calculation) can be reduced.

【０１２４】このように、学習装置２５０は、教師（例
えば、人間）が存在する環境において、環境から報酬と
教師信号とが与えられる場合に、その両方を有効に利用
する。学習装置２５０は、教師信号が得られる間には教
師信号に基づいて学習を実行することができ、教師信号
が得られない間は報酬に基づいて自律的に学習を実行す
ることができる。学習装置２５０は教師信号が得られる
か得られないかに関わらず、学習を実行することがで
き、教師信号のみに基づいて学習をする場合よりも学習
効率を高めることができる。As described above, in an environment where a teacher (for example, a human) is present, when a reward and a teacher signal are given from the environment, the learning device 250 effectively uses both of them. The learning device 250 can execute learning based on the teacher signal while the teacher signal is obtained, and can autonomously execute learning based on the reward while the teacher signal is not obtained. The learning device 250 can execute learning regardless of whether or not a teacher signal is obtained, and can improve learning efficiency as compared with a case where learning is performed based only on the teacher signal.

【０１２５】また、学習装置２５０は教師信号に基づく
効果的な学習を実行することができ、報酬のみに基づい
て学習をする場合よりも学習効率を高めることができ
る。Further, the learning device 250 can execute effective learning based on the teacher signal, and can improve learning efficiency as compared with the case where learning is performed based only on rewards.

【０１２６】なお、報酬に基づく学習部１２、教師行動
記憶部１３１および習熟部１９は、入出力の対応関係を
学習する必要がある。上述した例では、入出力の対応関
係を学習することのできる構成としてＥｌｍａｎ型のニ
ューラルネットワークを使用した。しかし、報酬に基づ
く学習部１２、教師行動記憶部１３１および習熟部１９
の構成はこれに限定されない。例えば、報酬に基づく学
習部１２、教師行動記憶部１３１および習熟部１９の全
てまたは一部に、他の形式のニューラルネットワークを
使用する構成を採用したり、あるいはニューラルネット
ワークを使用しない構成（例えば、ＩＦ−ＴＨＥＮ形式
のルール）を採用してもよい。また、バックプロパゲー
ション法に従って学習を実行することにも限定されな
い。Note that the learning unit 12, the teacher action storage unit 131, and the learning unit 19 based on the reward need to learn the input / output correspondence. In the above-described example, an Elman-type neural network is used as a configuration capable of learning the correspondence between input and output. However, the learning unit 12 based on the reward, the teacher action storage unit 131, and the learning unit 19
Is not limited to this. For example, for all or a part of the learning unit 12, the teacher action storage unit 131, and the learning unit 19 based on the reward, a configuration using another type of neural network is adopted, or a configuration not using a neural network (for example, IF-THEN format rule). Further, the invention is not limited to executing learning according to the back propagation method.

【０１２７】また、上述した例では、動作選択部１４は
環境の状態（教師信号が入力されるか否か）や内部状態
（記憶状況パラメータの値）に応じて、報酬に基づく学
習部１２が生成した出力値を選択するか教師信号に基づ
く学習部１３か生成した出力値を選択するかを決定して
いた。このような基準に代えて、あるいはこのような基
準に加えて、動作選択部１４は、報酬に基づいて、報酬
に基づく学習部１２が生成した出力値を選択するか教師
信号に基づく学習部１３が生成した出力値を選択するか
を決定してもよい。例えば、動作選択部１４は、報酬に
基づく学習部１２が生成した出力値が学習装置２５０の
出力値として出力されている場合の環境または人間から
の報酬の累積値と、教師信号に基づく学習部１３が生成
した出力値が学習装置２５０の出力値として出力されて
いる場合の環境または人間からの報酬の累積値との比較
に基づいて、どの出力値を選択するかを決定してもよ
い。これによって、学習装置２５０全体が、ある課題に
対する報酬の期待値を最大化するような動作をとるよう
にすることができる。Further, in the above-described example, the action selecting unit 14 determines whether the learning unit 12 based on the reward is in accordance with the state of the environment (whether or not a teacher signal is input) or the internal state (the value of the storage state parameter). It has been determined whether to select the generated output value or to select the generated output value from the learning unit 13 based on the teacher signal. Instead of or in addition to such a criterion, the operation selecting unit 14 selects the output value generated by the learning unit 12 based on the reward based on the reward or the learning unit 13 based on the teacher signal. May determine whether to select the output value generated by. For example, the operation selecting unit 14 may be configured to determine whether the output value generated by the learning unit 12 based on the reward is output as the output value of the learning device 250 or the cumulative value of the reward from the environment or human, and the learning unit based on the teacher signal. The output value to be selected may be determined based on a comparison with the environment or the accumulated value of rewards from humans when the output value generated by the learning device 13 is output as the output value of the learning device 250. This allows the entire learning device 250 to perform an operation that maximizes the expected value of the reward for a certain task.

【０１２８】なお、上述した習熟部１９および習熟切換
部２０は、省略することが可能である。Note that the learning unit 19 and the learning switching unit 20 described above can be omitted.

【０１２９】図８は、学習装置２５０から習熟部１９お
よび習熟切換部２０を除いた学習装置２６０の構成を示
す。学習装置２６０は、図２に示される学習装置２５０
の学習部２１０と同等の構成を有する。図８において、
図２に示される構成要素と同一の構成要素には同一の参
照番号を付し、その説明を省略する。FIG. 8 shows a configuration of a learning device 260 in which the learning unit 19 and the learning switching unit 20 are removed from the learning device 250. The learning device 260 includes the learning device 250 shown in FIG.
Has the same configuration as that of the learning unit 210. In FIG.
The same components as those shown in FIG. 2 are denoted by the same reference numerals, and description thereof will be omitted.

【０１３０】学習装置２６０は習熟切換部２０を有さな
いため、動作選択部１４が出力する出力値が、直接、学
習装置２６０の外部に出力される。このように、学習装
置２６０の動作選択部１４は、報酬に基づく学習部１２
の出力値２１２と、教師信号に基づく学習部１３の出力
値２１３との一方を選択し、選択された方を外部出力値
１７として出力する出力部として機能する。Since the learning device 260 does not have the learning switching unit 20, the output value output from the operation selecting unit 14 is directly output to the outside of the learning device 260. As described above, the operation selecting unit 14 of the learning device 260 sets the learning unit 12 based on the reward.
And an output value 213 of the learning unit 13 based on the teacher signal, and functions as an output unit that outputs the selected one as the external output value 17.

【０１３１】また、学習装置２６０の報酬に基づく学習
部１２は、動作選択部１４が外部出力値１７として出力
値２１２を出力しているか出力値２１３を出力している
かを監視し、学習指定部１１によって報酬に基づく学習
が指定されており、かつ、動作選択部１４が出力値２１
２を出力している場合には、報酬に基づく学習部１２
は、報酬に基づく学習を実行する。The learning unit 12 based on the reward of the learning device 260 monitors whether the operation selecting unit 14 is outputting the output value 212 or the output value 213 as the external output value 17, and 11, learning based on the reward is designated, and the operation selecting unit 14 outputs the output value 21.
2 is output, the learning unit 12 based on the reward
Performs learning based on rewards.

【０１３２】学習装置２６０は、学習装置２５０（図
２）と比較して、学習が進んだ段階で計算コストが削減
できるというメリットは得られないが、学習装置の内部
構成が単純化でき、学習装置の作成コストを低減できる
というメリットが得られる。The learning device 260 does not have the advantage that the calculation cost can be reduced at the stage of the learning as compared with the learning device 250 (FIG. 2). However, the internal configuration of the learning device can be simplified, and There is an advantage that the production cost of the device can be reduced.

【０１３３】学習装置は、環境に関する何らかの知識を
持つ人間からより知識や教示を獲得しやすくするため
に、音声認識機能を有してもよい。The learning device may have a voice recognition function to make it easier to obtain knowledge and teaching from a person having some knowledge about the environment.

【０１３４】図９は、音声認識機能を有する学習装置２
７０の構成を示す。学習装置２７０は、学習装置２５０
に音声解釈部２７１を付加した構造を有する。図９にお
いて、図２に示される構成要素と同一の構成要素には同
一の参照番号を付し、説明を省略する。FIG. 9 shows a learning apparatus 2 having a voice recognition function.
70 shows the configuration. The learning device 270 includes the learning device 250
Has a structure in which a voice interpreting unit 271 is added to FIG. 9, the same components as those shown in FIG. 2 are denoted by the same reference numerals, and description thereof will be omitted.

【０１３５】音声解釈部２７１は、音声認識部３０と意
味解釈部３１とを含む。The speech interpreting section 271 includes a speech recognizing section 30 and a semantic interpreting section 31.

【０１３６】音声認識部３０は、学習装置２７０の外部
から入力される、人間が発声した音声信号に対して音声
認識処理を行い、音声認識処理結果である音素列または
音節列または単語列を意味解釈部３１に出力する。音声
認識部３０には、音声認識処理機能を有する任意の構成
が採用され得る。The speech recognition section 30 performs speech recognition processing on a speech signal input from outside of the learning apparatus 270 and uttered by a human, and indicates a phoneme string, a syllable string, or a word string as a result of the speech recognition processing. Output to the interpretation unit 31. Any configuration having a voice recognition processing function can be adopted for the voice recognition unit 30.

【０１３７】意味解釈部３１は、音声認識部３０からの
音声認識処理結果を受け取り、人間の音声入力が人間に
よる操作を学習装置が模倣することを指示するものか、
学習装置の出力値の適否を評価するものか、学習装置の
望ましい出力値を指定するものかを判定し、それぞれの
判定結果に応じた学習装置の制御を行う。このように、
意味解釈部３１は、音声認識部３０による認識結果に基
づいて意味情報を抽出する。The semantic interpreting unit 31 receives the result of the speech recognition processing from the speech recognizing unit 30 and determines whether or not a human voice input indicates that the learning device imitates a human operation.
A determination is made as to whether the output value of the learning device is appropriate or not, or whether a desired output value of the learning device is specified, and the learning device is controlled according to each determination result. in this way,
The semantic interpretation unit 31 extracts semantic information based on the recognition result by the speech recognition unit 30.

【０１３８】このように、学習装置２７０によれば、環
境に関して何らかの知識を持つ人間は音声３１にて学習
装置に人間の意思を伝えることが可能となる。As described above, according to the learning device 270, a person who has some knowledge about the environment can communicate his / her intention to the learning device with the voice 31.

【０１３９】図１０は、人間の発した言葉の例と、その
言葉を意味解釈部３１が解釈した意味と、そのときに学
習装置２７０がとるべき動作とを示す。学習装置２７０
が人間から「まねをして」と言われた場合には、言葉の
意味は模倣の指示であり、学習指定部１１は習熟部１９
のみを学習させ、動作選択部１４は直接模倣部１３３を
選択し教師信号の値と同じ値を出力する。この場合、意
味解釈部３１によって抽出された意味情報は、学習装置
２７０に対する動作指示（コマンド）に関連する情報で
ある。このような意味情報は、学習装置２７０に対する
動作指示を受け取る指示入力部（図示されず）を介して
学習装置２７０に入力される。FIG. 10 shows examples of words spoken by humans, meanings of the words interpreted by the meaning interpreting section 31, and operations to be performed by the learning device 270 at that time. Learning device 270
Is said by a human being to "mimic", the meaning of the word is an instruction for imitation, and the learning designation unit 11
Only the operation selection unit 14 directly selects the imitation unit 133 and outputs the same value as the value of the teacher signal. In this case, the semantic information extracted by the semantic interpretation unit 31 is information related to an operation instruction (command) to the learning device 270. Such semantic information is input to the learning device 270 via an instruction input unit (not shown) that receives an operation instruction for the learning device 270.

【０１４０】学習装置２７０が人間から、「いいね」
「だめ」などと言われた場合には、言葉の意味は評価で
あり、報酬の正負の判別の後に報酬として与えられる。
すなわち、「いいね」は正の報酬であると判別され、
「だめ」は負の報酬であると判別される。この場合、意
味解釈部３１によって抽出された意味情報は、学習装置
２７０に対する報酬信号に関連する情報である。[0140] The learning device 270 receives "like" from a human.
In the case of saying “no”, the meaning of the word is evaluation, and is given as a reward after judging whether the reward is positive or negative.
In other words, “like” is determined to be a positive reward,
"No" is determined to be a negative reward. In this case, the semantic information extracted by the semantic interpretation unit 31 is information related to a reward signal for the learning device 270.

【０１４１】学習装置２７０が人間から「右」「左」な
どと言われた場合には、言葉の意味は動きの指示であ
り、学習指定部１１は教師信号に基づく学習部１３を学
習させ、教師信号の入力があったときと同様に学習を行
わせる。学習装置２７０の外部出力値１７としては指示
されたとおり（「右」または「左」）の操作信号が出力
される。この場合、意味解釈部３１によって抽出された
意味情報は、教師信号に関連する情報である。When the learning device 270 is said to be “right” or “left” by a human, the meaning of the word is a movement instruction, and the learning designating unit 11 makes the learning unit 13 learn based on the teacher signal. Learning is performed in the same manner as when a teacher signal is input. As the external output value 17 of the learning device 270, an operation signal as specified (“right” or “left”) is output. In this case, the semantic information extracted by the semantic interpretation unit 31 is information related to the teacher signal.

【０１４２】このような構成により、学習装置２７０に
は、環境について何らかの知識を持つ人間の操作が入力
されるだけでなく、人間からの音声信号によって学習装
置２７０の動作を変更することによって、より直接的に
学習を進めることが可能になる。With such a configuration, the learning device 270 receives not only a human operation having some knowledge about the environment, but also changes the operation of the learning device 270 by a voice signal from a human, thereby further improving the learning device 270. It is possible to proceed with learning directly.

【０１４３】学習装置２７０は音声によって「いいね」
「だめ」などの報酬を受け取ることができるため、人間
が学習装置２７０に報酬を与えることが容易になる。こ
のため、学習装置２７０は、音声による報酬を受け取る
ことができない場合と比較して、より多くの機会に報酬
を受けることができる。すでに述べたように、報酬は、
学習装置と学習装置の存在する環境とによって定義され
る状態がある特定の状態になった場合に学習装置に与え
られる情報であって、学習装置が環境に出力した出力値
が望ましいものであったか否かを表す。このため、報酬
は、学習装置が過去に出力した出力値に関する評価値で
あり得る。報酬を受ける機会が少ないと、学習装置が報
酬を受けた場合に、その報酬と、学習装置が出力した出
力値との関連が不明確になりやすく、学習が直接的に進
まない場合がある。学習装置２７０はより多くの機会に
報酬を受けることができるので、報酬を受けた場合に、
その報酬はどの時点で出力した出力値に関する評価値を
示すものであるのか特定することが容易である。以前の
出力値にはあまり遡る必要がないからである。このた
め、学習装置２７０は、報酬に基づく学習をより直接的
に進めることができる。[0143] The learning device 270 makes a "like" by voice.
Since a reward such as “no” can be received, it is easy for a human to reward the learning device 270. For this reason, the learning device 270 can receive a reward on more occasions as compared with a case where a reward by voice cannot be received. As already mentioned, the reward is
Information given to the learning device when the state defined by the learning device and the environment in which the learning device exists is a specific state, and whether the output value output by the learning device to the environment is desirable. Represents. For this reason, the reward may be an evaluation value related to an output value output in the past by the learning device. If the opportunity to receive the reward is small, when the learning device receives the reward, the relationship between the reward and the output value output by the learning device tends to be unclear, and the learning may not directly proceed. Since the learning device 270 can receive a reward on more occasions,
It is easy to specify at which point the reward indicates the evaluation value regarding the output value output. This is because it is not necessary to go back so much to the previous output value. For this reason, the learning device 270 can proceed with the learning based on the reward more directly.

【０１４４】また、学習装置２７０は音声によって
「右」「左」などの教師信号を受け取ることができるた
め、人間が学習装置２７０に教師信号を与えることが容
易になる。このため、学習装置２７０は、音声による教
師信号を受け取ることができない場合と比較して、より
多くの機会に教師信号を受けることができ、より適切な
学習ができることになる。Further, since the learning device 270 can receive a teacher signal such as “right” or “left” by voice, it becomes easy for a human to supply the learning device 270 with the teacher signal. Therefore, the learning device 270 can receive the teacher signal on more occasions as compared with the case where the teacher signal cannot be received by voice, and can perform more appropriate learning.

【０１４５】なお、上述した例では、報酬に基づく学習
部１２は報酬に基づく学習を実行し、教師信号に基づく
学習部１３は教師信号に基づく学習を実行していた。入
出力関係を学習する学習方法としては、この他にも、例
えば、自己組織化学習が知られている。学習指定部１
１、動作選択部１４、習熟切換部２０および習熟部１９
を用いる構成によれば、学習方法として報酬に基づく学
習と、教師信号に基づく学習とに限定されることなく、
互いに異なる学習方法を統合した学習装置を実現するこ
とが可能になる。In the example described above, the learning unit 12 based on the reward executes learning based on the reward, and the learning unit 13 based on the teacher signal executes learning based on the teacher signal. As a learning method for learning the input / output relationship, for example, self-organizing learning is also known. Learning designation section 1
1, operation selection unit 14, learning switching unit 20, and learning unit 19
According to the configuration using, the learning method is not limited to learning based on reward and learning based on a teacher signal,
It is possible to realize a learning device integrating different learning methods.

【０１４６】図１１は、互いに異なる３個の学習方法を
統合した学習装置２８０の構成を示す。図１１におい
て、図２に示される構成要素と同一の構成要素には同一
の参照番号を付し、その説明を省略する。学習装置２８
０は、学習装置２５０（図２）の報酬に基づく学習部１
２および教師信号に基づく学習部１３に代えて、学習モ
ジュール２１Ａ〜２１Ｃを備える。また、学習装置２８
０は、学習装置２５０の学習指定部１１に代えて学習指
定部２８１を備え、学習装置２５０の動作選択部１４に
代えて動作選択部２８２を備える。FIG. 11 shows a configuration of a learning device 280 in which three different learning methods are integrated. 11, the same components as those shown in FIG. 2 are denoted by the same reference numerals, and description thereof will be omitted. Learning device 28
0 is the learning unit 1 based on the reward of the learning device 250 (FIG. 2).
2 and learning modules 21A to 21C instead of the learning unit 13 based on the teacher signal. The learning device 28
0 includes a learning specifying unit 281 instead of the learning specifying unit 11 of the learning device 250, and includes an operation selecting unit 282 instead of the operation selecting unit 14 of the learning device 250.

【０１４７】学習モジュール２１Ａ〜２１Ｃはそれぞ
れ、学習方法１、学習方法２および学習方法３を実行す
る。学習方法１〜学習方法３のそれぞれは、例えば、報
酬に基づく学習、教師信号に基づく学習または自己組織
化学習であり得る。学習モジュール２１Ａ〜２１Ｃのそ
れぞれは、学習を実行することにより外部入力値１５に
応じたモジュール出力値２８３Ａ〜２８３Ｃを生成す
る。The learning modules 21A to 21C execute the learning method 1, the learning method 2, and the learning method 3, respectively. Each of the learning methods 1 to 3 may be, for example, learning based on reward, learning based on a teacher signal, or self-organizing learning. Each of the learning modules 21A to 21C generates a module output value 283A to 283C corresponding to the external input value 15 by executing the learning.

【０１４８】学習指定部２８１は、所定の規則（第１規
則）に基づいて、複数の学習モジュール２１Ａ〜２１Ｃ
のそれぞれによる学習を指定するか否かを決定する。The learning designating section 281 is provided with a plurality of learning modules 21A to 21C based on a predetermined rule (first rule).
It is determined whether or not learning by each of the above is designated.

【０１４９】また、動作選択部２８２は、所定の規則
（第２規則）に基づいて、複数の学習モジュール２１Ａ
〜２１Ｃから出力される複数のモジュール出力値の１つ
を第１出力値として選択的に出力する。[0149] The operation selecting section 282 also determines a plurality of learning modules 21A based on a predetermined rule (second rule).
-21C, and selectively outputs one of the plurality of module output values as the first output value.

【０１５０】以上のように構成された学習装置２８０に
よれば、習熟部１９は、複数の学習モジュール２１Ａ〜
２１Ｃ、学習指定部２８１、および動作選択部２８２を
包含する部分（学習部２８５）の入出力関係を学習す
る。従って、習熟部１９の学習が十分進んだ後には、習
熟部１９のみによって学習装置２８０の外部出力値１７
を決めることが可能である。習熟部１９による学習の結
果は、複数の学習モジュール２１Ａ〜２１Ｃおよび動作
選択部２８２による冗長な計算が除かれ、それぞれの学
習モジュールにおける計算が統合された結果である。こ
のため、外部出力値１７の計算に必要な計算量は、複数
の学習モジュール２１Ａ〜２１Ｃおよび動作選択部２８
２の動作に要する計算量の合計よりも少なくて済む。従
って、学習装置２８０は、習熟部１９および習熟切換部
２０を備えることにより、習熟部１９および習熟切換部
２０を備えない場合と比較して、学習結果に基づく出力
値の生成計算をより少ない計算コスト（空間計算量およ
び時間計算量）で実行することが可能になる。According to the learning device 280 configured as described above, the learning unit 19 includes a plurality of learning modules 21A to 21A.
The input / output relationship of a part (learning unit 285) including the 21C, the learning designating unit 281 and the operation selecting unit 282 is learned. Therefore, after the learning of the learning unit 19 is sufficiently advanced, the external output value 17
It is possible to decide. The result of the learning by the learning unit 19 is a result obtained by removing the redundant calculations by the plurality of learning modules 21A to 21C and the operation selecting unit 282, and integrating the calculations in the respective learning modules. For this reason, the amount of calculation necessary for calculating the external output value 17 depends on the plurality of learning modules 21A to 21C and the operation selecting unit 28.
The number of calculations required for the operation 2 is smaller than the total. Therefore, the learning device 280 includes the learning unit 19 and the learning switching unit 20, and thus requires less calculation to generate the output value based on the learning result as compared with the case where the learning unit 19 and the learning switching unit 20 are not provided. It can be executed at a cost (a space calculation amount and a time calculation amount).

【０１５１】学習モジュール２１Ａ〜２１Ｃは、例え
ば、Ｅｌｍａｎ型のニューラルネットワークによって構
成され得、バックプロパゲーション法に従って学習を実
行し得る。しかし、学習モジュール２１Ａ〜２１Ｃの構
成はこれに限定されない。例えば、学習モジュール２１
Ａ〜２１Ｃの全てまたは一部に、他の形式のニューラル
ネットワークを使用した学習システムを採用したり、あ
るいはニューラルネットワークを使用しない学習システ
ム（例えば、ＩＦ−ＴＨＥＮ形式のルール）を採用して
もよい。また、バックプロパゲーション法に従って学習
を実行することにも限定されない。Each of the learning modules 21A to 21C can be constituted by, for example, an Elman type neural network, and can execute learning according to a back propagation method. However, the configuration of the learning modules 21A to 21C is not limited to this. For example, the learning module 21
A learning system using another type of neural network may be used for all or a part of A to 21C, or a learning system not using a neural network (for example, an IF-THEN type rule) may be used. . Further, the invention is not limited to executing learning according to the back propagation method.

【０１５２】なお、図１１に示される例では、複数の学
習モジュールの個数は３（２１Ａ〜２１Ｃ）であるが、
複数の学習モジュールの数はこれに限定されない。ま
た、習熟部１９および習熟切換部２０は省略可能であ
る。In the example shown in FIG. 11, the number of the plurality of learning modules is 3 (21A to 21C).
The number of learning modules is not limited to this. The learning unit 19 and the learning switching unit 20 can be omitted.

【０１５３】また、複数の学習モジュールには、学習方
法が同一の複数のモジュールが含まれていてもよい。The plurality of learning modules may include a plurality of modules having the same learning method.

【０１５４】（実施の形態２）実施の形態１では、報酬
に基づく学習部と、教師信号に基づく学習部とは、別個
に学習を行った。報酬に基づく学習部と、教師信号に基
づく学習部とはそれぞれ、学習により獲得された知識を
内部パラメータによって表現している。例えば、報酬に
基づく学習部および教師信号に基づく学習部がニューラ
ルネットワークにより構成されている場合、知識は結合
荷重として表現される。(Embodiment 2) In Embodiment 1, the learning unit based on the reward and the learning unit based on the teacher signal perform learning separately. Each of the learning unit based on the reward and the learning unit based on the teacher signal expresses the knowledge acquired by the learning using internal parameters. For example, when a learning unit based on a reward and a learning unit based on a teacher signal are configured by a neural network, knowledge is expressed as a connection weight.

【０１５５】報酬に基づく学習部と、教師信号に基づく
学習部との間で知識の表現方法が同じ場合には、知識を
共有することが可能になる。知識の表現方法が同じ場合
とは、例えば、報酬に基づく学習部と教師信号に基づく
学習部がそれぞれニューラルネットワークにより構成さ
れ、同様の入出力関係を有し、中間層の数や結合の数が
同一である場合には、結合荷重として表される知識は相
互に置き換えることが可能である。この場合、学習方法
が切り換えられるときに、知識を報酬に基づく学習部
と、教師信号に基づく学習部との間でコピーすることに
より、一方の学習モジュールからもう一方の学習モジュ
ールに知識を継承することが可能になる。If the learning unit based on the reward is the same as the learning unit based on the teacher signal, the knowledge can be shared. The case where the method of expressing knowledge is the same means that, for example, the learning unit based on the reward and the learning unit based on the teacher signal are each configured by a neural network, have the same input / output relationship, and have the same number of intermediate layers and the number of connections. If they are the same, the knowledge expressed as connection weights can be interchanged. In this case, when the learning method is switched, the knowledge is copied between the learning unit based on the reward and the learning unit based on the teacher signal, so that the knowledge is inherited from one learning module to the other learning module. It becomes possible.

【０１５６】図１２（ａ）〜（ｆ）は、学習方法Ａを実
行する学習モジュールと学習方法Ｂを実行する学習モジ
ュールとの間で、知識の継承を行う場合および行わない
場合のそれぞれの学習度の時間変化を示す。ここで、学
習度とは、学習モジュールがどの範囲の入力値に対して
望ましい出力値を生成することが可能かを示す値であ
り、例えば、正当率によって定義される。FIGS. 12 (a) to 12 (f) show the learning between the learning module executing the learning method A and the learning module executing the learning method B, respectively, in the case where the knowledge is inherited and in the case where the knowledge is not inherited. The time change of the degree is shown. Here, the learning level is a value indicating which range of input values the learning module can generate a desired output value, and is defined by, for example, a validity rate.

【０１５７】図１２（ａ）〜（ｃ）は、学習方法Ａを実
行する学習モジュールＡと学習方法Ｂを実行する学習モ
ジュールＢとの間で、知識の表現方法が同じで、知識の
継承（学習結果の交換）を行う場合の学習度の時間変化
を示す。FIGS. 12A to 12C show that the learning module A that executes the learning method A and the learning module B that executes the learning method B have the same knowledge expression method, and that the knowledge inheritance ( (Exchange of learning results).

【０１５８】図１２（ｄ）〜（ｆ）は、学習モジュール
Ａと学習モジュールＢとの間で、知識の継承を行わない
場合の学習度の時間変化を示す。FIGS. 12 (d) to 12 (f) show the change over time of the learning degree between the learning module A and the learning module B when the knowledge is not inherited.

【０１５９】学習方法Ａおよび学習方法Ｂのそれぞれ
は、例えば、報酬に基づく学習および教師信号に基づく
学習であり得る。また、学習モジュールＡと学習モジュ
ールＢとは同時には学習を行わないものとする。Each of the learning method A and the learning method B can be, for example, learning based on a reward and learning based on a teacher signal. The learning module A and the learning module B do not perform learning at the same time.

【０１６０】まず、図１２（ａ）〜（ｃ）を参照して、
知識の継承が行われる場合の学習度の時間変化を説明す
る。図１２（ａ）〜（ｃ）はそれぞれ、時刻ｔ１〜ｔ３
における学習モジュールＡと学習モジュールＢとの学習
度を示す。ここで、ｔ１＜ｔ２＜ｔ３である。時刻ｔ１
において、学習モジュールＡの学習度がａ₁であり、学
習モジュールＢの学習度が０であるとする（図１２
（ａ））。時刻ｔ１〜時刻ｔ２の間の時間に、学習モジ
ュールＢによる学習が行われ、この時間に学習モジュー
ルＢは学習度がｂ₁増加するものとする。学習モジュー
ルＢによる学習に先立って学習モジュールＡの知識が学
習モジュールＢにコピーされるため、学習モジュールＢ
は時刻ｔ２において学習度ａ₁＋ｂ₁を有する（図１２
（ｂ））。また、時刻ｔ２〜時刻ｔ３の間の時間に、学
習モジュールＡによる学習が行われ、この時間に学習モ
ジュールＡは学習度がａ₂増加するものとする。学習モ
ジュールＢによる学習に先立って学習モジュールＢの知
識が学習モジュールＡにコピーされるため、学習モジュ
ールＡは時刻ｔ３において学習度ａ₁＋ｂ₁＋ａ₂を有す
る（図１２（ｃ））。First, referring to FIGS. 12A to 12C,
A description will be given of the time change of the learning degree when the knowledge is inherited. FIGS. 12A to 12C respectively show times t1 to t3.
5 shows the learning degrees of the learning module A and the learning module B in FIG. Here, t1 <t2 <t3. Time t1
In the learning of the learning module A is a _1, the learning of the learning module B is assumed to be 0 (FIG. 12
(A)). The time between the time t1~ time t2, performed the learning by the learning module B, the learning module B in this time it is assumed that the learning degree b ₁ increases. Prior to learning by the learning module B, the knowledge of the learning module A is copied to the learning module B.
Has a learning degree a ₁ + b ₁ at time t2 (FIG. 12)
(B)). Further, the learning by the learning module A is performed during the time between the time t2 and the time t3, and the learning degree of the learning module A increases by a ₂ during this time. Since the knowledge of the learning module B is copied to the learning module A prior to the learning by the learning module B, the learning module A has a learning degree a ₁ + b ₁ + a ₂ at time t3 (FIG. 12C).

【０１６１】次に、図１２（ｄ）〜（ｆ）を参照して、
知識の継承が行われない場合の学習度の時間変化を説明
する。図１２（ｄ）〜（ｆ）はそれぞれ、時刻ｔ１〜ｔ
３における学習モジュールＡと学習モジュールＢとの学
習度を示す。ここで、時刻ｔ１〜ｔ３はそれぞれ、図１
２（ａ）〜（ｃ）に示される時刻ｔ１〜ｔ３と同一であ
る。時刻ｔ１において、学習モジュールＡの学習度がａ
₁であり、学習モジュールＢの学習度が０であるとする
（図１２（ｄ））。図１２（ａ）〜（ｃ）を参照して説
明した例と同様に、時刻ｔ１〜時刻ｔ２の間の時間に、
学習モジュールＢによる学習が行われ、この時間に学習
モジュールＢは学習度がｂ₁増加するものとする。学習
モジュールＢは時刻ｔ２において学習度ｂ₁を有する
（図１２（ｅ））。また、時刻ｔ２〜時刻ｔ３の間の時
間に、学習モジュールＡによる学習が行われ、この時間
に学習モジュールＡは学習度がａ₂増加するものとす
る。学習モジュールＡは時刻ｔ３において学習度ａ₁＋
ａ₂を有する（図１２（ｆ））。Next, referring to FIGS. 12D to 12F,
A description will be given of a time change of the learning degree when the knowledge is not inherited. FIGS. 12D to 12F respectively show times t1 to t
3 shows the learning degree of learning module A and learning module B. Here, the times t1 to t3 are respectively shown in FIG.
It is the same as the times t1 to t3 shown in 2 (a) to (c). At time t1, the learning degree of the learning module A is a
_1, the learning of the learning module B is assumed to be 0 (FIG. 12 (d)). As in the example described with reference to FIGS. 12A to 12C, at the time between time t1 and time t2,
Learning module B by the learning is performed, the learning module B in this time it is assumed that the learning degree b ₁ increases. Learning module B has a learning level b ₁ at time t2 (FIG. 12 (e)). Further, the learning by the learning module A is performed during the time between the time t2 and the time t3, and the learning degree of the learning module A increases by a ₂ during this time. The learning module A learns a ₁ + at time t3.
having a ₂ (FIG. 12 (f)).

【０１６２】上述した図１２（ａ）〜（ｃ）に示される
例（知識の継承が行われる場合）と図１２（ｄ）〜
（ｆ）に示される例（知識の継承が行われない場合）と
を比較すると、知識の継承が行われる場合の方が知識の
継承が行われない場合よりも各学習モジュールの学習度
が高くなることがわかる。The examples shown in FIGS. 12A to 12C (in the case where knowledge is inherited) and FIGS.
Comparing the example shown in (f) (when knowledge is not inherited), the learning degree of each learning module is higher in the case where knowledge is inherited than in the case where knowledge is not inherited. It turns out that it becomes.

【０１６３】本発明の実施の形態２の学習装置は、この
ように異なる学習方法を実行する学習モジュール間で知
識の継承が行われ得る構成を有する。The learning apparatus according to the second embodiment of the present invention has a configuration in which knowledge can be inherited between learning modules executing different learning methods.

【０１６４】実施の形態１の学習装置２５０（図２）で
は、報酬に基づく学習によって獲得された知識と、教師
信号に基づく学習によって獲得された知識とを１つの学
習モジュールに統合するために習熟部１９が設けられて
いた。これに対して、本発明の実施の形態２の学習装置
では、報酬に基づく学習によって獲得された知識と、教
師信号に基づく学習によって獲得された知識との両方
が、報酬に基づく学習部と教師信号に基づく学習部との
それぞれに蓄積される。このため、実施の形態１におけ
る習熟部を設ける必要がない。In the learning apparatus 250 (FIG. 2) of the first embodiment, the knowledge acquired by the learning based on the reward and the knowledge acquired by the learning based on the teacher signal are integrated in one learning module. A part 19 was provided. On the other hand, in the learning device according to the second embodiment of the present invention, both the knowledge acquired by the learning based on the reward and the knowledge acquired by the learning based on the teacher signal are determined by the learning unit based on the reward and the teacher. It is stored in each of the learning units based on the signal. Therefore, there is no need to provide a learning section in the first embodiment.

【０１６５】図１３は、異なる学習方法を実行する学習
モジュール間で知識の継承が行われ得る学習装置３１０
の構成を示す。学習装置３１０は、環境からの外部入力
値１５を受け取る入力部３１１と、学習指定部３１２
と、報酬に基づく学習を実行する学習部（報酬に基づく
学習部）３１３と、教師信号に基づく学習を実行する学
習部（教師信号に基づく学習部）３１４と、出力部３１
５とを備える。FIG. 13 shows a learning apparatus 310 in which knowledge can be inherited between learning modules executing different learning methods.
Is shown. The learning device 310 includes an input unit 311 that receives an external input value 15 from the environment, and a learning designation unit 312.
A learning unit (learning unit based on a reward) 313 that performs learning based on a reward, a learning unit (a learning unit based on a teacher signal) 314 that performs learning based on a teacher signal, and an output unit 31
5 is provided.

【０１６６】学習装置３１０には、学習装置３１０の外
部から、教師信号と報酬とが入力される。The learning device 310 receives a teacher signal and a reward from outside the learning device 310.

【０１６７】報酬に基づく学習部３１３は、例えば、Ｅ
ｌｍａｎ型のニューラルネットワークであり、バックプ
ロパゲーション法に従って報酬に基づく学習を実行す
る。報酬に基づく学習部３１３は、学習装置３１０の外
部から学習装置３１０に入力される外部入力値１５を入
力部３１１を介して受け取り、外部入力値１５に応じた
出力値３２３を出力する。The learning unit 313 based on the reward, for example,
It is an lman-type neural network that executes learning based on rewards according to a back propagation method. The reward-based learning unit 313 receives an external input value 15 input to the learning device 310 from outside the learning device 310 via the input unit 311 and outputs an output value 323 according to the external input value 15.

【０１６８】このように、報酬に基づく学習部３１３
は、報酬（報酬信号）に基づく学習（第１学習）を実行
することにより、ニューラルネットワーク内部の結合荷
重（少なくとも１つの第１パラメータ）に基づいて、外
部入力値１５に応じた出力値３２３（第１出力値）を出
力する第１学習部として機能する。As described above, the learning unit 313 based on rewards
Performs learning (first learning) based on a reward (reward signal), and based on a connection weight (at least one first parameter) inside the neural network, outputs an output value 323 ( It functions as a first learning unit that outputs a (first output value).

【０１６９】また、報酬に基づく学習部３１３は、外部
出力値１７に関連する評価値を示す報酬（報酬信号）を
学習装置３１０の外部から受け取る報酬信号入力部とし
て機能すると同時に、報酬に基づいて、評価値が高くな
るようにニューラルネットワーク内部の結合荷重（少な
くとも１つの第１パラメータ）を調整する第１調整部と
して機能する。The reward-based learning unit 313 functions as a reward signal input unit that receives a reward (reward signal) indicating an evaluation value related to the external output value 17 from the outside of the learning device 310, and also, based on the reward. , Functions as a first adjustment unit that adjusts the connection weight (at least one first parameter) inside the neural network so that the evaluation value becomes higher.

【０１７０】教師信号に基づく学習部３１４は、例え
ば、Ｅｌｍａｎ型のニューラルネットワークであり、バ
ックプロパゲーション法に従って教師信号に基づく学習
を実行する。教師信号（例えば、人間が環境に対して行
う入出力パターン）に基づく学習部３１４は、学習装置
３１０の外部から学習装置３１０に入力される外部入力
値１５を入力部３１１を介して受け取り、外部入力値１
５に応じた出力値３２３を出力する。The learning unit 314 based on the teacher signal is, for example, an Elman-type neural network, and executes learning based on the teacher signal according to the back propagation method. A learning unit 314 based on a teacher signal (for example, an input / output pattern performed by a human on the environment) receives an external input value 15 input to the learning device 310 from outside the learning device 310 via the input unit 311, and Input value 1
An output value 323 corresponding to 5 is output.

【０１７１】このように、教師信号に基づく学習部３１
４は、教師信号に基づく学習（第２学習）を実行するこ
とにより、ニューラルネットワーク内部の結合荷重（少
なくとも１つの第２パラメータ）に基づいて、外部入力
値１５に応じた出力値３２４（第２出力値）を出力する
第２学習部として機能する。As described above, the learning unit 31 based on the teacher signal
4 executes learning based on a teacher signal (second learning), and based on a connection weight (at least one second parameter) inside the neural network, an output value 324 (second learning) corresponding to the external input value 15 It functions as a second learning unit that outputs an output value.

【０１７２】また、教師信号に基づく学習部３１４は、
外部入力値１５に対する外部出力値１７の期待値を示す
教師信号を学習装置３１０の外部から受け取る教師信号
入力部として機能すると同時に、教師信号に基づいて、
外部入力値１５に対する出力値３２４が期待値に近づく
ように、ニューラルネットワーク内部の結合荷重（少な
くとも１つの第１パラメータ）を調整する第２調整部と
して機能する。The learning unit 314 based on the teacher signal
At the same time as functioning as a teacher signal input unit that receives a teacher signal indicating the expected value of the external output value 17 with respect to the external input value 15 from outside the learning device 310, based on the teacher signal,
It functions as a second adjustment unit that adjusts the connection weight (at least one first parameter) inside the neural network so that the output value 324 for the external input value 15 approaches the expected value.

【０１７３】学習指定部３１２は、報酬に基づく学習お
よび教師信号に基づく学習を択一的に指定する。この決
定動作は、例えば、学習装置３１０の周囲の環境中に教
師が存在する（学習装置３１０に教師信号が入力され
る）か否かに基づいて行われ、学習装置３１０の周囲の
環境中に教師が存在する場合には、学習指定部３１２は
教師信号に基づく学習を指定し、学習装置３１０の周囲
の環境中に教師が存在しない場合には、学習指定部３１
２は報酬に基づく学習を指定する。The learning designating section 312 alternatively designates learning based on rewards and learning based on teacher signals. This determination operation is performed based on, for example, whether or not a teacher exists in the environment around the learning device 310 (a teacher signal is input to the learning device 310). When there is a teacher, the learning specifying unit 312 specifies learning based on the teacher signal, and when no teacher exists in the environment around the learning device 310, the learning specifying unit 31
2 specifies learning based on reward.

【０１７４】出力部３１５は、報酬に基づく学習部３１
３により出力される出力値３２３（第１出力値）および
教師信号に基づく学習部３１４により出力される出力値
３２４（第２出力値）の一方を外部出力値１７として選
択的に出力する。出力部３１５が出力値３２３と出力値
３２４とのうち、どちらを外部出力値１７として出力す
るかは、例えば、学習装置３１０の内部状態を示すパラ
メータ（例えば、時刻に応じて変化する記憶状況パラメ
ータ）の値に応じて決定される。The output unit 315 is provided for the learning unit 31 based on the reward.
One of the output value 323 (first output value) output by the learning unit 3 and the output value 324 (second output value) output by the learning unit 314 based on the teacher signal is selectively output as the external output value 17. Which of the output value 323 and the output value 324 the output unit 315 outputs as the external output value 17 is determined, for example, by a parameter indicating the internal state of the learning device 310 (for example, a storage status parameter that changes according to time). ).

【０１７５】あるいは、出力部３１５は、学習指定部３
１２が報酬に基づく学習を指定している場合には出力値
３２３を出力し、学習指定部３１２が教師信号に基づく
学習を指定している場合には出力値３２４を出力するよ
うにしてもよい。学習装置３１０では、報酬に基づく学
習によって獲得された知識と、教師信号に基づく学習に
よって獲得された知識とが共にそれぞれの学習モジュー
ルに継承されるため、２つの学習モジュールのうち、現
在学習を実行している学習モジュール（報酬に基づく学
習部３１３または教師信号に基づく学習部３１４）の方
が常に環境に対して適切な出力値を生成し得ると考えら
れるからである。Alternatively, the output unit 315 outputs the learning designation unit 3
The output value 323 may be output when 12 specifies learning based on a reward, and the output value 324 may be output when the learning specifying unit 312 specifies learning based on a teacher signal. . In the learning device 310, the knowledge acquired by the learning based on the reward and the knowledge acquired by the learning based on the teacher signal are both inherited by the respective learning modules. This is because it is considered that the learning module (the learning unit 313 based on the reward or the learning unit 314 based on the teacher signal) can always generate an output value appropriate for the environment.

【０１７６】報酬に基づく学習部３１３は、出力部３１
５が出力値３２３を出力しているか出力値３２４を出力
しているかを監視する。学習指定部３１２によって報酬
に基づく学習（第１学習）が指定されており、かつ、出
力部３１５が出力値３２３（すなわち、報酬に基づく学
習部３１３が出力する出力値：第１出力値）を出力して
いる場合には、報酬に基づく学習部３１３（第１学習
部）は報酬に基づく学習（第１学習）を実行する。この
ように、報酬に基づく学習部３１３が出力部３１５の出
力を監視する理由は、学習装置３１０に与えられた報酬
が、報酬に基づく学習部３１３が生成した出力値に関し
て与えられた評価値であることを保証するためである。
ただし、学習指定部３１２によって報酬に基づく学習
（第１学習）が指定されていれば出力部３１５が出力値
３２３を出力するようにした場合、報酬に基づく学習部
３１３による出力部３１５の出力の監視は省略され得
る。The learning unit 313 based on the reward outputs the output unit 31
5 monitors whether it is outputting the output value 323 or the output value 324. Learning based on reward (first learning) is specified by the learning specifying unit 312, and the output unit 315 outputs the output value 323 (that is, the output value output by the learning unit 313 based on reward: first output value). When outputting, the learning unit 313 (first learning unit) based on the reward executes learning (first learning) based on the reward. Thus, the reason that the learning unit 313 based on the reward monitors the output of the output unit 315 is that the reward given to the learning device 310 is an evaluation value given with respect to the output value generated by the learning unit 313 based on the reward. This is to ensure that there is.
However, if learning based on a reward (first learning) is specified by the learning specifying unit 312, the output unit 315 outputs the output value 323. If the output of the output unit 315 by the learning unit 313 based on the reward is specified. Monitoring can be omitted.

【０１７７】教師信号に基づく学習部３１４（第２学習
部）は、学習指定部３１２によって報酬に基づく学習
（第１学習）が指定されている場合に、教師信号に基づ
く学習（第１学習）を実行する。The learning section 314 (second learning section) based on the teacher signal, when learning based on reward (first learning) is specified by the learning specifying section 312, learning based on the teacher signal (first learning). Execute

【０１７８】報酬に基づく学習部３１３と、教師信号に
基づく学習部３１４との間で、学習によって獲得された
学習結果の表現方法が共通である。このため、報酬に基
づく学習部３１３と、教師信号に基づく学習部３１４と
の間では、学習結果の交換（知識の継承）を行うことが
できる。学習指定部３１２による択一的な学習指定が切
り換えられるときに、報酬に基づく学習部３１３と、教
師信号に基づく学習部３１４との間で学習結果の交換が
行われる。The learning unit 313 based on the reward and the learning unit 314 based on the teacher signal share a method of expressing the learning result obtained by learning. Therefore, between the learning unit 313 based on the reward and the learning unit 314 based on the teacher signal, the learning result can be exchanged (knowledge inheritance). When the alternative learning designation by the learning designation unit 312 is switched, the learning result is exchanged between the learning unit 313 based on the reward and the learning unit 314 based on the teacher signal.

【０１７９】例えば、学習指定部３１２が報酬に基づく
学習（第１学習）を指定している状態が、学習指定部３
１２が教師信号に基づく学習（第２学習）を指定してい
る状態に遷移した場合には、教師信号に基づく学習部３
１４のニューラルネットワークの結合荷重（少なくとも
１つの第２パラメータ）に、報酬に基づく学習部３１３
のニューラルネットワークの結合荷重（少なくとも１つ
の第１パラメータ）がコピーされる。For example, the state in which the learning specifying unit 312 specifies learning based on a reward (first learning) corresponds to the learning specifying unit 312.
When the learning state transitions to a state in which the learning signal 12 designates learning (second learning) based on the teacher signal, the learning unit 3 based on the teacher signal
A learning unit 313 based on a reward is added to the connection weights (at least one second parameter) of the 14 neural networks.
Are copied (at least one first parameter).

【０１８０】逆に、学習指定部３１２が教師信号に基づ
く学習（第２学習）を指定している状態が、学習指定部
３１２が報酬に基づく学習（第１学習）を指定している
状態に遷移した場合には、報酬に基づく学習部３１３の
ニューラルネットワークの結合荷重に、教師信号に基づ
く学習部３１４のニューラルネットワークの結合荷重が
コピーされる。Conversely, the state in which the learning specifying unit 312 specifies learning based on the teacher signal (second learning) is changed to the state in which the learning specifying unit 312 specifies learning based on reward (first learning). When the transition has occurred, the connection weight of the neural network of the learning unit 314 based on the teacher signal is copied to the connection weight of the neural network of the learning unit 313 based on the reward.

【０１８１】あるいは、学習が切り換えられるときに、
教師信号に基づく学習部３１４のニューラルネットワー
クの結合荷重と報酬に基づく学習部３１３のニューラル
ネットワークの結合荷重との間で完全なコピーが行われ
るのではなく、結合荷重の一部のコピーが行われるよう
にしてもよい。学習が終了した側のニューラルネットワ
ークの結合荷重に基づいて、これから学習が行われる側
のニューラルネットワークの結合荷重を調整するように
してもよい。Alternatively, when learning is switched,
Not a complete copy is made between the connection weight of the neural network of the learning unit 314 based on the teacher signal and the connection weight of the neural network of the learning unit 313 based on the reward, but a part of the connection weight is copied. You may do so. The connection weight of the neural network on which learning is to be performed may be adjusted based on the connection weight of the neural network on which learning has been completed.

【０１８２】以上の構成により、学習装置３１０は、報
酬に基づく学習部３１３と、教師信号に基づく学習部３
１４との間で知識の継承を行いながら学習を行うことが
でき、学習効率を高めることができる。With the above configuration, the learning device 310 includes the learning unit 313 based on the reward and the learning unit 3 based on the teacher signal.
The learning can be performed while passing the knowledge to and from the P.14, and the learning efficiency can be improved.

【０１８３】なお、報酬に基づく学習部３１３と、教師
信号に基づく学習部３１４とは、Ｅｌｍａｎ型のニュー
ラルネットワークを使用した。しかし、報酬に基づく学
習部３１３と、教師信号に基づく学習部３１４との構成
はこれに限定されない。例えば、報酬に基づく学習部３
１３と、教師信号に基づく学習部３１４として、他の形
式のニューラルネットワークを使用した学習システムを
採用したり、あるいはニューラルネットワークを使用し
ない学習システム（例えば、ＩＦ−ＴＨＥＮ形式のルー
ル）を採用してもよい。また、バックプロパゲーション
法に従って学習を実行することにも限定されない。報酬
に基づく学習部３１３と、教師信号に基づく学習部３１
４との間で、学習結果の交換が可能でありさえすればよ
い。すなわち、報酬に基づく学習部３１３と、教師信号
に基づく学習部３１４とは、一方の内部パラメータに基
づいて他方の内部パラメータを調整し得る構成でありさ
えすれば、知識の継承が可能になる。The learning unit 313 based on the reward and the learning unit 314 based on the teacher signal used an Elman type neural network. However, the configurations of the learning unit 313 based on the reward and the learning unit 314 based on the teacher signal are not limited thereto. For example, learning unit 3 based on reward
13 and a learning system using a neural network of another format as the learning unit 314 based on the teacher signal, or a learning system not using a neural network (for example, an IF-THEN format rule). Is also good. Further, the invention is not limited to executing learning according to the back propagation method. Learning unit 313 based on reward and learning unit 31 based on teacher signal
It is only necessary that the exchange of the learning result is possible with the program No. 4. That is, the learning unit 313 based on the reward and the learning unit 314 based on the teacher signal can inherit knowledge as long as the configuration can adjust one internal parameter based on the other internal parameter.

【０１８４】図１３に示される学習装置３１０は、報酬
に基づく学習部と教師信号に基づく学習部とを有し、そ
の間で知識の継承の継承を行っていた。これによって、
報酬に基づく学習によって獲得された知識と、教師信号
に基づく学習によって獲得された知識とが互いに蓄積さ
れるように学習が行われる。報酬に基づく学習によって
獲得された知識と、教師信号に基づく学習によって獲得
された知識とが互いに蓄積されるような学習を実現する
ための学習装置の構成としては、報酬に基づく学習部と
教師信号に基づく学習部とを有さずに、単一のニューラ
ルネットワークを有する構成を採用してもよい。The learning device 310 shown in FIG. 13 has a learning unit based on a reward and a learning unit based on a teacher signal, and inherits knowledge between them. by this,
The learning is performed such that the knowledge acquired by the learning based on the reward and the knowledge acquired by the learning based on the teacher signal are mutually accumulated. The configuration of the learning device for realizing learning in which the knowledge acquired by the learning based on the reward and the knowledge acquired by the learning based on the teacher signal are mutually accumulated includes a learning unit based on the reward and a teacher signal. A configuration having a single neural network without having a learning unit based on the network may be adopted.

【０１８５】図１４は、単一のニューラルネットワーク
を用いて、報酬に基づく学習によって獲得された知識
と、教師信号に基づく学習によって獲得された知識とが
互いに蓄積されるような学習を実現する学習装置３２０
の構成を示す。FIG. 14 shows a learning method for realizing learning in which knowledge acquired by learning based on a reward and knowledge acquired by learning based on a teacher signal are mutually accumulated using a single neural network. Device 320
Is shown.

【０１８６】学習装置３２０は、学習を実行し、学習装
置３２０の外部から入力される外部入力値１５に応じた
外部出力値１７を生成する調整部３２２と、調整部３２
２による学習に使用される学習パターンを生成する学習
パターン生成部３２４と、調整部３２２の入力値と出力
値とを一時的に記憶する短期記憶部３２５とを含む。The learning device 320 executes learning and generates an external output value 17 corresponding to an external input value 15 input from outside the learning device 320;
2 includes a learning pattern generation unit 324 that generates a learning pattern used for learning by the learning unit 2 and a short-term storage unit 325 that temporarily stores input values and output values of the adjustment unit 322.

【０１８７】調整部３２２は、Ｅｌｍａｎ型のニューラ
ルネットワークにより構成され、バックプロパゲーショ
ン法に従って学習を行う。The adjusting unit 322 is constituted by an Elman type neural network, and performs learning according to the back propagation method.

【０１８８】学習パターン生成部３２４には、学習装置
３２０の外部から、報酬および教師信号が入力される。
教師信号は、例えば、人間が環境に対して行う入力と出
力との関係を示す入出力パターンである。ここで、入出
力パターンとは、入力値と出力値との組を意味する。人
間の環境への入出力パターンは、人間が検知する環境の
状態（人間に環境から入力された入力値）と、人間がそ
の環境の状態に応じて、環境に出力する出力値との組で
ある。人間は環境についての何らかの知識を有している
ので、人間が環境に出力する出力値は環境に対して適切
な出力値であると考えられる。従って、学習装置３２０
も、その環境の状態に対して人間が出力する出力値と同
じ値を環境に出力することが期待される。人間が環境に
出力する出力値は、学習装置３２０の外部から入力され
る教師信号として使用することができる。人間の環境へ
の入出力パターンは、教師信号を含む。The learning pattern generation unit 324 receives a reward and a teacher signal from outside the learning device 320.
The teacher signal is, for example, an input / output pattern indicating the relationship between input and output performed by a human on the environment. Here, the input / output pattern means a set of an input value and an output value. The input / output pattern to the human environment is a set of the state of the environment detected by the human (input value input from the environment to the human) and the output value that the human outputs to the environment according to the state of the environment. is there. Since humans have some knowledge of the environment, the output values that humans output to the environment are considered to be output values appropriate for the environment. Therefore, the learning device 320
Also, it is expected that the same value as the output value output by a human for the state of the environment is output to the environment. The output value output by the human to the environment can be used as a teacher signal input from outside the learning device 320. The input / output pattern to the human environment includes a teacher signal.

【０１８９】なお、人間の環境への入出力パターンは、
人間が任意の入力装置（図示されず）を操作することに
よって学習装置３２０に入力されてもよい。あるいは、
学習装置３２０が、学習装置３２０の周囲に人間が存在
するか否かを判定し、人間が存在する場合には、人間の
環境への入出力パターンを検出する機能を有していても
よい。このような機能は、例えば、学習装置３２０の周
囲を監視する画像処理システム等の既知の構成によって
実現される。The input / output pattern to the human environment is
The information may be input to the learning device 320 by a human operating an arbitrary input device (not shown). Or
The learning device 320 may have a function of determining whether or not a human is present around the learning device 320 and, when a human is present, a function of detecting an input / output pattern to a human environment. Such a function is realized by a known configuration such as an image processing system that monitors the periphery of the learning device 320, for example.

【０１９０】図１５は、学習装置３２０が学習を実行す
る処理手順を示す。この処理手順は、学習装置３２０
が、学習装置３２０の周囲に人間が存在する場合には常
に人間の行動（環境への入出力パターン）を全て教師信
号として利用する場合の処理手順である。以下、図１５
に示される処理手順の各ステップの説明を行う。FIG. 15 shows a processing procedure in which the learning device 320 executes learning. This processing procedure is performed by the learning device 320
However, this is a processing procedure when all human actions (input / output patterns to the environment) are used as teacher signals whenever a human is present around the learning device 320. Hereinafter, FIG.
Will be described.

【０１９１】ステップＳ３１：環境（学習装置３２０の
外部）から学習装置３２０に外部入力値１５が入力され
る。Step S31: The external input value 15 is input to the learning device 320 from the environment (outside the learning device 320).

【０１９２】ステップＳ３２：調整部３２２によって、
外部入力値１５に応じた外部出力値１７が生成される。Step S32: The adjusting unit 322 performs
An external output value 17 corresponding to the external input value 15 is generated.

【０１９３】ステップＳ３３：ステップＳ３１で学習装
置３２０に入力された外部入力値１５と、その外部入力
値１５に応じてステップＳ３２で生成された外部出力値
１７との組が、入出力関係として短期記憶部３２５に記
憶される。Step S33: The set of the external input value 15 input to the learning device 320 in step S31 and the external output value 17 generated in step S32 according to the external input value 15 is a short-term input / output relationship. It is stored in the storage unit 325.

【０１９４】ステップＳ３４：ステップＳ３２において
生成された外部出力値１７が、環境に出力される。Step S34: The external output value 17 generated in step S32 is output to the environment.

【０１９５】ステップＳ３５：学習装置３２０の周囲
に、人間が存在するか否かが判定される。この判定が
「ＹＥＳ」であれば処理はステップＳ３６に進む。この
判定が「ＮＯ」であれば処理はステップＳ３８に進む。Step S35: It is determined whether or not a human exists around the learning device 320. If this determination is "YES", the process proceeds to step S36. If this determination is "NO", the process proceeds to step S38.

【０１９６】ステップＳ３６：人間の環境への入出力パ
ターン（教師信号）に基づいて、学習パターンが生成さ
れる。この学習パターンの生成処理は、学習パターン生
成部３２４によって実行される。このような学習パター
ンは、ある特定の外部入力値１５と、その特定の外部入
力値１５に対する外部出力値１７の期待値との組として
表される。Step S36: A learning pattern is generated based on an input / output pattern (teacher signal) to / from a human environment. This learning pattern generation processing is executed by the learning pattern generation unit 324. Such a learning pattern is expressed as a set of a specific external input value 15 and an expected value of an external output value 17 for the specific external input value 15.

【０１９７】ステップＳ３７：ステップＳ３６において
生成された学習パターンに基づいて、調整部３２２によ
る学習が実行される。この学習は、例えば、バックプロ
パゲーション法によって行われる。バックプロパゲーシ
ョン法では、調整部３２２の結合荷重（少なくとも１つ
のパラメータ）の値は、特定の外部入力値１５に対する
外部出力値１７が期待値に近づくように調整される。Step S37: Learning by the adjusting unit 322 is performed based on the learning pattern generated in step S36. This learning is performed by, for example, a back propagation method. In the back propagation method, the value of the coupling weight (at least one parameter) of the adjustment unit 322 is adjusted such that the external output value 17 for the specific external input value 15 approaches the expected value.

【０１９８】このように、ステップＳ３６およびステッ
プＳ３７において、学習装置３２０は教師信号に基づく
学習を実行する。Thus, in steps S36 and S37, learning device 320 executes learning based on the teacher signal.

【０１９９】ステップＳ３８：学習装置３２０に、環境
から正の報酬が入力されているか否かが判定される。こ
の判定が「ＹＥＳ」であれば処理はステップＳ３９に進
む。この判定が「ＮＯ」であれば処理はステップＳ４０
に進む。Step S38: It is determined whether or not a positive reward has been input from the environment to the learning device 320. If this determination is "YES", the process proceeds to step S39. If this determination is "NO", the process proceeds to step S40.
Proceed to.

【０２００】ステップＳ３９：ステップＳ３３において
短期記憶部３２５に記憶された入出力関係に基づいて、
学習パターン（学習データ）が生成される。この学習パ
ターンの生成処理は、学習パターン生成部３２４によっ
て実行される。このような学習パターンは、ある特定の
外部入力値１５と、その特定の外部入力値１５に応じて
出力した特定の外部出力値１７との組を表す。学習装置
３２０は、この特定の外部入力値１５に対して特定の外
部出力値１７を生成したことに関連して正の報酬を得た
と考えられる。Step S39: Based on the input / output relationship stored in the short-term storage unit 325 in step S33,
A learning pattern (learning data) is generated. This learning pattern generation processing is executed by the learning pattern generation unit 324. Such a learning pattern represents a set of a specific external input value 15 and a specific external output value 17 output according to the specific external input value 15. It is considered that the learning device 320 has obtained a positive reward in connection with generating the specific external output value 17 for the specific external input value 15.

【０２０１】ステップＳ４０：学習装置３２０に、環境
から負の報酬が入力されているか否かが判定される。こ
の判定が「ＹＥＳ」であれば処理はステップＳ３９に進
む。この判定が「ＮＯ」であれば処理はステップＳ４０
に進む。Step S40: It is determined whether or not a negative reward has been input from the environment to the learning device 320. If this determination is "YES", the process proceeds to step S39. If this determination is "NO", the process proceeds to step S40.
Proceed to.

【０２０２】ステップＳ４１：ステップＳ３３において
短期記憶部３２５に記憶された入出力関係のうち、出力
部分を所定の規則に従って変更することによって学習パ
ターンが生成される。この学習パターンの生成処理は、
学習パターン生成部３２４によって実行される。このよ
うな学習パターンは、ある特定の外部入力値１５と、そ
の特定の外部入力値１５に応じて出力した特定の外部出
力値１７を所定の規則に従って変更した値との組を表
す。学習装置３２０は、この特定の外部入力値１５に対
して特定の外部出力値１７を生成したことに関連して負
の報酬を得たと考えられる。Step S41: A learning pattern is generated by changing the output portion of the input / output relation stored in the short-term storage section 325 in step S33 according to a predetermined rule. The process of generating the learning pattern
This is executed by the learning pattern generation unit 324. Such a learning pattern represents a set of a specific external input value 15 and a value obtained by changing a specific external output value 17 output according to the specific external input value 15 according to a predetermined rule. It is considered that the learning device 320 has obtained a negative reward in connection with generating the specific external output value 17 for the specific external input value 15.

【０２０３】ステップＳ４２：ステップＳ３９またはス
テップＳ４２において生成された学習パターンに基づい
て、調整部３２２による学習が実行される。この学習
は、例えば、バックプロパゲーション法によって行われ
る。バックプロパゲーション法では、調整部３２２の結
合荷重が調整されることによって、正の報酬を得た場合
の特定の外部入力値１５に対しては、それに応じて生成
した特定の外部出力値１７が生成されやすくなり、負の
報酬を得た場合の特定の外部入力値１５に対しては、そ
れに応じて生成した特定の外部出力値１７が生成されに
くくなる。すなわち、報酬によって示される外部出力値
１７に関連する評価値が高くなるように、結合荷重が調
整される。Step S42: Based on the learning pattern generated in step S39 or S42, learning by the adjusting unit 322 is executed. This learning is performed by, for example, a back propagation method. In the back propagation method, a specific external input value 15 generated in response to a specific external input value 15 when a positive reward is obtained is obtained by adjusting the coupling weight of the adjusting unit 322. It becomes easy to generate, and it becomes difficult to generate a specific external output value 17 generated in response to a specific external input value 15 when a negative reward is obtained. That is, the connection weight is adjusted such that the evaluation value related to the external output value 17 indicated by the reward increases.

【０２０４】このように、ステップＳ３８〜ステップＳ
４２において、学習装置３２０は報酬に基づく学習を実
行する。As described above, steps S38 to S38
At 42, the learning device 320 performs learning based on the reward.

【０２０５】学習パターン生成部３２４がステップＳ３
９およびステップＳ４１において生成する学習パターン
はそれぞれ、正の報酬が得られた場合および負の報酬が
得られた場合に生成される。このように、学習パターン
生成部３２４は、ステップＳ３９およびステップＳ４１
において、報酬に基づく学習パターン（第１学習パター
ン）を生成する。学習パターン生成部３２４はまた、ス
テップＳ３６において、教師信号に基づく学習パターン
（第２学習パターン）を生成する。The learning pattern generator 324 determines in step S3
9 and the learning pattern generated in step S41 are generated when a positive reward is obtained and when a negative reward is obtained. As described above, the learning pattern generation unit 324 performs steps S39 and S41.
, A learning pattern (first learning pattern) based on the reward is generated. In step S36, the learning pattern generation unit 324 generates a learning pattern (second learning pattern) based on the teacher signal.

【０２０６】図１５に示される処理手順では、学習装置
３２０は学習装置３２０の周囲に人間が存在するか否か
を判定し、学習装置３２０の周囲に人間が存在する場合
には、学習パターン生成部３２４は常に人間の環境への
入出力パターンに基づいて第２学習パターンを生成し、
人間が存在しない場合には、報酬信号に基づいて第１学
習パターンを生成する。In the processing procedure shown in FIG. 15, the learning device 320 determines whether or not a human is present around the learning device 320, and when a human is present around the learning device 320, the learning pattern generation is performed. The unit 324 always generates the second learning pattern based on the input / output pattern to the human environment,
If no human is present, a first learning pattern is generated based on the reward signal.

【０２０７】あるいは、学習装置３２０は学習装置３２
０の周囲に人間が存在するか否かを判定する機能を有さ
ない構成とし、学習パターン生成部３２４は、人間の環
境への入出力パターン（教師信号）が学習パターン生成
部３２４に入力された場合には常に教師信号に基づいて
第２学習パターンを生成し、教師信号が学習パターン生
成部３２４に入力されない場合には、報酬に基づいて第
１学習パターンを生成してもよい。Alternatively, the learning device 320 is
The learning pattern generation unit 324 is configured not to have a function of determining whether or not a human exists around 0, and an input / output pattern (teacher signal) to a human environment is input to the learning pattern generation unit 324. In such a case, the second learning pattern may be always generated based on the teacher signal. If the teacher signal is not input to the learning pattern generation unit 324, the first learning pattern may be generated based on the reward.

【０２０８】このように、利用可能な全ての教師信号を
利用する処理は、学習装置３２０の学習が進んでいない
段階（学習初期）において特に有効である。利用可能な
全ての教師信号を利用することによって、学習速度を高
くすることが可能になるからである。As described above, the process of using all available teacher signals is particularly effective at a stage where learning of the learning device 320 has not progressed (early learning stage). This is because the learning speed can be increased by using all available teacher signals.

【０２０９】調整部３２２は、ステップＳ３７におい
て、第１学習パターンに基づいて、報酬によって示され
る評価値が高くなるように結合荷重を調整し、ステップ
Ｓ４２において、第２学習パターンに基づいて、外部入
力値１５に対する外部出力値１７が教師信号によって示
される期待値に近づくように結合荷重を調整する、調整
部として機能する。In step S37, the adjusting unit 322 adjusts the connection weight so that the evaluation value indicated by the reward becomes higher based on the first learning pattern. In step S42, the controller 322 adjusts the connection weight based on the second learning pattern. It functions as an adjustment unit that adjusts the connection weight so that the external output value 17 with respect to the input value 15 approaches the expected value indicated by the teacher signal.

【０２１０】学習装置３２０では、報酬に基づく学習に
よって獲得される学習結果と、教師信号に基づく学習に
よって獲得される学習結果とはともに、調整部３２２に
蓄積される。報酬に基づく学習によって獲得される学習
結果と、教師信号に基づく学習によって獲得される学習
結果とが、相反する結果をもたらすことは望ましくな
い。学習装置３２０が報酬に基づく学習を行った結果、
外部入力値１５に対する外部出力値１７が教師信号によ
って示される期待値から遠ざかることは望ましくない。
また、学習装置３２０が教師信号に基づく学習を行った
結果、報酬によって示される評価値が低くなることは望
ましくない。[0210] In the learning device 320, both the learning result obtained by the learning based on the reward and the learning result obtained by the learning based on the teacher signal are accumulated in the adjusting unit 322. It is not desirable that the learning result obtained by the learning based on the reward and the learning result obtained by the learning based on the teacher signal give conflicting results. As a result of the learning device 320 learning based on the reward,
It is not desirable that the external output value 17 for the external input value 15 depart from the expected value indicated by the teacher signal.
Further, it is not desirable that the evaluation value indicated by the reward be low as a result of learning performed by the learning device 320 based on the teacher signal.

【０２１１】このため、第１学習パターンは、調整部３
２２が第１学習パターンに基づいて結合荷重を調整した
場合（報酬に基づく学習が実行された場合）には、外部
入力値１５に対する外部出力値１７が教師信号によって
示される期待値に近づくように構成されていることが好
ましい。同様に、第２学習パターンは、調整部３２２が
第２学習パターンに基づいて結合荷重を調整した場合
（教師信号に基づく学習が実行された場合）には、報酬
によって示される評価値が高くなるように構成されてい
ることが好ましい。For this reason, the first learning pattern is generated by the adjusting unit 3
22 adjusts the connection weight based on the first learning pattern (when the learning based on the reward is executed) such that the external output value 17 with respect to the external input value 15 approaches the expected value indicated by the teacher signal. Preferably, it is configured. Similarly, when the adjustment unit 322 adjusts the connection weight based on the second learning pattern (when the learning based on the teacher signal is performed), the evaluation value indicated by the reward increases in the second learning pattern. It is preferable that it is comprised as follows.

【０２１２】報酬に基づく学習によって獲得される学習
結果と、教師信号に基づく学習によって獲得される学習
結果とが相反する結果をもたらすという望ましくない状
態を回避するためには、学習装置３２０が教師信号の品
質を判断し、教師信号の品質に基づいて、学習方法を制
御することも有効である。例えば、教師信号の品質が低
い場合には教師信号に基づく学習を実行しないように、
学習方法が制御される。In order to avoid an undesired state in which the learning result obtained by the learning based on the reward and the learning result obtained by the learning based on the teacher signal cause conflicting results, the learning device 320 must It is also effective to judge the quality of the learning and control the learning method based on the quality of the teacher signal. For example, when the quality of the teacher signal is low, learning based on the teacher signal is not performed.
The learning method is controlled.

【０２１３】図１６は、学習装置３２０が教師信号の品
質を判断し、教師信号の品質に基づいて学習方法を制御
する処理手順を示す。図１６において、図１５に示され
るステップと同一のステップには同一の参照番号を付
し、その説明を省略する。以下、図１６に示される処理
手順の各ステップを説明する。FIG. 16 shows a processing procedure in which the learning device 320 determines the quality of the teacher signal and controls the learning method based on the quality of the teacher signal. 16, steps that are the same as the steps shown in FIG. 15 are given the same reference numerals, and a description thereof will be omitted. Hereinafter, each step of the processing procedure shown in FIG. 16 will be described.

【０２１４】ステップＳ５５：人間の環境への入出力関
係が取得される。人間の環境への入出力関係は、例え
ば、学習装置３２０の周囲を監視する画像処理システム
等の既知の構成によって取得され得る。Step S55: The input / output relationship to the human environment is obtained. The input / output relationship to the human environment can be acquired by a known configuration such as an image processing system that monitors the periphery of the learning device 320, for example.

【０２１５】ステップＳ５６：学習装置３２０に、環境
から報酬が入力されているか否かが判定される。この判
定が「ＹＥＳ」であれば処理はステップＳ５７に進む。
この判定が「ＮＯ」であれば処理はステップＳ５８に進
む。Step S56: It is determined whether or not a reward has been input from the environment to the learning device 320. If this determination is "YES", the process proceeds to step S57.
If this determination is "NO", the process proceeds to step S58.

【０２１６】ステップＳ５７：学習装置３２０に入力さ
れた報酬を積算することにより、学習装置３２０の累積
報酬値が計算される。この積算は所定の期間にわたって
行われ、所定の期間よりも以前に学習装置３２０に入力
された報酬は、累積報酬値から除かれる。Step S57: The cumulative reward value of the learning device 320 is calculated by integrating the rewards input to the learning device 320. This accumulation is performed over a predetermined period, and rewards input to the learning device 320 before the predetermined period are excluded from the accumulated reward value.

【０２１７】ステップＳ５８：学習装置３２０の周囲
に、人間が存在するか否かが判定される。この判定が
「ＹＥＳ」であれば処理はステップＳ５９に進む。この
判定が「ＮＯ」であれば処理はステップＳ６３に進む。Step S58: It is determined whether or not a human exists around the learning device 320. If this determination is "YES", the process proceeds to step S59. If this determination is "NO", the process proceeds to step S63.

【０２１８】ステップＳ５９：人間が環境から報酬を与
えられているか否かが判定される。人間が環境から報酬
を与えられているか否かの判定は、例えば、任意のセン
サを用いることによって実行される。この判定が「ＹＥ
Ｓ」であれば処理はステップＳ６０に進む。この判定が
「ＮＯ」であれば処理はステップＳ６１に進む。Step S59: It is determined whether a human has been rewarded from the environment. The determination as to whether or not a human is being rewarded from the environment is performed, for example, by using an arbitrary sensor. This judgment is "YE
If "S", the process proceeds to step S60. If this determination is "NO", the process proceeds to step S61.

【０２１９】ステップＳ６０：人間が環境から与えられ
た報酬を積算することにより、人間の累積報酬値が計算
される。この積算は、所定の期間にわたって行われ、所
定の期間よりも以前に人間が環境から与えられた報酬は
累積報酬値から除かれる。この所定の期間は、ステップ
Ｓ５７における所定の期間と等しい期間である。Step S60: The cumulative reward value of the human is calculated by integrating the reward given by the human from the environment. This accumulation is performed over a predetermined period, and the reward given to the human by the environment before the predetermined period is excluded from the accumulated reward value. This predetermined period is a period equal to the predetermined period in step S57.

【０２２０】ステップＳ６１：学習装置３２０の累積報
酬値と人間の累積報酬値とが比較され、人間の累積報酬
値の方が学習装置３２０の累積報酬値よりも高いか否か
が判定される。この判定が「ＹＥＳ」であれば処理はス
テップＳ６２に進む。この判定が「ＮＯ」であれば処理
はステップＳ６３に進む。判定が「ＹＥＳ」であること
は、人間によって与えられる教師信号の品質が高いと判
定されることを意味し、判定が「ＮＯ」であることは、
人間によって与えられる教師信号の品質が低いと判定さ
れることを意味する。このような判定処理は、学習装置
３２０の学習パターン生成部３２４によって行われる。
学習パターン生成部３２４は、ステップＳ６１におい
て、所定の基準に基づいて教師信号の品質を判定する判
定部として機能する。Step S61: The cumulative reward value of the learning device 320 is compared with the cumulative reward value of the human, and it is determined whether or not the cumulative reward value of the human is higher than the cumulative reward value of the learning device 320. If this determination is "YES", the process proceeds to step S62. If this determination is "NO", the process proceeds to step S63. When the determination is “YES”, it means that the quality of the teacher signal given by a human is determined to be high, and when the determination is “NO”,
This means that the quality of the teacher signal given by a human is determined to be low. Such a determination process is performed by the learning pattern generation unit 324 of the learning device 320.
The learning pattern generation unit 324 functions as a determination unit that determines the quality of the teacher signal based on a predetermined criterion in step S61.

【０２２１】ステップＳ６２：教師信号に基づく学習が
実行される。ステップＳ６２における処理の内容は、図
１５を参照して説明したステップＳ３６およびステップ
Ｓ３７における処理の内容と同一である。すなわち、学
習パターン生成部３２４は、ステップＳ６１において教
師信号の品質が高いと判定された場合には、教師信号に
基づく学習パターン（第２学習パターン）を生成する。Step S62: Learning based on the teacher signal is executed. The content of the process in step S62 is the same as the content of the process in steps S36 and S37 described with reference to FIG. That is, when it is determined in step S61 that the quality of the teacher signal is high, the learning pattern generation unit 324 generates a learning pattern (second learning pattern) based on the teacher signal.

【０２２２】ステップＳ６３：報酬に基づく学習が実行
される。ステップＳ６３における処理の内容は、図１５
を参照して説明したステップＳ３８〜ステップＳ４２に
おける処理の内容と同一である。すなわち、学習パター
ン生成部３２４は、ステップＳ６１において教師信号の
品質が低いと判定された場合には、教師信号に基づく学
習パターン（第２学習パターン）を生成しない。人間の
行動は、無視される。Step S63: Learning based on the reward is executed. The content of the processing in step S63 is shown in FIG.
Are the same as the contents of the processing in steps S38 to S42 described with reference to FIG. That is, when it is determined in step S61 that the quality of the teacher signal is low, the learning pattern generation unit 324 does not generate a learning pattern (second learning pattern) based on the teacher signal. Human behavior is ignored.

【０２２３】図１６に示される処理手順では、教師信号
の品質は、教師信号を与える人間（教師信号発生源）の
報酬に基づいて判定される。教師信号の品質を判定する
基準は、これに限定されない。例えば、教師信号に基づ
く学習をある一定の期間継続したにもかかわらず、学習
装置３２０に環境から入力される報酬の累積値が減少し
た場合には、教師信号の品質が低いと判定するようにし
てもよい。In the processing procedure shown in FIG. 16, the quality of the teacher signal is determined based on the reward of the person (teacher signal generation source) who gives the teacher signal. The criterion for determining the quality of the teacher signal is not limited to this. For example, if the cumulative value of rewards input from the environment to the learning device 320 decreases even though learning based on the teacher signal is continued for a certain period, the quality of the teacher signal is determined to be low. You may.

【０２２４】このように、学習装置３２０が教師信号の
品質を判断し、教師信号の品質に基づいて学習方法を制
御する処理は、学習装置３２０の学習が進み、人間より
も優れた行動（外部出力値１７の出力）ができるように
なった段階において特に有効である。不必要に人間の模
倣を行うことを避けることができるからである。As described above, in the process in which the learning device 320 determines the quality of the teacher signal and controls the learning method based on the quality of the teacher signal, the learning of the learning device 320 progresses, and the action (external This is particularly effective at the stage when the output value 17 can be output. This is because unnecessary imitation of a human can be avoided.

【０２２５】なお、上述した例では、調整部３２２は、
バックプロパゲーション法に従って学習を実行するＥｌ
ｍａｎ型のニューラルネットワークによって構成される
とした。しかし、調整部３２２の構成はこれに限定され
ない。例えば、調整部３２２として、他の形式のニュー
ラルネットワークを使用した学習システムを採用した
り、あるいはニューラルネットワークを使用しない学習
システム（例えば、ＩＦ−ＴＨＥＮ形式のルール）を採
用してもよい。また、バックプロパゲーション法に従っ
て学習を実行することにも限定されない。In the example described above, the adjustment unit 322
El performing learning according to the back propagation method
It is assumed to be constituted by a man-type neural network. However, the configuration of the adjustment unit 322 is not limited to this. For example, as the adjustment unit 322, a learning system using another type of neural network may be employed, or a learning system not using a neural network (for example, an IF-THEN format rule) may be employed. Further, the invention is not limited to executing learning according to the back propagation method.

【０２２６】また、学習装置３１０または３２０に、図
９を参照して説明された音声認識機能を付加してもよ
い。学習装置３１０または３２０に音声認識機能を付加
した場合には、学習装置３１０または３２０が教師信号
および報酬信号を受け取る機会がより多くなる。このた
め、学習装置３１０または３２０は効率的に学習を進め
ることができる。Further, the speech recognition function described with reference to FIG. 9 may be added to the learning device 310 or 320. When the speech recognition function is added to the learning device 310 or 320, the chance that the learning device 310 or 320 receives the teacher signal and the reward signal is increased. Therefore, the learning device 310 or 320 can efficiently perform learning.

【０２２７】（実施の形態３）本発明の実施の形態３で
は、分類子システム採用した学習装置によって、教師信
号に基づく学習と報酬に基づく学習との統合が実現され
る。(Embodiment 3) In Embodiment 3 of the present invention, integration of learning based on teacher signals and learning based on rewards is realized by a learning device employing a classifier system.

【０２２８】分類子システム（クラシファイアシステ
ム）は、以下の文献に記載されるように、分類子の集合
によって環境に対する出力方法を記述した学習システム
である。A classifier system (classifier system) is a learning system in which an output method for an environment is described by a set of classifiers as described in the following document.

【０２２９】文献：ＤａｖｉｄＥ．Ｇｏｌｄｂｅｒ
ｇ、ＧｅｎｅｔｉｃＡｌｇｏｒｉｔｈｍｉｎＳｅ
ａｒｃｈ，Ｏｐｔｉｍｉｚａｔｉｏｎ，ａｎｄＭ
ａｃｈｉｎｅＬｅａｒｎｉｇ、Ｃｈａｐｔｅｒ６、
ＡｄｄｉｓｏｎＷｅｓｌｅｙ、１９８９。Reference: David E. Goldber
g, Genetic Algorithm in Se
arch, Optimization, and M
Achievement Learning, Chapter 6,
Addison Wesley, 1989.

【０２３０】図１７は、分類子システムを使用した学習
装置４００の構成を示す。学習装置４００は、分類子シ
ステム４０と、教師信号に基づく分類子生成部４１とを
含む。FIG. 17 shows the configuration of a learning device 400 using a classifier system. The learning device 400 includes a classifier system 40 and a classifier generation unit 41 based on a teacher signal.

【０２３１】分類子システム４０は、外部入力値４２に
応じた外部出力値４４を生成する処理を行う。分類子シ
ステム４０は、外部入力値４２に対する第１出力値を規
定するルールを示す分類子４０１の集合４７と、報酬に
基づいて新たな分類子を生成し、新たな分類子を分類子
の集合４７に追加する報酬に基づく分類子生成部４６
（第１生成部）と、分類子の集合４７に含まれるそれぞ
れの分類子４０１に信頼度を設定する設定部４５と、分
類子４０１のそれぞれから出力される第１出力値と、分
類子４０１のそれぞれに設定された信頼度とに基づいて
外部出力値４４を生成する出力部４８とを含む。The classifier system 40 performs a process of generating an external output value 44 according to the external input value 42. The classifier system 40 generates a new classifier 47 based on a reward and a set 47 of classifiers indicating rules defining a first output value with respect to the external input value 42, and classifies the new classifier into a set of classifiers. Classifier generating unit 46 based on the reward to be added to 47
(A first generation unit), a setting unit 45 that sets the reliability of each of the classifiers 401 included in the classifier set 47, a first output value output from each of the classifiers 401, and a classifier 401. And an output unit 48 for generating the external output value 44 based on the reliability set for each of the above.

【０２３２】分類子システム４０は、上記Ｄａｖｉｄ
Ｅ．Ｇｏｌｄｂｅｒｇの文献に開示されるように、報
酬に基づいて新たな分類子（第１の新たな分類子）を作
成したり、分類子４０１に設定されている信頼度を変更
することによって、報酬によって示される評価値が高く
なるように学習を行うシステムである。すなわち、分類
子システム４０は、学習装置４００の外部から入力され
る報酬に基づく学習を実行する。The classifier system 40 uses the above David
E. FIG. As disclosed in the Goldberg document, by creating a new classifier (first new classifier) based on the reward or changing the reliability set for the classifier 401, This is a system that performs learning so that the indicated evaluation value increases. That is, the classifier system 40 executes learning based on a reward input from outside the learning device 400.

【０２３３】教師信号に基づく分類子生成部４１は、学
習装置４００の外部から入力される教師信号に基づい
て、新たな分類子（第２の新たな分類子）を作成し、新
たな分類子を分類子の集合４７に追加する。The classifier generation unit 41 based on the teacher signal creates a new classifier (second new classifier) based on the teacher signal input from outside the learning device 400, and generates a new classifier. Is added to the set 47 of classifiers.

【０２３４】分類子システム４０が報酬に基づく学習を
実行する際には、主に、遺伝的アルゴリズム（上記Ｄａ
ｖｉｄＥ．Ｇｏｌｄｂｅｒｇの文献のＣｈａｐｔｅ
ｒ１参照）が採用される。しかし、遺伝的アルゴリズム
によれば、適切な新たな分類子が生成されるまでに多く
の計算量が必要である。新たな分類子処理は、主に、分
類子どうしの演算と、ランダムに発生させた値とに基づ
いて行われるからである。このため、遺伝的アルゴリズ
ムによる分類子の追加方法によれば、環境が変化した場
合に短時間で適切な分類子を獲得することが困難であ
る。When the classifier system 40 executes learning based on rewards, it mainly uses a genetic algorithm (Da
vid E. Chapter of Goldberg literature
r1) is adopted. However, according to the genetic algorithm, a large amount of calculation is required until an appropriate new classifier is generated. This is because the new classifier processing is mainly performed based on an operation between the classifiers and a randomly generated value. For this reason, according to the method of adding a classifier by the genetic algorithm, it is difficult to obtain an appropriate classifier in a short time when the environment changes.

【０２３５】このように、分類子システム４０による報
酬に基づく学習のみでは、環境への追従能力が不足する
という問題点があった。As described above, there is a problem that the ability to follow the environment is insufficient only by the learning based on the reward by the classifier system 40.

【０２３６】本発明の実施の形態３の学習装置４００
は、分類子システム４０と、教師信号に基づく分類子生
成部４１とを備えることにより、学習装置４００の外部
から入力される教師信号に基づく学習を実行することが
可能となっている。学習装置４００によれば、環境につ
いて何らかの知識を有する人間が学習装置４００の周囲
に存在する場合に、その人間から与えられる教師信号に
基づいて新たな分類子を作成することで、適切な新たな
分類子をたやすく生成することができる。あるいは、学
習装置４００は、周囲の人間の行動を観察する機能を有
し、ある時点における人間の行動と、環境の状態から新
たな分類子を生成するようにしてもよい。Learning device 400 according to Embodiment 3 of the present invention
Is provided with a classifier system 40 and a classifier generation unit 41 based on a teacher signal, so that learning based on a teacher signal input from outside the learning device 400 can be performed. According to the learning device 400, when a human having some knowledge about the environment is present around the learning device 400, an appropriate new classifier is created based on a teacher signal given by the human. Classifiers can be generated easily. Alternatively, the learning device 400 may have a function of observing the behavior of the surrounding humans, and may generate a new classifier from the human behavior at a certain point in time and the state of the environment.

【０２３７】分類子４０１のそれぞれに信頼度を設定す
る設定部４５は、教師信号に基づいて生成された新たな
分類子には、高い信頼度を設定するようにしてもよい。
教師信号に基づいて生成された新たな分類子に高い信頼
度を設定することにより、その新たな分類子が生成する
第１出力値が外部出力値４４に与える影響が大きくな
る。また、遺伝的アルゴリズムによって学習を行う際に
も、教師信号に基づいて生成された分類子を優先的に使
用して新たな分類子を生成することによって、探索空間
および計算時間を減らすことが可能になる。The setting unit 45 for setting the reliability for each of the classifiers 401 may set a high reliability for a new classifier generated based on the teacher signal.
By setting a high degree of reliability to a new classifier generated based on a teacher signal, the first output value generated by the new classifier greatly affects the external output value 44. Also, when performing learning using a genetic algorithm, it is possible to reduce the search space and calculation time by generating a new classifier by giving priority to the classifier generated based on the teacher signal. become.

【０２３８】このように、分類子の集合４７には、報酬
信号に基づいて生成された新たな分類子と、教師信号に
基づいて生成された新たな分類子とが追加される。設定
部４５は、分類子の集合４７に含まれる分類子４０１の
それぞれに信頼度を設定し、出力部４８は、分類子の集
合４７に含まれる分類子４０１のそれぞれから出力され
る第１出力値と、分類子の集合４７に含まれる分類子４
０１のそれぞれに設定された信頼度とに基づいて、外部
出力値４４を生成する。As described above, a new classifier generated based on the reward signal and a new classifier generated based on the teacher signal are added to the set 47 of classifiers. The setting unit 45 sets the reliability for each of the classifiers 401 included in the classifier set 47, and the output unit 48 outputs the first output that is output from each of the classifiers 401 included in the classifier set 47. Value and classifier 4 included in classifier set 47
The external output value 44 is generated based on the reliability set for each of the external output values 01.

【０２３９】学習装置４００は、分類子システム４０が
報酬のみに基づいて学習を行う場合と比較して、学習に
必要な計算時間を低減することができ、環境変化に対す
る追従能力を高くすることができる。The learning apparatus 400 can reduce the calculation time required for learning and increase the ability to follow environmental changes, as compared with the case where the classifier system 40 performs learning based only on rewards. it can.

【０２４０】なお、学習装置４００に、図９を参照して
説明された音声認識機能を付加してもよい。学習装置４
００に音声認識機能を付加した場合には、学習装置４０
０が教師信号および報酬を受け取る機会がより多くな
る。このため、学習装置４００は効率的に学習を進める
ことができる。[0240] Note that the speech recognition function described with reference to Fig. 9 may be added to the learning device 400. Learning device 4
When the voice recognition function is added to the
0 have more opportunities to receive teacher signals and rewards. Therefore, the learning device 400 can efficiently perform learning.

【０２４１】本発明の実施の形態１〜実施の形態３の学
習装置によれば、教師信号と報酬の両方に基づいて学習
が実行される。本発明の学習装置は、特に、人間が存在
する環境（例えば、家庭）において使用される場合に特
に好適である。このため、インテリジェントホームに好
適に適用され得る。本発明の学習装置は、人間が存在す
る間は人間の行動を観察して学習することができ、人間
が存在しない間は報酬に基づいて自律的に学習をするこ
とができる。このため、報酬と教師信号との一方のみに
基づいて学習を実行する場合と比較して、学習効率が高
くなる。すなわち、ユーザや環境に適応するまでに必要
な時間が短縮できる。本発明の学習装置によれば、効率
的な学習が行われるために、後天的な知識の獲得が容易
である。このために、学習装置に必要な全ての機能をシ
ステムの開発時に予め設計する必要はなくなり、製造時
のコストを削減することが可能になる。According to the learning apparatus of the first to third embodiments of the present invention, learning is performed based on both the teacher signal and the reward. The learning device of the present invention is particularly suitable when used in an environment where a human exists (for example, at home). Therefore, it can be suitably applied to intelligent homes. The learning device of the present invention can observe and learn human behavior while a human is present, and can learn autonomously based on rewards while no human is present. Therefore, the learning efficiency is higher than in the case where learning is performed based on only one of the reward and the teacher signal. That is, it is possible to reduce the time required to adapt to the user and the environment. ADVANTAGE OF THE INVENTION According to the learning device of this invention, since efficient learning is performed, acquisition of acquired knowledge is easy. For this reason, it is not necessary to design all the functions necessary for the learning device in advance when developing the system, and it is possible to reduce the manufacturing cost.

【０２４２】本発明の実施の形態１〜実施の形態３の学
習装置による学習処理は、プログラムの形式で記録媒体
に記録され得る。記録媒体としては、フロッピー（登録
商標）ディスクやＣＤ−ＲＯＭなどのコンピュータによ
って読み取り可能な任意のタイプの記録媒体を使用する
ことができる。記録媒体から読み出された学習処理プロ
グラムをコンピュータにインストールすることにより、
そのコンピュータを学習装置として機能させることが可
能になる。The learning processing by the learning device according to the first to third embodiments of the present invention can be recorded on a recording medium in the form of a program. As the recording medium, a computer-readable recording medium of any type such as a floppy (registered trademark) disk or a CD-ROM can be used. By installing the learning processing program read from the recording medium on the computer,
This allows the computer to function as a learning device.

【０２４３】[0243]

【発明の効果】本発明の学習装置によれば、環境から報
酬と教師信号とが与えられる場合に、その両方に基づい
て学習を実行することができる。このため、学習装置は
教師信号が得られるか得られないかに関わらず、学習を
実行することができ、教師信号のみに基づいて学習をす
る場合よりも学習効率を高めることができる。According to the learning apparatus of the present invention, when a reward and a teacher signal are given from the environment, learning can be executed based on both of them. For this reason, the learning device can perform learning regardless of whether or not a teacher signal is obtained, and can improve learning efficiency as compared with the case where learning is performed based only on the teacher signal.

【０２４４】また、本発明の学習装置は、教師信号に基
づく効果的な学習を実行することができ、報酬のみに基
づいて学習をする場合よりも学習効率を高めることがで
きる。Further, the learning device of the present invention can execute effective learning based on a teacher signal, and can improve learning efficiency as compared with the case where learning is performed based only on rewards.

【０２４５】また、本発明の学習装置によれば、報酬に
基づく学習によって獲得された知識と、教師信号に基づ
く学習によって獲得された知識とが互いに蓄積される。
このため、学習効率を高めることができる。According to the learning apparatus of the present invention, the knowledge acquired by learning based on reward and the knowledge acquired by learning based on teacher signals are mutually accumulated.
Therefore, learning efficiency can be improved.

[Brief description of the drawings]

【図１Ａ】教師信号に基づく学習を行う学習装置１００
の入出力関係を示す図FIG. 1A is a learning device 100 that performs learning based on a teacher signal.
Diagram showing the input / output relationship of

【図１Ｂ】報酬に基づく学習を行う学習装置２００の入
出力関係を示す図FIG. 1B is a diagram showing an input / output relationship of a learning device 200 that performs learning based on a reward.

【図２】本発明の実施の形態１の学習装置２５０の構成
を示すブロック図FIG. 2 is a block diagram showing a configuration of a learning device 250 according to the first embodiment of the present invention.

【図３】学習装置２５０の動作の詳細を説明するために
用いられる課題を示す図FIG. 3 is a diagram showing a task used to explain details of the operation of the learning device 250;

【図４】教師信号に基づく学習部１３の構成を示すブロ
ック図FIG. 4 is a block diagram showing a configuration of a learning unit 13 based on a teacher signal.

【図５】学習指定部１１による学習指定の処理手順を示
すフローチャートFIG. 5 is a flowchart showing a processing procedure of learning designation by a learning designation unit 11;

【図６】動作選択部１４による出力選択の処理手順を示
すフローチャートFIG. 6 is a flowchart showing a processing procedure of output selection by an operation selection unit 14;

【図７】報酬に基づく学習部１２、教師行動記憶部１３
１、記憶に基づく模倣部１３２、直接模倣部１３３およ
び習熟部１９の入力および出力の関係を示す図FIG. 7 shows a learning unit 12 and a teacher action storage unit 13 based on a reward.
1. Diagram showing input and output relationships of memory-based imitation unit 132, direct imitation unit 133, and learning unit 19

【図８】学習装置２５０から習熟部１９および習熟切換
部２０を除いた学習装置２６０の構成を示すブロック図FIG. 8 is a block diagram showing a configuration of a learning device 260 in which the learning unit 19 and the learning switching unit 20 are removed from the learning device 250.

【図９】音声認識機能を有する学習装置２７０の構成を
示すブロック図FIG. 9 is a block diagram showing a configuration of a learning device 270 having a voice recognition function.

【図１０】人間の発した言葉の例と、その言葉を意味解
釈部３１が解釈した意味と、そのときに学習装置２７０
がとるべき動作とを示す図FIG. 10 shows an example of a word spoken by a human, the meaning interpreted by the meaning interpreting unit 31, and the learning device 270 at that time.
Diagram showing actions to be taken

【図１１】互いに異なる３個の学習方法を統合した学習
装置２８０の構成を示す図FIG. 11 is a diagram showing a configuration of a learning device 280 in which three different learning methods are integrated.

【図１２】（ａ）〜（ｃ）は、学習方法Ａを実行する学
習モジュールＡと学習方法Ｂを実行する学習モジュール
Ｂとの間で、知識の継承を行う場合の学習度の時間変化
を示す図、（ｄ）〜（ｆ）は、学習モジュールＡと学習
モジュールＢとの間で、知識の継承を行わない場合の学
習度の時間変化を示す図12 (a) to 12 (c) show the time change of the learning degree when the knowledge is inherited between the learning module A executing the learning method A and the learning module B executing the learning method B. (D)-(f) are diagrams showing the time change of the learning degree when the knowledge is not inherited between the learning module A and the learning module B.

【図１３】異なる学習方法を実行する学習モジュール間
で知識の継承が行われ得る学習装置３１０の構成を示す
ブロック図FIG. 13 is a block diagram showing a configuration of a learning apparatus 310 in which knowledge can be inherited between learning modules that execute different learning methods.

【図１４】単一のニューラルネットワークを用いて、報
酬に基づく学習によって獲得された知識と、教師信号に
基づく学習によって獲得された知識とが互いに蓄積され
るような学習を実現する学習装置３２０の構成を示すブ
ロック図FIG. 14 illustrates a learning apparatus 320 that implements learning in which knowledge acquired by learning based on rewards and knowledge acquired by learning based on teacher signals are accumulated with each other using a single neural network. Block diagram showing configuration

【図１５】学習装置３２０が学習を実行する処理手順を
示すフローチャートFIG. 15 is a flowchart showing a processing procedure in which a learning device performs learning.

【図１６】学習装置３２０が教師信号の品質を判断し、
教師信号の品質に基づいて学習方法を制御する処理手順
を示すフローチャートFIG. 16 shows a learning device 320 determining the quality of a teacher signal,
Flowchart showing a processing procedure for controlling a learning method based on the quality of a teacher signal

【図１７】分類子システムを使用した学習装置４００の
構成を示すブロック図FIG. 17 is a block diagram showing a configuration of a learning device 400 using a classifier system.

[Explanation of symbols]

１１、２８１、３１２学習指定部１２、３１３報酬に基づく学習部１３、３１４教師信号に基づく学習部１４動作選択部１９習熟部２０習熟切換部２１Ａ〜Ｃ学習モジュール３０音声認識部３１意味解釈部４０分類子システム４１教師信号に基づく分類子生成部１３１教師行動記憶部１３２記憶に基づく模倣部１３３直接模倣部２５０、２６０、２７０、３１０、３２０学習装置２７１音声解釈部３１１入力部３１５出力部３２２調整部３２４学習パターン生成部３２５短期記憶部 11, 281, 312 Learning designation unit 12, 313 Reward-based learning unit 13, 314 Learning unit based on teacher signal 14 Action selection unit 19 Learning unit 20 Learning switching unit 21A-C Learning module 30 Voice recognition unit 31 Meaning interpretation unit 40 Classifier system 41 Classifier generation unit based on teacher signal 131 Teacher behavior storage unit 132 Imitation unit based on storage 133 Direct imitation unit 250, 260, 270, 310, 320 Learning device 271 Speech interpretation unit 311 Input unit 315 Output unit 322 Adjustment Unit 324 learning pattern generation unit 325 short term storage unit

Claims

[Claims]

1. A learning device for generating an output value corresponding to an input value based on at least one parameter, wherein a reward signal indicating an evaluation value related to the output value is received from outside the learning device. An input unit, a teacher signal input unit that receives a teacher signal indicating an expected value of the output value with respect to the input value from outside the learning device, and the evaluation value increases based on the reward signal and the teacher signal. And a adjusting unit that adjusts the value of the at least one parameter such that the output value with respect to the input value approaches the expected value.

2. The apparatus according to claim 1, further comprising: a pattern generator configured to generate a first learning pattern based on the reward signal, and generate a second learning pattern based on the teacher signal. And adjusting the value of the at least one parameter based on the second learning pattern so that the output value for the input value approaches the expected value based on the second learning pattern. Adjusting the value of the parameter, wherein the first learning pattern is configured such that, when the adjustment unit adjusts the value of the at least one parameter such that the evaluation value increases based on the first learning pattern, The output value with respect to the value is configured to approach the expected value, and the second learning pattern is determined by the adjustment unit based on the second learning pattern. The learning device according to claim 1, wherein the evaluation value is increased when the value of the at least one parameter is adjusted such that the output value with respect to the input value approaches the expected value. .

3. The pattern generation unit generates the second learning pattern based on the teacher signal whenever the teacher signal is input to the pattern generation unit, and outputs the teacher signal to the pattern generation unit. The learning device according to claim 2, wherein when not input, the first learning pattern is generated based on the reward signal.

4. The apparatus according to claim 1, further comprising: a determination unit configured to determine a quality of the teacher signal based on a predetermined criterion, wherein the pattern generation unit determines the quality of the teacher signal based on the quality of the teacher signal. The learning device according to claim 2, which determines whether to generate a learning pattern.

5. A learning device for generating an output value according to an input value, wherein a first learning is performed based on a reward signal, and the first learning is performed based on at least one first parameter. A first learning unit that generates one output value, and a second learning that generates a second output value corresponding to the input value based on at least one second parameter by performing second learning based on a teacher signal. A unit, an output unit that selectively outputs one of the first output value and the second output value as the output value, and whether to designate the first learning by the first learning unit,
And a learning designating unit that determines whether or not to designate the second learning by the second learning unit, wherein the first learning unit is configured to perform a learning process on a reward signal indicating an evaluation value related to the output value. A reward signal input unit received from outside the device; and a first adjusting unit that adjusts a value of the at least one first parameter based on the reward signal so that the evaluation value increases. A learning unit configured to receive a teacher signal indicating an expected value of the output value with respect to the input value from outside the learning device;
The at least one value is selected so that an output value approaches the expected value.
A second adjustment unit that adjusts the values of the two second parameters.

6. The learning device according to claim 5, wherein the learning specifying unit determines whether to specify the first learning according to a value of an attention parameter that changes according to time.

7. The output unit outputs the first output value as the output value or outputs the second output value as the output value according to a value of a storage status parameter that changes according to time. The learning device according to claim 5, which determines whether to perform the learning.

8. The output unit determines, based on the reward signal, whether to output the first output value as the output value or to output the second output value as the output value. 6. The learning device according to 5.

9. The learning device according to claim 5, wherein the learning device further includes a direct imitation unit that outputs the teacher signal as the second output value regardless of the input value.

10. The learning designating section, wherein the first learning by the first learning section and the second learning by the second learning section are performed.
In the case where learning is alternatively specified and the state in which the learning specifying unit specifies the first learning transitions to a state in which the learning specifying unit specifies the second learning, the at least Adjusting one second parameter based on the at least one first parameter;
The state in which learning is specified is when the learning specifying unit is in the first state.
The learning device according to claim 5, wherein when transitioning to a state in which learning is specified, the at least one first parameter is adjusted based on the at least one second parameter.

11. A speech recognition unit for recognizing speech input from outside the learning device, and a semantic interpretation unit for extracting semantic information based on a recognition result by the speech recognition unit. 6. The learning device according to 5.

12. The learning device further includes an instruction input unit that receives an operation instruction for the learning device, wherein the semantic information is information related to at least one of the operation instruction, the teacher signal, and the reward signal. The learning device according to claim 11, wherein:

13. A learning device that generates an output value according to an input value, wherein the learning device performs learning to generate a module output value according to the input value, and a plurality of learning modules; A learning specifying unit that determines whether or not to specify the learning by each of the plurality of learning modules based on a predetermined first rule; and a learning output unit that is output from the plurality of learning modules based on a predetermined second rule. An operation selecting unit for selectively outputting one of a plurality of module output values as a first output value, a learning unit for generating a second output value corresponding to the input value based on at least one parameter, An output unit configured to selectively output one of the first output value and the second output value as the output value based on a predetermined third rule; The output unit adjusts the at least one parameter so as to approach one output value. When the second output value is selected, the output unit generates a module output value by the plurality of learning modules; Learning for stopping an operation of determining whether or not to specify the learning by a specifying unit and an output operation of selectively outputting one of the plurality of module output values as the first output value by the operation selecting unit; apparatus.

14. A learning device for generating an output value according to an input value, wherein the learning device comprises: at least one classifier indicating a rule defining a first output value for the input value; A first generating unit that generates a first new classifier based on a reward signal indicating an evaluation value related to the first classifier and adds the first new classifier to the at least one classifier; A second generation unit that generates a second new classifier based on a teacher signal indicating an expected value of the output value with respect to a value, and adds the second new classifier to the at least one classifier; A setting unit that sets a degree of reliability for each of the at least one classifier to which the first new classifier and the second new classifier have been added; and a first new classifier. The small classifier to which the second new classifier has been added. At least one first output value output from at least one of each of the classifiers; and the at least one classifier to which the first new classifier and the second new classifier are added. And an output unit that generates the output value based on the reliability set for each of the learning devices.

15. A computer-readable recording medium storing a program for causing a learning device to execute a process of generating an output value according to an input value based on at least one parameter, the computer-readable recording medium comprising: (A) receiving a reward signal indicating an evaluation value related to the output value from outside the learning device; and (b) indicating an expected value of the output value with respect to the input value. Receiving a teacher signal from outside the learning device; and (c) based on the teacher signal and the reward signal, the output value with respect to the input value approaches the expected value, and the evaluation value increases. Adjusting the value of the at least one parameter.

16. A computer-readable recording medium storing a program for causing a learning device to execute a process of generating an output value according to an input value, wherein the process of generating an output value according to the input value is performed. (A) performing first learning based on a reward signal to generate a first output value corresponding to the input value based on at least one first parameter; and (b) based on a teacher signal. Performing a second learning to generate a second output value corresponding to the input value based on at least one second parameter; and (c) one of the first output value and the second output value. Selectively outputting as the output value; and (d) determining whether to specify the first learning and whether to specify the second learning,
The step (a) includes: (a1) receiving a reward signal indicating an evaluation value related to the output value from outside the learning device; and (a2) increasing the evaluation value based on the reward signal. Adjusting the value of the at least one first parameter, wherein the step (b) comprises: (b1) outputting a teacher signal indicating an expected value of the output value with respect to the input value from outside the learning device; Receiving; and (b2) recording based on the teacher signal, adjusting the value of the at least one second parameter such that the first output value for the input value approaches the expected value. Medium.

17. A computer-readable recording medium storing a program for executing a process of generating an output value according to an input value, wherein the process of generating an output value according to the input value includes: a) generating a plurality of module output values corresponding to the input values by executing a plurality of learnings; and (b) specifying each of the plurality of learnings based on a predetermined first rule. (C) selectively outputting one of the plurality of module output values as a first output value based on a predetermined second rule; and (d) at least one parameter. Generating a second output value according to the input value based on the following: (e) one of the first output value and the second output value as the output value based on a predetermined third rule Selection And (d1) adjusting the at least one parameter such that the second output value approaches the first output value. When the second output value is output as the output value in e), the operation of generating the plurality of module output values in step (a) and the plurality of learning in step (b) are specified. A recording medium configured to stop the operation of determining whether or not to perform the operation and the operation of selectively outputting one of the plurality of module output values as the first output value in step (c).

18. A computer-readable recording medium recording a program for executing a process of generating an output value according to an input value using at least one classifier, wherein the computer-readable recording medium is a computer-readable recording medium. Each shows a rule that defines a first output value for the input value, and the process of generating an output value according to the input value includes: (a) based on a reward signal indicating an evaluation value related to the output value Generating a first new classifier and adding the first new classifier to the at least one classifier; and (b) a teacher signal indicating an expected value of the output value with respect to the input value. Generating a second new classifier based on and adding the second new classifier to the at least one classifier; and (c) generating the first new classifier and the second classifier. 2 new Setting a reliability for each of the at least one classifier to which a new classifier has been added; and (d) adding the first new classifier and the second new classifier. At least one said first output value output from each of said at least one classifier;
Generating the output value based on the reliability set for each of the at least one classifier to which the first new classifier and the second new classifier have been added; A recording medium, including: