JP2011018245A

JP2011018245A - Recognition device and method, program, and recording medium

Info

Publication number: JP2011018245A
Application number: JP2009163192A
Authority: JP
Inventors: Yukiko Yoshiike; 由紀子吉池; Kenta Kawamoto; 献太河本; Kuniaki Noda; 邦昭野田; Kotaro Sabe; 浩太郎佐部
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2009-07-09
Filing date: 2009-07-09
Publication date: 2011-01-27

Abstract

【課題】変化する環境の中で自律的な学習を行う際に、現在自分が位置するノードは学習済の内部状態とされているノードなのか、新たに追加すべき内部状態とされるノードなのかを適切に認識できるようにする。
【解決手段】変数Ｎの値を１にセットし、ステップＳ２０２において、長さＮの時系列情報が取得される。ステップＳ２０３において、認識器は、時系列情報に基づいて、Viterbiアルゴリズムを用いてノード列を出力し、ステップＳ２０４において、実際にあり得るノード列であるか否かを判定する。実際にあり得るノード列ではないと判定された場合、未知ノードであると認識される。一方、実際にあり得るノード列であると判定された場合、エントロピーが計算され、閾値以上であると判定された場合、変数Ｎの値がインクリメントされ、時系列情報が過去方向に延長される。閾値以上ではないと判定された場合、既知ノードであると認識される。
【選択図】図３３When performing autonomous learning in a changing environment, the node where the user is currently located is a node that is in a learned internal state or a node that is in a newly added internal state To be able to recognize properly.
A value of a variable N is set to 1, and time series information of length N is acquired in step S202. In step S203, the recognizer outputs a node sequence using the Viterbi algorithm based on the time series information, and determines in step S204 whether the node sequence is actually possible. If it is determined that the node sequence is not actually possible, it is recognized as an unknown node. On the other hand, if it is determined that the node sequence is actually possible, the entropy is calculated, and if it is determined that the value is greater than or equal to the threshold value, the value of the variable N is incremented and the time-series information is extended in the past direction. If it is determined that it is not equal to or greater than the threshold value, it is recognized as a known node.
[Selection] Figure 33

Description

本発明は、認識装置および方法、プログラム、並びに記録媒体に関し、特に、変化する環境の中で自律的な学習を行う際に、現在自分が位置するノードは学習済の内部状態とされているノードなのか、新たに追加すべき内部状態とされるノードなのかを適切に認識できるようにする認識装置および方法、プログラム、並びに記録媒体に関する。 The present invention relates to a recognition apparatus and method, a program, and a recording medium, and in particular, when autonomous learning is performed in a changing environment, a node in which the node is currently located is a learned internal state. The present invention relates to a recognition apparatus and method, a program, and a recording medium that can appropriately recognize whether the node is an internal state to be newly added.

対象となるシステムから観測されるセンサ信号を時系列データとして扱い、状態および状態遷移を合わせ持つ確率モデルとして学習する方法としてＨＭＭ（隠れマルコフモデル）の利用が提案されている。ＨＭＭは、音声認識に広く利用される技術の一つである。ＨＭＭは、状態遷移確率と、各状態における出力確率密度関数で定義される状態遷移モデルであり、そのパラメータは、尤度を最大化するように推定される。パラメータの推定方法としては、Baum-Welch algorithmが広く利用されている。 The use of HMM (Hidden Markov Model) has been proposed as a method of learning a sensor signal observed from a target system as time series data and learning as a probabilistic model having both states and state transitions. HMM is one of the technologies widely used for speech recognition. The HMM is a state transition model defined by a state transition probability and an output probability density function in each state, and its parameters are estimated so as to maximize the likelihood. As a parameter estimation method, the Baum-Welch algorithm is widely used.

ＨＭＭでは、各状態から状態遷移確率を介して別の状態へ遷移することができるモデルとなっており、状態が遷移する過程としてモデル化が行われる。ただし、ＨＭＭでは、通常、観測されるセンサ信号がどの状態に対応するのかについては、確率的にしか決定されない。 The HMM is a model that can transition from each state to another state via a state transition probability, and modeling is performed as a process of state transition. However, in the HMM, it is usually determined only probabilistically to which state the observed sensor signal corresponds.

そこで、観測されるセンサ信号に基づいて、最も尤度が高くなるような状態遷移過程を決定する方法として、Viterbi Algorithmが広く利用されている。これにより、各時刻のセンサ信号に応じた状態を一意に確定することが可能となる。また、システムから観測されるセンサ信号が異なる状況で同じになったとしても、それぞれの時刻の前後におけるセンサ信号の時間変化の過程の違いに応じて、異なる状態遷移過程として扱うことが可能となる。perceptual aliasingの問題が完全に解決できるわけではないが、同じセンサ信号に対して異なる状態を割り当てることが可能であり、ＳＯＭなどに比べると、システムの状態をより詳細にモデル化することが可能である（例えば、非特許文献１参照）。 Therefore, the Viterbi Algorithm is widely used as a method for determining the state transition process with the highest likelihood based on the observed sensor signal. This makes it possible to uniquely determine the state corresponding to the sensor signal at each time. In addition, even if the sensor signals observed from the system are the same in different situations, it can be handled as different state transition processes according to the difference in the time change process of the sensor signals before and after each time. . Although the problem of perceptual aliasing cannot be solved completely, it is possible to assign different states to the same sensor signal, and it is possible to model the state of the system in more detail than SOM etc. Yes (for example, see Non-Patent Document 1).

Lawrence R. Rabiner (February 1989)."A tutorial on Hidden Markov Models and selected applications in speech recognition".Proceedings of the IEEE 77 (2): 257-286.Lawrence R. Rabiner (February 1989). "A tutorial on Hidden Markov Models and selected applications in speech recognition" .Proceedings of the IEEE 77 (2): 257-286.

ところで、ＨＭＭの学習において、状態の数および状態遷移の数が多くなると、正しくパラメータを推定するのが困難となる。特に、Baum-Welch algorithmは、必ずしも最適なパラメータを決定できることを保証する方法ではないため、パラメータの数が多くなると適切なパラメータを推定するのが極めて困難となる。また、学習すべき対象となるシステムが未知の場合、状態遷移モデルの構造やパラメータの初期値を適切に設定することは難しく、これもパラメータの推定を困難にする原因となる。 By the way, in HMM learning, when the number of states and the number of state transitions increase, it becomes difficult to correctly estimate parameters. In particular, the Baum-Welch algorithm is not necessarily a method for guaranteeing that an optimal parameter can be determined. Therefore, when the number of parameters increases, it is extremely difficult to estimate an appropriate parameter. Further, when the system to be learned is unknown, it is difficult to appropriately set the structure of the state transition model and the initial values of the parameters, which also makes it difficult to estimate the parameters.

音声認識においてＨＭＭが有効に利用されているのは、扱う対象が音声信号に限定されており、音声に関する数多くの知見が利用可能であることが要因となっている。さらに、音声認識においてＨＭＭの構造に関しては left-to-right型の構造が有効であることなどが長年に渡る膨大な研究成果の結果として得られていることなどが大きな要因である。従って、未知のシステムを対象とし、ＨＭＭの構造や初期値をあらかじめ決定するための情報が与えられない場合に、大規模なＨＭＭを実用的なモデルとして機能させることは非常に難しい問題であると言える。 The reason why the HMM is effectively used in the speech recognition is that the object to be handled is limited to the speech signal, and a lot of knowledge about speech is available. Furthermore, in the speech recognition, the fact that the left-to-right type structure is effective as a result of an enormous amount of research results over many years is a major factor. Therefore, it is a very difficult problem to make a large-scale HMM function as a practical model when an unknown system is targeted and information for determining the structure and initial values of the HMM in advance is not given. I can say that.

さて、ＨＭＭが対象とする問題は上記の通り、センサ信号を構造化するというものであり、アクション信号に関する考慮はない。ＨＭＭを拡張し、エージェントがアクション信号を用いて環境に対し働きかけ、将来のセンサ信号に影響を与えることができる、という枠組みに置き換えたものは部分観測マルコフ決定過程(Partially observable Markov decision process，以下、ＰＯＭＤＰ)と呼ばれる。 As described above, the problem targeted by the HMM is that the sensor signal is structured, and there is no consideration regarding the action signal. The HMM has been expanded and replaced with a framework in which agents can act on the environment using action signals and influence future sensor signals. Partially observable Markov decision process POMDP).

この問題のモデル学習は非常に困難な課題であり、これまで主に研究されてきたものは、事前知識によってスケルトンが与えられたモデル内の比較的少数のパラメータ推定のみであったり、あるいは強化学習的な枠組みで学習を駆動するようなものであった。さらに、学習の速度や収束性・安定性に課題のあるものも多く、実用性は必ずしも高くないと言える。 Model learning of this problem is a very difficult task, and what has been mainly studied so far is only a relatively small number of parameter estimates in the model given skeleton by prior knowledge, or reinforcement learning It was like driving learning in a dynamic framework. In addition, there are many problems with learning speed, convergence, and stability, so it can be said that the practicality is not necessarily high.

また、ＨＭＭの学習の方式として、バッチ学習方式と追加学習方式が存在する。ここで、バッチ学習方式は、例えば、１万ステップの遷移と観測のデータが得られる場合、１万ステップの遷移と観測に基づいて状態遷移確率テーブルと観測確率テーブルを生成して保存するものである。これに対して、追加学習方式は、例えば、最初に、１千ステップの遷移と観測に基づいて状態遷移確率テーブルと観測確率テーブルを生成して保存する。そして、その後の１千ステップの遷移と観測に基づいて状態遷移確率テーブルと観測確率テーブルの各値を変更して保存し、・・・というように、繰り返し学習を行って、内部モデルデータを更新（アップデート）していくものである。 Further, there are a batch learning method and an additional learning method as methods for learning the HMM. Here, the batch learning method is to generate and store a state transition probability table and an observation probability table based on 10,000 steps of transition and observation, for example, when 10,000 steps of transition and observation data are obtained. is there. On the other hand, in the additional learning method, for example, first, a state transition probability table and an observation probability table are generated and stored based on 1000-step transitions and observations. Then, change and save each value in the state transition probability table and the observation probability table based on the subsequent 1000-step transitions and observations, and so on to update the internal model data by repeatedly learning (Update).

従来のＨＭＭの学習では、追加学習方式の学習の際に問題が発生する。ＨＭＭの学習では、事前に全てのデータを予め準備しておき、バッチ学習方式での学習を行なうという方法がよく採られているが、このような学習では環境に適応して経験から学ぶことが原理的に不可能である。言い換えれば、多様な実世界の中でより良い性能を発揮するためには、実環境での動作結果をフィードバックして追加学習を行なうという機能が必須である。ところが、追加学習を行なう際に「学習済みの記憶構造」と「新しい経験」とをどのように調停するのかという問題は未解決である。一方では「新しい経験」を速やかに反映させてすばやい適応を実現したいが、他方、これまでに確立した記憶構造が破壊される危険性もある。 In conventional HMM learning, a problem occurs during learning in the additional learning method. In HMM learning, a method of preparing all data in advance and performing learning in a batch learning method is often adopted, but in such learning, learning from experience can be adapted to the environment. It is impossible in principle. In other words, in order to exhibit better performance in various real worlds, a function of performing additional learning by feeding back the operation results in the real environment is essential. However, the problem of how to mediate “learned memory structure” and “new experience” when performing additional learning is still unsolved. On the one hand, we want to quickly adapt our “new experience” and realize quick adaptation. On the other hand, there is a risk that the memory structure established so far will be destroyed.

また、従来、追加学習を行うために、過去の学習データを分離して保持するか、または、過去の学習データを現在の記憶からリハースする等して、新たに得られたデータとを組み合わせて学習することが行われていた。しかしながら、そのようにしても、分離された過去の学習データに、「新しい体験」が反映されなかったり、リハースされる過去の学習データが、「新しい体験」の影響を受けて生成されてしまうなどの問題があった。このように、大規模なＨＭＭの学習において、追加学習を行って実用的なモデルとして機能させることは困難であった。 Conventionally, in order to perform additional learning, past learning data is separated and held, or past learning data is rehearsed from the current memory and combined with newly obtained data. Learning was going on. However, even in such a case, the separated past learning data does not reflect the “new experience”, or rehearsed past learning data is generated under the influence of the “new experience”. There was a problem. Thus, in large-scale HMM learning, it is difficult to perform additional learning to function as a practical model.

さらに、例えば、学習すべき環境が変化した場合、観測シンボルの種類、ノード数が増えることになり、学習を進める際に、ノード数、観測シンボル数、またはアクション数を変更する必要に迫られこともある。このような場合、エージェントが自律的に環境の変化を認識して、状態遷移確率テーブル、および観測確率テーブルを拡張する必要がある。 Furthermore, for example, when the environment to be learned changes, the number of observation symbols and the number of nodes increase, and it is necessary to change the number of nodes, observation symbols, or actions when proceeding with learning. There is also. In such a case, it is necessary for the agent to autonomously recognize a change in the environment and expand the state transition probability table and the observation probability table.

エージェントが自律的に環境の変化を認識して、状態遷移確率テーブル、および観測確率テーブルを拡張する場合、そもそもエージェント自身が、新たに環境が拡張されたのか否かを認識する必要がある。つまり、エージェントが、現在自分が位置するノードは学習済の内部状態とされているノードなのか、新たに追加すべき内部状態とされるノードなのか認識できるようにしなければならない。 When an agent autonomously recognizes an environmental change and expands the state transition probability table and the observation probability table, it is necessary for the agent itself to recognize whether or not the environment has been newly expanded. That is, the agent must be able to recognize whether the node at which he is currently located is a learned internal state or a node to be newly added.

本発明はこのような状況に鑑みてなされたものであり、変化する環境の中で自律的な学習を行う際に、現在自分が位置するノードは学習済の内部状態とされているノードなのか、新たに追加すべき内部状態とされるノードなのかを適切に認識できるようにするものである。 The present invention has been made in view of such a situation. When autonomous learning is performed in a changing environment, is the node where the current position is a node in a learned internal state? This makes it possible to appropriately recognize whether the node is an internal state to be newly added.

本発明の一側面は、環境から得られるセンサ信号に基づいて観測シンボルを観測する観測手段と、時間の経過に伴って観測される前記観測シンボルを、前記観測シンボルが観測された時刻と対応付けて記憶する観測シンボル記憶手段と、前記観測シンボル記憶手段に記憶された情報を時系列情報として読み出し、前記時系列情報の最後の時刻におけるＨＭＭのノードを認識する認識手段とを備え、前記認識手段は、可変長の前記時系列情報を読み出して認識する認識装置である。 One aspect of the present invention relates to observation means for observing an observation symbol based on a sensor signal obtained from an environment, and associating the observation symbol observed over time with the time at which the observation symbol was observed Observation symbol storage means for storing the information, and recognition means for reading out information stored in the observation symbol storage means as time series information and recognizing an HMM node at the last time of the time series information, the recognition means Is a recognition device that reads and recognizes the time-series information of variable length.

前記認識手段は、前記時系列情報に基づいて、前記時系列情報の長さに対応するノード列を認識し、前記環境において前記ノード列が、前記ＨＭＭの状態遷移確率および観測確率に基づいて、第１の閾値以上の確率で存在すると判定され、かつ前記時系列情報の最後の時刻における前記ノードの事後確率のエントロピーの値が第２の閾値未満となるまで、前記観測シンボル記憶手段から読み出す前記時系列情報の長さを過去方向に延長するようにすることができる。 The recognizing unit recognizes a node sequence corresponding to the length of the time series information based on the time series information, and the node sequence in the environment is based on the state transition probability and the observation probability of the HMM, Read from the observed symbol storage means until it is determined that the node exists with a probability equal to or greater than a first threshold and the entropy value of the posterior probability of the node at the last time of the time series information is less than a second threshold. The length of the time series information can be extended in the past direction.

前記認識手段は、前記過去方向に延長された前記時系列情報に基づいて前記時系列情報の長さに対応するノード列を認識し、前記環境において前記ノード列が、前記ＨＭＭの状態遷移確率および観測確率に基づいて、第１の閾値以上の確率で存在しないと判定された場合、前記時系列情報の最後の時刻における前記ノードが、新たに追加すべき内部状態の未知ノードであると認識して認識結果として出力し、前記環境において前記ノード列が、前記ＨＭＭの状態遷移確率および観測確率に基づいて、第１の閾値以上の確率で存在すると判定され、かつ前記時系列情報の最後の時刻における前記ノードの事後確率のエントロピーの値が第２の閾値未満と判定された場合、前記時系列情報の最後の時刻における前記ノードが、学習済の内部状態の既知ノードであると認識して認識結果として出力するようにすることができる。 The recognizing means recognizes a node sequence corresponding to the length of the time-series information based on the time-series information extended in the past direction, and the node sequence in the environment includes the state transition probability of the HMM and Based on the observation probability, when it is determined that the node does not exist with a probability equal to or higher than the first threshold, the node at the last time of the time series information is recognized as an unknown node in the internal state to be newly added. Is output as a recognition result, and it is determined that the node sequence exists in the environment with a probability equal to or higher than a first threshold based on the state transition probability and the observation probability of the HMM, and the last time of the time-series information If the entropy value of the posterior probability of the node at is determined to be less than a second threshold, the node at the last time of the time series information is in the learned internal state It may be output as the recognition result recognized as a known node.

前記認識結果を、認識された時刻と対応付けて記憶する認識結果記憶手段をさらに備えるようにすることができる。 A recognition result storage means for storing the recognition result in association with the recognized time can be further provided.

前記認識手段は、前記認識結果記憶手段に記憶されている認識結果が、時間の経過に伴って既知ノードから未知ノードに変化した時刻を特定し、前記観測シンボル記憶手段から読み出す前記時系列情報の長さを過去方向に延長することにより、前記特定された時刻より時間的に前の時系列情報が読み出される場合、認識結果の出力を保留するようにすることができる。 The recognizing means identifies the time when the recognition result stored in the recognition result storage means has changed from a known node to an unknown node over time, and reads the time series information read from the observation symbol storage means. By extending the length in the past direction, the output of the recognition result can be suspended when the time-series information temporally prior to the specified time is read out.

前記認識手段は、長さＮの前記時系列情報の最後の時刻における前記ノードの事後確率のエントロピーの値と、長さＮ＋１の前記時系列情報の最後の時刻における前記ノードの事後確率のエントロピーの値との差分を算出し、前記算出された差分が第３の閾値未満となるまで、前記観測シンボル記憶手段から読み出す前記時系列情報の長さを過去方向に延長するようにすることができる。 The recognizing means includes an entropy value of the posterior probability of the node at the last time of the time series information of length N and an entropy of the posterior probability of the node at the last time of the time series information of length N + 1. A difference with a value is calculated, and the length of the time series information read from the observation symbol storage means can be extended in the past direction until the calculated difference becomes less than a third threshold.

前記認識手段は、前記過去方向に延長された前記時系列情報に基づいて前記時系列情報の長さに対応するノード列を認識し、前記環境において前記ノード列が、前記ＨＭＭの状態遷移確率および観測確率に基づいて、第１の閾値未満の確率で存在すると判定された場合、前記時系列情報の最後の時刻における前記ノードが、新たに追加すべき内部状態の未知ノードであると認識するようにすることができる。 The recognizing means recognizes a node sequence corresponding to the length of the time-series information based on the time-series information extended in the past direction, and the node sequence in the environment includes the state transition probability of the HMM and Based on the observation probability, when it is determined that the node exists with a probability less than the first threshold, the node at the last time of the time series information is recognized as an unknown node in the internal state to be newly added. Can be.

前記認識手段は、前記環境において前記ノード列が、前記ＨＭＭの状態遷移確率および観測確率に基づいて、第１の閾値以上の確率で存在すると判定された場合、前記時系列情報の最後の時刻における前記ノードの事後確率のエントロピーの値が第２の閾値未満となるとき、前記時系列情報の最後の時刻における前記ノードが、学習済の内部状態の既知ノードである認識し、前記時系列情報の最後の時刻における前記ノードの事後確率のエントロピーの値が第２の閾値以上となるとき、認識結果の出力を保留するようにすることができる。 In the environment, when the node sequence is determined to exist with a probability equal to or higher than the first threshold based on the state transition probability and the observation probability of the HMM in the environment, the recognizing unit at the last time of the time series information When the entropy value of the posterior probability of the node is less than a second threshold, the node at the last time of the time series information is recognized as a known node in the learned internal state, and the time series information When the entropy value of the posterior probability of the node at the last time is equal to or greater than the second threshold, the output of the recognition result can be suspended.

前記環境に対して自分が実行する行動を行動シンボルとし特定し、時間の経過に伴って得られる前記行動シンボルを、前記行動が実行された時刻と対応付けて記憶する行動シンボル記憶手段をさらに備え、前記観測シンボル記憶手段に記憶された情報と時間的に同じ長さの情報が前記行動シンボル記憶手段から読み出され、前記時系列情報とされるようにすることができる。 Action symbol storage means for specifying an action to be executed for the environment as an action symbol, and storing the action symbol obtained with the passage of time in association with the time at which the action is executed is further provided. The information having the same time length as the information stored in the observation symbol storage means can be read from the action symbol storage means and used as the time-series information.

本発明の一側面は、時間の経過に伴って観測される環境から得られるセンサ信号に基づく前記観測シンボルを、前記観測シンボルが観測された時刻と対応付けて記憶する観測シンボル記憶手段に記憶された情報を可変長の時系列情報として読み出し、前記時系列情報の最後の時刻におけるＨＭＭのノードを認識する認識方法である。 One aspect of the present invention is stored in an observation symbol storage unit that stores the observation symbol based on a sensor signal obtained from an environment observed with the passage of time in association with the time when the observation symbol is observed. Is a recognition method for reading out the information as variable-length time-series information and recognizing the node of the HMM at the last time of the time-series information.

本発明の一側面は、コンピュータを、環境から得られるセンサ信号に基づいて観測シンボルを観測する観測手段と、時間の経過に伴って観測される前記観測シンボルを、前記観測シンボルが観測された時刻と対応付けて記憶する観測シンボル記憶手段と、前記観測シンボル記憶手段に記憶された情報を時系列情報として読み出し、前記時系列情報の最後の時刻におけるＨＭＭのノードを認識する認識手段とを備え、前記認識手段は、可変長の前記時系列情報を読み出して認識する認識装置として機能させるプログラムである。 One aspect of the present invention provides a computer, an observation means for observing an observation symbol based on a sensor signal obtained from an environment, and the observation symbol that is observed as time elapses. Observation symbol storage means associated with and stored in the observation symbol storage means as time series information, recognizing means for recognizing the node of the HMM at the last time of the time series information, The recognition means is a program that functions as a recognition device that reads and recognizes the variable-length time-series information.

本発明の一側面においては、環境から得られるセンサ信号に基づいて観測シンボルが観測され、時間の経過に伴って観測される前記観測シンボルを、前記観測シンボルが観測された時刻と対応付けられて記憶され、記憶された情報を時系列情報として読み出し、前記時系列情報の最後の時刻におけるＨＭＭのノードが認識され、可変長の前記時系列情報が読み出されて認識される。 In one aspect of the present invention, an observation symbol is observed based on a sensor signal obtained from an environment, and the observation symbol observed with the passage of time is associated with a time when the observation symbol is observed. The stored information is read as time-series information, the node of the HMM at the last time of the time-series information is recognized, and the variable-length time-series information is read and recognized.

本発明によれば、変化する環境の中で自律的な学習を行う際に、現在自分が位置するノードは学習済の内部状態とされているノードなのか、新たに追加すべき内部状態とされるノードなのかを適切に認識できる。 According to the present invention, when autonomous learning is performed in a changing environment, whether the node where the user is currently located is a learned internal state or an internal state to be newly added. It is possible to properly recognize whether it is a node.

迷路の例を示す図である。It is a figure which shows the example of a maze. 図１の迷路を構成するパーツの例を示す図である。It is a figure which shows the example of the parts which comprise the maze of FIG. 迷路の構造の変化を説明する図である。It is a figure explaining the change of the structure of a maze. 迷路の構造の変化を説明する図である。It is a figure explaining the change of the structure of a maze. 迷路の構造の変化を説明する図である。It is a figure explaining the change of the structure of a maze. ロボットの移動方向を説明する図である。It is a figure explaining the moving direction of a robot. 通常のＨＭＭを説明する図である。It is a figure explaining normal HMM. アクション拡張型ＨＭＭを説明する図である。It is a figure explaining action expansion type HMM. 本発明の一実施の形態に係る自律行動学習装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the autonomous action learning apparatus which concerns on one embodiment of this invention. スプリットアルゴリズムの適用を説明する図である。It is a figure explaining application of a split algorithm. スプリットアルゴリズムの適用を説明する図である。It is a figure explaining application of a split algorithm. スプリットアルゴリズム適用処理の例を説明するフローチャートである。It is a flowchart explaining the example of a split algorithm application process. フォワードマージアルゴリズムの適用を説明する図である。It is a figure explaining application of a forward merge algorithm. フォワードマージアルゴリズムの適用を説明する図である。It is a figure explaining application of a forward merge algorithm. フォワードマージアルゴリズム適用処理の例を説明するフローチャートである。It is a flowchart explaining the example of a forward merge algorithm application process. バックワードマージアルゴリズムの適用を説明する図である。It is a figure explaining application of a backward merge algorithm. バックワードマージアルゴリズムの適用を説明する図である。It is a figure explaining application of a backward merge algorithm. バックワードマージアルゴリズムの適用処理の例を説明するフローチャートである。It is a flowchart explaining the example of the application process of a backward merge algorithm. アクション拡張型ＨＭＭにおける状態遷移確率テーブルと観測確率テーブルの尤度を比較する表である。It is a table | surface which compares the likelihood of the state transition probability table and observation probability table in action expansion type HMM. 一状態一観測制約とアクション遷移制約を課すことによる学習結果の変化を説明する図である。It is a figure explaining the change of the learning result by imposing one state one observation restrictions and action transition restrictions. 一状態一観測制約とアクション遷移制約を課すことによる学習結果の変化を説明する図である。It is a figure explaining the change of the learning result by imposing one state one observation restrictions and action transition restrictions. 一状態一観測制約とアクション遷移制約を課すことによる学習結果の変化を説明する図である。It is a figure explaining the change of the learning result by imposing one state one observation restrictions and action transition restrictions. 一状態一観測制約とアクション遷移制約を課すことによる学習結果の変化を説明する図である。It is a figure explaining the change of the learning result by imposing one state one observation restrictions and action transition restrictions. 一状態一観測制約とアクション遷移制約を課すことによる学習結果の変化を説明する図である。It is a figure explaining the change of the learning result by imposing one state one observation restrictions and action transition restrictions. 一状態一観測制約とアクション遷移制約を課すことによる学習結果の変化を説明する図である。It is a figure explaining the change of the learning result by imposing one state one observation restrictions and action transition restrictions. 一状態一観測制約とアクション遷移制約を課すことによる学習結果の変化を説明する図である。It is a figure explaining the change of the learning result by imposing one state one observation restrictions and action transition restrictions. アクション拡張型ＨＭＭ学習処理の例を説明するフローチャートである。It is a flowchart explaining the example of an action expansion type HMM learning process. 従来の方式により追加学習を行なう際の問題を説明する図である。It is a figure explaining the problem at the time of performing additional learning by the conventional system. 本発明における追加学習方式について説明する図である。It is a figure explaining the additional learning system in this invention. 観測シンボルの種類が増えることによる影響を説明する図である。It is a figure explaining the influence by the increase in the kind of observation symbol. ノードの数が増えることによる影響を説明する図である。It is a figure explaining the influence by the number of nodes increasing. アクションの数が増えることによる影響を説明する図である。It is a figure explaining the influence by the number of actions increasing. ノード認識処理の例を説明するフローチャートである。It is a flowchart explaining the example of a node recognition process. ノード認識処理の別の例を説明するフローチャートである。It is a flowchart explaining another example of a node recognition process. ノード認識処理のさらに別の例を説明するフローチャートである。It is a flowchart explaining another example of a node recognition process. ノード認識処理のさらに別の例を説明するフローチャートである。It is a flowchart explaining another example of a node recognition process. 未知ノードが追加される場合の例を説明する図である。It is a figure explaining an example in case an unknown node is added. 未知ノードが追加される場合の別の例を説明する図である。It is a figure explaining another example in case an unknown node is added. アンカリングする際にノードの追加または削除の要否のチェックが行なわれる場合の例について説明する図である。It is a figure explaining the example when the necessity of addition or deletion of a node is checked when anchoring. 未知ノード追加処理の例を説明するフローチャートである。It is a flowchart explaining the example of an unknown node addition process. 追加または削除要否チェック処理の例を説明するフローチャートである。It is a flowchart explaining the example of an addition or deletion necessity check process. 未知ノードが追加される場合、状態遷移確率テーブルにおいて拡張される領域を説明する図である。It is a figure explaining the area | region expanded in a state transition probability table when an unknown node is added. 追加される未知ノードの例を説明する図である。It is a figure explaining the example of the unknown node added. 追加される未知ノードとアクションの例を示す図である。It is a figure which shows the example of the unknown node and action which are added. 追加される未知ノード、候補ノードおよびアクションの例を示す図である。It is a figure which shows the example of the unknown node added, a candidate node, and an action. ノード追加時の状態遷移確率設定処理の例を説明するフローチャートである。It is a flowchart explaining the example of the state transition probability setting process at the time of node addition. ノード逆アクションペアリスト生成処理の例を説明するフローチャートである。It is a flowchart explaining the example of a node reverse action pair list production | generation process. 逆アクション状態遷移確率設定処理の例を説明するフローチャートである。It is a flowchart explaining the example of a reverse action state transition probability setting process. ノード順アクションペアリスト生成処理の例を説明するフローチャートである。It is a flowchart explaining the example of a node order action pair list production | generation process. 順アクション状態遷移確率設定処理の例を説明するフローチャートである。It is a flowchart explaining the example of a forward action state transition probability setting process. アンカリング処理の例を説明するフローチャートである。It is a flowchart explaining the example of an anchoring process. パーソナルコンピュータの構成例を示すブロック図である。And FIG. 16 is a block diagram illustrating a configuration example of a personal computer.

以下、図面を参照して、本発明の実施の形態について説明する。 Embodiments of the present invention will be described below with reference to the drawings.

最初に、アクション拡張型ＨＭＭについて説明する。 First, the action expansion type HMM will be described.

後述する本発明の自律行動学習装置は、例えば、迷路を自走して自分の位置を認識し、目的地へのルートを学習するロボットなどに適用される。 The autonomous behavior learning apparatus of the present invention to be described later is applied to, for example, a robot that self-runs in a maze, recognizes its own position, and learns a route to a destination.

図１は、迷路の例を示す図である。同図に示されるように、この迷路は、図２に示されるような複数の種類のパーツを組み合わせることにより構成されている。図２に示されるように、パーツのそれぞれは、同一の大きさの矩形として構成されており、１５の異なる種類が用意されている。例えば、パーツ５は、横方向の通路を構成するためのものであり、パーツ１０は、縦方向の通路を構成するためのものである。また、パーツ７、パーツ１１、パーツ１３は、それぞれＴ字路を構成するためのものであり、パーツ１５は、十字路を構成するためのものである。 FIG. 1 is a diagram illustrating an example of a maze. As shown in the figure, this maze is configured by combining a plurality of types of parts as shown in FIG. As shown in FIG. 2, each of the parts is configured as a rectangle having the same size, and 15 different types are prepared. For example, the part 5 is for forming a horizontal passage, and the part 10 is for forming a vertical passage. Parts 7, 11, and 13 are for configuring a T-shaped path, and part 15 is for configuring a cross road.

また、この迷路は、その構造を変化させることもできるようになされている。例えば、図３において、図中点線の円により示される部分の２つのパーツを変更することにより、迷路の構造は、図４に示されるように変化する。すなわち、図３において、通り抜けできなかったものが、図４においては通り抜けできるように、迷路の構造を変化させることができる。 The maze can also change its structure. For example, in FIG. 3, the structure of the maze changes as shown in FIG. 4 by changing two parts of the portion indicated by the dotted circle. That is, the structure of the maze can be changed so that what cannot pass through in FIG. 3 can pass through in FIG.

さらに、図４において、図中点線の円により示される部分の２つのパーツを変更することにより、迷路の構造は、図５に示されるように変化する。すなわち、図４において、通り抜けできたものが、図５においては通り抜けできないように、迷路の構造を変化させることができる。 Furthermore, in FIG. 4, the structure of the maze changes as shown in FIG. 5 by changing two parts of the part indicated by the dotted circle. That is, the structure of the maze can be changed so that what can pass through in FIG. 4 cannot pass through in FIG.

このような迷路をロボットが自走する。この例では、迷路は２次元であり、通路の方向も水平または垂直方向のみなので、ロボットも上下左右の４方向に移動できるように設定するものとする。 The robot runs in such a maze. In this example, since the maze is two-dimensional and the direction of the passage is only horizontal or vertical, it is set so that the robot can move in four directions, up, down, left, and right.

図６は、ロボットの移動方向を説明する図である。同図における垂直方向、水平方向は、図１に対応しており、図中上下左右のいずれかの方向に、図中中央に示されるロボットが移動することが分かる。 FIG. 6 is a diagram illustrating the moving direction of the robot. The vertical direction and horizontal direction in the figure correspond to those in FIG. 1, and it can be seen that the robot shown in the center in the figure moves in either the top, bottom, left or right direction in the figure.

ここで、ロボットの所定の方向への移動をアクションと称することにする。例えば、図６の例では、図中の４つの矢印に対応する４通りのアクションが存在することになる。 Here, the movement of the robot in a predetermined direction is referred to as an action. For example, in the example of FIG. 6, there are four actions corresponding to the four arrows in the figure.

また、ロボットには、例えば、物体を検知するセンサが設けられており、センサから出力される信号を解析することにより、迷路上においてロボットが位置するパーツの種類を特定することが可能となるようになされている。すなわち、ロボットは、迷路上の各位置において、図２を参照して上述した１５種類のパーツのいずれかに対応するセンサ信号を取得するのである。 In addition, the robot is provided with a sensor for detecting an object, for example, and by analyzing a signal output from the sensor, it is possible to specify the type of the part where the robot is located on the maze. Has been made. That is, the robot acquires sensor signals corresponding to any of the 15 types of parts described above with reference to FIG. 2 at each position on the maze.

本発明では、例えば、ロボットが自走した迷路上の各位置におけるセンサ信号に基づいて迷路の構造に対応する内部モデルデータを生成する。ここで、迷路を環境と称し、１５種類のパーツのいずれかに対応するセンサ信号を観測シンボルと称することにする。本発明では、ＨＭＭを利用して、迷路の構造を学習し、上述した内部モデルデータを生成する。 In the present invention, for example, internal model data corresponding to the structure of the maze is generated based on the sensor signal at each position on the maze where the robot self-runs. Here, the maze is referred to as an environment, and a sensor signal corresponding to any of 15 types of parts is referred to as an observation symbol. In the present invention, the structure of the maze is learned using the HMM, and the internal model data described above is generated.

ＨＭＭの学習においては、環境から得られる観測に基づいて状態が認識される。上述したように、環境は、例えば迷路であり、観測は、例えば１５種類のパーツのいずれかに対応するセンサ信号から特定される観測シンボルに対応する。なお、ロボットは、適宜、エージェントと称することにする。 In HMM learning, the state is recognized based on observations obtained from the environment. As described above, the environment is, for example, a maze, and the observation corresponds to an observation symbol specified from a sensor signal corresponding to, for example, any one of 15 types of parts. Note that the robot is appropriately referred to as an agent.

ＨＭＭの学習では、エージェントが、環境から得られる観測に基づいて自分がいる状態を認識する。ここでいう状態は、いわばエージェントが主観的に認識した状態であり、実際にエージェントが置かれた状態を外部から客観的に観察した場合、両者が異なることがある。例えば、２次元の迷路上においてロボットがいる位置を客観的に観察すれば、その位置は座標（ｘ１，ｙ１）であるのに対して、ロボット自身は、自分は座標（ｘ２，ｙ２）にいると認識する場合がある。このように、いわばエージェントが主観的に認識した状態がＨＭＭでは、隠れ状態、内部状態、state、ノードなどと表現される。 In HMM learning, the agent recognizes the state of himself / herself based on observations obtained from the environment. The state here is a state that the agent has subjectively recognized. If the state where the agent is actually placed is observed objectively from the outside, the state may be different. For example, if the position of the robot on the two-dimensional maze is objectively observed, the position is the coordinates (x1, y1), whereas the robot itself is at the coordinates (x2, y2). May be recognized. In this way, the state subjectively recognized by the agent is expressed as a hidden state, an internal state, a state, a node, etc. in the HMM.

本実施例では、主に、迷路上の各位置、すなわち、迷路に配置された各パーツの位置のそれぞれを、ＨＭＭにおけるノード（状態、隠れ状態、内部状態、state）に対応付けて、それらのノードに観測シンボルを対応づけた例について説明する。 In the present embodiment, each position on the maze, that is, each position of each part arranged in the maze is associated with a node (state, hidden state, internal state, state) in the HMM, An example in which observation symbols are associated with nodes will be described.

ところで、通常のＨＭＭは、センサ信号を構造化するというものであり、アクション信号に関する考慮はない。エージェントがアクション信号を用いて環境に対してアクションを実行し、今後観測される観測シンボルに影響を与えることができるようにするという状況における学習は、ＨＭＭでは想定されていない。このような問題の解決は、部分観測マルコフ決定過程(Partially observable Markov decision process，以下、ＰＯＭＤＰ)と呼ばれる。 By the way, the normal HMM is to structure the sensor signal, and there is no consideration regarding the action signal. Learning in a situation where an agent uses an action signal to perform an action on the environment so as to influence an observation symbol observed in the future is not assumed in the HMM. The solution of such a problem is called a partially observable Markov decision process (hereinafter referred to as POMDP).

そこで、本発明では、ＨＭＭを拡張して上記の問題を解決する。すなわち、本発明では、ＨＭＭを、アクション信号を考慮したものとなるように拡張する。このように拡張したＨＭＭをアクション拡張型ＨＭＭと称することにする。 Therefore, the present invention solves the above problem by extending the HMM. That is, in the present invention, the HMM is expanded so as to consider the action signal. The HMM expanded in this way is referred to as an action expanded HMM.

図７は、通常のＨＭＭを説明する図である。同図に示されるように、ＨＭＭは、ある１つのノードから他の１つのノードへ遷移（状態遷移）する確率を、起こりえる遷移の数だけ学習する。すなわち、ノード数×ノード数のテーブルの各行列位置に、状態遷移確率の値を設定し、状態遷移確率テーブルという２次元のテーブルを生成する。また、ＨＭＭは、ある１つのノードにおいて、それぞれの観測シンボルが観測される確率を学習する。すなわち、ノード数×観測シンボル数のテーブルの各行列位置に、観測確率の値を設定し、観測確率テーブルという２次元のテーブルを生成する。 FIG. 7 is a diagram for explaining a normal HMM. As shown in the figure, the HMM learns the probability of transition (state transition) from one node to another node by the number of possible transitions. That is, a state transition probability value is set at each matrix position in the node number × node number table, and a two-dimensional table called a state transition probability table is generated. Also, the HMM learns the probability that each observation symbol is observed at a certain node. That is, an observation probability value is set at each matrix position in the node number × observation symbol number table, and a two-dimensional table called an observation probability table is generated.

例えば、図７の状態遷移確率テーブルにおいて、図中垂直方向に記述されたノードのそれぞれは、遷移元のノードを表し、図中水平方向に記述されたノードのそれぞれが遷移先のノードを表す。従って、例えば、状態遷移確率テーブルのｎ行ｍ列に記述された数値は、インデックスｎのノード（第ｎ番目のノード）からインデックスｍのノード（第ｍ番目のノード）へ遷移する確率を表している。そして、状態遷移確率テーブルの各行（例えば、ｎ行目）に記述された全ての数値を合計すると、１となるようになされている。 For example, in the state transition probability table of FIG. 7, each of the nodes described in the vertical direction in the figure represents a transition source node, and each of the nodes described in the horizontal direction in the figure represents a transition destination node. Therefore, for example, the numerical value described in the n-th row and m-th column of the state transition probability table represents the probability of transition from the node with index n (n-th node) to the node with index m (m-th node). Yes. Then, all the numerical values described in each row (for example, the nth row) of the state transition probability table are summed to be 1.

また、例えば、図７の観測確率テーブルのｎ行p列に記述された数値は、インデックスｎのノード（第ｎ番目のノード）において、インデックスpの観測シンボル（第ｐ番目の観測シンボル）が観測される確率を表している。そして、観測確率テーブルの各行（例えば、ｎ行目）に記述された全ての数値を合計すると、１となるようになされている。 Further, for example, in the numerical value described in the n-th row and the p-th column of the observation probability table of FIG. 7, the observation symbol (p-th observation symbol) of the index p is observed at the node of the index n (n-th node). Represents the probability of being. Then, the sum of all the numerical values described in each row (for example, the nth row) of the observation probability table is 1.

図８は、アクション拡張型ＨＭＭを説明する図である。同図に示されるように、アクション拡張型ＨＭＭでは、状態遷移確率テーブルを、アクション毎に生成する。例えば、上方向への移動というアクションの結果、ある１つのノードから他の１つのノードへ遷移する確率を、上方向移動アクションの状態遷移確率テーブルとして生成する。また、下方向への移動というアクションの結果、ある１つのノードから他の１つのノードへ遷移する確率を、下方向移動アクションの状態遷移確率テーブルとして生成する。同様に、左方向移動アクションの状態遷移確率テーブルと、右方向移動アクションの状態遷移確率テーブルも生成される。 FIG. 8 is a diagram for explaining an action expansion type HMM. As shown in the figure, in the action expanded HMM, a state transition probability table is generated for each action. For example, as a result of the action of upward movement, the probability of transition from one node to another node is generated as a state transition probability table for the upward movement action. Further, as a result of the action of moving downward, the probability of transition from one node to another is generated as a state transition probability table for the downward moving action. Similarly, a state transition probability table for the leftward movement action and a state transition probability table for the rightward movement action are also generated.

例えば、図８の状態遷移確率テーブルを、複数枚の２次元のテーブルとしてみると、図中垂直方向に記述されたノードのそれぞれは、それぞれのアクションにおける遷移元のノードを表し、図中水平方向に記述されたノードのそれぞれが遷移先のノードを表す。従って、例えば、ｋ枚目の状態遷移確率テーブルのｎ行ｍ列に記述された数値は、インデックスｋのアクション（第ｋ番目のアクション）を実行することにより、インデックスｎのノードからインデックスｍのノードへ遷移する確率を表している。そして、状態遷移確率テーブルの各行（例えば、ｋ枚目のテーブルのｎ行目）に記述された全ての数値を合計すると、１となるようになされている。 For example, when the state transition probability table in FIG. 8 is viewed as a plurality of two-dimensional tables, each of the nodes described in the vertical direction in the figure represents a transition source node in each action, and in the horizontal direction in the figure. Each of the nodes described in 1 represents a transition destination node. Therefore, for example, the numerical value described in the n-th row and m-th column of the k-th state transition probability table is obtained by executing the action of index k (k-th action) from the node of index n to the node of index m. Represents the probability of transition to The sum of all the numerical values described in each row of the state transition probability table (for example, the nth row of the k-th table) is 1.

このように、アクション拡張型ＨＭＭでは、２次元の状態遷移確率テーブルがアクション毎に生成され、いわば３次元の状態遷移確率テーブルが生成されることになる。 Thus, in the action extended HMM, a two-dimensional state transition probability table is generated for each action, so to speak, a three-dimensional state transition probability table is generated.

なお、アクション拡張型ＨＭＭにおいても、通常のＨＭＭの場合と同様に、ノード数×観測シンボル数のテーブルの各行列位置に、観測確率の値を設定し、２次元の観測確率テーブルが生成される。 Note that, in the action expanded HMM, as in the case of a normal HMM, an observation probability value is set at each matrix position in the table of the number of nodes × the number of observation symbols, and a two-dimensional observation probability table is generated. .

例えば、図８の観測確率テーブルのｎ行p列に記述された数値は、図７の場合と同様に、インデックスｎのノードにおいて、インデックスpの観測シンボルが観測される確率を表している。そして、観測確率テーブルの各行（例えば、ｎ行目）に記述された全ての数値を合計すると、１となるようになされている。 For example, the numerical value described in n rows and p columns of the observation probability table of FIG. 8 represents the probability that the observation symbol of index p is observed at the node of index n, as in the case of FIG. Then, the sum of all the numerical values described in each row (for example, the nth row) of the observation probability table is 1.

ここでは、センサ信号に基づいて１５通りの観測シンボルが得られる場合であって、離散観測信号を取得する場合の例について説明した。しかし、例えば、少しずつ変化するセンサ信号に基づいてほぼ無限の観測シンボルが得られるような、連続観測信号を取得する場合にもアクション拡張型ＨＭＭを用いることができる。 Here, an example in which 15 observation symbols are obtained based on a sensor signal and a discrete observation signal is obtained has been described. However, for example, the action extended HMM can also be used when acquiring a continuous observation signal such that an almost infinite observation symbol is obtained based on a sensor signal that changes little by little.

また、ここでは、エージェントが４通りのアクションのいずれかを実行する場合であって、離散アクション集合を実行する場合の例について説明した。しかし、例えば、エージェントが、少しずつ移動方向を変え、ほぼ無限のアクションの中から１つのアクションを実行するような、連続アクション集合を実行する場合にもアクション拡張型ＨＭＭを用いることができる。 Further, here, an example has been described in which the agent executes any one of four actions, and executes a discrete action set. However, for example, the action extended HMM can also be used when the agent changes a moving direction little by little and executes a continuous action set in which one action is executed from among almost infinite actions.

ここまで、アクション拡張型ＨＭＭについて説明した。 So far, the action expansion type HMM has been described.

図９は、本発明を適用した自律行動学習装置１０の構成例を示すブロック図である。同図の自律行動学習装置１０は、例えば、図１に示されるような迷路上を移動するロボットの制御装置などとして構成される。この例では、自律行動学習装置１０に、センサ部３１、行動出力部３２、観測バッファ３３、学習器３４、認識器３５、行動生成器３６、内部モデルデータ記憶部３７、認識結果バッファ３８、および行動出力バッファ３９が設けられている。 FIG. 9 is a block diagram showing a configuration example of the autonomous behavior learning apparatus 10 to which the present invention is applied. The autonomous behavior learning device 10 shown in FIG. 1 is configured as a control device for a robot that moves on a maze as shown in FIG. In this example, the autonomous behavior learning device 10 includes a sensor unit 31, a behavior output unit 32, an observation buffer 33, a learning device 34, a recognizer 35, a behavior generator 36, an internal model data storage unit 37, a recognition result buffer 38, and An action output buffer 39 is provided.

センサ部３１は、例えば、迷路などの環境において、上述した観測シンボルを観測するためのセンサ信号（または観測信号）を出力する。センサ部３１から出力された観測信号は、その観測信号が出力された時刻と対応付けられて観測バッファ３３に記憶されるようになされている。 For example, the sensor unit 31 outputs a sensor signal (or an observation signal) for observing the observation symbol described above in an environment such as a maze. The observation signal output from the sensor unit 31 is stored in the observation buffer 33 in association with the time when the observation signal is output.

例えば、時刻t，t+1，t+2，・・・Tの各時刻で取得した観測信号に対応する観測シンボルo_t， o_t+1， o_t+2，・・・， o_Tが各時刻の観測シンボルとして観測バッファ３３に記憶されることになる。 For example, observation symbols o _t , o _{t + 1} , o _{t + 2} ,..., O _T corresponding to observation signals acquired at times _t , _{t + 1} , _{t + 2} ,. It is stored in the observation buffer 33 as an observation symbol at each time.

行動出力部３２は、例えば、ロボットが実行すべきアクション（日本語で行動）を、ロボットに実行させるための制御信号を出力する機能ブロックである。行動出力部３２から出力された制御信号は、その制御信号に対応するアクションを特定する情報に変換され、その制御信号が出力された時刻と対応付けられて行動出力バッファ３９に記憶されるようになされている。 The action output unit 32 is, for example, a functional block that outputs a control signal for causing the robot to execute an action (behavior in Japanese) to be executed by the robot. The control signal output from the action output unit 32 is converted into information specifying an action corresponding to the control signal, and is stored in the action output buffer 39 in association with the time when the control signal is output. Has been made.

例えば、時刻t，t+1，t+2，・・・Tの各時刻で実行したアクションc_t， c_t+1， c_t+2，・・・， c_Tが各時刻のアクションとして行動出力バッファ３９に記憶されることになる。 For example, the actions c _t , c _{t + 1} , c _{t + 2} ,..., C _T executed at the times t, t + 1, t + 2,. It is stored in the output buffer 39.

学習器３４は、観測バッファ３３および行動出力バッファ３９に記憶されている情報に基づいて、内部モデルデータを生成または更新し、内部モデルデータ記憶部３７に記憶させる。 The learning device 34 generates or updates internal model data based on the information stored in the observation buffer 33 and the behavior output buffer 39 and stores the internal model data in the internal model data storage unit 37.

内部モデルデータ記憶部３７に記憶されている内部モデルデータには、上述した、３次元の状態遷移確率テーブル、および２次元の観測確率テーブルが含まれる。さらに、内部モデルデータ記憶部３７に記憶されている内部モデルデータには、後述する、状態遷移確率の計算のための頻度変数および観測確率の計算のための頻度変数が含まれる。 The internal model data stored in the internal model data storage unit 37 includes the above-described three-dimensional state transition probability table and two-dimensional observation probability table. Furthermore, the internal model data stored in the internal model data storage unit 37 includes a frequency variable for calculating a state transition probability and a frequency variable for calculating an observation probability, which will be described later.

認識器３５は、観測バッファ３３および行動出力バッファ３９に記憶されている情報、並びに内部モデルデータ記憶部３７に記憶されている状態遷移確率テーブルおよび観測確率テーブルに基づいて、現在、ロボットが位置するノードを認識するようになされている。認識器３５から出力された認識結果は、その認識結果が出力された時刻と対応付けられて認識結果バッファ３８に記憶されるようになされている。 Based on the information stored in the observation buffer 33 and the action output buffer 39 and the state transition probability table and the observation probability table stored in the internal model data storage unit 37, the recognizer 35 is currently located in the robot. It is designed to recognize nodes. The recognition result output from the recognizer 35 is stored in the recognition result buffer 38 in association with the time when the recognition result is output.

行動生成器３６は、内部モデルデータ記憶部３７に記憶されている内部モデルデータ、行動出力バッファ３９に記憶されている情報、および認識器３５が出力する認識結果に基づいて、ロボットが実行すべきアクションを決定する。そして、行動生成器３６は、決定されたアクションに対応する制御信号を出力するように、行動出力部３２を制御する。 The behavior generator 36 should be executed by the robot based on the internal model data stored in the internal model data storage unit 37, the information stored in the behavior output buffer 39, and the recognition result output by the recognizer 35. Determine the action. Then, the behavior generator 36 controls the behavior output unit 32 so as to output a control signal corresponding to the determined action.

このように、自律行動学習装置１０は、例えば、ロボットを迷路上で移動させて、自動的に迷路の構造などを学習させることができるようになされている。 As described above, the autonomous behavior learning apparatus 10 can automatically learn the structure of the maze by moving the robot on the maze, for example.

次に、図９の学習器３４におけるアクション拡張型ＨＭＭの学習アルゴリズムについて説明する。 Next, the learning algorithm of the action expansion type HMM in the learning device 34 in FIG. 9 will be described.

通常のＨＭＭではノードs_iからs_jへの状態遷移確率を状態遷移確率テーブルa_ijでモデル化するが、アクション拡張型ＨＭＭではアクションパラメータcを用いてa_ij(c)としてモデル化する。 In the normal HMM, the state transition probability from the node s _i to s _j is modeled by the state transition probability table a _ij , but in the action expanded HMM, it is modeled as a _ij (c) using the action parameter c.

学習アルゴリズムとしては、Baum-Welchアルゴリズムを用いる。forward確率、backward確率の計算ができれば、Baum-Welchアルゴリズムに基づくパラメータ推定(期待値最大化法)が可能となるので、以下ではそれらの確率の計算について説明する。 A Baum-Welch algorithm is used as a learning algorithm. If the forward probability and the backward probability can be calculated, parameter estimation (expected value maximization method) based on the Baum-Welch algorithm can be performed. Therefore, calculation of these probabilities will be described below.

ここで、アクション集合C = {c₁， c₂，・・・， c_n}に属するアクションc_kによって、ノードs_i からs_j への遷移が起きる確率を、３次元の確率表現テーブルa_ij(k) ≡ a_ijkで表すこととする。なお、この例の場合、離散アクション集合を実行することになる。 Here, the probability that the transition from the node s _i to s _j occurs by the action c _k belonging to the action set C = {c ₁ , c ₂ ,..., C _n } is expressed as a three-dimensional probability expression table a _ij. (k) ≡ a _ijk In this example, a discrete action set is executed.

まずforward確率の計算について説明する。 First, calculation of the forward probability will be described.

時刻1，2，・・・t-1の各時刻においてエージェントが取得したセンサ信号に対応する観測シンボルを、それぞれo₁，o₂，・・・，o_t？1で表すことにする。また、時刻1，2，・・・t-1の各時刻においてエージェントが実行したアクションを、それぞれc₁，c₂，・・・，c_t？1で表すことにする。この場合、時刻tにおいてエージェントが取得したセンサ信号に対応する観測シンボルがo_tであるとき、エージェントがノードs_jにいるforward確率α_t(j)は、式（１）の漸化式により表すことができる。 Time 1, at each time of · · · t-1 observation symbols corresponding to sensor signals that the agent obtains, respectively _{_{o 1, o 2, ···,}} o t? Let's denote by ₁ . Also, the actions executed by the agent at each time of time 1, 2,..., T−1 are c ₁ , c ₂ _,. Let's denote by ₁ . In this case, when the observation symbols corresponding to sensor signals that the agent has acquired at time t is o _t, the agent node s _j to have forward probability alpha _t (j) denotes the recurrence formula of the formula (1) be able to.

・・・（１）
ただし、b_j(o)は、ノードs_jの下で観測シンボルoが得られる観測確率である。

... (1)
Here, b _j (o) is an observation probability that the observation symbol o is obtained under the node s _j .

次に、backward確率の計算について説明する。 Next, calculation of the backward probability will be described.

エージェントが時刻tにおいて状態iにいた場合、時刻t，t+1，t+2，・・・T-1の各時刻において、それぞれアクションc_t，c_t+1，・・・，c_T？1を実行し、各時刻で取得したセンサ信号に対応する観測シンボルが、それぞれo_t+1，o_t+2，・・・，o_Tであるbackward確率βt(i)は、式（２）の漸化式により表すことができる。 If the agent is in state i at time t, actions c _t , c _{t + 1} ,..., C _{T at} times t, t + 1, t + 2 _{,. 1} is executed, the observation symbols corresponding to sensor signals obtained at each time, each _{_{o t + 1, o t +}} 2, ···, backward probability βt is o _T (i) has the formula (2) It can be expressed by the recurrence formula.

・・・（２）

... (2)

このように計算されるforward確率と、backward確率とを用いて、状態遷移確率の推定と、観測確率の推定を行なうことができる。 By using the forward probability and the backward probability calculated in this way, it is possible to estimate the state transition probability and the observation probability.

離散アクション集合を実行する場合の状態遷移確率の推定と観測確率の推定は、次のようにして行なわれる。 The estimation of the state transition probability and the observation probability when executing the discrete action set are performed as follows.

状態遷移確率a_ij(k)の推定は、Baum-WelchアルゴリズムのＭ−ステップで行なう。ここで、状態遷移確率a_ij(k)は、エージェントが状態iにいるとき、アクションｋを実行することにより状態ｊに遷移する確率を意味する。すなわち、式（３）を演算することにより、状態遷移確率の推定値a´_ij(k)を得ることができる。 The state transition probability a _ij (k) is estimated in the M-step of the Baum-Welch algorithm. Here, the state transition probability a _ij (k) means the probability of transition to the state j by executing the action k when the agent is in the state i. In other words, the estimated value a ′ _ij (k) of the state transition probability can be obtained by calculating Expression (3).

・・・（３）

... (3)

観測確率b_j(o)の推定も、やはりBaum-WelchアルゴリズムのＭ-ステップで行なう。ここで、観測確率b_j(o)は、エージェントが状態ｊにいるとき、観測シンボルｏに対応するセンサ信号を取得する確立を意味する。すなわち、式（４）を演算することにより、観測確率の推定値b´_j(o)を得ることができる。 The observation probability b _j (o) is also estimated by the M-step of the Baum-Welch algorithm. Here, the observation probability b _j (o) means establishment of acquiring a sensor signal corresponding to the observation symbol o when the agent is in the state j. That is, by calculating equation (4), an estimated value b ′ _j (o) of the observation probability can be obtained.

・・・（４）

... (4)

式（４）は、離散観測信号を取得する場合の例であるが、連続観測信号を取得する場合は、時刻tにおいて取得された観測信号o_tを、式（５）に示されるγ_t(j）よって重み付けた信号分布を用いて、観測確率密度関数b_j(o)のパラメータを再推定すればよい。なお、γ_t(j）は、時刻tにおいてエージェントが状態jにいる場合の重み係数を表している。 Equation (4) is an example of obtaining a discrete observation signal. However, when obtaining a continuous observation signal, the observation signal o _t obtained at time t is expressed as γ _t ( j) The parameters of the observation probability density function b _j (o) may be re-estimated using the weighted signal distribution. Note that γ _t (j) represents a weighting coefficient when the agent is in the state j at time t.

・・・（５）

... (5)

通常は、ガウス分布などの対数凹又は楕円型対称確率密度をモデルとして用い、観測確率密度関数b_j(o)のパラメータの再推定を行うことができる。 Normally, logarithmic concave such as Gaussian distribution or elliptical symmetrical probability density is used as a model, and the parameters of the observed probability density function b _j (o) can be re-estimated.

ガウス分布などの対数凹又は楕円型対称確率密度のモデルのパラメータとしては、状態jにおける観測信号の平均ベクトルμ´_jおよび共分散行列U´_jを用いることができる。平均ベクトルμ´_jおよび共分散行列U´_jは、それぞれ、式（６）および式（７）により求めることができる。 As a parameter of a log-concave or elliptical symmetry probability density model such as a Gaussian distribution, an average vector μ ′ j of an observation signal in a state _j and a covariance matrix U ′ _j can be used. The average vector μ ′ _j and the covariance matrix U ′ _j can be obtained by Expression (6) and Expression (7), respectively.

・・・（６）

... (6)

・・・（７）

... (7)

次に連続アクション集合を実行する場合の例について説明する。 Next, an example of executing a continuous action set will be described.

連続アクションの場合、離散アクションの場合と異なり、離散アクションc_kより連続アクションcの出力される確率ρ_k(c)の学習が必要となる。確率ρ_k(c)を学習することにより、連続アクションｃを、あたかも離散アクションc_kであるようにラベリングする（離散アクションに対応付ける）ことができるからである。 In the case of the continuous action, unlike the case of the discrete action, it is necessary to learn the probability ρ _k (c) that the continuous action c is output from the discrete action c _k . This is because by learning the probability ρ _k (c), the continuous action c can be labeled as if it were the discrete action c _k (corresponding to the discrete action).

連続アクションの場合のforward確率の計算は次のようにして行なわれる。 The calculation of the forward probability in the case of continuous action is performed as follows.

時刻1，2，・・・t-1の各時刻においてエージェントが取得したセンサ信号に対応する観測シンボルを、それぞれo₁，o₂，・・・，o_t？1で表すことにする。また、時刻1，2，・・・t-1の各時刻においてエージェントが実行した連続アクションから推定される離散アクションを、それぞれc₁，c₂，・・・，c_t？1で表すことにする。この場合、時刻tにおいてエージェントが取得したセンサ信号に対応する観測シンボルがo_tであるとき、エージェントがノードs_jにいるforward確率α_t(j)は、式（８）の漸化式により表すことができる。 Time 1, at each time of · · · t-1 observation symbols corresponding to sensor signals that the agent obtains, respectively _{_{o 1, o 2, ···,}} o t? Let's denote by ₁ . Also, discrete actions estimated from continuous actions executed by the agent at each time of time 1, 2,..., T−1 are c ₁ , c ₂ _,. Let's denote by ₁ . In this case, when the observation symbols corresponding to sensor signals that the agent has acquired at time t is o _t, the agent node s _j to have forward probability alpha _t (j) denotes the recurrence formula of the formula (8) be able to.

・・・（８）
ただし、ρ_k(c)は、離散アクションc_kより連続アクションcの出力される確率を表す。
なお、ρ_k(c)をどのようにして求めるかについては、後述する。

... (8)
Here, ρ _k (c) represents the probability that the continuous action c is output from the discrete action c _k .
Note that how ρ _k (c) is obtained will be described later.

エージェントが時刻tにおいて状態iにいた場合、時刻t，t+1，t+2，・・・T-1の各時刻において、エージェントが実行した連続アクションから推定される離散アクションを、それぞれアクションc_t，c_t+1，・・・，c_T？1とし、各時刻で取得したセンサ信号に対応する観測シンボルが、それぞれo_t+1，o_t+2，・・・，o_Tであるbackward確率βt(i)は、式（９）の漸化式により表すことができる。 When the agent is in the state i at time t, the discrete actions estimated from the continuous actions executed by the agent at the times t, t + 1, t + 2,. _t , c _{t + 1} , ..., c _{T? 1,} and observation symbols corresponding to sensor signals obtained at each time, each _{_{o t + 1, o t +}} 2, ···, a o _T backward probability [beta] t (i) is gradually formula (9) It can be expressed by a chemical formula.

・・・（９）

... (9)

連続アクション集合を実行する場合の状態遷移確率の推定と観測確率の推定は、次のようにして行なわれる。 The estimation of the state transition probability and the observation probability when the continuous action set is executed is performed as follows.

状態遷移確率a_ij(k)の推定は、離散アクションの場合と同様に、Baum-WelchアルゴリズムのＭ−ステップで行なう。ここで、状態遷移確率a_ij(k)は、エージェントが状態iにいるとき、アクションｋを実行することにより状態ｊに遷移する確率を意味する。すなわち、式（１０）を演算することにより、状態遷移確率の推定値a´_ij(k)を得ることができる。 The state transition probability a _ij (k) is estimated in the M-step of the Baum-Welch algorithm as in the case of the discrete action. Here, the state transition probability a _ij (k) means the probability of transition to the state j by executing the action k when the agent is in the state i. In other words, the estimated value a ′ _ij (k) of the state transition probability can be obtained by calculating Expression (10).

・・・（１０）

(10)

観測確率b_j(o)の推定は、離散アクションの場合と全く同一なので、ここでは説明を省略する。 Since the estimation of the observation probability b _j (o) is exactly the same as in the case of the discrete action, the description is omitted here.

次に、離散アクションc_kより連続アクションcの出力される確率ρ_k(c)をどのようにして求めるかについて説明する。 Next, how to obtain the probability ρ _k (c) that the continuous action c is output from the discrete action c _k will be described.

確率ρ_k(c)もBaum-Welch アルゴリズムのＭ−ステップで行なうようにすることができる。すなわち、連続観測信号の場合における観測確率の推定と同様の方式で推定することができる。 The probability ρ _k (c) can also be performed in the M-step of the Baum-Welch algorithm. That is, it can be estimated by the same method as the estimation of the observation probability in the case of continuous observation signals.

時刻tにおいて実行されるアクションctを、式（１１）に示されるξ_t(i，j，k)よって重み付けた信号分布を用いて、確率ρ_k(c)を推定すればよい。 The probability ρ _k (c) may be estimated using a signal distribution obtained by weighting the action ct executed at time t by ξ _t (i, j, k) shown in Expression (11).

・・・（１１）

(11)

観測確率の場合と同様にガウス分布などをモデルとして用い、確率ρ_k(c)を推定することができる。 As with the observation probability, the probability ρ _k (c) can be estimated using a Gaussian distribution or the like as a model.

この場合、連続アクションｃをラベリングして得られる離散アクションc_kより生成されるアクション信号の平均ベクトルν_kおよび共分散行列Ｖ´_kを、それぞれ式（１２）および式（１３）により演算する。このようにして演算された、アクション信号の平均ベクトルν_kおよび共分散行列Ｖ´_kを、ガウス分布などのモデルのパラメータとして用いるようにすればよい。 In this case, the average vector ν _k and the covariance matrix V ′ _k of the action signal generated from the discrete action c _k obtained by labeling the continuous action c are calculated by the equations (12) and (13), respectively. The action signal average vector ν _k and covariance matrix V ′ _k calculated as described above may be used as parameters of a model such as a Gaussian distribution.

・・・（１２）

(12)

・・・（１３）

(13)

このようにして、アクション拡張型ＨＭＭにおける３次元の状態遷移確率テーブルと、２次元の観測確率テーブルを学習により生成することができる。 In this way, a three-dimensional state transition probability table and a two-dimensional observation probability table in the action expanded HMM can be generated by learning.

ここまで説明したアクション拡張型ＨＭＭの学習アルゴリズムにより、通常のＨＭＭと同様に、状態遷移確率と観測確率を得ることができる。 The state transition probability and the observation probability can be obtained by the action extended type HMM learning algorithm described so far as in the case of a normal HMM.

しかし、状態の数（ノード数）をＮ、観測シンボル数をＭ、アクション数をＫとすると、３次元の状態遷移確率テーブルと、２次元の観測確率テーブルにおいて算出すべきパラメータ数は、Ｎ²Ｋ＋ＮＭとなる。このように、アクション拡張型ＨＭＭにおいては、Ｎ，Ｍ，Ｋの数が増大すると、学習処理の負荷も加速度的に増大することが明らかである。例えば、Ｎが２５０程度、Ｍが１５程度、Ｋが５程度の環境においては、３０万規模のパラメータを算出する必要がある。数少ないサンプルからこれほど多くのパラメータを適切に決定することは非常に困難である。 However, if the number of states (number of nodes) is N, the number of observation symbols is M, and the number of actions is K, the number of parameters to be calculated in the three-dimensional state transition probability table and the two-dimensional observation probability table is N ^2. K + NM. As described above, in the action expansion type HMM, it is clear that the load of the learning process increases at an accelerated rate as the number of N, M, and K increases. For example, in an environment where N is about 250, M is about 15, and K is about 5, it is necessary to calculate 300,000 scale parameters. It is very difficult to properly determine so many parameters from a few samples.

しかしながら、例えば、モデルに制約を加えることでパラメータの自由度を減らし、学習を安定化させることが可能である。次に、必然的に大規模となるアクション拡張型ＨＭＭの学習を効率的かつ適切に行うために必要となる技術について説明する。 However, for example, by adding constraints to the model, it is possible to reduce the degree of freedom of parameters and stabilize learning. Next, a technique necessary for efficiently and appropriately learning an action expanded HMM that inevitably has a large scale will be described.

本発明では、アクション拡張型ＨＭＭの学習において、一状態一観測制約およびアクション遷移制約を課すことにする。 In the present invention, one-state one-observation constraint and action transition constraint are imposed in learning of the action-expanded HMM.

最初に、一状態一観測制約について説明する。一状態一観測制約は、例えば、あるノードで観測される観測シンボルは、原則として１つに限るという制約である。なお、一状態一観測制約の下でも、同じ観測シンボルを別々のノードで観測することは許容される。 First, one-state one-observation constraint will be described. The one-state one-observation constraint is a constraint that, for example, the number of observation symbols observed at a certain node is limited to one in principle. Note that it is permissible to observe the same observation symbol at different nodes even under the one-state one-observation constraint.

アクション拡張型ＨＭＭの学習において、一状態一観測制約を課すことにより事象の表現方式が限定され、結果として、状態遷移確率テーブルと観測確率テーブルの生成のために必要となるパラメータの自由度が減少する。 In action-expanded HMM learning, the representation method of events is limited by imposing one-state one-observation constraint, and as a result, the degree of freedom of parameters required for generating the state transition probability table and the observation probability table is reduced. To do.

一状態一観測制約を実現する方式の１つとして、例えば、離散観測型ＨＭＭの学習においてなされているように、目的関数に観測確率をスパースにするような制約項を加えるという方式がある。 As one of the methods for realizing one-state one-observation constraint, for example, there is a method of adding a constraint term that makes the observation probability sparse in the objective function, as is done in the learning of the discrete observation type HMM.

例えば、目的関数に観測確率をスパースにするような重みλを乗じた制約項Σ_jＨ(b_j)を加えるという方式が考えられる。ここで、Ｈ(b_j)は、ノードs_jで観測され得るすべて観測シンボルに対する観測確率ベクトルb_jに対して定義されるエントロピーとされる。これ以外にも、観測確率ベクトルb_jのＬ１ノルムとＬ２ノルムの差分Σ_j(||b_j||₁ ？ ||b_j||₂)などを、制約項とする方式も考えられる。 For example, a method of adding a constraint term Σ _j H (b _j ) multiplied by a weight λ that makes the observation probability sparse is added to the objective function. Here, H (b _j ) is an entropy defined for the observation probability vector b _j for all observation symbols that can be observed at the node s _j . In addition to this, a method in which the difference Σ _j (|| b _j || ₁ ? || b _j || ₂ ) between the L1 norm and the L2 norm of the observation probability vector b _j is also considered as a constraint term.

あるいはまた、上述の目的関数に観測確率をスパースにするような重みλを乗じた制約項Σ_jＨ(b_j)を加えるという方式以外の方式で一状態一観測制約を実現することも可能である。このような方式の例としてスプリットアルゴリズムを適用する例が考えられる。 Alternatively, the one-state one-observation constraint can be realized by a method other than the method of adding the constraint term Σ _j H (b _j ) obtained by multiplying the above objective function by a weight λ that makes the observation probability sparse. is there. As an example of such a method, an example in which a split algorithm is applied can be considered.

図１０と図１１は、スプリットアルゴリズムを説明する図である。図１０と図１１では、図中の円でノードが示されており、各ノードで観測されるシンボルとして図２を参照して上述したパーツの図形が表示されている。 10 and 11 are diagrams for explaining the split algorithm. In FIGS. 10 and 11, nodes are indicated by circles in the drawings, and the figure of the part described above with reference to FIG. 2 is displayed as a symbol observed at each node.

図１０は、エージェントの学習の結果得られた状態遷移確率テーブルと観測確率テーブルの内容を可視化した図である。図１０の例は、ノードＳ１０、ノードＳ２０、ノードＳ３０が存在する場合の例を示している。この例の場合、エージェントは、ノードＳ１０で十字路のパーツ（図２のパーツ１５）を１００％の確率で観測し、ノードＳ１０において右方向に移動するアクションを実行するとエージェントは１００％の確率でノードＳ２０に移動（遷移）する。 FIG. 10 is a diagram visualizing the contents of the state transition probability table and the observation probability table obtained as a result of agent learning. The example of FIG. 10 shows an example in the case where the node S10, the node S20, and the node S30 exist. In this example, when the agent observes the crossroad part (part 15 in FIG. 2) at the node S10 with a probability of 100% and executes an action of moving to the right at the node S10, the agent has a probability of 100%. Move (transition) to S20.

また、ノードＳ２０では、図２のパーツ７とパーツ１３が、それぞれ５０％の確率で観測される。ノードＳ２０において右方向に移動するアクションを実行すると１００％の確率でノードＳ３０に遷移し、ノードＳ２０において左方向に移動するアクションを実行すると１００％の確率でノードＳ１０に遷移する。 Further, at the node S20, the parts 7 and 13 in FIG. 2 are observed with a probability of 50%. When an action that moves in the right direction is executed in the node S20, the transition to the node S30 is made with a probability of 100%, and when an action that moves in the left direction is executed in the node S20, the action is made with a probability of 100%.

さらに、ノードＳ３０では、図２のパーツ５が１００％の確率で観測され、ノードＳ３０において左方向に移動するアクションを実行すると１００％の確率でノードＳ２０に遷移する。 Further, at the node S30, the part 5 in FIG. 2 is observed with a probability of 100%, and when the action of moving to the left in the node S30 is executed, the node 5 transitions to the node S20 with a probability of 100%.

なお、図１０（図１１も同じ）は、状態遷移確率テーブルと観測確率テーブルの内容を可視化したものであり、実際には、図１０に対応する状態遷移確率テーブルと観測確率テーブルが内部モデルデータとして学習されている。このような内部モデルデータにスプリットアルゴリズムを適用すると、状態遷移確率テーブルと観測確率テーブルの内容は、図１１に示されるように変化する。 Note that FIG. 10 (the same applies to FIG. 11) is a visualization of the contents of the state transition probability table and the observation probability table. Actually, the state transition probability table and the observation probability table corresponding to FIG. As learned. When the split algorithm is applied to such internal model data, the contents of the state transition probability table and the observation probability table change as shown in FIG.

図１１は、図１０に対応する状態遷移確率テーブルと観測確率テーブルの内容にスプリットアルゴリズムを適用した場合に得られる状態遷移確率テーブルと観測確率テーブルの内容を可視化した図である。 FIG. 11 is a diagram visualizing the contents of the state transition probability table and the observation probability table obtained when the split algorithm is applied to the contents of the state transition probability table and the observation probability table corresponding to FIG.

図１１の例では、ノードＳ１０、ノードＳ２１、ノードＳ２２、ノードＳ３０が存在する。すなわち、図１０のノードＳ２０が図１１においてノードＳ２１とノードＳ２２に分割（スプリット）されたのである。この例の場合、ノードＳ１０では図２のパーツ１５が１００％の確率で観測され、ノードＳ１０において右方向に移動するアクションを実行すると５０％の確率でノードＳ２１に遷移し、５０％の確率でノードＳ２２に遷移する。 In the example of FIG. 11, there are a node S10, a node S21, a node S22, and a node S30. That is, the node S20 in FIG. 10 is divided (split) into the node S21 and the node S22 in FIG. In this example, the part 15 in FIG. 2 is observed at the node S10 with a probability of 100%, and when the action of moving to the right is executed at the node S10, the transition to the node S21 is performed with a probability of 50%. Transition to node S22.

また、ノードＳ２１では図２のパーツ７が１００％の確率で観測され、ノードＳ２１において右方向に移動するアクションを実行すると１００％の確率でノードＳ３０に遷移し、左方向に移動するアクションを実行すると１００％の確率でノードＳ１０に遷移する。 2 is observed at a probability of 100% in the node S21. When an action of moving to the right in the node S21 is executed, the action moves to the node S30 with a probability of 100% and an action of moving to the left is executed. Then, a transition to the node S10 is made with a probability of 100%.

ノードＳ２２では図２のパーツ１３が１００％の確率で観測され、ノードＳ２２において右方向に移動するアクションを実行すると１００％の確率でノードＳ３０に遷移し、左方向に移動するアクションを実行すると１００％の確率でノードＳ１０に遷移する。 The node 13 in FIG. 2 is observed with a probability of 100%. If an action of moving to the right in the node S22 is executed, the transition to the node S30 is performed with a probability of 100%, and an action of moving to the left is executed. Transition to the node S10 with a probability of%.

さらに、ノードＳ３０では、図２のパーツ５が１００％の確率で観測され、ノードＳ３０において左方向に移動するアクションを実行すると５０％の確率でノードＳ２１に遷移し、５０％の確率でノードＳ２２に遷移する。 Further, at node S30, part 5 in FIG. 2 is observed with a probability of 100%. When an action of moving to the left in node S30 is executed, transition to node S21 is made with a probability of 50%, and node S22 is made with a probability of 50%. Transition to.

このように、スプリットアルゴリズムを適用することにより、一状態一観測制約を実現することが可能である。 Thus, by applying the split algorithm, it is possible to realize one-state one-observation constraint.

すなわち、スプリットアルゴリズムの適用は、期待値最大化法で求められた局所最適解に対して一状態一観測制約を適用し、修正された解に対して再度期待値最大化法に基づく局所最適化を施す処理を繰り返すことで、最終的に一状態一観測制約を満たす局所最適解を求める処理になる。 In other words, the split algorithm is applied by applying the one-state one-observation constraint to the local optimal solution obtained by the expected value maximization method, and again by using the local optimization based on the expected value maximization method for the corrected solution. By repeating the processing, the processing for obtaining a local optimal solution that finally satisfies the one-state one-observation constraint is obtained.

なお、図１０と図１１を参照して上述した例では、各ノードで観測される観測シンボルの観測確率が１００％とされるように分割されると説明したが、実際には、観測シンボルの観測確率が１００％とされることは稀である。一状態一観測制約は、厳密な意味で１つのノードで観測される観測シンボルが常に１つに限られるようにするものではないからである。すなわち、一状態一観測制約は、１つのノードで観測される観測シンボルが複数ある場合でも、そのうち１つの観測シンボルの観測確率が閾値以上となるようにするものである。 In the example described above with reference to FIGS. 10 and 11, it has been described that the observation symbol observed at each node is divided so that the observation probability is 100%. It is rare that the observation probability is 100%. This is because the one-state one-observation constraint does not always limit the number of observation symbols observed at one node to one in a strict sense. That is, the one-state one-observation constraint is such that even when there are a plurality of observation symbols observed at one node, the observation probability of one observation symbol is equal to or greater than a threshold value.

図９の学習器３４により内部モデルデータに対してスプリットアルゴリズムが適用される場合の処理について、図１２のフローチャートを参照して説明する。 Processing when the split algorithm is applied to the internal model data by the learning device 34 of FIG. 9 will be described with reference to the flowchart of FIG.

ステップＳ１０１において、学習器３４は、内部モデルデータ記憶部３７に記憶されている観測確率テーブルを参照し、観測確率b_jの最大値が閾値th1以下のノードs_jを1つ探す。 In step S101, the learning device 34 refers to the observation probability table stored in the internal model data storage unit 37, and searches for one node s _j whose maximum value of the observation probability b _j is equal to or less than the threshold th1.

ステップＳ１０２において、学習器３４は、ステップＳ１０１の処理の結果、最大値が閾値th1以下のノードs_jが見つかったか否かを判定し、見つかったと判定された場合、処理は、ステップＳ１０３に進む。 In step S102, the learning device 34 determines whether or not a node s _j having a maximum value equal to or less than the threshold value th1 has been found as a result of the process in step S101. If it is determined that the node s _j has been found, the process proceeds to step S103.

ステップＳ１０３において、学習器３４は、観測確率テーブルを参照し、ステップＳ１０２で見つかったと判定されたノードs_jにおける各観測シンボルの観測確率をチェックする。そして、学習器３４は、ノードs_jおいて、観測確率が閾値th2以上となる観測シンボルの数をカウントし、それらの観測シンボルをリストする。 In step S103, the learning device 34 refers to the observation probability table, and checks the observation probability of each observation symbol in the node s _j determined to be found in step S102. Then, the learning device 34 counts the number of observation symbols whose observation probability is equal to or higher than the threshold th2 at the node s _j , and lists those observation symbols.

例えば、Ｋ個の観測確率が閾値th2以上となる観測シンボルが存在する場合、観測シンボルo_k(k = 1，・・・，K)がリストされる。 For example, if the observation symbols of K observation probability is a threshold th2 or more are present, observation symbols _{o k (k = 1, ···} , K) are listed.

ステップＳ１０４において、学習器３４は、ノードs_jをＫ個に分割する。 In step S104, the learning device 34 divides the node s _j into K pieces.

このとき、ノードs_jが分割された後の観測確率テーブルにおける観測確率および状態遷移確率テーブルにおける状態遷移確率は、次のようにして設定される。 At this time, the observation probability in the observation probability table after the node s _j is divided and the state transition probability in the state transition probability table are set as follows.

ノードs_jが分割された結果得られるＫ個のノードのうちの第ｋ番目のノードを、s_j ^kと表すこととし、ノードs_j ^kで観測される各観測シンボルの観測確率のそれぞれを要素とするベクトルをb_j ^kと表すことにする。 The k-th node among the K nodes obtained as a result of dividing the node s _j is represented as s _j ^k, and each observation probability of each observation symbol observed at the node s _j ^k is an element. Let be expressed as b _j ^k .

ステップＳ１０４において、学習器３４は、ベクトルb_j ^kを、観測シンボルo_kに対する観測確率だけが突出して大きく（１に極めて近く）、その他の観測シンボルに対する観測確率はきわめて微小な範囲の一様乱数となるように設定する。 In step S104, the learning device 34 increases the vector b _j ^k with only the observation probability with respect to the observation symbol o _k protruding (very close to 1), and the observation probability with respect to the other observation symbols is a uniform random number in a very small range. Set to be.

また、ノードs_jが分割される前のノードs_iからノードs_jへの状態遷移確率をａ_ijで表すこととし、ノードs_jが分割された後のノードs_iからノードs_j ^kへの状態遷移確率をa^k _ijで表すことにする。 In addition, the state transition probability from the node s _i to the node s _j before the node s _j is divided is represented by a _ij , and the node s _i to the node s _j ^k after the node s _j is divided The state transition probability is represented by a ^k _ij .

ステップＳ１０４において、学習器３４は、状態遷移確率をa^k _ijが、分割前の状態遷移確率a_ijを分割前の各観測シンボルの観測確率の比で案分されたものとなるように設定する。 In step S104, the learning device 34 sets the state transition probability to be a ^k _ij and the state transition probability a _ij before the division to be prorated by the ratio of the observation probabilities of the respective observation symbols before the division. .

さらに、ノードs_jが分割される前のノードs_jからノードs_iへの状態遷移確率をａ_jiで表すこととし、ノードs_jが分割された後のノードs_j ^kからノードs_iへの状態遷移確率をa^k _jiで表すことにする。 Furthermore, the state transition probabilities from the previous node s _j to the node s _i to node s _j is divided and be represented by a _ji, from node s _j ^k after the node s _j is divided to the node s _i The state transition probability is represented by a ^k _ji .

ステップＳ１０４において、Ｋ個の状態遷移確率a^k _jiのそれぞれに、状態遷移確率ａ_jiを設定する。 In step S104, a state transition probability a _ji is set for each of the K state transition probabilities a ^k _ji .

このようにして、スプリットアルゴリズムの適用の処理が実行される。 In this way, the split algorithm application process is executed.

次に、アクション遷移制約について説明する。アクション遷移制約は、一状態一観測制約が課されていることを前提とした制約である。 Next, action transition constraints will be described. The action transition constraint is a constraint on the premise that a one-state one-observation constraint is imposed.

アクション遷移制約は、あるノードs_iから、同一のアクションc_kによって遷移可能な遷移先のノードs_j(j=1，・・・， J)、またはノードs_iへ同一のアクションc_kによって遷移可能な遷移元のノードs_j(j=1，・・・，J)では、それぞれ異なる観測シンボルo_j(j=1，・・・，J)が観測されるべきであるという制約である。前者をforward制約、後者をbackward制約と称する。すなわち、アクション遷移制約の下では、同一のアクションc_kによって遷移可能な複数の遷移先（または遷移元）のノードにおいて、同一の観測シンボルが観測されることは許容されないのである。なお、アクション遷移制約の下でも、異なる観測シンボルを観測するノードであれば、同一のアクションc_kによって遷移可能な遷移先のノードが複数存在することは許容される。 The action transition constraint is a transition from a node s _i to a transition destination node s _j (j = 1,..., J) that can be transitioned by the same action c _k or the same action c _k to the node s _i . In the possible transition source node s _j (j = 1,..., J), a different observation symbol o _j (j = 1,..., J) should be observed. The former is called a forward constraint and the latter is called a backward constraint. That is, under the action transition constraint, it is not allowed to observe the same observation symbol at a plurality of transition destination (or transition source) nodes that can be transitioned by the same action _ck . It should be noted that even under an action transition constraint, it is allowed that a plurality of transition destination nodes that can be transitioned by the same action _ck exist as long as the nodes observe different observation symbols.

アクション遷移制約を実現する方式の例としてフォワードマージアルゴリズムおよびバックワードマージアルゴリズムを適用する例が考えられる。 An example of applying a forward merge algorithm and a backward merge algorithm can be considered as an example of a method for realizing an action transition constraint.

図１３と図１４は、フォワードマージアルゴリズムを説明する図である。 13 and 14 are diagrams for explaining the forward merge algorithm.

図１３と図１４では、図中の円でノードが示されており、各ノードで観測されるシンボルとして図２を参照して上述したパーツの図形が表示されている。 In FIG. 13 and FIG. 14, nodes are indicated by circles in the drawings, and the graphics of the parts described above with reference to FIG. 2 are displayed as symbols observed at each node.

図１３は、エージェントの学習の結果得られた状態遷移確率テーブルと観測確率テーブルの内容を可視化した図である。図１３の例は、ノードＳ１０、ノードＳ２１、ノードＳ２２、ノードＳ３１、ノードＳ３２が存在する場合の例を示している。この例の場合、ノードＳ１０において右方向に移動するアクションを実行すると５０％の確率でノードＳ２１に遷移し、５０％の確率でノードＳ２２に遷移する。 FIG. 13 is a diagram visualizing the contents of the state transition probability table and the observation probability table obtained as a result of agent learning. The example of FIG. 13 illustrates an example in which the node S10, the node S21, the node S22, the node S31, and the node S32 exist. In the case of this example, when the action of moving in the right direction is executed in the node S10, the transition is made to the node S21 with a probability of 50% and the transition is made to the node S22 with a probability of 50%.

ノードＳ２１では、図２のパーツ５が１００％の確率で観測され、ノードＳ２２でも図２のパーツ５が１００％の確率で観測される。 At node S21, part 5 in FIG. 2 is observed with a probability of 100%, and at node S22, part 5 in FIG. 2 is observed with a probability of 100%.

さらに、ノードＳ２１において右方向に移動するアクションを実行すると１００％の確率でノードＳ３１に遷移し、ノードＳ２２において右方向に移動するアクションを実行すると１００％の確率でノードＳ３２に遷移する。 Further, if an action moving rightward is executed in the node S21, transition to the node S31 is made with a probability of 100%, and if an action moving rightward is executed in the node S22, the action is changed to the node S32 with a probability of 100%.

なお、図１３（図１４も同じ）は、状態遷移確率テーブルと観測確率テーブルの内容を可視化したものであり、実際には、図１３に対応する状態遷移確率テーブルと観測確率テーブルが内部モデルデータとして学習されている。このような内部モデルデータにフォワードマージアルゴリズムを適用すると、状態遷移確率テーブルと観測確率テーブルの内容は、図１４に示されるように変化する。 FIG. 13 (same for FIG. 14) is a visualization of the contents of the state transition probability table and the observation probability table. Actually, the state transition probability table and the observation probability table corresponding to FIG. As learned. When the forward merge algorithm is applied to such internal model data, the contents of the state transition probability table and the observation probability table change as shown in FIG.

図１４は、図１３に対応する状態遷移確率テーブルと観測確率テーブルの内容にフォワードマージアルゴリズムを適用した場合に得られる状態遷移確率テーブルと観測確率テーブルの内容を可視化した図である。 FIG. 14 is a diagram visualizing the contents of the state transition probability table and the observation probability table obtained when the forward merge algorithm is applied to the contents of the state transition probability table and the observation probability table corresponding to FIG.

図１４の例では、ノードＳ１０、ノードＳ２０、ノードＳ３１、ノードＳ３２が存在する。すなわち、図１３のノードＳ２１とノードＳ２２が図１４のノードＳ２０に併合（マージ）されたのである。この例の場合、ノードＳ２０では図２のパーツ５が１００％の確率で観測され、ノードＳ１０において右方向に移動するアクションを実行すると１００％の確率でノードＳ２０に遷移する。 In the example of FIG. 14, there are a node S10, a node S20, a node S31, and a node S32. That is, the node S21 and the node S22 in FIG. 13 are merged with the node S20 in FIG. In this example, the part 5 of FIG. 2 is observed with a probability of 100% at the node S20, and when the action of moving in the right direction is executed at the node S10, the transition to the node S20 is performed with a probability of 100%.

また、ノードＳ２０において右方向に移動するアクションを実行すると５０％の確率でノードＳ３１に遷移し、５０％の確率でノードＳ３２に遷移する。 Further, when an action of moving in the right direction is executed in the node S20, the node transits to the node S31 with a probability of 50%, and transits to the node S32 with a probability of 50%.

このように、フォワードマージアルゴリズムを適用することにより、アクション遷移制約のうちのforward制約を実現することが可能である。 Thus, by applying the forward merge algorithm, it is possible to realize a forward constraint among the action transition constraints.

つまり、アクション遷移制約の下では、同一のアクションc_kによって遷移可能な複数の遷移先のノードにおいて、同一の観測シンボルが観測されることは許容されないので、図１３のノードＳ２１とノードＳ２２が図１４のノードＳ２０にマージされたのである。なお、仮にノードＳ１０において右方向に移動するアクションを実行することにより遷移するノードＳ２３が存在した場合、ノードＳ２３でパーツ５以外のパーツが観測されるときは、ノード２３がマージの対象となることはない。アクション遷移制約の下でも、異なる観測シンボルを観測するノードであれば、同一のアクションc_kによって遷移可能な遷移先のノードが複数存在することは許容されるからである。 That is, under the action transition constraint, the same observation symbol is not allowed to be observed at a plurality of transition destination nodes that can be transitioned by the same action _ck , so that the nodes S21 and S22 in FIG. It is merged into 14 nodes S20. Note that if there is a node S23 that transitions by executing an action of moving to the right in the node S10, when a part other than the part 5 is observed in the node S23, the node 23 is to be merged. There is no. This is because, even under an action transition constraint, if a node observes different observation symbols, it is allowed to have a plurality of transition destination nodes that can be transitioned by the same action _ck .

すなわち、１つのノードにおいて所定のアクションを実行した場合に遷移し得る遷移先ノードのそれぞれでの観測確率分布が類似するノードを発見し、発見されたノードが併合（マージ）されるのである。 That is, a node having a similar observation probability distribution at each transition destination node that can transition when a predetermined action is executed in one node is found, and the discovered nodes are merged.

なお、図１３と図１４を参照して上述した例では、所定の観測シンボルが観測されるノードへの状態遷移確率が１００％とされるようにマージされると説明したが、実際には、状態遷移確率が１００％とされることは稀である。forward制約は、厳密な意味で同一のアクションc_kによって遷移可能な複数の遷移先のノードにおいて、同一の観測シンボルが観測されることは許容しないものではないからである。 In the example described above with reference to FIG. 13 and FIG. 14, it has been described that merging is performed so that the state transition probability to a node where a predetermined observation symbol is observed is 100%. It is rare that the state transition probability is 100%. This is because the forward constraint does not allow the same observation symbol to be observed in a plurality of transition destination nodes that can be transitioned by the same action c _k in a strict sense.

図９の学習器３４により内部モデルデータに対してフォワードマージアルゴリズムが適用される場合の処理について、図１５のフローチャートを参照して説明する。 Processing when the forward merge algorithm is applied to the internal model data by the learning device 34 of FIG. 9 will be described with reference to the flowchart of FIG.

ステップＳ１２１において、学習器３４は、内部モデルデータ記憶部３７に記憶されている状態遷移確率テーブルを参照し、ある１つのアクションc_kの状態遷移確率テーブルをチェックする。 In step S121, the learning device 34 refers to the state transition probability table stored in the internal model data storage unit 37, and checks the state transition probability table of one action _ck .

ステップＳ１２２において、学習器３４は、ステップＳ１２１の処理でチェックした状態遷移確率テーブルの中で、ある１つの遷移元ノードs_iを特定し、ノードs_iから各遷移先ノードへの状態遷移確率を要素とするベクトルa_ij(k)をチェックする。そして、学習器３４は、状態遷移確率の値が閾値以上となった遷移先ノードs_jをリストする。 In step S122, the learning device 34 identifies one transition source node s _i in the state transition probability table checked in the process of step S121, and calculates the state transition probability from the node s _i to each transition destination node. Check the element vector a _ij (k). Then, the learning device 34 lists the transition destination node s _j whose state transition probability value is equal to or greater than the threshold value.

ステップＳ１２３において、学習器３４は、ステップＳ１２２の処理でリストされた遷移先ノードを観測シンボル毎に分類する。 In step S123, the learning device 34 classifies the transition destination nodes listed in the process of step S122 for each observation symbol.

なお、上述したように、アクション遷移制約は、一状態一観測制約が課されていることを前提とした制約だから、遷移先ノードで観測される観測シンボルは、ほぼ１つに特定することが可能である。 Note that as described above, action transition constraints are based on the premise that one-state one-observation constraint is imposed, so it is possible to specify almost one observation symbol observed at the transition destination node. It is.

ステップＳ１２４において、学習器３４は、ステップＳ１２３の処理で分類された同一の観測シンボルのノードをマージする。 In step S124, the learning device 34 merges the nodes of the same observation symbol classified in the process of step S123.

すなわち、ステップＳ１２３の処理でマージされた、観測シンボルmに対応するノード群を、s_j ^m，l(l = 1，・・・，L)で表すものとし、Ｌ個のノードs_j ^m，lを１つのノードs_j ^mにマージするのである。 That is, the node group corresponding to the observation symbol m merged in the process of step S123 is represented by s _j ^{m, l} (l = 1,..., L), and L nodes s _j ^{m, l} is merged into one node s _j ^m .

このとき、Ｌ個のノードs_j ^m，lが１つのノードs_j ^mにマージされた後の状態遷移確率テーブルにおける状態遷移確率および観測確率テーブルにおける観測確率は、次のようにして設定される。 At this time, the state transition probability in the state transition probability table after the L nodes s _j ^{m, l} are merged into one node s _j ^m and the observation probability in the observation probability table are set as follows. .

ノードs_iからノードs_j ^mへの状態遷移確率a_ij ^mは、式（１４）により求められて設定される。 The state transition probability a _ij ^m from the node s _i to the node s _j ^m is obtained and set by Expression (14).

・・・（１４）
ここで、a_ij ^m，lは、マージされる前のノードs_iから１個のノードs_j ^m，lへの状態遷移確率を表すものとする。

(14)
Here, a _ij ^{m, l} represents the state transition probability from the node s _i before merging to one node s _j ^{m, l} .

ノードs_j ^mからノードs_iへの状態遷移確率a_ji ^mは、a_ji ^m，lの単純平均、またはΣ_ka_kj ^m，lによる重み付き平均として求められて設定される。 Node s state transition probability a _ji ^m from _j ^m to the node s _i ^is, a _ji ^m, simple average of ^l, or Σ _k a _kj ^m, is obtained and set as a weighted average by ^l.

Ｌ個のノードs_j ^m，lが１つのノードs_j ^mにマージされた後のノードs_j ^mにおける観測シンボルmの観測確率b_j ^mは、b_j ^m，lの単純平均、またはΣ_ka_kj ^m，lによる重み付き平均として求められて設定される。 The observation probability b _j ^m of the observation symbol m at the node s _j ^m after the L nodes s _j ^{m, l} are merged into one node s _j ^m is the simple average of b _j ^{m, l} or Σ _k It is obtained and set as a weighted average by a _kj ^{m, l} .

ステップＳ１２４では、このように、状態遷移確率a_ij ^m、状態遷移確率a_ji ^m、観測確率b_j ^mが設定される。 In step S124, the state transition probability a _ij ^m , the state transition probability a _ji ^m , and the observation probability b _j ^m are set in this way.

このようにして、フォワードマージアルゴリズムの適用の処理が実行される。 In this way, the process of applying the forward merge algorithm is executed.

図１６と図１７は、バックワードマージアルゴリズムを説明する図である。 16 and 17 are diagrams for explaining the backward merge algorithm.

図１６と図１７では、図中の円でノードが示されており、各ノードで観測されるシンボルとして図２を参照して上述したパーツの図形が表示されている。 In FIG. 16 and FIG. 17, nodes are indicated by circles in the drawings, and the graphics of the parts described above with reference to FIG. 2 are displayed as symbols observed at each node.

図１６は、エージェントの学習の結果得られた状態遷移確率テーブルと観測確率テーブルの内容を可視化した図である。図１６の例は、ノードＳ１１、ノードＳ１２、ノードＳ２１、ノードＳ２２、ノードＳ３０が存在する場合の例を示している。この例の場合、ノードＳ１１において右方向に移動するアクションを実行すると１００％の確率でノードＳ２１に遷移する。ノードＳ１２において右方向に移動するアクションを実行すると１００％の確率でノードＳ２２に遷移する。 FIG. 16 is a diagram visualizing the contents of the state transition probability table and the observation probability table obtained as a result of agent learning. The example of FIG. 16 illustrates an example in which the node S11, the node S12, the node S21, the node S22, and the node S30 exist. In this example, when an action of moving in the right direction is executed in the node S11, the transition to the node S21 is made with a probability of 100%. When the action of moving in the right direction is executed in the node S12, the transition to the node S22 is made with a probability of 100%.

また、ノードＳ２１では、図２のパーツ７が１００％の確率で観測される。ノードＳ２２では、図２のパーツ７が１００％の確率で観測される。 Further, at the node S21, the part 7 in FIG. 2 is observed with a probability of 100%. At node S22, part 7 in FIG. 2 is observed with a probability of 100%.

さらに、ノードＳ２１において右方向に移動するアクションを実行すると１００％の確率でノードＳ３０に遷移し、ノードＳ２２において右方向に移動するアクションを実行すると１００％の確率でノードＳ３０に遷移する。 Further, when an action moving in the right direction is executed in the node S21, transition to the node S30 is performed with a probability of 100%, and when an action moving in the right direction is executed in the node S22, transition to the node S30 is performed with a probability of 100%.

なお、図１６（図１７も同じ）は、状態遷移確率テーブルと観測確率テーブルの内容を可視化したものであり、実際には、図１６に対応する状態遷移確率テーブルと観測確率テーブルが内部モデルデータとして学習されている。このような内部モデルデータにバックワードマージアルゴリズムを適用すると、状態遷移確率テーブルと観測確率テーブルの内容は、図１７に示されるように変化する。 Note that FIG. 16 (same for FIG. 17) is a visualization of the contents of the state transition probability table and the observation probability table. Actually, the state transition probability table and the observation probability table corresponding to FIG. As learned. When the backward merge algorithm is applied to such internal model data, the contents of the state transition probability table and the observation probability table change as shown in FIG.

図１７は、図１６に対応する状態遷移確率テーブルと観測確率テーブルの内容にバックワードマージアルゴリズムを適用した場合に得られる状態遷移確率テーブルと観測確率テーブルの内容を可視化した図である。 FIG. 17 is a diagram visualizing the contents of the state transition probability table and the observation probability table obtained when the backward merge algorithm is applied to the contents of the state transition probability table and the observation probability table corresponding to FIG.

図１７の例では、ノードＳ１１、ノードＳ１２、ノードＳ２０、ノードＳ３０が存在する。すなわち、図１６のノードＳ２１とノードＳ２２が図１７のノードＳ２０に併合（マージ）されたのである。この例の場合、ノードＳ２０では図２のパーツ７が１００％の確率で観測される。 In the example of FIG. 17, there are a node S11, a node S12, a node S20, and a node S30. That is, the nodes S21 and S22 in FIG. 16 are merged with the node S20 in FIG. In this example, the part 7 in FIG. 2 is observed at the node S20 with a probability of 100%.

また、ノードＳ１１において右方向に移動するアクションを実行すると１００％の確率でノードＳ２０に遷移し、ノードＳ１２において右方向に移動するアクションを実行すると１００％の確率でノードＳ２０に遷移する。 Further, when an action that moves in the right direction is executed in the node S11, a transition is made to the node S20 with a probability of 100%, and when an action that moves in the right direction is executed in the node S12, a transition to the node S20 is made with a probability of 100%.

さらに、ノードＳ２０において右方向に移動するアクションを実行すると１００％の確率でノードＳ３０に遷移する。 Furthermore, when an action of moving in the right direction is executed in the node S20, the transition to the node S30 is made with a probability of 100%.

このように、バックワードマージアルゴリズムを適用することにより、アクション遷移制約のうちのbackward制約を実現することが可能である。 In this way, by applying the backward merge algorithm, it is possible to realize the backward constraint among the action transition constraints.

つまり、アクション遷移制約の下では、同一のアクションc_kによって遷移可能な複数の遷移元のノードにおいて、同一の観測シンボルが観測されることは許容されないので、図１６のノードＳ２１とノードＳ２２が図１７のノードＳ２０にマージされたのである。なお、仮に右方向に移動するアクションを実行することによりノードＳ３０に遷移するノードＳ２３が存在した場合、ノードＳ２３でパーツ７以外のパーツが観測されるときは、ノード２３がマージの対象となることはない。アクション遷移制約の下でも、異なる観測シンボルを観測するノードであれば、同一のアクションc_kによって遷移可能な遷移元のノードが複数存在することは許容されるからである。 That is, under the action transition constraint, the same observation symbol is not allowed to be observed at a plurality of transition source nodes that can be transitioned by the same action _ck , so that the nodes S21 and S22 in FIG. It is merged into 17 nodes S20. If there is a node S23 that transitions to the node S30 by executing an action that moves to the right, if a part other than the part 7 is observed at the node S23, the node 23 is to be merged. There is no. This is because it is allowed that a plurality of transition source nodes that can be transitioned by the same action _ck exist as long as the nodes observe different observation symbols even under the action transition constraint.

すなわち、1つのノードに対して、共通のアクションによって遷移してくる遷移元ノードのそれぞれでの観測確率分布が類似するノードを発見し、発見されたノードが併合されるのである。 That is, for a single node, a node having a similar observation probability distribution at each of the transition source nodes that are transitioned by a common action is discovered, and the discovered nodes are merged.

図９の学習器３４により内部モデルデータに対してバックワードマージアルゴリズムが適用される場合の処理について、図１８のフローチャートを参照して説明する。 Processing when the backward merge algorithm is applied to the internal model data by the learning device 34 of FIG. 9 will be described with reference to the flowchart of FIG.

ステップＳ１４１において、学習器３４は、内部モデルデータ記憶部３７に記憶されている状態遷移確率テーブルを参照し、ある１つのアクションc_kの状態遷移確率テーブルをチェックする。 In step S141, the learning device 34 refers to the state transition probability table stored in the internal model data storage unit 37, and checks the state transition probability table of one action _ck .

ステップＳ１４２において、学習器３４は、ステップＳ１４１の処理でチェックした状態遷移確率テーブルの中で、ある１つの遷移先ノードs_jを特定し、ノードs_jへの各遷移元ノードからの状態遷移確率を要素とするベクトルa_ij(k)をチェックする。そして、学習器３４は、状態遷移確率の値が閾値以上となった遷移元ノードs_iをリストする。 In step S142, the learning device 34 specifies one transition destination node s _j in the state transition probability table checked in the process of step S141, and the state transition probability from each transition source node to the node s _j . Check the vector a _ij (k) whose elements are. Then, the learning unit 34, lists the transition source node s _i the value of the state transition probability is equal to or larger than the threshold value.

ステップＳ１４３において、学習器３４は、ステップＳ１４２の処理でリストされた遷移元ノードを観測シンボル毎に分類する。 In step S143, the learning device 34 classifies the transition source nodes listed in the process of step S142 for each observation symbol.

なお、上述したように、アクション遷移制約は、一状態一観測制約が課されていることを前提とした制約だから、遷移元ノードで観測される観測シンボルは、ほぼ１つに特定することが可能である。 Note that, as described above, action transition constraints are based on the premise that one-state one-observation constraint is imposed, so it is possible to specify almost one observation symbol observed at the transition source node. It is.

ステップＳ１４４において、学習器３４は、ステップＳ１４３の処理で分類された同一の観測シンボルのノードをマージする。 In step S144, the learning device 34 merges the nodes of the same observation symbols classified in the process of step S143.

すなわち、ステップＳ１４３の処理でマージされた、観測シンボルmに対応するノード群を、s_i ^m，l(l=1，・・・，L)で表すものとし、Ｌ個のノードs_i ^m，lを１つのノードs_i ^mにマージするのである。 That is, the node group corresponding to the observation symbol m merged in the process of step S143 is represented by s _i ^{m, l} (l = 1,..., L), and L nodes s _i ^{m, l} is merged into one node s _i ^m .

このとき、Ｌ個のノードs_i ^m，lが１つのノードs_i ^mにマージされた後の状態遷移確率テーブルにおける状態遷移確率および観測確率テーブルにおける観測確率は、次のようにして設定される。 At this time, the state transition probability in the state transition probability table and the observation probability in the observation probability table after the L nodes s _i ^{m, l} are merged into one node s _i ^m are set as follows. .

ノードs_jへのノードs_i ^mからの状態遷移確率a_ij ^mは、a_ji ^m，lの単純平均、またはΣ_ka_ki ^m，lによる重み付き平均として求められて設定される。 State transition probability a _ij ^m from node s _i ^m to the node s _j ^is, a _ji ^m, simple average of ^l, or Σ _k a _ki ^m, it is obtained and set as a weighted average by ^l.

ノードs_jからのノードs_i ^mへの状態遷移確率a_ji ^mは、Σ_la_ji ^m，lにより求められて設定される。 The state transition probability a _ji ^m from the node s _j to the node s _i ^m is obtained and set by Σ _l a _ji ^{m, l} .

Ｌ個のノードs_i ^m，lが１つのノードs_i ^mにマージされた後のノードs_i ^mにおける観測シンボルmの観測確率b_i ^mは、b_i ^m，lの単純平均、またはΣ_ka_ki ^m，lによる重み付き平均として求められて設定される。 The observation probability b _i ^m of the observation symbol m at the node s _i ^m after the L nodes s _i ^{m, l} are merged into one node s _i ^m is the simple average of b _i ^{m, l} or Σ _k It is obtained and set as a weighted average by a _ki ^{m, l} .

このようにして、バックワードマージアルゴリズムの適用の処理が実行される。 In this way, the backward merging algorithm application process is executed.

このように一状態一観測制約およびアクション遷移制約を課すことで、学習処理の負荷を軽減することが可能となる。 By imposing the one-state one-observation constraint and the action transition constraint in this way, it is possible to reduce the load of the learning process.

図１９は、アクション拡張型ＨＭＭにおける状態遷移確率テーブルと観測確率テーブルの尤度を比較する表である。同図の最も左側の列は、学習の回数（試行回数）を表している。試行回数の右側の列は、「最初の学習」の列とされており、それぞれの試行回数時に学習された状態遷移確率テーブルと観測確率テーブルの尤度の値が記述されている。「最初の学習」の右側の列は、「スプリット・マージ後」の列とされている。この列には、「最初の学習」によって得られた状態遷移確率テーブルと観測確率テーブルに対して、図１２、図１５、および図１８の処理を施すことにより得られた状態遷移確率テーブルと観測確率テーブルの尤度の値が記述されている。さらに、「スプリット・マージ後」の右側の列は、「増分」の列とされている。この列には、「スプリット・マージ後」の列に記述された尤度値と「最初の学習」の列に記述された尤度値との差分が記述されている。 FIG. 19 is a table for comparing the likelihoods of the state transition probability table and the observation probability table in the action expanded HMM. The leftmost column in the figure represents the number of learning (number of trials). The column on the right side of the number of trials is a column of “first learning”, which describes the likelihood values of the state transition probability table and the observation probability table learned at each trial number. The column on the right side of “first learning” is the column “after split / merging”. In this column, the state transition probability table and the observation probability table obtained by performing the processing of FIG. 12, FIG. 15 and FIG. 18 on the state transition probability table and the observation probability table obtained by “first learning”. The likelihood value of the probability table is described. Further, the column on the right side of “after split / merge” is the column of “increment”. In this column, the difference between the likelihood value described in the “after split / merging” column and the likelihood value described in the “first learning” column is described.

図１９に示されるように、図１２、図１５、および図１８の処理を施すことにより尤度が向上することが分かる。また、図１２、図１５、および図１８の処理を施すことにより、尤度値は、「−６０」付近の値をとる回数が多くなることが分かる。つまり、尤度値は、「−６０」付近の値をとるような学習が行われた場合、与えられた環境を最も適切に学習したといえる。これに対して、「最初の学習」の列に記述された尤度は、学習の都度、値が大きく変化しており、学習を繰り返しても与えられた環境を最も適切に学習することは困難であることが判る。 As shown in FIG. 19, it is understood that the likelihood is improved by performing the processes of FIGS. 12, 15, and 18. It can also be seen that the number of times that the likelihood value takes a value in the vicinity of “−60” increases by performing the processes of FIGS. 12, 15, and 18. That is, it can be said that the likelihood value is most appropriately learned in the case where learning is performed in the vicinity of “−60”. On the other hand, the likelihood described in the “first learning” column changes greatly every time learning is performed, and it is difficult to learn the given environment most appropriately even if learning is repeated. It turns out that it is.

すなわち、一状態一観測制約とアクション遷移制約を課すことで、アクション拡張型ＨＭＭの学習の精度を高めることができるのである。 That is, by imposing one-state one-observation constraint and action transition constraint, it is possible to improve the learning accuracy of the action expanded HMM.

図２０乃至図２６は、一状態一観測制約とアクション遷移制約を課すことによる学習結果の変化を説明する図である。 FIGS. 20 to 26 are diagrams for explaining the change in the learning result by imposing one-state one-observation constraint and action transition constraint.

ここでは、図２０に示される迷路において、図中円で示される位置のパーツを変更し、図２１に示されるような構造に変化させた迷路を環境としてエージェントに学習させる場合を例として説明する。 Here, a case will be described as an example where, in the maze shown in FIG. 20, the part at the position shown by a circle in the figure is changed and the agent learns the maze changed to the structure shown in FIG. 21 as the environment. .

図２２は、図２０と図２１に示される環境を学習したエージェントの状態遷移確率テーブルと観測確率テーブルの内容を可視化した図である。図２２の例では、図中の円でノードが示されており、図中の三角形で表現された方向のアクションにより遷移するノードが線により接続されている。また、図中の円の内部にしめされた番号が、その円で示されるノードのインデックスを表している。図２２の例は、一状態一観測制約とアクション遷移制約を課すことなく得られた状態遷移確率テーブルと観測確率テーブルの内容を可視化した図とされる。 FIG. 22 is a diagram in which the contents of the state transition probability table and the observation probability table of the agent who has learned the environment shown in FIGS. 20 and 21 are visualized. In the example of FIG. 22, nodes are indicated by circles in the figure, and the nodes that are transitioned by the action in the direction represented by the triangles in the figure are connected by lines. In addition, the number shown inside the circle in the figure represents the index of the node indicated by the circle. The example of FIG. 22 is a diagram visualizing the contents of the state transition probability table and the observation probability table obtained without imposing one state one observation constraint and action transition constraint.

これに対して、図２３は、図２２に対応する状態遷移確率テーブルと観測確率テーブルに、一状態一観測制約とアクション遷移制約を課す処理を施すことにより得られた状態遷移確率テーブルと観測確率テーブルの内容を可視化した図とされる。 On the other hand, FIG. 23 shows a state transition probability table and an observation probability obtained by subjecting the state transition probability table and the observation probability table corresponding to FIG. The table contents are visualized.

図２３においては、図２２のノード２８がノード２８とノード３１に分割されている。また、図２３においては、図２２のノード１７とノード１９がノード７にマージされている。さらに、図２３においては、図２２のノード１２とノード２５がノード１２にマージされている。 In FIG. 23, the node 28 of FIG. 22 is divided into a node 28 and a node 31. In FIG. 23, the node 17 and the node 19 in FIG. Further, in FIG. 23, the node 12 and the node 25 in FIG.

なお、エージェントが学習のために迷路上を移動していた時間帯において、迷路が図２０に示される構造とされていた時間帯と、迷路が図２１に示される構造とされていた時間帯とが存在する。従って、図２３に示される各ノードの位置が完全に、図２０または図２１のパーツの位置と一致するものではない。例えば、図２３のノード２４、ノード３６、ノード２、ノード１８によって、迷路の構造が時間帯によって変化し得ることが適切に学習されていることが分かる。 It should be noted that in the time zone in which the agent was moving on the maze for learning, the time zone in which the maze was configured as shown in FIG. 20 and the time zone in which the maze was configured as shown in FIG. Exists. Therefore, the position of each node shown in FIG. 23 does not completely match the position of the part in FIG. 20 or FIG. For example, it can be understood that the structure of the maze can be appropriately changed by the time zone by the node 24, the node 36, the node 2, and the node 18 in FIG.

実際には、迷路の規模はさらに大きいものとされる。例えば、図２４に示されるような迷路を環境としてエージェントに学習させる。この場合、一状態一観測制約とアクション遷移制約を課すことなく得られた状態遷移確率テーブルと観測確率テーブルの内容を可視化すると図２５に示されるようになる。これに対して、一状態一観測制約とアクション遷移制約を課すことにより得られた状態遷移確率テーブルと観測確率テーブルの内容を可視化すると図２６に示されるようになる。 In practice, the size of the maze is even larger. For example, the agent learns using a maze as shown in FIG. 24 as an environment. In this case, the contents of the state transition probability table and the observation probability table obtained without imposing one state one observation constraint and action transition constraint are visualized as shown in FIG. On the other hand, when the contents of the state transition probability table and the observation probability table obtained by imposing one state one observation constraint and action transition constraint are visualized, it is as shown in FIG.

図２５と比較して、図２６は、現実の迷路（図２４）の構造に近いものとなっていることが分かる。 Compared with FIG. 25, FIG. 26 is close to the structure of the actual maze (FIG. 24).

ここまで、必然的に大規模となるアクション拡張型ＨＭＭの学習を効率的かつ適切に行うために必要となる技術について説明した。 So far, the technology necessary for efficiently and appropriately learning the action-expanded HMM that inevitably has a large scale has been described.

次に、ここまで説明してきた、図９の学習器３４によるアクション拡張型ＨＭＭの学習処理について、図２７のフローチャートを参照して説明する。 Next, the action expansion type HMM learning process by the learning device 34 of FIG. 9 described so far will be described with reference to the flowchart of FIG.

ステップＳ１６１において、学習器３４は、初期の内部モデルデータを取得する。ここで、初期の内部モデルデータは、例えば、ロボットが迷路上を移動することで生成された直後の状態遷移確率テーブルと観測確率テーブルとされる。状態遷移確率テーブルと観測確率テーブルに設定される状態遷移確率と観測確率は、例えば、各時刻においてロボットが実行したアクションと、そのアクションを実行した結果観測された観測シンボルとの組み合わせからなる情報に基づいて生成される。 In step S161, the learning device 34 acquires initial internal model data. Here, the initial internal model data is, for example, a state transition probability table and an observation probability table immediately after the robot is generated by moving on the maze. The state transition probability and the observation probability set in the state transition probability table and the observation probability table are, for example, information including a combination of an action executed by the robot at each time and an observation symbol observed as a result of executing the action. Based on.

ステップＳ１６２において、学習器３４は、ステップＳ１６１の処理で取得した内部モデルデータを最適化する。このとき、例えば、最尤推定法などにより、状態遷移確率テーブルの各値と観測確率テーブルの各値が最適化されるように変更される。 In step S162, the learning device 34 optimizes the internal model data acquired in the process of step S161. At this time, for example, each value of the state transition probability table and each value of the observation probability table are changed to be optimized by a maximum likelihood estimation method or the like.

ステップＳ１６３において、学習器３４は、ステップＳ１６２の処理で最適化された内部モデルデータが、上述した一状態一観測制約、およびアクション遷移制約を満たすか否かを判定する。 In step S163, the learning device 34 determines whether the internal model data optimized in the process of step S162 satisfies the one-state one-observation constraint and the action transition constraint described above.

例えば、１つのノードで観測される観測シンボルが複数ある場合でも、そのうち１つの観測シンボルの観測確率が閾値以上となる場合、一状態一観測制約を満たすことになる。また、例えば、同一のアクションによって遷移可能な複数の遷移先のノードにおいて、同一の観測シンボルが観測される確率が閾値以下となる場合、アクション遷移制約を満たすことになる。 For example, even when there are a plurality of observation symbols observed at one node, if one of the observation symbols has an observation probability equal to or higher than a threshold, the one-state one-observation constraint is satisfied. In addition, for example, in a plurality of transition destination nodes that can be transitioned by the same action, when the probability that the same observation symbol is observed is equal to or less than a threshold value, the action transition constraint is satisfied.

ステップＳ１６３において、内部モデルデータが、上述した一状態一観測制約、およびアクション遷移制約を満たさないと判定された場合、処理は、ステップＳ１６４に進む。 If it is determined in step S163 that the internal model data does not satisfy the one-state one-observation constraint and the action transition constraint described above, the process proceeds to step S164.

ステップＳ１６４において、学習器３４は、一状態一観測制約、およびアクション遷移制約を満たすように、内部モデルデータを変更する。このとき、例えば、図１２、図１５、および図１８を参照して上述した処理が実行されることにより、状態遷移確率テーブルの各値と観測確率テーブルの各値が変更される。 In step S164, the learning device 34 changes the internal model data so as to satisfy the one-state one-observation constraint and the action transition constraint. At this time, for example, each value of the state transition probability table and each value of the observation probability table are changed by executing the processing described above with reference to FIGS. 12, 15, and 18.

ステップＳ１６４の処理の後、処理は、ステップＳ１６２に戻る。そして、ステップＳ１６３において、一状態一観測制約、およびアクション遷移制約を満たすと判定されるまで、ステップＳ１６２乃至ステップＳ１６４の処理が繰り返し実行される。 After the process of step S164, the process returns to step S162. In step S163, the processes in steps S162 to S164 are repeatedly executed until it is determined that the one-state one-observation constraint and the action transition constraint are satisfied.

ステップＳ１６３において、一状態一観測制約、およびアクション遷移制約を満たすと判定された場合、処理は、ステップＳ１６５に進む。 If it is determined in step S163 that the one-state one-observation constraint and the action transition constraint are satisfied, the process proceeds to step S165.

ステップＳ１６５において、学習器３４は、内部モデルデータを、内部モデルデータ記憶部３７に保存する。 In step S165, the learning device 34 stores the internal model data in the internal model data storage unit 37.

このようにして、アクション拡張型ＨＭＭの学習処理が実行される。 In this manner, the action expansion type HMM learning process is executed.

ところで、ＨＭＭの学習の方式として、バッチ学習方式と追加学習方式が存在する。ここで、バッチ学習方式は、例えば、１万ステップの遷移と観測のデータが得られる場合、１万ステップの遷移と観測に基づいて状態遷移確率テーブルと観測確率テーブルを生成して保存するものである。これに対して、追加学習方式は、例えば、最初に、１千ステップの遷移と観測に基づいて状態遷移確率テーブルと観測確率テーブルを生成して保存する。そして、その後の１千ステップの遷移と観測に基づいて状態遷移確率テーブルと観測確率テーブルの各値を変更して保存し、・・・というように、繰り返し学習を行って、内部モデルデータを更新（アップデート）していくものである。 By the way, there are a batch learning method and an additional learning method as methods for learning the HMM. Here, the batch learning method is to generate and store a state transition probability table and an observation probability table based on 10,000 steps of transition and observation, for example, when 10,000 steps of transition and observation data are obtained. is there. On the other hand, in the additional learning method, for example, first, a state transition probability table and an observation probability table are generated and stored based on 1000-step transitions and observations. Then, change and save each value in the state transition probability table and the observation probability table based on the subsequent 1000-step transitions and observations, and so on to update the internal model data by repeatedly learning (Update).

例えば、迷路上を自走するロボットによるアクション拡張型ＨＭＭの学習などの場合、追加学習方式での学習を行うことが求められる。バッチ学習方式での学習では、迷路の構造の変化などを適応的に学ぶことが原理的に不可能であり、変化する環境の中でより良い性能を発揮するためには、動作結果をフィードバックする追加学習方式による学習が必須となるからである。 For example, in the case of learning of an action expansion type HMM by a robot that runs on the maze, it is required to perform learning by an additional learning method. In learning with the batch learning method, it is impossible in principle to adaptively learn changes in the structure of the maze, etc. In order to achieve better performance in a changing environment, the operation results are fed back. This is because learning by the additional learning method is essential.

ところが、追加学習を行なう際に「学習済みの記憶構造」と「新しい経験」とをどのように統合するのかという問題は未解決である。一方では「新しい経験」を速やかに反映させてすばやい適応を実現したいという要請もあるが、他方、これまでに確立した記憶構造が破壊される危険性もある。 However, the problem of how to integrate “learned memory structure” and “new experience” when performing additional learning is still unsolved. On the one hand, there is a request to realize quick adaptation by quickly reflecting “new experience”, but there is also a risk of destroying the memory structure established so far.

例えば、図２８に示されるような迷路の構造を学習するロボットが、１度学習して内部モデルデータを保存した後、図中の円１０１で示される範囲内を長時間移動し続けた場合、円１０２で示される範囲の位置に対応する内部モデルデータが破壊されることがある。すなわち、せっかく適切に学習されて記憶されていた円１０２で示される範囲の位置に対応する内部モデルデータが誤って更新されることがある。追加学習方式の学習では、新しく得られた遷移と観測に基づいてのみ、内部モデルデータが更新されるので、円１０１で示される範囲内の位置が誤って、円１０２で示される範囲の位置に対応するノードと認識されることがあるからである。 For example, when a robot that learns the structure of the maze as shown in FIG. 28 keeps moving for a long time within the range indicated by the circle 101 in the figure after learning once and storing the internal model data, The internal model data corresponding to the position in the range indicated by the circle 102 may be destroyed. That is, the internal model data corresponding to the position of the range indicated by the circle 102 that has been properly learned and stored may be erroneously updated. In the learning of the additional learning method, the internal model data is updated only based on newly obtained transitions and observations. Therefore, the position in the range indicated by the circle 101 is erroneously changed to the position in the range indicated by the circle 102. This is because it may be recognized as a corresponding node.

このような問題に対処するために、例えば、従来、追加学習方式での学習を行うために、内部モデルデータを迷路の各範囲に対応させて分離して保持するなどされていた。あるいはまた、過去の学習により得られた内部モデルデータを現在の記憶からリハースする等して、学習することが行われていた。 In order to cope with such a problem, for example, conventionally, in order to perform learning by the additional learning method, the internal model data is separately held corresponding to each range of the maze. Alternatively, learning is performed by rehearsing internal model data obtained by past learning from the current memory.

しかしながら、従来の方式を採用しても、例えば、分離された過去の内部モデルデータに、「新しい体験」が反映されなかったり、リハースされる過去の内部モデルデータが、「新しい体験」の影響を受けて生成されてしまうなどの問題があった。このように、従来の方式では、大規模なＨＭＭの学習において、追加学習を行って実用的なモデルとして機能させることは困難であった。例えば、過去の学習に用いられたデータと、新たな学習に用いられるデータとをまとめてバッチ学習するようにすれば、適切な学習結果を得ることができるが、これを実現するには、膨大な記憶容量と計算量が求められることになる。 However, even if the conventional method is adopted, for example, “new experience” is not reflected in the separated past internal model data, or past internal model data that is rehearsed affects the “new experience”. There was a problem that it was received and generated. As described above, in the conventional method, it is difficult to perform additional learning to function as a practical model in large-scale HMM learning. For example, if batch learning is performed on data used for past learning and data used for new learning, an appropriate learning result can be obtained. A large storage capacity and calculation amount are required.

次に、必然的に大規模となるアクション拡張型ＨＭＭにおける追加学習方式での学習を安定的に行うことができるようにするための技術について説明する。 Next, a technique for enabling stable learning in the additional learning method in the action expanded HMM that inevitably has a large scale will be described.

本発明においては、学習器３４が、次のような追加学習方式による学習を行い、変化する環境の中でより良い性能を発揮でき、かつ安定的な学習を行うことができるようにする。具体的には、後述する状態遷移確率の推定のための頻度変数と観測確率の推定のための頻度変数とを算出して保存することにより、アクション拡張型ＨＭＭにおける追加学習方式での学習を安定的に行うことができるようにする。 In the present invention, the learning device 34 performs learning by the following additional learning method so that better performance can be exhibited in a changing environment and stable learning can be performed. Specifically, by calculating and storing a frequency variable for estimating a state transition probability and a frequency variable for estimating an observation probability, which will be described later, learning in the additional learning method in the action expanded HMM is stabilized. Be able to do it automatically.

バッチ学習方式による学習は、換言すれば、複数の時間帯において得られた遷移と観測に基づく学習を足し合わせたものということもできる。例えば、図２９に示されるように、バッチ学習方式の学習で用いられる遷移と観測の全体データＤＡが構成されていると考えられる。すなわち、全体データＤＡは、第１の時間帯において得られたデータセットＤ１と、第２の時間帯において得られたデータセットＤ２と、第３の時間帯において得られたデータセットＤ３と、・・・により構成されていると考えられる。 In other words, the learning by the batch learning method can be said to be a combination of learning obtained based on observation and transition obtained in a plurality of time zones. For example, as shown in FIG. 29, it is considered that transition and observation whole data DA used in batch learning learning is configured. That is, the entire data DA includes a data set D1 obtained in the first time zone, a data set D2 obtained in the second time zone, a data set D3 obtained in the third time zone,・ It is thought that it is composed of

アクション拡張型ＨＭＭの学習における状態遷移確率の推定は、上述した式（３）により行なわれるが、ここでは、図２９に示されるように、複数のデータセットが存在している場合を考える。 The estimation of the state transition probability in learning of the action expanded HMM is performed by the above-described equation (3). Here, consider a case where a plurality of data sets exist as shown in FIG.

第ｎ番目の学習データセットDnにおける状態遷移確率の推定値a´_ij(k)⁽ⁿ⁾は、式（１５）により求めることができる。 The estimated value a ′ _ij (k) ⁽ⁿ⁾ of the state transition probability in the n-th learning data set Dn can be obtained by Expression (15).

・・・（１５）

... (15)

ここで、状態遷移確率の推定の説明において、t∈Dnは、特に言及がない場合、この表記により（t，t+1∈Dn）を表すものとする。また、学習データセットDnには、各時刻において実行したアクション、各時刻におけるノード、各時刻における観測シンボルを表す情報が含まれているものとする。 Here, in the description of the estimation of the state transition probability, t∈Dn represents (t, t + 1∈Dn) by this notation unless otherwise specified. Further, it is assumed that the learning data set Dn includes information representing an action executed at each time, a node at each time, and an observation symbol at each time.

式（１５）における分子は、学習データセットDnの中で、アクションc_kを実行することにより、ノードiからノードjに遷移した頻度を表すものと言える。一方、式（１５）の分母は、学習データセットDnの中で、アクションc_kを実行することにより、ノードiから他のノードに遷移した頻度を表すものと言える。 It can be said that the numerator in the equation (15) represents the frequency of transition from the node i to the node j by executing the action c _k in the learning data set Dn. On the other hand, the denominator of Expression (15) can be said to represent the frequency of transition from node i to another node by executing action _ck in learning data set Dn.

いま、式（１５）における分子に対応する式を表す変数χ_ij(k)⁽ⁿ⁾を式（１６）で示されるものとして定義する。 Now, a variable χ _ij (k) ⁽ⁿ⁾ representing an expression corresponding to the numerator in Expression (15) is defined as represented by Expression (16).

・・・（１６）

... (16)

式（１６）より式（１７）を得ることができる。 Expression (17) can be obtained from Expression (16).

・・・（１７）

... (17)

式（１７）と式（１５）より式（１８）が導出される。 Expression (18) is derived from Expression (17) and Expression (15).

・・・（１８）

... (18)

このように、状態遷移確率の推定値は、変数χ_ij(k)⁽ⁿ⁾を用いて表すことができるのである。 Thus, the estimated value of the state transition probability can be expressed using the variable χ _ij (k) ⁽ⁿ⁾ .

ここで、変数χ_ij(k)⁽ⁿ⁾は、式（１５）における分子に相当し、学習データセットDnの中で、アクションc_kを実行することにより、ノードiからノードjに遷移した頻度を表すものと言えるから、状態遷移確率の推定のための頻度変数と称することにする。 Here, the variable χ _ij (k) ⁽ⁿ⁾ corresponds to the numerator in the equation (15), and the frequency of transition from the node i to the node j by executing the action c _k in the learning data set Dn. Therefore, it is called a frequency variable for estimating the state transition probability.

本発明では、追加学習方式の学習を行う場合、安定的な学習を行うことができるようにするために、上述した頻度変数χ_ij(k)⁽ⁿ⁾を用いて状態遷移確率の推定値を求めることにする。すなわち、学習器３４が、１つの学習データセットに基づく学習を行う都度、頻度変数を更新して内部モデルデータの１つとして内部モデルデータ記憶部３７に記憶させて保存するようにする。 In the present invention, when performing the learning of the additional learning method, in order to enable stable learning, the estimated value of the state transition probability is calculated using the frequency variable χ _ij (k) ⁽ⁿ⁾ described above. I will ask. That is, each time the learning device 34 performs learning based on one learning data set, the frequency variable is updated and stored in the internal model data storage unit 37 as one of the internal model data.

つまり、新たに学習を行う際に、過去の学習データセットに対応する頻度変数を読み出して、その頻度変数に新たな学習に基づいて得られた頻度変数を足しこむことにより頻度変数の値を更新する。さらに、更新された頻度変数に基づいて得られる状態遷移確率の推定値を求めることにより、追加学習方式の学習を行うのである。このようにすることで、学習データセットＤ１，Ｄ２，Ｄ３・・・をまとめてバッチ学習するのとほぼ同等の結果を得ることができるのである。 In other words, when a new learning is performed, the frequency variable corresponding to the past learning data set is read, and the frequency variable is updated by adding the frequency variable obtained based on the new learning to the frequency variable. To do. Furthermore, learning of the additional learning method is performed by obtaining an estimated value of the state transition probability obtained based on the updated frequency variable. In this way, it is possible to obtain a result almost equivalent to batch learning of the learning data sets D1, D2, D3.

次に、複数回の学習により得られたそれぞれの内部モデルデータの統合について説明する。すなわち、学習データセットＤ_１，Ｄ_２，・・・Ｄ_ｎ・・・に基づいて計算される状態遷移確率の推定値a´_ij(k)⁽¹⁾，a´_ij(k)⁽²⁾，・・・a´_ij(k)⁽ⁿ⁾，・・・の統合について説明する。 Next, the integration of each internal model data obtained by multiple learning will be described. That is, estimated values a ′ _ij (k) ⁽¹⁾ and a ′ _ij (k) ^{(2) of} state transition probabilities calculated based on the learning data sets D ₁ , D ₂ ,... D _n. ,... A ′ _ij (k) ⁽ⁿ⁾ ,.

このような場合、例えば、重みw_n(Σw_n=1)を設定し、式（１９）に示されるように、状態遷移確率の推定値a´_ij(k)⁽¹⁾，a´_ij(k)⁽²⁾，・・・a´_ij(k)⁽ⁿ⁾，・・・を統合することも考えられる。 In such a case, for example, a weight w _n (Σw _n = 1) is set, and the state transition probability estimated values a ′ _ij (k) ⁽¹⁾ , a ′ _ij ( k) ⁽²⁾ , ... a´ _ij (k) ⁽ⁿ⁾ , ... can be integrated.

・・・（１９）

... (19)

式（１９）は、上述した状態遷移確率の推定値のそれぞれに、各学習データセットに対応する重みw₁，w₂，・・・，w_n，・・・を乗じて足し合わせることを意味している。 Equation (19) means that the estimated values of the state transition probabilities described above are multiplied by the weights w ₁ , w ₂ ,..., W _n ,. is doing.

しかしながら、上述したように本発明では、学習データセットに対応する頻度変数に基づいて得られる状態遷移確率の推定値を求めるようにしたので、式（１９）による統合は適さない。 However, as described above, in the present invention, since the estimated value of the state transition probability obtained based on the frequency variable corresponding to the learning data set is obtained, the integration according to Expression (19) is not suitable.

本発明では、学習データセットに対応する頻度変数に基づいて得られる状態遷移確率の推定値を求めるようにしたので、それぞれの状態遷移確率の推定値の信頼性を考慮して統合を行う必要がある。すなわち、学習データセットのデータ量（シーケンス長）を考慮して重みを設定する必要がある。 In the present invention, since the estimated value of the state transition probability obtained based on the frequency variable corresponding to the learning data set is obtained, it is necessary to perform integration in consideration of the reliability of the estimated value of each state transition probability. is there. That is, it is necessary to set the weight in consideration of the data amount (sequence length) of the learning data set.

また、学習データセットに対応する頻度変数は、過去の学習に基づいて既に設定されている状態遷移確率の値によって変化し得る。例えば、状態遷移確率の値が低い遷移が数多く発生した学習データセットから得られる頻度変数の値は必然的に小さい値となり易く、状態遷移確率の値が高い遷移が数多く発生した学習データセットから得られる頻度変数の値は必然的に大きい値となり易い。上述したように頻度変数は、式（１５）における分子に対応する式で表されるからである。従って、頻度変数の値の大きさも考慮して重みを設定する必要がある。 Further, the frequency variable corresponding to the learning data set may change depending on the value of the state transition probability that has already been set based on past learning. For example, the value of a frequency variable obtained from a learning data set in which many transitions with low state transition probabilities have occurred is inevitably small, and obtained from a learning data set in which many transitions with high state transition probabilities have occurred The value of the frequency variable is inevitably large. This is because the frequency variable is expressed by an expression corresponding to the numerator in Expression (15) as described above. Therefore, it is necessary to set the weight in consideration of the value of the frequency variable.

本発明では、統合後の状態遷移確率の推定値a´_ij(k)を式（２０）により求めるようにする。 In the present invention, the estimated value a ′ _ij (k) of the state transition probability after integration is obtained by Expression (20).

・・・（２０）

... (20)

このとき、上述したように重みw_nの考慮が必要となる。具体的には、シーケンス長Tnの学習データセットDnに対応する頻度変数χ_ij(k)⁽ⁿ⁾について、式（２１）に示される関係を満たすように重みw_nを設定する。 In this case, it is necessary to consider the weight w _n, as described above. Specifically, for the frequency variable χ _ij (k) ⁽ⁿ⁾ corresponding to the learning data set Dn having the sequence length Tn, the weight w _n is set so as to satisfy the relationship shown in Expression (21).

・・・（２１）

... (21)

このように、学習データセットごとに、その学習データセットのシーケンス長に応じた重みの調整を行いながら、頻度変数χ_ij(k)⁽ⁿ⁾をすべてのデータセットに渡って累積すれば、全データをまとめてバッチ学習するのとほぼ同等の結果を得ることができる。すなわち、式（２２）により、頻度変数χ_ij(k)を求め、式（２０）を参照して上述したように、頻度変数χ_ij(k)を用いて統合後の状態遷移確率の推定値a´_ij(k)を求めるのである。 In this way, for each training data set, if the frequency variable χ _ij (k) ⁽ⁿ⁾ is accumulated over all data sets while adjusting the weight according to the sequence length of the training data set, The result is almost equivalent to batch learning of data. That is, the frequency variable χ _ij (k) is obtained from the equation (22), and the estimated value of the state transition probability after integration using the frequency variable χ _ij (k) as described above with reference to the equation (20). a ′ _ij (k) is obtained.

・・・（２２）

(22)

このようにすることで、例えば、学習データセットＤ_１，Ｄ_２，・・・Ｄ_ｎ・・・のそれぞれに対応する状態遷移確率テーブルを保存するなどしなくても、全ての学習データセットをまとめてバッチ学習するのとほぼ同等の結果を得ることができるのである。すなわち、既に記憶されている、学習データセットＤ_ｎ-1までを学習することにより得られた頻度変数に、学習データセットＤ_nを学習することにより得られた頻度変数を足しこんで状態遷移確率の推定値を求める。これにより、学習データセットＤ_１，Ｄ_２，・・・Ｄ_ｎをまとめてバッチ学習するのとほぼ同等の結果を得ることができるのである。 In this way, for example, all the learning data sets can be stored without storing the state transition probability table corresponding to each of the learning data sets D ₁ , D ₂ ,... D _n. The result is almost equivalent to batch learning together. That is, the state transition probability obtained by adding the frequency variable obtained by learning the learning data set D _n to the frequency variable obtained by learning up to the learning data set D _n−1 already stored. Obtain an estimate of. Thus, learning data sets D _1, D _2, it is possible to collectively · · · D _n to obtain substantially the same results as batch learning.

一方、アクション拡張型ＨＭＭの学習における観測確率の推定は、上述した式（４）により行なわれるが、ここでは、図２９に示されるように、複数のデータセットが存在している場合を考える。 On the other hand, the estimation of the observation probability in learning of the action expanded HMM is performed by the above-described equation (4). Here, a case where a plurality of data sets exist as shown in FIG. 29 is considered.

第ｎ番目の学習データセットDnにおける観測確率の推定値b´_j(o)⁽ⁿ⁾は、式（２３）により求めることができる。 The estimated value b ′ _j (o) ⁽ⁿ⁾ of the observation probability in the n-th learning data set Dn can be obtained by Expression (23).

・・・（２３）

(23)

なお、状態遷移確率の推定の説明の場合と異なり、ここでは、t∈Dnの表記により（t，t+1∈Dn）を表すものではない。 Unlike the description of the estimation of the state transition probability, here, (t, t + 1εDn) is not represented by tεDn.

また、学習データセットDnには、各時刻において実行したアクション、各時刻におけるノード、各時刻における観測シンボルを表す情報が含まれているものとする。o_t=oは、時刻tにおける観測シンボルがoであることを表している。 Further, it is assumed that the learning data set Dn includes information representing an action executed at each time, a node at each time, and an observation symbol at each time. o _t = o indicates that the observation symbol at time t is o.

式（２３）における分子は、学習データセットDnの中で、ノードjにおいて観測シンボルｏが観測された頻度を表すものと言える。一方、式（２３）の分母は、学習データセットDnの中で、ノードjにおいていずれかの観測シンボルが観測された頻度を表すものと言える。 The numerator in Expression (23) can be said to represent the frequency at which the observation symbol o is observed at the node j in the learning data set Dn. On the other hand, it can be said that the denominator of Expression (23) represents the frequency of observation of any observation symbol at node j in the learning data set Dn.

いま、式（２３）における分子に対応する式を表す変数ω_j(o)⁽ⁿ⁾を式（２４）で示されるものとして定義する。 Now, a variable ω _j (o) ⁽ⁿ⁾ representing an expression corresponding to the numerator in Expression (23) is defined as represented by Expression (24).

・・・（２４）

... (24)

式（２４）より式（２５）を得ることができる。 Expression (25) can be obtained from Expression (24).

・・・（２５）

... (25)

式（２５）と式（２３）より式（２６）が導出される。 Equation (26) is derived from Equation (25) and Equation (23).

・・・（２６）

... (26)

このように、観測確率の推定値は、変数ω_j(o)⁽ⁿ⁾を用いて表すことができるのである。 Thus, the estimated value of the observation probability can be expressed using the variable ω _j (o) ⁽ⁿ⁾ .

ここで、変数ω_j(o)⁽ⁿ⁾は、式（２３）における分子に相当し、学習データセットDnの中で、ノードjにおいて観測シンボルｏが観測された頻度を表すものと言えるから、観測確率の推定のための頻度変数と称することにする。 Here, the variable ω _j (o) ⁽ⁿ⁾ corresponds to the numerator in the equation (23), and can be said to represent the frequency at which the observation symbol o is observed at the node j in the learning data set Dn. It will be called a frequency variable for estimating the observation probability.

本発明では、状態遷移確率の場合と同様に、追加学習方式の学習を行う場合、安定的な学習を行うことができるようにするために、上述した変数ω_j(o)⁽ⁿ⁾を用いて観測確率の推定値を求めることにする。すなわち、学習器３４が、１つの学習データセットに基づく学習を行う都度、頻度変数更新して内部モデルデータの１つとして内部モデルデータ記憶部３７に記憶させて保存するようにする。 In the present invention, as in the case of the state transition probability, the variable ω _j (o) ⁽ⁿ⁾ described above is used in order to enable stable learning when learning in the additional learning method. Thus, the estimated value of the observation probability is obtained. That is, each time the learning device 34 performs learning based on one learning data set, the frequency variable is updated and stored in the internal model data storage unit 37 as one of the internal model data.

そして、新たに学習を行う際に、過去の学習データセットに対応する頻度変数を読み出して、その頻度変数に新たな学習に基づいて得られた頻度変数を足しこむことにより頻度変数の値を更新する。さらに、更新された頻度変数に基づいて得られる観測確率の推定値を求めることにより、追加学習方式の学習を行うのである。 When new learning is performed, the frequency variable corresponding to the past learning data set is read, and the frequency variable is updated by adding the frequency variable obtained based on the new learning to the frequency variable. To do. Furthermore, learning of the additional learning method is performed by obtaining an estimated value of the observation probability obtained based on the updated frequency variable.

次に、複数回の学習により得られたそれぞれの内部モデルデータの統合について説明する。すなわち、学習データセットＤ_１，Ｄ_２，・・・Ｄ_ｎ・・・に基づいて計算される観測確率の推定値b´_j(o)⁽¹⁾，b´_j(o)⁽²⁾，・・・b´_j(o)⁽ⁿ⁾，・・・の統合について説明する。 Next, the integration of each internal model data obtained by multiple learning will be described. That is, the learning data sets _{_{_{D 1, D 2, ··· D}}} n estimates of the observation probability is calculated based on _{^{··· b'j (o) (1}} ), b'j (o) (2), ... Integration of b ′ _j (o) ⁽ⁿ⁾ ,.

統合にあたって、状態遷移確率の推定値の統合の場合と同様の理由で、重みw´_nの考慮が必要となる。 In the integration, it is necessary to consider the weight w ′ _n for the same reason as in the case of integrating the estimated values of the state transition probabilities.

本発明では、統合後の状態遷移確率の推定値b´_j(o)を式（２７）により求めるようにする。 In the present invention, the estimated value b ′ _j (o) of the state transition probability after integration is obtained by equation (27).

・・・（２７）

... (27)

このとき、シーケンス長Tnの学習データセットDnに対応する頻度変数ω_j(o)⁽ⁿ⁾について、式（２８）に示される関係を満たすように重みw´_nを設定する。 At this time, for the frequency variable ω _j (o) ⁽ⁿ⁾ corresponding to the learning data set Dn having the sequence length Tn, the weight w ′ _n is set so as to satisfy the relationship shown in Expression (28).

・・・（２８）

... (28)

このように、学習データセットごとに、その学習データセットのシーケンス長に応じた重みの調整を行いながら、頻度変数ω_j(o)⁽ⁿ⁾をすべてのデータセットに渡って累積すれば、全データをまとめてバッチ学習するのとほぼ同等の結果を得ることができる。すなわち、式（２９）により、頻度変数ω_j(o)を求め、式（２７）を参照して上述したように、頻度変数ω_j(o)を用いて統合後の観測確率の推定値b´_j(o)を求めるのである。 In this way, for each training data set, if the frequency variable ω _j (o) ⁽ⁿ⁾ is accumulated over all data sets while adjusting the weight according to the sequence length of the training data set, The result is almost equivalent to batch learning of data. That is, the frequency variable ω _j (o) is obtained from the equation (29), and the estimated value b of the observation probability after integration using the frequency variable ω _j (o) as described above with reference to the equation (27). ´ _j (o) is obtained.

・・・（２９）

... (29)

このようにすることで、例えば、学習データセットＤ_１，Ｄ_２，・・・Ｄ_ｎ・・・のそれぞれに対応する観測確率テーブル、状態遷移確率テーブルを保存するなどしなくても、全ての学習データセットをまとめてバッチ学習するのとほぼ同等の結果を得ることができるのである。すなわち、既に記憶されている、学習データセットＤ_ｎ-1までを学習することにより得られた頻度変数に、学習データセットＤ_nを学習することにより得られた頻度変数を足しこんで観測確率の推定値を求める。これにより、学習データセットＤ_１，Ｄ_２，・・・Ｄ_ｎをまとめてバッチ学習するのとほぼ同等の結果を得ることができるのである。 In this way, for example, all of the learning data sets D ₁ , D ₂ ,... D _n . It is possible to obtain almost the same result as batch learning of learning data sets. That is, by adding the frequency variable obtained by learning the learning data set D _n to the frequency variable obtained by learning up to the learning data set D _n−1 already stored, the observation probability is calculated. Get an estimate. As a result, it is possible to obtain almost the same result as batch learning of learning data sets D ₁ , D ₂ ,... D _n .

例えば、式（１５）または式（２３）の計算結果をそのまま保存して追加学習方式による学習を行っても学習データセットＤ_１，Ｄ_２，・・・Ｄ_ｎをまとめてバッチ学習するのとほぼ同等の結果を得ることはできない。式（１５）または式（２３）の計算結果は、確率の値として算出されるものであり、あり得る遷移の確率の合計値が１となるように正規化されているからである。仮に、式（１５）または式（２３）の計算結果をそのまま保存して、追加学習方式による学習を行っても、まとめてバッチ学習するのとほぼ同等の結果を得ることができるようにするためには、例えば、学習データセットのそれぞれに対応するテーブルを保存するなどの必要がある。このため、本発明では、式（１５）または式（２３）における分子に対応する式により得られる頻度変数を保存するようにしたのである。 For example, the learning data sets D ₁ , D ₂ ,... D _n are batch-learned together even if the calculation results of Expression (15) or Expression (23) are stored as they are and learning by the additional learning method is performed. Nearly equivalent results cannot be obtained. This is because the calculation result of the equation (15) or the equation (23) is calculated as a probability value and is normalized so that the total value of the probabilities of possible transitions is 1. Even if the calculation result of the equation (15) or the equation (23) is stored as it is and learning by the additional learning method is performed, a result almost the same as batch learning can be obtained. For example, it is necessary to store a table corresponding to each learning data set. For this reason, in the present invention, the frequency variable obtained by the equation corresponding to the numerator in the equation (15) or the equation (23) is stored.

このようにして、状態遷移確率と観測確率を求めるようにすれば、追加学習方式による学習を行って、変化する環境の中でより良い性能を発揮できるとともに、安定的な学習を行うことができるようになる。 If the state transition probability and the observation probability are obtained in this way, learning by the additional learning method can be performed, and better performance can be exhibited in a changing environment, and stable learning can be performed. It becomes like this.

また、そのようにするために、過去の学習のそれぞれに対応する内部モデルデータを全て保存するなどの必要がなく、例えば、内部モデルデータ記憶部３７の記憶容量を小さいものとすることができる。さらに、追加学習方式による学習の結果、内部モデルデータを更新する際の演算量を少なくすることができ、環境の変化をより迅速に認識させるようにすることが可能となる。 In order to do so, it is not necessary to store all the internal model data corresponding to each of the past learnings. For example, the storage capacity of the internal model data storage unit 37 can be reduced. Further, as a result of learning by the additional learning method, the amount of calculation when updating the internal model data can be reduced, and it becomes possible to recognize a change in the environment more quickly.

ここまでの追加学習方式に関する説明は、離散観測信号を取得する場合の例について説明した。連続観測信号を取得する場合は、時刻tにおいてエージェントが状態jにいる場合の重み係数γ_t(j）よって重み付けた信号分布を用いて、観測確率密度関数b_j(o)のパラメータを再推定すればよい。このとき、重み係数γ_t(j）が式（３０）を満たすように調整する必要がある。 The explanation about the additional learning method so far has described the example in the case of obtaining the discrete observation signal. When acquiring continuous observation signals, reestimate the parameters of the observation probability density function b _j (o) using the signal distribution weighted by the weighting coefficient γ _t (j) when the agent is in state j at time t. do it. At this time, it is necessary to adjust so that the weight coefficient γ _t (j) satisfies the expression (30).

・・・（３０）

... (30)

いまの場合、γ′_t(j)が頻度相当の意味を有するものとなる。 In this case, γ ′ _t (j) has a meaning equivalent to the frequency.

そして、γ′_t(j) ≡ w′_nγ_t(j)を用いて観測信号の平均ベクトルおよび共分散行列を推定すればよい。 Then, the average vector and covariance matrix of the observed signal may be estimated using γ ′ _t (j) ≡w ′ _n γ _t (j).

ガウス分布などの対数凹又は楕円型対称確率密度のモデルのパラメータとしては、状態jにおける観測信号の平均ベクトルμ´_jおよび共分散行列U´_jを用いることができる。平均ベクトルμ´_jおよび共分散行列U´_jは、それぞれ、式（３１）および式（３２）により求めることができる。 As a parameter of a log-concave or elliptical symmetry probability density model such as a Gaussian distribution, an average vector μ ′ j of an observation signal in a state _j and a covariance matrix U ′ _j can be used. The average vector μ ′ _j and the covariance matrix U ′ _j can be obtained by Expression (31) and Expression (32), respectively.

・・・（３１）

... (31)

・・・（３２）

... (32)

以上に説明した通り、追加学習方式による学習時の安定性を確保することができるが、追加学習方式の場合、直近の学習結果により大きな重みを与えて内部モデルデータを更新させることが多い。新しい経験は、環境の変化をより適切に学習するのに都合が良いと考えられるからである。 As described above, stability at the time of learning by the additional learning method can be ensured, but in the case of the additional learning method, the internal model data is often updated by giving a larger weight to the latest learning result. This is because the new experience is considered convenient for learning about environmental changes more appropriately.

例えば、１０万サンプルからなる学習を終えた学習器に対して１００サンプルの新規データを与えて追加学習方式の学習をさせる場合を考える。既に学習したものの大きさ(１０万)に対して新たに学習するデータの量(１００)が小さいため、そのまま学習すると新たな学習の影響度は０.１％となる。このような場合、環境の変化を適切に学習しているとは言い難い。 For example, consider a case in which 100 samples of new data are given to a learning device that has completed learning of 100,000 samples to perform learning of an additional learning method. Since the amount of newly learned data (100) is small with respect to the size of already learned (100,000), if the learning is performed as it is, the influence of new learning becomes 0.1%. In such a case, it is difficult to say that learning about changes in the environment is appropriate.

そこで、例えば、新たな学習の影響度である学習率を指定することができれば便利である。例えば、上述の例において、学習率を０.１(１０％)と指定した場合、新たに学習するデータの量を変えることなく、影響度を１００倍にすることができる。 Therefore, for example, it is convenient if a learning rate that is the degree of influence of new learning can be designated. For example, in the above example, when the learning rate is specified as 0.1 (10%), the degree of influence can be increased by 100 times without changing the amount of data to be newly learned.

本発明では、上述した学習率の指定があった場合でも、学習の安定性を損なうことがないようにする。 In the present invention, even when the learning rate is specified as described above, the learning stability is not impaired.

上述したように、状態遷移確率の推定のための頻度変数χ_ij(k)は、式（３３）で示されるように更新される。なお式（３３）における⇒は、χ_ij(k)が右辺に示されるように更新されることを表している。 As described above, the frequency variable χ _ij (k) for estimating the state transition probability is updated as shown in Expression (33). In the equation (33), ⇒ indicates that χ _ij (k) is updated as shown on the right side.

・・・（３３）

... (33)

観測確率の推定のための頻度変数ω_j(o)は、式（３４）で示されるように更新される。なお式（３４）における⇒は、ω_j(o)が右辺に示されるように更新されることを表している。 The frequency variable ω _j (o) for estimating the observation probability is updated as shown in Expression (34). In Expression (34), ⇒ indicates that ω _j (o) is updated as shown on the right side.

・・・（３４）

... (34)

いま、学習率ｒ(0≦r≦1)が指定された場合、本発明では、状態遷移確率の推定のための頻度変数を算出するために、式（３５）に示される重みW_nと、重みz_i(k)⁽ⁿ⁾を演算する。重みW_nと、重みz_i(k)⁽ⁿ⁾は、それぞれ新たな学習に基づいて得られた頻度変数に乗じるための重みと、既に保存されている頻度変数に乗じるための重みとして演算される。 Now, when the learning rate r (0 ≦ r ≦ 1) is specified, in the present invention, in order to calculate the frequency variable for estimating the state transition probability, the weight W _n shown in the equation (35), The weights z _i (k) ⁽ⁿ⁾ are calculated. The weight W _n and the weight z _i (k) ⁽ⁿ⁾ are respectively calculated as the weight for multiplying the frequency variable obtained based on the new learning and the weight for multiplying the already stored frequency variable. The

・・・（３５）

... (35)

そして、状態遷移確率の推定のための頻度変数は、式（３６）により演算される。 Then, the frequency variable for estimating the state transition probability is calculated by Expression (36).

・・・（３６）

... (36)

なお、式（３５）における重みz_i(k)⁽ⁿ⁾は、重みW_nが学習を繰り返すに従って一方的に大きくなることを考慮して設けられた重みであり、実際の演算では用いられないようにしてもよい。 Note that the weights z _i (k) ⁽ⁿ ) in Expression (35) are weights that are set in consideration that the weight W _n increases unilaterally as learning is repeated, and are not used in actual calculations. You may do it.

また、学習率ｒ(0≦r≦1)が指定された場合、観測確率の推定のための頻度変数を算出するために、式（３７）に示される重みW´_nと、重みz_i ⁽ⁿ⁾を演算する。重みW´_nと、重みz_i ⁽ⁿ⁾は、それぞれ新たな学習に基づいて得られた頻度変数に乗じるための重みと、既に保存されている頻度変数に乗じるための重みとして演算される。 Further, when the learning rate r (0 ≦ r ≦ 1) is specified, the weight W ′ _n shown in the equation (37) and the weight z _i ⁽ Calculate ⁿ⁾ . The weights W ′ _n and the weights z _i ⁽ⁿ⁾ are calculated as weights for multiplying the frequency variables obtained based on the new learning and weights for multiplying the already stored frequency variables.

・・・（３７）

... (37)

そして、状態遷移確率の推定のための頻度変数は、式（３８）により演算される。 Then, the frequency variable for estimating the state transition probability is calculated by the equation (38).

・・・（３８）

... (38)

なお、式（３７）における重みz_i ⁽ⁿ⁾は、重みW´_nが学習を繰り返すに従って一方的に大きくなることを考慮して設けられた重みであり、実際の演算では用いられないようにしてもよい。 The weight z _i ⁽ⁿ ) in the equation (37) is a weight provided in consideration of the fact that the weight W ′ _n increases unilaterally as learning is repeated, and is not used in actual calculations. May be.

ここまでの学習率の指定のある追加学習方式に関する説明は、離散観測信号を取得する場合の例について説明した。連続観測信号を取得する場合も同様に、対応する重み変換を行ってから分布パラメータの推定を行えばよい。 The description of the additional learning method with the specified learning rate up to this point has been given of the example in the case of acquiring discrete observation signals. Similarly, when acquiring continuous observation signals, distribution parameters may be estimated after corresponding weight conversion.

このようにすることで、学習率の指定があった場合でも、学習の安定性を損なうことがないようにすることができる。 By doing so, it is possible to prevent the learning stability from being impaired even when the learning rate is specified.

ところで、頻度変数を用いた状態遷移確率の推定値a´ij(k)の算出については、上述したように式（２０）により求めることができるが、実際には、分母のΣ_jχ_ij(k)が小さい値となる場合、計算結果が擾乱することがある。上述した擾乱は、学習により得られる内部モデルデータの信頼性を損ねて、その後の環境の認識に影響し、エージェントが環境を誤って認識してしまう。さらに、その認識結果が、追加学習方式の学習の結果に対しても再帰的に悪影響を及ぼすため、この問題を解決する必要がある。 By the way, the calculation of the estimated value a′ij (k) of the state transition probability using the frequency variable can be obtained by the equation (20) as described above, but actually, the denominator Σ _j χ _ij ( When k) becomes a small value, the calculation result may be disturbed. The disturbance described above impairs the reliability of the internal model data obtained by learning, affects the subsequent recognition of the environment, and causes the agent to mistakenly recognize the environment. Furthermore, since the recognition result adversely affects the learning result of the additional learning method, it is necessary to solve this problem.

ここで、N_ik＝Σ_jχ_ij(k)とする。N_ikが小さい値となる場合、計算結果が擾乱するという問題を解決するためには、状態遷移確率に対するN_ikの小ささに応じたペナルティ係数を乗ずるようにすればよい。すなわち、ペナルティ係数をη(N_ik)とし、状態遷移確率の推定値a´ij(k)を、式（３９）により求めるようにすればよい。 Here, N _ik = Σ _j χ _ij (k). In order to solve the problem that the calculation result is disturbed when N _ik is a small value, a penalty coefficient corresponding to the smallness of N _ik with respect to the state transition probability may be multiplied. That is, the penalty coefficient is η (N _ik ), and the estimated value a′ij (k) of the state transition probability may be obtained by Expression (39).

・・・（３９）

... (39)

ただし、関数η(x)は、定義域0≦xに対して値域0≦η(x)≦1を満たす単調増加関数であるものとする。 Note that the function η (x) is a monotonically increasing function that satisfies the range 0 ≦ η (x) ≦ 1 with respect to the domain 0 ≦ x.

関数η(x)は、例えば、式（４０）により表される関数とされる。 The function η (x) is, for example, a function represented by Expression (40).

・・・（４０）

... (40)

式（４０）におけるα(>0)，βは、必要に応じて適切に調整されるパラメータであり、例えば、指定された学習率rに応じて調整されるようにしてもよい。 Α (> 0) and β in the equation (40) are parameters that are appropriately adjusted as necessary, and may be adjusted according to the designated learning rate r, for example.

ところで、上述したように、本発明においては、内部モデルデータとして、状態遷移確率の推定のための頻度変数と、観測確率の推定のための頻度変数とを記憶するようにした。そうすると、状態遷移確率の推定のための頻度変数と、観測確率の推定のための頻度変数に対しても、上述した一状態一観測制約とアクション遷移制約を課す処理を施すことが必要となる。 Incidentally, as described above, in the present invention, the frequency variable for estimating the state transition probability and the frequency variable for estimating the observation probability are stored as internal model data. Then, it is necessary to perform the above-described processing for imposing the one-state one-observation constraint and the action transition constraint on the frequency variable for estimating the state transition probability and the frequency variable for estimating the observation probability.

状態遷移確率の推定のための頻度変数と、観測確率の推定のための頻度変数に対するスプリットアルゴリズム適用の処理は、次のようにして行われる。 The process of applying the split algorithm to the frequency variable for estimating the state transition probability and the frequency variable for estimating the observation probability is performed as follows.

ここでは、ノードs_jをＫ個のノードに分割する場合の例について説明する。なお、ノードs_jが分割された結果得られるＫ個のノードのうちの第ｋ番目のノードを、s_j ^kと表すこととし、ノードs_jが分割された後のノードs_iからノードs_j ^kへの状態遷移確率をa^k _ijで表すことにする。また、ノードs_jが分割された後のノードs_j ^kからノードs_iへの状態遷移確率をa^k _jiで表すことにする。 Here, an example in which the node s _j is divided into K nodes will be described. The k-th node among the K nodes obtained as a result of dividing the node s _j is expressed as s _j ^k, and the node s _i to the node s _j after the node s _j is divided. the state transition probability of the ^k to be represented by a ^k _ij. Further, a state transition probability from the node s _j ^k to the node s _i after the node s _j is divided is represented by a ^k _ji .

学習器３４は、観測シンボルoについての観測確率b_j(o)に対応する観測確率の推定のための頻度変数ω_j(o)を、式（４１）により求める。 The learning device 34 obtains a frequency variable ω _j (o) for estimating the observation probability corresponding to the observation probability b _j (o) for the observation symbol o by using the equation (41).

・・・（４１）

... (41)

また、学習器３４は、状態遷移確率a^k _ijに対応する状態遷移確率の推定のための頻度変数χ^k _ijが、分割前の頻度変数χ_ijを分割前の各観測シンボルの観測確率に対応する観測確率の推定のための頻度変数ω_j(o_k)の比で案分されたものとなるように設定する。 Also, the learning unit 34, the frequency variable chi ^k _ij for estimating a state transition probability corresponding to the state transition probability a ^k _ij is corresponding to the observation probability of each observation symbol before the division of the frequency variable chi _ij before division It is set so that it is proportionally divided by the ratio of the frequency variable ω _j (o _k ) for estimating the observation probability.

さらに、学習器３４は、状態遷移確率a^k _jiに対応する状態遷移確率の推定のための頻度変数χ^k _jiが、分割前の頻度変数χ_jiを分割前の各観測シンボルの観測確率に対応する観測確率の推定のための頻度変数ω_j(o_k)の比で案分されたものとなるように設定する。 Furthermore, the learning unit 34, the frequency variable chi ^k _ji for estimating a state transition probability corresponding to the state transition probability a ^k _ji is corresponding to the observation probability of each observation symbol before the division of the frequency variable chi _ji before division It is set so that it is proportionally divided by the ratio of the frequency variable ω _j (o _k ) for estimating the observation probability.

状態遷移確率の推定のための頻度変数と、観測確率の推定のための頻度変数に対するフォワードマージアルゴリズム適用の処理は、次のようにして行われる。 The process of applying the forward merge algorithm to the frequency variable for estimating the state transition probability and the frequency variable for estimating the observation probability is performed as follows.

ここでは、Ｌ個のノード群s_j ^m，l(l=1，・・・，L)を１つのノードs_j ^mにマージする場合の例について説明する。なお、マージされた後のノードs_iからノードs_j ^mへの状態遷移確率をa_ij ^mで表し、マージされた後のノードs_j ^mからノードs_iへの状態遷移確率をa_ji ^mで表すことにする。また、マージされた後の各観測シンボルの観測確率のそれぞれを要素とするベクトルを、b_j ^mで表すことにする。 Here, an example in which L node groups s _j ^{m, l} (l = 1,..., L) are merged into one node s _j ^m will be described. Incidentally, it expressed from node s _i after being merged state transition probability to the node s _j ^m in a _ij ^m, a state transition probability from the node s _j ^m after being merged into a node s _i at a _ji ^m I will represent it. Also, a vector whose elements are the observation probabilities of the respective observation symbols after being merged is represented by b _j ^m .

学習器３４は、状態遷移確率a_ij ^mに対応する状態遷移確率の推定のための頻度変数χ_ij ^mをΣ_lχ_ij ^m，lにより求めて設定する。ここで、χ_ij ^m，lは、マージされる前のノードs_iからノードs_j ^m，lへの状態遷移確率に対応する状態遷移確率の推定のための頻度変数とされる。 The learning device 34, a frequency variable chi _ij ^m for estimating a state transition probability corresponding to the state transition probability _{^{_{_{^{a ij m Σ l χ ij m}}}}} , determined and set by ^l. Here, χ _ij ^{m, l} is a frequency variable for estimating the state transition probability corresponding to the state transition probability from the node s _i to the node s _j ^{m, l} before being merged.

また、学習器３４は、状態遷移確率a_ji ^mに対応する状態遷移確率の推定のための頻度変数χ_ji ^mをΣ_lχ_ji ^m，lにより求めて設定する。ここで、χ_ji ^m，lは、マージされる前のノードs_j ^m，lからノードs_iへの状態遷移確率に対応する状態遷移確率の推定のための頻度変数とされる。 Also, the learning unit 34, a frequency variable chi _ji ^m for estimating a state transition probability corresponding to the state transition probability _{^{_{_{^{a ji m Σ l χ ji m}}}}} , determined and set by ^l. Here, χ _ji ^{m, l} is a frequency variable for estimating the state transition probability corresponding to the state transition probability from the node s _j ^{m, l} to the node s _i before merging.

さらに、学習器３４は、ベクトルb_j ^mの要素のそれぞれに対応する状態遷移確率の推定のための頻度変数のそれぞれを要素とするベクトルω_j ^mをΣ_lω_j ^m，lにより求めて設定する。 Further, the learning device 34, set determined by the vector b _j ^m the vector omega _j ^m to the elements, respectively Σ _l ω _j ^m of frequency variable for estimating a corresponding state transition probability to each of the ^{elements, l} To do.

そして学習器３４は、すべてのマージが終了したら、修正された状態遷移確率の推定のための頻度変数と観測確率の推定のための頻度変数とを用いて状態遷移確率と観測確率とを再計算する。 When all the merges are completed, the learning device 34 recalculates the state transition probability and the observation probability using the corrected frequency variable for estimating the state transition probability and the frequency variable for estimating the observation probability. To do.

状態遷移確率の推定のための頻度変数と、観測確率の推定のための頻度変数に対するバックワードマージアルゴリズム適用の処理は、次のようにして行われる。 The process of applying the backward merge algorithm to the frequency variable for estimating the state transition probability and the frequency variable for estimating the observation probability is performed as follows.

さらに、学習器３４は、ベクトルb_i ^mの要素のそれぞれに対応する観測確率の推定のための頻度変数のそれぞれを要素とするベクトルω_i ^mをΣ_lω_i ^m，lにより求めて設定する。 Furthermore, the learning unit 34 sets found through vector b _i ^m of the vector omega _i ^m to each element of the frequency variable for estimating an observation probability corresponding to each element Σ _^l ω _i ^m, _l .

このようにすることで、状態遷移確率の推定のための頻度変数と、観測確率の推定のための頻度変数に対しても、上述した一状態一観測制約とアクション遷移制約が課されることになる。 By doing so, the above-described one-state one-observation constraint and action transition constraint are imposed on the frequency variable for estimating the state transition probability and the frequency variable for estimating the observation probability. Become.

ここまで、必然的に大規模となるアクション拡張型ＨＭＭにおける追加学習方式での学習を安定的に行うことができるようにするための技術について説明した。 Up to this point, the technology for enabling stable learning in the additional learning method in the action expanded HMM that is necessarily large in scale has been described.

ところで以上においては、図８を参照して上述したような、３次元の状態遷移確率テーブルと、２次元の観測確率テーブルを有するアクション拡張型ＨＭＭの例について説明した。通常は、ノード数をＮ、観測シンボル数をＭ、アクション数をＫとすると、算出すべきパラメータ数は、Ｎ²Ｋ＋ＮＭとなり、Ｎ、Ｍ、およびＫの値が一定であることを前提に学習アルゴリズムが定められる。 In the above, the example of the action extended HMM having the three-dimensional state transition probability table and the two-dimensional observation probability table as described above with reference to FIG. 8 has been described. Normally, if the number of nodes is N, the number of observation symbols is M, and the number of actions is K, the number of parameters to be calculated is N ² K + NM, and learning is performed on the assumption that the values of N, M, and K are constant. An algorithm is defined.

しかしながら、学習を進めるうちにＮ、Ｍ、およびＫの値を変更する必要に迫られることがある。例えば、ロボットが移動する迷路に用いられるパーツが新たに加わった場合、観測シンボルの種類が増えることになるので、Ｍの値を大きくする必要がある。 However, it may be necessary to change the values of N, M, and K as learning proceeds. For example, when new parts are added to the maze where the robot moves, the number of observation symbols increases, so the value of M needs to be increased.

次に、学習を進める際に、ノード数、観測シンボル数、またはアクション数を変更する必要に迫られた場合にとり得る処置について説明する。 Next, a description will be given of actions that can be taken when it is necessary to change the number of nodes, the number of observation symbols, or the number of actions when learning is advanced.

図３０は、観測シンボルの種類が増えることによる影響を説明する図である。同図に示されるように、観測シンボルの種類が増えると、観測確率テーブルの行方向（図中水平方向）の拡張が生じることになる。すなわち、領域１２１に対応する観測確率の値を新たに設定する必要がある。なお、観測確率テーブルにおいては、テーブルの１行あたり観測確率値の合計が１.０となるようにする制約がある。 FIG. 30 is a diagram for explaining the influence of an increase in the number of types of observation symbols. As shown in the figure, when the number of observation symbols increases, the row direction (horizontal direction in the drawing) of the observation probability table is expanded. That is, it is necessary to newly set an observation probability value corresponding to the region 121. In the observation probability table, there is a restriction that the total observation probability value per row of the table is 1.0.

また、図３０に示されるように、観測確率テーブルの行方向（図中水平方向）の拡張が生じたことにより、観測確率の推定のための頻度変数のテーブルも拡張させる必要がある。すなわち、領域１２２に対応する頻度変数の値を新たに設定する必要がある。 Further, as shown in FIG. 30, since the observation probability table is expanded in the row direction (horizontal direction in the figure), it is necessary to expand the frequency variable table for estimating the observation probability. That is, it is necessary to newly set a frequency variable value corresponding to the region 122.

本発明では、図３０に示されるように観測確率テーブルを拡張する必要がある場合、学習器３４が、次のような処理を行う。ここでは、例えば、ロボットに対して予め所定の数だけ観測シンボルの種類が増えることを前提として、図３０に示されるように観測確率テーブルを拡張するように指令する場合の学習器３４の処理について説明する。 In the present invention, when it is necessary to expand the observation probability table as shown in FIG. 30, the learning device 34 performs the following processing. Here, for example, the processing of the learning device 34 when instructing the robot to expand the observation probability table as shown in FIG. 30 on the premise that the number of types of observation symbols is increased in advance by a predetermined number. explain.

いま、新しい観測シンボルo_M+iに対応するインデックスをM+iとして、観測確率テーブルに第M+i列を追加するものとする。 Now, the index corresponding to the new observation symbol o _{M + i} is M + i, and the M + i-th column is added to the observation probability table.

学習器３４は、観測確率テーブルの第M+i列に設定すべき観測確率値を適切な大きさの非零要素とする。この非零要素の値は、次のように決定される。 The learning device 34 sets the observation probability value to be set in the M + i-th column of the observation probability table as a non-zero element having an appropriate size. The value of this non-zero element is determined as follows.

式（４２）に示されるように、新しい観測シンボルを追加する前の観測シンボルの数をＭとし、第M+i列に設定すべき観測確率値は、全て１／Ｍとする。 As shown in Equation (42), the number of observation symbols before adding a new observation symbol is M, and all observation probability values to be set in the M + i column are 1 / M.

・・・（４２）

... (42)

あるいはまた、式（４３）に示されるように、観測確率テーブルの各行ごとに、観測確率bj(・)が閾値以上となる観測シンボルの数をカウントし、その数n_jを用いて第M+i列に設定すべき観測確率値を求める。なお、bj(・)は、閾値以上となった、それぞれの観測シンボルの観測確率を表している。 Alternatively, as shown in Expression (43), for each row of the observation probability table, the number of observation symbols for which the observation probability bj (•) is equal to or greater than the threshold is counted, and the number M _j is used as the number n _j. Find the observation probability value to be set in column i. Note that bj (•) represents the observation probability of each observation symbol that is equal to or greater than the threshold.

・・・（４３）

... (43)

学習器３４は、式（４２）または式（４３）に示されるように、観測確率テーブルの第M+i列に適切な大きさの非零要素を設定した後、テーブルの１行あたり観測確率値の合計が１.０となるように調整する。すなわち、式（４４）に示されるように、観測確率テーブル内の非零の各観測確率bj(・)を更新する。これにより、観測確率テーブルの拡張は完了する。 As shown in the equation (42) or the equation (43), the learning device 34 sets a non-zero element having an appropriate size in the M + i column of the observation probability table, and then the observation probability per row of the table. Adjust so that the sum of the values is 1.0. That is, as shown in Expression (44), each non-zero observation probability bj (•) in the observation probability table is updated. This completes the expansion of the observation probability table.

・・・（４４）

... (44)

さらに、学習器３４は、観測確率の推定のための頻度変数のテーブルの第M+i列に設定すべき観測確率値を全て０（零）とする。これにより、観測確率の推定のための頻度変数のテーブルの拡張は完了する。 Further, the learning device 34 sets all observation probability values to be set to the M + i column of the frequency variable table for estimating the observation probability to 0 (zero). This completes the expansion of the frequency variable table for estimating the observation probability.

そして学習器３４は、新しい観測シンボルを含む学習データセットに対する追加学習方式での学習を、所定の学習率の指定の下に行い、内部モデルデータを更新する。これにより、状態遷移確率の推定のための頻度変数、観測確率の推定のための頻度変数、並びに観測確率テーブルおよび状態遷移確率テーブルの各値が更新される。 Then, the learning device 34 performs learning by the additional learning method for the learning data set including the new observation symbol under the designation of a predetermined learning rate, and updates the internal model data. Thereby, the frequency variable for estimating the state transition probability, the frequency variable for estimating the observation probability, and each value of the observation probability table and the state transition probability table are updated.

このようにすることで、追加学習方式での学習中に、新たな観測シンボルが観測された場合でも、内部モデルデータを適切に更新することができる。 By doing in this way, even when a new observation symbol is observed during learning by the additional learning method, the internal model data can be appropriately updated.

また、例えば、学習中に所定の観測シンボルが不要となった場合、観測確率テーブルを列方向に縮小させることも可能である。 For example, when a predetermined observation symbol becomes unnecessary during learning, the observation probability table can be reduced in the column direction.

この場合、学習器３４は、不要となる観測シンボルに対応するインデックスをkとすると、観測確率テーブルから第k列を削除して、ノードs_jにおける観測確率b_j(k)が存在しないようにする。 In this case, the learning device 34 deletes the k-th column from the observation probability table so that the observation probability b _j (k) at the node s _j does not exist, where k is an index corresponding to the observation symbol that is not required. To do.

学習器３４は、観測確率の推定のための頻度変数のテーブルについても同様に、第k列を削除して、ω_j(k)が存在しないようにする。 Similarly, the learning device 34 deletes the k-th column from the frequency variable table for estimating the observation probability so that ω _j (k) does not exist.

さらに、学習器３４は、観測確率の推定のための頻度変数を用いて縮小後の観測確率テーブル内の各値を再計算する。 Further, the learning device 34 recalculates each value in the reduced observation probability table using the frequency variable for estimating the observation probability.

そして学習器３４は、所定の観測シンボルが不要となった後の学習データセットに対する追加学習方式での学習を、所定の学習率の指定の下に行い、内部モデルデータを更新する。これにより、状態遷移確率の推定のための頻度変数、観測確率の推定のための頻度変数、並びに観測確率テーブルおよび状態遷移確率テーブルの各値が更新される。 Then, the learning device 34 performs learning in the additional learning method for the learning data set after the predetermined observation symbol is no longer necessary, under the designation of the predetermined learning rate, and updates the internal model data. Thereby, the frequency variable for estimating the state transition probability, the frequency variable for estimating the observation probability, and each value of the observation probability table and the state transition probability table are updated.

また、例えば、ロボットが移動する迷路が所定の方向に延長された場合、ノードの数が増えることになるので、ノード数Ｎの値を大きくする必要がある。 Also, for example, when the maze where the robot moves is extended in a predetermined direction, the number of nodes increases, so the value of the number of nodes N needs to be increased.

図３１は、ノードの数が増えることによる影響を説明する図である。同図に示されるように、ノードの数が増えると、状態遷移確率テーブルの行列方向の拡張が生じることになる。すなわち、図３１の第１枚目の状態遷移確率テーブルにおける逆Ｌ字型の領域１３１−１に対応する状態遷移確率の値を新たに設定する必要がある。同様に、各アクションに対応する状態遷移確率テーブルの逆Ｌ字型の領域１３１−２、領域１３１−３、・・・に対応する状態遷移確率の値を新たに設定する必要がある。すなわち、アクション数Ｋ枚の状態遷移確率テーブルを拡張させて状態遷移確率の値を新たに設定する必要がある。なお、状態遷移確率テーブルにおいては、テーブルの１行あたり観測確率値の合計が１.０となるようにする制約がある。 FIG. 31 is a diagram for explaining the influence of an increase in the number of nodes. As shown in the figure, when the number of nodes increases, the matrix direction of the state transition probability table is expanded. That is, it is necessary to newly set a state transition probability value corresponding to the inverted L-shaped region 131-1 in the first state transition probability table of FIG. Similarly, it is necessary to newly set the value of the state transition probability corresponding to the inverted L-shaped region 131-2, region 131-3,... Of the state transition probability table corresponding to each action. That is, it is necessary to newly set a value of the state transition probability by extending the state transition probability table with K actions. In the state transition probability table, there is a restriction that the total of observation probability values per row of the table is 1.0.

また、ノードの数が増えると、観測確率テーブルの列方向（図中垂直方向）の拡張が生じることになる。すなわち、領域１３４に対応する観測確率の値を新たに設定する必要がある。なお、観測確率テーブルにおいては、テーブルの１行あたり観測確率値の合計が１.０となるようにする制約がある。 Further, when the number of nodes increases, the observation probability table expands in the column direction (vertical direction in the figure). That is, it is necessary to newly set an observation probability value corresponding to the region 134. In the observation probability table, there is a restriction that the total observation probability value per row of the table is 1.0.

さらに、同図には示されていないが、状態遷移確率の推定のための頻度変数のテーブルと、観測確率の推定のための頻度変数のテーブルも同様に拡張させて値を新たに設定する必要がある。 Furthermore, although not shown in the figure, the frequency variable table for estimating state transition probabilities and the frequency variable table for estimating observation probabilities must be expanded in the same way to set new values. There is.

本発明では、図３１に示されるように状態遷移確率テーブルと観測確率テーブルを拡張する必要がある場合、学習器３４が、次のような処理を行う。ここでは、例えば、ロボットに対して予め所定の数だけノードの数が増えることを前提として、図３１に示されるように状態遷移確率テーブルと観測確率テーブルを拡張するように指令する場合の学習器３４の処理について説明する。 In the present invention, when it is necessary to expand the state transition probability table and the observation probability table as shown in FIG. 31, the learning device 34 performs the following processing. Here, for example, a learning device for instructing the robot to extend the state transition probability table and the observation probability table as shown in FIG. The process 34 will be described.

いま、新しいノードs_N+iに対応するインデックスをN+iとして、状態遷移確率テーブルに第N+i行と第N+i列を追加するものとする。 Assume that the index corresponding to the new node s _{N + i} is _{N + i} , and the N + i-th row and the N + i-th column are added to the state transition probability table.

学習器３４は、状態遷移確率テーブルの第N+i行と第N+i列に設定すべき状態遷移確率値を、それぞれ微小なランダム要素とする。 The learning device 34 sets the state transition probability values to be set in the (N + i) th row and the (N + i) th column of the state transition probability table as small random elements.

学習器３４は、状態遷移確率の推定のための頻度変数のテーブルについても同様に、第N+i行と第N+i列とを追加し、設定すべき状態遷移確率値を、それぞれ微小なランダム要素とする。 Similarly, the learning device 34 adds the (N + i) th row and the (N + i) th column to the frequency variable table for estimating the state transition probability, and sets the state transition probability value to be set to a minute value. Use random elements.

学習器３４は、ノードs_N+iにおいて実行したことのあるアクションc_kを特定する。そして学習器３４は、アクションc_kに対応する第ｋ枚目の状態遷移確率テーブルのノードs_N+iに対応する行の状態遷移確率値のそれぞれを一様の値とする。ただし、アクションc_k実行時の実際の遷移結果を考慮して、経験のある遷移先状態への遷移確率を多少引き上げるようにしてもよい。 The learning device 34 identifies an action c _k that has been executed at the node s _{N + i} . Then, the learning device 34 sets each of the state transition probability values of the row corresponding to the node s _{N + i} of the k-th state transition probability table corresponding to the action _ck to a uniform value. However, in consideration of the actual transition result when the action _{ck is} executed, the transition probability to the transition destination state with experience may be slightly increased.

また、アクションc_kを実行した結果、ノードs_N+iへの遷移したことのある遷移元のノードs_jを特定する。そして学習器３４は、アクションc_kに対応する第ｋ枚目の状態遷移確率テーブルのノードs_jに対応する行の状態遷移確率値のそれぞれを次のように設定する。 Further, as a result of executing the action c _k , the transition source node s _j that has made a transition to the node s _{N + i} is specified. Then, the learning device 34 sets each of the state transition probability values of the row corresponding to the node s _j in the kth state transition probability table corresponding to the action _ck as follows.

当該行において、状態遷移確率が閾値以上となる遷移先ノードs_lの数をカウントし、その数をＬとする。そして、第ｋ枚目の状態遷移確率テーブルのノードs_jからノードs_N+iへの状態遷移確率a_iN+i(k)を１/Ｌとする。 In the row, counting the number of transition destination node s _l where state transition probability is equal to or greater than a threshold, for the number and L. The state transition probability a _{iN + i} (k) from the node s _j to the node s _{N + i} in the k-th state transition probability table is set to 1 / L.

そして学習器３４は、テーブルの１行あたり状態遷移確率値の合計が１.０となるように調整する。すなわち、式（４５）に示されるように、状態遷移確率テーブル内の各状態遷移確率a_j(k)を更新する。これにより、状態遷移確率テーブルの拡張は完了する。 The learning device 34 adjusts the total state transition probability value per row of the table to be 1.0. That is, as shown in Expression (45), each state transition probability a _j (k) in the state transition probability table is updated. Thereby, the expansion of the state transition probability table is completed.

・・・（４５）

... (45)

さらに、学習器３４は、状態遷移確率の推定のための頻度変数のテーブルの追加領域に設定すべき状態遷移確率値を全て０（零）とする。これにより、状態遷移確率の推定のための頻度変数のテーブルの拡張は完了する。 Further, the learning device 34 sets all the state transition probability values to be set in the additional region of the frequency variable table for estimating the state transition probability to 0 (zero). This completes the expansion of the frequency variable table for estimating the state transition probability.

また、学習器３４は、観測確率テーブルの第N+i行と第N+i列に設定すべき観測確率値を、適切な大きさの非零要素とする。非零要素の値としては、例えば、１/Ｎのような一様の値とされるが、ノードs_N+iで実際に観測されたことのある観測シンボルの観測確率を引き上げるようにしてもよい。 The learning device 34 sets the observation probability values to be set in the (N + i) th row and the (N + i) th column of the observation probability table as non-zero elements having appropriate sizes. The value of the non-zero element is, for example, a uniform value such as 1 / N, but the observation probability of an observation symbol that has actually been observed at the node s _{N + i} may be increased. Good.

さらに、学習器３４は、観測確率の推定のための頻度変数のテーブルにおいて追加されたノードs_N+iに対応する第N+i 行をすべて０（零）とする。これにより、観測確率の推定のための頻度変数のテーブルの拡張は完了する。 Further, the learning device 34 sets all N + i rows corresponding to the added node s _{N + i} in the frequency variable table for estimating the observation probability to 0 (zero). This completes the expansion of the frequency variable table for estimating the observation probability.

そして学習器３４は、新たなノードを含む学習データセットに対する追加学習方式での学習を、所定の学習率の指定の下に行い、内部モデルデータを更新する。これにより、状態遷移確率の推定のための頻度変数、観測確率の推定のための頻度変数、並びに観測確率テーブルおよび状態遷移確率テーブルの各値が更新される。 Then, the learning device 34 performs learning by the additional learning method for the learning data set including the new node under the designation of a predetermined learning rate, and updates the internal model data. Thereby, the frequency variable for estimating the state transition probability, the frequency variable for estimating the observation probability, and each value of the observation probability table and the state transition probability table are updated.

あるいはまた、例えば、ロボットが迷路上における移動可能方向が拡張されるように改造された場合、アクションの数が増えることになるので、アクション数Ｋの値を大きくする必要がある。 Alternatively, for example, when the robot is modified so that the movable direction on the maze is expanded, the number of actions increases, so the value of the action number K needs to be increased.

図３２は、アクションの数が増えることによる影響を説明する図である。同図に示されるように、アクションの数が増えると、状態遷移確率テーブルの奥行方向の拡張が生じることになる。すなわち、例えば、新たに追加されたアクションに対応する状態遷移確率テーブルであって、図３２の第３枚目の状態遷移確率テーブル１４１の状態遷移確率の値を新たに設定する必要がある。 FIG. 32 is a diagram for explaining the influence due to the increase in the number of actions. As shown in the figure, when the number of actions increases, the state transition probability table expands in the depth direction. That is, for example, it is a state transition probability table corresponding to a newly added action, and it is necessary to newly set the value of the state transition probability in the third state transition probability table 141 in FIG.

また、同図には示されていないが、状態遷移確率の推定のための頻度変数のテーブルも同様に拡張させて値を新たに設定する必要がある。 Although not shown in the figure, the frequency variable table for estimating the state transition probability needs to be similarly expanded to set a new value.

本発明では、図３２に示されるように状態遷移確率テーブルを拡張する必要がある場合、学習器３４が、次のような処理を行う。ここでは、例えば、ロボットに対して予め所定の数だけアクションが増えることを前提として、図３２に示されるように状態遷移確率テーブルを拡張するように指令する場合の学習器３４の処理について説明する。 In the present invention, when it is necessary to expand the state transition probability table as shown in FIG. 32, the learning device 34 performs the following processing. Here, for example, the processing of the learning device 34 in the case of instructing the robot to expand the state transition probability table as shown in FIG. .

いま、新しいアクションc_K+i に対応するインデックスをK+iとして、第K+i枚目の状態遷移確率テーブルを追加するものとする。 Assume that an index corresponding to a new action c _{K + i} is K + i, and a K + i-th state transition probability table is added.

学習器３４は、追加された第K+i枚目の状態遷移確率テーブルの全ての状態遷移確率を０とする。 The learning device 34 sets all state transition probabilities in the added K + i-th state transition probability table to 0.

また、学習器３４は、状態遷移確率の推定のための頻度変数のテーブルも同様に、第K+i枚目のテーブルを追加し、追加された第K+i枚目の状態遷移確率テーブルの全ての状態遷移確率を０とする。これにより、状態遷移確率の推定のための頻度変数のテーブルの拡張は完了する。 Similarly, the learning device 34 adds a K + i-th table to the frequency variable table for estimating the state transition probability, and adds the K + i-th state transition probability table. All state transition probabilities are set to zero. This completes the expansion of the frequency variable table for estimating the state transition probability.

さらに、学習器３４は、新しいアクションc_K+iを実行したことがあるノードs_jを特定する。そして学習器３４は、第K+i枚目の状態遷移確率テーブルのノードs_jに対応する行の状態遷移確率値を全て一様の値とする。ただし、実際のアクションc_K+i実行時の遷移結果を考慮して、経験のある遷移先ノードへの状態遷移確率を多少引き上げるようにしてもよい。これにより、状態遷移確率テーブルの拡張は完了する。 Further, the learning device 34 identifies a node s _j that has executed a new action c _{K + i} . Then, the learning device 34 sets all the state transition probability values of the row corresponding to the node s _j in the (K + i) th state transition probability table to a uniform value. However, in consideration of the transition result when the actual action c _{K + i is} executed, the state transition probability to the transition destination node with experience may be slightly increased. Thereby, the expansion of the state transition probability table is completed.

そして学習器３４は、新たなアクションの実行を含む学習データセットに対する追加学習方式での学習を、所定の学習率の指定の下に行い、内部モデルデータを更新する。これにより、状態遷移確率の推定のための頻度変数、観測確率の推定のための頻度変数、並びに観測確率テーブルおよび状態遷移確率テーブルの各値が更新される。 Then, the learning device 34 performs learning by the additional learning method for the learning data set including execution of a new action under the designation of a predetermined learning rate, and updates the internal model data. Thereby, the frequency variable for estimating the state transition probability, the frequency variable for estimating the observation probability, and each value of the observation probability table and the state transition probability table are updated.

上述した処理により、学習を進めるうちにノード数、観測シンボル数、アクション数を追加する必要に迫られた場合であっても学習を継続させることが可能となる。上述した処理は、例えば、ロボットに対して予め所定の数だけ観測シンボルの種類が増えることを前提として、図３０乃至図３２に示されるように各テーブルを拡張する場合の例である。 With the processing described above, it is possible to continue learning even when it is necessary to add the number of nodes, the number of observation symbols, and the number of actions as learning proceeds. The above-described processing is an example in the case where each table is expanded as shown in FIGS. 30 to 32 on the premise that the number of types of observation symbols is increased in advance by a predetermined number for the robot.

しかしながら、例えば、所定の数だけ観測シンボル、ノード、またはアクションが増えることを予め知ることができない場合がある。つまり、エージェントの自律的な行動によって逐次環境の変化が認識されるような場合、例えば、ロボットの管理者などが事前にどれだけ観測シンボル、ノード、またはアクションが増えるのかを知ることができない。従って、例えば、ロボットが迷路を移動中に、任意に新たな迷路のパーツが出現したり、新たに迷路が拡張されたり、新たに移動方向が追加されたりする場合は、さらなる考慮が必要となる。 However, for example, it may not be possible to know in advance that the number of observation symbols, nodes, or actions increases by a predetermined number. That is, when a change in the environment is sequentially recognized by the autonomous behavior of the agent, for example, the robot administrator cannot know in advance how many observation symbols, nodes, or actions increase. Therefore, for example, if a new maze part appears arbitrarily, a new maze is expanded, or a moving direction is newly added while the robot is moving in the maze, further consideration is required. .

次に、例えば、ロボットが迷路を移動中に、新たな迷路のパーツが出現したり、新たに迷路が拡張されたりする場合の、状態遷移確率テーブル、観測確率テーブルの拡張について説明する。すなわち、エージェントが自律的に環境の変化を認識して、状態遷移確率テーブル、および観測確率テーブルを拡張する場合の例について説明する。 Next, for example, expansion of the state transition probability table and the observation probability table when a new maze part appears or a new maze is expanded while the robot is moving in the maze will be described. That is, an example in which the agent autonomously recognizes a change in the environment and expands the state transition probability table and the observation probability table will be described.

エージェントが自律的に環境の変化を認識して、状態遷移確率テーブル、および観測確率テーブルを拡張する場合、そもそもエージェント自身が、新たに環境が拡張されたのか否かを認識する必要がある。つまり、エージェントが、現在自分が位置するノードは学習済の内部状態とされているノードなのか、新たに追加すべき内部状態とされるノードなのか認識できるようにしなければならない。例えば、ロボットが迷路を移動中に、新たに迷路が拡張された場合、拡張された部分を移動しているとき、自分が新たに追加されるべきノードに位置していることを認識できるようにしなければ、自律的に環境の変化を認識することができない。 When an agent autonomously recognizes an environmental change and expands the state transition probability table and the observation probability table, it is necessary for the agent itself to recognize whether or not the environment has been newly expanded. That is, the agent must be able to recognize whether the node at which he is currently located is a learned internal state or a node to be newly added. For example, when a robot is moving through a maze and the maze is newly expanded, when moving the expanded part, it is possible to recognize that it is located at a node to be newly added. Without it, it will not be able to recognize environmental changes autonomously.

ここで、自律行動学習装置１０におけるノードの認識の方式について説明する。ノードの認識は、図９の認識器３５により行なわれる。詳細は後述するが、ここでは、時系列情報の長さ値に上限があること、および認識された現在の状態確率のエントロピーの値の変化を考慮して、最終的には４通りの方式を説明することにする。 Here, a node recognition method in the autonomous behavior learning apparatus 10 will be described. Node recognition is performed by the recognizer 35 of FIG. Although details will be described later, in consideration of the fact that there is an upper limit to the length value of the time-series information and the change in the entropy value of the recognized current state probability, four methods are finally used. I will explain.

上述したように、認識器３５は、観測バッファ３３および行動出力バッファ３９に記憶されている情報、並びに内部モデルデータ記憶部３７に記憶されている状態遷移確率テーブルおよび観測確率テーブルに基づいて、現在、ロボットが位置するノードを認識するようになされている。 As described above, the recognizer 35 is based on the information stored in the observation buffer 33 and the behavior output buffer 39 and the state transition probability table and the observation probability table stored in the internal model data storage unit 37. The node where the robot is located is recognized.

また、上述したように、時刻t，t+1，t+2，・・・Tの各時刻で取得した観測信号に対応する観測シンボルo_t， o_t+1， o_t+2，・・・， o_Tが各時刻の観測シンボルとして観測バッファ３３に記憶されている。同様に、例えば、時刻t，t+1，t+2，・・・Tの各時刻で実行したアクションc_t， c_t+1， c_t+2，・・・， c_Tが各時刻のアクションとして行動出力バッファ３９に記憶されている。 Further, as described above, the observation symbols o _t , o _{t + 1} , o _{t + 2} corresponding to the observation signals acquired at the times _t , _{t + 1} , _{t + 2} ,. O and _T are stored in the observation buffer 33 as observation symbols at each time. Similarly, for example, the actions c _t , c _{t + 1} , c _{t + 2} ,..., C _T executed at the times _t , _{t + 1} , _{t + 2} ,. The action is stored in the action output buffer 39 as an action.

ここでは、認識器３５に入力される情報であって、観測バッファ３３および行動出力バッファ３９に記憶されている情報を時系列情報と称することにし、時系列情報の長さを変数Ｎで表すことにする。 Here, the information input to the recognizer 35 and stored in the observation buffer 33 and the action output buffer 39 is referred to as time-series information, and the length of the time-series information is represented by a variable N. To.

また、認識器３５から出力された認識結果は、その認識結果が出力された時刻と対応付けられて認識結果バッファ３８に記憶されるようになされている。 The recognition result output from the recognizer 35 is stored in the recognition result buffer 38 in association with the time when the recognition result is output.

認識器３５は、まず、時系列情報の長さＮを設定し、観測バッファ３３および行動出力バッファ３９から長さＮの時系列情報を取得し、内部モデルデータ記憶部３７に記憶されている状態遷移確率テーブルおよび観測確率テーブルに基づく認識を行なう。 The recognizer 35 first sets the length N of the time series information, acquires the time series information of the length N from the observation buffer 33 and the behavior output buffer 39, and is stored in the internal model data storage unit 37. Recognition is performed based on the transition probability table and the observation probability table.

認識器３５は、例えば、Viterbiアルゴリズムを用いて、長さＮに対応するノード列を出力する。例えば、Ｎ＝３であった場合、認識器３５は、認識結果としてのノード列s₁，s₂，s₃を出力する。この場合、認識器３５は、時刻t₁，において、ロボットがノードs₁に位置し、時刻t₂において、ロボットがノードs₂に位置し、時刻t₃において、ロボットがノードs₃に位置していたと認識したことになる。 The recognizer 35 outputs a node sequence corresponding to the length N using, for example, the Viterbi algorithm. For example, when N = 3, the recognizer 35 outputs node sequences s ₁ , s ₂ , and s ₃ as recognition results. In this case, the recognizer 35 has the robot located at the node s ₁ at time t ₁ , the robot is located at the node s ₂ at time t ₂ , and the robot is located at the node s ₃ at time t ₃ . It will be recognized that it was.

なお、Viterbiアルゴリズムを用いて、長さＮに対応するノード列を出力する処理においては、内部モデルデータ記憶部３７に記憶されている状態遷移確率テーブルおよび観測確率テーブルに基づいてノード列が推測されて出力される。Viterbiアルゴリズムを用いて、長さＮに対応するノード列を出力する場合、最も確からしい確率を有するノード列を含んだ複数のノード列を出力することが可能である。ここでは、Viterbiアルゴリズムを用いて得られた最も確からしい確率を有するノード列が出力されるものとする。 In the process of outputting the node sequence corresponding to the length N using the Viterbi algorithm, the node sequence is estimated based on the state transition probability table and the observation probability table stored in the internal model data storage unit 37. Is output. When a node sequence corresponding to the length N is output using the Viterbi algorithm, it is possible to output a plurality of node sequences including a node sequence having the most probable probability. Here, it is assumed that a node sequence having the most probable probability obtained using the Viterbi algorithm is output.

認識器３５は、さらに、現在ロボットが位置するノードが新たに追加されるべきであるのか否かを判定するために、Viterbiアルゴリズムを用いて出力されたノード列が、実際にあり得るノード列であるか否かを判定する。 The recognizer 35 further determines whether or not the node where the robot is currently located should be newly added, so that the node sequence output using the Viterbi algorithm is a node sequence that is actually possible. It is determined whether or not there is.

出力されたノード列が、実際にあり得るノード列であるか否かの判定は、例えば、次のようにして行なわれる。 The determination as to whether or not the output node string is a possible node string is performed as follows, for example.

いま、出力されたノード列（長さＴのノード列）をＸで表し、時系列情報に基づいて特定された観測シンボルの列（長さＴの観測シンボルの列）を、観測系列Ｏで表すことにする。また、内部モデルデータの状態遷移確率テーブルを、行列Ａで表すことにする。なお、行列Ａは、時系列情報に基づいて特定されたアクションのそれぞれに対応する状態遷移確率テーブルを意味することとする。 Now, the output node sequence (node sequence of length T) is represented by X, and the sequence of observation symbols specified based on the time series information (sequence of observation symbols of length T) is represented by observation sequence O. I will decide. In addition, the state transition probability table of the internal model data is represented by a matrix A. The matrix A means a state transition probability table corresponding to each action specified based on the time series information.

認識器３５は、ノード列Ｘと観測系列Ｏが式（４６）および式（４７）を満たすか否かを判定する。 The recognizer 35 determines whether or not the node sequence X and the observation sequence O satisfy Expression (46) and Expression (47).

・・・（４６）

... (46)

・・・（４７）

... (47)

ここで、P(O|X)は、ノード列Ｘを構成する各ノードにおける観測系列Ｏを構成する各観測シンボルの観測確率を意味するものとし、観測確率テーブルに基づいて特定することができる。また、Thres_trans、およびThres_obsは、それぞれ遷移があり得るかどうかの閾値と観測があり得るかどうかの閾値を表すものとする。 Here, P (O | X) means the observation probability of each observation symbol constituting the observation series O in each node constituting the node sequence X, and can be specified based on the observation probability table. Also, Thres _trans and Thres _obs represent a threshold value indicating whether there is a transition and a threshold value indicating whether there is an observation.

従って、ノード列Ｘと観測系列Ｏが式（４６）または式（４７）のいずれか１つでも満たさないと判定された場合、認識器３５は、出力されたノード列が、実際にあり得るノード列ではないと判定する。これにより、現在ロボットが位置するノード（時系列情報の最後の時刻におけるノード）は、新たに追加されるべきノードであって、未知のノードであると認識されるようにすることができる。 Accordingly, when it is determined that the node sequence X and the observation sequence O do not satisfy any one of the formulas (46) and (47), the recognizer 35 determines that the output node sequence is a node that can actually exist. It is determined that it is not a column. As a result, the node where the current robot is located (the node at the last time in the time-series information) can be newly added and recognized as an unknown node.

ノード列Ｘと観測系列Ｏが式（４６）および式（４７）を満たすと判定された場合、認識器３５は、現在の状態遷移確率のエントロピーを計算する。 When it is determined that the node sequence X and the observation sequence O satisfy Expression (46) and Expression (47), the recognizer 35 calculates the entropy of the current state transition probability.

ここで、エントロピーをＥ、ノードＸiの事後確率をP(Xi|O)とし、現在の内部モデルデータ上に存在するノード数の合計をＭで表すことにする。なお、ノード（状態）の事後確率とは、Viterbiアルゴリズムにより出力された確率であって、時系列情報の最後の時刻におけるノードに対応する確率を意味する。この場合、エントロピーＥは、式（４８）により表すことができる。 Here, the entropy is E, the posterior probability of the node Xi is P (Xi | O), and the total number of nodes existing on the current internal model data is represented by M. The posterior probability of a node (state) is a probability output by the Viterbi algorithm, and means a probability corresponding to the node at the last time of the time series information. In this case, entropy E can be expressed by equation (48).

・・・（４８）

... (48)

例えば、式（４８）により演算されたエントロピーの値を所定の閾値と比較し、閾値未満である場合、認識器３５は、出力されたノード列が、実際にあり得るノード列であって一意に特定することができる状況であることを意味する。これにより、現在ロボットが位置するノード（時系列情報の最後の時刻におけるノード）は、内部モデルデータ上に既に存在するノードであって、既知のノード（学習済みの内部状態）であると認識されるようにすることができる。 For example, when the entropy value calculated by the equation (48) is compared with a predetermined threshold value and is less than the threshold value, the recognizer 35 determines that the output node sequence is a node sequence that is actually possible and It means that the situation can be identified. As a result, the node where the current robot is located (the node at the last time of the time series information) is a node that already exists in the internal model data and is recognized as a known node (learned internal state). You can make it.

さらに、出力されたノード列に含まれる固有ノード数が閾値Thres以上であるか否かが判定され、Thres以上である場合にのみ、時系列情報の最後の時刻におけるノードは、既知のノードであると認識されるようにしてもよい。すなわち、認識の精度を保証するための閾値であって、認識した結果のノード列における固有ノード数の閾値を設けるのである。ここで、固有ノード数とは、インデックスが異なるノードのみをカウントした場合のノード数を意味する。 Furthermore, it is determined whether or not the number of unique nodes included in the output node sequence is equal to or greater than the threshold Thres. Only when the number is equal to or greater than Thres, the node at the last time in the time series information is a known node. May be recognized. That is, it is a threshold for guaranteeing the accuracy of recognition, and a threshold for the number of unique nodes in the recognized node string is provided. Here, the number of unique nodes means the number of nodes when only nodes having different indexes are counted.

例えば、出力されたノード列のインデックスが「１０」、「１１」、「１０」、「１１」、「１２」、「１３」であった場合、ノード列の長さは６であるが、固有ノード数は４である。例えば、エージェントが同じノード間の遷移を繰り返した場合、同じ長さの時系列情報に基づいて認識を行なったとしても、認識結果の精度は低くなる。このため、認識の精度を保証するための閾値であって、認識した結果のノード列における固有ノード数の閾値を設けるようにしてもよい。 For example, when the index of the output node sequence is “10”, “11”, “10”, “11”, “12”, “13”, the length of the node sequence is 6, but it is unique The number of nodes is four. For example, when the agent repeats the transition between the same nodes, even if the recognition is performed based on the time-series information having the same length, the accuracy of the recognition result is low. For this reason, a threshold value for assuring the accuracy of recognition, and a threshold value for the number of unique nodes in the recognized node sequence may be provided.

一方、エントロピーの値が閾値以上である場合、出力されたノード列が、実際にあり得るノード列であるが、例えば、複数の候補が存在しており一意に特定することができない状況であることを意味する。このため、認識器３５は、出力されたノード列が、時系列情報の長さを増加させるべきと判定する。これにより、例えば、時系列情報の長さＮの値がインクリメントされて処理が繰り返し実行される。 On the other hand, if the entropy value is greater than or equal to the threshold value, the output node sequence is a node sequence that can actually exist. For example, there are a plurality of candidates that cannot be uniquely identified. Means. For this reason, the recognizer 35 determines that the output node sequence should increase the length of the time-series information. Thereby, for example, the value of the length N of the time series information is incremented, and the process is repeatedly executed.

次に、図３３のフローチャートを参照して、認識器３５によるノード認識処理について説明する。この処理は、認識器３５によるノード認識処理の第１の方式の例となる処理である。 Next, node recognition processing by the recognizer 35 will be described with reference to the flowchart of FIG. This process is an example of a first method of the node recognition process by the recognizer 35.

ステップＳ２０１において、認識器３５は、変数Ｎの値を初期値である１にセットする。 In step S201, the recognizing device 35 sets the value of the variable N to 1 which is an initial value.

ステップＳ２０２において、認識器３５は、長さＮの時系列情報を観測バッファ３３および行動出力バッファ３９から取得する。 In step S <b> 202, the recognizer 35 acquires time-series information of length N from the observation buffer 33 and the behavior output buffer 39.

ステップＳ２０３において、認識器３５は、ステップＳ２０２で出力された時系列情報に基づいて、Viterbiアルゴリズムを用いてノード列を出力する。 In step S203, the recognizer 35 outputs a node sequence using the Viterbi algorithm based on the time series information output in step S202.

ステップＳ２０４において、認識器３５は、ステップＳ２０３の処理の結果、出力されたノード列が実際にあり得るノード列であるか否かを判定する。このとき、上述したように、ノード列Ｘと観測系列Ｏが式（４６）および式（４７）を満たすか否かが判定される。ノード列Ｘと観測系列Ｏが式（４６）および式（４７）を満たす場合、ステップＳ２０４では、実際にあり得るノード列であると判定される。一方、ノード列Ｘと観測系列Ｏが式（４６）または式（４７）の少なくとも一方を満たさない場合、ステップＳ２０４では、実際にあり得るノード列ではないと判定される。 In step S204, the recognizing device 35 determines whether or not the node sequence output as a result of the processing in step S203 is a node sequence that can actually exist. At this time, as described above, it is determined whether or not the node sequence X and the observation sequence O satisfy Expression (46) and Expression (47). When the node sequence X and the observation sequence O satisfy the equations (46) and (47), it is determined in step S204 that the node sequence is a possible node sequence. On the other hand, when the node sequence X and the observation sequence O do not satisfy at least one of the equation (46) or the equation (47), it is determined in step S204 that the node sequence is not actually possible.

ステップＳ２０４において、実際にあり得るノード列ではないと判定された場合、処理は、ステップＳ２０８に進み、認識器３５は、時系列情報の最後の時刻におけるノードは、未知ノードであると認識する。ステップＳ２０８の認識結果は、時系列情報の最後の時刻と対応付けられて認識結果バッファ３８に記憶されるようになされている。 If it is determined in step S204 that the node sequence is not actually possible, the process proceeds to step S208, and the recognizer 35 recognizes that the node at the last time of the time series information is an unknown node. The recognition result in step S208 is stored in the recognition result buffer 38 in association with the last time of the time series information.

一方、ステップＳ２０４において、実際にあり得るノード列であると判定された場合、処理は、ステップＳ２０５に進む。 On the other hand, if it is determined in step S204 that the node sequence is actually possible, the process proceeds to step S205.

ステップＳ２０５において、認識器３５は、エントロピーを計算する。このとき上述したように、式（４８）によりエントロピーが演算される。 In step S205, the recognizer 35 calculates entropy. At this time, as described above, the entropy is calculated by the equation (48).

ステップＳ２０６において、認識器３５は、ステップＳ２０５の処理で演算されたエントロピーの値を所定の閾値と比較し、エントロピーの値が閾値以上であるか否かを判定する。 In step S206, the recognizer 35 compares the entropy value calculated in the process of step S205 with a predetermined threshold value, and determines whether the entropy value is equal to or greater than the threshold value.

ステップＳ２０６において、エントロピーの値が閾値以上であると判定された場合、処理は、ステップＳ２０９に進む。 If it is determined in step S206 that the entropy value is greater than or equal to the threshold, the process proceeds to step S209.

ステップＳ２０９において、認識器３５は、変数Ｎの値を１だけインクリメントする。これにより、その後実行されるステップＳ２０２の処理において、長さがＮ＋１の時系列情報が取得されることになる。なお、ステップＳ２０９において変数Ｎの値をインクリメントする毎に、ステップＳ２０２で取得される時系列情報は、過去方向に延長されるものとする。 In step S209, the recognizer 35 increments the value of the variable N by 1. As a result, time-series information having a length of N + 1 is acquired in the process of step S202 executed thereafter. Note that each time the value of the variable N is incremented in step S209, the time series information acquired in step S202 is extended in the past direction.

このように、ステップＳ２０４で実際にあり得るノード列ではないと判定されるか、または、ステップＳ２０６において、エントロピーの値が閾値以上ではないと判定されるまで、ステップＳ２０２乃至ステップＳ２０６、およびステップＳ２０９の処理が繰り返し実行される。 Thus, until it is determined in step S204 that the node sequence is not actually possible, or in step S206, it is determined that the entropy value is not greater than or equal to the threshold value, step S202 to step S206 and step S209 are performed. This process is repeatedly executed.

ステップＳ２０６において、エントロピーの値が閾値以上ではないと判定された場合、処理は、ステップＳ２０７に進む。 If it is determined in step S206 that the entropy value is not equal to or greater than the threshold value, the process proceeds to step S207.

また、ステップＳ２０４において、出力されたノード列に含まれる固有ノード数が閾値Thres以上であるか否かがさらに判定され、Thres以上である場合にのみ、処理は、ステップＳ２０５またはステップＳ２０８に進むようにしてもよい。 In step S204, it is further determined whether or not the number of unique nodes included in the output node string is equal to or greater than the threshold value Thres. Only when the number is equal to or greater than Thres, the process proceeds to step S205 or step S208. Also good.

あるいはまた、ステップＳ２０３で固有ノード数が閾値Thres以上となるノード列が出力された場合にのみ、処理がステップＳ２０４に進み、固有ノード数が閾値Thres未満である場合は、Ｎの値がインクリメントされて時系列情報が再度取得されるようにしてもよい。 Alternatively, the process proceeds to step S204 only when a node string in which the number of unique nodes is equal to or greater than the threshold value Thres is output in step S203. If the number of unique nodes is less than the threshold value Thres, the value of N is incremented. The time series information may be acquired again.

ステップＳ２０７において、認識器３５は、時系列情報の最後の時刻におけるノードは、既知ノードであると認識する。このとき、時系列情報の最後の時刻におけるノードのインデックスが出力されるようにしてもよい。また、ステップＳ２０７の認識結果は、時系列情報の最後の時刻と対応付けられて認識結果バッファ３８に記憶されるようになされている。 In step S207, the recognizer 35 recognizes that the node at the last time of the time series information is a known node. At this time, the index of the node at the last time of the time series information may be output. Further, the recognition result in step S207 is stored in the recognition result buffer 38 in association with the last time of the time series information.

このようにしてノード認識処理が実行される。 In this way, the node recognition process is executed.

ところで、図３３の処理において、変数Ｎの値をインクリメントする毎に、取得される時系列情報は、過去方向に延長されるものとすると説明したが、既知ノードから未知ノードへの遷移が生じた時刻より以前に時系列情報を延長することはできない。既知ノードから遷移した未知ノードを含むノード列に基づいて、正確な認識結果を得ることはできないからである。 By the way, in the process of FIG. 33, every time the value of the variable N is incremented, it has been described that the acquired time series information is extended in the past direction, but a transition from a known node to an unknown node has occurred. Time series information cannot be extended before the time. This is because an accurate recognition result cannot be obtained based on a node string including an unknown node that has transitioned from a known node.

従って、時系列情報に対応するノード列の中に、既知ノードから遷移した未知ノードが含まれるようにすることはできず、時系列情報の長さＮの値に上限があることになる。なお、当該ノードが既知ノードから遷移した未知ノードであるか否かは、認識結果バッファ３８に記憶された情報に基づいて判断することができる。 Therefore, the node sequence corresponding to the time-series information cannot include an unknown node that has transitioned from the known node, and the value of the length N of the time-series information has an upper limit. Whether or not the node is an unknown node that has transitioned from a known node can be determined based on information stored in the recognition result buffer 38.

次に、図３４のフローチャートを参照して、時系列情報の長さＮの値に上限があることを考慮した場合のノード認識処理の例について説明する。この処理は、認識器３５によるノード認識処理の第２の方式の例となる処理である。 Next, an example of the node recognition process when considering that there is an upper limit in the value of the length N of the time series information will be described with reference to the flowchart of FIG. This process is an example of a second method of the node recognition process by the recognizer 35.

ステップＳ２２１乃至ステップＳ２２９の処理は、図３３のステップＳ２０１乃至ステップＳ２０９の処理と同様のものなので、詳細な説明は省略する。 Since the processing from step S221 to step S229 is the same as the processing from step S201 to step S209 in FIG. 33, detailed description thereof is omitted.

図３４の例の場合、ステップＳ２２９の処理で変数Ｎの値が１だけインクリメントされると、ステップＳ２３０において、ノード列に既知ノードから遷移した未知ノードが含まれることになるか否かが判定される。すなわち、変数Ｎの値をインクリメントする毎に、取得される時系列情報は、過去方向に延長されるが、ノード列を過去方向に延長すると既知ノードから遷移した未知ノードが含まれることになるか否かが判定されるのである。つまり、既知ノードから未知ノードへの遷移が生じた時刻より以前に時系列情報を延長することができないようにされるのである。 In the case of the example of FIG. 34, when the value of the variable N is incremented by 1 in the process of step S229, it is determined in step S230 whether or not an unknown node transitioned from a known node is included in the node sequence. The That is, every time the value of the variable N is incremented, the acquired time series information is extended in the past direction, but if the node sequence is extended in the past direction, an unknown node that has transitioned from a known node is included. It is determined whether or not. That is, the time series information cannot be extended before the time when the transition from the known node to the unknown node occurs.

ステップＳ２３０において、既知ノードから遷移した未知ノードが含まれることになると判定された場合、処理は、ステップＳ２３１に進む。ステップＳ２３０において、既知ノードから遷移した未知ノードが含まれることにはならないと判定された場合、処理は、ステップＳ２２２に戻る。 If it is determined in step S230 that an unknown node that has changed from a known node is included, the process proceeds to step S231. If it is determined in step S230 that an unknown node that has changed from a known node is not included, the process returns to step S222.

ステップＳ２３１において、認識器３５は、認識結果を保留し、時系列情報を未来方向に延長するように指令する。つまり、さらにアクションを実行して時系列情報を蓄積することを指令するためのメッセージ等を出力するのである。このとき、認識器３５は、例えば、行動生成器３６に対して、さらに、アクションを実行させるように制御情報を出力する。 In step S231, the recognizer 35 holds the recognition result and commands to extend the time series information in the future direction. That is, a message or the like for instructing further execution of action to accumulate time-series information is output. At this time, the recognizer 35 outputs control information to the action generator 36 so as to further execute an action, for example.

すなわち、現時点でのノードの認識は不可能であるか、または、仮に可能であっても不確実な認識結果となるため、認識器３５は、認識結果を保留し、時系列情報をさらに蓄積するように指令を出力するのである。 That is, it is impossible to recognize the node at the present time, or even if it is possible, an uncertain recognition result is obtained. Therefore, the recognizer 35 holds the recognition result and further accumulates time-series information. The command is output as follows.

認識処理は、図３４に示されるように実行されるようにしてもよい。 The recognition process may be executed as shown in FIG.

ところで、図３３と図３４を参照して上述した処理においては、ノード列Ｘと観測系列Ｏが式（４６）および式（４７）を満たすかによって、実際にあり得るノード列であるか否かが判定されると説明した。しかし、認識された現在の状態確率のエントロピーの値の変化に基づいて実際にあり得るノード列であるか否かが判定されるようにすることも可能である。 By the way, in the processing described above with reference to FIGS. 33 and 34, whether or not the node sequence X and the observation sequence O are actually possible node sequences depends on whether the equation (46) and the equation (47) are satisfied. It was explained that However, it is also possible to determine whether or not the node sequence is actually possible based on the recognized change in the entropy value of the current state probability.

すなわち、長さＮの時系列情報に基づいて式（４８）により演算されるエントロピーをＥ_Nとし、長さＮ−１の時系列情報に基づいて式（４８）により演算されるエントロピーをＥ_N-1とし、△Ｅ＝Ｅ_N−Ｅ_N-1を演算する。そして、△Ｅを所定の閾値Thres_entと比較し、その比較処理の繰り返し回数を閾値Thres_stableと比較し、それらの比較結果に基づいてノードが認識されるようにしてもよい。 That is, the entropy calculated by the formula (48) based on the time series information of length _N is set to E _N, and the entropy calculated by the formula (48) based on the time series information of length N−1 is set to E _{N. −1} and ΔE = E _N −E _N−1 are calculated. Then, △ E is compared with a predetermined threshold Thres _ent, compares the number of repetitions of the comparison process with the threshold Thres _stable, may be a node is recognized based on their comparison results.

例えば、△Ｅ＜Thres_entを満たさない場合、時系列情報が過去方向に延長されるようにし、さらにエントロピーが計算されて△Ｅ＜Thres_entを満たすか否かが判定される。△Ｅ＜Thres_entを満たす場合、カウンタＮＣがカウントアップされ、ＮＣ＞Thres_stableを満たすとき、ノードの認識が行なわれることになる。 For example, △ E <is not satisfied Thres _ent, the time series information so as to be extended in the past direction, and it is further judged whether satisfy entropy is calculated △ E <Thres _ent. △ E <if they meet Thres _ent, the counter NC is counted up, NC> when satisfying Thres _stable, so that the recognition of the node is performed.

次に、図３５のフローチャートを参照して、状態確率のエントロピーの値の変化に基づく認識を行なう場合のノード認識処理の例について説明する。この処理は、認識器３５によるノード認識処理の第３の方式の例となる処理である。 Next, an example of node recognition processing in the case of performing recognition based on a change in entropy value of the state probability will be described with reference to the flowchart in FIG. This process is an example of a third method of the node recognition process by the recognizer 35.

ステップＳ２５１において、認識器３５は、変数Ｎの値を初期値である１にセットする。 In step S251, the recognizer 35 sets the value of the variable N to 1 which is an initial value.

ステップＳ２５２において、認識器３５は、長さＮの時系列情報を観測バッファ３３および行動出力バッファ３９から取得する。 In step S <b> 252, the recognizer 35 acquires time-series information of length N from the observation buffer 33 and the action output buffer 39.

ステップＳ２５３において、認識器３５は、ステップＳ２０２で出力された時系列情報に基づいて、Viterbiアルゴリズムを用いてノード列を出力する。 In step S253, the recognizing device 35 outputs a node sequence using the Viterbi algorithm based on the time series information output in step S202.

ステップＳ２５４において、認識器３５は、エントロピーの差分を演算する。このとき、上述したように、長さＮの時系列情報に基づいて式（４８）により演算されるエントロピーをＥ_Nとし、長さＮ−１の時系列情報に基づいて式（４８）により演算されるエントロピーをＥ_N-1とし、△Ｅ＝Ｅ_N−Ｅ_N-1を演算する。なお、ステップＳ２５４の演算は、Ｎの値が２以上となったときに実行されるものとする。 In step S254, the recognizer 35 calculates an entropy difference. At this time, as described above, the entropy calculated by the equation (48) based on the time-series information of length _N is E _N, and is calculated by the equation (48) based on the time-series information of length N−1. _Let E _N-1 be the entropy to be calculated, and ΔE = E _N −E _N−1 is calculated. Note that the calculation in step S254 is executed when the value of N becomes 2 or more.

ステップＳ２５５において、認識器３５は、ステップＳ２５４で演算したエントロピーの差分は、閾値Thres_ent以上であるか否かを判定する。ステップＳ２５５において、ステップＳ２５４で演算したエントロピーの差分は、閾値以上ではないと判定された場合、処理は、ステップＳ２５６に進む。 In step S255, the recognizing device 35 determines whether or not the entropy difference calculated in step S254 is equal to or greater than a threshold value Thre _ent . If it is determined in step S255 that the entropy difference calculated in step S254 is not equal to or greater than the threshold, the process proceeds to step S256.

ステップＳ２５６において、認識器３５は、カウンタＮＣの値を１だけインクリメントする。 In step S256, the recognizing device 35 increments the value of the counter NC by 1.

ステップＳ２５７において、認識器３５は、カウンタＮＣの値が閾値Thres_stable以上であるか否かを判定する。ステップＳ２５７において、カウンタＮＣの値が閾値Thres_stable以上であると判定された場合、処理は、ステップＳ２５８に進む。 In step S257, the recognizing device 35 determines whether or not the value of the counter NC is equal to or greater than a threshold value Thres _stable . If it is determined in step S257 that the value of the counter NC is equal to or greater than the threshold value Thres _stable , the process proceeds to step S258.

ステップＳ２５８において、認識器３５は、ステップＳ２５３の処理の結果、出力されたノード列が実際にあり得るノード列であるか否かを判定する。このとき、上述したように、ノード列Ｘと観測系列Ｏが式（４６）および式（４７）を満たすか否かが判定される。ノード列Ｘと観測系列Ｏが式（４６）および式（４７）を満たす場合、ステップＳ２５８では、実際にあり得るノード列であると判定される。一方、ノード列Ｘと観測系列Ｏが式（４６）または式（４７）の少なくとも一方を満たさない場合、ステップＳ２５８では、実際にあり得るノード列ではないと判定される。 In step S258, the recognizing device 35 determines whether or not the node sequence output as a result of the processing in step S253 is a node sequence that can actually exist. At this time, as described above, it is determined whether or not the node sequence X and the observation sequence O satisfy Expression (46) and Expression (47). When the node sequence X and the observation sequence O satisfy the equations (46) and (47), it is determined in step S258 that the node sequence is a possible node sequence. On the other hand, when the node sequence X and the observation sequence O do not satisfy at least one of the formula (46) or the formula (47), it is determined in step S258 that the node sequence is not actually possible.

ステップＳ２５８において、出力されたノード列が実際にあり得るノード列ではないと判定された場合、処理は、ステップＳ２６２に進み、認識器３５は、時系列情報の最後の時刻におけるノードは、未知ノードであると認識する。ステップＳ２６２の認識結果は、時系列情報の最後の時刻と対応付けられて認識結果バッファ３８に記憶されるようになされている。 If it is determined in step S258 that the output node sequence is not a possible node sequence, the process proceeds to step S262, and the recognizer 35 determines that the node at the last time of the time series information is an unknown node. Recognize that The recognition result in step S262 is stored in the recognition result buffer 38 in association with the last time of the time series information.

一方、ステップＳ２５８において、出力されたノード列が実際にあり得るノード列であると判定された場合、処理は、ステップＳ２５９に進む。 On the other hand, if it is determined in step S258 that the output node string is a possible node string, the process proceeds to step S259.

ステップＳ２５９において、認識器３５は、エントロピーを計算する。このとき上述したように、式（４８）によりエントロピーが演算される。 In step S259, the recognizer 35 calculates entropy. At this time, as described above, the entropy is calculated by the equation (48).

ステップＳ２６０において、認識器３５は、ステップＳ２５９の処理で演算されたエントロピーの値を所定の閾値と比較し、エントロピーの値が閾値以上であるか否かを判定する。 In step S260, the recognizing device 35 compares the entropy value calculated in the process of step S259 with a predetermined threshold value, and determines whether or not the entropy value is equal to or greater than the threshold value.

ステップＳ２６０において、エントロピーの値が閾値以上であると判定された場合、処理は、ステップＳ２６３に進む。 If it is determined in step S260 that the entropy value is greater than or equal to the threshold, the process proceeds to step S263.

ステップＳ２６３において、認識器３５は、認識結果を保留し、時系列情報を未来方向に延長するように指令する。つまり、さらにアクションを実行して時系列情報を蓄積することを指令するためのメッセージ等を出力するのである。このとき、認識器３５は、例えば、行動生成器３６に対して、さらに、アクションを実行するように制御情報を出力する。 In step S263, the recognizing device 35 holds the recognition result and commands to extend the time series information in the future direction. That is, a message or the like for instructing further execution of action to accumulate time-series information is output. At this time, the recognizer 35 outputs control information to the action generator 36 so as to further execute an action, for example.

一方、ステップＳ２６０において、エントロピーの値が閾値以上ではないと判定された場合、処理は、ステップＳ２６１に進み、認識器３５は、時系列情報の最後の時刻におけるノードは、既知ノードであると認識する。 On the other hand, if it is determined in step S260 that the entropy value is not equal to or greater than the threshold, the process proceeds to step S261, and the recognizer 35 recognizes that the node at the last time of the time series information is a known node. To do.

ステップＳ２６１の認識結果は、時系列情報の最後の時刻と対応付けられて認識結果バッファ３８に記憶されるようになされている。 The recognition result in step S261 is stored in the recognition result buffer 38 in association with the last time of the time series information.

また、ステップＳ２５８において、出力されたノード列に含まれる固有ノード数が閾値Thres以上であるか否かがさらに判定され、Thres以上である場合にのみ、処理は、ステップＳ２５９またはステップＳ２６２に進むようにしてもよい。この場合、ステップＳ２５８において、出力されたノード列に含まれる固有ノード数が閾値Thres以上ではないと判定されたときは、処理は、ステップＳ２６５に進むようにすればよい。すなわち、変数Ｎの値が１だけインクリメントされるようにすればよい。 In step S258, it is further determined whether or not the number of unique nodes included in the output node string is equal to or greater than the threshold value Thres. Only when the number is equal to or greater than Thres, the process proceeds to step S259 or step S262. Also good. In this case, if it is determined in step S258 that the number of unique nodes included in the output node string is not equal to or greater than the threshold value Thres, the process may proceed to step S265. That is, the value of the variable N may be incremented by 1.

また、ステップＳ２５５で、ステップＳ２５４で演算したエントロピーの差分は、閾値Thres_ent以上であると判定された場合、処理は、ステップＳ２６４に進み、カウンタＮＣの値が０に設定される。 Further, in step S255, the difference between the entropy calculated in step S254, if it is determined that the threshold Thres _ent above, the processing proceeds to step S264, the value of the counter NC is set to 0.

ステップＳ２６４の処理の後、または、ステップＳ２５７でカウンタＮＣの値が閾値Thres_stable以上ではないと判定された場合、処理は、ステップＳ２６５に進む。 After the process of step S264, or when it is determined in step S257 that the value of the counter NC is not greater than or equal to the threshold value Thres _stable , the process proceeds to step S265.

ステップＳ２６５において、認識器３５は、変数Ｎの値を１だけインクリメントする。これにより、その後実行されるステップＳ２０２の処理において、長さがＮ＋１の時系列情報が取得されることになる。なお、ステップＳ２６５において変数Ｎの値をインクリメントする毎に、ステップＳ２５２で取得される時系列情報は、過去方向に延長されるものとする。 In step S265, the recognizer 35 increments the value of the variable N by 1. As a result, time-series information having a length of N + 1 is acquired in the process of step S202 executed thereafter. Note that each time the value of the variable N is incremented in step S265, the time series information acquired in step S252 is extended in the past direction.

このように、ステップＳ２５５でエントロピーの差分は、閾値Thres_ent以上ではないと判定され、かつ、ステップＳ２５７でカウンタＮＣの値が閾値Thres_stable以上であると判定されるまで、ステップＳ２５２乃至ステップＳ２５７、およびステップＳ２６５の処理が繰り返し実行される。 Thus, the difference between the entropy at step S255, it is determined not to be a threshold Thres _ent or more, and, in step S257 until the value of the counter NC is determined to be the threshold Thres _stable above, steps S252 to step S257, And the process of step S265 is repeatedly executed.

このようにしてノード認識処理が実行される。図３５の例の場合、ステップＳ２５５とステップＳ２５７の処理により、エントロピーの値が収束したことが確認され、その後、出力されたノード列が実際にあり得るノード列であるかが判定されるようにした。従って、例えば、図３３を参照して上述した場合と比較して、より確実な認識を行なうことが可能となる。 In this way, the node recognition process is executed. In the case of the example of FIG. 35, it is confirmed that the entropy value has converged by the processing of step S255 and step S257, and then it is determined whether the output node sequence is a possible node sequence. did. Therefore, for example, more reliable recognition can be performed as compared with the case described above with reference to FIG.

また、図３５の処理の場合も、既知ノードから未知ノードへの遷移が生じた時刻より以前に時系列情報を延長することはできない。既知ノードから遷移した未知ノードを含むノード列に基づいて、正確な認識結果を得ることはできないからである。。 Also in the case of the processing of FIG. 35, the time series information cannot be extended before the time when the transition from the known node to the unknown node occurs. This is because an accurate recognition result cannot be obtained based on a node string including an unknown node that has transitioned from a known node. .

従って、時系列情報に対応するノード列の中に、既知ノードから遷移した未知ノードであると認識されたノードが含まれるようにすることはできず、時系列情報の長さＮの値に上限があることになる。なお、当該ノードが既知ノードから遷移した未知ノードであるか否かは、認識結果バッファ３８に記憶された情報に基づいて判断することができる。 Therefore, the node sequence corresponding to the time series information cannot include a node recognized as an unknown node transitioned from a known node, and the value of the length N of the time series information has an upper limit. There will be. Whether or not the node is an unknown node that has transitioned from a known node can be determined based on information stored in the recognition result buffer 38.

次に、図３６のフローチャートを参照して、状態確率のエントロピーの値の変化に基づく認識を行なう場合、時系列情報の長さＮの値に上限があることを考慮するときのノード認識処理の例について説明する。この処理は、認識器３５によるノード認識処理の第４の方式の例となる処理である。 Next, referring to the flowchart of FIG. 36, when performing recognition based on a change in the value of entropy of the state probability, node recognition processing when considering that there is an upper limit in the value of the length N of the time series information An example will be described. This process is an example of a fourth method of the node recognition process performed by the recognizer 35.

ステップＳ２８１乃至ステップＳ２９５の処理は、図３５のステップＳ２５１乃至ステップＳ２６５の処理と同様のものなので、詳細な説明は省略する。 Since the processing from step S281 to step S295 is the same as the processing from step S251 to step S265 in FIG. 35, detailed description thereof is omitted.

図３６の例の場合、ステップＳ２９５の処理で変数Ｎの値が１だけインクリメントされると、ステップＳ２９６において、ノード列に既知ノードから遷移した未知ノードが含まれることになるか否かが判定される。すなわち、変数Ｎの値をインクリメントする毎に、取得される時系列情報は、過去方向に延長されるが、ノード列を過去方向に延長すると既知ノードから遷移した未知ノードが含まれることになるか否かが判定されるのである。 In the case of the example of FIG. 36, when the value of the variable N is incremented by 1 in the process of step S295, it is determined in step S296 whether or not an unknown node transitioned from a known node is included in the node sequence. The That is, every time the value of the variable N is incremented, the acquired time series information is extended in the past direction, but if the node sequence is extended in the past direction, an unknown node that has transitioned from a known node is included. It is determined whether or not.

ステップＳ２９６において、既知ノードから遷移した未知ノードが含まれることになると判定された場合、処理は、ステップＳ２９３に進む。ステップＳ２９６において、既知ノードから遷移した未知ノードが含まれることにはならないと判定された場合、処理は、ステップＳ２８２に戻る。 If it is determined in step S296 that an unknown node that has changed from a known node is included, the process proceeds to step S293. If it is determined in step S296 that an unknown node that has changed from a known node is not included, the process returns to step S282.

ステップＳ２９３において、認識器３５は、認識結果を保留し、時系列情報を未来方向に延長するように指令する。つまり、さらにアクションを実行して時系列情報を蓄積することを指令するためのメッセージ等を出力するのである。このとき、認識器３５は、例えば、行動生成器３６に対して、さらに、アクションを実行するように制御情報を出力する。 In step S293, the recognizing device 35 holds the recognition result and commands to extend the time series information in the future direction. That is, a message or the like for instructing further execution of action to accumulate time-series information is output. At this time, the recognizer 35 outputs control information to the action generator 36 so as to further execute an action, for example.

すなわち、現時点でのノードの認識は不可能であるか、または、仮に可能であっても不確実な認識結果となるため、認識器３５は、認識結果を保留し、時系列情報をさらに蓄積されるように指令を出力するのである。 That is, it is impossible to recognize the node at the current time, or even if it is possible, the recognition result is uncertain. Therefore, the recognizer 35 holds the recognition result and further accumulates time-series information. The command is output as follows.

認識処理は、図３６に示されるように実行されるようにしてもよい。 The recognition process may be executed as shown in FIG.

図３３乃至図３６を参照して上述した４通りの方式により、ロボットは、自分が新たに追加された迷路のパーツ上（未知ノード）に位置しているのか、または以前から存在していたパーツ上（既知ノード）にいるのかを認識することができる。このようにして認識された未知ノードに関する状態遷移確率と観測確率を設定し、状態遷移確率テーブル、および観測確率テーブルを拡張する。 With the four methods described above with reference to FIGS. 33 to 36, the robot is located on a part of the maze that has been newly added (an unknown node) or has existed before. You can recognize whether you are on the top (known node). The state transition probability and the observation probability regarding the unknown node recognized in this way are set, and the state transition probability table and the observation probability table are expanded.

なお、ここでは、アクション拡張型ＨＭＭによる認識を行なう場合の例について説明したが、図３３乃至図３６の認識処理は、通常のＨＭＭの認識においても適用することができる。 Here, an example in which recognition is performed by the action expansion type HMM has been described, but the recognition processing of FIGS. 33 to 36 can also be applied to recognition of a normal HMM.

ところで、エージェントが自律的に環境の変化を認識して、状態遷移確率テーブル、および観測確率テーブルを拡張する場合、どの時点でどれだけの未知ノードを、新たに状態遷移確率テーブル、および観測確率テーブルなどに含めるかが問題となる。次に、自律的に環境の変化を認識して、未知ノードを内部モデルデータに追加する場合における追加すべき未知ノードの個数および追加すべきタイミングについて説明する。 By the way, when an agent autonomously recognizes an environmental change and expands the state transition probability table and the observation probability table, how many unknown nodes are newly added to the state transition probability table and the observation probability table. It becomes a problem whether to include it. Next, the number of unknown nodes to be added and the timing to add will be described when autonomously recognizing environmental changes and adding unknown nodes to internal model data.

なお、ここでいう未知ノードの内部モデルデータへの追加とは、未知ノードとみなされたノードを表す新たなインデックスを生成し、例えば、そのインデックスに対応する行列を状態遷移確率テーブルなどに追加することを意味する。 The addition of unknown nodes to the internal model data referred to here generates a new index representing a node regarded as an unknown node, for example, adds a matrix corresponding to the index to a state transition probability table or the like. Means that.

図３３乃至図３６を参照して上述した方式により、自分が新たに追加されるべきノード（未知ノード）に位置していると認識した時刻から経過した時間をＮとする。この時間Ｎは、時系列情報の長さと言い換えることもできる。また、ここでは認識の精度を保証するための閾値であって、認識した結果のノード列における固有ノード数の閾値Thresを設けることにする。 Let N be the time that has elapsed since the time when it was recognized that it was located at a node (an unknown node) to be newly added by the method described above with reference to FIGS. This time N can be rephrased as the length of time-series information. Here, a threshold value for guaranteeing the accuracy of recognition, and a threshold value Thres for the number of unique nodes in the recognized node sequence is provided.

まず、長さＮの時系列情報に含まれる固有ノード数がThresの値になるまで、エージェントは行動を繰り返す。すなわち、行動生成器３６と行動出力部３２とにより、Ｎ回のアクションが実行され、観測バッファ３３、および行動出力バッファ３９に、長さＮの時系列情報が蓄積されることになる。なお、ここでいう長さＮの時系列情報は、自分が未知ノードに位置していると認識した時刻後の時間的長さＮの時系列情報を意味する。また、以下において適宜、「長さＮの時系列情報に基づいて認識された長さＬｒのノード列に含まれる固有ノード数がThresの値になる」という意味で「ＬｒがThresの値になる」と表現することにする。 First, the agent repeats the action until the number of unique nodes included in the time-series information of length N reaches the value of Thres. That is, the action generator 36 and the action output unit 32 execute N actions, and time-series information of length N is accumulated in the observation buffer 33 and the action output buffer 39. The time-series information of length N here means time-series information of time length N after the time when it is recognized that it is located at an unknown node. Also, in the following, “Lr becomes the value of Thres” in the sense that “the number of unique nodes included in the node string of length Lr recognized based on the time-series information of length N becomes the value of Thres”. ".

そして、ＬｒがThres以上になった場合、認識器３５は、時系列情報に基づいて、図３４、または図３６を参照して上述した認識処理を実行する。この場合、時系列情報の長さの上限Ｎがあることになる。 When Lr becomes equal to or greater than Thres, the recognizer 35 executes the recognition process described above with reference to FIG. 34 or FIG. 36 based on the time series information. In this case, there is an upper limit N of the length of the time series information.

ここで実行される認識処理における図３４のステップＳ２２３または図３６のステップＳ２８３で出力されるノード列をＳとし、そのノード列の長さをＬrとする。 In the recognition process executed here, the node sequence output in step S223 of FIG. 34 or step S283 of FIG. 36 is S, and the length of the node sequence is Lr.

そして、図３４のステップＳ２２８または図３６のステップＳ２９２で未知ノードであると認識された場合、そのノードが未知ノードとみなされて、学習器３４により内部モデルデータに追加するようにする。 When the node is recognized as an unknown node in step S228 of FIG. 34 or step S292 of FIG. 36, the node is regarded as an unknown node and is added to the internal model data by the learning device 34.

実際に未知になってから足したノードの数をｍ_addとすると、追加する未知ノードの数をｍは、式（４９）により表すことができる。 Assuming that the number of nodes added after actually becoming unknown is m_add, the number of unknown nodes to be added can be expressed by Expression (49).

・・・（４９）

... (49)

なお、ｍ_addは、自分が未知ノードに位置していると最初に認識したときから、既に未知ノードの追加が行なわれた場合、それら追加されたノードの個数を表す数とされる。すなわち、式（４９）は、未知ノードに位置していると認識された後、未知ノードとみなして追加したノードの数はひいた上で、最初に認識した未知ノードに至るまでの間のノードを足すことを示している。 Note that m_add is a number representing the number of added nodes when an unknown node has already been added since it was first recognized that it was located at an unknown node. That is, the expression (49) is a node between the first node and the unknown node after the number of nodes added as the unknown node is subtracted after being recognized as being located at the unknown node. Indicates that

また、式（４９）の右辺において加算される「1」は、長さＬrのノード列の最も過去の時刻に対応するノードを、どのノードに接続するかが現時点では決められないため、保留するということを示す。 Further, “1” added on the right side of the equation (49) is reserved because it is not determined at this time which node is connected to the node corresponding to the oldest time in the node string of length Lr. It shows that.

図３７を参照してさらに詳細に説明する。図３７は、図中垂直方向に時間軸ｔが設けられており、時間の経過に伴ってエージェントが遷移したノードが図中の円により示されている。また、同図において、図中垂直方向の点線は、自分が未知ノードに位置していると最初に認識したノードを示すためのものである。この例では、ノード２０１が、自分が未知ノードに位置していると最初に認識したノードとされる。 This will be described in more detail with reference to FIG. In FIG. 37, a time axis t is provided in the vertical direction in the figure, and the nodes to which the agent has changed over time are indicated by circles in the figure. Also, in the figure, a dotted line in the vertical direction in the drawing is for indicating a node that is first recognized as being located at an unknown node. In this example, the node 201 is the node that is first recognized as being located at an unknown node.

さらに、説明を簡単にするため、アクションを１回実行すると図中の円により示されたノードの数と時系列情報の長さが１だけ増加するものとし、それらのノードは、特に説明がない限り、全て固有ノードであると認識されたものとする。 Furthermore, for the sake of simplicity, it is assumed that the number of nodes indicated by circles in the figure and the length of the time series information increase by 1 when the action is executed once, and these nodes have no particular explanation. It is assumed that all are recognized as unique nodes.

同図に示されるように、自分が未知ノードに位置していると最初に認識した後、１ずつアクションが実行されて時系列情報が蓄積されていく。そして、時系列情報の長さＮが閾値Thresと等しくなった（この場合、Ｌｒ＝Ｎ）後、認識器３５は、時系列情報に基づいて、図３４、または図３６を参照して上述した認識処理を実行する。この例の場合、ノード２０１、ノード２０２、・・・ノード２１１のノード列が出力され、ノード２１１は、未知ノードであると認識されたものとする。 As shown in the figure, after first recognizing that it is located at an unknown node, actions are executed one by one and time series information is accumulated. Then, after the length N of the time series information becomes equal to the threshold value Thres (in this case, Lr = N), the recognizer 35 is described above with reference to FIG. 34 or FIG. 36 based on the time series information. Perform recognition processing. In the case of this example, it is assumed that the node sequence of node 201, node 202,..., Node 211 is output, and that node 211 is recognized as an unknown node.

その後、さらに１のアクションが実行され、エージェントは、ノード２１２に遷移する。このとき、ノード２０２乃至ノード２１２に対応する長さＬｒの時系列情報に基づく認識処理が実行され、ノード２１２は未知ノードと認識されたものとする。この時点では、まだ未知ノードの追加は行なわれない。 Thereafter, another action is executed, and the agent transitions to the node 212. At this time, it is assumed that recognition processing based on the time series information of length Lr corresponding to the nodes 202 to 212 is executed, and the node 212 is recognized as an unknown node. At this point, unknown nodes are not yet added.

その後、さらに１のアクションが実行され、エージェントがノード２１３に遷移する。このとき、ノード２０３乃至ノード２１３に対応する長さＬｒの時系列情報に基づく認識処理が実行され、ノード２１３は未知ノードと認識されたものとする。この時点でノード２０１の追加が行なわれる。 Thereafter, another action is executed, and the agent transitions to the node 213. At this time, it is assumed that recognition processing based on the time series information of length Lr corresponding to the nodes 203 to 213 is executed, and the node 213 is recognized as an unknown node. At this point, the node 201 is added.

これにより、それ以後の認識処理においては、ノード２０１が既知ノードとして取り扱われることになる。 As a result, in the subsequent recognition processing, the node 201 is handled as a known node.

いまの場合、時系列情報の長さ（自分が未知ノードに位置していると認識した時刻後の時間的長さ）Ｎは、Thres＋２である。また、いまの場合、ノード２０３乃至ノード２１３がノード列Ｓに対応し、ノード列Ｓの長さＬrは、Thresである。よって、式（４９）より、追加すべきノードの個数ｍは、Thres＋２-（Thres＋０＋１）＝１と算出される。従って、未知ノードであった１個のノード２０１が新たに追加されたのである。 In this case, the length of the time-series information (the time length after the time when it is recognized that it is located at an unknown node) N is Thres + 2. In this case, the nodes 203 to 213 correspond to the node string S, and the length Lr of the node string S is Thres. Therefore, from the equation (49), the number m of nodes to be added is calculated as Thres + 2− (Thres + 0 + 1) = 1. Therefore, one node 201 that was an unknown node is newly added.

すなわち、内部モデルデータの状態遷移確率テーブルなどに、ノード２０１を表す新たなインデックスの行列が追加されるのである。 That is, a new index matrix representing the node 201 is added to the state transition probability table of the internal model data.

なお、上述した例において、ノード２１１乃至ノード２１３は、いずれも未知ノードであると認識されているが、ノード２０１が真の意味で未知ノードであったか否かは不明である。例えば、ノード２１１が未知ノードと判定されたのは、ノード２０１乃至ノード２１１のノード列が、実際にあり得るノード列ではないと判定された結果であり、必ずしもノード２１１が既存の内部モデルデータに存在しないノードであるとは限らないからである。つまり、ノード２０１乃至ノード２１１のノードのいずれかが既存の内部モデルデータに存在しないノードであれば、ノード２１１は、未知ノードと認識されるのである。 In the example described above, the nodes 211 to 213 are all recognized as unknown nodes, but it is unknown whether the node 201 is truly unknown. For example, the node 211 is determined to be an unknown node because it is determined that the node sequence of the nodes 201 to 211 is not a possible node sequence, and the node 211 is not necessarily included in the existing internal model data. This is because it is not always a node that does not exist. That is, if any of the nodes 201 to 211 does not exist in the existing internal model data, the node 211 is recognized as an unknown node.

従って、現時点でノード２０１を未知ノードとみなして内部モデルデータの状態遷移確率テーブルなどに、ノード２０１を表す新たなインデックスの行列が追加しても、実際には既存のインデックスの行列と重複する結果にもなり得る。このように、ノード２０１が真の意味で未知ノードであったか否かは不明なのである。 Therefore, even if the node 201 is regarded as an unknown node at the present time and a new index matrix representing the node 201 is added to the state transition probability table of the internal model data, etc., a result that actually overlaps with the existing index matrix Can also be. Thus, it is unknown whether or not the node 201 is an unknown node in the true sense.

なお、ここでは、図３７を参照して説明する都合上、ノード２０１が真の意味で未知ノードであったか否かは不明と説明しているが、図３７の例では、ノード２０１は真の意味で未知ノードであったことが前提とされる。従って、本来は、その後に追加される「ノード２０２、ノード２０３、・・・が真の意味で未知ノードであったか否かは不明なものとなる」という説明が適切である。 Here, for the convenience of description with reference to FIG. 37, it is described that it is unknown whether or not the node 201 is an unknown node in the true sense. However, in the example of FIG. 37, the node 201 has the true meaning. It is assumed that it was an unknown node. Therefore, the explanation that is originally added afterwards is “it is unknown whether or not the nodes 202, 203,... Were truly unknown nodes”.

上述のように、ノード２０１が真の意味で未知ノードであったか否かは不明であるとしても、既存のインデックスの行列と重複する可能性を過度に懸念して、ノード２０１を表す新たなインデックスを内部モデルデータに追加しないとすると、問題がある。エージェントの状況によっては、永遠に学習が完了しないことになるからである。 As described above, even if it is unclear whether or not the node 201 is a truly unknown node, the new index representing the node 201 is created with excessive concern about the possibility of overlapping with the existing index matrix. There is a problem if not added to the internal model data. This is because, depending on the situation of the agent, learning will not be completed forever.

例えば、環境である迷路が拡張され、新たな迷路の部屋ができ、エージェントであるロボットが、新たな迷路の部屋に閉じ込められた場合、追加するノードが真の意味で未知ノードであったと確信できなくても、やはり追加せざるを得ない。 For example, if the environment maze is expanded to create a new maze room, and the agent robot is trapped in the new maze room, you can be confident that the added node was truly an unknown node. Even without it, you still have to add it.

このため、自分が未知ノードに位置していると最初に認識したときから、所定の時間経過後のタイミングで、所定の個数のノードを、内部モデルデータに追加する必要があるのである。 For this reason, it is necessary to add a predetermined number of nodes to the internal model data at a timing after the elapse of a predetermined time from when the node is first recognized as being located at an unknown node.

説明を図３７に戻す。ノード２０１が内部モデルデータに追加された後、さらにアクションが実行され、時系列情報に基づいて認識処理が実行されていく。ノード２１２乃至ノード２２１に対応する時系列情報に基づく認識処理が実行された結果、ノード２２１が既知ノードであると認識された場合、ノード２１２乃至ノード２２１は、全て既知ノードであったことになる。このとき、ノード２１１が追加されるとともに、ノード２１１からノード２１２へのアンカリングが行われる。アンカリングは、未知ノードから既知ノードへの遷移が認識された場合、未知ノードと既知ノードとの状態遷移確率などを設定する処理である。なお、アンカリングの詳細については後述する。 Returning to FIG. After the node 201 is added to the internal model data, further actions are executed, and recognition processing is executed based on the time series information. When the recognition process based on the time series information corresponding to the nodes 212 to 221 is executed, if the nodes 221 are recognized as known nodes, the nodes 212 to 221 are all known nodes. . At this time, the node 211 is added and anchoring from the node 211 to the node 212 is performed. Anchoring is a process of setting a state transition probability between an unknown node and a known node when a transition from an unknown node to a known node is recognized. Details of anchoring will be described later.

ところで、図３４、または図３６を参照して上述した認識処理においては、ステップＳ２３１またはステップＳ２９３において、認識結果を保留し、時系列情報を未来方向に延長する指令が出力される場合がある。このような場合、時系列情報の長さThresでは適切な認識を行なうことができないので、時系列情報の長さを未来方向に延長する必要がある。 By the way, in the recognition processing described above with reference to FIG. 34 or FIG. 36, in step S231 or step S293, a command for holding the recognition result and extending the time-series information in the future direction may be output. In such a case, it is necessary to extend the length of the time series information in the future direction because appropriate recognition cannot be performed with the length Thres of the time series information.

認識処理において、認識結果を保留し、時系列情報を未来方向に延長する指令が出力された場合の例について、図３８を参照してさらに詳細に説明する。図３８では、図３７と同様に、図中垂直方向に時間軸ｔが設けられており、時間の経過に伴ってエージェントが遷移したノードが図中の円により示されている。また、同図において、図中垂直方向の点線は、自分が未知ノードに位置していると最初に認識したノードを示すためのものである。この例では、ノード２０１が、自分が未知ノードに位置していると最初に認識したノードとされる。 With reference to FIG. 38, an example in which a recognition result is suspended and a command for extending the time series information in the future direction is output in the recognition processing will be described in more detail. In FIG. 38, similarly to FIG. 37, the time axis t is provided in the vertical direction in the figure, and the nodes to which the agent has changed over time are indicated by circles in the figure. Also, in the figure, a dotted line in the vertical direction in the drawing is for indicating a node that is first recognized as being located at an unknown node. In this example, the node 201 is the node that is first recognized as being located at an unknown node.

同図に示されるように、自分が未知ノードに位置していると最初に認識した後、１ずつアクションが実行されて時系列情報が蓄積されていく。そして、時系列情報の長さＮが閾値Thresと等しくなった（この場合、Ｌｒ＝Ｎ）後、認識器３５は、時系列情報に基づいて、図３４、または図３６を参照して上述した認識処理を実行する。この例の場合、ノード２０１、ノード２０２、・・・ノード２１１のノード列が出力され、ノード２０１乃至ノード２１１は、全て未知ノードであると認識されたものとする。また、この例では、ノード２０１乃至ノード２１１が内部モデルデータに追加されたものとする。 As shown in the figure, after first recognizing that it is located at an unknown node, actions are executed one by one and time series information is accumulated. Then, after the length N of the time series information becomes equal to the threshold value Thres (in this case, Lr = N), the recognizer 35 is described above with reference to FIG. 34 or FIG. 36 based on the time series information. Perform recognition processing. In the case of this example, it is assumed that the node sequence of the node 201, the node 202,..., The node 211 is output, and the nodes 201 to 211 are all recognized as unknown nodes. In this example, it is assumed that the nodes 201 to 211 are added to the internal model data.

これにより、それ以後の認識処理においては、ノード２０１乃至ノード２１１が既知ノードとして取り扱われることになる。 Thereby, in subsequent recognition processing, the nodes 201 to 211 are handled as known nodes.

エージェントがノード２２１に遷移したとき、長さＬｒの時系列情報に基づいて認識処理が実行され、この時点では、認識結果を保留し、時系列情報を未来方向に延長する指令が出力されたものとする。すなわち、この時点では、ノード列を一意に認識することができず、仮に認識したとしても複数の候補が存在する状態となっている。 When the agent transitions to the node 221, recognition processing is executed based on the time series information of length Lr, and at this time, a command for holding the recognition result and extending the time series information in the future direction is output. And That is, at this time, the node string cannot be uniquely identified, and even if it is recognized, there are a plurality of candidates.

このような場合、閾値Thresの値が１だけインクリメントされ、新たに１のアクションが実行され、認識処理の対象となる時系列情報の長さも１だけインクリメントされる。これにより、エージェントは、ノード２２２に遷移したものとする。この時点で、長さThres＋１の時系列情報に基づいて認識処理を実行し、長さＬｒ（＝Thres＋１）のノード列を得たが、この時点でも、認識結果を保留し、時系列情報を未来方向に延長する指令が出力されたものとする。 In such a case, the value of the threshold value Thres is incremented by 1, a new action 1 is executed, and the length of the time-series information that is the target of recognition processing is also incremented by 1. As a result, the agent transitions to the node 222. At this point, recognition processing is executed based on the time series information of length Thres + 1, and a node string of length Lr (= Thres + 1) is obtained. It is assumed that a command to extend in the direction is output.

そして、閾値Thresの値がインクリメントされ、さらにアクションが実行されることにより、エージェントは、ノード２３１に遷移したものとする。この時点で、長さThres＋qの時系列情報に基づいて認識処理を実行することにより、ノード２３１が既知ノードであると認識されたものとする。 Then, it is assumed that the agent has transitioned to the node 231 by incrementing the value of the threshold value Thres and further executing an action. At this point, it is assumed that the node 231 is recognized as a known node by executing recognition processing based on the time series information of length Thres + q.

ノード２３１が既知ノードであると認識された場合、ノード２１３乃至ノード２３１は、全て既知ノードであったことになる。このとき、ノード２１２が追加されるとともに、ノード２１２からノード２１３へのアンカリングが行われる。 When the node 231 is recognized as a known node, the nodes 213 to 231 are all known nodes. At this time, the node 212 is added and anchoring from the node 212 to the node 213 is performed.

ただし、上述したように、未知ノードとみなされて追加されたノードの中に、実際には既知ノードであるノードが含まれることがある。また、例えば、エージェントが実際には同じノードに繰り返し遷移している場合（例えば、２つのノード間を往復している場合）でも、それらが異なる未知ノードと認識される場合がある。 However, as described above, a node that is actually a known node may be included in the added nodes that are regarded as unknown nodes. In addition, for example, even when the agent actually makes repeated transitions to the same node (for example, when the agent reciprocates between two nodes), they may be recognized as different unknown nodes.

このように、本来未知ノードとは言えないノードが未知ノードとみなされて、それらの未知ノードが内部モデルデータに追加されることを抑止するために、例えば、アンカリングする際にノードの追加または削除の要否のチェックが行なわれる。 In this way, in order to prevent nodes that are not originally unknown nodes from being regarded as unknown nodes and being added to the internal model data, for example, when adding or The necessity of deletion is checked.

アンカリングする際にノードの追加または削除の要否のチェックが行なわれる場合の例について、図３９を参照してさらに詳細に説明する。図３９では、図３７と同様に、図中垂直方向に時間軸ｔが設けられており、時間の経過に伴ってエージェントが遷移したノードが図中の円により示されている。また、同図において、図中垂直方向の点線は、自分が未知ノードに位置していると最初に認識したノードを示すためのものである。この例では、ノード２０１が、自分が未知ノードに位置していると最初に認識したノードとされる。 An example in which the necessity of adding or deleting a node is checked at the time of anchoring will be described in more detail with reference to FIG. In FIG. 39, similarly to FIG. 37, a time axis t is provided in the vertical direction in the figure, and the nodes to which the agent has changed with the passage of time are indicated by circles in the figure. Also, in the figure, a dotted line in the vertical direction in the drawing is for indicating a node that is first recognized as being located at an unknown node. In this example, the node 201 is the node that is first recognized as being located at an unknown node.

同図に示されるように、自分が未知ノードに位置していると最初に認識した後、１ずつアクションが実行されて時系列情報が蓄積されていく。そして、時系列情報の長さＮが閾値Thresと等しくなった（この場合、Ｌｒ＝Ｎ）後、認識器３５は、時系列情報に基づいて、図３４、または図３６を参照して上述した認識処理を実行する。この例の場合、ノード２０１、ノード２０２、・・・ノード２１１のノード列が出力され、ノード２０１乃至ノード２１１は、全て未知ノードであると認識されたものとする。 As shown in the figure, after first recognizing that it is located at an unknown node, actions are executed one by one and time series information is accumulated. Then, after the length N of the time series information becomes equal to the threshold value Thres (in this case, Lr = N), the recognizer 35 is described above with reference to FIG. 34 or FIG. 36 based on the time series information. Perform recognition processing. In the case of this example, it is assumed that the node sequence of the node 201, the node 202,..., The node 211 is output, and the nodes 201 to 211 are all recognized as unknown nodes.

その後、さらに１のアクションが実行され、エージェントは、ノード２１２に遷移するが、この時点では、まだ未知ノードの追加は行なわれない。 Thereafter, another action is executed, and the agent transitions to the node 212, but at this point, an unknown node is not yet added.

その後、さらに１のアクションが実行され、エージェントがノード２１３に遷移すると、ノード２０１の追加が行なわれる。 Thereafter, another action is executed, and when the agent transitions to the node 213, the node 201 is added.

このようにして、アクションが実行され、エージェントはノード２１５に遷移したものとする。また、このとき、ノード２０１乃至ノード２０３の追加が既に行なわれていたものとする。この時点で、ノード２０１乃至ノード２０３は、未知ノードとみなされて追加されており、例えば、新たなインデックスを有するノードが内部モデルデータに追加されているものとする。その後、ノード２０５乃至ノード２１５に対応する時系列情報に基づく認識処理が実行された結果、ノード２１５が既知ノードであると認識された場合、ノード２０５乃至ノード２１５は、全て既知ノードであったことになる。 In this way, it is assumed that the action is executed and the agent transitions to the node 215. At this time, it is assumed that the addition of the nodes 201 to 203 has already been performed. At this time, the nodes 201 to 203 are regarded as unknown nodes and added. For example, it is assumed that a node having a new index is added to the internal model data. Thereafter, when recognition processing based on time-series information corresponding to the nodes 205 to 215 is executed, if the node 215 is recognized as a known node, the nodes 205 to 215 are all known nodes. become.

このとき、ノードの削除の要否のチェックが行なわれる。すなわち、時系列情報の長さが過去方向に延長され、延長された時系列情報に基づく認識処理が実行される。その結果、例えば、ノード２０３乃至ノード２１５に対応する時系列情報に基づく認識処理が実行され、その結果、ノード２０３乃至ノード２１５が全て既知ノードであったと認識されたものとする。すなわち、ノード２０３は、未知ノードとみなされて追加されており、例えば、新たなインデックスを有するノードが内部モデルデータに追加されていたが、本来は、既知ノードであって、追加したインデックスのノードは、内部モデルデータから削除すべきである。 At this time, the necessity of deleting the node is checked. That is, the length of the time series information is extended in the past direction, and the recognition process based on the extended time series information is executed. As a result, for example, it is assumed that recognition processing based on time-series information corresponding to the nodes 203 to 215 is executed, and as a result, all of the nodes 203 to 215 are recognized as known nodes. That is, the node 203 is regarded as an unknown node and has been added. For example, a node having a new index has been added to the internal model data. Should be removed from the internal model data.

例えば、ノード２０３とノード２０５、実際には同じインデックスのノードであり、また、ノード２０４とノード２０６は、実際には同じインデックスのノードであった場合、上述のように認識されることになる。 For example, if the node 203 and the node 205 are actually nodes having the same index, and the node 204 and the node 206 are actually nodes having the same index, they are recognized as described above.

例えば、ノード２０３のインデックスをｕとして状態遷移確率テーブルなどに新たな行列を追加したが、ノードの削除の要否のチェックが行なわれた結果、ノード２０３のインデックスは、ｆであることが判明したとする。インデックスｆに対応する行列は、エージェントがノード２０１に遷移する以前から状態遷移確率テーブルなどに存在していたものとする。この場合、インデックスｕに対応する行列とインデックスｆに対応する行列が重複して存在することになるので、インデックスｕに対応する行列は、状態遷移確率テーブルなどから削除しておく必要がある。 For example, a new matrix is added to the state transition probability table or the like with the index of the node 203 as u. As a result of checking whether or not it is necessary to delete the node, the index of the node 203 is found to be f. And It is assumed that the matrix corresponding to the index f exists in the state transition probability table or the like before the agent transitions to the node 201. In this case, since the matrix corresponding to the index u and the matrix corresponding to the index f exist in an overlapping manner, the matrix corresponding to the index u needs to be deleted from the state transition probability table or the like.

その結果、ノード２０３のインデックスとして新たに追加されたインデックスｕに対応する行列などが内部モデルデータから削除され、ノード２０２から既知ノードとして認識されたノード２０３へのアンカリングが行われる。 As a result, the matrix corresponding to the index u newly added as the index of the node 203 is deleted from the internal model data, and anchoring from the node 202 to the node 203 recognized as a known node is performed.

例えば、上述の例において、ノード２０２のインデックスをｔとして状態遷移確率テーブルなどに新たな行列を追加していた場合、インデックスｔのノードからインデックスｆのノードへの状態遷移確率などが、アンカリングによって設定されることになる。 For example, in the above example, when a new matrix is added to the state transition probability table or the like with the index of the node 202 as t, the state transition probability from the node at the index t to the node at the index f is determined by anchoring. Will be set.

なお、アンカリングが行なわれた後、これまで蓄積された時系列情報に基づいて、追加学習方式の学習が行われるようになされている。すなわち、アンカリングされた直後の内部モデルデータを初期値とし、図３９におけるノード２０１乃至ノード２１５、およびノード２０１の左側の１つのノードに対応する時系列情報に基づく学習が行われることになる。 In addition, after anchoring is performed, learning of an additional learning method is performed based on time-series information accumulated so far. That is, learning based on time series information corresponding to the nodes 201 to 215 in FIG. 39 and one node on the left side of the node 201 is performed using the internal model data immediately after the anchoring as an initial value.

上述したように、アンカリングは、未知ノードから既知ノードへの遷移が認識された場合、未知ノードと既知ノードとの状態遷移確率などを設定する処理である。本発明では、アンカリングが行なわれた後、これまで蓄積された時系列情報に基づいて、追加学習方式の学習が行われるようにする。 As described above, anchoring is a process of setting a state transition probability between an unknown node and a known node when a transition from an unknown node to a known node is recognized. In the present invention, after the anchoring is performed, the learning of the additional learning method is performed based on the time series information accumulated so far.

すなわち、未知ノードが追加された後の内部モデルデータに基づいて、追加学習方式の学習が行われる。仮に、実際には同じインデックスのノードが異なる未知ノードとして重複して追加されたとしても、この学習によって、上述したフォワードマージアルゴリズムおよびバックワードマージアルゴリズムが適用されて同一のノードとしてマージされる可能性が高くなる。 That is, learning of the additional learning method is performed based on the internal model data after the unknown node is added. Even if nodes with the same index are actually added as different unknown nodes, the learning may be merged as the same node by applying the forward merge algorithm and backward merge algorithm described above. Becomes higher.

また、追加学習方式での学習を、アンカリングが行なわれるまで実行しないようにすることで、内部モデルデータにおいて更新すべきパラメータの数をできるだけ少なくすることができる。アンカリングの際にノードの削除の要否のチェックが行なわれるからである。従って、計算量を抑制しながら、適切に内部モデルデータを更新していくことが可能となるのである。 In addition, the number of parameters to be updated in the internal model data can be reduced as much as possible by not performing learning in the additional learning method until anchoring is performed. This is because the necessity of node deletion is checked during anchoring. Therefore, it is possible to appropriately update the internal model data while suppressing the calculation amount.

このように、アンカリングの際に、ノードの削除の要否のチェックが行なわれる場合、追加する未知ノードの数ｍは、式（４９）に替えて式（５０）より表すことができる。 As described above, when the necessity of node deletion is checked during anchoring, the number m of unknown nodes to be added can be expressed by Expression (50) instead of Expression (49).

・・・（５０）

... (50)

いまの場合、時系列情報の長さ（自分が未知ノードに位置していると認識した時刻後の時間的長さ）Ｎは、１１である。また、いまの場合、ノード２０３乃至ノード２１５がノード列Ｓに対応し、ノード列Ｓの長さＬrは、Thres＋２である。よって、式（５０）より、追加すべきノードの個数ｍは、Thres＋４-（Thres＋２＋３）＝−１と算出される。従って、既に追加された３つのノードのうちの１個のノード２０３（のインデックスに対応する行列）が削除されるのである。 In this case, the length of the time-series information (the time length after the time when it is recognized that it is located at an unknown node) N is 11. In this case, the nodes 203 to 215 correspond to the node string S, and the length Lr of the node string S is Thres + 2. Therefore, from the equation (50), the number m of nodes to be added is calculated as Thres + 4− (Thres + 2 + 3) = − 1. Therefore, one node 203 (a matrix corresponding to the index of the three nodes already added) is deleted.

ここでは、ノードが削除される場合の例のみを説明したが、ｍ_addの値の如何によっては、ノードが追加される場合もあり得る。すなわち、式（５０）または後述する式（５１）により計算されたｍが正の値となれば、その分のノードが追加されることになる。従って、実際には、アンカリングの際に、ノードの追加または削除の要否のチェックが行なわれることになる。 Here, only an example in which a node is deleted has been described, but a node may be added depending on the value of m_add. That is, if m calculated by the equation (50) or the equation (51) described later becomes a positive value, the corresponding nodes are added. Therefore, in practice, when anchoring is performed, whether or not a node needs to be added or deleted is checked.

なお、削除すべきノードが認識処理の結果、既知ノードと認識されてしまう場合、そのノードの削除は行なわれない。 If the node to be deleted is recognized as a known node as a result of the recognition process, the node is not deleted.

仮に、既に未知ノードとみなして追加したノードのうちＫ個目のノードが、認識処理で出力されたノード列Ｓに含まれていた場合、削除するノードの数ｍは、式（５０）に替えて式（５１）より表すことができる。 If the K-th node among nodes that are already regarded as unknown nodes and added is included in the node sequence S output in the recognition process, the number m of nodes to be deleted is changed to Equation (50). This can be expressed by equation (51).

・・・（５１）

... (51)

式（５１）により算出された|ｍ|個のノードが削除するノードとなる。 The | m | nodes calculated by the equation (51) are nodes to be deleted.

また、この場合、アンカリングするノードは、ノード列Ｓの中の（（Ｌｒ＋Ｋ）−Ｎ）番目のノードとなる。 In this case, the node to be anchored is the ((Lr + K) −N) th node in the node sequence S.

このようにアンカリングが行なわれた後、これまで蓄積された時系列情報に基づいて、追加学習方式の学習が行われるようになされている。また、アンカリングが行なわれるまでは、追加学習方式での学習を実行しないようになされている。従って、アンカリングされる前に、未知ノードとみなされて内部モデルデータに追加されたノードは、それ以後の認識処理において既知ノードの１つとして認識されるものの、いわば仮の既知ノードとして認識されていることになる。アンカリングされる前に、未知ノードとみなされて内部モデルデータに追加されたノードは、最終的には削除すべきものである可能性があるからである。また、アンカリングされる前に、未知ノードとみなされて内部モデルデータに追加されたノードと、他のノードとの状態遷移確率などの値は、追加学習方式での学習により変更される可能性があるからである。 After the anchoring is performed in this way, learning of the additional learning method is performed based on the time series information accumulated so far. Further, learning by the additional learning method is not executed until anchoring is performed. Therefore, a node that is regarded as an unknown node and added to the internal model data before being anchored is recognized as one of the known nodes in the subsequent recognition process, but it is recognized as a temporary known node. Will be. This is because a node that is regarded as an unknown node and added to the internal model data before being anchored may eventually be deleted. In addition, values such as state transition probabilities between nodes that are considered as unknown nodes and added to internal model data before being anchored, and other nodes may be changed by learning using the additional learning method. Because there is.

ところで、追加するノードが真の意味で未知ノードであったと確信できなくても、自分が未知ノードに位置していると最初に認識したときから、所定の時間経過後のタイミングで、所定の個数のノードを、内部モデルデータに追加せざるを得ないことについては上述した。つまり、アンカリングする前の内部モデルデータには、単に未知ノードとみなされたノードを表すインデックスに対応する情報も追加されている可能性が極めて高いといえる。 By the way, even if you cannot be certain that the node to be added is an unknown node in the true sense, a certain number of nodes will be used at the timing after a certain amount of time has elapsed since you first recognized that you are located in an unknown node. As described above, this node must be added to the internal model data. That is, it can be said that there is an extremely high possibility that information corresponding to an index representing a node that is simply regarded as an unknown node is added to the internal model data before anchoring.

しかし、真の意味で未知ノードであったと確信できない極めて多数のノードが、それぞれ一様に未知ノードとみなされて内部モデルデータに追加されていくと、認識処理における誤認識を招くことがある。未知ノードとみなされて追加されたノードも、それ以後の認識処理においては既知ノードとして取り扱われることになるからである。 However, if an extremely large number of nodes that are not surely unknown nodes are truly regarded as unknown nodes and added to the internal model data, erroneous recognition may occur in the recognition process. This is because a node added as an unknown node is also treated as a known node in subsequent recognition processing.

その結果、例えば、以前から存在していた既知ノードが、未知ノードとみされて追加されたノードであると、誤って認識されてしまうことがある。認識処理は、内部モデルデータに基づいて行なわれるからである。 As a result, for example, a known node that has existed before may be erroneously recognized as a node that has been added as an unknown node. This is because the recognition process is performed based on the internal model data.

このような誤認識を抑制するために、アンカリングする前に、未知ノードとみなして追加してしまったノードを適宜削除するようにしてもよい。この場合、式（４９）に示したｍの値が０よりも小さくなったとき、|ｍ|個のノードを削除するようにすればよい。 In order to suppress such erroneous recognition, a node that has been added as an unknown node may be appropriately deleted before anchoring. In this case, when the value of m shown in Expression (49) becomes smaller than 0, | m | nodes may be deleted.

例えば、固有ノード数の閾値Thresの値が７であった場合を考える。例えば、ノード２１６（図示せず）が、自分が未知ノードに位置していると最初に認識したノードであったものとし、いま、エージェントはノード２２６（図示せず）に遷移したものとする。ここで、ノード２１６は、既に内部モデルデータに追加されているものとする。 For example, consider a case where the threshold Thres value of the number of unique nodes is 7. For example, it is assumed that the node 216 (not shown) is the node that first recognized that it is located at an unknown node, and now the agent has transitioned to the node 226 (not shown). Here, it is assumed that the node 216 has already been added to the internal model data.

ノード２１９乃至ノード２２６に対応する時系列情報に基づいて認識処理を行った結果、ノード２２６は未知ノードであると認識されたものとする。このとき、ノード２１７が内部モデルデータに追加されることになる。 As a result of performing the recognition processing based on the time series information corresponding to the nodes 219 to 226, it is assumed that the node 226 is recognized as an unknown node. At this time, the node 217 is added to the internal model data.

その後、アクションを実行することにより、エージェントはノード２２７（図示せず）に遷移し、この時点での認識処理の結果、ノード２２７は未知ノードであると認識されたものとする。このとき、ノード２１８が内部モデルデータに追加されることになる。しかし、ノード２１８が内部モデルデータに追加された結果、ノード２２０、ノード２２２、ノード２２４、およびノード２２６は、実際には、ノード２１８と同じインデックスのノードであることが認識されたこととする。 Thereafter, by executing an action, the agent transitions to a node 227 (not shown). As a result of the recognition processing at this point, it is assumed that the node 227 is recognized as an unknown node. At this time, the node 218 is added to the internal model data. However, as a result of adding the node 218 to the internal model data, it is recognized that the node 220, the node 222, the node 224, and the node 226 are actually nodes having the same index as the node 218.

この場合、閾値Thres以上の固有ノード数を含むノード列を出力させるためには、時系列情報の長さを、ノード２１７乃至ノード２２７に対応する長さとしなければならない。 In this case, in order to output a node string including the number of unique nodes equal to or greater than the threshold value Thres, the length of the time-series information has to be a length corresponding to the nodes 217 to 227.

このような場合、時系列情報の長さ（ノード２１６乃至ノード２２７）Ｎは１２であり、既に追加したノード（ノード２１６乃至ノード２１８）の数ｍ_addは３である。また、いまの場合、ノード２１７乃至ノード２２７がノード列Ｓに対応し、ノード列Ｓの長さＬrは１１である。よって、式（４９）より、追加すべきノードの個数ｍは、１２−（１１＋３＋１）＝−３と算出される。従って、内部モデルデータに追加された３つノードであって、ノード２１６乃至ノード２１８が削除されるのである。 In such a case, the length of the time series information (nodes 216 to 227) N is 12, and the number m_add of the nodes (nodes 216 to 218) that have already been added is 3. In this case, the nodes 217 to 227 correspond to the node string S, and the length Lr of the node string S is 11. Therefore, from Expression (49), the number m of nodes to be added is calculated as 12− (11 + 3 + 1) = − 3. Accordingly, the three nodes added to the internal model data and the nodes 216 to 218 are deleted.

このようにして、アンカリングする前に、未知ノードとみなして追加してしまったノードを必要に応じて削除すれば、誤認識を抑制することが可能となる。 In this way, if a node that has been added as an unknown node is deleted as necessary before anchoring, erroneous recognition can be suppressed.

すなわち、アンカリングする前に、未知ノードを追加したり、未知ノードとみなして追加してしまったノードを必要に応じて削除したりする処理が行われる。この処理は、後述する図４０のステップＳ３１６に対応する。 That is, before anchoring, a process is performed in which an unknown node is added, or a node that has been added as an unknown node is deleted as necessary. This process corresponds to step S316 in FIG.

また、アンカリングする際にも、未知ノードを追加したり、未知ノードとみなして追加してしまったノードを必要に応じて削除したりする処理が行われる。この処理は、後述する図４０のステップＳ３１８に対応する。 Also, when anchoring, a process of adding an unknown node or deleting a node that has been added as an unknown node as necessary is performed. This process corresponds to step S318 in FIG.

次に、図４０のフローチャートを参照して、未知ノード追加処理について説明する。この処理は、エージェントが自律的に環境の変化を認識して、内部モデルデータを拡張する必要がある場合、自律行動学習装置１０により実行される。 Next, the unknown node addition process will be described with reference to the flowchart of FIG. This process is executed by the autonomous behavior learning device 10 when the agent autonomously recognizes a change in the environment and needs to expand the internal model data.

ステップＳ３１１において、認識器３５は、変数Ｎの値を初期値である１にセットする。 In step S311, the recognizer 35 sets the value of the variable N to 1 which is an initial value.

ステップＳ３１２において、認識器３５は、長さＮの時系列情報を観測バッファ３３および行動出力バッファ３９から取得する。 In step S <b> 312, the recognizing device 35 acquires time-series information of length N from the observation buffer 33 and the behavior output buffer 39.

ステップＳ３１３において、認識器３５は、Ｎが固有ノード数の閾値Thres以上となったか否かを判定し、まだ、閾値Thres以上となっていないと判定された場合、処理は、ステップＳ３２１に進む。 In step S313, the recognizing device 35 determines whether N is equal to or greater than the threshold Thres of the number of unique nodes. If it is determined that N is not yet equal to or greater than the threshold Thres, the process proceeds to step S321.

ステップ３２１において、変数Ｎの値が１だけインクリメントされ、処理は、ステップＳ３１２に戻る。 In step 321, the value of the variable N is incremented by 1, and the process returns to step S312.

一方、ステップＳ３１３において、Ｎが閾値Thres以上となったと判定された場合、処理は、ステップＳ３１４に進む。 On the other hand, if it is determined in step S313 that N is equal to or greater than the threshold value Thres, the process proceeds to step S314.

ステップＳ３１４において、認識器３５は、図３４、または図３６を参照して上述した認識処理を実行する。ただし、いまの場合、ステップＳ３１２の処理で時系列情報が取得されているので、その時系列情報に基づいて認識処理が実行される。 In step S314, the recognizer 35 executes the recognition process described above with reference to FIG. 34 or FIG. However, in this case, since the time series information is acquired in the process of step S312, the recognition process is executed based on the time series information.

ステップＳ３１５において、学習器３４は、ステップＳ３１４における認識処理の結果、ノード列の最後のノードが未知ノードと認識されたか否かを判定する。ステップＳ３１５において、認識処理の結果、未知ノードと認識されたと判定された場合、処理は、ステップＳ３１６に進む。 In step S315, the learning device 34 determines whether or not the last node in the node sequence is recognized as an unknown node as a result of the recognition processing in step S314. If it is determined in step S315 that the node is recognized as an unknown node as a result of the recognition process, the process proceeds to step S316.

ステップＳ３１６において、学習器３４は、未知ノードとみなされたノードを追加または削除する。 In step S316, the learning device 34 adds or deletes a node regarded as an unknown node.

ステップＳ３１６では、例えば、図３７において未知ノードとみなされたノード２０１が内部モデルデータに追加されたように、ノードの追加が行なわれる。また、例えば、上述したように、誤認識を抑制するために、アンカリング前に、未知ノードとみなして追加してしまったノードの削除が行なわれる。 In step S316, for example, a node is added as if the node 201 considered as an unknown node in FIG. 37 was added to the internal model data. Further, for example, as described above, in order to suppress misrecognition, a node that has been added as an unknown node is deleted before anchoring.

一方、ステップＳ３１５において、認識処理の結果、未知ノードと認識されていないと判定された場合、処理は、ステップＳ３１７に進む。 On the other hand, when it is determined in step S315 that the node is not recognized as an unknown node as a result of the recognition process, the process proceeds to step S317.

ステップＳ３１７において、学習器３４は、ステップＳ３１４における認識処理の結果、ノード列の最後のノードが既知ノードと認識されたか否かを判定する。ステップＳ３１７において、認識処理の結果、既知ノードと認識されたと判定された場合、処理は、ステップＳ３１８に進む。 In step S317, the learning device 34 determines whether or not the last node in the node sequence is recognized as a known node as a result of the recognition processing in step S314. If it is determined in step S317 that the node is recognized as a known node as a result of the recognition process, the process proceeds to step S318.

ステップＳ３１８において、学習器３４および認識器３５は、図４１を参照して後述する追加または削除要否チェック処理を実行する。これにより、例えば、図３９を参照して上述したように、アンカリングの際のノードの削除要否がチェックされ、削除が必要であれば未知ノードとみなされて追加されたノードが削除される。 In step S318, the learning device 34 and the recognizing device 35 execute an addition or deletion necessity check process described later with reference to FIG. Thus, for example, as described above with reference to FIG. 39, whether or not a node needs to be deleted at the time of anchoring is checked, and if deletion is necessary, it is regarded as an unknown node and the added node is deleted. .

ステップＳ３１９において、学習器３４は、アンカリングを行なう。これにより、例えば、既知ノードから未知ノードへの状態遷移確率などが設定される。 In step S319, the learning device 34 performs anchoring. Thereby, for example, a state transition probability from a known node to an unknown node is set.

一方、ステップＳ３１７において、認識処理の結果、既知ノードと認識されていないと判定された場合、処理は、ステップＳ３２０に進む。 On the other hand, if it is determined in step S317 that the node is not recognized as a known node as a result of the recognition process, the process proceeds to step S320.

ステップＳ３２０において、認識器３５は、閾値Thresの値を１だけインクリメントする。 In step S320, the recognizing device 35 increments the value of the threshold value Thres by one.

すなわち、ステップＳ３１７において、認識処理の結果、既知ノードと認識されていないと判定された場合、認識処理において、認識結果を保留し、時系列情報を未来方向に延長する指令が出力されたことになる。例えば、図３４、または図３６を参照して上述したステップＳ２３１の処理またはステップＳ２９３の処理が行われる場合である。この場合、例えば、図３８を参照して上述したように、閾値Thresの値をインクリメントするとともに、時系列情報の長さを未来方向に延長する必要がある。 That is, in step S317, if it is determined that the node is not recognized as a known node as a result of the recognition process, a command to hold the recognition result and extend the time series information in the future direction is output in the recognition process. Become. For example, this is a case where the process of step S231 or the process of step S293 described above with reference to FIG. 34 or FIG. 36 is performed. In this case, for example, as described above with reference to FIG. 38, it is necessary to increment the value of the threshold Thres and extend the length of the time series information in the future direction.

従って、ステップＳ３２０の処理の後、処理は、ステップＳ３２１に進むことになる。 Therefore, after the process of step S320, the process proceeds to step S321.

このようにして、未知ノード追加処理が実行される。 In this way, the unknown node addition process is executed.

次に、図４１のフローチャートを参照して、図４０のステップＳ３１８の追加または削除要否チェック処理の詳細な例について説明する。 Next, a detailed example of the addition or deletion necessity check process in step S318 in FIG. 40 will be described with reference to the flowchart in FIG.

ステップＳ３４１において、認識器３５は、長さＮの時系列情報を取得する。すなわち、自分が未知ノードに位置していると認識した時刻後の時間的長さＮの時系列情報が取得される。例えば、図３９の例の場合、ノード２０１乃至ノード２１５に対応する長さの時系列情報が取得されることになる。 In step S341, the recognizing device 35 acquires time-series information of length N. That is, time-series information of time length N after the time when it is recognized that it is located at an unknown node is acquired. For example, in the example of FIG. 39, time-series information having a length corresponding to the nodes 201 to 215 is acquired.

ステップＳ３４２において、認識器３５は、長さＮの時系列情報に基づく認識処理を実行する。このとき、図３４、または図３６を参照して上述した認識処理を実行する。ただし、いまの場合、ステップＳ３４１の処理で時系列情報が取得されているので、その時系列情報に基づいて認識処理が実行される。 In step S342, the recognizer 35 executes a recognition process based on time-series information of length N. At this time, the recognition process described above with reference to FIG. 34 or 36 is executed. However, in this case, since the time series information is acquired in the process of step S341, the recognition process is executed based on the time series information.

ステップＳ３４３において、学習器３４は、ステップＳ３４２における認識処理の結果、ノード列の最後のノード（時間的に最も後のノード）が既知ノードと認識されたか否かを判定する。ステップＳ３４３において、認識処理の結果、既知ノードと認識されていないと判定された場合、処理は、ステップＳ３４４に進む。 In step S343, the learning device 34 determines whether or not the last node (the last node in time) in the node sequence is recognized as a known node as a result of the recognition processing in step S342. If it is determined in step S343 that the node is not recognized as a known node as a result of the recognition process, the process proceeds to step S344.

ステップＳ３４４において、認識器３５は、時系列情報の長さＮを１だけデクリメントする。この場合、時系列情報が過去側から短縮されることになる。例えば、例えば、図３９の例の場合、ノード２０１乃至ノード２１５に対応する長さの時系列情報が取得されていたものが、ノード２０２乃至ノード２１５に対応する長さの時系列情報とされることになる。 In step S344, the recognizer 35 decrements the length N of the time series information by 1. In this case, the time series information is shortened from the past side. For example, in the example of FIG. 39, the time series information having the length corresponding to the nodes 201 to 215 is acquired as the time series information having the length corresponding to the nodes 202 to 215. It will be.

このように、ステップＳ３４３において、認識処理の結果、既知ノードと認識されたと判定されるまで、時系列情報が過去側から短縮され、繰り返し認識処理が実行されるのである。 As described above, in step S343, the time series information is shortened from the past side and the repeated recognition process is executed until it is determined that the node is recognized as a known node as a result of the recognition process.

ステップＳ３４３において、認識処理の結果、既知ノードと認識されたと判定された場合、処理は、ステップＳ３４５に進む。例えば、図３９の例の場合、ノード２０３乃至ノード２１５に対応する長さの時系列情報に基づく認識処理の結果、ノード２０３乃至ノード２１５が全て既知ノードであったと認識される。このとき、ノード２０３乃至ノード２１５のノード列におけるノード数が特定される。 If it is determined in step S343 that the node has been recognized as a result of the recognition process, the process proceeds to step S345. For example, in the case of the example in FIG. 39, as a result of the recognition processing based on the time series information having the length corresponding to the nodes 203 to 215, it is recognized that all of the nodes 203 to 215 are known nodes. At this time, the number of nodes in the node train of the nodes 203 to 215 is specified.

ステップＳ３４５において、学習器３４は、ノード数を特定し、特定されたノード数をノード列Ｓの長さＬrとして、式（５０）を参照して上述した演算を行う。 In step S345, the learning device 34 specifies the number of nodes, performs the above-described calculation with reference to the equation (50), with the specified number of nodes as the length Lr of the node sequence S.

ステップＳ３４６において、学習器３４は、追加（または削除）すべきノードがあるか否かを判定する。ステップＳ３４６において、追加（または削除）すべきノードがあると判定された場合、処理は、ステップＳ３４７に進む。一方、ステップＳ３４６において、追加（または削除）すべきノードがないと判定された場合、ステップＳ３４７の処理は、スキップされる。 In step S346, the learning device 34 determines whether there is a node to be added (or deleted). If it is determined in step S346 that there is a node to be added (or deleted), the process proceeds to step S347. On the other hand, if it is determined in step S346 that there is no node to be added (or deleted), the process of step S347 is skipped.

ステップＳ３４７において、学習器３４は、ステップＳ３４６の処理で追加（または削除）すべきと判定されたノードを追加（または削除）する。例えば、図３９の例の場合、式（５０）より、追加すべきノードの個数ｍは、Thres＋４-（Thres＋２＋３）＝−１と算出されるので、既に追加された３つのノードのうちの１個のノード２０３が削除される。すなわち、ノード２０３は、未知ノードとして追加されており、例えば、新たなインデックスを有するノードが内部モデルデータに追加されていたが、本来は、既知ノードであって、追加したインデックスのノードは、内部モデルデータから削除されるのである。 In step S347, the learning device 34 adds (or deletes) a node determined to be added (or deleted) in the process of step S346. For example, in the case of the example of FIG. 39, the number m of nodes to be added is calculated as Thres + 4− (Thres + 2 + 3) = − 1 from Expression (50), so one of the three nodes that have already been added. Node 203 is deleted. That is, the node 203 has been added as an unknown node. For example, a node having a new index has been added to the internal model data. However, the node 203 is originally a known node, and the node with the added index is an internal node. It is deleted from the model data.

このようにして、追加または削除要否チェック処理が実行される。 In this way, an addition or deletion necessity check process is executed.

これまでに学習して得られた内部モデルデータでは表現できない、新しい状況に遭遇したときには、ノード数を増やして状況を表現し、事態を解決する必要がある。例えば、ロボットが移動する迷路が所定の方向に延長された場合、ノードの数が増えることになるので、ノード数Ｎの値を大きくする必要がある。 When a new situation is encountered that cannot be expressed by internal model data obtained through learning so far, it is necessary to increase the number of nodes to express the situation and solve the situation. For example, if the maze where the robot moves is extended in a predetermined direction, the number of nodes increases, so the value of the number of nodes N needs to be increased.

従来の技術では、新たなノードを検出すると、その場で直ちに内部モデルデータを拡張し、新たなノードを表すインデックスの追加が行なわれていた。 In the conventional technique, when a new node is detected, the internal model data is immediately expanded on the spot, and an index representing the new node is added.

しかしながら、一般に新しい経験を取り込む際、その経験は既存の構造とどのような関係に位置づけられるのかが最重要な問題となり、例えば、新たなノードを検出した直後では既存構造との関係が十分明確でないことも多い。 However, in general, when a new experience is taken in, the most important issue is how the experience is related to the existing structure. For example, immediately after a new node is detected, the relationship with the existing structure is not clear enough. There are many things.

従って、早急に新たなノードを表すインデックス内部モデルデータに追加することにより、今後の誤認識を招くおそれもある。例えば、新たなノードが連続して検出されるような状況では、新たなノードは直前の状態に対してしか関係性を定義できず、そのような連鎖が連続すればするほど、加速度的に既存構造に対する関係の不明瞭化が進むことになる。また、このような内部モデルデータに基づいて追加学習方式で学習を行ったとしても、学習時に調整すべきパラメータが膨大になってしまう。 Therefore, adding to the index internal model data representing a new node immediately may cause future misrecognition. For example, in a situation where new nodes are continuously detected, the new node can only define a relationship with the previous state, and the more such a chain continues, the more rapidly existing The ambiguity of the relationship to the structure will progress. Even if learning is performed by the additional learning method based on such internal model data, the parameters to be adjusted at the time of learning become enormous.

そこで、本発明では、上述のように、所定のタイミングで所定の個数の未知ノードが追加されるようにするとともに、アンカリングされた直後の内部モデルデータに基づいて追加学習方式での学習が行われるようにしたのである。このようにすることで、例えば、既知ノードの中に散発的に新たなノードが発現するような場合はもちろんのこと、長期に渡って新たなノードが連続して検出されるような困難な環境においても、十分に有効な学習を行うことが可能となる。 Therefore, in the present invention, as described above, a predetermined number of unknown nodes are added at a predetermined timing, and learning by the additional learning method is performed based on the internal model data immediately after the anchoring. It was made to be. By doing so, for example, in a case where new nodes appear sporadically among known nodes, it is difficult environment where new nodes are continuously detected over a long period of time. In this case, it is possible to perform sufficiently effective learning.

上述のように、エージェントが自律的に環境の変化を認識して、状態遷移確率テーブル、および観測確率テーブルを拡張することが可能であるが、その際、それぞれのテーブルの拡張された領域に設定すべき状態遷移確率、観測確率などの値を特定する必要がある。 As described above, it is possible for the agent to autonomously recognize changes in the environment and expand the state transition probability table and the observation probability table. At that time, set the expanded area of each table. It is necessary to specify values such as state transition probability and observation probability to be performed.

図３０乃至図３２において各テーブルを拡張する場合の例については説明したが、ここでは、既に記憶されている状態遷移確率から、拡張された領域のノードへの状態遷移確率などを推定して設定する方式について説明する。 Although the example in the case of extending each table in FIGS. 30 to 32 has been described, here, the state transition probability to the node in the expanded area is estimated and set from the state transition probability already stored. The method to do is demonstrated.

例えば、図３１に示されるように状態遷移確率テーブルを拡張する必要がある場合、状態遷移確率テーブルの各行の確率の値の総和が１となるように正規化する必要があると説明した。換言すれば、図３１の例において上述した処理では、追加された領域に状態遷移確率を設定するにあたり、既に記憶されている既知ノードから既知ノードへの状態遷移確率は考慮されていなかった。しかしながら、内部モデルデータに追加された未知ノードに対して複数の既知ノードからの遷移が発生し得ることは予測可能である。 For example, as described in FIG. 31, when it is necessary to expand the state transition probability table, it has been described that it is necessary to normalize the sum of the probability values of each row of the state transition probability table to be 1. In other words, in the processing described above in the example of FIG. 31, the state transition probability from the already-known node to the known node is not considered in setting the state transition probability in the added region. However, it can be predicted that a transition from a plurality of known nodes may occur for an unknown node added to the internal model data.

例えば、迷路におけるあるパーツＡが、別のパーツＢと置き換えられた場合、パーツＡに隣接していたパーツＣと、パーツＢとが接続されることになる。このような場合、ロボットがパーツＣからパーツＡに移動するためのアクションを実行すると、パーツＢに移動する可能性が高い。また、ロボットがパーツＡからパーツＣに移動するためのアクションを、パーツＢにおいて実行すると、パーツＣに移動する可能性が高い。この例では、パーツＢに対応するＨＭＭのノードを、未知ノードとして新たに追加する必要があるが、上記を考慮してパーツＣに対応する既知ノードとの状態遷移確率を設定すべきである。 For example, when a part A in the maze is replaced with another part B, the part C adjacent to the part A and the part B are connected. In such a case, if the robot performs an action for moving from part C to part A, the possibility of moving to part B is high. Further, when the action for moving the robot from the part A to the part C is executed in the part B, the possibility of moving to the part C is high. In this example, it is necessary to newly add an HMM node corresponding to part B as an unknown node, but the state transition probability with the known node corresponding to part C should be set in consideration of the above.

従って、既に記憶されている既知ノードから既知ノードへの状態遷移確率に基づいて、未知ノードと既知ノードとの間の状態遷移確率などを設定することができれば、より適切に状態遷移確率を設定することができると考えられる。いわば、過去の経験に基づいて未知ノードと既知ノードとの間の状態遷移確率などを設定することができれば、より適切に状態遷移確率を設定することができるのである。 Therefore, if the state transition probability between the unknown node and the known node can be set based on the already stored state transition probability from the known node to the known node, the state transition probability is set more appropriately. It is considered possible. In other words, if the state transition probability between an unknown node and a known node can be set based on past experience, the state transition probability can be set more appropriately.

なお、ここで説明する既知ノードには、例えば、ロボットが迷路を移動中に、未知ノードとみなされて既に内部モデルデータに追加されているノードも含まれるものとする。 Note that the known nodes described here include, for example, nodes that are regarded as unknown nodes and already added to the internal model data while the robot is moving in the maze.

上述のように設定する状態遷移確率の値は、次のパターンを考慮して決める必要がある。 The state transition probability value set as described above needs to be determined in consideration of the following pattern.

すなわち、現実に、ノードs_iからノードs_jへの遷移が生じた場合、ノードs_iとノードs_jが既知のノードであるのか、新たに追加された未知のノードであるのかを考慮する必要がある。 That is, when a transition from node s _i to node s _j actually occurs, it is necessary to consider whether node s _i and node s _j are known nodes or newly added unknown nodes. There is.

つまり、既知ノードから未知ノードへの遷移、未知ノードから未知ノードへの遷移、および未知ノードから既知ノードへの遷移の３つのパターンを考慮する必要がある。 That is, it is necessary to consider three patterns: a transition from a known node to an unknown node, a transition from an unknown node to an unknown node, and a transition from an unknown node to a known node.

例えば、状態遷移確率テーブルが拡張された場合、図４２に示される領域３０１−１乃至領域３０１−３に、既知ノードから未知ノードへの状態遷移確率を設定する必要がある。また、領域３０３−１乃至領域３０３−３に、未知ノードから未知ノードへの状態遷移確率を設定する必要がある。さらに、領域３０２−１乃至領域３０２−３に、未知ノードから既知ノードへの状態遷移確率を設定する必要がある。 For example, when the state transition probability table is expanded, it is necessary to set the state transition probability from the known node to the unknown node in the region 301-1 to region 301-3 shown in FIG. In addition, it is necessary to set the state transition probability from the unknown node to the unknown node in the regions 303-1 to 303-3. Furthermore, it is necessary to set the state transition probability from the unknown node to the known node in the region 302-1 to region 302-3.

また、上述したように、状態遷移確率テーブルの各行（例えば、ｎ行目）に記述された全ての数値を合計すると、１となるようになされているので、図４２において既存状態と記述された領域の確率もあらためて設定する必要がある。 Further, as described above, since all the numerical values described in each row (for example, the nth row) of the state transition probability table are summed to be 1, the existing state is described in FIG. It is necessary to set the probability of the area again.

例えば、図４３に示されるような場合を例として説明する。 For example, a case as shown in FIG. 43 will be described as an example.

すなわち、既知ノードである遷移元のノード３２１において、図中右方向への移動に対応するアクションを実行した結果、遷移する可能性の高い遷移先ノードは、状態遷移確率テーブルに基づいてノード３２２またはノード３２３と予想されていたものとする。しかしながら、実際に遷移元のノード３２１において、図中右方向への移動に対応するアクションを実行した結果、遷移した遷移先ノードはノード３２４であったとする。この場合、ノード３２４が未知ノードとなる。 That is, as a result of executing the action corresponding to the movement in the right direction in the figure at the transition source node 321 which is a known node, the transition destination node having a high possibility of transition is determined based on the state transition probability table. Assume that node 323 was expected. However, it is assumed that the transition destination node that has transitioned as a result of executing the action corresponding to the movement in the right direction in the figure in the transition source node 321 is the node 324. In this case, the node 324 becomes an unknown node.

図４３の例において、ノード３２１では、図２のパーツ５に対応する観測シンボルが観測され、ノード３２２では、図２のパーツ１２に対応する観測シンボルが観測され、ノード３２３では、図２のパーツ６に対応する観測シンボルが観測されている。 43, an observation symbol corresponding to part 5 in FIG. 2 is observed at node 321, an observation symbol corresponding to part 12 in FIG. 2 is observed at node 322, and the part of FIG. 2 is observed at node 323. An observation symbol corresponding to 6 is observed.

なお、図４３においては、迷路のパーツを表す矩形に対してノード３２１乃至ノード３２４の符号が付されているが、実際には、それらのパーツに対応する観測シンボルが観測されたノードに対して付される符号である。すなわち、エージェントは、ノード３２１乃至ノード３２３を、学習済の内部モデルデータに基づいて一意に認識することができたのであり、ノード３２４は、これまでに記憶されていない内部状態（ノード）として認識されたものとなる。 In FIG. 43, the signs of the nodes 321 to 324 are attached to the rectangles representing the parts of the maze, but actually, the nodes where the observation symbols corresponding to these parts are observed are shown. It is the code | symbol attached. That is, the agent can uniquely recognize the nodes 321 to 323 based on the learned internal model data, and the node 324 recognizes the internal state (node) that has not been stored so far. Will be.

つまり、エージェントは、ノード３２１から図中右方向に移動すると、図中上向きの曲がり角（ノード３２２）または図中下向きの曲がり角（ノード３２３）に出るものと予想していた。 That is, when the agent moves from the node 321 in the right direction in the figure, it is expected that the agent will come out at the upward turning corner (node 322) in the drawing or the downward turning corner (node 323) in the drawing.

しかしながら、実際にノード３２１から図中右方向に移動してみると、十字路（ノード３２４）に出たのである。すなわち、ノード３２４では、図２のパーツ１５に対応する観測シンボルが観測されている。 However, when actually moving from the node 321 in the right direction in the figure, it came out to the crossroad (node 324). That is, at the node 324, an observation symbol corresponding to the part 15 in FIG. 2 is observed.

例えば、迷路におけるノード３２１に対応する位置に配置されていたパーツが置き換えられた場合、このような状況となる。このような場合、ノード３２４は、それまでの内部モデルデータには存在しなかったノードと考えられるから、少なくともノード３２４を内部モデルデータに追加する必要がある。 For example, such a situation occurs when a part arranged at a position corresponding to the node 321 in the maze is replaced. In such a case, since the node 324 is considered as a node that did not exist in the internal model data so far, it is necessary to add at least the node 324 to the internal model data.

このような場合、ノード３２４に対応する新たなインデックスが生成されて状態遷移確率テーブルの行列が追加される。従って、右方向のアクションに対応する状態遷移確率テーブルのノード３２１からノード３２４への状態遷移確率を設定する必要がある。ただし、実際に、新たなインデックスが生成されて状態遷移確率テーブルの行列が追加されるタイミングについては、図３７乃至図４１を参照して上述した通りである。 In such a case, a new index corresponding to the node 324 is generated and a matrix of the state transition probability table is added. Therefore, it is necessary to set the state transition probability from the node 321 to the node 324 in the state transition probability table corresponding to the action in the right direction. However, the timing at which a new index is actually generated and the matrix of the state transition probability table is added is as described above with reference to FIGS.

この状態遷移確率は、例えば、ノード３２１からノード３２２およびノード３２３への状態遷移確率の和を３で割った値を、設定する。このとき、ノード３２１からノード３２２およびノード３２３への状態遷移確率もそれぞれの状態遷移確率に応じた重みづけにより案分されて設定されるようにすればよい。 As this state transition probability, for example, a value obtained by dividing the sum of the state transition probabilities from the node 321 to the node 322 and the node 323 by 3 is set. At this time, the state transition probabilities from the node 321 to the node 322 and the node 323 may be set in proportion to the weights according to the respective state transition probabilities.

ノード３２１（例えば、ノードs_iとする）から右方向のアクション（例えば、アクションｋ´とする）により遷移し得るノードの候補s_j ^l（l=1，・・・L）は、例えば、状態遷移確率a_ij（k´）が閾値以上となる遷移先のノードs_jをリストすればよい。 Node candidates s _j ^l (l = 1,... L) that can be shifted from a node 321 (for example, node s _i ) by an action in the right direction (for example, action k ′) are, for example, state It is only necessary to list the transition destination node s _{j for} which the transition probability a _ij (k ′) is equal to or greater than the threshold.

図４３の例では、ノード３２１から右方向のアクションにより遷移し得るノード３２２およびノード３２３の２つのノードが候補ノードとしてリストされることになる。この場合、Ｌの値は２となる。 In the example of FIG. 43, two nodes of the node 322 and the node 323 that can transition from the node 321 by the action in the right direction are listed as candidate nodes. In this case, the value of L is 2.

未知ノードであるノード３２４をノードs_newで表し、アクションk´に対応する各既知ノードs_iからs_newへの状態遷移確率a_inew(k´)は、1/Lとして設定する。 An unknown node 324 is represented by a node s _new , and the state transition probability a _inew (k ′) from each known node s _i to s _new corresponding to the action k ′ is set as 1 / L.

図４３の例では、ノード３２１からノード３２４への状態遷移確率が１／２として設定されることになる。 In the example of FIG. 43, the state transition probability from the node 321 to the node 324 is set as 1/2.

状態遷移確率a_inew(k´)は、図４２の例における領域３０１−１乃至領域３０１−３のいずれかの領域に設定されるものとなる。 The state transition probability a _inew (k ′) is set in any one of the regions 301-1 to 301-3 in the example of FIG.

そして、アクションk´に対応する状態遷移確率テーブルの各行の状態遷移確率の総和が１となるように正規化する。すなわち、状態遷移確率a_inew(k´)として非零値が設定された行の各値をL/（L+1）倍すればよい。 Then, normalization is performed so that the sum of the state transition probabilities of each row of the state transition probability table corresponding to the action k ′ becomes 1. That is, each value of a row in which a non-zero value is set as the state transition probability a _inew (k ′) may be multiplied by L / (L + 1).

ただし、状態遷移確率a_ij（k´）が閾値以上となる遷移先のノードが存在しなかった場合、状態遷移確率a_inew(k´)≒１として、上述のような正規化を行なう。 However, when there is no transition destination node in which the state transition probability a _ij (k ′) is equal to or greater than the threshold, normalization as described above is performed with the state transition probability a _inew (k ′) _≈1 .

なお、ノード３２１においてアクションｋ´以外のアクションを実行することによりノード３２４に遷移する状態遷移確率は、０に近い微小な値を設定すればよいので、状態遷移確率テーブルの各行の状態遷移確率の総和が１となるように正規化する必要はない。 Note that the state transition probability of transitioning to the node 324 by executing an action other than the action k ′ at the node 321 may be set to a minute value close to 0, so that the state transition probability of each row of the state transition probability table There is no need to normalize the sum to be one.

また、図４４の図中の矢印で示されるに示されるように、十字路であるノード３２４においては、上下左右方向の４つアクションを実行して他のノードに遷移することが可能である。従って、上下左右方向のアクションに対応する状態遷移確率テーブルのノード３２４から既知の各ノードへの状態遷移確率も設定する必要がある。これらの状態遷移確率は、図４２の例における領域３０２−１乃至領域３０２−３のいずれかに設定されるものとなる。なお、未知ノードから未知ノードへの遷移があり得る場合は、上記に加えて図４２の例における領域３０３−１乃至領域３０３−３のいずれかも含まれることになる。 As indicated by the arrows in FIG. 44, the node 324 that is a crossroad can execute four actions in the up, down, left, and right directions to make a transition to another node. Therefore, it is also necessary to set the state transition probability from the node 324 of the state transition probability table corresponding to the action in the up / down / left / right directions to each known node. These state transition probabilities are set in any of the region 302-1 to region 302-3 in the example of FIG. If there is a transition from an unknown node to an unknown node, any of the areas 303-1 to 303-3 in the example of FIG. 42 is included in addition to the above.

例えば、上方向のアクションに対応する状態遷移確率テーブルのノード３２４から既知の各ノードへの状態遷移確率は、ノード３２２から既知の各ノードへの状態遷移確率をコピーする。ノード３２２は、上向きの曲がり角であり、ノード３２１から右方向のアクションによって遷移し得るノードのうち、上方向のアクションにより他の既知ノードへ遷移し得る唯一のノードだからである。なお、ノード３２２から既知の各ノードへの状態遷移確率については何も変更しない。 For example, the state transition probability from the node 324 to each known node in the state transition probability table corresponding to the upward action copies the state transition probability from the node 322 to each known node. This is because the node 322 has an upward turn and is the only node that can make a transition from the node 321 by an action in the right direction and can make a transition to another known node by an action in the upward direction. Note that nothing is changed for the state transition probability from the node 322 to each known node.

また、例えば、下方向のアクションに対応する状態遷移確率テーブルのノード３２４から既知の各ノードへの状態遷移確率は、ノード３２３から既知の各ノードへの状態遷移確率をコピーする。ノード３２３は、下向きの曲がり角であり、ノード３２１から右方向のアクションによって遷移し得るノードのうち、下方向のアクションにより他の既知ノードへ遷移し得る唯一のノードだからである。なお、ノード３２３から既知の各ノードへの状態遷移確率については何も変更しない。 Further, for example, as the state transition probability from the node 324 to each known node in the state transition probability table corresponding to the downward action, the state transition probability from the node 323 to each known node is copied. This is because the node 323 has a downward turn and is the only node that can transition from the node 321 by an action in the right direction and can transition to another known node by an action in the downward direction. Note that nothing is changed for the state transition probability from the node 323 to each known node.

さらに、左方向のアクションに対応する状態遷移確率テーブルのノード３２４から既知の各ノードへの状態遷移確率は、ノード３２２から既知の各ノードへの状態遷移確率とノード３２３から既知の各ノードへの状態遷移確率を平均化した値とされる。ノード３２２とノード３２３は、ノード３２１から右方向のアクションによって遷移し得るノードのうち、左方向のアクションにより他の既知ノードへ遷移し得るノードだからである。すなわち、ノード３２２およびノード３２３の状態遷移確率の平均値をもって、左方向のアクションに対応する状態遷移確率テーブルのノード３２４から既知の各ノードへの状態遷移確率とすればよい。なお、このとき、ノード３２２およびノード３２３から既知の各ノードへの状態遷移確率については何も変更されない。 Further, the state transition probability from the node 324 to each known node in the state transition probability table corresponding to the leftward action is the state transition probability from the node 322 to each known node and the node 323 to each known node. It is a value obtained by averaging the state transition probabilities. This is because the nodes 322 and 323 are nodes that can make a transition from the node 321 by an action in the right direction and can make a transition to another known node by an action in the left direction. That is, the average value of the state transition probabilities of the node 322 and the node 323 may be used as the state transition probability from the node 324 to each known node of the state transition probability table corresponding to the leftward action. At this time, nothing is changed for the state transition probabilities from the node 322 and the node 323 to each known node.

また、右方向のアクションに対応する状態遷移確率テーブルのノード３２４から既知の各ノードへの状態遷移確率は、例えば、それぞれ一様の値を設定する。図４５に示されるように、右方向のアクションにより他の既知ノードへ遷移し得る候補のノードが他にないからである。 Further, for each state transition probability from the node 324 of the state transition probability table corresponding to the action in the right direction to each known node, for example, a uniform value is set. This is because, as shown in FIG. 45, there is no other candidate node that can make a transition to another known node by an action in the right direction.

さらに、ノード３２１以外の各既知ノードからノード３２４への状態遷移確率も設定する必要がある。 Furthermore, it is necessary to set a state transition probability from each known node other than the node 321 to the node 324.

ノード３２４は十字路だから、他のノードからノード３２４への遷移は、上下左右方向のいずれのアクションによっても起こりえる。すなわち、上方向のアクションによってノード３２４に遷移する遷移元のノードが存在し、下方向のアクションによってノード３２４に遷移する遷移元のノードが存在するはずである。また、左方向のアクションによってノード３２４に遷移する遷移元のノードが存在し、右方向のアクションによってノード３２４に遷移する遷移元のノードが存在するはずである。 Since the node 324 is a crossroad, the transition from another node to the node 324 can occur by any action in the vertical and horizontal directions. That is, there should be a transition source node that transitions to the node 324 by an upward action, and a transition source node that transitions to the node 324 by a downward action. In addition, there should be a transition source node that transitions to the node 324 by an action in the left direction, and a transition source node that transitions to the node 324 by an action in the right direction.

この場合、遷移元のノードを特定するとともに、遷移元のノードのそれぞれにおいてどのアクションを実行することによりノード３２４へ遷移するのかを特定しなければならない。すなわち、未知ノードへの逆方向遷移アクションを特定する必要がある。 In this case, it is necessary to specify the transition source node and to specify which action is executed in each of the transition source nodes to transit to the node 324. That is, it is necessary to specify a backward transition action to an unknown node.

まず、遷移元のノードの推定の根拠となる情報を得るために、ノード３２４に類似するノードを抽出する。ノード３２４に類似するノードとは、例えば、仮にエージェントが現在ノード３２４以外のノードにいるとした場合、ある程度確からしいノードということもできる。 First, a node similar to the node 324 is extracted in order to obtain information as a basis for estimating the transition source node. A node similar to the node 324 may be a node that is certain to some extent if, for example, the agent is in a node other than the current node 324.

例えば、迷路の構造上似通った部分が複数存在する場合を考える。エージェントは、それらの部分の１つであって所定のパーツ上に存在しているものとする。このような場合、実際には、エージェントが認識した部分とはことなる部分の所定のパーツ上に存在している可能性もある。このように、エージェントが認識したノードに類似するノードを抽出することができるのである。 For example, consider a case where there are a plurality of parts that are similar in the structure of the maze. It is assumed that the agent is one of those parts and exists on a predetermined part. In such a case, it may actually exist on a predetermined part that is different from the part recognized by the agent. In this way, a node similar to the node recognized by the agent can be extracted.

類似するノードは、過去ｎステップ分の時系列情報を用いたｎ-ステップ状態認識により特定することができる。 Similar nodes can be identified by n-step state recognition using time series information for the past n steps.

時刻tにおいて、過去ｎステップ分のアクションのシーケンスc_t？n，・・・，c_t？1および過去n+1ステップ分の観測シンボルのシーケンスo_t？n，・・・，o_tを用いて現在のノードを推定したり、現時刻tにおいてエージェントが各ノードに存在する確率を計算したりすることを「n-ステップ状態認識」と称する。 At time t, a sequence of actions for the past n steps c _{t? n} , ..., c _t? Sequence of observation symbols for ₁ and past n + 1 steps o _t? Estimating the current node using _n ,..., o _t or calculating the probability that an agent exists at each node at the current time t is referred to as “n-step state recognition”.

ｎ-ステップ状態認識では、最初に、インデックスｉ（i=1，・・・N）のノードに対応する事前確率π_iが、例えば、予め決められた方式で設定される。 In n-step state recognition, first, a priori probability π _i corresponding to a node with an index i (i = 1,... N) is set by, for example, a predetermined method.

その後、認識器３５が、時刻t-nにおいてエージェントが各ノードに存在する確率δ_t-n(i)を式（５２）により演算する。 Thereafter, the recognizer 35 calculates the probability δ _tn (i) that the agent exists in each node at the time tn by the equation (52).

・・・（５２）

... (52)

そして、認識器３５は、時刻τ=t-n+1，・・・tの順に、エージェントが各ノードに存在する確率δτ(i)を式（５３）の漸化式により演算する。 Then, the recognizer 35 calculates the probability δτ (i) that the agent exists in each node in the order of time τ = t−n + 1,... T by the recurrence formula of Formula (53).

・・・（５３）

... (53)

あるいはまた、式（５３）に替えて式（５４）の演算が行われるようにしてもよい。 Alternatively, the calculation of Expression (54) may be performed instead of Expression (53).

・・・（５４）

... (54)

認識器３５は、さらに、式（５３）または式（５４）における最終時刻tにおいてエージェントが各ノードに存在する確率δ_t(i)を正規化することにより、時刻tにおける各ノードについての状態確率δ´_t(i)を式（５５）により演算する。 The recognizer 35 further normalizes the probability δ _t (i) that the agent exists at each node at the final time t in the equation (53) or the equation (54), thereby obtaining the state probability for each node at the time t. δ ′ _t (i) is calculated by Expression (55).

・・・（５５）

... (55)

式（５５）により得られた状態確率が予め定められた閾値以上となるノードのそれぞれが、類似するノードとされる。 Each node whose state probability obtained by the equation (55) is equal to or greater than a predetermined threshold is a similar node.

なお、ｎ-ステップ状態認識では、過去ｎステップ分のアクションのシーケンスおよび観測シンボルのシーケンスが用いられるが、ｎを０とすると、観測シンボルo_tが所定の閾値以上の確率で観測されるノードの全てが類似するノードとなる。また、ｎを大きくするほど、類似するノードの数も、通常少なくなっていく。ｎ-ステップ状態認識におけるｎの値は、例えば、本発明において行なわれる推定等に用いて好適となるような予め設定された値とされるものとする。 Incidentally, n- in the step state recognition, but a sequence of the sequence and observation symbol action past n steps worth is used, when the n is set to 0, the node observation symbol o _t is observed with probability higher than a predetermined threshold value All become similar nodes. In addition, as n is increased, the number of similar nodes usually decreases. It is assumed that the value of n in n-step state recognition is a preset value that is suitable for use in estimation performed in the present invention, for example.

類似ノードが得られたら、それらのノードにおいて実行されることにより、他のノードに遷移し得るアクションを特定する。例えば、ノード３２４は十字路であるから、ノード３２４に類似するノードも十字路である可能性が高い。そうすると、類似するノードでは、上下左右方向の移動のアクションにより他のノードに遷移できることになる。 When similar nodes are obtained, actions that can be transitioned to other nodes are specified by being executed at those nodes. For example, since the node 324 is a crossroad, there is a high possibility that a node similar to the node 324 is also a crossroad. Then, a similar node can transition to another node by an up / down / left / right movement action.

そしてそれらのアクションを実行して他のノードに遷移し得る既知ノードを特定する。例えば、ノード３２１から右方向のアクションによって遷移し得る既知ノードであるノード３２２でも、それぞれ左方向および上方向のアクションを実行することにより、他のノードに遷移し得る。同様に、ノード３２１から右方向のアクションによって遷移し得る既知ノードであるノード３２３でも、それぞれ左方向および下方向のアクションを実行することにより、他のノードに遷移し得る。 Then, by executing these actions, a known node that can transition to another node is specified. For example, even a node 322 that is a known node that can transition from the node 321 by an action in the right direction can transition to another node by executing an action in the left direction and the upward direction, respectively. Similarly, the node 323, which is a known node that can transition from the node 321 by an action in the right direction, can transition to another node by executing an action in the left direction and the downward direction, respectively.

そうすると、未知ノード３２４には、ノード３２２において、それぞれ左方向および上方向のアクションを実行することで遷移する遷移先ノードのそれぞれから、左方向および上方向の逆方向となるアクションによって遷移し得ると仮定できる。この場合、右方向および下方向が逆方向となるアクション（逆方向遷移アクション）となる。 Then, the unknown node 324 can transition from each of the transition destination nodes that transition by executing the leftward and upward actions in the node 322, respectively, by an action that is the reverse of the leftward and upward directions. I can assume. In this case, an action (reverse direction transition action) in which the right direction and the downward direction are reversed is performed.

また、未知ノード３２４には、ノード３２３において、それぞれ左方向および下方向のアクションを実行することで遷移する遷移先ノードのそれぞれから、左方向および下方向の逆方向となるアクションによって遷移し得ると仮定できる。この場合、右方向および上方向が逆方向となるアクション（逆方向遷移アクション）となる。 In addition, the unknown node 324 can transition from each of the transition destination nodes that transition by executing the leftward and downward actions in the node 323, respectively, by an action that is the reverse of the leftward and downward directions. I can assume. In this case, an action in which the right direction and the upward direction are opposite directions (reverse direction transition action) is performed.

逆方向遷移アクションは、例えば、次のようにして推定することができる。例えば、アクションc_zによってノードs_aからノードs_bへの遷移が起きる場合、逆方向遷移、すなわちノードs_bからノードs_aへの遷移を起こすためのアクションc_z′を推定する。 The backward direction transition action can be estimated as follows, for example. For example, if the transition of the action c _z from node s _a to node s _b occurs, reverse transition, that estimates the action c _{z 'for} causing the transition from node s _b to the node s _a.

逆方向遷移アクションを推定するにあたり、認識器３５は、上述したように類似するノードであって既知ノードを特定する。ここで特定された既知ノードのそれぞれを、ノードs_j ^q（q=1，・・・Q）で表すことにする。 In estimating the backward transition action, the recognizer 35 identifies similar nodes that are similar as described above. Each of the known nodes specified here is represented by a node s _j ^q (q = 1,... Q).

そして、認識器３５は、ノードs_j ^qのそれぞれについて、アクションc_zによってノードs_j ^qに遷移する遷移元ノードを抽出する。この場合、例えば、状態遷移確率a_ij ^q(z）が閾値以上となるノードs_iをリストすればよい。 Then, the recognition device 35, for each node s _j ^q, extracts the transition source node to transition to the node s _j ^q by the action c _z. In this case, for example, the nodes s _i whose state transition probabilities a _ij ^q (z) are equal to or greater than a threshold may be listed.

そして、認識器３５は、(s_j ^q，s_i ^q，l)(q= 1，・・・，Q， l= 1，・・・，L_q)の全ての組み合わせについて、ノードs_j ^qからノードs_i ^q，lへの状態遷移確率の平均値a^*(k)を式（５６）により演算する。 Then, the recognizing device 35 determines the node s _j ^q for all combinations of (s _j ^q , s _i ^{q, l} ) (q = 1,..., Q, l = 1,..., L _q ). The average value a ^* (k) of the state transition probabilities from the node s _i ^{q, l} to the node s _i ^{q, l} is calculated by the equation (56).

・・・（５６）

... (56)

このようにして得られた状態遷移確率の平均値a^*(k)のうち、閾値以上となるものを選択し、そのa^*(k)に対応するアクションc_kを特定すれば、逆方向遷移アクションc_z ^r´(r＝1，・・・，R)を特定することができる。 If the average value a ^* (k) of the state transition probabilities obtained in this way is selected to be equal to or greater than the threshold, and the action c _k corresponding to the a ^* (k) is specified, the backward transition The action c _z ^r ′ (r = 1,..., R) can be specified.

このようにして特定された遷移元のノードで、逆方向遷移が実行されることによりノード３２４へ遷移すると仮定すれば、上述したノード３２１からノード３２４への状態遷移確率を設定する場合と同様の操作により状態遷移確率を設定することができる。 Assuming that the transition source node identified in this way makes a transition to the node 324 by executing a reverse transition, the same as the case of setting the state transition probability from the node 321 to the node 324 described above. The state transition probability can be set by operation.

このように考えると、未知ノードとみなされたノードのインデックスに対応する行列を状態遷移確率テーブルに追加するときには、図４２に示される領域の全ての状態遷移確率を再設定する必要がある。 Considering this, when adding a matrix corresponding to the index of a node regarded as an unknown node to the state transition probability table, it is necessary to reset all state transition probabilities in the region shown in FIG.

すなわち、未知ノードとみなされたノードのインデックスに対応する行列を状態遷移確率テーブルに追加するときには、未知ノードにおいて実行し得るアクションと、そのアクションにより遷移し得る遷移先のノードとを特定する必要がある。このようにすれば、それら特定されたアクションと遷移先ノードとのペアから、状態遷移確率テーブルの所定の行列位置を特定することができ、それらの位置に設定すべき状態遷移確率の値を設定するとともに、その行の各値を正規化するなどすればよい。 That is, when a matrix corresponding to an index of a node regarded as an unknown node is added to the state transition probability table, it is necessary to specify an action that can be executed at the unknown node and a transition destination node that can be shifted by the action. is there. In this way, it is possible to specify a predetermined matrix position in the state transition probability table from the pair of the identified action and the transition destination node, and set the value of the state transition probability to be set at those positions. And normalizing each value in the row.

また、未知ノードとみなされたノードのインデックスに対応する行列を状態遷移確率テーブルに追加するときには、未知ノードに遷移し得る遷移元ノードと、その遷移元ノードから未知ノードに遷移するためのアクションとを特定する必要がある。このようにすれば、それら特定されたアクションと遷移元ノードとのペアから、状態遷移確率テーブルの所定の行列位置を特定することができ、それらの位置に設定すべき状態遷移確率の値を設定するとともに、その行の各値を正規化するなどすればよい。 In addition, when adding a matrix corresponding to the index of a node regarded as an unknown node to the state transition probability table, a transition source node that can transition to the unknown node, an action for transitioning from the transition source node to the unknown node, Need to be identified. In this way, it is possible to specify a predetermined matrix position in the state transition probability table from the pair of the identified action and the transition source node, and set the value of the state transition probability to be set at those positions. And normalizing each value in the row.

従って、上記に示したように、エージェントが自律的に環境の変化を認識して、状態遷移確率テーブルを拡張した場合、拡張された領域に設定すべき状態遷移確率の値を設定する処理は、最終的には、例えば、図４６に示される手順で実行されるようにすることができる。 Therefore, as described above, when the agent autonomously recognizes the environmental change and expands the state transition probability table, the process of setting the value of the state transition probability to be set in the expanded area is as follows. Finally, for example, it can be executed by the procedure shown in FIG.

図４６は、ノード追加時の状態遷移確率設定処理を説明するフローチャートである。この処理は、例えば、エージェントが自律的に環境の変化を認識して、状態遷移確率テーブルなどに未知ノードを追加するとき実行される。 FIG. 46 is a flowchart for explaining state transition probability setting processing when a node is added. This process is executed, for example, when the agent autonomously recognizes a change in the environment and adds an unknown node to the state transition probability table or the like.

なお、ここでは、未知ノードs_newが内部モデルデータに追加されるものとし、エージェントがノードs_newに遷移する直前のノードをノードs_i′とし、ノードs_i′においてアクションc_k′が実行されることによりエージェントはノードs_newに遷移したものとする。 Here, it is assumed that the unknown node s _{new new} is added to the internal model data, the agent 'and the node s _i' node s _i to node prior to the transition to the node s _{new new} carries out the action c _k 'in As a result, the agent transitions to the node s _new .

ステップＳ４０１において、認識器３５は、図４７のフローチャートを参照して後述するノード逆アクションペアリスト生成処理を実行する。 In step S401, the recognizing device 35 executes a node reverse action pair list generation process to be described later with reference to the flowchart of FIG.

これにより、未知ノードへの遷移元のノードが特定されるとともに、未知ノードへの逆方向遷移アクションが特定されることになる。 As a result, the transition source node to the unknown node is specified, and the backward transition action to the unknown node is specified.

ステップＳ４０２において、学習器３４は、図４８のフローチャートを参照して後述する逆アクション状態遷移確率設定処理を実行する。 In step S402, the learning device 34 executes a reverse action state transition probability setting process which will be described later with reference to the flowchart of FIG.

これにより、ステップＳ４０１の処理により特定された遷移元のノードにおいて逆方向遷移アクションを実行することにより未知ノードへ遷移する状態遷移確率が設定される。また、ここで新たに設定された状態遷移確率に応じて状態遷移確率テーブルの各行の値が正規化される。 As a result, the state transition probability of transitioning to an unknown node is set by executing a backward transition action at the transition source node identified by the process of step S401. In addition, the value of each row of the state transition probability table is normalized according to the state transition probability newly set here.

ステップＳ４０３において、認識器３５は、図４９のフローチャートを参照して後述するノード順アクションペアリスト生成処理を実行する。 In step S403, the recognizing device 35 executes a node-order action pair list generation process which will be described later with reference to the flowchart of FIG.

これにより、未知ノードからの遷移先のノードが特定されるとともに、未知ノードからそれらの遷移先ノードへ遷移するための順方向遷移アクションが特定されることになる。 As a result, the transition destination node from the unknown node is specified, and the forward transition action for transitioning from the unknown node to the transition destination node is specified.

ステップＳ４０４において、学習器３４は、図５０のフローチャートを参照して後述する順アクション状態遷移確率設定処理を実行する。 In step S404, the learning device 34 executes a forward action state transition probability setting process which will be described later with reference to the flowchart of FIG.

これにより、未知ノードにおいて、ステップＳ４０３の処理により特定された順方向遷移アクションを実行することにより遷移先のノードへ遷移する状態遷移確率が設定される。また、ここで新たに設定された状態遷移確率に応じて状態遷移確率テーブルの各行の値が正規化される。 Thereby, in the unknown node, the state transition probability of transitioning to the transition destination node is set by executing the forward transition action specified by the process of step S403. In addition, the value of each row of the state transition probability table is normalized according to the state transition probability newly set here.

次に、図４７のフローチャートを参照して、図４６のステップＳ４０１のノード逆アクションペアリスト生成処理の詳細について説明する。 Next, details of the node reverse action pair list generation processing in step S401 of FIG. 46 will be described with reference to the flowchart of FIG.

ステップＳ４２１において、認識器３５は、ノードs_i′においてアクションc_k′が実行されることにより遷移し得る候補ノードs_j ^l（l=1，・・・L）を抽出する。候補ノードs_j ^lは、例えば、状態遷移確率a_i´j（k´）が閾値以上となる遷移先のノードs_j´をリストすればよい。 In step S421, the recognizing device 35 extracts a candidate node s _j ^l (l = 1,... L) that can be transitioned by executing the action c _k ′ at the node s _i ′. Candidate node s _j ^l may be, for example, the state transition probability a _i'j (k') list is equal to or greater than a threshold transition destination node s _j'.

ステップＳ４２２において、認識器３５は、過去ｎステップ分の時系列情報を用いたｎ-ステップ状態認識を行なう。 In step S422, the recognizer 35 performs n-step state recognition using time series information for the past n steps.

ステップＳ４２３において、認識器３５は、ステップＳ４２２の処理結果に基づいて、ノードs_newに類似する類似ノードであって既知ノードを抽出する。ここで特定された既知ノードのそれぞれを、ノードs_j ^q（q=1，・・・Q）で表すことにする。このとき、上述した式（５２）乃至式（５５）の演算が行われることにより、ノードs_newに類似する類似ノードが抽出される。 In step S423, the recognizer 35 extracts similar nodes that are similar to the node s _new and are known based on the processing result of step S422. Each of the known nodes specified here is represented by a node s _j ^q (q = 1,... Q). At this time, similar nodes similar to the node s _new are extracted by _performing the operations of the above-described equations (52) to (55).

ステップＳ４２４において、認識器３５は、ステップＳ４２３の処理で抽出された類似ノードの有効アクションを抽出する。 In step S424, the recognizing device 35 extracts the effective action of the similar node extracted in the process of step S423.

ここで、有効アクションは、上述した各類似ノードにおいて実行されることにより、他のノードに遷移し得るアクションを意味する。 Here, the effective action means an action that can be transferred to another node by being executed in each of the similar nodes.

ステップＳ４２４では、例えば、アクション毎の評価値E_kが、式（５７）により演算される。なお、この演算は、個々のアクションに対応してそれぞれ行われ、１のアクションに対して１の評価値が得られることになる。 In step S424, for example, the evaluation value E _{k for} each action is calculated by Expression (57). This calculation is performed for each action, and one evaluation value is obtained for one action.

・・・（５７）

... (57)

ここでa_jx ^q(k)(q=1，・・・，Q， x=1，・・・，N)は、ノードs_j ^q(q=1，・・・，Q)においてアクションc_kを実行したとき、ノードs_xに遷移する状態遷移確率である。 Here, a _jx ^q (k) (q = 1,..., Q, x = 1,..., N) is an action c _{k at the} node s _j ^q (q = 1,..., Q). Is the state transition probability when transitioning to node s _x .

そして、式（５７）により演算された評価値が閾値以上となったアクションｋが選択され、有効アクションの候補とされる。 Then, the action k whose evaluation value calculated by the equation (57) is equal to or greater than the threshold value is selected and is set as a valid action candidate.

さらに、その選択されたアクションｋのそれぞれについて、状態遷移確率a_jx ^q(k)がチェックされ、状態遷移確率a_jx ^q(k)が閾値以上となる(q，x)の組が少なくとも1つ存在するか否かが判定される。そのような(q，x)の組が1つも存在しない場合、そのアクションｋは、有効アクションの候補から除外される。 Further, for each of the selected actions k, the state transition probability a _jx ^q (k) is checked, and at least one pair of (q, x) in which the state transition probability a _jx ^q (k) is _{equal to} or greater than a threshold value. It is determined whether or not it exists. If there is no such (q, x) pair, the action k is excluded from the valid action candidates.

このようにして、ステップＳ４２４では、有効アクションc_k ^r(r=1，・・・，R)が抽出される。 In this way, in step S424, the effective action c _k ^r (r = 1,..., R) is extracted.

ステップＳ４２５において、認識器３５は、ステップＳ４２１の処理で抽出された候補ノードs_j ^lのうち、ステップＳ４２４の処理で抽出されたアクションc_k ^rを有効アクションとして有するものを抽出する。すなわち、候補ノードのうち、類似ノードと同一の有効アクションを有するノードs_j ^ru (u=1，・・・，Ur)が抽出される。 In step S425, the recognizing device 35 extracts, from the candidate nodes s _j ^l extracted in the process of step S421, those having the action c _k ^r extracted in the process of step S424 as an effective action. That is, among the candidate nodes, nodes s _j ^ru (u = 1,..., Ur) having the same effective action as the similar nodes are extracted.

ステップＳ４２５では、例えば、ノードs_j ^lのそれぞれについて評価値Elrが式（５８）により演算される。なお、この演算は、ノードs_j ^lのそれぞれにおいて、個々のアクションc_k ^rを実行する場合のそれぞれに対応してそれぞれ行われ、ノードとアクションの組み合わせ１つに対して１の評価値が得られることになる。 In step S425, for example, the evaluation value Elr for each of the nodes s _j ^l is calculated by the equation (58). This calculation is performed for each of the nodes s _j ^l corresponding to the case where each action c _k ^r is executed, and one evaluation value is obtained for one combination of the node and the action. Will be.

・・・（５８）

... (58)

なお、式（５８）は、変数lにより特定されるインデックスjの候補ノードで、変数ｒで特定されるアクションc_kを実行する場合について算出される。また、式（５８）の右辺の状態遷移確率のアクションであるｋ（またはc_k）は、左辺の変数ｒにより特定されるものとする。 Equation (58) is calculated for the case where the action c _k specified by the variable r is executed on the candidate node of the index j specified by the variable l. In addition, k (or c _k ), which is the action of the state transition probability on the right side of Equation (58), is specified by the variable r on the left side.

このように、ステップＳ４２５では、式（５８）により算出された評価値が閾値以上となったものが、ノードs_j ^ruとして抽出されるのである。 As described above, in step S425, the node having the evaluation value calculated by the equation (58) equal to or greater than the threshold is extracted as the node s _j ^ru .

ステップＳ４２６において、認識器３５は、ステップＳ４２５で抽出されたノードとステップＳ４２４で抽出された有効アクションのペア(s_j ^ru，c_k ^r)を生成し、それぞれのペアから特定される遷移先ノードを特定する。 In step S426, the recognizer 35 generates a pair (s _j ^ru , c _k ^r ) of the node extracted in step S425 and the valid action extracted in step S424, and the transition destination node specified from each pair Is identified.

例えば、ノードs_j ^ruにおいて、アクションc_k ^rを実行した場合の状態遷移確率a_jl ^ru(k)（l=1，・・・，N）をチェックし、閾値を超える状態遷移確率に対応する遷移先ノードs_l ^q（q=1，・・・Q_ru）を特定する。 For example, in the node s _j ^ru , the state transition probability a _jl ^ru (k) (l = 1,..., N) when the action c _k ^r is executed is checked, and the state transition probability exceeding the threshold is handled. The transition destination node s _l ^q (q = 1,... Q _ru ) is specified.

ステップＳ４２７において、認識器３５は、ノードs_j ^ruにおけるアクションc_k ^rの逆方向遷移アクションを推定する。すなわち、ノードs_l ^qからノードs_j ^ruへ遷移するためのアクションを推定する。このとき推定された逆方向遷移アクションを、c_ruq ^v（v=1，・・・V_ruq）とする。ただし、遷移先ノードがノードs_i′であった場合は、この推定は行なわない。 In step S427, the recognizer 35 estimates the backward transition action of the action c _k ^{r at} the node s _j ^ru . That is, an action for transitioning from the node s _l ^q to the node s _j ^ru is estimated. The backward transition action estimated at this time is c _ruq ^v (v = 1,... V _ruq ). However, when the transition destination node is the node s _i ′, this estimation is not performed.

そして、認識器３５は、ステップＳ４２６で特定された遷移先ノードと、逆方向遷移アクションとのペア（s_l ^q，c_ruq ^v）(l=1，・・・，L， r=1，・・・，R， u=1，・・・，Ur， q=1，・・・，Qru， v=1，・・・，Vruq)を生成する。 Then, the recognizing device 35 pairs (s _l ^q , c _ruq ^v ) (l = 1,..., L, r = 1,...) With the transition destination node specified in step S426 and the backward transition action. ..., R, u = 1, ..., Ur, q = 1, ..., Qru, v = 1, ..., Vruq).

ステップＳ４２８において、認識器３５は、ステップＳ４２７で生成したペア（s_l ^q，c_ruq ^v）に（s_i′，c_k′）を加えて重複を排除し、未知ノードへの遷移元ノードと逆方向遷移アクションのペア(s_i ^x，c_k ^x)(x=1，・・・，X)を生成する。そして、未知ノードへの遷移元ノードと逆方向遷移アクションのペアのそれぞれがリストされる。 In step S428, the recognizer 35 _adds (s _i ′, c _k ′) to the pair (s _l ^q , c _ruq ^v ) generated in step S427 to eliminate duplication, and determines the transition source node to the unknown node. A pair of reverse transition actions (s _i ^x , c _k ^x ) (x = 1,..., X) is generated. Then, each of the pair of the transition source node to the unknown node and the backward transition action is listed.

このようにして、ノード逆アクションペアリスト生成処理が実行される。 In this way, the node reverse action pair list generation process is executed.

図４７の処理により得られたペアに基づいて、ノードs_i ^xにおいてアクションc_k ^xを実行することによりノードs_newに遷移したと仮定され、図４６のステップＳ４０２の処理が実行される。 Based on the pair obtained by the process of FIG. 47, it is assumed that the node s _i ^x has transitioned to the node s _new by executing the action c _k ^x, and the process of step S402 of FIG. 46 is executed.

次に、図４８のフローチャートを参照して、図４６のステップＳ４０２の逆アクション状態遷移確率設定処理の詳細な例について説明する。 Next, a detailed example of the reverse action state transition probability setting process in step S402 in FIG. 46 will be described with reference to the flowchart in FIG.

例えば、遷移元ノードs_iからアクションc_ｋによりノードs_newに遷移したと仮定されたものとする。 For example, it is assumed that a transition is made from the transition source node s _i to the node s _new by the action c _k .

ステップＳ４４１において、学習器３４は、アクションc_ｋにより、ノードs_iから遷移し得るノードの候補を抽出する。ノードの候補s_j ^l（l=1，・・・L）は、例えば、状態遷移確率a_ij（k）が閾値以上となる遷移先のノードs_jをリストすればよい。 In step S441, the learning device 34 extracts candidate nodes that can transition from the node s _i by the action _ck . The node candidates s _j ^l (l = 1,... L) may be, for example, a list of transition destination nodes s _j whose state transition probability a _ij (k) is equal to or greater than a threshold.

ステップＳ４４２において、学習器３４は、未知ノードへの状態遷移確率を設定し、正規化する。 In step S442, the learning device 34 sets and normalizes the state transition probability to an unknown node.

例えば、アクションc_ｋに対応する各候補ノードs_iからs_newへの状態遷移確率a_inew(k)は、1/Lとして設定する。そして、アクションc_ｋに対応する状態遷移確率テーブルの各行の状態遷移確率の総和が１となるように正規化する。すなわち、状態遷移確率a_inew(k)として非零値が設定された行の各値をL/（L+1）倍する。 For example, the state transition probability a _inew (k) from each candidate node s _i to s _new corresponding to the action c _k is set as 1 / L. Then, normalization is performed so that the sum of the state transition probabilities of each row of the state transition probability table corresponding to the action _ck becomes 1. That is, each value in a row in which a non-zero value is set as the state transition probability a _inew (k) is multiplied by L / (L + 1).

ただし、ステップＳ４１１の処理の結果、状態遷移確率a_ij（k）が閾値以上となる遷移先のノードが存在しなかった場合、状態遷移確率a_inew(k)≒１として、上述のような正規化を行なう。 However, if there is no transition destination node in which the state transition probability a _ij (k) is greater than or equal to the threshold value as a result of the processing in step S411, the state transition probability a _inew (k) _{≈1 is assumed and the normality} as described above To do.

このようにして、逆アクション状態遷移確率設定処理が実行される。 In this way, the reverse action state transition probability setting process is executed.

次に、図４９のフローチャートを参照して、図４６のステップＳ４０３のノード順アクションペアリスト生成処理の詳細な例について説明する。 Next, a detailed example of the node order action pair list generation processing in step S403 in FIG. 46 will be described with reference to the flowchart in FIG.

ステップＳ４６１において、認識器３５は、図４７のステップＳ４２６の処理と同様に遷移先ノードs_l ^q（q=1，・・・Q_ru）を抽出する。すなわち、候補ノードと有効アクションのペアを生成し、各ペアに対応する遷移先ノードを特定する。 In step S461, the recognizing device 35 extracts the transition destination node s _l ^q (q = 1,... Q _ru ) in the same manner as the processing in step S426 in FIG. That is, a pair of candidate nodes and valid actions is generated, and a transition destination node corresponding to each pair is specified.

ステップＳ４６２において、認識器３５は、ステップＳ４６１の処理で得られた遷移先ノードs_l ^q（q=1，・・・Q_ru）と、その遷移先ノードに遷移するためのアクションc_k ^r（r=1，・・・R）をペアとして生成する。 In step S462, the recognizing device 35 determines the transition destination node s _l ^q (q = 1,... Q _ru ) obtained by the processing in step S461 and the action c _k ^r (for transition to the transition destination node. r = 1, ... R) is generated as a pair.

ステップＳ４６３において、認識器３５は、ステップＳ４６２の処理で得られたペアの重複を排除し、ペア（s_j ^y，c_k ^y）(y=1，・・・，Y)を生成する。そして、遷移先ノードとその遷移先ノードへ遷移するためのアクションのペアのそれぞれがリストされる。 In step S463, the recognizing device 35 eliminates the pair duplication obtained in the process of step S462, and generates a pair (s _j ^y , c _k ^y ) (y = 1,..., Y). Then, each of a transition destination node and an action pair for transitioning to the transition destination node is listed.

このようにして、ノード順アクションペアリスト生成処理が実行される。 In this way, node order action pair list generation processing is executed.

図４９の処理により得られたペアに基づいて、ノードs_newにおいてアクションc_k ^yを実行することによりノードs_j ^yに遷移したと仮定され、図４６のステップＳ４０４の処理が実行される。 Based on the pair obtained in the processing in FIG. 49, is assumed that the shift to the node s _j ^y by performing the action c _k ^y at the node s _{new new,} the process of step S404 of FIG. 46 is executed.

次に、図５０のフローチャートを参照して、図４６のステップＳ４０４の順アクション状態遷移確率設定処理の詳細な例について説明する。 Next, a detailed example of the forward action state transition probability setting process in step S404 in FIG. 46 will be described with reference to the flowchart in FIG.

ステップＳ４８１において、学習器３４は、状態遷移確率a_newj（k）(j=1，・・・，N， k=1，・・・，K)を、全て微小な値で初期化する。 In step S481, the learning device 34 initializes all the state transition probabilities a _newj (k) (j = 1,..., N, k = 1,..., K) with very small values.

ステップＳ４８２において、学習器３４は、図４８の処理により得られたペア（s_j ^y，c_k ^y）を用いて状態遷移確率を設定する。すなわち、ノードs_newにおいてアクションc_k ^yを実行することによりノードs_j ^yに遷移する状態遷移確率a_newj ^y(k)を１として設定する。 In step S482, the learning device 34 sets a state transition probability using the pair (s _j ^y , c _k ^y ) obtained by the processing of FIG. That is, the state transition probability a _newj ^y (k) for transitioning to the node s _j ^y by executing the action c _k ^y at the node s _new is set as 1.

ステップＳ４８３において、学習器３４は、Σ_ja_newj(k)（k=1，・・・，K)を満たすように正規化する。 In step S483, the learning device 34 normalizes so as to satisfy Σ _j a _newj (k) (k = 1,..., K).

このようにして順アクション状態遷移確率設定処理が実行される。 In this way, the forward action state transition probability setting process is executed.

上記した例においては、エージェントが自律的に環境の変化を認識して、状態遷移確率テーブルに未知ノードを追加する場合の例について説明したが、これに伴って、観測確率テーブルにも未知ノードを追加する必要がある。この場合の、観測確率テーブルの更新については、例えば、図３１に示されるように観測確率テーブルを拡張する必要がある場合に、学習器３４が行う処理として上述した処理を行うようにすればよい。 In the above example, an example has been described in which an agent autonomously recognizes a change in the environment and adds an unknown node to the state transition probability table. Need to add. Regarding the update of the observation probability table in this case, for example, when it is necessary to expand the observation probability table as shown in FIG. 31, the above-described processing may be performed as the processing performed by the learning device 34. .

また、勿論、状態遷移確率の推定のための頻度変数のテーブル、および観測確率の推定のための頻度変数のテーブルも図４６を参照して上述した処理に伴って更新されることになる。 Of course, the frequency variable table for estimating the state transition probability and the frequency variable table for estimating the observation probability are also updated in accordance with the processing described above with reference to FIG.

次に、アンカリングする場合の状態遷移確率の設定について説明する。 Next, the setting of the state transition probability when anchoring will be described.

上述したように、アンカリングは、既知ノードへの遷移が認識された場合、未知ノードとみなされたノードと既知ノードとの状態遷移確率などを設定する処理である。 As described above, anchoring is a process of setting a state transition probability between a node regarded as an unknown node and a known node when a transition to a known node is recognized.

換言すれば、未知ノードs_i´においてアクションc_k´を実行して、既知ノードs_j´に遷移した場合、内部モデルデータの状態遷移確率テーブルにおいて、状態遷移確率a _i´j(k´）(j=1，・・・，N)が閾値以上となるノードs_jが存在しないとき、アンカリングが行なわれる。すなわち、未知ノードとみなされたノードから、既知ノードへの遷移が確認され、かつ当該未知ノードから当該既知ノード以外のノードへの遷移が発生し難い場合、アンカリングが行われるのである。 In other words, when the action c _{k ′} is executed at the unknown node s _i _′ and the state transitions to the known node s _j ′, the state transition probability a _i′j (k ′) in the state transition probability table of the internal model data. Anchoring is performed when there is no node s _j where (j = 1,..., N) is greater than or equal to the threshold. That is, when a transition from a node regarded as an unknown node to a known node is confirmed and a transition from the unknown node to a node other than the known node is difficult to occur, anchoring is performed.

アンカリングでは、アクションc_k´による未知ノードs_i´から既知ノードs_j´への状態遷移確率が設定される。例えば、図４６を参照して上述したように、未知ノードとみなされたノードが内部モデルデータに追加される都度、その未知ノードから既知ノードへの状態遷移確率が推定されて設定される。しかし、未知ノードから既知ノードへの遷移が実際に発生した場合は、アンカリングがなされることになる。 In anchoring, the state transition probability from the unknown node s _{i ′} to the known node s _j _′ by the action c _{k ′} is set. For example, as described above with reference to FIG. 46, every time a node regarded as an unknown node is added to the internal model data, the state transition probability from the unknown node to the known node is estimated and set. However, when a transition from an unknown node to a known node actually occurs, anchoring is performed.

ここで、図５１のフローチャートを参照してアンカリング処理について説明する。この処理は、例えば、図４０のステップＳ３１９の処理として実行される処理である。 Here, the anchoring process will be described with reference to the flowchart of FIG. This process is, for example, a process executed as the process of step S319 in FIG.

ステップＳ５０１において、学習器３４は、アンカリングの対象となる遷移に対応する状態遷移確率を１とする。上述の例では、状態遷移確率a _i´j´(k´）が１とされる。 In step S501, the learning device 34 sets the state transition probability corresponding to the transition to be anchored to 1. In the above example, the state transition probability a _i′j ′ (k ′) is 1.

ステップＳ５０２において、学習器３４は、Σ_ja _i´j(k´）が１となるように、状態遷移確率テーブルの各値を正規化する。 In step S502, the learning device 34 normalizes each value in the state transition probability table so that Σ _j a _i′j (k ′) becomes 1.

ステップＳ５０３において、認識器３５は、既知ノードs_j´から未知ノードs_i´に遷移する逆方向遷移アクションを推定する。このとき、例えば、図４７を参照して上述した場合と同様に逆方向遷移アクションの推定が行なわれる。これにより、逆方向遷移アクションc_z ^r(r=1，・・・，R)が推定される。 In step S503, the recognizing device 35 estimates a backward transition action for transition from the known node s _{j ′} to the unknown node s _{i ′} . At this time, for example, the backward transition action is estimated in the same manner as described above with reference to FIG. Thereby, the backward transition action c _z ^r (r = 1,..., R) is estimated.

ステップＳ５０４において、学習器３４は、ステップＳ５０３の処理で推定された逆方向遷移アクションのそれぞれが実行されることにより、既知ノードs_j´から未知ノードs_i´への遷移が発生したと仮定して状態遷移確率を設定する。この処理は、例えば、図４８を参照して上述した場合と同様である。 In step S504, the learning device 34 assumes that a transition from the known node s _{j ′} to the unknown node s _i ′ has occurred by executing each of the backward transition actions estimated in the process of step S503. To set the state transition probability. This process is the same as that described above with reference to FIG. 48, for example.

このようにしてアンカリング処理が実行される。 In this way, the anchoring process is executed.

なお、図５１を参照して説明した処理に替えて、既知ノードs_j´から未知ノードs_i´への遷移が発生したと仮定して図４６を参照して上述した処理が行われることにより状態遷移確率を設定することで、アンカリング処理がなされるようにしてもよい。 Note that, instead of the process described with reference to FIG. 51, the process described above with reference to FIG. 46 is performed on the assumption that a transition from the known node s _{j ′} to the unknown node s _i ′ has occurred. An anchoring process may be performed by setting a state transition probability.

すなわち、実際には、未知ノードs_i´においてアクションc_k´を実行して、既知ノードs_j´に遷移したのだが、逆方向遷移アクションc_z ^r (r=1，・・・，R)によって既知ノードs_j´から未知ノードs_i´への遷移が発生したと仮定するのである。ここで、逆方向遷移アクションc_z ^r (r=1，・・・，R)は、例えば、ステップＳ５０３の処理と同様にして推定することができる。 That is, in practice, the action c _{k ′} is executed in the unknown node s _i _′ and the transition is made to the known node s _j ′, but the backward transition action c _z ^r (r = 1,..., R) It is assumed that a transition from the known node s _{j ′} to the unknown node s _i ′ has occurred. Here, the backward transition action c _z ^r (r = 1,..., R) can be estimated in the same manner as in the process of step S503, for example.

つまり、アクションc_z ¹によって、既知ノードs_j´から未知ノードs_i´への遷移が発生したと仮定して図４６を参照して上述した処理を実行する。また、アクションc_z ²によって、やはり、既知ノードs_j´から未知ノードs_i´への遷移が発生したと仮定して図４６を参照して上述した処理を実行する。同様に、アクションc_z ³・・・アクションc_z ^Rによって、それぞれ既知ノードs_j´から未知ノードs_i´への遷移が発生したと仮定して図４６を参照して上述した処理を実行するのである。 That is, assuming that a transition from the known node s _{j ′} to the unknown node s _i ′ has occurred by the action c _z ¹ , the processing described above with reference to FIG. 46 is executed. Further, assuming that a transition from the known node s _{j ′} to the unknown node s _i ′ has occurred by the action c _z ² , the processing described above with reference to FIG. 46 is executed. Similarly, it is assumed that a transition from the known node s _{j ′} to the unknown node s _i ′ is caused by the action c _z ³ ... Action c _z ^R , respectively, and the processing described above with reference to FIG. It is.

アンカリングの際には、このように、アクションc_z ^r (r=1，・・・，R)によって直前のノードs_j′（実際には、アンカリングする既知ノード）から未知ノードs_i′へ遷移したものとみなして、図４６の処理がそれぞれ実行されるようにしてもよい。 At the time of anchoring, in this way, the action c _z ^r (r = 1,..., R) causes the unknown node s _{i ′} from the immediately preceding node s _{j ′} (actually the known node to be anchored). 46 may be executed by assuming that the process has transitioned to.

このように、本発明によれば、エージェントが自律的に環境の変化を認識して、状態遷移確率テーブル、および観測確率テーブルを拡張することが可能である。また、その際に、それぞれのテーブルの拡張された領域に設定すべき状態遷移確率、観測確率などの値を適切に設定することも可能である。さらに、既に記憶されている既知ノードから既知ノードへの状態遷移確率に基づいて、未知ノードと既知ノードとの間の状態遷移確率などを設定することが可能である。 Thus, according to the present invention, it is possible for the agent to autonomously recognize a change in the environment, and to expand the state transition probability table and the observation probability table. At that time, it is also possible to appropriately set values such as state transition probability and observation probability to be set in the expanded area of each table. Further, it is possible to set the state transition probability between the unknown node and the known node based on the state transition probability from the known node to the known node that has already been stored.

ここまで、学習を進める際に、ノード数、観測シンボル数、またはアクション数を変更する必要に迫られた場合にとり得る処置について説明した。 So far, the description has been given of the actions that can be taken when it is necessary to change the number of nodes, the number of observation symbols, or the number of actions when learning is advanced.

以上のように、本発明によれば、アクション拡張型ＨＭＭを用いた学習を行うことができる。これにより、エージェントがアクション信号を用いて環境に対してアクションを実行し、今後観測される観測シンボルに影響を与えることができるようにするという状況における学習が可能となる。 As described above, according to the present invention, learning using an action expanded HMM can be performed. As a result, it is possible to learn in a situation where the agent executes an action on the environment using the action signal and can influence an observation symbol observed in the future.

また、本発明によれば、必然的に大規模となるアクション拡張型ＨＭＭの学習を効率的かつ適切に行うことができる。すなわち、学習される内部モデルデータに対してスプリットアルゴリズムを適用するなどして一状態一観測制約を課し、フォワードマージアルゴリズムおよびバックワードマージアルゴリズムを適用するなどしてアクション遷移制約を課す。これにより、計算すべきパラメータの数の増大などを抑制し、必然的に大規模となるアクション拡張型ＨＭＭの学習を効率的かつ適切に行うことができる。 Further, according to the present invention, it is possible to efficiently and appropriately learn an action expanded HMM that inevitably has a large scale. That is, a one-state one-observation constraint is imposed on the learned internal model data by applying a split algorithm, and an action transition constraint is imposed by applying a forward merge algorithm and a backward merge algorithm. As a result, an increase in the number of parameters to be calculated can be suppressed, and learning of the action extended HMM that inevitably has a large scale can be performed efficiently and appropriately.

さらに、本発明によれば、必然的に大規模となるアクション拡張型ＨＭＭにおける追加学習方式での学習を安定的に行うことができる。すなわち、状態遷移確率の推定のための頻度変数と観測確率の推定のための頻度変数とを算出して保存することにより、アクション拡張型ＨＭＭにおける追加学習方式での学習を安定的に行うことができる。 Furthermore, according to the present invention, it is possible to stably perform learning by the additional learning method in the action expanded HMM that inevitably has a large scale. In other words, by calculating and storing a frequency variable for estimating the state transition probability and a frequency variable for estimating the observation probability, it is possible to stably perform learning in the additional learning method in the action expanded HMM. it can.

また、本発明によれば、学習を進める際に、ノード数、観測シンボル数、またはアクション数を変更することが可能である。 Further, according to the present invention, it is possible to change the number of nodes, the number of observation symbols, or the number of actions when learning is advanced.

この際、例えば、エージェントに対して予め所定の数だけノードの数が増えることを前提として、内部モデルデータを拡張するように指令することも可能であるし、エージェントが自律的に環境の変化を認識して、内部モデルデータを拡張することも可能である。 At this time, for example, it is possible to instruct the agent to expand the internal model data on the assumption that the number of nodes increases in advance by a predetermined number, and the agent autonomously changes the environment. It is also possible to recognize and extend the internal model data.

エージェントが自律的に環境の変化を認識して、内部モデルデータを拡張するために、エージェントが、現在自分が位置するノードは学習済の内部状態とされているノードなのか、新たに追加すべき内部状態とされるノードなのか認識できるようにした。 In order for the agent to autonomously recognize changes in the environment and expand the internal model data, the agent should add a new node whether the node where the agent is currently located is a learned internal state Enabled to recognize whether the node is in internal state.

また、所定のタイミングで所定の個数の未知ノードが追加されるようにするとともに、アンカリングされた直後の内部モデルデータに基づいて追加学習方式での学習が行われるようにした。これにより、例えば、既知ノードの中に散発的に新たなノードが発現するような場合はもちろんのこと、長期に渡って新たなノードが連続して検出されるような困難な環境においても、十分に有効な学習を行うことが可能となった。 In addition, a predetermined number of unknown nodes are added at a predetermined timing, and learning by the additional learning method is performed based on the internal model data immediately after the anchoring. This allows, for example, new nodes to appear sporadically among known nodes, but also in difficult environments where new nodes are continuously detected over a long period of time. It became possible to perform effective learning.

さらに、内部モデルデータを拡張するにあたり、過去の経験に基づいて未知ノードと既知ノードとの間の状態遷移確率などを設定することができるようにした。 Furthermore, when expanding the internal model data, the state transition probability between the unknown node and the known node can be set based on past experience.

このように、本発明によれば、変化する環境の中で自律的な学習を行う際に、効率的かつ安定的な学習を行うことができるのである。 Thus, according to the present invention, efficient and stable learning can be performed when autonomous learning is performed in a changing environment.

以上においては、本発明の実施の形態を主に、ロボットが迷路を移動する場合の例に適用して説明したが、勿論、それ以外の実施の形態であっても構わない。例えば、アクションは、エージェントを移動させるものに限られず、環境に対して働きかける行為であればアクションとなり得る。また、例えば、観測シンボルは、迷路のパーツの形状などに対応するものに限られず、光や音の変化などに対応するものであってもよい。 In the above description, the embodiment of the present invention is mainly applied to an example in which the robot moves in the maze. However, other embodiments may be used as a matter of course. For example, the action is not limited to moving an agent, and may be an action as long as it acts on the environment. Further, for example, the observation symbols are not limited to those corresponding to the shape of the maze part, but may correspond to changes in light or sound.

なお、上述した一連の処理は、ハードウェアにより実行させることもできるし、ソフトウェアにより実行させることもできる。上述した一連の処理をソフトウェアにより実行させる場合には、そのソフトウェアを構成するプログラムが、専用のハードウェアに組み込まれているコンピュータにインストールされる。例えば、各種のプログラムをインストールすることで、各種の機能を実行することが可能な、図５２に示されるような汎用のパーソナルコンピュータ７００などに、ネットワークや記録媒体から、そのソフトウェアを構成するプログラムがインストールされる。 The series of processes described above can be executed by hardware, or can be executed by software. When the above-described series of processing is executed by software, a program constituting the software is installed in a computer incorporated in dedicated hardware. For example, a general-purpose personal computer 700 as shown in FIG. 52 that can execute various functions by installing various programs has a program that configures the software from a network or a recording medium. Installed.

図５２において、ＣＰＵ（Central Processing Unit）７０１は、ＲＯＭ（Read Only Memory）７０２に記憶されているプログラム、または記憶部７０８からＲＡＭ（Random Access Memory）７０３にロードされたプログラムに従って各種の処理を実行する。ＲＡＭ７０３にはまた、ＣＰＵ７０１が各種の処理を実行する上において必要なデータなども適宜記憶される。 52, a CPU (Central Processing Unit) 701 executes various processes according to a program stored in a ROM (Read Only Memory) 702 or a program loaded from a storage unit 708 to a RAM (Random Access Memory) 703. To do. The RAM 703 also appropriately stores data necessary for the CPU 701 to execute various processes.

ＣＰＵ７０１、ＲＯＭ７０２、およびＲＡＭ７０３は、バス７０４を介して相互に接続されている。このバス７０４にはまた、入出力インタフェース７０５も接続されている。 The CPU 701, ROM 702, and RAM 703 are connected to each other via a bus 704. An input / output interface 705 is also connected to the bus 704.

入出力インタフェース７０５には、キーボード、マウスなどよりなる入力部７０６、ＬＣＤ(Liquid Crystal display)などよりなるディスプレイ、並びにスピーカなどよりなる出力部７０７、ハードディスクなどより構成される記憶部７０８が接続されている。また、入出力インタフェース７０５には、モデム、LANカードなどのネットワークインタフェースカードなどより構成される通信部７０９が接続されている。通信部７０９は、インターネットを含むネットワークを介しての通信処理を行う。 Connected to the input / output interface 705 are an input unit 706 composed of a keyboard, a mouse, etc., a display composed of an LCD (Liquid Crystal display), an output unit 707 composed of a speaker, etc., and a storage unit 708 composed of a hard disk. Yes. The input / output interface 705 is connected to a communication unit 709 including a network interface card such as a modem or a LAN card. The communication unit 709 performs communication processing via a network including the Internet.

入出力インタフェース７０５にはまた、必要に応じてドライブ７１０が接続され、磁気ディスク、光ディスク、光磁気ディスク、或いは半導体メモリなどのリムーバブルメディア７１１が適宜装着される。そして、それらから読み出されたコンピュータプログラムが、必要に応じて記憶部７０８にインストールされる。 A drive 710 is connected to the input / output interface 705 as necessary, and a removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory is appropriately mounted. Then, the computer program read from them is installed in the storage unit 708 as necessary.

上述した一連の処理をソフトウェアにより実行させる場合には、そのソフトウェアを構成するプログラムが、インターネットなどのネットワークや、リムーバブルメディア７１１などからなる記録媒体からインストールされる。 When the above-described series of processing is executed by software, a program constituting the software is installed from a network such as the Internet or a recording medium such as a removable medium 711.

なお、この記録媒体は、図５２に示される、装置本体とは別に、ユーザにプログラムを配信するために配布される、プログラムが記録されている磁気ディスク（フロッピディスク（登録商標）を含む）、光ディスク（CD-ROM(Compact Disk-Read Only Memory)，ＤＶＤ (Digital Versatile Disk)を含む）、光磁気ディスク（MD（Mini-Disk）（登録商標）を含む）、もしくは半導体メモリなどよりなるリムーバブルメディア７１１により構成されるものだけでなく、装置本体に予め組み込まれた状態でユーザに配信される、プログラムが記録されているＲＯＭ７０２や、記憶部７０８に含まれるハードディスクなどで構成されるものも含む。 The recording medium shown in FIG. 52 is a magnetic disk (including a floppy disk (registered trademark)) on which a program is recorded, which is distributed to distribute the program to the user separately from the apparatus main body. Removable media consisting of optical disks (including CD-ROM (compact disk-read only memory), DVD (digital versatile disk)), magneto-optical disks (including MD (mini-disk) (registered trademark)), or semiconductor memory It includes not only those configured by 711 but also those configured by a ROM 702 in which a program is recorded, a hard disk included in the storage unit 708, and the like distributed to the user in a state of being incorporated in the apparatus main body in advance.

なお、本明細書において上述した一連の処理は、記載された順序に沿って時系列的に行われる処理はもちろん、必ずしも時系列的に処理されなくとも、並列的あるいは個別に実行される処理をも含むものである。 Note that the series of processes described above in this specification includes processes that are performed in parallel or individually even if they are not necessarily processed in time series, as well as processes that are performed in time series in the order described. Is also included.

また、本発明の実施の形態は、上述した実施の形態に限定されるものではなく、本発明の要旨を逸脱しない範囲において種々の変更が可能である。 The embodiments of the present invention are not limited to the above-described embodiments, and various modifications can be made without departing from the scope of the present invention.

１０自律行動学習装置，３１センサ部，３２行動出力部，３３観測バッファ，３４学習器，３５認識器，３６行動生成器，３７内部モデルデータ記憶部，３８認識結果バッファ，３９行動出力バッファ DESCRIPTION OF SYMBOLS 10 Autonomous behavior learning apparatus, 31 sensor part, 32 action output part, 33 observation buffer, 34 learning device, 35 recognizer, 36 action generator, 37 internal model data storage part, 38 recognition result buffer, 39 action output buffer

Claims

Observation means for observing observation symbols based on sensor signals obtained from the environment;
Observation symbol storage means for storing the observation symbol observed over time in association with the time when the observation symbol was observed;
Recognizing means for reading information stored in the observation symbol storage means as time series information and recognizing a node of the HMM at the last time of the time series information;
The recognition means reads and recognizes the time-series information of variable length.

The recognition means is
Based on the time series information, recognize a node sequence corresponding to the length of the time series information,
Based on the state transition probability and the observation probability of the HMM, it is determined that the node string exists in the environment with a probability equal to or higher than a first threshold, and the posterior probability of the node at the last time of the time series information Until the entropy value is below the second threshold,
The recognition apparatus according to claim 1, wherein a length of the time-series information read from the observation symbol storage unit is extended in a past direction.

The recognition means is
Recognizing a node sequence corresponding to the length of the time series information based on the time series information extended in the past direction,
If it is determined in the environment that the node sequence does not exist with a probability equal to or higher than the first threshold based on the state transition probability and the observation probability of the HMM, the node at the last time of the time series information is: Recognize that it is an unknown node in the internal state to be newly added, and output it as a recognition result.
Based on the state transition probability and the observation probability of the HMM, it is determined that the node string exists in the environment with a probability equal to or higher than a first threshold, and the posterior probability of the node at the last time of the time series information 3. When it is determined that the entropy value is less than a second threshold, the node at the last time of the time-series information is recognized as a known node in the learned internal state and is output as a recognition result. The recognition device described in 1.

The recognition apparatus according to claim 3, further comprising a recognition result storage unit that stores the recognition result in association with the recognized time.

The recognition means is
Identifying the time when the recognition result stored in the recognition result storage means changes from a known node to an unknown node over time;
When the time series information that is temporally prior to the specified time is read by extending the length of the time series information read from the observation symbol storage means in the past direction,
The recognition apparatus according to claim 4, wherein output of the recognition result is suspended.

The recognition means is
The difference between the entropy value of the posterior probability of the node at the last time of the time series information of length N and the entropy value of the posterior probability of the node at the last time of the time series information of length N + 1 Calculate
The recognition apparatus according to claim 1, wherein the length of the time-series information read from the observation symbol storage unit is extended in the past direction until the calculated difference becomes less than a third threshold.

The recognition means is
Recognizing a node sequence corresponding to the length of the time series information based on the time series information extended in the past direction,
When it is determined in the environment that the node sequence exists with a probability less than a first threshold based on the state transition probability and the observation probability of the HMM,
The recognition apparatus according to claim 6, wherein the node at the last time of the time-series information is recognized as an unknown node having an internal state to be newly added.

The recognition means is
In the environment, when it is determined that the node string exists with a probability equal to or higher than a first threshold based on the state transition probability and the observation probability of the HMM,
When the entropy value of the posterior probability of the node at the last time of the time series information is less than a second threshold, the node at the last time of the time series information is a known node in the learned internal state. Recognize and
The recognition apparatus according to claim 7, wherein when the entropy value of the posterior probability of the node at the last time of the time-series information is equal to or greater than a second threshold, the recognition result output is suspended.

Action symbol storage means for specifying an action to be executed for the environment as an action symbol, and storing the action symbol obtained with the passage of time in association with the time at which the action is executed is further provided. ,
The recognition apparatus according to claim 1, wherein information having the same time length as information stored in the observation symbol storage unit is read from the action symbol storage unit and used as the time-series information.

The information stored in the observation symbol storage means for storing the observation symbol based on the sensor signal obtained from the environment observed with the passage of time in association with the time when the observation symbol was observed is variable length Read as series information,
A recognition method for recognizing an HMM node at the last time of the time-series information.

Computer
Observation means for observing observation symbols based on sensor signals obtained from the environment;
Observation symbol storage means for storing the observation symbol observed over time in association with the time when the observation symbol was observed;
Recognizing means for reading information stored in the observation symbol storage means as time series information and recognizing a node of the HMM at the last time of the time series information;
The recognition means is a program that functions as a recognition device that reads and recognizes the time-series information of variable length.

A recording medium on which the program according to claim 11 is recorded.