JP2019031268A

JP2019031268A - Control policy learning and vehicle control method based on reinforcement learning without active exploration

Info

Publication number: JP2019031268A
Application number: JP2018091189A
Authority: JP
Inventors: 智樹西; Tomoki Nishi
Original assignee: Toyota Motor Engineering and Manufacturing North America Inc; Toyota Engineering and Manufacturing North America Inc
Current assignee: Toyota Motor Engineering and Manufacturing North America Inc
Priority date: 2017-05-12
Filing date: 2018-05-10
Publication date: 2019-02-28
Anticipated expiration: 2038-05-10
Also published as: JP6856575B2

Abstract

【課題】車両の操作を行なう目的で車両を自律的に制御するためのコンピュータ実装型方法を提供する。【解決手段】方法は、最低予想累積コストで車両の操作を実施する目的で車両を制御するように構成された制御ポリシーを学習するために、車両の操作に関連する受動的に収集されたデータに対して、受動的ａｃｔｏｒ−ｃｒｉｔｉｃ強化学習方法を適用するステップと、車両の操作を行なうために制御ポリシーにしたがって車両を制御するステップと、を含む。【選択図】図６A computer-implemented method for autonomously controlling a vehicle for the purpose of operating the vehicle. The method includes passively collected data related to vehicle operation to learn a control policy configured to control the vehicle for the purpose of performing the operation of the vehicle at the lowest expected accumulated cost. In contrast, the method includes applying a passive actuator-critic reinforcement learning method and controlling the vehicle according to a control policy in order to operate the vehicle. [Selection] Figure 6

Description

関連出願の相互参照
本出願は、２０１６年７月８日出願の米国特許出願第１５／２０５，５５８号の一部継続出願であり、その利益を主張するものである。 CROSS REFERENCE TO RELATED APPLICATIONS This application is a continuation-in-part of US patent application Ser. No. 15 / 205,558, filed Jul. 8, 2016, and claims its benefit.

本発明は、車両を自律的に制御する方法に関し、より詳細には車両の操作を自律的に制御するために使用可能な制御ポリシー（control policy）を修正及び／又は最適化するための強化学習方法に関する。 The present invention relates to a method for autonomously controlling a vehicle, and more particularly to reinforcement learning for modifying and / or optimizing a control policy that can be used to autonomously control the operation of a vehicle. Regarding the method.

一定のタイプのシステムにおいては、環境を能動的に探索することにより、最適なシステム制御ポリシーを決定するために、モデルフリー強化学習（ＲＬ）技術を利用することができる。しかしながら、車両が採用し得るあらゆる活動の広範な能動的探索に付随する潜在的にマイナスの帰結に起因して、車両の自律的制御のために使用可能な制御ポリシーに対して従来のＲＬアプローチを適用することは困難であり得る。さらに、車両の安全性の確保を支援するのに必要とされる形で能動的探索を行なうことによって、高い計算コストが必要となる可能性がある。代替案としてのモデルベースのＲＬ技術の使用には、車両が作動する環境の正確なシステムダイナミクスモデルが必要になり得る。しかしながら、自律的車両が作動する複雑な環境は、正確にモデリングすることが非常に困難なものであり得る。 In certain types of systems, model-free reinforcement learning (RL) techniques can be utilized to determine the optimal system control policy by actively searching the environment. However, due to the potential negative consequences associated with extensive active exploration of any activity that a vehicle can employ, the traditional RL approach to control policies that can be used for autonomous control of the vehicle It can be difficult to apply. Furthermore, high computational costs may be required by performing an active search in the form required to assist in ensuring vehicle safety. As an alternative, the use of model-based RL technology may require an accurate system dynamics model of the environment in which the vehicle operates. However, the complex environment in which autonomous vehicles operate can be very difficult to model accurately.

本明細書中に記載の実施形態の一態様においては、車両の操作を行なう目的で車両を自律的に制御するためのコンピュータ実装型方法が提供されている。該方法は、最低予想累積コストで車両の操作を実施する目的で車両を制御するように構成された制御ポリシーを学習するために、車両の操作に関連する受動的に収集されたデータに対して、受動的ａｃｔｏｒ−ｃｒｉｔｉｃ強化学習方法を適用するステップと；車両の操作を行なうために制御ポリシーにしたがって車両を制御するステップと；を含む。 In one aspect of the embodiments described herein, a computer-implemented method for autonomously controlling a vehicle for the purpose of operating the vehicle is provided. The method applies to passively collected data related to vehicle operation to learn a control policy configured to control the vehicle for the purpose of performing vehicle operation at the lowest expected accumulated cost. Applying a passive actuator-critic reinforcement learning method; and controlling the vehicle according to a control policy to operate the vehicle.

本明細書中に記載の実施形態の別態様においては、操作を行なうようシステムを制御するために使用可能な制御ポリシーを最適化するためのコンピュータ実装型方法が提供されている。該方法は、システムを制御するために使用可能な制御ポリシーを提供するステップと、行なうべき操作に関する受動的に収集されたデータに対して受動的ａｃｔｏｒ−ｃｒｉｔｉｃ強化学習方法を適用して、最低予想累積コストで操作を行なうようにシステムを制御するために制御ポリシーが操作可能になるような形で制御ポリシーを修正するステップと、を含む。 In another aspect of the embodiments described herein, a computer-implemented method for optimizing a control policy that can be used to control a system to perform operations is provided. The method includes providing a control policy that can be used to control the system, and applying a passive actor-critic reinforcement learning method to passively collected data regarding the operation to be performed, thereby providing a minimum expectation. Modifying the control policy in such a way that the control policy is operable to control the system to operate at an accumulated cost.

本明細書中に記載の実施形態の別態様においては、車両の操作を行なう目的で車両を自律的に制御するために使用可能な制御ポリシーを最適化するように構成されたコンピュータ処理システムが提供されている。このコンピュータ処理システムは、コンピュータ処理システムの操作を制御するための１つ以上のプロセッサと、１つ以上のプロセッサにより使用可能なデータ及びプログラム命令を記憶するためのメモリとを含む。メモリは、１つ以上のプロセッサによって実行された時点で、１つ以上のプロセッサに、ａ）車両の操作に関わる受動的に収集されたデータを受信させ；ｂ）到達コストを推定するために使用可能なＺ値関数を決定させ；ｃ）コンピュータ処理システム内のｃｒｉｔｉｃネットワークにおいて：Ｚ値関数及び受動的に収集されたデータのサンプルを使用してＺ値を決定させ；受動的に収集されたデータのサンプルを用いて最適なポリシー下で平均コストを推定させ；ｄ）コンピュータ処理システム内のａｃｔｏｒネットワークにおいて、受動的に収集されたデータ、システムについての制御ダイナミクス、到達コスト及び制御ゲインを用いて制御ポリシーを修正させ；ｅ）推定平均コストが収束するまで、ステップ（ｃ）及び（ｄ）を反復的に繰返させる；コンピュータコードを記憶するように構成されている。 In another aspect of the embodiments described herein, a computer processing system is provided that is configured to optimize a control policy that can be used to autonomously control a vehicle for the purpose of operating the vehicle. Has been. The computer processing system includes one or more processors for controlling the operation of the computer processing system and a memory for storing data and program instructions usable by the one or more processors. The memory, when executed by one or more processors, causes one or more processors to: a) receive passively collected data relating to the operation of the vehicle; b) used to estimate the arrival cost Possible Z value function is determined; c) in a critical network in a computer processing system: using the Z value function and a sample of passively collected data to determine the Z value; passively collected data To estimate the average cost under optimal policy using a sample of d; control in an actor network in a computer processing system using passively collected data, control dynamics for the system, cost of arrival and control gain Modify the policy; e) repeat steps (c) and (d) until the estimated average cost converges To repeat; configured to store computer code.

本明細書中に記述された実施形態に係る、（例えば自律車両などの）システムに対する制御入力を決定すべく且つシステム制御ポリシーを修正及び／又は最適化すべく構成されたコンピュータ処理システムのブロック図である。FIG. 6 is a block diagram of a computer processing system configured to determine control inputs to a system (eg, an autonomous vehicle) and to modify and / or optimize a system control policy according to embodiments described herein. is there. 本明細書中に記述された方法に係る、車両制御入力の決定、及び／又は、制御ポリシーの修正若しくは最適化の間における情報の流れを示す概略図である。FIG. 6 is a schematic diagram illustrating the flow of information during vehicle control input determination and / or control policy modification or optimization in accordance with the methods described herein. 本明細書中に記述された実施形態に係る、一つ以上の制御入力と制御ポリシーとを使用する自律的制御に向けて構成された車両であって、当該車両に対する制御入力を決定すべく且つ自律車両操作制御ポリシーを修正及び／又は最適化すべく構成されたコンピュータ処理システムが組み込まれた車両の概略的ブロック図である。A vehicle configured for autonomous control using one or more control inputs and control policies, according to embodiments described herein, to determine a control input for the vehicle and 1 is a schematic block diagram of a vehicle incorporating a computer processing system configured to modify and / or optimize an autonomous vehicle operation control policy. 本明細書中に記述された実施形態に係る方法を用いる、高速道路合流用の制御ポリシーの最適化の例において採用された車両の構成の概略図である。FIG. 3 is a schematic diagram of a vehicle configuration employed in an example of a control policy optimization for highway merge using the method according to the embodiments described herein. 図４に示された車両の構成に関して実施される最適化のグラフ表示である。FIG. 5 is a graphical representation of optimization performed with respect to the vehicle configuration shown in FIG. 4. 車両を制御するように構成された制御ポリシーを学習するために受動的ａｃｔｏｒ−ｃｒｉｔｉｃ強化学習方法を適用し、学習した制御ポリシーを用いて車両を制御するための方法の実装を例示するフローチャートである。6 is a flowchart illustrating an implementation of a method for applying a passive actor-critical reinforcement learning method to learn a control policy configured to control a vehicle and controlling the vehicle using the learned control policy. . 本明細書中に記載の実施形態に係る受動的ａｃｔｏｒ−ｃｒｉｔｉｃ（ＰＡＣ）強化学習方法の適用を例示するフローチャートである。6 is a flowchart illustrating application of a passive actuator-critic (PAC) reinforcement learning method according to an embodiment described herein. 図７のブロック８２０に示されているように受動的に収集されたデータのサンプルを用いて最適な制御ポリシー下でＺ値及び平均コストを推定するための、ｃｒｉｔｉｃネットワークによる受動的ａｃｔｏｒ−ｃｒｉｔｉｃ（ＰＡＣ）強化学習方法のステップの適用を例示するフローチャートである。Passive actor-critic (by a critical network) to estimate the Z-value and average cost under optimal control policy using a sample of passively collected data as shown in block 820 of FIG. It is a flowchart which illustrates application of the step of a PAC) reinforcement learning method. 図７のブロック８３０に示されているように受動的に収集されたデータのサンプルを用いて最適な制御ポリシー下でＺ値及び平均コストを推定するための、ａｃｔｏｒネットワークによる受動的ａｃｔｏｒ−ｃｒｉｔｉｃ（ＰＡＣ）強化学習方法のステップの適用を例示するフローチャートである。Passive actor-critical (by actor network) to estimate Z-value and average cost under optimal control policy using samples of passively collected data as shown in block 830 of FIG. It is a flowchart which illustrates application of the step of a PAC) reinforcement learning method.

本明細書中に記載の実施形態は、最低予想累積コストで車両の操作を行なうように車両を自律的に制御するために構成された制御ポリシーを学習することを目的として、車両の操作に関連する受動的に収集されたデータに対して受動的ａｃｔｏｒ−ｃｒｉｔｉｃ（ｐＡＣ）強化学習方法を適用するためのコンピュータ実装型の方法に関する。このとき、車両は、車両の操作を行なうための制御ポリシーにしたがってコンピュータ処理システムによって制御され得る。ｐＡＣ方法は、制御ポリシーを学習するために、合流操作中に車両が作動している環境の正確なシステムダイナミクスモデルを必要としない。ｐＡＣ方法は同様に、（例えば、行動を行ない且つその行動の結果を監視して制御ポリシーを決定し変更することを伴い得る）制御ポリシーを学習するための環境の能動的探索を使用しない。本明細書中に記載のｐＡＣ方法は、能動的探索の代りに、制御されている車両の、受動的に収集されたデータ、部分的に公知のシステムダイナミクスモデル及び公知の制御ダイナミクスモデルを使用する。特定の実施形態において、ｐＡＣ方法は、１車線内を走行する第２の車両と第３の車両の間においてこの車線内に車両を合流させるように車両を制御するために使用可能な制御ポリシーを学習する目的で使用され得る。 Embodiments described herein are related to vehicle operation for the purpose of learning a control policy configured to autonomously control the vehicle to operate the vehicle at the lowest expected accumulated cost. The present invention relates to a computer-implemented method for applying a passive actor-critical (pAC) reinforcement learning method to passively collected data. At this time, the vehicle can be controlled by the computer processing system in accordance with a control policy for operating the vehicle. The pAC method does not require an accurate system dynamics model of the environment in which the vehicle is operating during the merge operation in order to learn the control policy. The pAC method likewise does not use an active search of the environment to learn control policies (which may involve, for example, taking actions and monitoring the outcome of those actions to determine and change control policies). The pAC method described herein uses passively collected data, partially known system dynamics model, and known control dynamics model of the vehicle being controlled instead of active search. . In certain embodiments, the pAC method has a control policy that can be used to control a vehicle to merge vehicles in this lane between a second vehicle and a third vehicle traveling in one lane. Can be used for learning purposes.

本開示に関連して、「オンライン」とは、コンピュータ処理システムが学習し得ると共に、actor及びcriticのネットワークパラメータが、上記システムが作動するにつれて（例えば車両が移動するなどにつれて）、コンピュータ処理され且つ更新され得ることを意味する。オンラインのソリューションを用いてactorパラメータ及びcriticパラメータを決定かつ更新すると、車両及びシステムのダイナミクス（dynamics）の変更が許容され得る。同様に、自律的操作とは、自律的に実施される操作である。 In the context of this disclosure, “online” means that a computer processing system can learn and that actor and critic network parameters are computer processed as the system operates (eg, as the vehicle moves) and It means that it can be updated. Once the actor and critic parameters are determined and updated using online solutions, changes in vehicle and system dynamics may be tolerated. Similarly, an autonomous operation is an operation performed autonomously.

図１は、本明細書中に開示される種々の実施形態に係る方法を実現すべく構成されたコンピュータ処理システム１４のブロック図である。更に詳細には、少なくとも一つの実施形態において、コンピュータ処理システム１４は、本明細書中に記述された方法に従い、制御入力を決定すべく構成され得る。また、コンピュータ処理システムは、システム（例えば、自律車両）を制御して特定の操作若しくは機能を自律的に実施すべく使用可能な制御ポリシーを修正及び／又は最適化するようにも構成され得る。 FIG. 1 is a block diagram of a computer processing system 14 configured to implement methods in accordance with various embodiments disclosed herein. More particularly, in at least one embodiment, computer processing system 14 may be configured to determine control inputs in accordance with the methods described herein. The computer processing system may also be configured to modify and / or optimize a control policy that can be used to control the system (eg, an autonomous vehicle) to autonomously perform a particular operation or function.

最適な又は最適化された制御ポリシーは、最低予想累積コストで車両の操作を行なう目的で車両を制御するように構成された制御ポリシーであり得る。最適な制御ポリシーは、車両の操作に関連する受動的に収集されたデータに対して受動的ａｃｔｏｒ−ｃｒｉｔｉｃ（ｐＡＣ）強化学習方法を適用することにより初期制御ポリシーを修正することを通して学習され得る。ｐＡＣ強化学習方法は、制御ポリシーのパラメータ値を反復的に最適化するために初期制御ポリシーに対し適用され得る。初期制御ポリシーのパラメータはランダム値に初期化され得る。１つ以上の配置において、最適な制御ポリシーは、このポリシーに付随する平均コストが収束したときに学習されたものとみなされる。平均コストは、平均コストの値がｐＡＣ方法の予め定められた回数の反復について予め定められた範囲又は許容誤差ゾーン外へ変動しない場合に、収束したものとみなすことができる。例えば、図５に例示された実施形態において、平均コストは、２００００回の反復後、約０．３の値を達成している。２００００回の反復後の予め定められた回数の反復について平均コストがいずれの方向にも０．３から一定値を超えて変動しない場合、制御ポリシーは最適化されたものと考えられてよい。そのとき、車両は、車両の操作を行なうために、最適化された制御ポリシーにしたがって制御され得る。車両の操作を行なう目的で車両を制御するための最適化された制御ポリシーの使用は、このとき、最低予想累積コストでの車両の操作の実施を結果としてもたらすはずである。 The optimal or optimized control policy may be a control policy configured to control the vehicle for the purpose of operating the vehicle at the lowest expected accumulated cost. The optimal control policy can be learned through modifying the initial control policy by applying a passive actuator-critical (pAC) reinforcement learning method to passively collected data related to vehicle operation. The pAC reinforcement learning method may be applied to the initial control policy to iteratively optimize the control policy parameter values. The parameters of the initial control policy can be initialized to random values. In one or more arrangements, the optimal control policy is considered learned when the average cost associated with this policy has converged. The average cost can be considered converged if the average cost value does not vary out of a predetermined range or tolerance zone for a predetermined number of iterations of the pAC method. For example, in the embodiment illustrated in FIG. 5, the average cost has achieved a value of about 0.3 after 20000 iterations. A control policy may be considered optimized if the average cost does not fluctuate beyond 0.3 to a certain value in either direction for a predetermined number of iterations after 20000 iterations. The vehicle can then be controlled according to an optimized control policy in order to operate the vehicle. The use of an optimized control policy to control the vehicle for the purpose of operating the vehicle should then result in performing the vehicle operation at the lowest expected accumulated cost.

少なくとも一つの実施形態において、コンピュータ処理システムは、車両に組み込まれ得ると共に、車両の操作の制御に向けられた制御ポリシーを修正及び最適化するように構成され得る。制御ポリシーを修正及び／又は最適化するためにコンピュータ処理システムにより必要とされる情報（例えば、データ、命令、及び／又は他の情報）は、任意の適切な手段から、例えば車両センサから又は無線接続を介して遠隔データベースのような車外情報源から、受信され且つ／又はそれにより収集され得る。幾つかの実施形態においては、制御ポリシーを修正及び／又は最適化するためにコンピュータ処理システムにより必要とされる情報（例えば、データ）の少なくとも幾つかは、車両の操作の前に（例えば、メモリ内に記憶されたデータ及び他の情報として）コンピュータ処理システムに提供され得る。また、コンピュータ処理システムは、修正若しくは最適化された制御ポリシーに従って車両を制御することで、関連する自律的操作を実施するようにも構成され得る。 In at least one embodiment, the computer processing system can be incorporated into a vehicle and can be configured to modify and optimize a control policy directed to controlling the operation of the vehicle. Information (eg, data, instructions, and / or other information) required by the computer processing system to modify and / or optimize the control policy is obtained from any suitable means, such as from vehicle sensors or wirelessly. It can be received from and / or collected by an external source such as a remote database via a connection. In some embodiments, at least some of the information (e.g., data) required by the computer processing system to modify and / or optimize the control policy may be prior to operation of the vehicle (e.g., memory Can be provided to the computer processing system (as data and other information stored therein). The computer processing system may also be configured to perform related autonomous operations by controlling the vehicle according to a modified or optimized control policy.

少なくとも一つの実施形態において、コンピュータ処理システムは、（例えばスタンドアロンのコンピュータ処理システムとして）車両から遠隔的に配置され得ると共に、制御ポリシーを車両から遠隔的に修正及び／又は最適化するように構成され得る。遠隔的なコンピュータ処理システムによって生成された最適化又は修正された制御ポリシーは、その後、車両による展開のために車両のコンピュータ処理システムへロード又はインストールされて、実際の交通環境において車両を制御し得る。 In at least one embodiment, the computer processing system may be remotely located from the vehicle (eg, as a stand-alone computer processing system) and configured to modify and / or optimize the control policy remotely from the vehicle. obtain. The optimized or modified control policy generated by the remote computer processing system can then be loaded or installed into the vehicle computer processing system for deployment by the vehicle to control the vehicle in the actual traffic environment. .

図１を参照すると、コンピュータ処理システム１４は、コンピュータ処理システム１４及び関連する構成要素の全体的な操作を制御する（少なくとも一つのマイクロプロセッサを含み得る）一つ以上のプロセッサ５８であって、メモリ５４のような一時的でない（non-transitory）コンピュータ可読媒体内に記憶された命令を実行する、プロセッサ５８を含み得る。本開示に関連して、コンピュータ可読記憶媒体とは、命令を実行するシステム、装置若しくはデバイスによって使用されるか又はそれに関連して使用されるプログラムを含む又は記憶し得る任意の有形媒体であり得る。プロセッサ５８は、プログラムコード中に含まれた命令を実施すべく構成された少なくとも一つのハードウェア回路（例えば、集積回路）を含み得る。複数のプロセッサ５８が在る構成において、斯かるプロセッサは相互から独立して作動し得るか、又は、一つ以上のプロセッサが相互に協働して作動し得る。 Referring to FIG. 1, computer processing system 14 includes one or more processors 58 (which may include at least one microprocessor) that control the overall operation of computer processing system 14 and associated components, including memory. A processor 58 may be included that executes instructions stored in a non-transitory computer readable medium such as 54. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that may contain or store a program used by or associated with a system, apparatus, or device that executes instructions. . The processor 58 may include at least one hardware circuit (eg, an integrated circuit) configured to implement the instructions contained in the program code. In configurations where there are multiple processors 58, such processors may operate independently of one another, or one or more processors may operate in cooperation with one another.

幾つかの実施形態において、コンピュータ処理システム１４は、ＲＡＭ５０、ＲＯＭ５２、及び／又は他の任意で適切な形態のコンピュータ可読メモリを含み得る。メモリ５４は、一つ以上のコンピュータ可読メモリを備え得る。一つ又は複数のメモリ５４は、コンピュータ処理システム１４の構成要素であり得るか、又は、一つ又は複数のメモリは、コンピュータ処理システム１４に作用的に接続されてコンピュータ処理システム１４に使用され得る。本説明を通して使用される「作用的に接続された」という語句は、直接的な物理接触のない接続を含め、直接的又は間接的な接続を含み得る。 In some embodiments, computer processing system 14 may include RAM 50, ROM 52, and / or any other suitable form of computer readable memory. Memory 54 may comprise one or more computer readable memories. One or more memories 54 may be a component of computer processing system 14, or one or more memories may be operatively connected to and used in computer processing system 14. . As used throughout this description, the phrase “operably connected” may include direct or indirect connections, including connections without direct physical contact.

一つ以上の構成において、本明細書中に記述されたコンピュータ処理システム１４は、人工的又はコンピュータ的な知能要素、例えば、ニューラルネットワーク、又は他の機械学習アルゴリズム、を組み込み得る。更に、一つ以上の構成において、本明細書中に記述された特定の機能又は操作を実施するように構成されたハードウェア及び／又はソフトウェア要素は、複数の要素及び／又は箇所に分散され得る。コンピュータ処理システム１４に加え、車両は、コンピュータ処理システム１４により実施される制御機能を増強若しくは支援するために、又は他の目的のために、付加的なコンピュータ処理システム及び／又はデバイス（図示せず）を組み込み得る。 In one or more configurations, the computer processing system 14 described herein may incorporate artificial or computational intelligence elements, such as neural networks, or other machine learning algorithms. Further, in one or more configurations, hardware and / or software elements configured to perform a particular function or operation described herein may be distributed across multiple elements and / or locations. . In addition to the computer processing system 14, the vehicle may include additional computer processing systems and / or devices (not shown) to enhance or support control functions performed by the computer processing system 14, or for other purposes. ) Can be incorporated.

メモリ５４は、さまざまな機能を実行するためにプロセッサ５８により実行可能であるデータ６０及び／又は命令５６（例えばプログラム論理）を格納し得る。データ６０は、制御ポリシーにより制御されるべき車両の操作に関連する受動的に収集されたデータを含み得る。さらに受動的に収集されたデータは、コンピュータ処理システム１４による使用のため他のソースに対して提供されてよい（あるいは、他のソース上に存在し得る）。受動的に収集されたデータとは、能動的探索から収集されないデータとして定義され得る。高速道路合流操作に関連する受動的に収集されたデータとしては例えば、入り口ランプの近くで高速道路の最も右側のレーン内を走行する車両の速度及び加速、ならびに入り口ランプに沿って走行し最も右側のレーンに進入するサンプル車両の速度及び加速が含まれ得る。受動的に収集されたデータの一例は、建物の上に組付けられたカメラを用いた高速道路の入口の周囲の車両の軌跡の獲得について説明するｈｔｔｐ：／／ｗｗｗ．ｆｈｗａ．ｄｏｔ．ｇｏｖ／ｐｕｂｌｉｃａｔｉｏｎｓ／ｒｅｓｅａｒｃｈ／ｏｐｅｒａｔｉｏｎｓ／０６１３７／中に記載のデータセットである。別の例において、受動的に収集されたデータは、人間のドライバが実行する操作に応答して車両センサが収集するデータを含み得る。人間のドライバにより実行された操作、その操作が実行された車両環境条件、及び操作に後続して及び／又は操作に応答して車両の周囲で発生する事象に関連するデータが収集され、コンピュータ処理システムに提供され得る。代替的には、コンピュータ処理システムが車両内に設置された場合、コンピュータ処理システム１４は、（制御ポリシー１０１などの）１つ以上の車両制御ポリシーのオンライン修正及び／又は最適化のために、このような受動的に収集されたデータを蓄積及び／又は受信するように構成され得る。 Memory 54 may store data 60 and / or instructions 56 (eg, program logic) that can be executed by processor 58 to perform various functions. Data 60 may include passively collected data related to the operation of the vehicle to be controlled by the control policy. Further, passively collected data may be provided to other sources (or may reside on other sources) for use by computer processing system 14. Passively collected data may be defined as data that is not collected from active searches. Passively collected data related to highway merge operations include, for example, the speed and acceleration of a vehicle traveling in the rightmost lane of the highway near the entrance ramp, and the rightmost side traveling along the entrance ramp The speed and acceleration of the sample vehicle entering the lane may be included. An example of passively collected data can be found at http: //www.development of the trajectory of a vehicle around a highway entrance using a camera built on top of a building. fhwa. dot. It is a data set described in gov / publications / research / operations / 06137 /. In another example, passively collected data may include data collected by vehicle sensors in response to operations performed by a human driver. Data related to operations performed by human drivers, vehicle environmental conditions under which the operations were performed, and events that occur around the vehicle following and / or in response to the operations are collected and computer processed Can be provided to the system. Alternatively, if a computer processing system is installed in the vehicle, the computer processing system 14 may use this for online modification and / or optimization of one or more vehicle control policies (such as control policy 101). Such passively collected data may be configured to be stored and / or received.

車両制御ダイナミクスモデル８７は、車両がさまざまな入力にどのように応答するかを記述する刺激−応答モデルであり得る。車両制御ダイナミクスモデル８７は、所与の車両状態ｘにおける車両についての車両制御ダイナミクスＢ（ｘ）を（受動的に収集されたデータを用いて）決定するように使用され得る。状態コスト関数ｑ（ｘ）は、状態ｘにある車両又はコストであり、逆強化学習などの公知の方法に基づいて学習され得る。状態コストｑ（ｘ）及び車両制御ダイナミクスＢ（ｘ）は、本明細書中に記載されているように制御ポリシー１０１の修正及び最適化の両方のために使用され得る。任意の所与の車両についての車両制御ダイナミクスモデル８７を決定し、メモリ５４などのメモリ内に記載することができる。 The vehicle control dynamics model 87 may be a stimulus-response model that describes how the vehicle responds to various inputs. The vehicle control dynamics model 87 can be used to determine (using passively collected data) the vehicle control dynamics B (x) for a vehicle in a given vehicle state x. The state cost function q (x) is a vehicle or a cost in the state x, and can be learned based on a known method such as reverse reinforcement learning. The state cost q (x) and vehicle control dynamics B (x) can be used for both modification and optimization of the control policy 101 as described herein. A vehicle control dynamics model 87 for any given vehicle can be determined and stored in a memory, such as memory 54.

再び図１を参照すると、コンピュータ処理システムの実施形態は、２つの学習システム又は学習ネットワーク、並びに相互に作用するactorネットワーク（又は「actor」）８３及びcriticネットワーク（又は「critic」）８１も含み得る。これらネットワークは、例えば、人工ニューラルネットワーク（ＡＮＮ）を用いて実現され得る。本明細書中に記述された目的に対し、（変数πによっても表される）制御ポリシー１０１は、一群の車両の状態のうちの各状態ｘに応じて車両により取られるべき行動ｕを特定又は決定する関数又は他の関係として定義され得る。故に、自律的操作の実行中の車両の各状態ｘに対し、車両は、関連する行動ｕ＝π（ｘ）を実施するように制御され得る。したがって、制御ポリシーは、車両の操作を制御して、例えば、高速道路合流などの関連する操作を自律的に実施する。actor８３は、制御ポリシーに関して作動し、criticから受信した情報及び他の情報を用いて、ポリシーを修正及び／又は最適化し得る。制御ポリシーにより自律的に制御された車両操作は、高速道路への合流、又は、車線の変更のような特定の目的を達成すべく実施される一つの運転操作又は一群の運転操作として定義され得る。 Referring back to FIG. 1, an embodiment of a computer processing system may also include two learning systems or networks, and an interacting actor network (or “actor”) 83 and a critic network (or “critic”) 81. . These networks can be realized using, for example, an artificial neural network (ANN). For the purposes described herein, the control policy 101 (also represented by the variable π) specifies or determines the action u to be taken by the vehicle according to each state x of the group of vehicle states. It can be defined as a function to determine or other relationship. Thus, for each state x of the vehicle that is performing an autonomous operation, the vehicle can be controlled to perform the associated action u = π (x). Therefore, the control policy controls the operation of the vehicle and autonomously performs related operations such as highway merge. The actor 83 operates on the control policy and can modify and / or optimize the policy using information received from the critic and other information. A vehicle operation controlled autonomously by a control policy can be defined as a driving operation or a group of driving operations performed to achieve a specific purpose such as merging into a highway or changing lanes. .

コンピュータ処理システム１４は、制御ポリシーの修正及び最適化に対して使用可能である新規な半モデルフリーＲＬ方法（semi-model-free RL method）（本明細書においては受動的actor／critic（ｐＡＣ）方法という）を実行するように構成され得る。この方法において、criticは、車両の種々の状態に対する評価関数を学習し、且つ、actorは、能動的探索なしで、代わりに受動的に収集されたデータと既知の車両制御ダイナミクスモデルとを用いて制御ポリシーを改善する。この方法は、部分的に既知であるシステムダイナミクスモデルを使用することにより、能動的探索に対する必要性を回避する。この方法は、車両環境の制御されていないダイナミクス又は過渡的なノイズレベルに関する知見を必要としない。この方法は、例えば、環境がノイズ的に如何に展開するかのサンプルは入手可能であるが車両センサにより能動的に探索することは困難であり得る自律車両に関して、実行可能である。 The computer processing system 14 is a novel semi-model-free RL method (here passive actor / critic (pAC)) that can be used for control policy modification and optimization. Method)). In this method, the critic learns the evaluation function for the various states of the vehicle, and the actor uses passively collected data and known vehicle control dynamics models instead, without active search. Improve control policies. This method avoids the need for active search by using a partially known system dynamics model. This method does not require knowledge of the uncontrolled dynamics or transient noise levels of the vehicle environment. This method is feasible, for example, for autonomous vehicles where a sample of how the environment evolves in a noisy manner is available but can be difficult to actively explore with vehicle sensors.

本明細書中に記載の特定の実施形態において、制御ポリシーにより制御されるべき車両の操作は、１車線内を走行する第２の車両と第３の車両の間でこの車線内に車両を合流させるための操作である。制御ポリシーは、第２の車両と第３の車両の間の中間に車両を合流させる目的で車両を制御するために構成され得る。 In the specific embodiment described herein, the operation of the vehicle to be controlled by the control policy is to merge the vehicle in this lane between the second vehicle and the third vehicle traveling in one lane. It is an operation to make it. The control policy may be configured to control the vehicle for the purpose of merging the vehicle in the middle between the second vehicle and the third vehicle.

本明細書中に記載のコンピュータ処理システム１４の実施形態は、さまざまなタイプの入力及び出力情報を測定、受信及び／又はアクセスすることにより、システム（例えば車両）の状態ｘ（ｔ）を決定する。例えば、システムに連結された又は他の形でシステムと通信状態にあるセンサを用いて、データを測定することができる。コンピュータ処理システム１４は、方程式（１）により特徴付けされる車両の安定性及び所望の運動を達成するためと同時に、方程式（２）中に記載のエネルギーベースのコスト関数を最小化するために、制御入力ｕを決定し得る。 Embodiments of the computer processing system 14 described herein determine the state x (t) of the system (eg, vehicle) by measuring, receiving and / or accessing various types of input and output information. . For example, data can be measured using sensors coupled to the system or otherwise in communication with the system. The computer processing system 14 achieves the vehicle stability and desired motion characterized by equation (1) while at the same time minimizing the energy-based cost function described in equation (2). The control input u can be determined.

制御ポリシーを修正及び最適化する目的に対し、状態ｘ∈Ｒⁿ及び制御入力ｕ∈Ｒ^mにより、離散時間確率論的ダイナミクス系は以下のように定義され得る。
Δｘ＝Ａ（ｘ_t）Δｔ＋Ｂ（ｘ_t）ｕ_tΔｔ＋Ｃ（ｘ_t）ｄω （１）
式中、ω（ｔ）はブラウニアン運動であり、Ａ（ｘ_t）、Ｂ（ｘ_t）ｕ_t、及びＣ（ｘ_t）は、それぞれ、受動的ダイナミクス、車両制御用ダイナミクス、及び、過渡的ノイズレベルである。Δｔは、時間のステップサイズである。この種の系は、多くの状況において生ずる（例えば、ほとんどの機械系のモデルはこれらのダイナミクスに従う）。関数Ａ（ｘ）、Ｂ（ｘ）及びＣ（ｘ）は、理解されるべく、モデル化されている特定の系に依存する。受動的ダイナミクスは、車両の環境における変化であって、車両システムに対する制御入力の結果ではない変化を含む。 For the purpose of modifying and optimizing the control policy, with state xεR ⁿ and control input uεR ^m , a discrete-time stochastic dynamics system can be defined as follows:
Δx = A (x _t ) Δt + B (x _t ) u _t Δt + C (x _t ) dω (1)
Where ω (t) is a Brownian motion, and A (x _t ), B (x _t ) u _t , and C (x _t ) are passive dynamics, vehicle control dynamics, and transients, respectively. Noise level. Δt is a time step size. This type of system occurs in many situations (eg, most mechanical models follow these dynamics). The functions A (x), B (x) and C (x) depend on the particular system being modeled to be understood. Passive dynamics include changes in the environment of the vehicle that are not the result of control inputs to the vehicle system.

本明細書中に記述された方法及びシステムにおいて、離散時間ダイナミクス系に対するマルコフ決定過程（ＭＤＰ）は、タプル＜Ｘ、Ｕ、Ｐ、Ｒ＞であり、式中、Ｘ⊆Ｒⁿ及びＵ⊆Ｒ^ｍは、状態空間及び行動空間である。Ｐ：＝｛ｐ（ｙ｜ｘ，ｕ）｜ｘ，ｙ∈Ｘ，ｕ∈Ｕ｝は、行動による状態遷移モデルであり、且つ、Ｒ：＝｛ｒ（ｘ，ｕ）｜ｘ∈Ｘ，ｕ∈Ｕ｝は、状態ｘ及び行動ｕに関する即時コスト関数である。先に記述されたように、制御ポリシーｕ＝π（ｘ）は、状態ｘから行動ｕへとマッピングする関数である。予期される累積コストである、ポリシーπの下での到達コスト関数（cost-to-go function）（又は価値関数）Ｖ^π（ｘ）は、無限時間区間（infinite horizon）の平均コストの最適性判断基準の下で、以下のように定義される。

式中、

は平均コストであり、ｋは時間インデックスであり、且つ、Δｔは時間ステップである。最適な到達コスト関数は、以下の離散時間ハミルトン−ヤコビ−ベルマン方程式を満足する。

式中、Ｑ^π（ｘ，ｕ）は行動価値関数（action-value function）であり、且つ、ｇ［・］は積分演算子である。ＭＤＰの目的は、以下の関係に従い、無限時間区間に亘り、平均コストを最小化する制御ポリシーを見出すことである。

ここで、最適な制御ポリシーにおける値は、上付き文字^*を以て表され得る（例えば、Ｖ^*、Ｖ^* _avg）。 In the methods and systems described herein, the Markov decision process (MDP) for a discrete-time dynamics system is a tuple <X, U, P, R>, where X⊆R ⁿ and U⊆R. ^m is a state space and an action space. P: = {p (y | x, u) | x, yεX, uεU} is a state transition model by action, and R: = {r (x, u) | xεX, uεU} is an immediate cost function for state x and action u. As described above, the control policy u = π (x) is a function that maps from the state x to the action u. The expected cumulative cost, cost-to-go function (or value function) V ^π (x) under policy π, is the optimality of the average cost over an infinite horizon Under the criteria, it is defined as follows:

Where

Is the average cost, k is the time index, and Δt is the time step. The optimal arrival cost function satisfies the following discrete-time Hamilton-Jacobi-Berman equation:

^Where Q ^π (x, u) is an action-value function and g [•] is an integration operator. The purpose of MDP is to find a control policy that minimizes the average cost over an infinite time interval according to the following relationship:

Here, the value in the optimal control policy may be represented with a superscript ^* (eg, V ^* , V ^* _avg ).

離散時間ダイナミクス系に対する線形マルコフ決定過程（Ｌ−ＭＤＰ）は、連続的な状態空間及び行動空間に対して厳密な解が迅速に求められ得るという利点を備えた汎用マルコフ決定過程のサブクラスである。構築されたダイナミクス、及び、別体的な状態コスト及び制御コストの下で、ベルマン方程式は、組み合わされた状態コスト及び制御されていないダイナミクスの線形固有関数を見出すことに解が限定された線形微分方程式として再構築され得る。その後、Ｌ−ＭＤＰに対する到達コスト関数（又は、価値関数）は、正確なダイナミクスモデルが利用可能であるときに、二次プログラミング（ＱＰ）のような最適化方法により、効率的に求められ得る。 Linear Markov decision processes (L-MDP) for discrete-time dynamics systems are a subclass of generalized Markov decision processes with the advantage that exact solutions can be quickly obtained for continuous state spaces and action spaces. Under constructed dynamics and separate state and control costs, the Bellman equation is a linear derivative whose solution is limited to finding linear eigenfunctions of combined state and uncontrolled dynamics. It can be reconstructed as an equation. Thereafter, the reaching cost function (or value function) for L-MDP can be efficiently determined by an optimization method such as quadratic programming (QP) when an accurate dynamics model is available.

マルコフ決定過程の線形公式は、以下に示されるように、制御コストを定義すべく、且つ、車両ダイナミクスに関する条件を加えるべく使用され得る。

ここで、ｑ（ｘ）≧０は状態コスト関数であり、ｐ（ｘ）は行動による状態遷移モデルであり、且つ、ＫＬ（・||・）はクルバック−ライブラー（ＫＬ）偏差である。式（３）は、行動のコストを、それが系に対して有する確率論的効果の量に対して関連付け、且つ、それを状態コストに対して加算する。第２の条件は、何らの行動も、受動的ダイナミクスの下では達成され得ない新たな遷移を導入しないことを確実とする。式（１）により表された確率論的ダイナミクス系は、当然、上記仮定を満足する。 A linear formula for the Markov decision process can be used to define control costs and to add conditions on vehicle dynamics, as shown below.

Here, q (x) ≧ 0 is a state cost function, p (x) is a behavioral state transition model, and KL (· || ·) is a Kullback-Liver (KL) deviation. Equation (3) relates the cost of action to the amount of stochastic effect it has on the system and adds it to the state cost. The second condition ensures that no action introduces new transitions that cannot be achieved under passive dynamics. Naturally, the stochastic dynamics system expressed by equation (1) satisfies the above assumption.

ハミルトン−ヤコビ−ベルマン方程式（式（２））は、Ｌ−ＭＤＰ形態において、指数的に変換された到達コスト関数に対する線形微分方程式（以下、線形化ベルマン方程式という）へと書き換えられ得る。

式中、Ｚ（ｘ）及びＺ_avgは、それぞれ、Ｚ値と称される指数的に変換された到達コスト関数、及び、最適ポリシーの下での平均コストである。Ｚ値は入力パラメータｘの対応する値に対するＺ値関数Ｚ（ｘ）の特定の値であってもよい。（式（１））における状態遷移はガウス性であることから、制御されたダイナミクスと受動的なダイナミクスとの間のＫＬ偏差は、

として表され得る。 The Hamilton-Jacobi-Berman equation (equation (2)) can be rewritten into a linear differential equation (hereinafter referred to as a linearized Bellman equation) for an exponentially transformed arrival cost function in the L-MDP form.

_Where Z (x) and Z _avg are the exponentially transformed arrival cost function, called the Z value, respectively, and the average cost under the optimal policy. The Z value may be a specific value of the Z value function Z (x) for the corresponding value of the input parameter x. Since the state transition in (Equation (1)) is Gaussian, the KL deviation between the controlled and passive dynamics is

Can be expressed as:

その後、Ｌ−ＭＤＰ系に対する最適な制御ポリシーは、

として表され、式中、

は、ｘ_kにおけるｘに関する到達コスト関数Ｖの偏微分値であり、パラメータρ（ｘ_k）はＢ（ｘ_k）^TＶ_kで表されるベクトルが乗算する回数を表す制御ゲインである。Ｚ値及び平均コストは、系のダイナミクスが完全に入手可能であるとき、固有値又は固有関数を解くことにより、線形化ベルマン方程式から導かれ得る。 After that, the optimal control policy for L-MDP system is

Expressed as:

Is the partial differential value of the arrival cost function V with respect to x at x _k , and the parameter ρ (x _k ) is a control gain representing the number of multiplications by the vector represented by B (x _k ) ^T V _k Z values and average costs can be derived from linearized Bellman equations by solving eigenvalues or eigenfunctions when the dynamics of the system are fully available.

固有値問題の解決法については、全体が参照により本明細書に組込まれているAdvances in Neural Information Processing Systems，２００６，ｐ１３６９〜１３７６，Ｖｏｌ．１９中で公開された「Linearly-solvable Markov Decision Problems」内でTodorovにより論述されている。固有関数問題の解決法については、同様に全体が参照により本明細書に組込まれているConference:In Adaptive Dynamic Programming and Reinforcement Learning,IEEE Symposium，２００９，ｐ１６１〜１６８中で公開された「Eigenfunction Approximation Methods for Linearly-solvable Optimal Control Problems」内でTodorovにより論述されている。 For solutions to the eigenvalue problem, see Advances in Neural Information Processing Systems, 2006, pp. 1369-1376, Vol. Discussed by Todorov in “Linearly-solvable Markov Decision Problems” published in 19th. For solving eigenfunction problems, see also “Eigenfunction Approximation Methods” published in Conference: In Adaptive Dynamic Programming and Reinforcement Learning, IEEE Symposium, 2009, p. 161-168, which is incorporated herein by reference in its entirety. for Linearly-solvable Optimal Control Problems "by Todorov.

本明細書中に記載されているコンピュータ処理システム１４の実施形態には、互いに相互作用する２つの学習システム又は学習ネットワーク、ａｃｔｏｒネットワーク（又は「ａｃｔｏｒ」）８３及びｃｒｉｔｉｃネットワーク（又は「ｃｒｉｔｉｃ」）８１が含まれる。これらのネットワークは、人工神経ネットワークを用いて実装され得る。 Embodiments of the computer processing system 14 described herein include two learning systems or networks that interact with each other, an actor network (or “actor”) 83 and a critical network (or “critic”) 81. Is included. These networks can be implemented using artificial neural networks.

１つ以上の配置において、ａｃｔｏｒ８３は内部ループフィードバックコントローラとして実装され、ｃｒｉｔｉｃ８１は外部ループフィードバックコントローラとして実装される。両方共、車両起動型メカニズム又は制御指令をもたらすために操作可能である制御機構との関係においてフィードフォワード経路内に位置設定されてよい。 In one or more arrangements, actor 83 is implemented as an inner loop feedback controller and critic 81 is implemented as an outer loop feedback controller. Both may be located in the feedforward path in relation to a vehicle activated mechanism or control mechanism operable to provide a control command.

反復とは、（ｃｒｉｔｉｃについては重みω、ａｃｔｏｒについてはμなどの）ｃｒｉｔｉｃ及びａｃｔｏｒパラメータの更新として定義され得る。さらに、ｃｒｉｔｉｃネットワークパラメータの更新は、車両が動いているときに行なうことができる。本明細書中に記載の方法において、ここで、ｃｒｉｔｉｃネットワーク及びａｃｔｏｒネットワークパラメータの更新中に使用される唯一のデータは、受動的に収集されたデータである。 Iteration may be defined as updating the critical and actor parameters (such as weight ω for critical and μ for actor). In addition, the critical network parameters can be updated when the vehicle is moving. In the method described herein, the only data used during updating of the critical network and actor network parameters is passively collected data.

ｃｒｉｔｉｃ８１は、受動的に収集されたデータのサンプル内に反映されている状態及び状態コストを用いて、推定平均コスト及び、ａｃｔｏｒネットワークによって適用された場合に車両の到達コストについての最小値を生成する近似された到達コスト関数を決定する。受動的に収集されたデータのサンプル内に反映された車両状態及び車両制御ダイナミクスモデル８７から受信した状態コストｑ（ｘ）を用いて、ｃｒｉｔｉｃ８１は、車両の現在の状態ｘ_k及び推定された次の状態ｘ_k+1、及び受動的に収集されたデータのサンプルを用いる最適ポリシー下の状態コストｑ_kを評価する。ｃｒｉｔｉｃ８１は同様に、近似された到達コスト関数

（Ｚ値関数）及び現在の状態についての付随する推定Ｚ値を決定し、ａｃｔｏｒ８３による使用のための推定された平均コスト

を生成するために、前述のベルマン方程式（方程式（５））の線形化版を使用する。推定された次の状態ｘ_k+1は、受動的に収集されたデータ及び車両ダイナミクスモデル８７を用いて計算可能である。 The critic 81 uses the state and state cost reflected in the passively collected data sample to generate an estimated average cost and a minimum value for the vehicle arrival cost when applied by the actor network. Determine an approximate arrival cost function. Using passively collected reflected in the sample data a vehicle condition and state costs q received from the vehicle control dynamics model 87 (x), critic81 was the current state x _k and the estimated vehicle following State x _{k + 1} , and state cost q _k under optimal policy using passively collected data samples. Similarly, the critical 81 is an approximated arrival cost function.

(Z value function) and the associated estimated Z value for the current state, and the estimated average cost for use by actor 83

To generate a linearized version of the aforementioned Bellman equation (equation (5)). The estimated next state x _{k + 1} can be calculated using passively collected data and the vehicle dynamics model 87.

Ｚ値の推定を目的として、重み付けされた放射基底関数（ＲＢＦ）の線形結合がＺ値関数を近似するために使用され得る：

ここでωは重み、ｆ_jはｊ番目のＲＢＦそしてＮはＲＢＦの数である。基底関数は、車両システムの非線形ダイナミクスに応じて好適に選択され得る。Ｚ値は、近似Ｚ値関数及び受動的に収集されたデータのサンプルを用いて近似され得る。 For the purpose of Z value estimation, a linear combination of weighted radial basis functions (RBFs) can be used to approximate the Z value function:

Here, ω is a weight, f _j is the j-th RBF, and N is the number of RBFs. The basis function can be suitably selected according to the nonlinear dynamics of the vehicle system. The Z value can be approximated using an approximate Z value function and a sample of passively collected data.

重み付けされた放射基底関数の線形結合を用いてＺ値関数を近似する前に、重み付けされた放射基底関数内で使用される重みを最適化することができる。放射基底関数内で使用するため、累乗された真の到達コスト（又はＺ値）と推定到達コストとの間の最小二乗誤差を最小化することにより、重みωを最適化することができる。Ｚ（ｘ_k）及びＺ_avgを真のＺ値とし、

、

を推定Ｚ値とすると、

ここで、Ｃは、自明な解ω＝０への収束を回避するために使用される一定値である。方程式（５）に由来する∀ｘ、０＜Ｚ_avgＺ（ｘ）≦１，＼及び∀ｘ、ｑ（ｘ）≧０を満たすために、第２及び第３の制約が必要とされる。 Prior to approximating the Z-value function using a linear combination of weighted radial basis functions, the weights used in the weighted radial basis functions can be optimized. For use in a radial basis function, the weight ω can be optimized by minimizing the least square error between the true power of arrival (or Z value) raised to the power and the estimated arrival cost. Let Z (x _k ) and Z _avg be true Z values,

,

Is the estimated Z value,

Here, C is a constant value used to avoid convergence to the trivial solution ω = 0. In order to satisfy ∀x, 0 <Z _avg Z (x) ≦ 1, \ and ∀x, q (x) ≧ 0 from equation (5), the second and third constraints are required.

重みを最適化する前に、重み付けされた放射基底関数内で使用される重みを更新することができる。重みωは最適化に先立ち更新され得、ｃｒｉｔｉｃネットワークにより使用される重み及び平均コスト

は、ｐＡＣ方法のために使用される情報を用いて真の到達コスト及び真の平均コストを決定することができないことを理由として、線形化されたベルマン方程式（ＬＢＥ）（方程式（５））から以下のように決定される近似された時間差誤差ｅ_kで真の到達コストと推定到達コスト間の誤差を近似することによって、ラングランジェ緩和時間差（ＴＤ）学習に基づいて反復ステップにおける使用に先立ち更新され得る。

ここでα₁ ⁱ及びα₂は、学習率であり、ｅ_kはＬ−ＭＤＰｓについてのＴＤ誤差である。δ_ijはディラックのデルタ関数を意味する。上付き文字ｉは、反復回数を意味する。λ₁、λ₂、λ₃は、制約方程式（９）についてのラングランジェ乗数である。ωは、方程式（１０）で誤差を最小化し、方程式（１１）で制約を満たすために更新される。 Prior to optimizing the weights, the weights used in the weighted radial basis functions can be updated. The weight ω can be updated prior to optimization and the weight and average cost used by the critic network

From the linearized Bellman equation (LBE) (Equation (5)) because the true arrival cost and the true average cost cannot be determined using the information used for the pAC method. by approximating the error between the true arrival cost estimated arrival cost in time difference error e _k approximated are determined as follows, updates prior to use in the iteration step based on the rung Langeais relaxation time difference (TD) learning Can be done.

Wherein alpha ₁ ⁱ and alpha ₂ are the learning rate, e _k is the TD error for L-MDPs. δ _ij means Dirac delta function. The superscript i means the number of iterations. λ ₁ , λ ₂ , and λ ₃ are Langlanger multipliers for the constraint equation (9). ω is updated to minimize the error in equation (10) and satisfy the constraint in equation (11).

乗数の値は、以下の方程式を解くことで計算される。

The multiplier value is calculated by solving the following equation:

いくつかの事例において、制約サブセットが有効でない場合がある。このような場合には、これらの制約についての乗数はゼロに設定され、残りの有効な制約についての乗数が得られる。ｃｒｉｔｉｃは、制御ポリシーと無関係な状態コストｑ_kと受動的ダイナミクス（ｘ_kｘ_k+1）の下での状態遷移サンプルを用いてパラメータを更新する。重みω、推定Ｚ値

及び平均コスト

は、車両が動いている間に、方程式（１０）−（１１Ａ）にしたがってオンラインで更新され得る。 In some cases, the constraint subset may not be valid. In such cases, the multipliers for these constraints are set to zero, and the multipliers for the remaining valid constraints are obtained. The critical updates the parameters with the state cost q _k independent of the control policy and the state transition samples under passive dynamics (x _k x _{k + 1} ). Weight ω, estimated Z value

And average cost

Can be updated online according to equations (10)-(11A) while the vehicle is moving.

コンピュータ処理システム内で、ｃｒｉｔｉｃネットワークに作用的に連結されたａｃｔｏｒ８３は、到達コストについて最小値を生成する車両に対し適用するための制御入力を決定し得る。ｃｒｉｔｉｃにより生成された推定到達コスト

及び推定平均コストＺ_avg、状態コストｑ（ｘ）、車両制御ダイナミクスモデル８７から決定される現在の状態についての制御ダイナミクス情報Ｂ（ｘ）、及び到達コスト関数

を推定し推定平均コスト

を生成するためにｃｒｉｔｉｃにより使用される車両の現在の状態及び推定された次の状態を用いて、ａｃｔｏｒ８３は制御入力を決定することができる。制御入力は、制御ポリシーπを修正するために使用可能である。特定の実施形態において、ポリシーπは、本明細書中に記載の要領で収束に至るまで反復的に修正され、その時点で最適化されたものとみなされる。ａｃｔｏｒは、標準ベルマン方程式を用いて能動的探索なしで制御ポリシーを改良し修正する。制御ダイナミクスは、車両についての公知の制御ダイナミクスから決定され得る。 Within the computer processing system, an actor 83 that is operatively coupled to the critical network may determine the control input to apply to the vehicle that generates the minimum value for the attainment cost. Estimated arrival cost generated by critic

And the estimated average cost Z _avg , the state cost q (x), the control dynamics information B (x) for the current state determined from the vehicle control dynamics model 87, and the arrival cost function

Estimate the estimated average cost

Using the current state of the vehicle used by critic to generate and the estimated next state, actor 83 can determine the control input. The control input can be used to modify the control policy π. In certain embodiments, the policy π is considered iteratively modified and optimized at that time until convergence as described herein. actor uses standard Bellman equations to refine and modify control policies without active search. The control dynamics can be determined from known control dynamics for the vehicle.

ａｃｔｏｒ８３は同様に、所望される操作（例えば高速道路合流、レーン変更など）を自律的に行なうため、実時間で車両システムに対して制御入力ｕ（ｘ）を適用することもできる。本明細書中で開示されているいくつかの実施形態において、ａｃｔｏｒ８３は、内部ループフィードバックコントローラ内で具現され得、ｃｒｉｔｉｃ８１は外部ループフィードバックコントローラ内で具現され得る。両方のコントローラ共、車両起動式制御機構との関係においてフィードフォワード経路内に位置設定され得る。 Similarly, the actor 83 can apply the control input u (x) to the vehicle system in real time in order to autonomously perform a desired operation (eg, highway merge, lane change, etc.). In some embodiments disclosed herein, actor 83 can be implemented in an inner loop feedback controller and critic 81 can be implemented in an outer loop feedback controller. Both controllers can be positioned in the feedforward path in relation to the vehicle activated control mechanism.

ａｃｔｏｒ８３は、ｃｒｉｔｉｃ由来の推定値（例えば

及び

）、受動的ダイナミクス下のサンプル、公知の制御ダイナミクスＢ_kを用いて制御ゲインρ（ｘ_k）を推定することにより、制御ポリシーを改良又は修正することができる。制御ポリシーの修正には、制御ゲインを近似するステップ、制御ゲインを最適化して最適化された制御ゲインを提供するステップ、及び最適化された制御ゲインを用いて制御ポリシーを修正するステップが含まれ得る。 The actor 83 is an estimated value derived from critical (for example,

as well as

), The control policy can be improved or modified by estimating the control gain ρ (x _k ) using the samples under passive dynamics, the known control dynamics B _k . Modifying a control policy includes approximating the control gain, optimizing the control gain to provide an optimized control gain, and modifying the control policy using the optimized control gain. obtain.

制御ゲインρ（ｘ_k）は、重み付けされた放射基底関数の線形結合で学習された状態で近似され得る：

ここで、μ_jは、ｊ番目の放射基底関数ｇ_jのための重みである。Μは放射基底関数の数である。ρ（ｘ_k）は、到達コストと行動−状態価値の間の最小平均誤差を最小化することによって最適化され得る。

ここでＶ^*、Ｖ^* _avg、及び

は、最適な制御ポリシー下の、真の到達コスト関数、平均コスト及び推定行動−状態価値である。最適な制御ポリシーは、ポリシーが最適なポリシーである場合にのみ真の行動価値コストがＶ^*＋Ｖ^* _avgに等しいことから、目的関数を最小化することによって学習され得る。以下の関係にしたがって、制御ゲインρ（ｘ_k）を更新する場合に

及び

を決定するために、

及び

を使用することができる：

The control gain ρ (x _k ) can be approximated in a learned state with a linear combination of weighted radial basis functions:

Here, μ _j is a weight for the j-th radial basis function g _j . Μ is the number of radial basis functions. ρ (x _k ) can be optimized by minimizing the minimum average error between the reached cost and the behavior-state value.

Where V ^* , V ^* _avg , and

Is the true cost of arrival function, average cost and estimated behavior-state value under optimal control policy. The optimal control policy can be learned by minimizing the objective function because the true action value cost is equal to V ^* + V ^* _avg only if the policy is the optimal policy. When updating the control gain ρ (x _k ) according to the following relationship:

as well as

In order to determine the,

as well as

Can be used:

制御ゲインを最適化する前に、制御入力を決定することができ、制御入力、受動的に収集されたデータのサンプル及び近似された制御ゲインを用いて行動価値関数の値Ｑを決定することができる。重み付けされた放射基底関数の線形結合を用いて制御ゲインを近似する前に、重み付けされた放射基底関数内で使用される重みμを更新することができる。 Prior to optimizing the control gain, the control input can be determined, and using the control input, the passively collected sample of data, and the approximated control gain, the value Q of the action value function can be determined. it can. Prior to approximating the control gain using a linear combination of the weighted radial basis functions, the weight μ used in the weighted radial basis functions can be updated.

重みμは、以下で定義する近似時間差（ＴＤ）誤差ｄ_kを用いて更新され得る：

ここで、β’は学習率であり、Ｌ_k、k+1は項Ｌ（ｘ_k、ｘ_k+1）の省略版である。 The weight μ can be updated with an approximate time difference (TD) error d _k defined below:

Here, β ′ is a learning rate, and L _{k and k + 1} are abbreviated versions of the terms L (x _{k and} x _{k + 1} ).

真の到達コスト及び真の平均コストを計算することができないため、誤差ｄ_kを決定するために標準ベルマン方程式を近似することができる。

ここで、ｘ_k+1は受動的ダイナミクス下の次の状態であり、ｘ_k+1＋Ｂ_kｕ_kΔｔは、行動ｕ_kでの制御されたダイナミクス下における次の状態である。推定到達コスト、平均コスト及びそれらの微分値は、ｃｒｉｔｉｃからの推定Ｚ値及び平均Ｚ値コストを使用することによって計算可能である。さらに、以下の式により、μとの関係においてＴＤ誤差を線形化するために、

を近似することができる。

Since the true arrival cost and true average cost cannot be calculated, the standard Bellman equation can be approximated to determine the error d _k .

Here, x k _{+ 1} is the next state under passive _{_{dynamics, x k + 1 + B k}} u k Δt is the next state in a controlled dynamics of a behavioral u _k. Estimated arrival costs, average costs and their derivatives can be calculated by using estimated Z values and average Z value costs from critic. Furthermore, in order to linearize the TD error in relation to μ,

Can be approximated.

この手順は、受動的ダイナミクス下での状態遷移サンプル（ｘ_k、ｘ_k+1）、状態コストｑ_k、及び所与の状態における制御ダイナミクスＢ_kを使用することによって能動的探索なしでポリシーを改良する。標準ａｃｔｏｒ−ｃｒｉｔｉｃ方法は、能動的探索を用いてポリシーを最適化する。これらのａｃｔｏｒ及びｃｒｉｔｉｃ関数が定義された状態で、コンピュータ処理システム１４は、Ｌ−ＭＤＰを用いて半モデルフリー強化学習を実装することができる。 This procedure allows policies to be created without active search by using state transition samples (x _k , x _{k + 1} ) under passive dynamics, state cost q _k , and control dynamics B _k in a given state. Improve. Standard actor-critical methods use active search to optimize policies. With these actor and critical functions defined, the computer processing system 14 can implement semi-model free reinforcement learning using L-MDP.

本明細書中に記載の方法において、ポリシーは、到達コストと行動−状態価値との間の誤差を最小化することによって、車両制御ダイナミクスについての知識及び受動的に収集されたデータのサンプルを用いて学習されるパラメータで最適化される。本明細書中に記載の方法は、通常車を制御するために利用可能である車両自体のダイナミクスモデルで最適なポリシーを決定することを可能にする。これらの方法は、同様に、通常そのダイナミクスモデルが未知である周囲の車両の操作に関する受動的に収集されたデータも使用する。さらに、本明細書に記載の方法を用いると、最適な制御ポリシーを決定するために、車両環境の受動的ダイナミクスＡ（ｘ_t）及び遷移ノイズレベル及びＣ（ｘ_t）を知っている必要はない。 In the method described herein, the policy uses knowledge of vehicle control dynamics and a sample of passively collected data by minimizing the error between cost of arrival and action-state value. Optimized with the learned parameters. The method described herein makes it possible to determine the optimal policy with the vehicle's own dynamics model that is normally available to control the vehicle. These methods also use passively collected data regarding the operation of surrounding vehicles, whose dynamics model is usually unknown. Furthermore, using the method described herein, it is necessary to know the passive dynamics A (x _t ) and transition noise level and C (x _t ) of the vehicle environment in order to determine the optimal control policy. Absent.

別の態様においては、本明細書中で記載されているように、１つの操作を行なうようシステムを制御するために使用可能な制御ポリシーを最適化するためのコンピュータ実装型方法が提供されている。この方法は、システムを制御するために使用可能な制御ポリシーを提供するステップと；行なうべき操作に関する受動的に収集されたデータに対して受動的ａｃｔｏｒ−ｃｒｉｔｉｃ強化学習方法を適用して、最低予想累積コストで操作を行なうようにシステムを制御するために制御ポリシーが操作可能になるような形で制御ポリシーを修正するステップと；を含むことができる。受動的に収集されたデータに対して受動的ａｃｔｏｒ−ｃｒｉｔｉｃ強化学習方法を適用するステップは、ａ）コンピュータ処理システム内のｃｒｉｔｉｃネットワークにおいて、受動的に収集されたデータのサンプルを用いてＺ値を推定し、受動的に収集されたデータのサンプルを用いて最適なポリシー下で平均コストを推定するステップと；ｂ）コンピュータ処理システム内のａｃｔｏｒネットワークにおいて、受動的に収集されたデータのサンプル、システムについての制御ダイナミクス、到達コスト及び制御ゲインを用いて制御ポリシーを修正するステップと；ｃ）制御ポリシーを修正する上で、及び最適なポリシー下でＺ値及び平均コストを推定する上で使用されるパラメータを更新するステップと；ｄ）推定平均コストが収束するまで、ステップ（ａ）〜（ｃ）を反復的に繰返すステップとを含むことができる。 In another aspect, a computer-implemented method is provided for optimizing a control policy that can be used to control a system to perform a single operation, as described herein. . The method provides a control policy that can be used to control the system; and applies a passive actor-critic reinforcement learning method to passively collected data regarding the operation to be performed, to provide a minimum expectation Modifying the control policy in such a way that the control policy is operable to control the system to operate at an accumulated cost. Applying a passive actor-critic reinforcement learning method to passively collected data includes the steps of: a) using a sample of passively collected data in a critical network in a computer processing system. Estimating and estimating an average cost under optimal policy using a sample of passively collected data; b) a sample of data passively collected in an actor network within a computer processing system, system Modifying control policy using control dynamics, arrival cost and control gain for; c) used to modify control policy and to estimate Z value and average cost under optimal policy Updating the parameters; d) the estimated average cost converges Until it may include the step of repeating steps (a) ~ (c) iteratively.

図６〜９は、本明細書中に記載の一実施形態に係る、最小予想累積コストで車両の操作を行なう目的で車両を制御するために構成された制御ポリシーを学習するため、車両の操作に関連する受動的に収集されたデータに対して受動的ａｃｔｏｒ−ｃｒｉｔｉｃ強化学習方法を適用するコンピュータ実装型方法を例示するフローチャートである。 6-9 illustrate vehicle operation to learn a control policy configured to control a vehicle for the purpose of operating the vehicle at a minimum expected accumulated cost, according to one embodiment described herein. 6 is a flow chart illustrating a computer-implemented method for applying a passive actor-critic reinforcement learning method to passively collected data associated with.

図６を参照すると、ブロック７１０において、プロセッサ５８は、行なうべき車両の操作に関連する受動的に収集されたデータを受信し得る。受動的に収集されたデータは、メモリ５４及び／又はコンピュータ処理システム１４の外部のソースから受信され得る。 Referring to FIG. 6, at block 710, the processor 58 may receive passively collected data related to the operation of the vehicle to be performed. Passively collected data may be received from memory 54 and / or a source external to computer processing system 14.

ブロック７２０では、コンピュータ処理システム１４のプロセッサ及び／又は他の要素は、受動的に収集されたデータのサンプルに対し、本明細書中に記載の受動的ａｃｔｏｒ−ｃｒｉｔｉｃ（ＰＡＣ）強化学習方法を反復的に適用することができる。ＰＡＣ方法を適用することにより、最低予想累積コストで車両の操作を行なうように車両が制御され得るようにする制御ポリシーを学習することが可能である。ブロック７３０では、車両は、車両の操作を行なうために学習された制御ポリシーにしたがって制御され得る。 At block 720, the processor and / or other element of the computer processing system 14 iterates the passive actor-critical (PAC) reinforcement learning method described herein for a passively collected sample of data. Can be applied. By applying the PAC method, it is possible to learn a control policy that allows the vehicle to be controlled to operate the vehicle at the lowest expected accumulated cost. In block 730, the vehicle may be controlled in accordance with a learned control policy for operating the vehicle.

図７は、図６のブロック７２０に示されているような、本明細書中に記載の実施形態に係る受動的ａｃｔｏｒ−ｃｒｉｔｉｃ（ＰＡＣ）強化学習方法の適用を例示するフローチャートである。 FIG. 7 is a flow chart illustrating application of a passive actor-critical (PAC) reinforcement learning method according to embodiments described herein, as shown in block 720 of FIG.

図７を参照すると、ブロック８１０において、コンピュータ処理システムは、車両の操作を行なうように車両を制御するために適応され得る初期制御ポリシーを受信することができる。制御ポリシーの初期版のパラメータは、コンピュータ処理システム内のランダム化ルーチンを用いてランダム値に初期化され得る。 Referring to FIG. 7, at block 810, the computer processing system may receive an initial control policy that may be adapted to control the vehicle to operate the vehicle. The parameters of the initial version of the control policy can be initialized to random values using a randomizing routine within the computer processing system.

ブロック８２０では、コンピュータ処理システムは、受動的に収集されたデータのサンプルを用いて前述の最適な制御ポリシー下でＺ値及び平均コストを推定することができる。ブロック８３０では、コンピュータ処理システムは、最適なポリシー下で受動的に収集されたデータ、推定されたＺ値及び推定された平均コストのサンプルを用いて制御ポリシーを修正することができる。ブロック８４０では、コンピュータ処理システムは、推定平均コストが収束するまで、ブロック８２０及び８３０に示されたステップを反復的に繰返すことができる。 At block 820, the computer processing system may use a passively collected sample of data to estimate the Z value and average cost under the optimal control policy described above. At block 830, the computer processing system may modify the control policy using the passively collected data under the optimal policy, the estimated Z value, and the estimated average cost sample. At block 840, the computer processing system may iteratively repeat the steps shown in blocks 820 and 830 until the estimated average cost has converged.

図８は、図７のブロック８２０に示されているように、受動的に収集されたデータのサンプルを用いて最適な制御ポリシー下でＺ値及び平均コストを推定するための、ｃｒｉｔｉｃネットワークによる受動的ａｃｔｏｒ−ｃｒｉｔｉｃ（ＰＡＣ）強化学習方法のステップの適用を例示するフローチャートである。 FIG. 8 shows a passive network with a critic network for estimating the Z-value and average cost under optimal control policy using a sample of passively collected data, as shown in block 820 of FIG. 6 is a flowchart illustrating application of the steps of a dynamic actor-critical (PAC) reinforcement learning method.

ブロック９１０では、Ｚ値関数を近似するために使用可能である重み付けされた放射基底関数内で使用される重みを更新することができる。ブロック９２０では、Ｚ値関数を近似するために使用可能である重み付けされた放射基底関数内で使用される重みを、最適化することができる。ブロック９３０では、重み付けされた放射基底関数の線形結合を用いて、Ｚ値関数を近似することができる。ブロック９４０では、ブロック９３０で決定された近似Ｚ値関数及び受動的に収集されたデータのサンプルを用いてＺ値を近似することができる。 At block 910, the weights used in the weighted radial basis functions that can be used to approximate the Z-value function can be updated. At block 920, the weights used in the weighted radial basis functions that can be used to approximate the Z-value function can be optimized. At block 930, a linear combination of weighted radial basis functions may be used to approximate the Z value function. At block 940, the Z value may be approximated using the approximate Z value function determined at block 930 and a sample of passively collected data.

図８は、図７のブロック８３０に示されているように、受動的に収集されたデータのサンプルを用いて最適な制御ポリシー下でＺ値及び平均コストを推定するための、ａｃｔｏｒネットワークによる受動的ａｃｔｏｒ−ｃｒｉｔｉｃ（ＰＡＣ）強化学習方法のステップの適用を例示するフローチャートである。 FIG. 8 shows a passive by actor network for estimating Z-value and average cost under optimal control policy using passively collected samples of data, as shown in block 830 of FIG. 6 is a flowchart illustrating application of the steps of a dynamic actor-critical (PAC) reinforcement learning method.

ブロック１０１０では、制御ゲインを近似するために使用可能である重み付けされた放射基底関数内で使用される重みを更新することができる。ブロック１０２０では、重み付けされた放射基底関数の線形結合を用いて、制御ゲインρ（ｘ_k）を近似することができる。先に説明された関係（１２）及び受動的に収集されたデータのサンプルを用いて制御ゲインを近似することができる：

At block 1010, the weights used in the weighted radial basis functions that can be used to approximate the control gain can be updated. At block 1020, the control gain ρ (x _k ) can be approximated using a linear combination of weighted radial basis functions. The control gain can be approximated using the relationship (12) described above and a sample of passively collected data:

ブロック１０３０では、関係

を用いて制御入力ｕを決定することができる。ブロック１０４０では、ブロック１０３０で決定された制御入力、受動的に収集されたデータのサンプル及びブロック１０２０で近似された制御ゲインρ（ｘ_k）を用いて、行動価値関数Ｑの値を決定することができる。行動価値関数Ｑの値は、先に段落［００４５］で明記された関係、

を用いて決定することができる。ブロック１０５０では、制御ゲインを最適化して、最適化制御ゲインを提供することができる。ブロック１０４０で決定された行動価値関数Ｑの値、及び先に段落［００５７］で明記された関係、

を用いて、制御ゲインを最適化することができる。ブロック１０６０では、最後に最適化制御ゲインρ（ｘ_k）及び関係（７）、すなわち

を用いて、制御ポリシーを修正又は更新することができる。 In block 1030, the relationship

Can be used to determine the control input u. At block 1040, using the control input determined at block 1030, a sample of passively collected data, and the control gain ρ (x _k ) approximated at block 1020, the value of the action value function Q is determined. Can do. The value of the behavior value function Q is the relationship specified earlier in paragraph [0045],

Can be determined. At block 1050, the control gain can be optimized to provide an optimized control gain. The value of the action value function Q determined in block 1040 and the relationship specified earlier in paragraph [0057],

Can be used to optimize the control gain. In block 1060, finally the optimized control gain ρ (x _k ) and relationship (7), ie

Can be used to modify or update the control policy.

この関係は、先に段落［００２８］中に明記されたものである。上述したｃｒｉｔｉｃ及びａｃｔｏｒネットワークにより行なわれるステップは、推定平均コストが収束するまで、追加の受動的に収集されたデータについて反復的に繰返し可能である。 This relationship was specified earlier in paragraph [0028]. The steps performed by the critical and actor networks described above can be iteratively repeated for additional passively collected data until the estimated average cost has converged.

図２は、本明細書中に記載の方法に係るコンピュータ処理システム１４内での制御入力の決定及び制御ポリシーの修正及び制御ポリシーの最適化の実行中の情報の流れを示す概略図である。従来のａｃｔｏｒ−ｃｒｉｔｉｃ方法は環境から能動的に収集されたデータのサンプルを使用して動作し得るが、本明細書中で説明されているｐＡＣ方法は、環境の能動的探索なく、代りに受動的に収集されたサンプル及び公知の車両制御ダイナミクスを用いて、最適な制御ポリシーを決定する。ｃｒｉｔｉｃ８１又はａｃｔｏｒ８３のいずれかで受信したあらゆる情報を、後日使用するためにメモリ内にバッファリングすることができる。例えば、パラメータ値を計算又は推定するためにｃｒｉｔｉｃ又はａｃｔｏｒに必要とされる情報の全てが現在利用可能でない状況においては、残りの所要情報が受信されるまで、受信された情報をバッファリングすることができる。項

及び

は、それぞれｘ_k及びｘ_k+1におけるｘに関するＺ値関数の偏導関数である。項

は、ｘ_kにおける到達コスト関数Ｖの偏導関数を計算するために使用可能である。 FIG. 2 is a schematic diagram illustrating the flow of information during control input determination and control policy modification and control policy optimization in the computer processing system 14 according to the methods described herein. While conventional actor-critical methods can operate using samples of data actively collected from the environment, the pAC methods described herein do not actively explore the environment, but instead are passive. Using optimally collected samples and known vehicle control dynamics, an optimal control policy is determined. Any information received at either the critical 81 or actor 83 can be buffered in memory for later use. For example, in situations where all of the information required for critical or actor is not currently available to calculate or estimate the parameter value, buffer the received information until the remaining required information is received. Can do. Term

as well as

Is the partial derivative of the Z-value function for x at x _k and x _{k + 1} , respectively. Term

Can be used to calculate the partial derivative of the arrival cost function V at x _k .

図３は、図１のコンピュータ処理システム１１４と同様の態様で構成されたコンピュータ処理システム１１４が組み込まれた例示的な実施形態に係る車両１１を示す機能的ブロック図である。車両１１は、乗用車、トラック、又は、本明細書中に記述された操作を実施し得る他の任意の車両の形態を取り得る。車両１１は、完全に又は部分的に自律モードで作動すべく構成され得る。自律モードで作動している間、車両１１は、人的相互作用なしで作動すべく構成され得る。例えば、高速道路の合流操作が実行されている自律モードにおいて、車両は、車両乗員からの入力なしで、高速道路上の車両から安全距離を維持すること、他の車両と速度を調和すること等を行うように、スロットル、ブレーキ及び他のシステムを作動させ得る。 FIG. 3 is a functional block diagram illustrating a vehicle 11 according to an exemplary embodiment incorporating a computer processing system 114 configured in a manner similar to the computer processing system 114 of FIG. The vehicle 11 may take the form of a passenger car, truck, or any other vehicle that can perform the operations described herein. The vehicle 11 can be configured to operate fully or partially in an autonomous mode. While operating in autonomous mode, the vehicle 11 may be configured to operate without human interaction. For example, in an autonomous mode where a highway merge operation is being performed, the vehicle maintains a safe distance from the vehicle on the highway without input from the vehicle occupant, or harmonizes speed with other vehicles, etc. Throttles, brakes and other systems may be activated to

車両１１は、コンピュータ処理システム１１４に加え、且つ、相互に作動的に通信する種々のシステム、サブシステム及び構成要素、及び構成要素、例えば、センサシステム又は配列２８、一つ以上の通信インタフェース１６、操舵システム１８、スロットルシステム２０、制動システム２２、電源３０、動力システム２６、並びに本明細書中に記述されたように車両を操作するために必要な他のシステム及び構成要素を含み得る。車両１１は、図３に示されたよりも多い又は少ないサブシステムを含み得ると共に、各サブシステムは、複数の要素を含み得る。更に、車両１１のサブシステム及び要素の各々は、相互接続され得る。車両１１の記述された機能及び／又は自律的作動の一つ以上の実施は、相互に協働して作動している複数の車両システム及び／又は構成要素により実行され得る。 The vehicle 11 includes various systems, subsystems and components, and components such as a sensor system or array 28, one or more communication interfaces 16, in addition to the computer processing system 114 and in operative communication with each other. The steering system 18, throttle system 20, braking system 22, power supply 30, power system 26, and other systems and components necessary to operate the vehicle as described herein may be included. The vehicle 11 may include more or fewer subsystems than shown in FIG. 3, and each subsystem may include multiple elements. Further, each of the subsystems and elements of the vehicle 11 can be interconnected. One or more implementations of the described functions and / or autonomous operation of the vehicle 11 may be performed by multiple vehicle systems and / or components operating in cooperation with each other.

センサシステム２８は、任意の適切な形式のセンサを含み得る。本明細書中には、異なる形式のセンサの種々の例が記述される。しかし、実施形態は、記述された特定のセンサに限定されないことは理解される。 Sensor system 28 may include any suitable type of sensor. Various examples of different types of sensors are described herein. However, it is understood that the embodiments are not limited to the specific sensors described.

センサシステム２８は、車両１１の外部環境に関する情報を検知すべく構成された所定数のセンサを含み得る。例えば、センサシステム２８は、全地球測位システム（ＧＰＳ）のようなナビゲーションユニット、及び、例えば、慣性測定装置（ＩＭＵ）（図示せず）、ＲＡＤＡＲユニット（図示せず）、レーザ測距計／ＬＩＤＡＲユニット（図示せず）、及び車両の内部及び／又は該車両１１の外部環境の複数の画像を捕捉すべく構成されたデバイスを備える一台以上のカメラ（図示せず）等の他のセンサを含み得る。カメラは、スチルカメラ又はビデオカメラであり得る。ＩＭＵは、慣性加速度に基づいて車両１１の位置及び向きの変化を検知するように構成されたセンサ（例えば、加速度計及びジャイロスコープ等）の任意の組合せを組み込み得る。例えば、ＩＭＵは、車両のロール速度、ヨーレート、ピッチ速度、長手方向加速度、横方向加速度、及び、垂直加速度のようなパラメータを検知し得る。ナビゲーションユニットは、車両１１の地理的位置を推定すべく構成された任意のセンサであり得る。この目的の為に、ナビゲーションユニットは、地球に対する車両１１の位置に関する情報を提供するように作動可能な送受信機を含む一つ以上の送受信機を含み得る。また、ナビゲーションユニットは、業界公知の態様で、記憶され且つ／又は利用可能な地図を用いて与えられた開始点（例えば、車両の現在位置）から、選択された目的地までの走行ルートを決定又は計画するように構成され得る。また、車両１１に近接して又は所定の距離以内で移動する車両に関する近さ、距離、速度及び他の情報を検出するように構成された一つ以上のセンサが設けられてもよい。 The sensor system 28 may include a predetermined number of sensors configured to detect information regarding the external environment of the vehicle 11. For example, the sensor system 28 may include a navigation unit such as a global positioning system (GPS), and an inertial measurement unit (IMU) (not shown), a RADAR unit (not shown), a laser rangefinder / LIDAR, for example. Other sensors such as a unit (not shown) and one or more cameras (not shown) with devices configured to capture multiple images of the interior of the vehicle and / or the environment outside the vehicle 11 May be included. The camera can be a still camera or a video camera. The IMU may incorporate any combination of sensors (eg, accelerometers and gyroscopes) configured to detect changes in the position and orientation of the vehicle 11 based on inertial acceleration. For example, the IMU may sense parameters such as vehicle roll speed, yaw rate, pitch speed, longitudinal acceleration, lateral acceleration, and vertical acceleration. The navigation unit can be any sensor configured to estimate the geographical position of the vehicle 11. For this purpose, the navigation unit may include one or more transceivers including a transceiver operable to provide information regarding the position of the vehicle 11 with respect to the earth. The navigation unit also determines a travel route from a given starting point (eg, the current position of the vehicle) to a selected destination in a manner known in the industry using a stored and / or available map. Or it can be configured to plan. There may also be provided one or more sensors configured to detect proximity, distance, speed and other information relating to a vehicle moving close to the vehicle 11 or within a predetermined distance.

公知の態様において、車両センサ２８は、種々の車両システムに対する適切な制御命令を策定且つ実行する際にコンピュータ処理システム１１４により使用されるデータを提供する。例えば、慣性センサ、車輪速度センサ、道路状態センサ、及び操舵角センサからのデータは、車両を旋回させるための命令を策定して操舵システム１８において実行する上で、処理され得る。各車両センサ２８は、車両１１に組み込まれる任意の運転者支援機能及び自律的操作機能をサポートするために必要とされる任意のセンサを含み得る。センサシステム２８が複数のセンサを含む構成において、センサは、相互から独立的に作動し得る。代替的に、各センサのうちの２つ以上が、相互に協働して作動し得る。センサシステム２８のセンサは、コンピュータ処理システム１１４に対し、及び／又は車両１１の他の任意の要素に対し、作用的に接続され得る。 In a known manner, the vehicle sensor 28 provides data used by the computer processing system 114 in formulating and executing appropriate control instructions for various vehicle systems. For example, data from inertial sensors, wheel speed sensors, road condition sensors, and steering angle sensors can be processed in formulating and executing in the steering system 18 commands for turning the vehicle. Each vehicle sensor 28 may include any sensor required to support any driver assistance functions and autonomous operating functions incorporated into the vehicle 11. In configurations where the sensor system 28 includes multiple sensors, the sensors can operate independently of each other. Alternatively, two or more of each sensor may operate in cooperation with each other. Sensors of sensor system 28 may be operatively connected to computer processing system 114 and / or to any other element of vehicle 11.

また、各車両センサ２８により収集された任意のデータは、本明細書中に記述された目的でデータを必要とし又は利用する任意の車両システム又は構成要素にも送信され得る。例えば、車両センサ２８により収集されたデータは、コンピュータ処理システム１１４に、又は一つ以上の専用のシステム又は構成要素のコントローラ（図示せず）に送信され得る。付加的な特定の形式のセンサとしては、本明細書中に記述された機能及び操作を実施するために必要とされる他の任意の形式のセンサが挙げられる。 Also, any data collected by each vehicle sensor 28 can be transmitted to any vehicle system or component that requires or uses data for the purposes described herein. For example, data collected by the vehicle sensor 28 may be transmitted to the computer processing system 114 or to one or more dedicated system or component controllers (not shown). Additional specific types of sensors include any other type of sensor required to perform the functions and operations described herein.

特定の車両センサからの情報は、一つよりも多い車両システム又は構成要素を制御すべく処理かつ使用され得る。例えば、自動化された操舵制御及び制動制御の両方を組み込んだ車両において、種々の道路状態センサは、データをコンピュータ処理システム１１４に提供し、このコンピュータ処理システムは、プロセッサが実行可能な記憶された命令に従って道路状態情報を処理すると共に、操舵システム及び制動システムの両方に対して適切な制御命令を策定することができるようになる。 Information from a particular vehicle sensor can be processed and used to control more than one vehicle system or component. For example, in a vehicle that incorporates both automated steering control and braking control, various road condition sensors provide data to computer processing system 114, which stores stored instructions that can be executed by a processor. The road condition information is processed according to the above, and appropriate control commands can be formulated for both the steering system and the braking system.

車両１１は、センサの出力信号又は他の信号が、コンピュータ処理システム１１４又は別の車両システム若しくは要素による使用の前に前処理を必要とするという状況、又はコンピュータ処理システムから送信された制御信号が、起動可能なサブシステム又はサブシステム構成要素（例えば、操舵システム又はスロットルシステムの構成要素）による使用の前に処理を必要とするという状況に適した、信号処理手段３８を含み得る。信号処理手段は、例えば、アナログ／デジタル（Ａ／Ｄ）変換器又はデジタル／アナログ（Ｄ／Ａ）変換器であり得る。 The vehicle 11 may be in a situation where the sensor output signal or other signal requires pre-processing prior to use by the computer processing system 114 or another vehicle system or element, or a control signal transmitted from the computer processing system. Signal processing means 38 may be included, suitable for situations requiring processing prior to use by a startable subsystem or subsystem component (eg, steering system or throttle system component). The signal processing means may be, for example, an analog / digital (A / D) converter or a digital / analog (D / A) converter.

センサ統合機能（sensor fusion capability）１３８は、センサシステム２８からのデータを入力として受け入れるべく構成されたアルゴリズム（又は、アルゴリズムを記憶するコンピュータプログラム製品）の形態であり得る。上記データは、例えば、センサシステム２８の各センサにて検知された情報を表すデータを含む。センサ統合アルゴリズムは、センサシステムから受信したデータを処理し、（例えば、複数の個別的なセンサの出力から形成された）統合された又は合成された信号を生成し得る。センサ統合アルゴリズム１３８は、例えば、カルマンフィルタ、ベイジアンネットワーク、又は、別のアルゴリズムを含む。センサ統合アルゴリズム１３８は更に、センサシステム２８からのデータに基づく種々のアセスメントを提供し得る。例示的な実施形態において、アセスメントは、車両１１の環境における個別的な物体又は特定構造の評価、特定状況の評価、及び、特定の状況に基づく可能的な影響の評価を含み得る。他のアセスメントも可能である。センサ統合アルゴリズム１３８は、コンピュータ処理システム１１４に組み込まれた又はコンピュータ処理システム１１４と作用的に通信する（メモリ１５４のような）メモリ内に記憶され得ると共に、当業界において公知の態様でコンピュータ処理システムにより実行され得る。 The sensor fusion capability 138 may be in the form of an algorithm (or a computer program product that stores the algorithm) configured to accept data from the sensor system 28 as input. The data includes data representing information detected by each sensor of the sensor system 28, for example. The sensor integration algorithm may process data received from the sensor system and generate an integrated or synthesized signal (eg, formed from the outputs of multiple individual sensors). The sensor integration algorithm 138 includes, for example, a Kalman filter, a Bayesian network, or another algorithm. The sensor integration algorithm 138 may further provide various assessments based on data from the sensor system 28. In an exemplary embodiment, the assessment may include an assessment of individual objects or specific structures in the environment of the vehicle 11, an assessment of a particular situation, and an assessment of possible impacts based on the particular situation. Other assessments are possible. The sensor integration algorithm 138 may be stored in a memory (such as memory 154) that is incorporated into or operatively communicates with the computer processing system 114, and in a manner known in the art. Can be executed.

本明細書中に記述された任意の情報若しくはパラメータの受信、収集、監視、処理、及び／又は、決定を参照するときにおける「連続的に」という語句の使用は、コンピュータ処理システム１１４が、これらのパラメータに関する情報が存在し又は検出されるや否や、又は、センサの取得サイクル及びプロセッサの処理サイクルに従ってできるだけ素早く、任意の情報を受信及び／又は処理すべく構成されることを意味している。コンピュータ処理システム１１４が、例えば、センサからのデータ又は車両構成要素の状況に関する情報を受信すると直ちに、コンピュータ処理システムは、記憶されたプログラム命令に従って動作し得る。同様に、コンピュータ処理システム１１４は、センサシステム２８から及び他の情報源から、同時進行的又は連続的に情報の流れを受信して処理し得る。この情報は、本明細書中に記述された態様及び目的にて、メモリ内に記憶された命令に従って処理及び／又は評価される。 The use of the phrase “continuously” when referring to the reception, collection, monitoring, processing, and / or determination of any information or parameters described herein is used by the computer processing system 114. As soon as information on the parameters is present or detected, or it is configured to receive and / or process any information as quickly as possible according to the sensor acquisition cycle and the processor processing cycle. As soon as the computer processing system 114 receives, for example, data from sensors or information about the status of vehicle components, the computer processing system may operate according to stored program instructions. Similarly, the computer processing system 114 may receive and process information streams from the sensor system 28 and from other information sources, either simultaneously or continuously. This information is processed and / or evaluated according to instructions stored in the memory in the manner and purposes described herein.

また、図３は、先に記述されたように、図１のコンピュータ処理システム１１４と同様の態様で構成された代表的なコンピュータ処理システム１１４のブロック図も示している。本明細書中に記述されたようにポリシーの修正を実施すると共に制御入力を決定するために必要とされる機能を組み込むと共に、コンピュータ処理システム１１４は、他の車両システム及び要素に作用的に接続されると共に、その他の点では、車両１１及びその構成要素の制御及び操作に影響するように構成され得る。コンピュータ処理システム１１４は、少なくとも幾つかのシステム及び／又は構成要素を、（ユーザ入力なしで）自律的に且つ／又は（一定程度のユーザ入力を以て）半自律的に制御すべく構成され得る。また、コンピュータ処理システムは、幾つかの機能を自律的及び／又は半自律的に制御及び／又は実行するようにも構成され得る。コンピュータ処理システム１１４は、種々のサブシステム（例えば、動力システム２６、センサシステム２８、操舵システム１８）から、各通信インタフェース１６のうちの任意のものから、及び／又は他の任意で適切な情報源から受信した入力及び／又は情報に基づき、車両１１の機能性を制御し得る。 FIG. 3 also shows a block diagram of a representative computer processing system 114 configured as described above in a manner similar to computer processing system 114 of FIG. The computer processing system 114 is operatively connected to other vehicle systems and elements, incorporating the functions required to implement policy modifications and determine control inputs as described herein. In other respects, it may be configured to affect the control and operation of the vehicle 11 and its components. The computer processing system 114 may be configured to control at least some systems and / or components autonomously (without user input) and / or semi-autonomously (with some degree of user input). The computer processing system may also be configured to control and / or perform some functions autonomously and / or semi-autonomously. Computer processing system 114 may be from various subsystems (eg, power system 26, sensor system 28, steering system 18), from any of each communication interface 16, and / or any other suitable source of information. The functionality of the vehicle 11 can be controlled based on input and / or information received from the vehicle.

図３の実施形態において、コンピュータ処理システム１１４は、図１に関して先に記述されたように、車両制御ダイナミクスモデル１８７、critic１８１、actor１８３、及び、制御ポリシー２０１を含み得る。コンピュータ処理システム１１４は、先に記述されたように、制御入力を決定すべく、且つ自律車両の操作制御ポリシーを修正及び／又は最適化すべく構成され得る。また、コンピュータ処理システム１１４は、制御入力に従って、且つ、本明細書中に記述されたように修正又は最適化された制御ポリシーにも従って、車両を制御して所望操作を実施すべく構成され得る。 In the embodiment of FIG. 3, the computer processing system 114 may include a vehicle control dynamics model 187, a critic 181, an actor 183, and a control policy 201 as described above with respect to FIG. Computer processing system 114 may be configured to determine control inputs and to modify and / or optimize autonomous vehicle operational control policies, as described above. The computer processing system 114 may also be configured to control the vehicle to perform a desired operation according to control inputs and according to a modified or optimized control policy as described herein. .

コンピュータ処理システム１１４は、図３に示された要素の幾つか又は全てを有し得る。加えて、コンピュータ処理システム１１４は、特定の用途に必要とされ又は所望される付加的な構成要素も含み得る。また、コンピュータ処理システム１１４は、複数のコントローラ又はコンピュータ処理デバイスであって、分散態様にて、情報を処理し且つ／又は車両１１の個別的な構成要素若しくはサブシステムを制御するように機能する複数のコントローラ又はコンピュータ処理デバイスを表し、又は、それにより具現され得る。 The computer processing system 114 may have some or all of the elements shown in FIG. In addition, the computer processing system 114 may also include additional components that are required or desired for a particular application. The computer processing system 114 is also a plurality of controllers or computer processing devices that function to process information and / or control individual components or subsystems of the vehicle 11 in a distributed manner. Represents or may be embodied by a controller or computer processing device.

メモリ１５４は、単一又は複数のプロセッサ１５８により実行されて、車両１１の種々の機能を実行するデータ１６０及び／又は命令１５６（例えば、プログラムロジック）を収納し得る。メモリ１５４は、本明細書中に記述された車両システム及び／又は構成要素（例えば、動力システム２６、センサシステム２８、コンピュータ処理システム１１４、及び、通信インタフェース１６）のうちの一つ以上にデータを送信し、それらからデータを受信し、それらと相互作用し、又はそれらを制御するための命令を含む、付加的な命令も含み得る。命令１５６に加え、メモリ１５４は、他の情報の中でも、道路地図、経路情報のようなデータを記憶し得る。斯かる情報は、自律的、半自律的、及び／又は手動的なモードにおける車両１１の操作の間において、ルートを計画するのに且つその他にことをするのに、車両１１及びコンピュータ処理システム１１４により使用され得る。 Memory 154 may be executed by single or multiple processors 158 to store data 160 and / or instructions 156 (eg, program logic) that perform various functions of vehicle 11. Memory 154 stores data for one or more of the vehicle systems and / or components described herein (eg, power system 26, sensor system 28, computer processing system 114, and communication interface 16). Additional instructions may also be included, including instructions for sending, receiving data from, interacting with, or controlling them. In addition to instructions 156, memory 154 may store data such as road maps and route information, among other information. Such information may be used by the vehicle 11 and the computer processing system 114 to plan routes and do other things during operation of the vehicle 11 in autonomous, semi-autonomous, and / or manual modes. Can be used.

コンピュータ処理システム１１４は、（概略的に６２と表される）一つ以上の自律的な機能又は操作を実施するために、種々の起動可能な車両システム及び構成要素の制御を連携調整するように構成され得る。これらの自律的な機能６２は、メモリ１５４及び／又は他のメモリ内に記憶されると共に、プロセッサにより実行されたときに、本明細書中に記述された種々のプロセス、命令又は機能のうちの一つ以上を実現するコンピュータ可読プログラムコードの形態で実現され得る。 The computer processing system 114 is adapted to coordinate the control of various activatable vehicle systems and components to perform one or more autonomous functions or operations (represented generally at 62). Can be configured. These autonomous functions 62 are stored in memory 154 and / or other memory and, when executed by the processor, are among the various processes, instructions or functions described herein. It may be implemented in the form of computer readable program code that implements one or more.

通信インタフェース１６は、車両１１と、外部センサ、他の車両、他のコンピュータシステム、（本明細書中に記述されたように、衛星システム、携帯電話／無線通信システム、種々の車両サービスセンターなどのような）種々の外部のメッセージ及び通信システム、及び／又はユーザとの間の相互作用することができるように構成され得る。通信インタフェース１６は、車両１１のユーザに情報を提供し又はユーザから入力を受信するためのユーザインタフェース（例えば、一台以上のディスプレイ（図示せず）、音声／オーディオインタフェース（図示せず）、及び／又は他のインタフェース）を含み得る。 Communication interface 16 includes vehicle 11 and external sensors, other vehicles, other computer systems (such as satellite systems, mobile phone / wireless communication systems, various vehicle service centers, etc. as described herein). Various external messages and communication systems and / or can be configured to be able to interact with users. The communication interface 16 provides a user interface for providing information to the user of the vehicle 11 or receiving input from the user (eg, one or more displays (not shown), a voice / audio interface (not shown), and (Or other interface).

また、通信インタフェース１６は、ワイドエリアネットワーク（ＷＡＮ）、無線通信ネットワーク、及び／又は他の任意で適切な通信ネットワークにおける通信を可能とするインタフェースも含み得る。通信ネットワークは、有線の通信リンク、及び／又は無線の通信リンクを含み得る。通信ネットワークは、上記のネットワーク及び／又は他の形式のネットワークの任意の組合せを含み得る。通信ネットワークは、一つ以上のルータ、スィッチ、アクセスポイント、無線アクセスポイント、及び／又は類似物を含み得る。一つ以上の構成において、通信ネットワークは、任意の近傍車両及び車両１１と、任意の近傍の路側の通信ノード及び／又はインフラとの間の通信を許容し得る、車両対全て（Ｖ２Ｘ）（車両対インフラストラクチャ（Ｖ２Ｉ）技術及び車両対車両（Ｖ２Ｖ）技術を含む）の技術を包含し得る。 Communication interface 16 may also include an interface that enables communication in a wide area network (WAN), a wireless communication network, and / or any other suitable communication network. The communication network may include a wired communication link and / or a wireless communication link. A communications network may include any combination of the networks described above and / or other types of networks. A communication network may include one or more routers, switches, access points, wireless access points, and / or the like. In one or more configurations, the communication network may allow communication between any nearby vehicles and vehicles 11 and any nearby roadside communication nodes and / or infrastructure (V2X) (vehicles). Technology (including infrastructure-to-infrastructure (V2I) technology and vehicle-to-vehicle (V2V) technology).

ＷＡＮネットワーク環境において使用されたとき、コンピュータ処理システム１１４は、ネットワーク（例えば、インターネット）のようなＷＡＮ上での通信を確立するためのモデム又は他の手段を含み（又は、それに対して作用的に接続され）得る。無線通信ネットワークにおいて使用されたとき、コンピュータ処理システム１１４は、無線ネットワークにおける一つ以上のネットワークデバイス（例えば、基地送受信ステーション）を介して無線コンピュータ処理デバイス（図示せず）と通信するための一つ以上の送受信機、デジタル信号プロセッサ、及び付加的な回路機構並びにソフトウェアを含み（又は、それに対して作用的に接続され）得る。これらの構成は、種々の外部情報源から定常的な情報の流れを受信する種々の態様を提供する。 When used in a WAN network environment, the computer processing system 114 includes (or is operatively associated with) a modem or other means for establishing communications over a WAN, such as a network (eg, the Internet). Get connected). When used in a wireless communication network, the computer processing system 114 is one for communicating with a wireless computer processing device (not shown) via one or more network devices (eg, base transceiver stations) in the wireless network. It may include (or be operatively connected to) the above transceiver, digital signal processor, and additional circuitry and software. These arrangements provide various aspects of receiving a steady flow of information from various external information sources.

車両１１は、コンピュータ処理システム１１４並びに他の車両システム及び／又は構成要素と作用的に通信し且つコンピュータ処理システムから受信した制御命令に応じて作用し得る、種々の起動可能なサブシステム及び要素を含み得る。種々の起動可能なサブシステム及び要素は、（例えば、ＡＣＣ及び／又は車線維持などの）いずれの自律的の走行支援システムが起動されているのか且つ／又は車両が完全自律モードで駆動されているのかといった所定の走行状況のような要因に依存して、手動的又は（コンピュータ処理システム１１４により）自動的に制御され得る。 The vehicle 11 includes various activatable subsystems and elements that are in operative communication with the computer processing system 114 and other vehicle systems and / or components and that can act in response to control instructions received from the computer processing system. May be included. The various activatable subsystems and elements may indicate which autonomous driving assistance system (eg, ACC and / or lane keeping) is activated and / or the vehicle is driven in fully autonomous mode It can be controlled manually or automatically (by the computer processing system 114) depending on factors such as the predetermined driving situation.

操舵システム１８は、車両ホイール、ラック及びピニオン操舵ギア、操舵ナックル、及び／若しくは車両１１の方向を調節すべく作用可能であり得る他の任意の要素（コンピュータシステムで制御可能な任意の機構又は要素を含む）、又は要素の組み合わせを含み得る。動力システム２６は、車両１１に動力運動を提供すべく作用可能な構成要素を含み得る。例示的な実施形態において、動力システム２６は、エンジン（図示せず）、（ガソリン、ディーゼル燃料、又は、ハイブリッド車両の場合には一つ以上の電気バッテリのような）エネルギ源、及び、変速機（図示せず）を含み得る。制動システム２２は、車両１１を減速すべく構成された、要素及び／又はコンピュータシステムで制御可能な任意の機構の任意の組合せを含み得る。スロットルシステムは、（例えば、加速ペダル、及び／又は例えばエンジンの作動速度を制御することで車両１１の速度を制御するように構成された任意のコンピュータシステム制御可能な機構などの）要素及び／又は機構を含み得る。図３は、車両に組み込まれ得る車両サブシステムの僅かな例１８、２０、２２、２６を示している。特定の車両は、これらのシステムの一つ以上、又は示されたシステムの一つ以上に加えて他のシステム（図示せず）の一つ以上を組み込み得る。 The steering system 18 may be a vehicle wheel, rack and pinion steering gear, steering knuckle, and / or any other element that may be operable to adjust the direction of the vehicle 11 (any mechanism or element that can be controlled by a computer system). Or a combination of elements. The power system 26 may include components operable to provide power movement to the vehicle 11. In the exemplary embodiment, power system 26 includes an engine (not shown), an energy source (such as gasoline, diesel fuel, or one or more electric batteries in the case of hybrid vehicles), and a transmission. (Not shown). The braking system 22 may include any combination of elements and / or any mechanism controllable by a computer system configured to decelerate the vehicle 11. The throttle system may include elements (e.g., an accelerator pedal and / or any computer system controllable mechanism configured to control the speed of the vehicle 11, for example, by controlling the operating speed of the engine) and / or A mechanism may be included. FIG. 3 shows a few examples 18, 20, 22, 26 of vehicle subsystems that can be incorporated into a vehicle. A particular vehicle may incorporate one or more of these systems, or one or more of the other systems (not shown) in addition to one or more of the systems shown.

車両１１は、コンピュータ処理システム１１４、センサシステム２８、起動可能なサブシステム１８、２０、２２、２６及びその他のシステム及び要素が、コントローラエリアネットワーク（ＣＡＮ）バス３３などを用いて互いに通信できるように構成され得る。ＣＡＮバス及び／又は他の有線又は無線メカニズムを介して、コンピュータ処理システム１１４は、さまざまなシステム及び構成要素に対しメッセージを伝送する（及び／又はそこからメッセージを受信する）ことができる。代替的には、本明細書中に記載の要素及び／又はシステムのいずれかは、バスを使用することなく互いに直接接続され得る。同様に、本明細書中に記載の要素及び／又はシステム間の接続は、別の物理的媒体（例えば有線接続）を通したものであり得るか、又は接続は無線接続でもあり得る。図３は、コンピュータ処理システム１１４、メモリ１５４及び通信インターフェース１６などの車両１１のさまざまな構成要素を車両１１に組込まれているものとして示しているが、これらの構成要素の１つ以上は車両１１とは別個に組付ける又は付随させることのできるものである。例えば、メモリ１５４は、一部が又は全部が車両１１とは別個に存在することができる。こうして、車両１１は、別個に又は一緒に位置設定可能なデバイス要素の形で提供され得る。車両１１を作り上げるデバイス要素は、有線又は無線で、共に通信可能に連結され得る。こうして、別の態様において、本明細書中で説明されるように、コンピュータ処理システム１１４は、車両の操作を行なう目的で車両を自律的に制御するために使用可能な制御ポリシーを最適化するように構成され得る。コンピュータ処理システム１１４は、コンピュータ処理システム１１４の操作を制御するための１つ以上のプロセッサ１５８と、１つ以上のプロセッサにより使用可能なデータ及びプログラム命令を記憶するためのメモリ１５４とを含むことができる。メモリ１５４は、１つ以上のプロセッサによって実行された時点で、１つ以上のプロセッサ１５８に、ａ）システムに関わる受動的に収集されたデータを受信させ；ｂ）車両についての到達コストを推定するために使用可能なＺ値関数を決定させ；ｃ）コンピュータ処理システム内のｃｒｉｔｉｃネットワークにおいて：Ｚ値関数及び受動的に収集されたデータのサンプルを使用してＺ値を決定させ；受動的に収集されたデータのサンプルを用いて最適なポリシー下で平均コストを推定させ；ｄ）コンピュータ処理システム内のａｃｔｏｒネットワークにおいて、受動的に収集されたデータ、システムについての制御ダイナミクス、到達コスト及び制御ゲインを用いて制御ポリシーを修正させ；ｅ）推定平均コストが収束するまで、ステップ（ｃ）及び（ｄ）を反復的に繰返させる；コンピュータコードを記憶するように構成され得る。 The vehicle 11 allows the computer processing system 114, sensor system 28, activatable subsystems 18, 20, 22, 26 and other systems and elements to communicate with each other using a controller area network (CAN) bus 33 or the like. Can be configured. Through the CAN bus and / or other wired or wireless mechanisms, the computer processing system 114 can transmit messages to (and / or receive messages from) various systems and components. Alternatively, any of the elements and / or systems described herein can be directly connected to each other without using a bus. Similarly, connections between elements and / or systems described herein can be through another physical medium (eg, a wired connection), or the connection can be a wireless connection. FIG. 3 illustrates various components of the vehicle 11 such as the computer processing system 114, the memory 154, and the communication interface 16 as being incorporated into the vehicle 11, one or more of these components being the vehicle 11. Can be assembled or attached separately. For example, the memory 154 may be partly or wholly separate from the vehicle 11. Thus, the vehicle 11 can be provided in the form of device elements that can be positioned separately or together. The device elements that make up the vehicle 11 may be communicatively coupled together, either wired or wireless. Thus, in another aspect, as described herein, computer processing system 114 optimizes a control policy that can be used to autonomously control a vehicle for the purpose of operating the vehicle. Can be configured. The computer processing system 114 may include one or more processors 158 for controlling the operation of the computer processing system 114 and a memory 154 for storing data and program instructions usable by the one or more processors. it can. Memory 154, when executed by one or more processors, causes one or more processors 158 to a) receive passively collected data relating to the system; b) estimate the cost of arrival for the vehicle C) in a critical network in a computer processing system: using the Z value function and a sample of passively collected data to determine the Z value; passively collecting Average cost is estimated under optimal policy using the sampled data; d) in the actor network in the computer processing system, the passively collected data, the control dynamics for the system, the arrival cost and the control gain Use to modify the control policy; e) step until the estimated average cost converges c) and (d) is to iteratively repeated; may be configured to store the computer code.

図４及び５を参照すると、本明細書中に記載のｐＡＣ強化学習方法の一実施形態の一実施例において、自律的高速道路合流操作がシミュレーションされている。最低予想累積コストで車両の操作を行なう目的で車両を制御するために構成され最適化された制御ポリシーを学習するように、高速道路合流操作に関連する受動的に収集されたデータが、前述のように処理される。その後、高速道路合流操作を行なうため、学習された制御ポリシーにしたがって車両を制御することができる。 Referring to FIGS. 4 and 5, in one example of one embodiment of the pAC reinforcement learning method described herein, an autonomous highway merge operation is simulated. In order to learn a control policy that is configured and optimized to control a vehicle for the purpose of operating the vehicle at the lowest expected accumulated cost, passively collected data related to highway merge operations is described above. Is processed as follows. Thereafter, the vehicle can be controlled in accordance with the learned control policy in order to perform the highway merge operation.

高速道路合流操作は、４次元状態空間と１次元行動空間とを有し得る。車両環境ダイナミクスの受動的ダイナミクスＡ（ｘ_t）及び車両制御ダイナミクスＢ（ｘ）は、以下のように表現可能である：

ここで、下付き文字「０」は、高速道路の最も右側のレーン上の合流車両の後方にある「０」と標識付けされた車両（「後続車両」という）を意味し、下付き文字「１」は、ランプＲＲ上の合流する自動運転車両である「１」と標識付けされた車両を意味し、下付き文字「２」は、高速道路の最も右側のレーン上の合流車両１の前方にある「２」と標識付けされた車両（「先行車両」というばれる）を意味する。ｄｘ₁₂に及びｄｖ₁₂は、先行車両と合流車両との相対的位置及び速度を意味し、項α₀（ｘ）は、後続車両０の加速度を表わす。パラメータα、β及びγは、交通環境内の人間の運動挙動に調整することのできるモデルパラメータ（例えば、Ｇａｚｉｓ−Ｈｅｒｍａｎ−Ｒｏｔｈｅｒｙ（ＧＨＲ）の車追従モデル内で使用されているようなもの）である。実施例のためには、先行車両が定速Ｖ２＝３０メートル／秒で運転されていること、後続車両についての車両制御ダイナミクスが公知であること、が仮定されている。後続車両の速度が先行車両の速度より遅い場合、（ｄυ₀₂＜０）、α＝１．５５、β＝１．０８、γ＝１．６５であり、そうでない場合にはα＝２．１５、β＝−１．６５、γ＝−０．８９である。 The highway merge operation may have a four-dimensional state space and a one-dimensional action space. The vehicle environment dynamics passive dynamics A (x _t ) and vehicle control dynamics B (x) can be expressed as follows:

Here, the subscript “0” means a vehicle (referred to as “following vehicle”) labeled “0” behind the merged vehicle on the rightmost lane of the highway. “1” means a vehicle that is labeled “1”, which is a self-driving vehicle that merges on the ramp RR, and the subscript “2” is the front of the merged vehicle 1 on the rightmost lane of the highway Means a vehicle labeled “2” (referred to as “preceding vehicle”). and dv ₁₂ to dx ₁₂ means a relative position and speed of the preceding vehicle and the merging vehicle, term alpha ₀ (x) represents the acceleration of the following vehicle 0. The parameters α, β and γ are model parameters that can be adjusted to human movement behavior in the traffic environment (such as those used in Gazis-Herman-Rothery (GHR) car following models). is there. For the purposes of the example, it is assumed that the preceding vehicle is operating at a constant speed V2 = 30 meters / second and that the vehicle control dynamics for the following vehicle are known. If the speed of the following vehicle is slower than the speed of the preceding vehicle, (dυ ₀₂ <0), α = 1.55, β = 1.08, γ = 1.65, otherwise α = 2.15. , Β = −1.65, and γ = −0.89.

状態コストｑ（ｘ）を以下のように表現することができる：

ここで、ｋ₁、ｋ₂及びｋ₃は、状態コストのための重みであり；合流車両がランプ上で（すなわちｄｘ₁₂＜０及びｄｘ₁₂＞ｄｘ₀₂の条件下で）後続車両と先行車両の間にある場合、ｋ₁＝１、ｋ₂＝１０及びｋ₃＝１０であり；そうでない場合ｋ₁＝１０、ｋ₂＝１０及びｋ₃＝０である。状態コストについての重みｋ１、ｋ２、ｋ３は、割当てられるか又は手動で調整され得る。代替的には、逆強化学習を用いて、収集されたデータセットから状態コスト関数を学習することができる。コストは、後続車両と同じ速度で後続車両と先行車両の間で中間に合流するように自動運転車両を誘起するように設計される。初期状態は、−１００＜ｄｘ₁₂＜１００メートル、−１０／ｄｖ₁₂＜１０メートル／秒、−１００＜ｄｘ₀₂＜−５メートル及び−１０＜ｄｘ_0.2＜１０メートル／秒の範囲内でランダムに選択された。Ｚ値を近似するために、ガウス放射基底関数を使用した：

ここで、ｍｉ及びＳｉはｉ番目の放射基底関数のための平均及び逆共分散である。高速道路合流のシミュレーションのためには、１状態次元あたり８個の値で構成されたグリッドの頂点に平均が設定された４０９６個のガウス放射基底関数で、Ｚ値を近似した。基底の標準偏差は、各次元における最も近い２つの基底の間の距離の０．７であった。ρ（ｘ）の実際値が実施例において恒常であることから、制御ゲインρ（ｘ）を推定するためにｇ（ｘ）＝１の値を使用した。最適ポリシーは、上述のように、方程式（７）を用いて決定した。該方法は、受動的ダイナミクスをシミュレートすることによって収集された１００００個のサンプルからポリシーを最適化した。図５は、本明細書中に記載された方法により決定された連続する制御入力を用いて、１２５の異なる初期状態から出発して、（収束に必要とされる反復数として表現される）３０秒以内での合流成功率を示す。 The state cost q (x) can be expressed as:

Where k ₁ , k ₂ and k ₃ are weights for the state cost; the merging vehicle is on the ramp (ie under the conditions dx ₁₂ <0 and dx ₁₂ > dx ₀₂ ) and the following and preceding vehicles K ₁ = 1, k ₂ = 10 and k ₃ = 10; otherwise k ₁ = 10, k ₂ = 10 and k ₃ = 0. The weights k1, k2, k3 for the state costs can be assigned or manually adjusted. Alternatively, inverse reinforcement learning can be used to learn a state cost function from the collected data set. The cost is designed to induce the self-driving vehicle to meet midway between the following vehicle and the preceding vehicle at the same speed as the following vehicle. The initial state is randomly within a range of −100 <dx ₁₂ <100 meters, −10 / dv ₁₂ <10 meters / second, −100 <dx ₀₂ <−5 meters, and −10 <dx _0.2 <10 meters / second. chosen. A Gaussian radial basis function was used to approximate the Z value:

Where mi and Si are the mean and inverse covariance for the i th radial basis function. For the simulation of expressway merge, the Z value was approximated by 4096 Gaussian radial basis functions with the average set at the vertex of the grid composed of 8 values per state dimension. The standard deviation of the base was 0.7 of the distance between the two closest bases in each dimension. Since the actual value of ρ (x) is constant in the examples, the value of g (x) = 1 was used to estimate the control gain ρ (x). The optimal policy was determined using equation (7) as described above. The method optimized the policy from 10,000 samples collected by simulating passive dynamics. FIG. 5 shows 30 (represented as the number of iterations required for convergence) starting from 125 different initial states using successive control inputs determined by the method described herein. Indicates the success rate of merging within seconds.

状態コスト関数は、特定の合流状況に適合させるように設計又は調整され得る。１つ以上の実施形態において、特定の合流状況のために状態コスト関数を学習する目的で逆強化学習を使用するように、コンピュータ処理システムをプログラミングすることができる。 The state cost function may be designed or adjusted to fit a particular merge situation. In one or more embodiments, the computer processing system can be programmed to use inverse reinforcement learning for the purpose of learning a state cost function for a particular merge situation.

開示を読了した時点で当業者であれば認識するように、本明細書中に記載のさまざまな態様を、方法、コンピュータ処理システム又はコンピュータプログラムプロダクトとして具体化することができる。したがって、これらの態様は、完全にハードウェアの実施形態、完全にソフトウェアの実施形態又は、ソフトウェアとハードウェアの態様を組合せた実施形態の形をとることができる。その上、このような態様は、本明細書中に記載の機能を実行するための記憶媒体内又は上に具体化された、コンピュータ可読プログラムコード又は命令を有する１つ以上のコンピュータ可読記憶媒体によって記憶されたコンピュータプログラムプロダクトの形をとることができる。さらに、本明細書中に記載のデータ、命令又は事象を表わすさまざまな信号を、金属線、光ファイバ、及び／又は無線伝送媒体（例えば空気及び／又は空間）などの信号伝導媒体を通して走行する電磁波の形で発信元と宛先との間で移送することができる。 As those skilled in the art will appreciate upon reading the disclosure, various aspects described herein may be embodied as a method, a computer processing system, or a computer program product. Accordingly, these aspects can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Moreover, such aspects are provided by one or more computer readable storage media having computer readable program code or instructions embodied in or on a storage medium for performing the functions described herein. Can take the form of a stored computer program product. In addition, electromagnetic waves that travel various signals representing data, instructions, or events described herein through signal conducting media such as metal lines, optical fibers, and / or wireless transmission media (eg, air and / or space). Can be transported between the source and destination.

図中のフローチャート及びブロック図は、さまざまな実施形態に係るシステム、方法及びコンピュータプログラムプロダクトの考えられる実装のアーキテクチャ、機能性及び操作を例示する。この点において、フローチャート又はブロック図内の各ブロックは、規定の論理関数を実装するための１つ以上の実行可能な命令を含むモジュール、セグメント又はコード部分を表わし得る。同様に、いくつかの代替的実装において、ブロック内で指摘された機能が、図中に指摘された順序以外で発生し得るという点にも留意すべきである。例えば、連続して示されている２つのブロックを、実際には実質的に同時に実行することができ、あるいは、関与する機能性に応じてブロックを逆の順序で実行することができる場合もある。 The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments. In this regard, each block in the flowchart or block diagram may represent a module, segment, or code portion that includes one or more executable instructions for implementing a specified logical function. Similarly, it should be noted that in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figure. For example, two blocks shown in succession can actually be executed substantially simultaneously, or the blocks can be executed in reverse order depending on the functionality involved. .

本明細書中で使用される「ａ」及び「ａｎ」なる用語は、１以上として定義される。本明細書中で使用される「複数（plurality）」なる用語は、２以上として定義される。本明細書中で使用される「別の（another）」なる用語は、少なくとも第２以上のものとして定義される。本明細書中で使用される「〜を含む（including）」及び／又は「〜を有する（having）」なる用語は、含む（comprising）（すなわちオープンランゲージ）として定義される。本明細書中で使用される「〜と〜の少なくとも１つ「at least one of...and...」なる言い回しは、付随する列挙された品目のうちのいずれか及びその１つ以上の考えられる全ての組合せを意味し包含する。一例として、「Ａ、Ｂ及びＣの少なくとも１つ」なる言い回しは、Ａのみ、Ｂのみ、Ｃのみ、又はその任意の組合せ（例えばＡＢ、ＡＣ、ＢＣ又はＡＢＣ）を含む。 The terms “a” and “an” as used herein are defined as one or more. The term “plurality” as used herein is defined as two or more. As used herein, the term “another” is defined as at least a second or more. The terms “including” and / or “having” as used herein are defined as comprising (ie, open language). As used herein, the phrase “at least one of ... and ...” refers to any of the accompanying listed items and one or more of the listed items. Means and includes all possible combinations. By way of example, the phrase “at least one of A, B, and C” includes only A, only B, only C, or any combination thereof (eg, AB, AC, BC, or ABC).

上述の詳細な説明においては、その一部を成す添付図面に対する参照が指示されている。図中、類似の符号は、文脈上別段の指示のないかぎり、類似の構成要素を識別する。詳細な説明、図及びクレーム中に記載された例示的実施形態は、限定的なものとして意図されていない。本明細書中で提示された主題の範囲から逸脱することなく、他の実施形態を利用することができ、他の変更を加えることも可能である。本明細書中で一般的に説明され図中に例示されている本開示の態様は、多様な異なる構成で配置、置換、組合せ、分離及び設計することができ、その全てが本明細書中で明示的に企図される。したがって、本発明の範囲を標示するものとしては、以上の明細書ではなくむしろ以下のクレームを参照すべきである。 In the foregoing detailed description, references are made to the accompanying drawings that form a part thereof. In the drawings, similar symbols identify similar components, unless context dictates otherwise. The illustrative embodiments described in the detailed description, figures, and claims are not meant to be limiting. Other embodiments may be utilized and other changes may be made without departing from the scope of the subject matter presented herein. The aspects of the present disclosure generally described herein and illustrated in the figures may be arranged, substituted, combined, separated and designed in a variety of different configurations, all of which are described herein. Explicitly contemplated. Accordingly, reference should be made to the following claims rather than to the foregoing specification as indicating the scope of the invention.

Claims

In a computer-implemented method for autonomously controlling a vehicle for the purpose of operating the vehicle,
Passively with respect to passively collected data related to operation of the vehicle to learn a control policy configured to control the vehicle to perform operation of the vehicle at the lowest expected accumulated cost Applying a dynamic actor-critical reinforcement learning method;
Controlling the vehicle in accordance with the control policy to operate the vehicle.

The operation of the vehicle is an operation of joining the vehicle into the lane between the second vehicle and the third vehicle traveling in the lane, and the control policy is the second vehicle and the third vehicle. The method of claim 1, wherein the method is configured to control the vehicle to join the vehicle in the middle between the vehicles.

Receiving a control policy that can be adapted to control the vehicle to operate the vehicle, and applying the passive actor-critical reinforcement learning method to the passively collected data But,
a) estimating a Z value and an average cost under an optimal control policy using a sample of the passively collected data in a critical network;
b) In an actor network operatively coupled to a critical network, under an optimal control policy from the critical network, the passively collected data sample, the estimated Z value, and the estimated Modifying the control policy using a determined average cost;
The method of claim 1, comprising: c) repeating steps (a)-(b) repeatedly until the estimated average cost converges.

The method of claim 3, wherein the Z value is estimated using a linearized version of the Bellman equation.

The method of claim 3, wherein the step of estimating the average cost under an optimal policy comprises updating the average cost prior to the step of modifying the control policy.

The step of estimating the Z value comprises:
Approximating the Z-value function using a linear combination of weighted radial basis functions;
The method of claim 3, comprising approximating a Z value using an approximated Z value function and the passively collected sample of data.

The method of claim 6, wherein the step of approximating a Z-value function using a linear combination of weighted radial basis functions comprises optimizing weights used in the weighted radial basis functions. .

The step of approximating a Z-value function using a linear combination of weighted radial basis functions updates the weights used in the weighted radial basis functions prior to the step of optimizing the weights. The method of claim 7, comprising steps.

The step of modifying the control policy comprises:
Approximating the control gain;
Optimizing the control gain to provide an optimized control gain;
Modifying the control policy using the optimized control gain.

Before optimizing the control gain,
Determining a control input;
10. The method of claim 9, further comprising: determining a value of an action value function using the control input, the passively collected sample of data, and the approximated control gain.

The method of claim 9, wherein the step of approximating a control gain comprises approximating the control gain using a linear combination of weighted radial basis functions.

12. The method of claim 11, further comprising: updating a weight used in the weighted radial basis function prior to the step of approximating the control gain using a linear combination of weighted radial basis functions. the method of.

A computer-implemented method for optimizing a control policy that can be used to control a system to perform an operation, comprising:
Providing a control policy that can be used to control the system;
The control policy can be manipulated to control the system to perform the operation at the lowest expected cumulative cost by applying a passive actor-critical reinforcement learning method to passively collected data regarding the operation to be performed Modifying the control policy to become.

Applying the passive actor-critic reinforcement learning method to passively collected data comprising:
a) Estimating a Z-value using the passively collected sample of data and estimating an average cost under an optimal policy in the critical network using the passively collected sample of data When,
b) modifying the control policy in the actor network using the passively collected sample of data, control dynamics for the system, arrival cost and control gain;
c) updating the parameters used to modify the control policy and the parameters used to estimate the Z value and the average cost under an optimal policy;
and d) repeating steps (a)-(c) repeatedly until the estimated average cost converges.

A computer processing system configured to optimize a control policy that can be used to autonomously control a vehicle to operate the vehicle,
The computer processing system includes one or more processors for controlling operation of the computer processing system, and a memory for storing data and program instructions usable by the one or more processors,
The memory is configured to store computer code that, when executed by the one or more processors, stores the computer code in the one or more processors,
a) receiving passively collected data regarding the operation of the vehicle;
b) determining a Z-value function that can be used to estimate the arrival cost for the vehicle;
c) In a critical network in the computer processing system,
c1) using the Z value function and the passively collected sample of data to determine a Z value;
c2) let the sample of the passively collected data be used to estimate the average cost under optimal policy,
d) let the actor network in the computer processing system modify the control policy using the passively collected data, the control dynamics for the vehicle, the arrival cost and the control gain;
e) A computer processing system that repeats steps (c) and (d) repeatedly until the estimated average cost has converged.