CN107423813A

CN107423813A - A kind of state space based on depth learning technology decomposes and sub-goal creation method

Info

Publication number: CN107423813A
Application number: CN201710642392.5A
Authority: CN
Inventors: 王燕清; 郑豪
Original assignee: Nanjing Xiaozhuang University
Current assignee: Nanjing Xiaozhuang University
Priority date: 2017-07-31
Filing date: 2017-07-31
Publication date: 2017-12-01

Abstract

The present invention relates to a kind of state space based on depth learning technology to decompose and sub-goal creation method.Specifying for task is completed in the reward mechanism specified according to field, typical intensified learning (RL) agency study.In order to solve this problem, a framework is developed, a depth RL agency can use the attention mechanism of a repetition, from smaller, simpler domain, to more complicated domain.Task is presented to agency with the instruction of image and specified target.This cell controller guides agency to realize its target by designing a less subtask sequence in state space, so that effectively decomposition goal.

Description

A State Space Decomposition and Subgoal Creation Method Based on Deep Learning Techniques

技术领域：Technical field:

本发明涉及一种基于深度学习技术的状态空间分解和子目标创建方法。The invention relates to a state space decomposition and sub-goal creation method based on deep learning technology.

背景技术Background technique

设计一个深度学习框架, 在该框架中，深度学习通过代理机制可以使用一个重复的注意机制，从而通过更小的、更简单的域映射到更复杂的域。学习任务以图像和指定目标的指令呈现给代理，采用元控制器通过在状态空间中设计若干个子任务序列来引导代理实现它的目标，从而有效地分解,元控制器会在关注的范围内创建子目标。采用元控制器，它学习分解状态空间，并在较小的空间内提供可解析的子目标，因为当底层代理解决原始任务时，元控制器正在处理一个延迟的奖赏问题，从而得到正强化。它提出一系列的子目标，从而使这种强化的期望最大化，除了创建子目标之外，元控制器还会对状态空间进行碎片化，使底层代理呈现一个较小的状态，从而可以轻松地为子目标学习一个最优策略，它通过使用一种注意力机制来完成这一过程，类似于重复的注意力模式，元控制器学会控制它的注意力，并且只将状态的一部分传递给代理。Design a deep learning framework in which deep learning can use a repeated attention mechanism through a proxy mechanism to map from smaller, simpler domains to more complex domains. The learning task is presented to the agent with images and instructions specifying the goal, and the meta-controller guides the agent to achieve its goal by designing several subtask sequences in the state space, thereby effectively decomposing, the meta-controller will create within the scope of attention subgoal. With a meta-controller, it learns to decompose the state space and provide resolvable subgoals in a smaller space, since the meta-controller is working on a delayed reward problem while the underlying agent solves the original task, thus being positively reinforced. It proposes a series of subgoals that maximize the expectation of this reinforcement. In addition to creating subgoals, the meta-controller also fragments the state space so that the underlying agent presents a smaller state that can be easily It learns an optimal policy for subgoals, it does this by using an attention mechanism, similar to the repeated attention pattern, the meta-controller learns to control its attention and only transfers part of the state to acting.

元控制器的公式是:The formula for the meta controller is:

(1)状态: S，是过去和现在状态表征。(1) State: S, is the past and present state representation.

行动:A，是注意力的位置L^attn, 以及一系列子目标的分配g。Actions: A, is the attention location L ^attn , and the distribution g of a series of sub-goals.

奖励: r，如果底层的代理能够解决这个任务，奖励r是负的；否则奖励r是正的。Reward: r, if the underlying agent can solve the task, the reward r is negative; otherwise the reward r is positive.

转换:根据提供的状态和子目标底层代理执行其策略。由于这个策略对元控制来说是未知的，它为环境中的一种附加的因子。Transition: The underlying agent executes its policy based on the provided state and sub-targets. Since this policy is unknown to the meta-control, it is an additional factor in the environment.

元控制器为L^attn和分配P(g)选择一个值。在这个值的状态空间和一个子目标g，被传递给底层代理。然后，代理选择一个将其移动到实现g的原子动作。新的代理位置L^agent改变了元控制器的环境，它选择了一个新的关注点和子目标。首先，本发明假设底层代理可以访问每个子目标的最优策略。这种依赖于目标的策略可以通过诸如通用近似价值函数 (简称：UVFAs)这样的技术来学习。UVFAs学习接近V(S，g)或关于目标的值函数，使用类似于深度神经网络的函数近似。学习价值函数V(S，g)可以用来构造一个实现目标g的策略。这个值函数可以被训练独立或与元相结合，为实现子目标提供内在的奖励。其次，本发明假设代理仍然保持状态，除非它的位置和子目标都是由元控制器提供给它的状态。一般来说，元控制器会自动被激励去集中注意力，并提供子目标，这样潜在的代理就能够解决给定的任务，因为它的奖励结构。在这种情况下，这意味着在注意中保持代理位置和子目标。例如，在Pacman的游戏中，如果子目标是吃最接近的药丸，那么潜在的代理应该有Pacman和至少一种药丸。否则，代理可能会随机移动，它将无法实现获得高分的总体目标。上面的假设简化了元控制器的训练，但是提供的方法应该适用于基本的设置，也就是底层代理的策略也得到了学习。The meta-controller chooses a value for L ^attn and distribution P(g). In this value state space and a subgoal g, is passed to the underlying agent. The agent then chooses an atomic action that moves it to realize g. The new agent location L ^agent changes the environment of the meta-controller, which chooses a new focus and subgoal. First, the present invention assumes that the underlying agent has access to the optimal policy for each subgoal. Such goal-dependent policies can be learned by techniques such as Universal Approximate Value Functions (abbreviated: UVFAs). UVFAs learn a value function close to V(S,g) or about the target, using a function approximation similar to deep neural networks. Learning the value function V(S, g) can be used to construct a policy to achieve the goal g. This value function can be trained independently or combined with meta to provide intrinsic rewards for achieving subgoals. Second, the invention assumes that the agent still maintains state unless its position and subgoals are state provided to it by the meta-controller. In general, the meta-controller is automatically motivated to focus and provide subgoals so that potential agents can solve a given task due to its reward structure. In this case, that means keeping agent positions and subgoals in mind. For example, in the game of Pacman, if the subgoal is to eat the closest pill, then a potential agent should have Pacman and at least one kind of pill. Otherwise, the agent may move randomly and it will miss the overall goal of getting a high score. The assumptions above simplify the training of the meta-controller, but the approach presented should work for basic settings, where the policies of the underlying agents are also learned.

发明内容：Invention content:

本发明的目的是提供一种基于启发式概率Hough变换的道路边缘检测方法。The purpose of the present invention is to provide a road edge detection method based on heuristic probability Hough transform.

上述的目的通过以下的技术方案实现：Above-mentioned purpose realizes by following technical scheme:

提供了一个层次结构框架，在这个框架中，代理可以从更高级别的代理设置子目标并在较长时间框架内运行的内在奖励中学习。高级代理的奖励是由环境提供完成任务所提供的。子目标依次通过在面向对象框架中的实体和关系来提供。本发明方法进一步分解了状态空间，使基本代理只能在任何时候看到它的小部分。优点是产生更好的计算效率，因为基础代理现在可以使用更小的网络，并且可以允许将已学习的策略转移到状态空间的不同部分，而不必显式地探索。为了实现这一点，更高级别的代理，或者元控制器，必须学会将信息集成到目前为止所观察到的状态。因此，本发明使用一个循环模型来通过长短期记忆(LSTM)网络来表示元控制器。为了训练本发明的元控制器的注意力机制，本发明采用了一种策略梯度来训练分类和简单控制任务的注意力机制。在本发明方法中，不使用复杂的传感器，而是简单地使用5x5的输入图像。这可以很容易地合并到设置中。此外，没有直接使用连续的输出L^attn，而是使用离散的动作，向上，向下，空（noop），以转移注意力。Provides a hierarchical framework in which agents learn from intrinsic rewards from higher-level agents setting subgoals and operating over longer time frames. Rewards for advanced agents are provided by the environment for completing tasks. Subobjects are in turn provided by entities and relationships in an object-oriented framework. The inventive method further decomposes the state space so that the base agent only sees a small portion of it at any one time. The advantage is that it yields better computational efficiency, since the underlying agent can now use a smaller network, and can allow the learned policy to be transferred to a different part of the state space without having to explore it explicitly. To achieve this, higher-level agents, or meta-controllers, must learn to integrate information from the state observed so far. Therefore, the present invention uses a recurrent model to represent the meta-controller through a long short-term memory (LSTM) network. To train the attention mechanism of the meta-controller of the present invention, the present invention adopts a policy gradient to train the attention mechanism for classification and simple control tasks. In the inventive method, instead of using complex sensors, a 5x5 input image is simply used. This can easily be incorporated into the setup. Furthermore, instead of directly using the continuous output L ^attn , discrete actions, up, down, and noop, are used to divert attention.

强化学习和决策过程(MDP)Reinforcement Learning and Decision Process (MDP)

强化学习解决了选择行为的问题，这种行为最大化了长期累积奖励的概念。它通常被制定为Markov的决策过程(MDP)。MDP的特征是元组<S，A，T，R>。S是代理可以进入的状态集。A(S)是代理在每个状态中都可以使用的操作集。通常，代理选择一个从A执行的操作，根据转换函数，它可以将其带入一个新的状态。是在一个状态下执行一个动作时接收到的标量值。最后，是权重因子。策略，即，通知代理在每个状态中执行哪个操作。强化学习代理的目标是找到最优策略，使长期预期回报或效用最大化。Reinforcement learning addresses the problem of choosing actions that maximize the notion of long-term cumulative reward. It is usually formulated as a Markov decision process (MDP). MDPs are characterized by tuples <S, A, T, R>. S is the set of states the agent can enter. A(S) is the set of actions available to the agent in each state. Typically, the agent chooses an action to perform from A, according to the transition function , which can bring it into a new state. is a scalar value received when performing an action in a state. At last, is the weight factor. strategy, namely , which tells the agent which action to perform in each state. The goal of a reinforcement learning agent is to find the optimal policy , to maximize the long-run expected return or utility.

梯度策略gradient strategy

梯度策略方法通过调整策略参数直接最大化预期回报。预期的奖励一个轨迹采样从策略π，θ是由参数化Gradient policy methods directly maximize expected rewards by tuning policy parameters. The expected reward for a trajectory sampled from policy π, θ is parameterized by

(1) (1)

当p(S1:T;θ)取决于分布由π，R的返回轨道。在这个公式中，没有进行折现。策略梯度定理给出了J的梯度When p(S1:T;θ) depends on the distribution by π, the return orbit of R. In this formula, no discounting is done. The policy gradient theorem gives the gradient of J

(2) (2)

本发明可以从当前策略中取样一组轨迹，并对其进行平均梯度。减少方差的一个常见的修改是从返回中减去基线。基线bt的计算方法是在过去的N个时间步骤中采取观察到的平均回报。在本文选择N = 100。The present invention can sample a set of trajectories from the current policy and average gradients on them. A common modification to reduce variance is to subtract the baseline from the return. The baseline bt is calculated by taking the average return observed over the past N time steps. N = 100 is chosen in this paper.

环境environment

通过使用一个包含10x5网格的环境。网格由四个“空间”组成，每个空间都是一个水平的kx5条，一些k不超过4。这些空间层层叠叠，每一个空间都有不同的颜色，分别是红色、绿色、蓝色和黄色。环境还生成一个指令，作为长度为4的一个热向量，指定目标空间。当代理到达目标空间时，一个事件终止，收到+ 1的正报酬，或者它超时而达不到。步骤成本−0.01。By using an environment containing a 10x5 grid. The grid consists of four "spaces", each of which is a horizontal kx5 bar, some k not exceeding 4. These spaces are layered on top of each other, and each space has a different color, namely red, green, blue and yellow. The environment also generates a directive, as a one-hot vector of length 4, specifying the target space. An event terminates when the agent reaches the goal space, receives a positive reward of +1, or it times out without being reached. Step cost −0.01.

方法method

为元控制器构造三个框架，负责向底层代理提供子目标，这样它就能成功地导航到目标空间。在本发明中, meta-controller使用优化器的学习速率1×e−5。代理和注意(如果使用)总是在网格的左上角开始。Three frameworks are constructed for the meta-controller, which is responsible for providing subgoals to the underlying agent so that it can successfully navigate to the goal space. In the present invention, the meta-controller uses the optimizer with a learning rate of 1×e−5. Agents and attention (if used) always start at the top left corner of the grid.

元控制器中没有注意机制There is no attention mechanism in the meta controller

将整个状态空间作为每个阶段的元控制器的输入来简化问题，因此不使用任何注意机制。具体来说，元控制器接收作为输入的10x5的网格图像，其中包含空间和代理的位置，Lagent。元控制器的输出是一个分布在空间的P(g)。从P(g)中抽取一个空间，并作为指令提供给基本代理。假定底层代理可以在给定指令的整个10x5格上最优地移动。在这个设置中，元控制器的最优策略是直接输出目标空间。Simplifies the problem by taking the entire state space as input to the meta-controller at each stage, thus not using any attention mechanism. Specifically, the meta-controller receives as input a 10x5 grid image containing the space and the location of the agent, Lagent. The output of the meta-controller is a distribution over space P(g). A space is extracted from P(g) and provided as an instruction to the base agent. Assume that the underlying agent can move optimally over the entire 10x5 grid given the instruction. In this setting, the optimal policy for the meta-controller is to directly output the target space.

部分状态分解的元控制器Meta-controller with partial state decomposition

在这个设置中，有一个带有注意机制的元控制器，它由一个5x5窗口组成，进入10x5网格。除了子目标指令之外，它还必须输出一个动作来控制观察点。注意机制只部分地分解状态空间，这意味着代理可能最优地移动到提供的子目标，即使代理不在当前的关注范围内。然而，子目标必须位于注意的内部。元控制器的目标是使用它的注意机制来找到目标空间的位置，然后指示代理在每一步的时候进入空间的颜色。这个元控制器的体系结构由一个状态处理器网络组成，该网络将环境作为输入在每个时间步骤中提供5x5的注意窗口和目标指令。它使用前馈卷积网络处理这些输入，并使用LSTM单元输出P(g)和P(a)，P(a)是关于观察点动作的概率分布。网络的卷积层使用了纠正的线性单元激活函数。注意操作影响下一个注意位置Lattn，而子目标指令影响下一个代理位置Lagent。因此，LSTM的隐藏状态包含了从在一集中获取指令和观察点动作序列中获得的知识。为了有效地利用策略梯度来训练这个网络，假设注意和指令操作的概率是相互独立的。In this setup, there is a meta controller with an attention mechanism consisting of a 5x5 window into a 10x5 grid. In addition to subgoal instructions, it must also output an action to control the watchpoint. The attention mechanism only partially decomposes the state space, which means that the agent may optimally move to the provided subgoal even if the agent is not currently in the attention domain. However, subgoals must be inside the attention. The goal of the meta-controller is to use its attention mechanism to find the location of the target space, and then instruct the agent to enter the color of the space at each step. The architecture of this meta-controller consists of a network of state processors that takes the environment as input to provide a 5x5 attention window and target instruction at each time step. It processes these inputs using a feed-forward convolutional network and uses LSTM cells to output P(g) and P(a), where P(a) is a probability distribution over the actions of the observation points. The convolutional layers of the network use rectified linear unit activation functions. Attention operations affect the next attention location Lattn, while subgoal instructions affect the next agent location Lagent. Thus, the hidden state of the LSTM incorporates the knowledge gained from taking instructions and observing point-action sequences in an episode. To efficiently train this network using policy gradients, it is assumed that the probabilities of attention and instruction actions are independent of each other.

具有有限的注意机制的元控制器Meta-controller with limited attention mechanism

在这个设置中，除非在注意范围内，否则代理不会移动。这意味着，元控制器必须指示代理在向下移动之前移动到一个空间内，然后重复这个过程，直到代理到达目标空间为止。因此，如果代理或子目标不出现在已分解的状态空间内，则不会实现目标任务。部分状态分解和约束机制实验注LSTM单位允许meta-controller使用代理和目标的位置的内存空间来指导其行为，选择代理或目标空间内不存在注意窗口在一个特定的时间。受约束的观察点机械论框架增加了为代理达到目标空间的子目标的最优序列的额外步骤。In this setting, the agent will not move unless it is within attention range. This means that the meta-controller must instruct the agent to move into a space before moving down, and then repeat the process until the agent reaches the target space. Therefore, if the agent or subgoal does not appear in the decomposed state space, the goal task will not be achieved. Partial State Decomposition and Constraint Mechanism Experiments Note LSTM units allow the meta-controller to use the memory space of agent and target locations to guide its behavior, choosing an attention window that does not exist within the agent or target space at a particular time. The constrained viewpoint mechanistic framework adds an extra step for the agent to reach an optimal sequence of subgoals in the goal space.

有益效果：Beneficial effect:

本发明的总体贡献是一个框架，允许代理在一个大型的环境中完成一个任务，在更小的环境中了解如何这样做。通过使用注意机制，需要更小的网络，更容易进行培训。在开发了这三个框架后，元控制器了解了空间颜色的表示方式，以及如何将表示转移到引导代理达到所需目标的子指令。本发明的结果表明，通过使用注意机制来分解一个大的状态空间，可以在较小的环境中对策略进行扩展。本发明的最终目标是与元控制器一起训练底层代理，并将此框架应用于动态和复杂的环境。The overall contribution of this invention is a framework that allows an agent to complete a task in a large environment and learn how to do so in a smaller environment. By using an attention mechanism, smaller networks are required and easier to train. After developing these three frameworks, the meta-controller understands how spatial colors are represented and how to transfer the representation to sub-instructions that guide the agent to the desired goal. Our results show that policies can be scaled up in smaller environments by using attention mechanisms to decompose a large state space. The ultimate goal of the present invention is to train the underlying agents together with the meta-controller and apply this framework to dynamic and complex environments.

附图说明：Description of drawings:

附图1为状态处理网。部分状态分解和约束注意机制实验的网络体系结构。5x5注意窗口和目标指令在每次进入元控制器的过程中被输入，然后输出概率分布。Accompanying drawing 1 is the status processing network. Network Architectures for Partial State Decomposition and Constrained Attention Mechanism Experiments. The 5x5 attention window and target instructions are input at each pass into the meta-controller, which then outputs a probability distribution.

附图2为有注意机制的元控器的体系结构。这个元控制器的体系结构由一个状态处理器网络组成，该网络将环境作为输入在每个时间步骤中提供5x5的注意窗口和目标指令。它使用前馈卷积网络处理这些输入，并使用LSTM单元输出P(g)和P(a)，P(a)是关于观察点动作的概率分布。Accompanying drawing 2 is the architecture of the element controller with attention mechanism. The architecture of this meta-controller consists of a network of state processors that takes the environment as input to provide a 5x5 attention window and target instruction at each time step. It processes these inputs using a feed-forward convolutional network and uses LSTM cells to output P(g) and P(a), where P(a) is a probability distribution over the actions of the observation points.

具体实施方式：detailed description:

实施例1：一种基于深度学习技术的状态空间分解和子目标创建方法，其特征是：设计一个深度学习框架, 在该框架中，深度学习通过代理机制可以使用一个重复的注意机制，从而通过更小的、更简单的域映射到更复杂的域。学习任务以图像和指定目标的指令呈现给代理，采用元控制器通过在状态空间中设计若干个子任务序列来引导代理实现它的目标，从而有效地分解,元控制器会在关注的范围内创建子目标。Embodiment 1: A method for state space decomposition and sub-goal creation based on deep learning technology, characterized in that: a deep learning framework is designed, in which deep learning can use a repeated attention mechanism through an agent mechanism, so that through more Small, simpler domains map to more complex domains. The learning task is presented to the agent with images and instructions specifying the goal, and the meta-controller guides the agent to achieve its goal by designing several subtask sequences in the state space, thereby effectively decomposing, the meta-controller will create within the scope of attention subgoal.

实施例2：根据实施例1所述的元控制器，其特征是：采用元控制器，它学习分解状态空间，并在较小的空间内提供可解析的子目标，因为当底层代理解决原始任务时，元控制器正在处理一个延迟的奖赏问题，它得到正强化。它提出一系列的子目标，从而使这种强化的期望最大化，除创建子目标之外，元控制器还会对状态空间进行碎片化，使底层代理呈现一个较小的状态，从而可以轻松地为子目标学习一个最优策略，它通过使用一种注意力机制来完成这一过程，类似于重复的注意力模式，元控制器学会控制它的注意力，并且只将状态的一部分传递给代理。Embodiment 2: The meta-controller according to embodiment 1, characterized by the use of a meta-controller that learns to decompose the state space and provide resolvable sub-goals in a smaller space, because when the underlying agent solves the original task, the meta-controller is processing a delayed reward problem, which is positively reinforced. It proposes a series of sub-goals that maximize the expectation of this reinforcement. In addition to creating sub-goals, the meta-controller also fragments the state space so that the underlying agent presents a smaller state that can be easily It learns an optimal policy for subgoals, it does this by using an attention mechanism, similar to the repeated attention pattern, the meta-controller learns to control its attention and only transfers part of the state to acting.

Claims

1. a kind of state space based on deep learning technology decomposes and sub-goal creation method, it is characterized in that：Design a depth Learning framework is spent, in the framework, deep learning can use the attention mechanism of a repetition by agency mechanism, so as to logical Cross smaller, simpler domain mapping and agency be presented to the instruction of image and specified target to more complicated domain, learning tasks, Using cell controller by designing several subtask sequences in state space the target that guides agency to realize it, so as to have Effect ground decomposes, and cell controller can create sub-goal in the range of concern.

2. cell controller according to claim 1, it is characterized in that：Using cell controller, it learns decomposing state space, and Analysable sub-goal is provided in less space, because when bottom acts on behalf of solution ancestral task, cell controller is located The award problem of one delay of reason, it obtains positive reinforcement, and it proposes a series of sub-goal, so that the expectation of this reinforcing is most Bigization, in addition to sub-goal is created, cell controller can also carry out fragmentation to state space, bottom agency be presented one smaller State, so as to easily be sub-goal learn an optimal policy, it is completed by using a kind of notice mechanism This process, similar to the attention force mode repeated, cell controller association controls its notice, and only by one of state Divide and pass to agency.