CN114815816A

CN114815816A - An autonomous navigation robot

Info

Publication number: CN114815816A
Application number: CN202210365323.5A
Authority: CN
Inventors: 陶冶; 王浩杰
Original assignee: Qingdao University of Science and Technology
Current assignee: Qingdao University of Science and Technology
Priority date: 2022-04-07
Filing date: 2022-04-07
Publication date: 2022-07-29

Abstract

The invention discloses an autonomous navigation robot, which comprises a sensor, a controller and a traveling mechanism, wherein the sensor is arranged on the controller; the sensor is used for detecting the distance and the angle of an obstacle relative to the robot to form state data; the controller is used for classifying scenes where the robot is located according to the state data and the relative position of the target point, and if the scenes are simple scenes, executing a PID control strategy; if the scene is a complex scene, executing a reinforced simulation learning control strategy; if the scene is an emergency scene, executing a constraint reinforcement simulation learning control strategy; the controller controls the traveling mechanism to drive the robot to travel by executing a corresponding control strategy, so that the aim of navigation and obstacle avoidance is fulfilled. The autonomous navigation strategy of the robot is designed for various complex scenes with dynamic and static obstacles, and the defects that the traditional path planning method cannot avoid the dynamic obstacles, the supervised learning method has poor generalization capability, and the reinforcement learning method has unsatisfactory strategy output under simple and emergency conditions can be overcome.

Description

An autonomous navigation robot

技术领域technical field

本发明属于移动机器人技术领域，具体地说，是涉及一种可以根据环境变化合理规划行走路径并能自动避障的自主导航机器人。The invention belongs to the technical field of mobile robots, and in particular relates to an autonomous navigation robot that can reasonably plan a walking path according to environmental changes and can automatically avoid obstacles.

背景技术Background technique

导航和避障是移动机器人完成各项任务的基本功能。移动机器人通过外部传感器进行环境感知，并获得几何空间的各维度信息。根据获得的几何空间信息并结合避障算法，机器人在行走的过程中可以对障碍物进行规避，并自主规划路径。Navigation and obstacle avoidance are the basic functions for mobile robots to complete various tasks. The mobile robot perceives the environment through external sensors, and obtains the information of each dimension of the geometric space. According to the obtained geometric space information combined with the obstacle avoidance algorithm, the robot can avoid obstacles and plan the path autonomously during the walking process.

在移动机器人上实现自主学习避障功能，是提升机器人智能化程度的重要一步，这可以使得移动机器人具备像人类一样的行为策略，能够在未知的环境中躲避机器人前方的动态或者静态障碍物，从而使得移动机器人具有自主导航的能力。Implementing the autonomous learning obstacle avoidance function on a mobile robot is an important step to improve the intelligence of the robot, which can enable the mobile robot to have a human-like behavior strategy and avoid dynamic or static obstacles in front of the robot in an unknown environment. Thus, the mobile robot has the ability to navigate autonomously.

目前，用于机器人的自主导航方法主要有路径规划方法、监督式学习方法和强化学习方法。其中，路径规划方法需要对机器人及其所处环境进行准确感知，以保证规划出的路径长度是最优的。但是，该方法需要中央服务器集中计算，难以在大规模的机器人群中以及有动态障碍物的未知环境中使用。监督式学习方法能够根据传感器数据进行决策，使得移动机器人可以躲避动态障碍物。但是，该方法所需数据的采集难度较大，若机器人所观察到的环境状态未在训练数据集中出现过，则无法做出正确的决策，因此，泛化能力较差。强化学习方法通过机器人与环境交互进行训练，不需要数据集，在环境中所采取的策略具有一定的随机性。但是，基于强化学习模型设计的自动导航机器人，在简单的场景下无法沿直线行驶，走出最短路径。并且，在接近目标点位置时，会出现机器人在目标点附近徘徊，而不是迅速接近目标点的情况。同时，在周围的障碍物非常密集以及障碍物突然出现在面前等紧急场景下，机器人对障碍物无法及时做出反应，执行紧急避障。At present, the autonomous navigation methods for robots mainly include path planning methods, supervised learning methods and reinforcement learning methods. Among them, the path planning method needs to accurately perceive the robot and its environment to ensure that the planned path length is optimal. However, this method requires centralized computing on a central server, which is difficult to use in large-scale robot swarms and unknown environments with dynamic obstacles. Supervised learning methods can make decisions based on sensor data, allowing mobile robots to avoid dynamic obstacles. However, it is difficult to collect the data required by this method. If the environmental state observed by the robot has not appeared in the training data set, it cannot make a correct decision, so the generalization ability is poor. The reinforcement learning method is trained by the interaction between the robot and the environment, and does not require a data set, and the strategy adopted in the environment has a certain degree of randomness. However, the automatic navigation robot designed based on the reinforcement learning model cannot drive in a straight line and walk out of the shortest path in a simple scene. Also, when approaching the target point position, the robot may linger near the target point instead of rapidly approaching the target point. At the same time, in emergency scenarios such as the surrounding obstacles are very dense and the obstacles suddenly appear in front of them, the robot cannot respond to the obstacles in time and perform emergency obstacle avoidance.

发明内容SUMMARY OF THE INVENTION

本发明面向各种具有动态和静态障碍物的复杂场景设计移动机器人的自主导航策略，可以弥补传统路径规划方法无法躲避动态障碍物，监督式学习方法泛化能力差，强化学习方法在简单以及紧急情况下输出策略不理想的缺陷。The invention designs the autonomous navigation strategy of the mobile robot for various complex scenes with dynamic and static obstacles, which can make up for the inability of the traditional path planning method to avoid dynamic obstacles, the poor generalization ability of the supervised learning method, and the reinforcement learning method in simple and urgent. In case the output strategy is not ideal.

为达到上述发明目的，本发明采用以下技术方案予以实现：In order to achieve the above-mentioned purpose of the invention, the present invention adopts the following technical solutions to realize:

一种自主导航机器人，包括传感器、控制器和行走机构；其中，所述传感器用于检测障碍物相对机器人的距离和角度，形成状态数据；所述控制器用于根据所述状态数据以及目标点相对位置对机器人所处的场景进行分类，若为简单场景，则执行PID控制策略；若为复杂场景，则执行强化模仿学习控制策略；若为紧急场景，则执行约束强化模仿学习控制策略；并且通过执行相应的控制策略计算出机器人行走的线速度和角速度；所述行走机构用于驱动机器人按照控制器计算出的线速度和角速度行走。An autonomous navigation robot, comprising a sensor, a controller and a walking mechanism; wherein, the sensor is used to detect the distance and angle of an obstacle relative to the robot to form state data; the controller is used to relative to the target point according to the state data and The location classifies the scene where the robot is located. If it is a simple scene, execute the PID control strategy; if it is a complex scene, execute the reinforcement imitation learning control strategy; if it is an emergency scene, execute the constraint reinforcement imitation learning control strategy; A corresponding control strategy is executed to calculate the linear velocity and angular velocity of the robot; the walking mechanism is used to drive the robot to walk according to the linear velocity and angular velocity calculated by the controller.

在本申请的一些实施例中，为了尽可能地避免机器人在行走过程中碰撞到障碍物，可以在所述控制器中配置碰撞预测模型，所述碰撞预测模型可以根据所述状态数据以及机器人的自身速度预测机器人能否发生碰撞。In some embodiments of the present application, in order to avoid the robot from colliding with obstacles during walking as much as possible, a collision prediction model may be configured in the controller, and the collision prediction model may be based on the state data and the robot's collision prediction model. Its own speed predicts whether the robot will collide.

在本申请的一些实施例中，所述简单场景为机器人前方没有障碍物或者机器人到达目标点周围的场景；所述紧急场景为通过所述碰撞预测模型机器预测出机器人会发生碰撞的场景；所述复杂场景为所述简单场景和紧急场景以外的场景。In some embodiments of the present application, the simple scene is a scene in which there is no obstacle in front of the robot or the robot reaches around the target point; the emergency scene is a scene in which the collision prediction model machine predicts that the robot will collide; The complex scene is a scene other than the simple scene and the emergency scene.

在本申请的一些实施例中，所述控制器在执行PID控制策略时，可以将机器人前进正方向与目标点之间的夹角设置为偏差，代入PID计算公式，计算出机器人的角速度，并保持机器人的线速度不变。采用PID控制策略可以控制机器人以最短的路径行驶，并在接近目标点位置时，迅速达到目标点。In some embodiments of the present application, when the controller executes the PID control strategy, the controller can set the angle between the forward direction of the robot and the target point as a deviation, and substitute it into the PID calculation formula to calculate the angular velocity of the robot, and Keep the linear velocity of the robot constant. The PID control strategy can control the robot to travel on the shortest path and quickly reach the target point when approaching the target point position.

在本申请的一些实施例中，所述控制器所执行的强化模仿学习控制策略可以包括：In some embodiments of the present application, the reinforcement imitation learning control strategy executed by the controller may include:

模仿学习过程，其利用专家数据集中的数据对Actor网络进行训练；imitating the learning process, which uses the data in the expert dataset to train the Actor network;

强化学习过程，其利用经模仿学习过程训练后的Actor网络以及Critic网络，结合状态数据、机器人的自身速度以及目标点相对位置计算输出动作a，并根据所述动作a控制所述行走机构调整机器人行走的线速度和角速度。The reinforcement learning process uses the Actor network and the Critic network trained by the imitation learning process, combines the state data, the robot's own speed and the relative position of the target point to calculate the output action a, and controls the walking mechanism to adjust the robot according to the action a. Linear and angular velocity of walking.

在本申请的一些实施例中，在所述自主导航机器人中还配置有存储器，用于存储控制器在执行强化模仿学习控制策略时计算输出的动作a、机器人在环境中执行了动作a后到达的状态s以及机器人执行动作a获得的奖赏r，并将所收集到的(s,a,r)数据存入经验池；所述控制器在存入经验池中的数据的数量满足设定条件时，计算强化学习模型的损失值，进而对强化学习模型中的Actor网络和Critic网络进行更新，以实现网络优化。In some embodiments of the present application, the autonomous navigation robot is further configured with a memory for storing the action a calculated and output by the controller when executing the reinforcement imitation learning control strategy, and the robot arrives after the robot executes the action a in the environment. The state s and the reward r obtained by the robot performing action a, and the collected (s, a, r) data are stored in the experience pool; the number of data stored in the experience pool by the controller satisfies the set conditions When , the loss value of the reinforcement learning model is calculated, and then the Actor network and the Critic network in the reinforcement learning model are updated to achieve network optimization.

在本申请的一些实施例中，可以配置所述约束强化模仿学习控制策略与所述强化模仿学习控制策略中所使用的Actor网络和Critic网络相同；所述控制器在执行约束强化模仿学习控制策略时，首先判断机器人的线速度是否大于设定阈值；若大于设定阈值，则将机器人的速度设置为0，即，控制机器人停止，以实现紧急避障；若小于等于设定阈值，则缩小传感器检测到的距离数据，并将缩小后的距离数据输入强化学习模型，使通过强化学习模型计算输出动作a中表示机器人速度的数值减小。通过降低机器人的行走速度，并借助强化模仿学习控制策略进行导航，由此可以达到有效避障的设计目的。In some embodiments of the present application, the constrained reinforcement imitation learning control strategy may be configured to be the same as the Actor network and the Critic network used in the reinforcement imitation learning control strategy; the controller implements the constrained reinforcement imitation learning control strategy When , first determine whether the linear speed of the robot is greater than the set threshold; if it is greater than the set threshold, set the speed of the robot to 0, that is, control the robot to stop to achieve emergency obstacle avoidance; if it is less than or equal to the set threshold, reduce The distance data detected by the sensor is input, and the reduced distance data is input into the reinforcement learning model, so that the numerical value representing the speed of the robot in the output action a calculated by the reinforcement learning model is reduced. By reducing the walking speed of the robot and navigating with the help of reinforcement imitation learning control strategy, the design purpose of effective obstacle avoidance can be achieved.

与现有技术相比，本发明的优点和积极效果是：本发明利用基于碰撞预测的场景分类模型，随时对机器人所在的环境进行分类。针对简单场景，采用PID控制策略，控制机器人直线、快速地到达目标点，避免了机器人在目标点附近徘徊，而不是迅速接近目标点的情况出现。针对复杂场景，采用强化模仿学习控制策略进行导航，以控制机器人安全避开障碍物。针对紧急情况，采用约束强化模仿学习控制策略，控制机器人对突然出现的障碍物及时做出反应，避免发生碰撞。三种控制策略的结合应用，可以使得机器人以较短的时间和路经长度，安全到达目标点，提高效率。Compared with the prior art, the advantages and positive effects of the present invention are: the present invention utilizes a scene classification model based on collision prediction to classify the environment where the robot is located at any time. For simple scenarios, the PID control strategy is used to control the robot to reach the target point in a straight line and quickly, avoiding the situation that the robot lingers near the target point instead of rapidly approaching the target point. For complex scenes, the reinforcement imitation learning control strategy is used for navigation to control the robot to avoid obstacles safely. For emergencies, a constraint-enhanced imitation learning control strategy is used to control the robot to respond to sudden obstacles in time to avoid collisions. The combined application of the three control strategies can make the robot reach the target point safely in a shorter time and path length, and improve the efficiency.

结合附图阅读本发明实施方式的详细描述后，本发明的其他特点和优点将变得更加清楚。Other features and advantages of the present invention will become more apparent upon reading the detailed description of the embodiments of the present invention in conjunction with the accompanying drawings.

附图说明Description of drawings

图1是本发明所提出的自主导航机器人的一种实施例的主要硬件架构图；1 is a main hardware architecture diagram of an embodiment of an autonomous navigation robot proposed by the present invention;

图2是本发明所提出的自主导航机器人所执行的导航策略的一种实施例的总体架构图；2 is an overall architecture diagram of an embodiment of a navigation strategy executed by an autonomous navigation robot proposed by the present invention;

图3是基于碰撞预测的场景分类流程图；Fig. 3 is the scene classification flow chart based on collision prediction;

图4是导航模型的更新流程图；Fig. 4 is the update flow chart of the navigation model;

图5是强化模仿学习控制策略的一种实施例的流程图；5 is a flowchart of an embodiment of a reinforcement imitation learning control strategy;

图6是约束强化模仿学习控制策略的一种实施例的流程图；6 is a flow chart of an embodiment of a constrained reinforcement imitation learning control strategy;

图7是八个机器人在圆形场景下行走的轨迹示意图。Figure 7 is a schematic diagram of the trajectories of eight robots walking in a circular scene.

具体实施方式Detailed ways

下面结合附图对本发明的具体实施方式进行详细地描述。The specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

如图1所示，本实施例的移动机器人为了实现自主导航功能，在硬件配置上主要配置有传感器、控制器、存储器、行走机构等功能部件。As shown in FIG. 1 , in order to realize the autonomous navigation function, the mobile robot of this embodiment is mainly configured with functional components such as sensors, controllers, memories, and traveling mechanisms in hardware configuration.

其中，传感器用于观察机器人所处的环境状态，例如障碍物相对机器人的距离和角度等信息，以形成状态数据，提供给导航策略使用。在某些实施例中，所述传感器可以选择激光雷达传感器，以实现状态数据的快速、准确获取。Among them, the sensor is used to observe the environmental state of the robot, such as the distance and angle of the obstacle relative to the robot, to form the state data and provide it for the navigation strategy. In some embodiments, the sensor may be a lidar sensor to enable fast and accurate acquisition of status data.

控制器用于执行导航策略，接收传感器检测到的状态数据，并结合目标点相对位置，对机器人所处的场景进行分类。然后，根据机器人所处的场景类型执行与该场景相对应的控制策略，以控制机器人安全、快速地达到目标点位置，实现自主导航。The controller is used to execute the navigation strategy, receive the state data detected by the sensor, and combine the relative position of the target point to classify the scene where the robot is located. Then, the control strategy corresponding to the scene is executed according to the type of scene the robot is in, so as to control the robot to reach the target point position safely and quickly, and realize autonomous navigation.

存储器连接控制器，存储机器人在运行过程中产生的各种数据，例如控制器在执行强化模仿学习控制策略时计算输出的动作a、机器人在环境中执行了动作a后到达的状态s、机器人执行动作a获得的奖赏r等。控制器将收集到的(s,a,r)数据存入经验池(存储器中划分出来的一块存储空间)，以用于对导航策略中的数据模型进行更新、优化。The memory is connected to the controller to store various data generated by the robot during operation, such as the action a that the controller calculates and outputs when executing the reinforcement imitation learning control strategy, the state s the robot reaches after executing action a in the environment, the robot execution The reward r obtained by action a, etc. The controller stores the collected (s, a, r) data into the experience pool (a storage space divided in the memory) for updating and optimizing the data model in the navigation strategy.

行走机构用于驱动机器人按照控制器计算出的角速度和线速度行走，最终到达目标点的位置。The walking mechanism is used to drive the robot to walk according to the angular velocity and linear velocity calculated by the controller, and finally reach the position of the target point.

本实施例的导航策略采用PID控制策略、强化模仿学习控制策略、约束强化模仿学习控制策略相结合的方式制定。其中，当机器人处于简单场景下时，可以采用PID控制策略进行导航，以控制机器人直线、快速地到达目标点位置，避免机器人在目标点附近徘徊，而不是迅速到达目标点的情况出现。当机器人处于复杂场景下时，可以采用强化模仿学习控制策略进行导航，以使机器人在安全躲避障碍物的情况下，走出比较优化的路线。当机器人处于紧急场景下时，可以采用约束强化模仿学习控制策略进行导航，以使机器人能够对突然出现的障碍物及时做出反应，避免发生碰撞。The navigation strategy in this embodiment is formulated by a combination of a PID control strategy, a reinforcement imitation learning control strategy, and a constrained reinforcement imitation learning control strategy. Among them, when the robot is in a simple scene, the PID control strategy can be used for navigation to control the robot to reach the target point position in a straight line and quickly, so as to avoid the situation that the robot lingers near the target point instead of reaching the target point quickly. When the robot is in a complex scene, the reinforcement imitation learning control strategy can be used for navigation, so that the robot can walk out of an optimized route while avoiding obstacles safely. When the robot is in an emergency situation, the constraint reinforcement imitation learning control strategy can be used for navigation, so that the robot can respond to the sudden obstacle in time and avoid collision.

为了使机器人能够根据其所处的环境类型自动调用相应的控制策略，本实施例在控制器中配置了基于碰撞预测的场景分类模型，结合图2、图3所示。机器人在运行过程中，实时地通过传感器观察环境状态，并将检测到的状态数据以及机器人的自身速度(包括但不限于线速度和角速度)输入到碰撞预测模型中，以预测出机器人可能发生碰撞的概率，进而控制机器人提前做出反应，实现安全避障。In order to enable the robot to automatically invoke the corresponding control strategy according to the type of environment in which it is located, a scene classification model based on collision prediction is configured in the controller in this embodiment, as shown in FIG. 2 and FIG. 3 . During the operation, the robot observes the environment state through sensors in real time, and inputs the detected state data and the robot's own speed (including but not limited to linear speed and angular speed) into the collision prediction model to predict the possible collision of the robot probability, and then control the robot to respond in advance to achieve safe obstacle avoidance.

这里的碰撞预测模型可以采用现有成熟的碰撞预测模型。在对碰撞预测模型进行训练时，可以使用多种以往的机器人导航算法，在多种不同的仿真模拟环境中进行测试，并收集大量类型为“传感器数据、机器人自身速度、是否发生碰撞”的数据；然后，利用收集到的上述类型的数据对碰撞预测模型进行训练；最后，将训练后的碰撞预测模型应用到实际场景。The collision prediction model here can adopt the existing mature collision prediction model. When training the collision prediction model, you can use a variety of previous robot navigation algorithms to test in a variety of different simulation environments, and collect a large number of types of "sensor data, the robot's own speed, whether there is a collision" data. ; Then, use the collected data of the above types to train the collision prediction model; finally, apply the trained collision prediction model to the actual scene.

在本实施例中，可以根据碰撞预测模型生成的预测结果，对机器人所处的环境进行分类。例如，若通过传感器检测到机器人前方没有障碍物或者机器人已经到达了目标点周围，则可以将此时的场景定义为简单场景；若通过碰撞预测模型生成的预测结果为会发生碰撞(例如，机器人周围的障碍物非常密集或者有障碍物突然出现的场景)，则可以将此时的场景定义为紧急场景；其余的场景则可定义为复杂场景。In this embodiment, the environment where the robot is located may be classified according to the prediction result generated by the collision prediction model. For example, if it is detected by the sensor that there is no obstacle in front of the robot or the robot has reached the target point, the scene at this time can be defined as a simple scene; if the prediction result generated by the collision prediction model is that a collision will occur (for example, the robot The surrounding obstacles are very dense or there are obstacles suddenly appearing), the scene at this time can be defined as an emergency scene; the rest of the scenes can be defined as complex scenes.

参见图3，机器人在实际运行过程中，定时地将传感器观测到的状态数据以及自身的速度信息输入到碰撞预测模型中，进行碰撞预测。当然，也可以将历史碰撞数据输入到碰撞预测模型中，以提高预测结果的准确度。Referring to Fig. 3, during the actual operation, the robot periodically inputs the state data observed by the sensor and its own speed information into the collision prediction model to predict the collision. Of course, historical collision data can also be input into the collision prediction model to improve the accuracy of the prediction results.

如果通过碰撞预测模型预测会发生碰撞，则机器人执行紧急场景下的控制策略进行避障，即，执行约束强化模仿学习控制策略。如果预测不会发生碰撞，则对当前场景作进一步分类，例如，若传感器检测到机器人已经到达了目标点附近或者机器人前方没有障碍物，则执行简单场景下的PID控制策略；否则，判定机器人处于复杂场景下，执行强化模仿学习控制策略。If a collision is predicted by the collision prediction model, the robot executes the control strategy in the emergency scenario to avoid obstacles, that is, executes the constraint reinforcement imitation learning control strategy. If it is predicted that no collision will occur, the current scene is further classified. For example, if the sensor detects that the robot has reached the target point or that there is no obstacle in front of the robot, the PID control strategy in the simple scene is executed; otherwise, it is determined that the robot is in In complex scenarios, the reinforcement imitation learning control strategy is implemented.

在PID控制策略中，可以将机器人前进正方向与目标点之间的夹角设置为偏差，代入PID计算公式，计算出机器人的角速度，并保持机器人的线速度不变。在PID计算公式中，比例、积分、微分三个参数可以分别设置为：k_P＝0.15、k_I＝0.08、k_D＝0.01。In the PID control strategy, the angle between the forward direction of the robot and the target point can be set as the deviation, and the PID calculation formula can be substituted to calculate the angular velocity of the robot and keep the linear velocity of the robot unchanged. In the PID calculation formula, the three parameters of proportional, integral and differential can be respectively set as: k _P =0.15, k _I =0.08, k _D =0.01.

在强化模仿学习控制策略中，采用模仿学习和强化学习相结合的方式制定控制策略。In the reinforcement imitation learning control strategy, the combination of imitation learning and reinforcement learning is used to formulate the control strategy.

强化学习(Reinforcement Learning，简称RL)是机器学习的一个重要分支。强化学习通常使用马尔可夫决策过程(Markov Decision Process，简称MDP)来描述，具体而言：机器处在一个环境中，每个状态为机器对当前环境的感知；机器只能通过动作来影响环境，当机器执行一个动作后，会使得环境按某种概率转移到另一个状态；同时，环境会根据潜在的奖赏函数反馈给机器一个奖赏。综合而言，强化学习主要包含四个要素：状态、动作、转移概率以及奖赏函数，即：Reinforcement Learning (RL) is an important branch of machine learning. Reinforcement learning is usually described by Markov Decision Process (MDP), specifically: the machine is in an environment, and each state is the machine's perception of the current environment; the machine can only affect the environment through actions , when the machine performs an action, it will cause the environment to transition to another state with a certain probability; at the same time, the environment will feed back a reward to the machine according to the potential reward function. In summary, reinforcement learning mainly includes four elements: state, action, transition probability and reward function, namely:

状态s：机器对环境的感知，所有可能的状态称为状态空间；State s: the machine's perception of the environment, all possible states are called state space;

动作a：机器所采取的动作，所有能采取的动作构成动作空间；Action a: The action taken by the machine, all actions that can be taken constitute the action space;

转移概率p：当执行某个动作后，当前状态会以某种概率转移到另一个状态；Transition probability p: When an action is performed, the current state will transition to another state with a certain probability;

奖赏函数r：在状态转移的同时，环境反馈给机器一个奖赏。Reward function r: At the same time as the state transition, the environment feeds back a reward to the machine.

因此，强化学习的主要任务就是通过在环境中不断地尝试，根据尝试获得的反馈信息调整策略，最终生成一个较好的策略π，机器根据这个策略便能知道在什么状态下应该执行什么动作。一个策略的优劣取决于长期执行这一策略后的累积奖赏，换句话说：可以使用累积奖赏来评估策略的好坏，最优策略则表示在初始状态下一直执行该策略后，最后的累积奖赏值最高。Therefore, the main task of reinforcement learning is to continuously try in the environment, adjust the strategy according to the feedback information obtained by the attempt, and finally generate a better strategy π, and the machine can know what action should be performed in what state according to this strategy. The quality of a strategy depends on the cumulative reward after long-term implementation of the strategy, in other words: the cumulative reward can be used to evaluate the quality of the strategy, and the optimal strategy means that after the strategy has been implemented in the initial state, the final cumulative reward The highest reward.

模仿学习是强化学习的一个分支,能够很好地解决强化学习中的多步决策问题。Imitation learning is a branch of reinforcement learning, which can well solve the multi-step decision-making problem in reinforcement learning.

本实施例在模仿学习过程中，使用专家数据对Actor网络进行训练，在优化导航性能的同时，可以加速后续训练过程中模型的收敛速度。In the process of imitation learning in this embodiment, expert data is used to train the Actor network, which can accelerate the convergence speed of the model in the subsequent training process while optimizing the navigation performance.

参见图5，在模仿学习阶段，可以使用已经得到验证的导航算法(例如ORCA、RL、Hybrid-RL算法)，在若干种不同的仿真环境中进行测试，收集形式为(s,a)的数据作为专家数据，形成专家数据集。Referring to Figure 5, during the imitation learning phase, proven navigation algorithms (such as ORCA, RL, Hybrid-RL algorithms) can be used, tested in several different simulation environments, and data of the form (s, a) are collected As expert data, an expert dataset is formed.

使用专家数据集中的数据对Actor网络进行训练。通过训练，使Actor网络拟合专家数据，这样在后续环境交互过程中，机器人不会在环境中盲目地进行探索，继而加快模型的收敛速度。与此同时，专家数据集中的数据，都是机器人在简单以及紧急情况下有着完美表现，能够执行完美策略所对应的数据。因此，使用专家数据集训练出的Actor网络，能够在简单以及紧急情况下比传统方法表现得更好。Actor networks are trained using data from the expert dataset. Through training, the Actor network is fitted to the expert data, so that in the subsequent environment interaction process, the robot will not blindly explore in the environment, thereby accelerating the convergence speed of the model. At the same time, the data in the expert data set is the data corresponding to the robot's perfect performance in simple and emergency situations and the ability to execute perfect strategies. Therefore, Actor networks trained with expert datasets can perform better than traditional methods in simple and emergent situations.

模仿学习的优化目标是缩小与专家数据集之间的误差，因此，本实施例可以配置模仿学习的目标优化函数为：The optimization goal of imitation learning is to reduce the error with the expert data set. Therefore, in this embodiment, the objective optimization function of imitation learning can be configured as:

其中，s_i、a_i为专家数据集中的数据，且s_i表示机器人通过传感器所观察到的状态，a_i表示在状态s_i下机器人所执行的动作；π_θ表示Actor网络；N表示专家数据集中的样本数量；θ表示Actor网络的权重；

表示求结果最小值所对应的θ。Among them, s _i and a _i are the data in the expert data set, and _si represents the state observed by the robot through the sensor, a _i represents the action performed by the robot in the state _si ; π _θ represents the Actor network; N represents the expert The number of samples in the dataset; θ represents the weight of the Actor network;

Represents the θ corresponding to the minimum value of the result.

利用计算出的权重θ对Actor网络进行优化。The Actor network is optimized using the calculated weights θ.

接下来，可以在强化学习阶段，利用机器人在实际运行过程收集到的数据对Actor网络作进一步训练。Next, in the reinforcement learning phase, the Actor network can be further trained using the data collected by the robot during the actual operation.

强化学习模型的训练是通过机器人在环境中不断进行探索，收集数据，以最大化累计奖赏为目标实现的。强化学习模型不为机器人规划路径，而是针对机器人在环境中所观察到的状态，直接输出控制指令进行导航。The training of the reinforcement learning model is achieved by the robot constantly exploring the environment, collecting data, and aiming at maximizing the cumulative reward. The reinforcement learning model does not plan a path for the robot, but directly outputs control commands for navigation according to the state observed by the robot in the environment.

结合图2、图4所示，强化学习模型根据机器人自带传感器在环境中所观察到的状态、机器人的自身速度以及目标点相对位置，输出一个动作的概率分布。之后，从概率分布中随机采样，得到机器人应该执行的动作a。机器人在环境中执行动作a之后，到达了状态s。同时，执行动作a会获得一个立即回报r(即，奖赏)以及新的观测状态。机器人将于环境中所收集到的数据以(s,a,r)的形式存入经验池。当经验池中存储的数据的数量满足一定条件时，计算强化学习模型的损失值，用于对强化学习模型中的Critic网络和Actor网络进行更新，以优化导航策略。Combined with Figure 2 and Figure 4, the reinforcement learning model outputs the probability distribution of an action according to the state observed by the robot's own sensors in the environment, the robot's own speed, and the relative position of the target point. After that, randomly sample from the probability distribution to get the action a that the robot should perform. After the robot performs action a in the environment, it reaches state s. At the same time, performing action a results in an immediate reward r (ie, a reward) and a new observed state. The robot will store the data collected in the environment into the experience pool in the form of (s, a, r). When the amount of data stored in the experience pool satisfies a certain condition, the loss value of the reinforcement learning model is calculated to update the Critic network and the Actor network in the reinforcement learning model to optimize the navigation strategy.

在本实施例中，可以对强化学习模型的奖赏函数进行如下配置：In this embodiment, the reward function of the reinforcement learning model can be configured as follows:

初始化奖赏函数r＝0；Initialize reward function r=0;

根据机器人在第t-1个时间步和第t个时间步的位置p^t-1、p^t以及目标点的位置g和机器人第t-1个时间步的角速度w^t计算奖赏函数r，包括以下情况：The reward function r is calculated according to the positions pt ^-1 and pt of the robot at the ^t -1th time step and the ^t -th time step, the position g of the target point and the angular velocity wt of the robot at the t-1th time step, including The following situations:

若|p^t-g||＜0.1，则r＝r+r_arrival；否则r＝r+w_g(||p^t-1-g||-||p^t-g||)；If |p ^t -g||＜0.1, then r=r+r _arrival ; otherwise r=r+w _g (||p ^t-1 -g||-||p ^t -g||);

若预测机器人会发生碰撞，则r＝r+r_collision；If the robot is predicted to collide, then r=r+r _collision ;

若|w^t|＞0.7，则

其中，||||表示取模运算；r_arrival、r_collision、w_g、w_w均为超参数。超参数是在开始学习过程之前设置的参数，而不是通过训练得到的参数数据。通常情况下，需要对超参数进行优化，给学习模型选择一组最优超参数，可以提高学习的性能和效果。本实施例的超参数可以赋予经验值。If |w ^t |>0.7, then

Among them, |||| represents the modulo operation; r _arrival , r _collision , w _g , and w _w are all hyperparameters. Hyperparameters are parameters that are set before starting the learning process, not parameter data obtained through training. Usually, hyperparameters need to be optimized, and selecting a set of optimal hyperparameters for the learning model can improve the performance and effect of learning. The hyperparameters of this embodiment can be given empirical values.

Critic网络的优化目标是最小化与经验池中累积奖赏值的误差，因此，本实施例配置Critic网络的目标优化函数为：The optimization goal of the Critic network is to minimize the error with the accumulated reward value in the experience pool. Therefore, the objective optimization function of the Critic network configured in this embodiment is:

其中，φ为Critic网络的权重；γ为折扣因子；t为时间步；T为最大步数；s_t表示机器人在第t个时间步的状态；r_t'为机器人在第t'个时间步所获得的奖赏；V_φ为Critic函数，用于评估状态s的好坏；

表示求结果最小值所对应的φ。Among them, φ is the weight of the Critic network; γ is the discount factor; t is the time step; T is the maximum number of steps; s _t represents the state of the robot at the t-th time step; r _t' is the robot at the t'-th time step. The reward obtained; V _φ is the Critic function, which is used to evaluate the quality of the state s;

Indicates the φ corresponding to the minimum value of the result.

Actor网络的优化目标是最大化累计奖赏的期望值，因此，本实施例配置Actor网络的目标优化函数为：The optimization objective of the Actor network is to maximize the expected value of the accumulated reward. Therefore, the objective optimization function of the Actor network configured in this embodiment is:

即：

which is:

其中，

λ、ε为超参数；θ为Actor网络的权重；n表示回合数；T_n表示第n回合的最大步数；a_t表示机器人在第t个时间步的动作；π_θ和π_old分别表示当前的Actor网络和更新前的Actor网络；E_t表示期望函数；

表示求结果最大值所对应的θ；clip表示限制函数，且

即，将r_t(θ)的值限制在[1-ε，1+ε]范围内。in,

λ and ε are hyperparameters; θ is the weight of the Actor network; n represents the number of rounds; T _n represents the maximum number of steps in the nth round; a _t represents the action of the robot at the t-th time step; π _θ and π _old respectively represent The current Actor network and the Actor network before the update; E _t represents the expectation function;

represents the θ corresponding to the maximum value of the result; clip represents the limit function, and

That is, the value of r _t (θ) is restricted to be in the range of [1-ε, 1+ε].

利用计算出的Actor网络的权重θ和Critic网络的权重φ，对Actor网络和Critic网络进行更新，以进一步训练强化学习模型。Using the calculated weights θ of the Actor network and φ of the Critic network, the Actor network and the Critic network are updated to further train the reinforcement learning model.

训练后的Actor网络和Critic网络，可以应用到紧急场景下的约束强化模仿学习控制策略中。The trained Actor network and Critic network can be applied to the constrained reinforcement imitation learning control strategy in emergency scenarios.

传统的深度学习模型在较为紧急的场景下，虽然也会采取一定的行为进行避障，但是往往采取的动作幅度不够，动作的执行不够及时，导致碰撞时有发生。Although the traditional deep learning model will take certain actions to avoid obstacles in more urgent scenarios, the range of actions taken is often insufficient, and the execution of actions is not timely enough, resulting in frequent collisions.

为了解决上述问题，本实施例配置碰撞预测模型，能够提前若干个步长预测到可能会发生的碰撞。之后，执行约束的强化模仿学习控制策略，利用该控制策略对机器人所执行的动作进行约束，进而避免碰撞的发生。In order to solve the above problem, this embodiment configures a collision prediction model, which can predict possible collisions several steps in advance. After that, the reinforcement imitation learning control strategy of constraints is implemented, and the control strategy is used to constrain the actions performed by the robot, thereby avoiding the occurrence of collisions.

本实施例的约束强化模仿学习控制策略，是在复杂场景下的强化模仿学习控制策略的基础上增加约束条件，通过限制机器人的行走速度，达到安全避障的技术效果。The constraint reinforcement imitation learning control strategy in this embodiment is based on the reinforcement imitation learning control strategy in complex scenarios, adding constraints, and by limiting the walking speed of the robot, the technical effect of safe obstacle avoidance is achieved.

如图6所示，当机器人处于紧急环境下时，首先判断机器人的线速度是否大于设定阈值；若大于设定阈值，则说明会有极大的概率发生碰撞，此时可以将机器人的速度设置为0，即，控制机器人停止运行，以躲避突然出现的障碍物；否则，缩小传感器检测到的距离数据，并将缩小后的距离数据输入强化学习模型，使得通过强化学习模型计算输出动作a中表示机器人速度(至少包括线速度)的数值减小，即，控制机器人减速，由此便可获得一个相对安全的动作a。机器人执行此动作a，将不会发生碰撞。As shown in Figure 6, when the robot is in an emergency environment, first determine whether the linear speed of the robot is greater than the set threshold; Set to 0, that is, control the robot to stop running to avoid sudden obstacles; otherwise, reduce the distance data detected by the sensor, and input the reduced distance data into the reinforcement learning model, so that the output action a is calculated by the reinforcement learning model The numerical value representing the speed of the robot (at least including the linear speed) is reduced in , that is, the robot is controlled to decelerate, so that a relatively safe action a can be obtained. The robot performs this action a and will not collide.

在某些实施例中，当机器人在紧急环境下的线速度小于等于设定阈值时，可以将传感器检测到的距离数据缩小P倍，即，将传感器检测到的距离数据除以P，所述P优选在[1.25，1.5]之间取值。同时，对缩小后的距离数据进行约束，使其在设定的大小阈值范围[Dmin，Dmax]内，这样可以进一步保证通过强化学习模型输出的动作a将会是一个安全的动作，避免机器人与障碍物发射碰撞。In some embodiments, when the linear speed of the robot in an emergency environment is less than or equal to a set threshold, the distance data detected by the sensor can be reduced by a factor of P, that is, the distance data detected by the sensor is divided by P, the P preferably takes a value between [1.25, 1.5]. At the same time, the reduced distance data is constrained to be within the set size threshold range [Dmin, Dmax], which can further ensure that the action a output by the reinforcement learning model will be a safe action, avoiding the robot and the Obstacle launch collision.

下面通过一个具体的实例，阐述本实施例的移动机器人所执行的自主导航策略。The autonomous navigation strategy executed by the mobile robot of this embodiment is described below through a specific example.

将八个机器人部署在一个半径为3米的圆形场景中，每个机器人需要行驶到自己圆心对面的位置，即，目标点位置。八个机器人在同一时间段内均进入运行状态，彼此之间形成对方的障碍物。每一个机器人均执行本实施例的自主导航策略，形成的运动轨迹如图7所示。Eight robots are deployed in a circular scene with a radius of 3 meters, and each robot needs to drive to the position opposite to its own circle center, that is, the target point position. The eight robots all entered the running state within the same time period, forming obstacles to each other. Each robot executes the autonomous navigation strategy of this embodiment, and the formed motion trajectory is shown in FIG. 7 .

每一个机器人在运行过程中，仅依靠自己所观察到的环境状态决定下一步的动作。In the process of running, each robot only depends on the state of the environment it observes to decide the next action.

首先，机器人将所观察到的环境信息(状态s)输入到碰撞预测模块中。碰撞预测模块根据环境信息判断机器人当前处于哪一类型的环境中。First, the robot inputs the observed environment information (state s) into the collision prediction module. The collision prediction module determines which type of environment the robot is currently in according to the environment information.

如果机器人当前处于简单环境下，则采用PID控制策略控制机器人朝着自己的目标点位置沿直线快速移动。If the robot is currently in a simple environment, the PID control strategy is used to control the robot to move quickly along a straight line towards its target point position.

如果机器人处于复杂环境下，则采用强化模仿学习控制策略，根据环境状态s输出机器人应该执行的动作a。If the robot is in a complex environment, the reinforcement imitation learning control strategy is adopted to output the action a that the robot should perform according to the environment state s.

如果机器人处于紧急环境下，则有极大可能发生碰撞。此时，若机器人当前速度大于设定阈值，则控制机器人立刻停止运动；否则，采用约束强化模仿学习控制策略，将约束后的环境状态输入强化学习模型，随后得到约束的动作a让机器人执行，以避免发生碰撞。If the robot is in an emergency environment, there is a high possibility of a collision. At this time, if the current speed of the robot is greater than the set threshold, the robot is controlled to stop moving immediately; otherwise, the constrained reinforcement imitation learning control strategy is adopted, and the constrained environment state is input into the reinforcement learning model, and then the constrained action a is obtained for the robot to execute, to avoid collisions.

通过上述实验场景可以看到，所有机器人都能够以较短的、平滑的路径，并且在不发生碰撞的情况下，最终到达目标点位置，成功完成所有导航任务。It can be seen from the above experimental scenarios that all robots can reach the target point with a short, smooth path and without collision, and successfully complete all navigation tasks.

当然，以上所述仅是本发明的一种优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。Of course, the above is only a preferred embodiment of the present invention. It should be pointed out that for those skilled in the art, without departing from the principles of the present invention, several improvements and modifications can be made. Improvements and modifications should also be considered within the scope of the present invention.

Claims

1. An autonomous navigation robot, comprising:

the sensor is used for detecting the distance and the angle of the obstacle relative to the robot to form state data;

the controller is used for classifying scenes where the robot is located according to the state data and the relative position of the target point, and if the scenes are simple scenes, executing a PID control strategy; if the scene is a complex scene, executing a reinforced simulation learning control strategy; if the scene is an emergency scene, executing a constraint reinforcement simulation learning control strategy; calculating the linear speed and the angular speed of the robot walking by executing corresponding control strategies;

and the traveling mechanism is used for driving the robot to travel according to the linear velocity and the angular velocity calculated by the controller.

2. The autonomous navigation robot according to claim 1, wherein a collision prediction model that predicts whether or not the robot can collide based on the state data and a speed of the robot itself is provided in the controller.

3. The autonomous navigational robot of claim 2,

the simple scene is a scene that no obstacle exists in front of the robot or the robot reaches the periphery of a target point;

the emergency scene is a scene in which the collision of the robot is predicted by the collision prediction model machine;

the complex scene is a scene except the simple scene and the emergency scene.

4. The autonomous navigation robot of any of claims 1 to 3, wherein the controller sets an angle between a forward direction of the robot and a target point as a deviation when executing a PID control strategy, substitutes a PID calculation formula to calculate an angular velocity of the robot, and keeps a linear velocity of the robot constant.

5. The autonomous navigational robot of claim 1, wherein the augmented-mock-learning control strategy executed by the controller comprises:

simulating a learning process that trains an Actor network using data in an expert dataset;

and in the reinforcement learning process, an Actor network and a Critic network trained by the simulation learning process are utilized, state data, the self speed of the robot and the relative position of a target point are combined to calculate and output an action a, and the walking mechanism is controlled to adjust the linear speed and the angular speed of the robot walking according to the action a.

6. The autonomous navigational robot of claim 5, wherein the controller configures the objective optimization function of the mock learning to:

wherein s is _i 、a _i Is data in an expert data set, and s _i Representing the state of the robot as observed by the sensor, a _i Is shown in state s _i An action performed by the lower robot; pi _θ Representing an Actor network; n represents the number of samples in the expert data set; θ represents the weight of the Actor network;

represents theta corresponding to the minimum value of the calculation result;

the controller optimizes the Actor network using the weight θ.

7. The autonomous navigational robot of claim 5, further comprising:

a memory for storing an action a which the controller calculates and outputs when executing the reinforcement mimic learning control strategy, a state s which the robot reaches after executing the action a in the environment, and a reward r obtained when the robot executes the action a, and storing the collected (s, a, r) data into an experience pool;

when the quantity of the data stored in the experience pool meets a set condition, the controller calculates a loss value of the reinforcement learning model, and then updates an Actor network and a Critic network in the reinforcement learning model.

8. The autonomous navigational robot of claim 7, wherein the controller configures the reinforcement learned reward function as follows:

initializing a reward function r to be 0;

according to the position p of the robot at the t-1 time step and the t time step ^t-1 、p ^t And the position g of the target point and the angular velocity w of the t-1 time step of the robot ^t Calculating a reward function r, including the following:

if p ^t -g | < 0.1, then r ═ r + r _arrival (ii) a Otherwise r is r + w _g (||p ^t-1 -g||-||p ^t -g||)；

If the robot is predicted to collide, r is r + r _collision ；

If | w ^t If | is greater than 0.7, then

Wherein, | | | represents a modulo operation; r is _arrival 、r _collision 、w _g 、w _w Are all hyper-parameters.

9. The autonomous navigational robot of claim 7, wherein the controller configures the objective optimization function of the Critic network during the reinforcement learning process to be:

wherein phi is the weight of the Critic network; gamma is a discount factor; t is a time step; t is the maximum step number; s _t Representing the state of the robot at the t time step; r is _t' A reward obtained for the robot at the t' th time step; v _φ Is a Critic function and is used for evaluating the state s;

expressing phi corresponding to the minimum value of the calculation result;

configuring an objective optimization function of the Actor network as follows:

wherein θ is the weight of the Actor network; n represents the number of rounds; t is _n Representing the maximum number of steps of the nth round; a is _t Representing the motion of the robot at the t-th time step; ε represents a hyperparameter; pi _θ And pi _old Respectively representing a current Actor network and an Actor network before updating; e _t Representing a desired function;

represents theta corresponding to the maximum value of the calculation result; clip represents a restriction function, and

10. the autonomous navigation robot of any one of claims 5 to 9,

the constraint reinforcement modeling learning control strategy is the same as an Actor network and a Critic network used in the reinforcement modeling learning control strategy;

the constraint-reinforced mimic learning control strategy executed by the controller comprises:

judging whether the linear velocity of the robot is greater than a set threshold value or not;

if the current value is greater than the set threshold value, controlling the robot to stop walking;

if the distance data detected by the sensor is smaller than the set threshold, the distance data detected by the sensor is input to the reinforcement learning model, and the numerical value representing the robot speed in the calculation output action a by the reinforcement learning model is reduced.