CN116300905A

CN116300905A - A Constrained Multi-robot Reinforcement Learning Safe Formation Method Based on 2D Laser Observation

Info

Publication number: CN116300905A
Application number: CN202310164480.4A
Authority: CN
Inventors: 王越; 崔瑜翔; 熊蓉
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-02-25
Filing date: 2023-02-25
Publication date: 2023-06-23

Abstract

The invention discloses a two-dimensional laser observation-based multi-robot reinforcement learning safety formation method. Each formation member generates a preliminary control instruction according to individual observation and intra-formation sharing information by using a navigation planning method based on reinforcement learning, and the preliminary control instruction is further optimized under the guidance of a control barrier function, so that the final control instruction meets safety constraint, and the running safety of the robot is ensured. According to the invention, under the condition of remarkably small observation and calculation cost, the multi-robot information is integrated, the multi-robot control is planned, the safe and efficient multi-robot formation collaborative planning is realized, the formation control instruction generation is realized by only utilizing the two-dimensional laser information, the execution safety is ensured, and the large-scale formation task under the unknown dynamic unstructured environment can be dealt with.

Description

A Constrained Multi-robot Reinforcement Learning Safe Formation Based on 2D Laser Observation method

技术领域technical field

本发明属于移动机器人编队导航领域，具体涉及一种基于二维激光观测的带约束多机器人强化学习安全编队方法。The invention belongs to the field of formation navigation of mobile robots, and in particular relates to a safety formation method for multi-robot reinforcement learning with constraints based on two-dimensional laser observation.

背景技术Background technique

近年来，随着机器人技术的不断发展，机器人的工作环境逐步从静态的结构化场景向动态的非结构化场景迁移，作业任务也逐步从单机移动作业向多机协同作业过渡。在当前机器人技术快速发展的背景下，实现兼顾轻量化、灵活性以及安全性的多机器人编队控制难题亟待解决。In recent years, with the continuous development of robot technology, the working environment of robots has gradually migrated from static structured scenes to dynamic unstructured scenes, and the work tasks have gradually transitioned from single-machine mobile operations to multi-machine collaborative operations. In the context of the current rapid development of robot technology, the problem of multi-robot formation control that takes into account lightweight, flexibility and safety needs to be solved urgently.

现有的多智能体编队导航规划方法多为全知视角下的集中式全局规划方法，依托于完备的离线地图构建，在线机器人定位以及快速编队规划求解，方法架构臃肿，多适用于静态简单环境或结构化场景下的小规模编队任务，难以实现复杂动态非结构化环境下的大规模编队任务的迁移。随着环境复杂度的提升以及机器人数目的增加，方法的有效性以及实时性将难以得到保证。Most of the existing multi-agent formation navigation planning methods are centralized global planning methods from an omniscient perspective, relying on complete offline map construction, online robot positioning and fast formation planning solution, the method structure is bloated, and it is mostly suitable for static simple environments or Small-scale formation tasks in structured scenarios are difficult to migrate to large-scale formation tasks in complex dynamic unstructured environments. As the environment complexity increases and the number of robots increases, the effectiveness and real-time performance of the method will be difficult to be guaranteed.

近年来，基于深度学习的多智能体规划方法采用端到端的方式增强方法的适应性，可以实现大规模的并行计算，为多智能体进行同时规划，取得了显著成效，但一般不具备显式的安全保障约束，在实际应用中难以得到推广。In recent years, the multi-agent planning method based on deep learning adopts an end-to-end method to enhance the adaptability of the method, which can realize large-scale parallel computing and simultaneous planning for multi-agents, and has achieved remarkable results, but generally does not have explicit It is difficult to be popularized in practical applications.

发明内容Contents of the invention

本发明要解决的技术问题是解决复杂动态非结构化环境下的较大规模编队基于原始观测信息进行带安全约束的导航规划问题，提供了一种基于二维激光观测的带约束多机器人强化学习安全编队方法，编队成员机器人可以仅通过数据量较小的传感器观测信息通讯实现信息共享与环境感知，在综合的环境感知的基础上进行扩展性较强的编队分布规划，最终通过灵活的导航策略与带有安全约束的控制器生成实际执行指令，确保编队导航任务的高效性与安全性。本发明所采用的技术方案如下：The technical problem to be solved by the present invention is to solve the problem of navigation planning with safety constraints based on original observation information for large-scale formations in complex dynamic unstructured environments, and provides a multi-robot reinforcement learning with constraints based on two-dimensional laser observations In the safe formation method, the formation member robots can realize information sharing and environment perception only through sensor observation information communication with a small amount of data, and carry out scalable formation distribution planning on the basis of comprehensive environment perception, and finally through flexible navigation strategy Generate actual execution instructions with the controller with safety constraints to ensure the efficiency and safety of formation navigation tasks. The technical scheme adopted in the present invention is as follows:

本发明公开了一种基于二维激光观测的带约束多机器人强化学习安全编队方法，包括：The invention discloses a two-dimensional laser observation-based safety formation method for multi-robot reinforcement learning with constraints, including:

S1：获取编队任务设定参数，包含编队队形，编队成员数目以及编队起点位置，根据参数进行初始化，得到初步状态感知信息；S1: Obtain formation task setting parameters, including formation formation, number of formation members and formation starting position, initialize according to the parameters, and obtain preliminary state perception information;

S2：通过多机器人编队的内部通讯实时进行状态感知信息共享，获取编队内部其余各机器人通过传感器观测获取的状态感知信息，包含二维激光观测信息，速度状态信息以及目标相对位置关系；S2: Through the internal communication of the multi-robot formation, share the state perception information in real time, and obtain the state perception information obtained by the other robots in the formation through sensor observation, including two-dimensional laser observation information, speed state information and relative positional relationship of the target;

S3：对多机器人状态感知信息进行解耦整合，获取以编队中心目标位置为基准的环境分布状态，根据编队中心目标位置，各编队成员位置以及障碍物分布等信息建立观测势场；S3: Decouple and integrate multi-robot state perception information, obtain the environment distribution state based on the target position of the formation center, and establish an observation potential field according to the target position of the formation center, the positions of each formation member, and the distribution of obstacles;

S4：根据观测势场以及编队任务设定参数，进行队形层级拆分以及各分层内部的编队目标位置计算，根据当前编队成员位置进行目标位置分配；S4: According to the observed potential field and the setting parameters of the formation task, split the formation level and calculate the formation target position within each layer, and allocate the target position according to the current formation member position;

S5：各编队成员机器人根据分配得到的目标位置以及综合状态感知信息，由基于强化学习训练得到的策略网络生成初始控制指令；S5: Each formation member robot generates an initial control instruction from the policy network obtained based on reinforcement learning training according to the assigned target position and comprehensive state perception information;

S6：根据描述机器人安全状态的控制障碍函数，结合观测预测模型与梯度优化器对初始控制指令进行迭代优化，直至满足安全约束，得到保证安全的最终控制指令；S6: According to the control obstacle function describing the safety state of the robot, combined with the observation prediction model and the gradient optimizer, iteratively optimize the initial control instruction until the safety constraints are met, and obtain the final control instruction that guarantees safety;

S7：将最终控制指令下发至机器人控制器进行执行，到达新的状态，回到S2，直至编队到达最终目标位置。S7: Send the final control command to the robot controller for execution, reach a new state, and return to S2 until the formation reaches the final target position.

作为进一步地改进，本发明在步骤S1中，获取编队任务设定参数，包含编队队形，编队成员数目以及编队起点位置，具体为：As a further improvement, in step S1, the present invention acquires formation task setting parameters, including formation formation, number of formation members and formation starting position, specifically:

该方法支持包括全包围队形，半包围队形以及任意角度包围的多种编队队形方案，可扩展至较大数目的机器人编队任务。This method supports multiple formation schemes including full encirclement formation, half encirclement formation and arbitrary angle envelopment, and can be extended to a large number of robot formation tasks.

作为进一步地改进，本发明在步骤S2中，状态感知信息包含二维激光观测信息，速度状态信息以及目标相对位置关系，具体为：As a further improvement, in step S2 of the present invention, the state perception information includes two-dimensional laser observation information, speed state information and relative positional relationship of the target, specifically:

其中二维激光信息可依据传感器分辨率确定为定长的一维数组信息，速度状态信息与目标相对位置信息可确定为二维坐标信息。整合的状态感知信息由限定大小的数组进行确定，数据量小，通讯负载占用少，可以实现多机器人的扩展。The two-dimensional laser information can be determined as a fixed-length one-dimensional array information according to the resolution of the sensor, and the speed state information and the relative position information of the target can be determined as two-dimensional coordinate information. The integrated state perception information is determined by an array with a limited size, the data volume is small, the communication load occupies less, and the expansion of multiple robots can be realized.

作为进一步地改进，本发明在步骤S3中，对多机器人状态感知信息进行解耦整合，获取以编队中心目标位置为基准的环境分布状态，根据编队中心目标位置，各编队成员位置以及障碍物分布等信息建立观测势场，具体为：As a further improvement, in step S3, the present invention decouples and integrates the multi-robot state perception information, obtains the environment distribution state based on the target position of the formation center, and according to the target position of the formation center, the positions of each formation member and the distribution of obstacles and other information to establish the observation potential field, specifically:

根据目标位置，障碍物分布情况以及其余编队成员位置建立观测势场，描述当前环境状态下各位置作为编队成员目标位置的执行代价。其中，势场中各位置的代价具体数值由当前位置受到的障碍物排斥力F_ro，编队队友排斥力F_ra以及目标吸引力F_a的合力大小F确定，其具体关系为：According to the target position, the distribution of obstacles and the positions of other formation members, the observation potential field is established, and the execution cost of each position as the target position of formation members under the current environment state is described. Among them, the specific value of the cost of each position in the potential field is determined by the resultant force F of the obstacle repulsion F _ro received by the current position, the formation teammate repulsion F _ra , and the target attraction F _a . The specific relationship is:

F＝F_ro+F_ra+F_a F＝F _ro +F _ra +F _a

作为进一步地改进，本发明在步骤S4中，根据观测势场以及编队任务设定参数，进行队形层级拆分以及各分层内部的编队目标位置计算，根据当前编队成员位置进行目标位置分配，具体为：As a further improvement, in step S4 of the present invention, according to the observed potential field and formation task setting parameters, formation hierarchy splitting and formation target position calculation within each layer are performed, and target position allocation is performed according to the current formation member positions, Specifically:

依据观测势场以及任务设定对编队进行分层，各分层编队整体形状为相互嵌套的比例放大关系，保证编队整体的秩序性以及编队内部各个编队成员的安全性。各分层内部，方法首先确定势场最低位置为当前编队成员位置，随后依据该成员的排斥势场更新整体环境势场，最后通过迭代的方式重复该循环，逐步找到该分层内部各编队成员的位置。确定目标编队位置后，方法依据距离优先以及避免轨迹交错的原则，结合当前各编队成员位置对目标进行分配。The formations are layered according to the observation potential field and task setting, and the overall shape of each layered formation is a nested scale-enlarged relationship to ensure the overall order of the formation and the safety of each formation member within the formation. Inside each layer, the method first determines the lowest position of the potential field as the position of the current formation member, then updates the overall environmental potential field according to the repulsive potential field of the member, and finally repeats the cycle iteratively to gradually find the formation members in the layer s position. After determining the position of the target formation, the method assigns the target according to the principle of distance priority and avoiding track interlacing, combined with the current position of each formation member.

作为进一步地改进，本发明在步骤S5中，根据分配得到的目标位置以及综合状态感知信息，由基于强化学习训练得到的策略网络生成初始控制指令，具体为：As a further improvement, in step S5 of the present invention, according to the assigned target position and comprehensive state perception information, the policy network obtained based on reinforcement learning training generates initial control instructions, specifically:

该策略网络在仿真环境中通过自主试错的方式进行学习，依据环境反馈的奖惩信息不断迭代优化，直至收敛为性能较为稳定的自主导航策略。该策略网络可以批量处理较大规模的多机器人编队决策任务，效率更高，可扩展性更强。The strategy network learns through autonomous trial and error in the simulation environment, and iteratively optimizes iteratively according to the reward and punishment information fed back from the environment until it converges to an autonomous navigation strategy with relatively stable performance. The policy network can batch process large-scale multi-robot formation decision-making tasks, with higher efficiency and stronger scalability.

作为进一步地改进，本发明在步骤S6中，根据描述机器人安全状态的控制障碍函数，结合观测预测模型与梯度优化器对初始控制指令进行迭代优化，直至满足安全约束，得到保证安全的最终控制指令，具体为：As a further improvement, in step S6 of the present invention, according to the control obstacle function describing the safety state of the robot, combined with the observation prediction model and the gradient optimizer, the initial control instruction is iteratively optimized until the safety constraints are satisfied, and the final control instruction that guarantees safety is obtained. ,Specifically:

根据基于二维激光观测的控制障碍函数对初始指令进行安全评估，结合观测预测模型对指令优化的梯度方向进行判断，利用梯度优化器进行迭代优化，最终得到确保安全的最终控制指令。According to the control obstacle function based on two-dimensional laser observation, the safety evaluation of the initial command is carried out, and the gradient direction of the command optimization is judged by combining the observation prediction model, and the gradient optimizer is used for iterative optimization, and finally the final control command that ensures safety is obtained.

其中该基于观测的控制障碍函数描述当前机器人状态在环境中的安全程度，包含机器人自身安全与编队内部安全两部分。机器人自身安全主要描述机器人以当前速度状态进行运动时机器人与环境内部的静态或动态障碍物间的安全关系，编队内部安全则主要描述机器人以当前速度状态进行运动时与编队内部其他成员间的安全关系。The observation-based control obstacle function describes the safety degree of the current robot state in the environment, including two parts: the robot's own safety and the internal safety of the formation. The safety of the robot itself mainly describes the safety relationship between the robot and static or dynamic obstacles inside the environment when the robot is moving at the current speed state, and the internal safety of the formation mainly describes the safety between the robot and other members of the formation when it is moving at the current speed state relation.

本发明具有如下有益效果：The present invention has following beneficial effect:

1)轻量性。本发明通过数据量较小的编队内部通讯进行环境感知信息共享，利用基于强化学习的决策网络对规模较大的编队局部决策任务进行批量化处理，从通讯以及信息处理层面提高了方法效率。1) Lightweight. The present invention shares environment perception information through internal communication of formations with a small amount of data, uses a decision network based on reinforcement learning to perform batch processing on local decision-making tasks of larger formations, and improves method efficiency from communication and information processing levels.

2)灵活性。本发明通过建立观测势场的方式对动态环境进行表达，以迭代更新的方式对任意数目的编队成员进行目标分配，利用具备批量化处理能力的深度网络帮助任意数目的编队成员进行决策，采用二维激光雷达数据等原始传感器信息作为方法输入，使得方法可以在复杂的真实环境中实现灵活迁移。2) Flexibility. The present invention expresses the dynamic environment by establishing an observed potential field, assigns targets to any number of formation members in an iterative update manner, and uses a deep network with batch processing capabilities to help any number of formation members make decisions. Raw sensor information such as 3D LiDAR data is used as input to the method, so that the method can be flexibly migrated in complex real environments.

3)安全性。本发明利用基于二维激光观测的安全控制障碍函数作为控制指令约束，在指令执行前对基于强化学习的策略网络输出进行安全评估以及优化，直至满足安全约束后下达至控制器，保证机器人运行的安全性。3) Security. The present invention uses the safety control obstacle function based on two-dimensional laser observation as a control instruction constraint, and performs safety evaluation and optimization on the policy network output based on reinforcement learning before the instruction is executed, until it meets the safety constraints and sends it to the controller to ensure the robot's operation. safety.

因此，本发明是一种非常实用有效的多机器人编队导航方法，具有很好的应用前景。Therefore, the present invention is a very practical and effective multi-robot formation navigation method and has good application prospects.

附图说明Description of drawings

图1为基于二维激光观测的带约束多机器人强化学习安全编队方法实验框架图；Figure 1 is the experimental frame diagram of the safety formation method of reinforcement learning with constrained multi-robots based on two-dimensional laser observation;

图2为基于二维激光观测的带约束多机器人强化学习安全编队方法信息流图；Figure 2 is an information flow diagram of a multi-robot reinforcement learning safety formation method with constraints based on two-dimensional laser observation;

图3为基于二维激光观测的带约束多机器人强化学习安全编队方法流程图。Fig. 3 is a flowchart of a method for safe formation of multi-robot reinforcement learning with constraints based on two-dimensional laser observation.

具体实施方式Detailed ways

本发明公开一种基于二维激光观测的带约束多机器人强化学习安全编队方法具体方案如下：The invention discloses a two-dimensional laser observation-based safety formation method with constrained multi-robot reinforcement learning. The specific scheme is as follows:

首先，在任务开始执行前，使用者需要根据任务设定，确定编队队形，编队成员数目，初始化编队各成员起点位置以及编队最终目标位置，在各编队成员机器人上进行参数初始化设定，并在仿真或实际场景中对多机器人进行安置与启动。任务开始执行后，各编队成员机器人在多智能体编队通过内部通讯进行感知信息共享，获取其余各智能体的状态信息，包含二维激光观测信息，速度状态信息以及目标相对位置关系。随后，各编队成员机器人对多智能体状态信息进行解耦整合，获取以编队中心目标位置为基准的环境分布状态，根据编队中心目标位置，各编队成员位置以及障碍物分布等信息建立观测势场。根据观测势场分布，已选定的编队队形以及编队成员数目，各编队成员机器人对编队进行队形层级拆分以及各分层内部的编队目标位置计算，并根据当前编队成员位置依据统一规则进行目标分配。随后，各编队成员机器人根据分配得到的目标位置，二维激光观测信息以及速度状态信息，由基于强化学习训练得到的策略网络生成初始控制指令。根据描述机器人安全状态的控制障碍函数，各编队成员机器人结合观测预测模型与梯度优化器对初始控制指令进行迭代优化，直至满足安全约束，得到保证安全的最终控制指令下发至机器人控制器进行执行，到达新的状态。各编队成员机器人持续执行当前流程，直至编队到达最终目标位置，完成编队导航任务。First of all, before the task starts, the user needs to determine the formation formation, the number of formation members, initialize the starting position of each formation member and the final target position of the formation according to the task setting, perform parameter initialization settings on each formation member robot, and Position and activate multiple robots in simulation or real-world scenarios. After the task is executed, the robots of each formation member share the perception information through internal communication in the multi-agent formation, and obtain the state information of the remaining agents, including two-dimensional laser observation information, speed state information and relative positional relationship of the target. Subsequently, each formation member robot decouples and integrates the multi-agent state information, obtains the environment distribution state based on the target position of the formation center, and establishes an observation potential field based on the target position of the formation center, the position of each formation member, and the distribution of obstacles. . According to the distribution of the observed potential field, the selected formation and the number of formation members, each formation member robot splits the formation level and calculates the formation target position within each layer, and according to the current formation member position according to the unified rules Make target assignments. Subsequently, each formation member robot generates an initial control instruction by a strategy network based on reinforcement learning training based on the assigned target position, two-dimensional laser observation information, and speed state information. According to the control obstacle function that describes the safety state of the robot, the robots of each formation member combine the observation prediction model and the gradient optimizer to iteratively optimize the initial control instructions until the safety constraints are satisfied, and the final control instructions that guarantee safety are issued to the robot controller for execution , reaching a new state. Each formation member robot continues to execute the current process until the formation reaches the final target position and completes the formation navigation task.

图1是本发明方法的执行框架图，图2是本发明方法的信息流图，图3是本发明方法的整体流程图，本方案的具体执行步骤如下。Fig. 1 is an execution frame diagram of the method of the present invention, Fig. 2 is an information flow diagram of the method of the present invention, Fig. 3 is an overall flow chart of the method of the present invention, the specific execution steps of the scheme are as follows.

(1)使用者根据任务需求在多机器人编队运行前指定编队队形，编队成员数目以及编队起始位置，在各编队成员机器人上进行参数初始化设定，并在仿真或实际场景中对多机器人进行安置与启动，随后多机器人编队在指定的任务参数设定下进行移动作业。(1) The user specifies the formation formation, the number of formation members, and the starting position of the formation before the multi-robot formation operation according to the task requirements, and initializes the parameters on each formation member robot, and performs multi-robot operation in the simulation or actual scene. Placement and start-up are carried out, and then the multi-robot formation performs mobile operations under the specified task parameter settings.

其中编队队形方面，本方法支持包括但不限于全包围队形，半包围队形的多种编队队形方案，可扩展至较大数目的多机器人编队任务。编队起始位置需要在相对空旷位置，多机器人符合预先队形设定且安全均匀散开。机器人参数初始化包含但不限于安全距离设定，机器人速度上限设定等，根据任务要求或机器人平台执行能力进行初始化。In terms of formation formation, this method supports multiple formation formation schemes including but not limited to full-encirclement formation and semi-encirclement formation, and can be extended to a larger number of multi-robot formation tasks. The starting position of the formation needs to be in a relatively open place, and the multi-robots conform to the pre-set formation and spread out safely and evenly. The robot parameter initialization includes but is not limited to the setting of the safety distance, the upper limit setting of the robot speed, etc., and is initialized according to the task requirements or the execution capability of the robot platform.

(2)多机器人通过内部通讯共享传感器原始状态感知信息，包含二维激光观测信息，速度状态信息以及目标相对位置关系。(2) Multi-robots share the original state perception information of the sensor through internal communication, including two-dimensional laser observation information, speed state information and relative positional relationship of the target.

其中二维激光信息可依据传感器实际选型与设定情况确定为定长的一维数组信息，在具体执行中一般取决于激光雷达的观测角度以及分辨率。数组中每个数字的含义为当前角度的障碍物距离，若超出雷达观测范围则标注为无限远。速度状态信息为当前机器人的线速度信息与角速度信息，可确定为定长的一维数组信息。目标相对位置信息可确定为二维坐标信息，具体通讯信息类型同样为定长的一维数组信息。整合的状态感知信息由限定大小的数组进行确定，数据量小，通讯负载占用少，可以实现多机器人的扩展。Among them, the two-dimensional laser information can be determined as a fixed-length one-dimensional array information according to the actual selection and setting of the sensor, and the specific implementation generally depends on the observation angle and resolution of the laser radar. The meaning of each number in the array is the obstacle distance at the current angle. If it exceeds the radar observation range, it will be marked as infinity. Velocity status information is the linear velocity information and angular velocity information of the current robot, which can be determined as a fixed-length one-dimensional array information. The relative position information of the target can be determined as two-dimensional coordinate information, and the specific communication information type is also fixed-length one-dimensional array information. The integrated state perception information is determined by an array with a limited size, the data volume is small, the communication load occupies less, and the expansion of multiple robots can be realized.

(3)各个机器人根据通讯共享获取的感知信息，以编队中心目标位置为基准进行解耦，得到以编队中心目标位置为中点的局部占用地图。根据目标位置，障碍物分布情况以及其余编队成员位置，可以建立观测势场，描述当前环境状态下各位置作为编队成员目标位置的执行代价。其中，势场中各位置的代价具体数值由当前位置受到的障碍物排斥力F_ro，编队队友排斥力F_ra以及目标吸引力F_a的合力大小F确定，其具体关系为：(3) According to the perception information obtained through communication sharing, each robot is decoupled based on the target position of the formation center, and a local occupancy map with the target position of the formation center as the midpoint is obtained. According to the target position, the distribution of obstacles and the positions of other formation members, an observation potential field can be established to describe the execution cost of each position in the current environment state as the target position of formation members. Among them, the specific value of the cost of each position in the potential field is determined by the resultant force F of the obstacle repulsion F _ro received by the current position, the formation teammate repulsion F _ra , and the target attraction F _a . The specific relationship is:

F＝F_ro+F_ra+F_a F＝F _ro +F _ra +F _a

其中障碍物排斥力Fro与编队队友排斥力F_ra由当前位置与障碍物位置或场上已有机器人位置间的欧氏距离的反比例关系进行确定，两者距离越近，排斥力越大。目标吸引力F_a由当前位置与目标位置间的欧氏距离的平方项的正比例关系确定，两者距离越近，吸引力越小。这样的势场设定，可以引导编队与障碍物保持距离的同时逐步靠近目标位置。Among them, the obstacle repulsion force Fro and the teammate repulsion force F _ra are determined by the inverse proportional relationship of the Euclidean distance between the current position and the obstacle position or the existing robot position on the field. The closer the distance between the two, the greater the repulsion force. The attractive force F _a of the target is determined by the proportional relationship of the square term of the Euclidean distance between the current position and the target position, the closer the distance between the two, the smaller the attractive force. Such a potential field setting can guide the formation to gradually approach the target position while maintaining a distance from obstacles.

(4)各编队成员依据环境状态，编队成员数目以及机器人安全距离等对编队进行分层，随后在分层内部对逐个编队成员目标位置进行确定，最后将各个编队成员目标位置分配给各个机器人。(4) Each formation member stratifies the formation according to the environment state, the number of formation members and the safety distance of the robot, and then determines the target position of each formation member within the hierarchy, and finally assigns the target position of each formation member to each robot.

各分层编队整体形状为相互嵌套的比例放大关系，保证编队整体的秩序性以及编队内部各个编队成员的安全性。各分层内部，方法首先确定势场最低位置为当前编队成员位置，随后依据该成员的排斥势场更新整体环境势场，最后通过迭代的方式重复该循环，逐步找到该分层内部各编队成员的位置。确定目标编队位置后，方法依据距离优先以及避免轨迹交错的原则，结合当前各编队成员位置对目标进行分配。由于所有机器人采用同一套决策规划规则以及参数设定且感知信息通过共享实现统一，分配结果可保证多机一致性。The overall shape of each layered formation is a nested and enlarged relationship to ensure the overall order of the formation and the safety of each formation member within the formation. Inside each layer, the method first determines the lowest position of the potential field as the position of the current formation member, then updates the overall environmental potential field according to the repulsive potential field of the member, and finally repeats the cycle iteratively to gradually find the formation members in the layer s position. After determining the position of the target formation, the method assigns the target according to the principle of distance priority and avoiding track interlacing, combined with the current position of each formation member. Since all robots use the same set of decision-making planning rules and parameter settings, and the perception information is unified through sharing, the allocation results can ensure the consistency of multiple machines.

(5)各编队机器人根据分配目标位置，二维激光观测信息以及速度状态信息，由基于强化学习训练得到的策略网络生成初始控制指令。(5) Each formation robot generates an initial control command based on the assigned target position, two-dimensional laser observation information, and velocity state information through a policy network based on reinforcement learning training.

该策略网络在仿真环境中通过自主试错的方式进行学习，依据环境反馈的奖惩信息不断迭代优化，直至收敛为性能较为稳定的自主导航策略。实践中，深度网络在共享策略的随机多智能体环境下进行策略的自我博弈，学习动态避障策略。The strategy network learns through autonomous trial and error in the simulation environment, and iteratively optimizes iteratively according to the reward and punishment information fed back from the environment until it converges to an autonomous navigation strategy with relatively stable performance. In practice, deep networks play strategic self-games in a stochastic multi-agent environment with shared strategies, and learn dynamic obstacle avoidance strategies.

深度策略网络的输入为时序二维激光信息转换得到的序列局部障碍物占用地图，速度状态信息与目标位置信息。其中二维激光序列转换的具体方式为，方法根据里程计获取的机器人自身位姿变化进行解耦，得到同一观测坐标系下的多帧激光数据，通过转化为机器人观测视角下的地图形式，可以得到堆叠的局部障碍物地图序列。在该形式下，静态障碍物和动态障碍物可以得到明显区分，其中静态障碍物呈现为堆叠重影，而动态障碍物则呈现为一条运动轨迹。网络采用三维卷积网络处理序列图像信息，线性层处理数组信息，最终输出决策指令。The input of the deep strategy network is the sequential local obstacle occupancy map, speed state information and target position information converted from time-series two-dimensional laser information. The specific method of two-dimensional laser sequence conversion is that the method decouples the robot’s own pose changes obtained by the odometer, and obtains multiple frames of laser data in the same observation coordinate system, which can be transformed into a map form under the observation perspective of the robot. Get a sequence of stacked local obstacle maps. In this form, static obstacles and dynamic obstacles can be clearly distinguished, in which static obstacles appear as stacked ghosts, while dynamic obstacles appear as a moving track. The network uses a three-dimensional convolutional network to process sequence image information, a linear layer to process array information, and finally outputs decision-making instructions.

奖惩函数方面，环境的奖惩反馈信息主要由两部分构成，接近目标奖励以及靠近障碍物惩罚。In terms of reward and punishment functions, the reward and punishment feedback information of the environment is mainly composed of two parts, the reward for approaching the target and the penalty for approaching obstacles.

R(s_t)＝R_g(s_t)+R_c(s_t)R(s _t )＝R _g (s _t )+R _c (s _t )

R_g(s_t)用于引导智能体逐步逼近并最终到达指定的编队目标位置。R _g (s _t ) is used to guide the agent to gradually approach and finally reach the designated formation target position.

其中p^t指代机器人在t时刻所处的位置，p^*指代目标位置。Where p ^t refers to the position of the robot at time t, and p ^* refers to the target position.

为了保证靠近编队目标过程中运动的安全性，机器人在靠近或者碰到障碍物时会得到一定程度的惩罚。In order to ensure the safety of the movement in the process of approaching the formation target, the robot will be punished to a certain extent when approaching or encountering obstacles.

其中r指代机器人的安全半径，d指代当前时刻最小的障碍物距离。Among them, r refers to the safe radius of the robot, and d refers to the minimum obstacle distance at the current moment.

在该奖惩信号的引导下，策略网络的表现将逐渐收敛为，在保证机器人自身安全的同时逐步靠近目标位置，使得编队成员机器人具备初步的导航性能，灵活适用于多变的环境分布以及多智能体交互状态。Under the guidance of the reward and punishment signal, the performance of the policy network will gradually converge to gradually approach the target position while ensuring the safety of the robot itself, so that the formation member robots have preliminary navigation performance, and are flexibly applicable to changing environmental distribution and multi-intelligence body interaction state.

该策略网络可以批量处理较大规模的多机器人编队决策任务，效率更高，可扩展性更强，在实践中可以满足较大规模的编队任务处理。The policy network can batch process large-scale multi-robot formation decision-making tasks, with higher efficiency and stronger scalability, and can meet large-scale formation task processing in practice.

(6)各编队成员机器人根据基于二维激光观测的控制障碍函数对初始指令进行安全评估，结合观测预测模型对指令优化的梯度方向进行判断，利用梯度优化器进行迭代优化，最终得到确保安全的最终控制指令。(6) Each formation member robot evaluates the safety of the initial command according to the control obstacle function based on two-dimensional laser observation, and judges the gradient direction of the command optimization in combination with the observation prediction model, and uses the gradient optimizer to perform iterative optimization, and finally obtains a safety-guaranteed final control order.

安全评估函数基于控制障碍函数理论，对环境的危险程度进行评估。根据机器人状态以及评估结果可以将环境分为初始状态，安全状态以及危险状态，状态间的转移也可以表达为控制决策对于当前环境的作用，如式：The safety evaluation function is based on the control barrier function theory, and evaluates the degree of danger of the environment. According to the state of the robot and the evaluation results, the environment can be divided into initial state, safe state, and dangerous state. The transition between states can also be expressed as the effect of control decisions on the current environment, such as:

本发明中，基于观测的控制障碍函数描述当前机器人状态在环境中的安全程度，包含机器人自身安全与编队内部安全两部分。机器人自身安全主要描述机器人以当前速度状态进行运动时机器人与环境内部的静态或动态障碍物间的安全关系，编队内部安全则主要描述机器人以当前速度状态进行运动时与编队内部其他成员间的安全关系。实践中，方法选用碰撞时间作为安全评估的描述，也即机器人以当前状态接收给定指令并执行后，发生碰撞的时间长短。In the present invention, the observation-based control obstacle function describes the safety degree of the current robot state in the environment, including two parts: the safety of the robot itself and the internal safety of the formation. The safety of the robot itself mainly describes the safety relationship between the robot and static or dynamic obstacles inside the environment when the robot is moving at the current speed state, and the internal safety of the formation mainly describes the safety between the robot and other members of the formation when it is moving at the current speed state relation. In practice, the method chooses the collision time as the description of the safety assessment, that is, the length of time for a collision to occur after the robot receives and executes a given command in its current state.

本发明利用基于二维激光雷达观测的世界模型进行任意长度的观测预测。该世界模型以监督学习的方式，利用机器人在仿真环境中的自主探索试错数据进行训练，拟合机器人在不同的运动指令下的观测变化机制，从而实现对于未来观测的预测。网络利用处理时序信息的LSTM网络和处理图像信息的卷积神经网络，对时序雷达数据恢复得到的序列障碍物占用地图序列进行处理，在图像隐变量空间引入当前控制指令信息，获得当前状态的综合感知，最终输出在该控制指令下的障碍物占用地图预测。The invention utilizes a world model based on two-dimensional lidar observations to perform observation predictions of any length. The world model is trained by means of supervised learning, using the trial-and-error data of the robot's autonomous exploration in the simulation environment, and fitting the observation change mechanism of the robot under different motion commands, so as to realize the prediction of future observations. The network uses the LSTM network for processing time-series information and the convolutional neural network for processing image information to process the sequential obstacle occupancy map sequence obtained from the recovery of time-series radar data, introduce the current control instruction information into the image hidden variable space, and obtain the comprehensive information of the current state. perception, and finally output the map prediction of obstacle occupancy under the control command.

建立了基于观测的安全评估函数以及有效的世界模型后，依据动力学约束的影响，以及控制决策对于当前状态的影响，方法选择利用数值逼近的方法，以差分的形式根据环境离散变化建立优化迭代的梯度。After the observation-based safety evaluation function and an effective world model are established, according to the influence of dynamic constraints and the influence of control decisions on the current state, the method is selected to use numerical approximation methods to establish optimization iterations in the form of differences according to discrete changes in the environment gradient.

根据该梯度以及对应的初始控制决策，利用增广拉格朗日优化进行多步优化迭代，直至策略满足安全评估约束，即可得到最终的安全决策，保证执行策略的安全性。According to the gradient and the corresponding initial control decision, the augmented Lagrangian optimization is used to perform multi-step optimization iterations until the policy meets the security evaluation constraints, and the final security decision can be obtained to ensure the security of the execution policy.

显然，本发明不限于以上实施例，还可以有许多变形，本领域的普通技术人员能从本发明公开的内容直接导出或联想到的所有变形，均应认为是本发明的保护范围。Apparently, the present invention is not limited to the above embodiments, and many modifications can be made, and all modifications that can be directly derived or associated by those skilled in the art from the content disclosed in the present invention should be considered as the protection scope of the present invention.

Claims

1. A multi-robot reinforcement learning safe formation method with constraints based on two-dimensional laser observation, characterized in that it comprises:

S1: Obtain formation task setting parameters, including formation formation, number of formation members and formation starting position, initialize according to the parameters, and obtain preliminary state perception information;

S2: Through the internal communication of the multi-robot formation, share the state perception information in real time, and obtain the state perception information obtained by the other robots in the formation through sensor observation, including two-dimensional laser observation information, speed state information and relative positional relationship of the target;

S3: Decouple and integrate multi-robot state perception information, obtain the environment distribution state based on the target position of the formation center, and establish an observation potential field according to the target position of the formation center, the positions of each formation member, and the distribution of obstacles;

S4: According to the observed potential field and the setting parameters of the formation task, split the formation level and calculate the formation target position within each layer, and allocate the target position according to the current formation member position;

S5: Each formation member robot generates an initial control instruction from the policy network obtained based on reinforcement learning training according to the assigned target position and comprehensive state perception information;

S6: According to the control obstacle function describing the safety state of the robot, combined with the observation prediction model and the gradient optimizer, iteratively optimize the initial control instruction until the safety constraints are met, and obtain the final control instruction that guarantees safety;

S7: Send the final control command to the robot controller for execution, reach a new state, and return to S2 until the formation reaches the final target position.

2. The band-constrained multi-robot reinforcement learning safety formation method based on two-dimensional laser observation as claimed in claim 1, characterized in that, in the step S1, the formation task setting parameters are obtained, including formation formation, formation members The number and starting position of the formation are as follows:

It supports multiple formation schemes including full encirclement formation, half encirclement formation and arbitrary angle envelopment, and can be extended to a larger number of robot formation tasks.

3. The band-constrained multi-robot reinforcement learning safe formation method based on two-dimensional laser observation as claimed in claim 1, wherein, in the step S2, the state perception information includes two-dimensional laser observation information, speed state information and The relative positional relationship of the target, specifically:

The two-dimensional laser information is determined as a fixed-length one-dimensional array information according to the resolution of the sensor, the speed state information and the relative position information of the target are determined as two-dimensional coordinate information, and the integrated state perception information is determined by an array of a limited size, with a small amount of data , the communication load occupies less, and the expansion of multiple robots is realized.

4. As claimed in claim 1 or 2 or 3, the multi-robot reinforcement learning safe formation method based on two-dimensional laser observation is characterized in that, in the step S3, the multi-robot state perception information is decoupled and integrated , to obtain the environment distribution state based on the target position of the formation center, and establish the observation potential field according to the target position of the formation center, the positions of each formation member, and the distribution of obstacles, specifically:

According to the target position, the distribution of obstacles and the positions of other formation members, an observation potential field is established, and each position is described as the execution cost of the formation member target position under the current environment state. The specific value of the cost of each position in the potential field is determined by the current position The resultant force F of obstacle repulsion F _ro , formation teammates repulsion F _ra and target attraction F _a is determined, and its specific relationship is:

F=F _ro +F _ra +F _a .

5. The multi-robot reinforcement learning safe formation method with constraints based on two-dimensional laser observation as claimed in claim 4, characterized in that, in the step S4, according to the observed potential field and the formation task setting parameters, the formation Layer splitting and formation target position calculation within each layer, and target position allocation based on the current formation member positions, specifically:

According to the observation potential field and task setting, the formations are layered. The overall shape of each layered formation is a nested proportional magnification relationship to ensure the overall order of the formation and the safety of each formation member within the formation. Inside each layer, First determine the lowest position of the potential field as the position of the current formation member, then update the overall environmental potential field according to the repulsive potential field of the member, and finally repeat the cycle through iteration to gradually find the position of each formation member in the layer; determine the target formation After the position, according to the principle of distance priority and avoiding track interlacing, the target is allocated according to the current position of each formation member.

6. The method for constrained multi-robot reinforcement learning safe formation based on two-dimensional laser observation as claimed in claim 4, characterized in that, in the step S5, according to the assigned target position and comprehensive state perception information, based on The policy network obtained by reinforcement learning training generates initial control instructions, specifically:

The policy network is learned through autonomous trial and error in the simulation environment, and iteratively optimized according to reward and punishment information fed back from the environment until it converges to an autonomous navigation policy with relatively stable performance.

7. As claimed in claim 1 or 2 or 3 or 5 or 6, the band-constrained multi-robot reinforcement learning safety formation method based on two-dimensional laser observation is characterized in that, in the step S6, according to the description of the safety state of the robot Control the barrier function, combine the observation prediction model and the gradient optimizer to iteratively optimize the initial control instructions until the safety constraints are met, and obtain the final control instructions that guarantee safety, specifically:

According to the control obstacle function based on two-dimensional laser observation, the safety evaluation of the initial instruction is carried out, and the gradient direction of the instruction optimization is judged by combining the observation prediction model, and the gradient optimizer is used for iterative optimization, and finally the final control instruction ensuring safety is obtained;

The observation-based control obstacle function describes the safety degree of the current robot state in the environment, including two parts: the safety of the robot itself and the internal safety of the formation. The safety relationship between dynamic obstacles and the internal safety of the formation mainly describe the safety relationship between the robot and other members of the formation when it is moving at the current speed state.