CN116192885A

CN116192885A - High-availability cluster architecture artificial intelligence experiment cloud platform data processing method and system

Info

Publication number: CN116192885A
Application number: CN202211603530.6A
Authority: CN
Inventors: 贾子琪; 杨浩; 朱世冲; 古超; 周楚亚; 张强; 张腾飞; 陈连山
Original assignee: Nanyang Institute of Technology
Current assignee: Nanyang Institute of Technology
Priority date: 2022-12-13
Filing date: 2022-12-13
Publication date: 2023-05-30

Abstract

The application relates to a cloud platform technology and provides a high-availability cluster architecture artificial intelligent experimental cloud platform data processing method and system, wherein an artificial intelligent cloud platform comprises a plurality of master nodes and a plurality of slave nodes, and if a target slave node receives an experimental work task deployment instruction sent by the target master node, a target container is correspondingly created according to the experimental work task deployment instruction; if the target slave node receives the access request of the user terminal and passes the verification, connecting a target container instance corresponding to the target container with the user terminal; the target slave node receives target operation data of the user terminal, and stores the target operation data into a key value database corresponding to the target container; and if the target slave node receives the container operation instruction, correspondingly creating or deleting the container according to the container operation instruction. The cloud platform based artificial intelligent related experimental task processing method and system can process artificial intelligent related experimental tasks based on the cloud in the cloud platform, nodes can be added or subtracted from the cluster at any time, and the high availability and the load capacity of the cluster are improved.

Description

High-availability cluster architecture artificial intelligence experiment cloud platform data processing method and system

技术领域technical field

本申请涉及云平台技术领域，尤其涉及一种高可用集群架构人工智能实验云平台数据处理方法及系统。The present application relates to the field of cloud platform technology, and in particular to a data processing method and system for an artificial intelligence experiment cloud platform with a highly available cluster architecture.

背景技术Background technique

目前，企业或高校在进行人工智能相关实验时，出现了部分采用实验平台集群的解决方式，即将人工智能相关实验数据放在云平台的集群上进行云端实验任务。但是目前的云平台集群中往往不能随时对集群增加或删减节点，这就导致人工智能相关实验面对的操作人员数量受限，不能处理多规模人员参与的云端实验任务处理。而且现有云平台的集群中在遇到断电等异常故障，也无法自动保存实验数据，数据存在较大的安全风险。At present, when enterprises or universities conduct artificial intelligence-related experiments, some solutions using experimental platform clusters appear, that is, to put artificial intelligence-related experimental data on the cluster of the cloud platform for cloud experimental tasks. However, in the current cloud platform clusters, it is often impossible to add or delete nodes to the cluster at any time, which leads to a limited number of operators for artificial intelligence-related experiments, and cannot handle cloud experiment tasks with multi-scale personnel participation. Moreover, in the case of abnormal failures such as power outages in the cluster of the existing cloud platform, the experimental data cannot be automatically saved, and the data has a large security risk.

发明内容Contents of the invention

本申请实施例提供了一种高可用集群架构人工智能实验云平台数据处理方法及系统，旨在解决现有技术中进行人工智能相关实验使用的云平台集群中往往不能随时对集群增加或删减节点，这就导致人工智能相关实验面对的操作人员数量受限，只能开展少量人员参与的人工智能相关实验的问题。The embodiment of the present application provides a data processing method and system of a highly available cluster architecture artificial intelligence experiment cloud platform, aiming to solve the problem that the cloud platform cluster used for artificial intelligence related experiments in the prior art often cannot add or delete the cluster at any time This leads to the limited number of operators faced by artificial intelligence-related experiments, and only a small number of artificial intelligence-related experiments can be carried out.

第一方面，本申请实施例提供了一种高可用集群架构人工智能实验云平台数据处理方法，应用于人工智能实验云平台，所述人工智能云平台包括多个主节点和多个从节点，所述多个主节点和所述多个从节点均通讯连接；所述方法包括：In the first aspect, the embodiment of the present application provides a data processing method of a high-availability cluster architecture artificial intelligence experiment cloud platform, which is applied to the artificial intelligence experiment cloud platform, and the artificial intelligence cloud platform includes a plurality of master nodes and a plurality of slave nodes, The plurality of master nodes and the plurality of slave nodes are all communicatively connected; the method includes:

目标从节点若接收到目标主节点发送的实验工作任务部署指令，则根据所实验工作任务部署指令对应创建目标容器；其中，所述目标从节点为所述多个从节点中任意一个从节点，所述目标主节点为所述多个主节点中当前为活跃状态的主节点；If the target slave node receives the experimental work task deployment instruction sent by the target master node, it will create a target container correspondingly according to the experimental work task deployment instruction; wherein, the target slave node is any one of the multiple slave nodes, The target master node is a currently active master node among the plurality of master nodes;

目标从节点若接收到用户终端的访问请求并通过验证，将所述目标容器对应的目标容器实例与所述用户终端连接；If the target slave node receives the access request of the user terminal and passes the verification, connects the target container instance corresponding to the target container with the user terminal;

目标从节点接收所述用户终端的目标操作数据，将所述目标操作数据存储至所述目标容器对应的键值数据库；The target slave node receives the target operation data of the user terminal, and stores the target operation data in a key-value database corresponding to the target container;

目标主节点发送容器操作指令至目标从节点；The target master node sends container operation instructions to the target slave node;

目标从节点若接收到所述容器操作指令，则根据所述容器操作指令对应创建或删除容器。If the target slave node receives the container operation instruction, it creates or deletes a container correspondingly according to the container operation instruction.

第二方面，本申请实施例提供了一种高可用集群架构人工智能实验云平台数据处理系统，运行于人工智能实验云平台，其包括多个主节点和多个从节点，所述多个主节点和所述多个从节点均通讯连接；其中，所述目标从节点为所述多个从节点中任意一个从节点，所述目标主节点为所述多个主节点中当前为活跃状态的主节点；In the second aspect, the embodiment of the present application provides a high-availability cluster architecture artificial intelligence experiment cloud platform data processing system, running on the artificial intelligence experiment cloud platform, which includes a plurality of master nodes and a plurality of slave nodes, the plurality of master nodes The node and the plurality of slave nodes are all communicatively connected; wherein, the target slave node is any one of the plurality of slave nodes, and the target master node is currently active among the plurality of master nodes master node;

目标从节点，用于若接收到目标主节点发送的实验工作任务部署指令，则根据所实验工作任务部署指令对应创建目标容器；其中，所述目标从节点为所述多个从节点中任意一个从节点，所述目标主节点为所述多个主节点中当前为活跃状态的主节点；The target slave node is configured to create a target container according to the experimental task deployment instruction corresponding to the experimental work task deployment instruction sent by the target master node; wherein the target slave node is any one of the plurality of slave nodes From the node, the target master node is a currently active master node among the plurality of master nodes;

目标从节点，还用于若接收到用户终端的访问请求并通过验证，将所述目标容器对应的目标容器实例与所述用户终端连接；The target slave node is further configured to connect the target container instance corresponding to the target container to the user terminal if the access request from the user terminal is received and passed the verification;

目标从节点，还用于接收所述用户终端的目标操作数据，将所述目标操作数据存储至所述目标容器对应的键值数据库；The target slave node is further configured to receive target operation data of the user terminal, and store the target operation data in a key-value database corresponding to the target container;

目标主节点，用于发送容器操作指令至目标从节点；The target master node is used to send container operation instructions to the target slave node;

目标从节点，还用于若接收到目标主节点发送的容器操作指令，则根据所述容器操作指令对应创建或删除容器。The target slave node is further configured to correspondingly create or delete a container according to the container operation instruction if receiving the container operation instruction sent by the target master node.

第三方面，本申请实施例又提供了一种计算机设备，其包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，处理器执行计算机程序时实现上述第一方面的高可用集群架构人工智能实验云平台数据处理方法。In the third aspect, the embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and operable on the processor. When the processor executes the computer program, the above-mentioned first aspect is realized. High-availability cluster architecture artificial intelligence experimental cloud platform data processing method.

第四方面，本申请实施例还提供了一种计算机可读存储介质，其中计算机可读存储介质存储有计算机程序，计算机程序当被处理器执行时使处理器执行上述第一方面的高可用集群架构人工智能实验云平台数据处理方法。In the fourth aspect, the embodiment of the present application also provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by the processor, the processor executes the high-availability cluster of the first aspect above Architect the artificial intelligence experimental cloud platform data processing method.

本申请实施例提供了一种高可用集群架构人工智能实验云平台数据处理方法及系统，人工智能云平台包括多个主节点和多个从节点，多个主节点和多个从节点均通讯连接；方法包括：目标从节点若接收到目标主节点发送的实验工作任务部署指令，则根据所实验工作任务部署指令对应创建目标容器；目标从节点若接收到用户终端的访问请求并通过验证，将目标容器对应的目标容器实例与用户终端连接；目标从节点接收用户终端的目标操作数据，将目标操作数据存储至目标容器对应的键值数据库；目标主节点发送容器操作指令至目标从节点；目标从节点若接收到容器操作指令，则根据容器操作指令对应创建或删除容器。实现了在人工智能实验云平台中能基于云端进行人工智能相关实验任务的处理，而且能随时对集群增加或删减节点，提高集群的高可用性和负载能力。The embodiment of the present application provides a data processing method and system of a high-availability cluster architecture artificial intelligence experiment cloud platform. The artificial intelligence cloud platform includes multiple master nodes and multiple slave nodes, and the multiple master nodes and multiple slave nodes are all connected by communication. The method includes: if the target slave node receives the experimental work task deployment instruction sent by the target master node, then correspondingly create the target container according to the experimental work task deployment instruction; if the target slave node receives the access request of the user terminal and passes the verification, it will The target container instance corresponding to the target container is connected to the user terminal; the target slave node receives the target operation data of the user terminal, and stores the target operation data in the key-value database corresponding to the target container; the target master node sends the container operation command to the target slave node; If the slave node receives the container operation instruction, it will create or delete the container correspondingly according to the container operation instruction. In the artificial intelligence experiment cloud platform, the processing of artificial intelligence related experimental tasks can be carried out based on the cloud, and nodes can be added or deleted to the cluster at any time to improve the high availability and load capacity of the cluster.

附图说明Description of drawings

为了更清楚地说明本申请实施例技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the embodiments of the present application more clearly, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can also obtain other drawings based on these drawings on the premise of not paying creative work.

图1为本申请实施例提供的高可用集群架构人工智能实验云平台数据处理方法的应用场景示意图；Fig. 1 is a schematic diagram of an application scenario of a data processing method of a highly available cluster architecture artificial intelligence experiment cloud platform provided by an embodiment of the present application;

图2为本申请实施例提供的高可用集群架构人工智能实验云平台数据处理方法的流程示意图；Fig. 2 is the schematic flow chart of the data processing method of the highly available cluster architecture artificial intelligence experiment cloud platform provided by the embodiment of the present application;

图3为本申请实施例提供的高可用集群架构人工智能实验云平台数据处理系统的示意性框图；Fig. 3 is a schematic block diagram of a data processing system of a highly available cluster architecture artificial intelligence experiment cloud platform provided by an embodiment of the present application;

图4为本申请实施例提供的计算机设备的示意性框图。Fig. 4 is a schematic block diagram of a computer device provided by an embodiment of the present application.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.

应当理解，当在本说明书和所附权利要求书中使用时，术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在，但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that when used in this specification and the appended claims, the terms "comprising" and "comprises" indicate the presence of described features, integers, steps, operations, elements and/or components, but do not exclude one or Presence or addition of multiple other features, integers, steps, operations, elements, components and/or collections thereof.

还应当理解，在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样，除非上下文清楚地指明其它情况，否则单数形式的“一”、“一个”及“该”意在包括复数形式。It should also be understood that the terminology used in the specification of this application is for the purpose of describing particular embodiments only and is not intended to limit the application. As used in this specification and the appended claims, the singular forms "a", "an" and "the" are intended to include plural referents unless the context clearly dictates otherwise.

还应当进一步理解，在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合，并且包括这些组合。It should be further understood that the term "and/or" used in the description of the present application and the appended claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations .

请参阅图1和图2，图1为本申请实施例提供的高可用集群架构人工智能实验云平台数据处理方法的应用场景示意图；图2为本申请实施例提供的高可用集群架构人工智能实验云平台数据处理方法的流程示意图。本申请实施例提供的高可用集群架构人工智能实验云平台数据处理方法应用于人工智能实验云平台，如图1所示，人工智能实验云平台包括多个主节点和多个从节点，所述多个主节点和所述多个从节点均通讯连接。人工智能实验云平台可视为包括多个主节点和多个从节点的Kubernetes集群，是一个可管理单个容器集群资源的编排和调度的分布式系统。Please refer to Figure 1 and Figure 2, Figure 1 is a schematic diagram of the application scenario of the data processing method of the high-availability cluster architecture artificial intelligence experiment cloud platform provided by the embodiment of the application; Figure 2 is the high-availability cluster architecture artificial intelligence experiment provided by the embodiment of the application Schematic flow chart of the cloud platform data processing method. The high-availability cluster architecture artificial intelligence experiment cloud platform data processing method provided by the embodiment of the present application is applied to the artificial intelligence experiment cloud platform. As shown in Figure 1, the artificial intelligence experiment cloud platform includes multiple master nodes and multiple slave nodes. The plurality of master nodes and the plurality of slave nodes are all communicatively connected. The artificial intelligence experiment cloud platform can be regarded as a Kubernetes cluster including multiple master nodes and multiple slave nodes. It is a distributed system that can manage the orchestration and scheduling of a single container cluster resource.

其中，所述多个主节点中每一主节点均包括APIServer模块(可理解为接口模块)、Scheduler模块(可理解为调度模块)和Controller-Manager模块(可理解为管理控制模块)和键值数据库(可表示为Etcd数据库)。所述APIServer模块，用于根据主节点的决策去通知从节点进行集群资源的建立、删除和停止等操作；所述Scheduler模块，用于根据人工智能实验云平台所对应集群内各从节点的资源消耗情况进行Pod调度(Pod是Kubernetes系统(也即K8S系统)中可以创建和管理的最小单元，是资源对象模型中由用户创建或部署的最小资源对象模型)；所述Controller-Manager模块，用于对人工智能实验云平台所对应集群内各主节点和各从节点的状态是否健康进行检测；所述键值数据库，用于存储人工智能实验云平台所对应集群内各种重要配置信息，以及持久化集群内的各种数据资源。在所述多个主节点中在每一时刻只有一个正在运行并处于活跃状态的主节点并可记为Leader-Master-Node(只有Leader-Master-Node才能对外提供服务)，其他主节点则处于非活跃的备用状态。如果正在工作的主节点(即Leader-Master-Node)出现异常，人工智能实验云平台所对应集群会在备用状态的多个主节点中自动选出一个主节点立刻代替异常状态主节点成为新的正在运行并处于活跃状态的主节点接续当前工作。Wherein, each of the plurality of master nodes includes an APIServer module (which can be understood as an interface module), a Scheduler module (which can be understood as a scheduling module), and a Controller-Manager module (which can be understood as a management control module) and a key-value Database (can be expressed as Etcd database). The APIServer module is used to notify the slave nodes to perform operations such as setting up, deleting and stopping cluster resources according to the decision of the master node; Pod scheduling according to the consumption situation (Pod is the smallest unit that can be created and managed in the Kubernetes system (that is, the K8S system), and is the smallest resource object model created or deployed by the user in the resource object model); the Controller-Manager module uses To detect whether the state of each master node and each slave node in the cluster corresponding to the artificial intelligence experiment cloud platform is healthy; the key-value database is used to store various important configuration information in the cluster corresponding to the artificial intelligence experiment cloud platform, and Persist various data resources in the cluster. Among the multiple master nodes, there is only one running and active master node at each moment and can be recorded as Leader-Master-Node (only Leader-Master-Node can provide external services), and other master nodes are in Inactive standby state. If the working master node (Leader-Master-Node) is abnormal, the cluster corresponding to the artificial intelligence experiment cloud platform will automatically select a master node from multiple master nodes in the standby state to immediately replace the master node in the abnormal state as the new master node. The running and active master node continues the current work.

所述多个从节点中每一从节点可以视为人工智能实验云平台所对应集群中的工作节点，是实际执行人工智能实验任务的节点，也是负责运行实际业务和资源的运行容器。每一从节点除了提供Pod的运行环境以外，还有用于管理和通信的基础设施，具体是每一从节点通过Kubelet组件(是从节点上的代理组件)与所述多个主节点中各主节点进行数据交互。Kubelet组件是定期从主节点的API-Server模块接收工作任务，以用于处理主节点上Pod的整个生命周期相关事物；而且Kubelet还会定期经由主节点的API-Server模块向主节点上报的所有工作信息。不同的从节点之间通过Kube-proxy组件(其为Kubernetes集群从节点上的网络代理组件)进行网络代理访问。Each of the plurality of slave nodes can be regarded as a working node in the cluster corresponding to the artificial intelligence experiment cloud platform, a node that actually executes artificial intelligence experiment tasks, and a running container responsible for running actual business and resources. In addition to providing the operating environment of the Pod, each slave node also has an infrastructure for management and communication. Specifically, each slave node communicates with each of the multiple master nodes through the Kubelet component (a proxy component on the slave node) Nodes interact with each other. The Kubelet component regularly receives work tasks from the API-Server module of the master node to process the entire life cycle related matters of the Pod on the master node; and Kubelet also regularly reports all the tasks to the master node through the API-Server module of the master node work info. Network proxy access is performed between different slave nodes through the Kube-proxy component (which is the network proxy component on the slave node of the Kubernetes cluster).

如图2所示，该高可用集群架构人工智能实验云平台数据处理方法包括步骤S101～S105。As shown in FIG. 2 , the data processing method of the high-availability cluster architecture artificial intelligence experiment cloud platform includes steps S101-S105.

S101、目标从节点若接收到目标主节点发送的实验工作任务部署指令，则根据所实验工作任务部署指令对应创建目标容器；其中，所述从节点为所述多个从节点中任意一个从节点，所述目标主节点为所述多个主节点中当前为活跃状态的主节点。S101. If the target slave node receives the experimental work task deployment instruction sent by the target master node, it will create a target container according to the experimental work task deployment instruction; wherein, the slave node is any one of the multiple slave nodes , the target master node is a currently active master node among the plurality of master nodes.

在本实施例中，当由多个主节点及多个从节点组成了人工智能实验云平台所对应集群时，该人工智能实验云平台可以作为进行人工智能实验的云平台。具体是目标主节点的用户(如具平台管理员通过管理员权限用户账号登录人工智能实验云平台的主节点)在用户界面上操作在从节点中部署实验任务，则会触发产生实验工作任务部署指令。目标主节点中产生的所述实验工作任务部署指令发送至各从节点，在各从节中的目标从节点中基于所述实验工作任务部署指令对应创建目标容器。其中，所述目标主节点为多个主节点中当前为活跃状态的主节点，以确保集群中当前只有一个主节点正在工作并进行各项数据处理。In this embodiment, when a cluster corresponding to the artificial intelligence experiment cloud platform is composed of multiple master nodes and multiple slave nodes, the artificial intelligence experiment cloud platform can be used as a cloud platform for artificial intelligence experiments. Specifically, the user of the target master node (for example, a platform administrator who logs in to the master node of the artificial intelligence experiment cloud platform through an administrator privilege user account) operates on the user interface to deploy experimental tasks on the slave nodes, which will trigger the deployment of experimental work tasks instruction. The experiment task deployment instruction generated in the target master node is sent to each slave node, and the target container is correspondingly created in the target slave node in each slave node based on the experiment task deployment instruction. Wherein, the target master node is the currently active master node among the multiple master nodes, so as to ensure that only one master node in the cluster is currently working and performing various data processing.

当在所述目标从节点中基于该实验工作任务部署指令完成了目标容器的创建后，还需在各目标容器中添加与该实验工作任务部署指令对应的目标镜像从而得到各目标容器实例。当得到了目标容器实例后，各目标容器实例则可与对应的用户终端对应实验参与人员相对应，以供各实验参与人员使用用户终端连接对应的目标容器实例进行人工智能实验任务的处理。其中，各目标容器实例中已经部署了容器运行环境和模型代码等数据。目标容器实例的容器引擎可提供容器运行环境，制作不同需求镜像等；私有镜像仓库中集成了人工智能领域相关实验所需的TensorFlow、Caffe和PyTorch等框架镜像，也支持深度神经网络DNN、卷积神经网络CNN和目标检测相关的YoLoV1～V5模型等。After the target container is created in the target slave node based on the experimental work task deployment instruction, it is necessary to add a target image corresponding to the experimental work task deployment instruction in each target container to obtain each target container instance. After the target container instance is obtained, each target container instance can correspond to the corresponding user terminal corresponding to the experiment participants, so that each experiment participant can use the user terminal to connect to the corresponding target container instance to process the artificial intelligence experiment task. Among them, data such as the container operating environment and model code have been deployed in each target container instance. The container engine of the target container instance can provide the container operating environment and make images for different requirements; the private image warehouse integrates framework images such as TensorFlow, Caffe, and PyTorch required for experiments in the field of artificial intelligence, and also supports deep neural network DNN, convolution Neural network CNN and target detection related YoLoV1~V5 models, etc.

可见，目标容器是在从节点中创建，而非在主节点中创建，这就确保了从节点作为人工智能实验云平台中实际运行容器的云设备，而主节点则作为统一监控和管理从节点的云设备。若有从节点出现了故障，但因人工智能实验云平台采用了高可用集群和应用的高可用部署来降低节点故障问题带来的危害，确保云平台的高可靠性。It can be seen that the target container is created in the slave node instead of the master node, which ensures that the slave node is used as the cloud device that actually runs the container in the artificial intelligence experiment cloud platform, and the master node is used as a unified monitoring and management slave node cloud device. If a slave node fails, the artificial intelligence experimental cloud platform adopts a high-availability cluster and high-availability deployment of applications to reduce the damage caused by node failure and ensure the high reliability of the cloud platform.

在一实施例中，步骤S101之前还包括：In one embodiment, before step S101, it also includes:

目标主节点中的接口模块与目标从节点的Kubelet代理组件建立通讯连接。The interface module in the target master node establishes a communication connection with the Kubelet agent component of the target slave node.

在本实施例中，当构建高可用集群架构人工智能实验云平台时，需要先将多个主节点和多个从节点进行通讯连接。具体是将各从节点基于Kubelet代理组件与目标主节点中的接口模块建立通讯连接，这样作为从节点之一的目标从节点也是基于Kubelet代理组件与目标主节点中的接口模块建立通讯连接。其中，Kubelet代理组件可以形象的理解为目标主节点与各从节点之间进行数据交互的纽带。Kubelet组件是定期从主节点的API-Server模块接收工作任务，以用于处理主节点上Pod的整个生命周期相关事物；而且Kubelet还会定期经由主节点的API-Server模块向主节点上报的所有工作信息。而且，人工智能实验云平台中包括的各从节点可以基于Kube-proxy组件访问互联网，或是与用户终端基于互联网进行通讯连接。In this embodiment, when constructing a high-availability cluster architecture artificial intelligence experiment cloud platform, it is necessary to communicate and connect multiple master nodes and multiple slave nodes. Specifically, each slave node establishes a communication connection with the interface module in the target master node based on the Kubelet proxy component, so that the target slave node as one of the slave nodes also establishes a communication connection with the interface module in the target master node based on the Kubelet proxy component. Among them, the Kubelet proxy component can be vividly understood as the link for data interaction between the target master node and each slave node. The Kubelet component regularly receives work tasks from the API-Server module of the master node to process the entire life cycle related matters of the Pod on the master node; and Kubelet also regularly reports all the tasks to the master node through the API-Server module of the master node work info. Moreover, each slave node included in the artificial intelligence experiment cloud platform can access the Internet based on the Kube-proxy component, or communicate with the user terminal based on the Internet.

在一实施例中，所述目标主节点中的接口模块与目标从节点的Kubelet代理组件建立通讯连接，包括：In one embodiment, the interface module in the target master node establishes a communication connection with the Kubelet agent component of the target slave node, including:

目标主节点的Keepalived组件通过虚拟路由冗余协议自动配置人工智能实验云平台的虚拟IP地址；The Keepalived component of the target master node automatically configures the virtual IP address of the artificial intelligence experiment cloud platform through the virtual routing redundancy protocol;

目标主节点的接口模块基于所述虚拟IP地址与目标从节点的Kubelet代理组件块建立通讯连接。The interface module of the target master node establishes a communication connection with the Kubelet agent component block of the target slave node based on the virtual IP address.

在本实施例中，各主节点中均具有Keepalived组件和Haproxy组件，其中Keepalived组件用于通过虚拟路由冗余协议(即VRRP协议)自动配置人工智能实验云平台的虚拟IP地址，以确保人工智能实验云平台有一个统一的虚拟IP来对外进行访问。Haproxy组件，则用于为从节点提供负载均衡服务。In this embodiment, each master node has a Keepalived component and a Haproxy component, wherein the Keepalived component is used to automatically configure the virtual IP address of the artificial intelligence experiment cloud platform through the virtual routing redundancy protocol (ie VRRP protocol), so as to ensure that the artificial intelligence The experimental cloud platform has a unified virtual IP for external access. The Haproxy component is used to provide load balancing services for slave nodes.

当目标主节点的Keepalived组件获取到了人工智能实验云平台的虚拟IP地址后则进行自动配置，使得所述目标主节点具有与人工智能实验云平台相同的虚拟IP。而且除了目标主节点的接口模块基于所述虚拟IP地址与目标从节点的Kubelet代理组件块建立通讯连接，剩余的其他主节点在由备用状态切换至活跃状态时也是基于的接口模块基于所述虚拟IP地址与目标从节点的Kubelet代理组件块建立通讯连接。可见，基于这一架构方式，确保了系统的高可用和高负载。After the Keepalived component of the target master node obtains the virtual IP address of the artificial intelligence experiment cloud platform, automatic configuration is performed so that the target master node has the same virtual IP as the artificial intelligence experiment cloud platform. Moreover, except that the interface module of the target master node establishes a communication connection with the Kubelet proxy component block of the target slave node based on the virtual IP address, the remaining other master nodes are also based on the interface module based on the virtual IP address when switching from the standby state to the active state. IP address to establish a communication connection with the Kubelet proxy component block of the target slave node. It can be seen that based on this architectural approach, high availability and high load of the system are ensured.

在一实施例中，步骤S101包括：In one embodiment, step S101 includes:

若所述实验工作任务部署指令为统一实验任务部署指令，则所述目标从节点获取与所述统一实验任务部署指令对应的第一目标镜像资源、GPU资源和数据存储卷路径，所述目标从节点根据所述统一实验任务部署指令对应的第一目标镜像资源、GPU资源和数据存储卷路径对应创建目标容器；If the experimental work task deployment instruction is a unified experimental task deployment instruction, then the target obtains the first target image resource, GPU resource and data storage volume path corresponding to the unified experimental task deployment instruction from the node, and the target obtains from the node The node correspondingly creates a target container according to the first target image resource, GPU resource and data storage volume path corresponding to the unified experimental task deployment instruction;

若所述实验工作任务部署指令为个性化容器部署指令，则所述目标从节点获取与所述个性化容器部署指令对应的第二目标镜像资源，所述目标从节点根据所述个性化容器部署指令对应的第二目标镜像资源以及预先存储的数据存储卷路径对应创建目标容器。If the experimental work task deployment instruction is a personalized container deployment instruction, the target slave node obtains the second target image resource corresponding to the personalized container deployment instruction, and the target slave node deploys the personalized container according to the The second target image resource corresponding to the instruction and the pre-stored data storage volume path correspond to the creation target container.

在本实施例中，在人工智能实验云平台中预先划分了至少三类权限的用户账号，分别是管理员权限用户账号、第一权限用户账号(如老师权限用户账号)和第二权限用户账号(如学生权限用户账号)。其中，管理员权限用户账号具有对整个人工智能实验云平台的所有数据进行管理的权限，例如在人工智能实验云平台的多个从节点中创建多个命名空间(即namespace)为其中一个权限；第一权限用户账号具有在与其对应的命名空间中创建多个容器以供第二权限用户账号登录使用的权限；第二权限用户账号则只具有登录相应命名空间的相应容器进行人工智能实验任务处理的权限。In this embodiment, user accounts with at least three types of authority are pre-divided in the artificial intelligence experiment cloud platform, which are administrator authority user accounts, first authority user accounts (such as teacher authority user accounts) and second authority user accounts (such as a user account with student permissions). Among them, the user account with administrator authority has the authority to manage all data of the entire artificial intelligence experiment cloud platform, for example, creating multiple namespaces (namely namespaces) in multiple slave nodes of the artificial intelligence experiment cloud platform is one of the permissions; The user account with the first authority has the authority to create multiple containers in the corresponding namespace for the login and use of the user account with the second authority; the user account with the second authority only has the corresponding container to log in the corresponding namespace to process the artificial intelligence experiment task permission.

其中，管理员权限用户账号可以接收某一第一权限用户账号对应用户所提供的待创建的用户账号清单，此时则管理员权限用户账号对应管理员可以在登录了人工智能实验云平台的主节点或从节点后，由该用户账单清单对应的老师姓名及学生班级名称组合得到组合名称。之后，管理员权限用户账号对应管理员在所述人工智能实验云平台中从节点以上述组合名称对应创建命名空间。这样，即可形象的理解为管理员在人工智能实验云平台中针对该老师姓名所对应的老师所带课班级创建了班级专属的命名空间。然后，管理员权限用户账号对应管理员还能根据所述用户账单清单中包括的学生姓名清单(或学生学号清单)对应相应个数的第二权限用户账号(即用户账单清单中包括的学生姓名总个数与第二权限用户账号的总个数相同)。Among them, the user account with administrator authority can receive the list of user accounts to be created provided by the user corresponding to a user account with the first authority. After the node or slave node, the combination name is obtained from the combination of the teacher's name and the student's class name corresponding to the user's bill list. Afterwards, the administrator authority user account corresponding to the administrator creates a namespace corresponding to the above-mentioned combined name from the node in the artificial intelligence experiment cloud platform. In this way, it can be vividly understood that the administrator has created a class-specific namespace for the class taught by the teacher corresponding to the teacher's name in the artificial intelligence experiment cloud platform. Then, the administrator authority user account corresponding to the administrator can also correspond to the corresponding number of second authority user accounts (that is, the student included in the user bill list) according to the student name list (or student student number list) included in the user bill list. The total number of names is the same as the total number of second-authority user accounts).

当然，在每一个命名空间中根据实际需求创建多个容器时，可以是管理员权限用户账号对应管理员根据人工智能实验任务的需求及所述用户账单清单中包括的学生姓名清单创建相应个数的容器；也可以是第一权限用户账号对应老师根据人工智能实验任务的需求及所述用户账单清单中包括的学生姓名清单创建相应个数的容器。在人工智能实验云平台中所述命名空间的相关信息是存储在主节点中，且与每一命名空间对应的容器则是部署在所述人工智能实验云平台的从节点中。Of course, when multiple containers are created in each namespace according to actual needs, the user account with administrator authority can correspond to the administrator to create the corresponding number according to the requirements of artificial intelligence experiment tasks and the list of student names included in the user bill list. It may also be a corresponding number of containers created by the teacher corresponding to the first authority user account according to the requirements of the artificial intelligence experiment task and the list of student names included in the user bill list. In the artificial intelligence experiment cloud platform, the relevant information of the namespace is stored in the master node, and the container corresponding to each namespace is deployed in the slave node of the artificial intelligence experiment cloud platform.

当在所述人工智能实验云平台中的从节点中完成了某一班级如班级A的命名空间的创建后，可以继续由第一权限用户账号对应老师根据人工智能实验任务的需求生成统一实验任务部署指令，并在该统一实验任务部署指令中具体设置第一目标镜像资源、GPU资源和数据存储卷路径。之后所述资源容器层根据所述统一实验任务部署指令对应的第一目标镜像资源、GPU资源和数据存储卷路径对应创建容器。After the creation of the namespace of a certain class such as class A is completed in the slave node in the artificial intelligence experiment cloud platform, the teacher corresponding to the first authority user account can continue to generate a unified experiment task according to the requirements of the artificial intelligence experiment task. Deployment instructions, and specifically set the first target image resources, GPU resources and data storage volume paths in the unified experimental task deployment instructions. Afterwards, the resource container layer creates a container corresponding to the first target image resource, GPU resource, and data storage volume path corresponding to the unified experimental task deployment instruction.

其中，所述第一目标镜像资源可从集成了人工智能领域相关实验所需的TensorFlow、Caffe和PyTorch等框架镜像，或集成了深度神经网络DNN、卷积神经网络CNN和目标检测相关的YoLoV1～V5等目标检测网络这些人工智能相关神经网络模型镜像中任选一个或多个进行部署。更具体所述第一目标镜像资源可选择TensorFlow框架并在TensorFlow框架上部署卷积神经网络CNN。Wherein, the first target image resource can be mirrored from frameworks such as TensorFlow, Caffe, and PyTorch that are required for experiments in the field of artificial intelligence, or integrated with deep neural network DNN, convolutional neural network CNN, and target detection-related YoLoV1～ Choose one or more of these artificial intelligence-related neural network model images for target detection networks such as V5 to deploy. More specifically, the first target image resource may select the TensorFlow framework and deploy a convolutional neural network (CNN) on the TensorFlow framework.

当由第一权限用户账号对应老师根据人工智能实验任务的需求生成统一实验任务部署指令，并对应在所述资源容器层中完成了与所述命名空间相对应绑定的多个容器的创建后，则完成了人工智能实验任务的初始环境搭建。When the teacher corresponding to the first authority user account generates a unified experimental task deployment instruction according to the requirements of the artificial intelligence experimental task, and correspondingly completes the creation of multiple containers corresponding to the namespace in the resource container layer , the initial environment construction of the artificial intelligence experiment task is completed.

当然，当在所述人工智能实验云平台中的目标从节点中完成了某一班级如班级A的命名空间的创建后，还可以由第二权限用户账号对应学生根据自身进行人工智能实验任务的个性化需求生成个性化容器部署指令，并将该个性化容器部署指令由所述人工智能实验云平台的从节点将该个性化容器部署指令发送至第一权限用户账号对应老师所使用的用户终端，当该第一权限用户账号对应老师在用户终端上操作审批通过了该个性化容器部署指令后，则所述人工智能实验云平台的目标从节点根据所述个性化容器部署指令对应的第二目标镜像资源以及预先存储的数据存储卷路径对应创建容器。同样的，所述第二目标镜像资源可从集成了人工智能领域相关实验所需的TensorFlow、Caffe和PyTorch等框架镜像，或集成了深度神经网络DNN、卷积神经网络CNN和目标检测相关的YoLoV1～V5等目标检测网络这些人工智能相关神经网络模型镜像中任选一个或多个进行部署。更具体所述第二目标镜像资源可选择TensorFlow框架并在TensorFlow框架上部署YoLoV5目标检测网络。Of course, when the target slave node in the artificial intelligence experiment cloud platform has completed the creation of a certain class such as the namespace of class A, the second authority user account can also be used by the corresponding student to perform the artificial intelligence experiment task according to their own Personalized requirements generate a personalized container deployment instruction, and the personalized container deployment instruction is sent by the slave node of the artificial intelligence experiment cloud platform to the user terminal used by the teacher corresponding to the first authorized user account , when the teacher corresponding to the user account with the first authority operates and approves the personalized container deployment instruction on the user terminal, then the target slave node of the artificial intelligence experiment cloud platform according to the second corresponding to the personalized container deployment instruction The target image resource and the path of the pre-stored data storage volume correspond to the creation of the container. Similarly, the second target image resource can be mirrored from frameworks such as TensorFlow, Caffe, and PyTorch that are required for experiments in the field of artificial intelligence, or integrated with deep neural network DNN, convolutional neural network CNN, and YoLoV1 related to target detection. ~V5 and other target detection networks can choose one or more of these artificial intelligence-related neural network model images for deployment. More specifically, the second target image resource can select the TensorFlow framework and deploy the YoLoV5 target detection network on the TensorFlow framework.

而且，当由第二权限用户账号对应学生根据自身进行人工智能实验任务的个性化需求生成个性化容器部署指令时，默认是不分配GPU资源的，也就是基于个性化容器部署指令在目标从节点中所创建的容器都是普通服务器容器，而非GPU服务器容器。当然，若所述个性化容器部署指令对应的容器需求中对GPU服务器容器使用有需求，则该个性化容器部署指令发送至第一权限用户账号对应老师所使用的用户终端，当该第一权限用户账号对应老师在用户终端上操作审批通过了该个性化容器部署指令后，则所述人工智能实验云平台的所述目标从节点根据所述个性化容器部署指令对应的第二目标镜像资源、GPU资源以及预先存储的数据存储卷路径对应创建容器。Moreover, when the student with the second-authority user account generates personalized container deployment instructions according to their own individual needs for artificial intelligence experiment tasks, GPU resources are not allocated by default, that is, based on the personalized container deployment instructions on the target slave node The containers created in are common server containers, not GPU server containers. Of course, if the container requirements corresponding to the personalized container deployment instruction have requirements for the use of the GPU server container, then the personalized container deployment instruction is sent to the user terminal used by the teacher corresponding to the first authorized user account. After the teacher corresponding to the user account operates and approves the personalized container deployment instruction on the user terminal, the target slave node of the artificial intelligence experiment cloud platform according to the second target image resource corresponding to the personalized container deployment instruction, GPU resources and pre-stored data storage volume paths correspond to the creation of containers.

在一实施例中，所述目标从节点获取与所述统一实验任务部署指令对应的第一目标镜像资源、GPU资源和数据存储卷路径，包括：In one embodiment, the target acquires the first target image resource, GPU resource and data storage volume path corresponding to the unified experimental task deployment instruction from the node, including:

所述目标从节点若检测到统一实验任务部署指令，则获取所述统一实验任务部署指令对应的教学进度信息和教师教学标签集，根据所述教学进度信息、教师教学标签集及预设的资源调用策略生成与所述统一实验任务部署指令对应的第一目标镜像资源、GPU资源和数据存储卷路径。If the target slave node detects a unified experimental task deployment instruction, then obtain the teaching progress information and teacher teaching tag set corresponding to the unified experimental task deployment instruction, and according to the teaching progress information, teacher teaching tag set and preset resources The calling strategy generates the first target image resource, GPU resource and data storage volume path corresponding to the unified experiment task deployment instruction.

在本实施例中，当所述目标从节点若检测到统一实验任务部署指令时，具体可由所述目标从节点对该统一实验任务部署指令进行解析，判断其中是否包括教学进度信息(例如学习到AA人工智能课程的第一章第五节等)和教师教学标签集(如包括人脸识别、卷积神经网络等标签)。若该统一实验任务部署指令解析得到了教学进度信息和教师教学标签集，则根据所述教学进度信息、教师教学标签集及预设的资源调用策略生成创建容器所需的第一目标镜像资源、GPU资源和数据存储卷路径等容器信息。其中，所述资源调用策略可以理解为预先设置的包括若干条教学进度信息、教师教学标签集与目标镜像资源、GPU资源和数据存储卷路径的映射表，以教学进度信息、教师教学标签集为检索条件可以查询到对应的目标镜像资源、GPU资源和数据存储卷路径。In this embodiment, when the target slave node detects a unified experimental task deployment instruction, specifically, the target slave node can analyze the unified experimental task deployment instruction to determine whether it includes teaching progress information (for example, learned Section 5 of Chapter 1 of the AA artificial intelligence course, etc.) and teacher teaching label sets (such as including labels such as face recognition and convolutional neural networks). If the unified experimental task deployment instruction is parsed and the teaching progress information and the teacher's teaching label set are obtained, then according to the teaching progress information, the teacher's teaching label set and the preset resource call strategy, the first target image resource required for creating the container is generated, Container information such as GPU resources and data storage volume paths. Wherein, the resource calling strategy can be understood as a preset mapping table including several pieces of teaching progress information, teacher teaching label set and target image resource, GPU resource and data storage volume path, with teaching progress information, teacher teaching label set as The search conditions can query the corresponding target image resources, GPU resources, and data storage volume paths.

S102、目标从节点若接收到用户终端的访问请求并通过验证，将所述目标容器对应的目标容器实例与所述用户终端连接。S102. If the target slave node receives the access request from the user terminal and passes the verification, connect the target container instance corresponding to the target container to the user terminal.

在本实施例中，当在目标从节点完成了目标容器及目标容器实例的部署之后，则可以提供给用户进行访问从而进行人工智能相关实验任务的处理。当用户需访问从节点中的目标容器时，是先将带有用户账号信息的访问请求发送至目标从节点。之后目标从节点对该访问请求验证通过时，则建立所述目标容器对应的目标容器实例与所述用户终端的通讯连接。这样用户终端即可访问目标从节点进行人工智能相关实验任务的处理。In this embodiment, after the target slave node completes the deployment of the target container and the target container instance, it can be provided to the user for access to process artificial intelligence-related experimental tasks. When a user needs to access a target container in a slave node, an access request with user account information is first sent to the target slave node. Afterwards, when the target slave node passes the verification of the access request, a communication connection between the target container instance corresponding to the target container and the user terminal is established. In this way, the user terminal can access the target slave node to process the experimental tasks related to artificial intelligence.

S103、目标从节点接收所述用户终端的目标操作数据，将所述目标操作数据存储至所述目标容器对应的键值数据库。S103. The target slave node receives the target operation data of the user terminal, and stores the target operation data in the key-value database corresponding to the target container.

在本实施例中，当用户终端与目标从节点中的目标容器实例连接后，目标容器实例则可接收所述用户终端的目标操作数据。为了提高对目标操作数据的数据安全性，可以将将所述目标操作数据存储至目标主节点中与所述目标容器相对应的键值数据库(如Etcd数据库，其为一种键值数据库)，这样即使目标从节点发生故障停止运行时，其中所包括各容器的操作数据均是存储在目标主节点中的键值数据库。当目标从节点排除故障恢复正常时，也是从目标主节点的键值数据库中调用该目标从节点的所有数据以进行断点恢复。In this embodiment, after the user terminal is connected to the target container instance in the target slave node, the target container instance can receive the target operation data of the user terminal. In order to improve the data security of the target operation data, the target operation data can be stored in a key-value database corresponding to the target container in the target master node (such as Etcd database, which is a key-value database), In this way, even when the target slave node fails and stops running, the operation data of each container included therein is stored in the key-value database of the target master node. When the target slave node returns to normal after troubleshooting, it also calls all the data of the target slave node from the key-value database of the target master node for breakpoint recovery.

在一实施例中，步骤S103之后还包括：In one embodiment, after step S103, it also includes:

目标主节点若检测到当前工作状态为异常状态，从剩余的多个主节点中随机选择一个主节点以作为目标主节点。If the target master node detects that the current working state is abnormal, a master node is randomly selected from the remaining multiple master nodes as the target master node.

在本实施例中，一般集群中当前只有一个主节点正在工作并进行各项数据处理，其他主节点则处于非活跃的备用状态。如果正在工作的目标主节点(即Leader-Master-Node)出现异常，人工智能实验云平台所对应集群会在备用状态的多个主节点中自动选出一个主节点立刻代替异常状态主节点成为新的正在运行并处于活跃状态的主节点接续当前工作。可见，集群中无论哪一个主节点出现故障都不会影响整个集群的工作，如果正在工作的主节点出现异常，集群会在备用主节点中自动选出一个主节点立刻代替异常节点成为新的主控制节点Leader-Master-Node接续当前工作。In this embodiment, currently only one master node in a general cluster is working and performing various data processing, and other master nodes are in an inactive standby state. If the working target master node (that is, Leader-Master-Node) is abnormal, the cluster corresponding to the artificial intelligence experiment cloud platform will automatically select a master node from multiple master nodes in the standby state to immediately replace the master node in the abnormal state as the new master node. The running and active master node continues the current work. It can be seen that no matter which master node in the cluster fails, it will not affect the work of the entire cluster. If the working master node is abnormal, the cluster will automatically select a master node from the standby master nodes to replace the abnormal node as the new master immediately. The control node Leader-Master-Node continues the current work.

所述目标从节点若检测到当前工作状态为异常状态，获取当前节点容器数据并将所述当前节点容器数据存储至目标主节点中与所述目标从节点对应的键值数据库。If the target slave node detects that the current working state is an abnormal state, it obtains the current node container data and stores the current node container data in the key-value database corresponding to the target slave node in the target master node.

在本实施例中，当目标从节点的当前工作状态为异常状态，在其还未重启进行排障之前，可以在目标主节点中再次获取到该目标从节点的当前节点容器数据并将所述当前节点容器数据存储至目标主节点中与所述目标从节点对应的键值数据库进行数据备份。在该键值数据库针对每一个用户创建一个持久卷声明(PVC)和一个持久卷(PV)，当用户针对目标从节点中目标容器实例的操作停止且关闭后或者是因故障退出目标容器实例，此次关闭操作或退出操作之前用户针对目标容器实例的操作产生的当前节点容器数据会自动存储至目标主节点中与所述目标从节点对应的键值数据库进行数据备份。当用户下一次再次进入该目标容器实例后，所述人工智能实验云平台自动从所述键值数据库中调用该当前节点容器数据，以将该目标容器实例还原成上一次退出时的状态，以供用户本次重新进入后继续针对目标容器实例进行操作。In this embodiment, when the current working state of the target slave node is an abnormal state, before it is restarted for troubleshooting, the current node container data of the target slave node can be obtained again in the target master node and the The current node container data is stored in the key-value database corresponding to the target slave node in the target master node for data backup. Create a persistent volume statement (PVC) and a persistent volume (PV) for each user in the key-value database. When the user stops and closes the operation of the target container instance in the target slave node or exits the target container instance due to a failure, The current node container data generated by the user's operation on the target container instance before this shutdown operation or exit operation will be automatically stored in the key-value database corresponding to the target slave node in the target master node for data backup. When the user enters the target container instance again next time, the artificial intelligence experiment cloud platform automatically calls the current node container data from the key-value database to restore the target container instance to the state when it exited last time, so as to For the user to continue to operate on the target container instance after re-entering this time.

由于是将每一容器的所有节点容器数据持久化到一个动态创建的持久卷中，无论是用户主动退出容器实例还是因故障被动退出容器实例，所述人工智能实验云平台可以自动保存历史节点容器数据到持久卷中，以供用户重新进入容器后调用历史节点容器数据继续针对容器实例进行操作。Since all node container data of each container is persisted into a dynamically created persistent volume, whether the user actively exits the container instance or passively exits the container instance due to a failure, the artificial intelligence experiment cloud platform can automatically save the historical node container The data is stored in the persistent volume for the user to call the historical node container data and continue to operate on the container instance after re-entering the container.

S104、目标主节点发送容器操作指令至目标从节点。S104. The target master node sends a container operation instruction to the target slave node.

在本实施例中，当然用户除了对目标从节点中容器内的人工智能相关实验数据进行操作，还可以是用户访问目标主节点(如具有管理员权限用户账号登录了目标主节点)，然后进行增加或删除容器等操作所对应容器操作指令的触发产生操作。当完成了触发增加或删除容器的容器操作指令后，目标主节点发送容器操作指令至目标从节点。In this embodiment, of course, in addition to operating the artificial intelligence-related experimental data in the container of the target slave node, the user can also access the target master node (for example, a user account with administrator authority logs in to the target master node), and then perform The triggering of the container operation instruction corresponding to operations such as adding or deleting a container generates an operation. After completing the container operation instruction that triggers adding or deleting containers, the target master node sends the container operation instruction to the target slave node.

S105、目标从节点若接收到所述容器操作指令，则根据所述容器操作指令对应创建或删除容器。S105. If the target slave node receives the container operation instruction, correspondingly create or delete a container according to the container operation instruction.

在本实施例中，人工智能实验云平台中可以基于容器操作指令实现集群节点的随时增删，以通过集群控制器方便实现Pod的扩缩容。更具体的是当目标从节点中资源紧张时，增加更多的从节点并在从节点中创建容器以实现扩容。In this embodiment, the artificial intelligence experiment cloud platform can realize the addition and deletion of cluster nodes at any time based on container operation instructions, so as to facilitate the expansion and contraction of Pod through the cluster controller. More specifically, when the resources in the target slave nodes are tight, add more slave nodes and create containers in the slave nodes to achieve capacity expansion.

该方法实现了在人工智能实验云平台中能基于云端进行人工智能相关实验任务的处理，而且能随时对集群增加或删减节点，提高集群的高可用性和负载能力。The method realizes the processing of artificial intelligence-related experimental tasks based on the cloud in the artificial intelligence experiment cloud platform, and can add or delete nodes to the cluster at any time, thereby improving the high availability and load capacity of the cluster.

本申请实施例还提供一种高可用集群架构人工智能实验云平台数据处理系统，该高可用集群架构人工智能实验云平台数据处理系统用于执行前述高可用集群架构人工智能实验云平台数据处理方法的任一实施例。具体地，请参阅图3，图3是本申请实施例提供的高可用集群架构人工智能实验云平台数据处理系统100的示意性框图。The embodiment of the present application also provides a high-availability cluster architecture artificial intelligence experiment cloud platform data processing system, the highly available cluster architecture artificial intelligence experiment cloud platform data processing system is used to execute the aforementioned highly available cluster architecture artificial intelligence experiment cloud platform data processing method any of the examples. Specifically, please refer to FIG. 3 . FIG. 3 is a schematic block diagram of a data processing system 100 for an artificial intelligence experiment cloud platform with a high-availability cluster architecture provided by an embodiment of the present application.

其中，如图3所示，高可用集群架构人工智能实验云平台数据处理系统100包多个主节点101和多个从节点102。其中，所述目标从节点为所述多个从节点102中任意一个从节点，所述目标主节点为所述多个主节点101中当前为活跃状态的主节点。Wherein, as shown in FIG. 3 , the data processing system 100 of the high-availability cluster architecture artificial intelligence experiment cloud platform includes multiple master nodes 101 and multiple slave nodes 102 . Wherein, the target slave node is any slave node in the plurality of slave nodes 102 , and the target master node is a currently active master node in the plurality of master nodes 101 .

目标从节点，用于若接收到目标主节点发送的实验工作任务部署指令，则根据所实验工作任务部署指令对应创建目标容器；其中，所述从节点为所述多个从节点中任意一个从节点，所述目标主节点为所述多个主节点中当前为活跃状态的主节点。The target slave node is configured to create a target container correspondingly according to the experimental work task deployment command if it receives the experimental work task deployment instruction sent by the target master node; wherein, the slave node is any one of the multiple slave nodes node, where the target master node is a master node that is currently active among the plurality of master nodes.

在一实施例中，目标主节点，还用于通过目标主节点中的接口模块与目标从节点的Kubelet代理组件建立通讯连接。In an embodiment, the target master node is further configured to establish a communication connection with the Kubelet proxy component of the target slave node through an interface module in the target master node.

在一实施例中，目标从节点还用于：In one embodiment, the target slave node is also used to:

目标从节点，还用于若接收到用户终端的访问请求并通过验证，将所述目标容器对应的目标容器实例与所述用户终端连接。The target slave node is further configured to connect the target container instance corresponding to the target container with the user terminal if the access request from the user terminal is received and passed the verification.

目标从节点，还用于接收所述用户终端的目标操作数据，将所述目标操作数据存储至所述目标容器对应的键值数据库。The target slave node is further configured to receive target operation data of the user terminal, and store the target operation data in a key-value database corresponding to the target container.

在一实施例中，目标主节点，还用于若检测到当前工作状态为异常状态，从剩余的多个主节点中随机选择一个主节点以作为目标主节点。In an embodiment, the target master node is further configured to randomly select a master node from the remaining multiple master nodes as the target master node if it is detected that the current working state is an abnormal state.

在一实施例中，目标从节点，还用于若检测到当前工作状态为异常状态，获取当前节点容器数据并将所述当前节点容器数据存储至目标主节点中与所述目标从节点对应的键值数据库。In one embodiment, the target slave node is also used to obtain the current node container data and store the current node container data in the target master node corresponding to the target slave node if it detects that the current working state is an abnormal state. Key-value database.

目标主节点，用于发送容器操作指令至目标从节点。The target master node is used to send container operation instructions to the target slave node.

目标从节点，还用于若接收到所述容器操作指令，则根据所述容器操作指令对应创建或删除容器。The target slave node is further configured to correspondingly create or delete a container according to the container operation instruction if the container operation instruction is received.

该系统实现了在人工智能实验云平台中能基于云端进行人工智能相关实验任务的处理，而且能随时对集群增加或删减节点，提高集群的高可用性和负载能力。The system realizes the processing of artificial intelligence-related experimental tasks based on the cloud in the artificial intelligence experiment cloud platform, and can add or delete nodes to the cluster at any time to improve the high availability and load capacity of the cluster.

上述高可用集群架构人工智能实验云平台数据处理系统可以实现为计算机程序的形式，该计算机程序可以在如图4所示的计算机设备上运行。The data processing system of the artificial intelligence experimental cloud platform with a high-availability cluster architecture can be realized in the form of a computer program, and the computer program can run on a computer device as shown in FIG. 4 .

请参阅图4，图4是本申请实施例提供的计算机设备的示意性框图。该计算机设备500是服务器，也可以是服务器集群。服务器可以是独立的服务器，也可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(ContentDeliveryNetwork，CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。Please refer to FIG. 4 . FIG. 4 is a schematic block diagram of a computer device provided by an embodiment of the present application. The computer device 500 is a server, and may also be a server cluster. The server can be an independent server, or provide cloud service, cloud database, cloud computing, cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, content distribution network (ContentDeliveryNetwork, CDN) , and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.

参阅图4，该计算机设备500包括通过装置总线501连接的处理器502、存储器和网络接口505，其中，存储器可以包括存储介质503和内存储器504。Referring to FIG. 4 , the computer device 500 includes a processor 502 connected through a device bus 501 , a memory and a network interface 505 , wherein the memory may include a storage medium 503 and an internal memory 504 .

该存储介质503可存储操作系统5031和计算机程序5032。该计算机程序5032被执行时，可使得处理器502执行高可用集群架构人工智能实验云平台数据处理方法。The storage medium 503 can store an operating system 5031 and a computer program 5032 . When the computer program 5032 is executed, it can make the processor 502 execute the data processing method of the high-availability cluster architecture artificial intelligence experiment cloud platform.

该处理器502用于提供计算和控制能力，支撑整个计算机设备500的运行。The processor 502 is used to provide calculation and control capabilities and support the operation of the entire computer device 500 .

该内存储器504为存储介质503中的计算机程序5032的运行提供环境，该计算机程序5032被处理器502执行时，可使得处理器502执行高可用集群架构人工智能实验云平台数据处理方法。The internal memory 504 provides an environment for the operation of the computer program 5032 in the storage medium 503. When the computer program 5032 is executed by the processor 502, the processor 502 can execute the data processing method of the high-availability cluster architecture artificial intelligence experiment cloud platform.

该网络接口505用于进行网络通信，如提供数据信息的传输等。本领域技术人员可以理解，图4中示出的结构，仅仅是与本申请方案相关的部分结构的框图，并不构成对本申请方案所应用于其上的计算机设备500的限定，具体的计算机设备500可以包括比图中所示更多或更少的部件，或者组合某些部件，或者具有不同的部件布置。The network interface 505 is used for network communication, such as providing data transmission and the like. Those skilled in the art can understand that the structure shown in FIG. 4 is only a block diagram of a partial structure related to the solution of this application, and does not constitute a limitation to the computer device 500 on which the solution of this application is applied. The specific computer device 500 may include more or fewer components than shown, or combine certain components, or have a different arrangement of components.

其中，处理器502用于运行存储在存储器中的计算机程序5032，以实现本申请实施例公开的高可用集群架构人工智能实验云平台数据处理方法。Wherein, the processor 502 is used to run the computer program 5032 stored in the memory, so as to realize the data processing method of the high-availability cluster architecture artificial intelligence experiment cloud platform disclosed in the embodiment of the present application.

本领域技术人员可以理解，图4中示出的计算机设备的实施例并不构成对计算机设备具体构成的限定，在其他实施例中，计算机设备可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。例如，在一些实施例中，计算机设备可以仅包括存储器及处理器，在这样的实施例中，存储器及处理器的结构及功能与图4所示实施例一致，在此不再赘述。Those skilled in the art can understand that the embodiment of the computer device shown in FIG. 4 does not constitute a limitation on the specific composition of the computer device. In other embodiments, the computer device may include more or less components than those shown in the illustration. Or combine certain components, or different component arrangements. For example, in some embodiments, the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are consistent with those of the embodiment shown in FIG. 4 , and will not be repeated here.

应当理解，在本申请实施例中，处理器502可以是中央处理单元(CentralProcessingUnit，CPU)，该处理器502还可以是其他通用处理器、数字信号处理器(DigitalSignalProcessor，DSP)、专用集成电路(ApplicationSpecificIntegratedCircuit，ASIC)、现成可编程门阵列(Field-ProgrammableGateArray，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中，通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that in the embodiment of the present application, the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application-specific integrated circuits ( Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. Wherein, the general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

在本申请的另一实施例中提供计算机可读存储介质。该计算机可读存储介质可以为非易失性的计算机可读存储介质，也可以为易失性的计算机可读存储介质。该计算机可读存储介质存储有计算机程序，其中计算机程序被处理器执行时实现本申请实施例公开的高可用集群架构人工智能实验云平台数据处理方法。In another embodiment of the present application a computer readable storage medium is provided. The computer-readable storage medium may be a non-volatile computer-readable storage medium, or a volatile computer-readable storage medium. The computer-readable storage medium stores a computer program, wherein when the computer program is executed by the processor, the data processing method of the high-availability cluster architecture artificial intelligence experiment cloud platform disclosed in the embodiment of the present application is realized.

所属领域的技术人员可以清楚地了解到，为了描述的方便和简洁，上述描述的设备、装置和单元的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。本领域普通技术人员可以意识到，结合本文中所公开的实施例描述的各示例的单元及算法步骤，能够以电子硬件、计算机软件或者二者的结合来实现，为了清楚地说明硬件和软件的可互换性，在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本申请的范围。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the above-described equipment, devices and units can refer to the corresponding process in the foregoing method embodiments, and details are not repeated here. Those of ordinary skill in the art can realize that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of the two. In order to clearly illustrate the relationship between hardware and software Interchangeability. In the above description, the composition and steps of each example have been generally described according to their functions. Whether these functions are implemented by hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.

在本申请所提供的几个实施例中，应该理解到，所揭露的设备、装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，所述单元的划分，仅仅为逻辑功能划分，实际实现时可以有另外的划分方式，也可以将具有相同功能的单元集合成一个单元，例如多个单元或组件可以结合或者可以集成到另一个装置，或一些特征可以忽略，或不执行。另外，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接，也可以是电的，机械的或其它的形式连接。In the several embodiments provided in this application, it should be understood that the disclosed devices, devices and methods can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only for logical function division. In actual implementation, there may be other division methods, and units with the same function can also be combined into one Units such as a plurality of units or components may be combined or integrated into another device, or some features may be omitted, or not implemented. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present application.

另外，在本申请各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个存储介质中。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分，或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，后台服务器，或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-OnlyMemory)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a storage medium. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of software products, and the computer software products are stored in a storage medium Among them, several instructions are included to make a computer device (which may be a personal computer, a background server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: various media capable of storing program codes such as a U disk, a mobile hard disk, a read-only memory (ROM, Read-Only Memory), a magnetic disk, or an optical disk.

以上所述，仅为本申请的具体实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到各种等效的修改或替换，这些修改或替换都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应以权利要求的保护范围为准。The above is only a specific embodiment of the application, but the scope of protection of the application is not limited thereto. Any person familiar with the technical field can easily think of various equivalents within the scope of the technology disclosed in the application. Modifications or replacements, these modifications or replacements shall be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.

Claims

1. A high-availability cluster architecture artificial intelligence experiment cloud platform data processing method, applied to the artificial intelligence experiment cloud platform, is characterized in that, the artificial intelligence cloud platform includes a plurality of master nodes and a plurality of slave nodes, and the plurality of The master node and the plurality of slave nodes are all communicatively connected; the method includes:

If the target slave node receives the experimental work task deployment instruction sent by the target master node, it will create a target container correspondingly according to the experimental work task deployment instruction; wherein, the target slave node is any one of the multiple slave nodes, The target master node is a currently active master node among the plurality of master nodes;

If the target slave node receives the access request of the user terminal and passes the verification, connects the target container instance corresponding to the target container with the user terminal;

The target slave node receives the target operation data of the user terminal, and stores the target operation data in a key-value database corresponding to the target container;

The target master node sends container operation instructions to the target slave node;

If the target slave node receives the container operation instruction, it creates or deletes a container correspondingly according to the container operation instruction.

2. the high-availability cluster architecture artificial intelligence experiment cloud platform data processing method according to claim 1, is characterized in that, described corresponding creation target container according to institute's experiment task deployment instruction, comprises:

If the experimental work task deployment instruction is a unified experimental task deployment instruction, then the target obtains the first target image resource, GPU resource and data storage volume path corresponding to the unified experimental task deployment instruction from the node, and the target obtains from the node The node correspondingly creates a target container according to the first target image resource, GPU resource and data storage volume path corresponding to the unified experimental task deployment instruction;

If the experimental work task deployment instruction is a personalized container deployment instruction, the target slave node obtains the second target image resource corresponding to the personalized container deployment instruction, and the target slave node deploys the personalized container according to the The second target image resource corresponding to the instruction and the pre-stored data storage volume path correspond to the creation target container.

3. the high-availability cluster architecture artificial intelligence experiment cloud platform data processing method according to claim 1, is characterized in that, if described target slave node receives the experimental task deployment instruction that target master node sends, then according to the experimental work Before the task deployment instruction corresponds to creating the target container, it also includes:

The interface module in the target master node establishes a communication connection with the Kubelet proxy component of the target slave node.

4. the high-availability cluster architecture artificial intelligence experiment cloud platform data processing method according to claim 3, is characterized in that, the interface module in the target master node and the Kubelet proxy component of the target slave node establish a communication connection, comprising:

The Keepalived component of the target master node automatically configures the virtual IP address of the artificial intelligence experiment cloud platform through the virtual routing redundancy protocol;

The interface module of the target master node establishes a communication connection with the Kubelet agent component block of the target slave node based on the virtual IP address.

5. the high-availability cluster architecture artificial intelligence experiment cloud platform data processing method according to claim 1, is characterized in that, described target receives the target operating data of described user terminal from node, and described target operating data is stored in the said target operating data After describing the key-value database corresponding to the target container, it also includes:

If the target master node detects that the current working state is abnormal, a master node is randomly selected from the remaining multiple master nodes as the target master node.

6. The high-availability cluster architecture artificial intelligence experiment cloud platform data processing method according to claim 1, wherein the target slave node receives the target operation data of the user terminal, and stores the target operation data in the After describing the key-value database corresponding to the target container, it also includes:

If the target slave node detects that the current working state is an abnormal state, it obtains the current node container data and stores the current node container data in the key-value database corresponding to the target slave node in the target master node.

7. the high-availability cluster architecture artificial intelligence experiment cloud platform data processing method according to claim 2, is characterized in that, described target obtains the first target image resource corresponding to described unified experimental task deployment instruction, GPU resource from node and datastore volume paths, including:

If the target slave node detects a unified experimental task deployment instruction, then obtain the teaching progress information and teacher teaching tag set corresponding to the unified experimental task deployment instruction, and according to the teaching progress information, teacher teaching tag set and preset resources The calling strategy generates the first target image resource, GPU resource and data storage volume path corresponding to the unified experiment task deployment instruction.

8. A high-availability cluster architecture artificial intelligence experiment cloud platform data processing system, running on the artificial intelligence experiment cloud platform, is characterized in that it includes a plurality of master nodes and a plurality of slave nodes, and the plurality of master nodes and the plurality of Each slave node is connected by communication; wherein, the target slave node is any slave node in the plurality of slave nodes, and the target master node is a currently active master node in the plurality of master nodes;

The target slave node is configured to create a target container according to the experimental task deployment instruction corresponding to the experimental work task deployment instruction sent by the target master node; wherein the target slave node is any one of the plurality of slave nodes From the node, the target master node is a currently active master node among the plurality of master nodes;

The target slave node is further configured to connect the target container instance corresponding to the target container to the user terminal if the access request from the user terminal is received and passed the verification;

The target slave node is further configured to receive target operation data of the user terminal, and store the target operation data in a key-value database corresponding to the target container;

The target master node is used to send container operation instructions to the target slave node;

The target slave node is further configured to correspondingly create or delete a container according to the container operation instruction if receiving the container operation instruction sent by the target master node.

9. A computer device, comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, characterized in that, when the processor executes the computer program, the computer program according to claim 1 is realized. The data processing method of the high-availability cluster architecture artificial intelligence experiment cloud platform described in any one of 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes any one of claims 1 to 7. The high-availability cluster architecture artificial intelligence experimental cloud platform data processing method described in the item.