[go: up one dir, main page]

CN116089477B - Distributed Training Method and System - Google Patents

Distributed Training Method and System Download PDF

Info

Publication number
CN116089477B
CN116089477B CN202310374312.8A CN202310374312A CN116089477B CN 116089477 B CN116089477 B CN 116089477B CN 202310374312 A CN202310374312 A CN 202310374312A CN 116089477 B CN116089477 B CN 116089477B
Authority
CN
China
Prior art keywords
cache
computing
data
data set
service
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310374312.8A
Other languages
Chinese (zh)
Other versions
CN116089477A (en
Inventor
高礼
殷贞玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Honor Device Co Ltd
Original Assignee
Honor Device Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Honor Device Co Ltd filed Critical Honor Device Co Ltd
Priority to CN202310374312.8A priority Critical patent/CN116089477B/en
Publication of CN116089477A publication Critical patent/CN116089477A/en
Application granted granted Critical
Publication of CN116089477B publication Critical patent/CN116089477B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5018Thread allocation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5021Priority
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请提供了一种分布式训练方法及系统。该方法包括:在第一集群的多个缓存节点中创建第一数据集对应的第一缓存服务的缓存进程;在第一集群的多个第一计算节点中创建第一计算任务对应的第一计算任务进程,并设置第一缓存服务为第一计算任务的输入,第一计算节点属于计算节点组;根据第一缓存服务、第一计算任务、第一计算节点,确定是否扩容;如果确定扩容,在第一计算节点中创建第一缓存服务的缓存进程,以使在训练第一计算任务的过程中,第一计算节点从第一计算节点中的缓存进程读取数据,来完成第一计算节点上的第一计算任务进程的训练。这样,使计算任务训练过程中能够在本地读取训练所需数据,提高训练速度。

The present application provides a distributed training method and system. The method includes: creating a cache process corresponding to the first data set in multiple cache nodes of the first cluster; creating a first cache process corresponding to the first computing task among multiple first computing nodes in the first cluster; Calculate the task process, and set the first cache service as the input of the first calculation task, the first computing node belongs to the computing node group; according to the first cache service, the first computing task, and the first computing node, determine whether to expand; if it is determined to expand , create a cache process of the first cache service in the first computing node, so that in the process of training the first computing task, the first computing node reads data from the cache process in the first computing node to complete the first calculation Training of the first computing task process on the node. In this way, the data required for training can be read locally during the training of computing tasks, and the training speed is improved.

Description

分布式训练方法及系统Distributed Training Method and System

技术领域technical field

本申请涉及终端设备领域,尤其涉及一种分布式训练方法及系统。The present application relates to the field of terminal equipment, and in particular to a distributed training method and system.

背景技术Background technique

目前,大量机器学习任务在云原生容器环境进行训练,方便高效利用计算资源。训练所需的数据集通常存储在远程存储服务如obs、nfs等,计算节点上的训练任务需要读取远程数据集。At present, a large number of machine learning tasks are trained in the cloud-native container environment, which facilitates efficient use of computing resources. The data sets required for training are usually stored in remote storage services such as obs, nfs, etc., and the training tasks on computing nodes need to read remote data sets.

训练任务的训练时间包括训练数据的读取时间和计算时间。当训练任务的计算时间小于远程数据读取时间时,数据读取速度限制了任务训练速度,导致训练速度较慢。The training time of the training task includes the reading time and computing time of the training data. When the computing time of the training task is less than the remote data reading time, the data reading speed limits the task training speed, resulting in a slower training speed.

发明内容Contents of the invention

为了解决上述技术问题,本申请提供一种分布式训练方法及系统,能够将计算任务训练所需的数据集对应的缓存进程,自动扩容到计算任务所在的计算节点,从而使计算任务训练过程中能够在本地读取训练所需数据,提高训练速度。In order to solve the above technical problems, this application provides a distributed training method and system, which can automatically expand the cache process corresponding to the data set required by the computing task training to the computing node where the computing task is located, so that the computing task training process The data required for training can be read locally to improve the training speed.

第一方面,本申请提供一种分布式训练方法,该方法包括:在第一集群的多个缓存节点中创建第一数据集对应的第一缓存服务的缓存进程,第一数据集是位于第一集群之外的远程数据库中的数据集;在第一集群的多个第一计算节点中创建第一计算任务对应的第一计算任务进程,并设置第一缓存服务为第一计算任务的输入,第一计算节点属于计算节点组;根据第一缓存服务、第一计算任务、第一计算节点,确定是否扩容;如果确定扩容,在第一计算节点中创建第一缓存服务的缓存进程,以使在训练第一计算任务的过程中,第一计算节点从第一计算节点中的缓存进程读取数据,来完成第一计算节点上的第一计算任务进程的训练。这样,能够将计算任务训练所需的数据集对应的缓存进程,自动扩容到计算任务所在的计算节点,从而使计算任务训练过程中能够在本地读取训练所需数据,提高训练速度。In a first aspect, the present application provides a distributed training method, the method includes: creating a cache process of the first cache service corresponding to the first data set in multiple cache nodes of the first cluster, the first data set is located in the first A data set in a remote database outside the cluster; creating a first computing task process corresponding to the first computing task in multiple first computing nodes of the first cluster, and setting the first cache service as the input of the first computing task , the first computing node belongs to the computing node group; according to the first cache service, the first computing task, and the first computing node, determine whether to expand the capacity; if it is determined to expand the capacity, create a cache process of the first cache service in the first computing node to In the process of training the first computing task, the first computing node reads data from the cache process in the first computing node to complete the training of the first computing task process on the first computing node. In this way, the cache process corresponding to the data set required for computing task training can be automatically expanded to the computing node where the computing task is located, so that the data required for training can be read locally during the computing task training process, and the training speed is improved.

根据第一方面,根据第一缓存服务、第一计算任务、第一计算节点,确定是否扩容,包括:获取的第一缓存服务对应的第一特征数据、第一计算任务对应的第二特征数据、计算节点组对应的第三特征数据;从第一特征数据中提取第一特征向量,从第二特征数据中提取第二特征向量,从第三特征数据中提取第三特征向量;根据第一特征向量、第二特征向量和第三特征向量,得到组合特征向量;将组合特征向量输入到已训练好的扩容决策模型,由扩容决策模型输出是否扩容的决策结果。According to the first aspect, according to the first cache service, the first computing task, and the first computing node, determine whether to expand capacity, including: the acquired first feature data corresponding to the first cache service, and the second feature data corresponding to the first computing task . Calculate the third characteristic data corresponding to the node group; extract the first characteristic vector from the first characteristic data, extract the second characteristic vector from the second characteristic data, and extract the third characteristic vector from the third characteristic data; according to the first The eigenvector, the second eigenvector, and the third eigenvector are used to obtain the combined eigenvector; the combined eigenvector is input into the trained capacity expansion decision model, and the capacity expansion decision model outputs the decision result of whether to expand capacity.

根据第一方面,扩容决策模型为分类模型。According to the first aspect, the capacity expansion decision model is a classification model.

根据第一方面,第一特征数据包括第一数据集的统计信息、缓存设置信息和缓存应用信息。According to the first aspect, the first feature data includes statistical information, cache setting information and cache application information of the first data set.

根据第一方面,第一数据集的统计信息包括第一数据集的文件总大小、文件总数量、文件格式;第一数据集的缓存设置信息包括缓存容量、缓存介质、缓存进程数量;第一数据集的缓存应用信息包括应用第一数据集的缓存的计算任务数量、应用第一数据集的缓存的计算任务历史信息。According to the first aspect, the statistical information of the first data set includes the total file size, the total number of files, and the file format of the first data set; the cache setting information of the first data set includes cache capacity, cache media, and the number of cache processes; the first The cached application information of the data set includes the number of cached computing tasks to which the first data set is applied, and historical information of the cached computing tasks to which the first data set is applied.

根据第一方面,第二特征数据包括如下数据中的任意一种或多种:任务优先级、用户信息、申请的中央处理器CPU资源、申请的图形处理器GPU资源、申请内存资源、使用的输入数据信息、对应的算法类型、历史执行信息。According to the first aspect, the second characteristic data includes any one or more of the following data: task priority, user information, central processing unit CPU resources applied for, graphics processing unit GPU resources applied for, memory resources applied for, used Input data information, corresponding algorithm type, and historical execution information.

根据第一方面,第三特征数据包括如下数据中的任意一种或多种:各个计算节点可分配的空闲的CPU信息、GPU信息、内存信息、固态硬盘信息,各个计算节点已经分配的CPU信息、GPU信息、内存信息、固态硬盘信息,各个计算节点所处的网络拓扑结构。According to the first aspect, the third feature data includes any one or more of the following data: free CPU information, GPU information, memory information, and solid-state disk information that can be allocated by each computing node, and CPU information that has been allocated by each computing node , GPU information, memory information, SSD information, and the network topology of each computing node.

根据第一方面,在第一集群的多个缓存节点中创建第一数据集对应的第一缓存服务的缓存进程,包括:接收第一缓存服务创建请求;获取第一数据集的数据量;如果第一数据集的数据量小于数据量阈值,设置第一缓存服务的缓存进程的缓存容量等于第一数据集的数据量;为第一数据集对应的第一缓存服务资源设置缓存初始化标签和缓存服务标签;向第一集群发送第一指令,第一指令携带第一缓存服务资源;根据第一指令,在第一集群中具有缓存初始化标签的多个缓存节点中创建第一数据集对应的第一缓存服务的缓存进程;将第一数据集中的数据加载到缓存进程中。According to the first aspect, creating a caching process of the first caching service corresponding to the first data set among multiple caching nodes of the first cluster includes: receiving a first caching service creation request; obtaining the data volume of the first data set; if The data volume of the first data set is less than the data volume threshold, set the cache capacity of the cache process of the first cache service equal to the data volume of the first data set; set the cache initialization tag and cache for the first cache service resource corresponding to the first data set service label; send a first instruction to the first cluster, the first instruction carries the first cache service resource; according to the first instruction, create the first cache node corresponding to the first data set in the multiple cache nodes with the cache initialization label in the first cluster A caching process of the caching service; the data in the first data set is loaded into the caching process.

根据第一方面,根据第一缓存服务、第一计算任务、第一计算节点,确定是否扩容,包括:如果第一计算节点的可用存储资源大于第一数据集的数据量,确定扩容。According to the first aspect, determining whether to expand capacity according to the first cache service, the first computing task, and the first computing node includes: if the available storage resource of the first computing node is greater than the data volume of the first data set, determining to expand the capacity.

根据第一方面,根据第一缓存服务、第一计算任务、第一计算节点,确定是否扩容,包括:获取第一计算任务的优先级;如果第一计算任务的优先级高于预设等级,确定扩容。According to the first aspect, determining whether to expand capacity according to the first cache service, the first computing task, and the first computing node includes: acquiring the priority of the first computing task; if the priority of the first computing task is higher than a preset level, Confirm expansion.

根据第一方面,根据第一缓存服务、第一计算任务、第一计算节点,确定是否扩容,包括:获取第一计算任务的算法的历史训练速度;如果第一计算任务的算法的历史训练速度小于预设速度值,确定扩容。According to the first aspect, determining whether to expand capacity according to the first cache service, the first computing task, and the first computing node includes: obtaining the historical training speed of the algorithm of the first computing task; if the historical training speed of the algorithm of the first computing task If it is less than the preset speed value, it is determined to expand the capacity.

根据第一方面,每个缓存进程存储第一数据集的全部数据。According to the first aspect, each cache process stores all data of the first data set.

第二方面,本申请提供一种分布式训练系统,该系统包括控制节点和第一集群,其中:控制节点,用于:在第一集群的多个缓存节点中创建第一数据集对应的第一缓存服务的缓存进程,第一数据集是位于第一集群之外的远程数据库中的数据集;在第一集群的多个第一计算节点中创建第一计算任务对应的第一计算任务进程,并设置第一缓存服务为第一计算任务的输入,第一计算节点属于计算节点组;根据第一缓存服务、第一计算任务、第一计算节点,确定是否扩容;如果确定扩容,在第一计算节点中创建第一缓存服务的缓存进程;第一集群中的第一计算节点,用于在训练第一计算任务的过程中,从第一计算节点中的缓存进程读取数据,来完成第一计算节点上的第一计算任务进程的训练。In a second aspect, the present application provides a distributed training system. The system includes a control node and a first cluster, wherein: the control node is configured to: create the first data set corresponding to the first data set among multiple cache nodes in the first cluster. A caching process of a caching service, the first data set is a data set located in a remote database outside the first cluster; a first computing task process corresponding to the first computing task is created in multiple first computing nodes of the first cluster , and set the first cache service as the input of the first computing task, the first computing node belongs to the computing node group; according to the first cache service, the first computing task, and the first computing node, determine whether to expand the capacity; Create a cache process of the first cache service in a computing node; the first computing node in the first cluster is used to read data from the cache process in the first computing node during the training of the first computing task to complete Training of the first computing task process on the first computing node.

第三方面,本申请提供一种电子设备,包括:存储器和处理器,存储器与处理器耦合;存储器存储有程序指令,当程序指令由处理器执行时,使得电子设备执行第一方面任意一项的分布式训练方法。In a third aspect, the present application provides an electronic device, including: a memory and a processor, the memory is coupled to the processor; the memory stores program instructions, and when the program instructions are executed by the processor, the electronic device executes any one of the first aspects distributed training method.

第四方面,本申请提供一种计算机可读存储介质,包括计算机程序,当计算机程序在电子设备上运行时,使得电子设备执行前述的第一方面任意一项的分布式训练方法。In a fourth aspect, the present application provides a computer-readable storage medium, including a computer program. When the computer program runs on an electronic device, the electronic device executes the distributed training method of any one of the foregoing first aspects.

附图说明Description of drawings

图1为示例性示出的分布式训练相关的系统结构示例图;FIG. 1 is an exemplary diagram showing an example of a system structure related to distributed training;

图2为示例性示出的与分布式训练相关的分布式缓存服务的部署情况示例图;FIG. 2 is an exemplary diagram showing an example of the deployment of a distributed cache service related to distributed training;

图3为示例性示出的缓存服务创建过程的示例图;Fig. 3 is an exemplary diagram showing a process of creating a cache service;

图4为示例性示出的缓存服务应用过程的示例图;FIG. 4 is an exemplary diagram showing an application process of a cache service;

图5为示例性示出的缓存服务扩容过程的示例图;Fig. 5 is an exemplary diagram showing a cache service expansion process;

图6为示例性示出的由扩容决策模型所需的输入数据得到是否扩容的结果的过程示意图;FIG. 6 is a schematic diagram of the process of obtaining the result of capacity expansion from the input data required by the capacity expansion decision-making model;

图7为示例性示出的不同数据集的扩容情况与读取方式示例图。Fig. 7 is an exemplary diagram showing the expansion situation and reading method of different data sets.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of this application.

本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。The term "and/or" in this article is just an association relationship describing associated objects, which means that there can be three relationships, for example, A and/or B can mean: A exists alone, A and B exist simultaneously, and there exists alone B these three situations.

本申请实施例的说明书和权利要求书中的术语“第一”和“第二”等是用于区别不同的对象,而不是用于描述对象的特定顺序。例如,第一目标对象和第二目标对象等是用于区别不同的目标对象,而不是用于描述目标对象的特定顺序。The terms "first" and "second" in the description and claims of the embodiments of the present application are used to distinguish different objects, rather than to describe a specific order of objects. For example, the first target object, the second target object, etc. are used to distinguish different target objects, rather than describing a specific order of the target objects.

在本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。In the embodiments of the present application, words such as "exemplary" or "for example" are used as examples, illustrations or illustrations. Any embodiment or design scheme described as "exemplary" or "for example" in the embodiments of the present application shall not be interpreted as being more preferred or more advantageous than other embodiments or design schemes. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete manner.

在本申请实施例的描述中,除非另有说明,“多个”的含义是指两个或两个以上。例如,多个处理单元是指两个或两个以上的处理单元;多个系统是指两个或两个以上的系统。In the description of the embodiments of the present application, unless otherwise specified, "plurality" means two or more. For example, multiple processing units refer to two or more processing units; multiple systems refer to two or more systems.

图1为示例性示出的分布式训练相关的系统结构示例图。请参见图1,本实施例中,分布式训练相关的系统可以包括集群1、远程服务器和控制节点。其中,控制节点中部署有管理服务,管理服务是控制节点设备上的一个应用。其中,远程服务器中存储有数据集,为描述方便,本文中将这些存储在远程服务器中的数据集称为远程数据集。FIG. 1 is an exemplary diagram showing a system structure related to distributed training. Referring to FIG. 1 , in this embodiment, a system related to distributed training may include a cluster 1, a remote server, and a control node. Wherein, the management service is deployed in the control node, and the management service is an application on the control node device. Wherein, there are data sets stored in the remote server. For the convenience of description, these data sets stored in the remote server are referred to as remote data sets in this paper.

其中,集群1位于云原生环境,即k8s机器环境中。远程服务器和控制节点均为k8s外部机器。Among them, cluster 1 is located in the cloud native environment, that is, the k8s machine environment. Both the remote server and the control node are k8s external machines.

需要说明的是,尽管图1中远程服务器中仅示例了一个数据集,但应当理解的是,每个远程服务器可以存储多个数据集,而不限于一个。It should be noted that although only one data set is illustrated in the remote server in FIG. 1 , it should be understood that each remote server can store multiple data sets, not limited to one.

同样,尽管图1中仅示例了一个远程服务器,但应当理解的是,分布式训练相关的系统中远程服务器的数量也可以为多个。Similarly, although only one remote server is illustrated in FIG. 1 , it should be understood that there may be multiple remote servers in a system related to distributed training.

需要说明的是,控制节点可以位于集群1中,也可以位于集群1外部。It should be noted that the control node may be located in the cluster 1 or outside the cluster 1.

可以理解的是,管理服务可以部署在多个节点,每个部署有管理服务的节点都可以作为控制节点。管理服务可以遵从微服务架构。It can be understood that the management service can be deployed on multiple nodes, and each node on which the management service is deployed can serve as a control node. Management services can follow the microservices architecture.

请继续参见图1,集群1中包括计算节点组和缓存初始化节点组以及其他节点组。其中,计算节点组可以包括多个计算节点,缓存初始化节点组可以包括多个缓存节点。这些节点组是管理员预先划分好的。其中,计算节点组中的节点称为计算节点,每个计算节点均设置有GPU标签;缓存初始化节点组中的节点称为缓存节点,每个缓存节点均设置有cpu标签,cpu称为缓存初始化标签。因此,根据节点的标签是GPU还是cpu,可以区分是计算节点还是缓存节点。需要说明的是,cpu标签表明具有该标签的节点属于缓存初始化节点组,标签GPU表明具有该标签的节点属于计算节点组,此时的cpu、GPU仅为标签名称。在其他实施例中,缓存初始化节点组中的节点也可以采用cpu以外的其他标签,计算节点组中的节点也可以采用GPU以外的其他标签。Please continue to refer to FIG. 1 , cluster 1 includes a computing node group, a cache initialization node group, and other node groups. Wherein, the computing node group may include multiple computing nodes, and the cache initialization node group may include multiple cache nodes. These node groups are pre-divided by the administrator. Among them, the nodes in the computing node group are called computing nodes, and each computing node is set with a GPU label; the nodes in the cache initialization node group are called cache nodes, and each cache node is set with a cpu label, and the cpu is called cache initialization Label. Therefore, according to whether the label of the node is GPU or cpu, it can be distinguished whether it is a computing node or a cache node. It should be noted that the cpu label indicates that the node with this label belongs to the cache initialization node group, and the label GPU indicates that the node with this label belongs to the computing node group. At this time, cpu and GPU are only label names. In other embodiments, the nodes in the cache initialization node group may also use labels other than cpu, and the nodes in the computing node group may also use labels other than GPU.

其中,每个计算节点上都设置有缓存客户端守护进程。缓存客户端守护进程用于在检测到节点设置有缓存客户端标签client-id,在设置有缓存客户端标签client-id的节点上创建缓存客户端。Wherein, each computing node is provided with a cache client daemon process. The cache client daemon is used to detect that the node is set with the cache client label client-id, and create a cache client on the node with the cache client label client-id set.

其中,其他节点组中设置有api接口服务、控制器、调度器和数据库。集群1外部的设备,例如控制节点,可以通过api接口服务调用云原生环境(即k8s机器环境)中的控制器、调度器等,对集群1内的节点进行控制,以执行相应的操作。因此,控制节点中的管理服务可以通过向集群1内其他节点组中的api接口服务发送指令或信息的方式,对集群1内的节点进行控制。Wherein, other node groups are provided with api interface services, controllers, schedulers and databases. Devices outside cluster 1, such as control nodes, can call controllers, schedulers, etc. in the cloud native environment (that is, k8s machine environment) through API interface services to control nodes in cluster 1 to perform corresponding operations. Therefore, the management service in the control node can control the nodes in the cluster 1 by sending instructions or information to the API interface services in other node groups in the cluster 1 .

图2为示例性示出的与分布式训练相关的分布式缓存服务的部署情况示例图。下面结合图1、图2和后续的图3、图4和图5,对分布式训练的过程进行说明。FIG. 2 is an exemplary diagram showing an example of deployment of a distributed cache service related to distributed training. The process of distributed training will be described below with reference to FIG. 1 , FIG. 2 and subsequent FIG. 3 , FIG. 4 and FIG. 5 .

首先,对分布式缓存服务的创建过程进行说明。First, the creation process of the distributed cache service is described.

请参见图1,其中的过程1表示创建缓存服务。过程1包括如下步骤:Please refer to Figure 1, where process 1 represents creating a cache service. Process 1 includes the following steps:

1.1,控制节点中的管理服务向api接口服务发送创建缓存服务资源的指令1,该指令1中携带缓存服务资源,该缓存服务资源设置有缓存初始化标签cpu和缓存服务标签cache-id。1.1. The management service in the control node sends an instruction 1 for creating a cache service resource to the api interface service. The instruction 1 carries the cache service resource. The cache service resource is set with the cache initialization tag cpu and the cache service tag cache-id.

其中,每个缓存服务对应一个cache-id,不同缓存服务对应的cache-id不同。其中,缓存初始化标签cpu用于指示在缓存初始化节点组内的节点上创建缓存服务。集群内的调度器和控制器根据缓存初始化标签cpu确定在具有相同标签(即缓存初始化标签cpu)的节点上创建缓存服务。缓存服务标签cache-id用于指示创建哪个缓存服务。Wherein, each cache service corresponds to a cache-id, and cache-ids corresponding to different cache services are different. Wherein, the cache initialization label cpu is used to indicate that the cache service is created on the nodes in the cache initialization node group. The scheduler and controller in the cluster determine to create the cache service on the node with the same label (that is, the cache initialization label cpu) according to the cache initialization label cpu. The cache service label cache-id is used to indicate which cache service to create.

1.2,api接口服务根据指令1,调用控制器、调度器等在缓存初始化节点组的多个缓存节点中创建缓存服务的缓存进程worker。1.2, the api interface service calls the controller, scheduler, etc. to create a cache process worker for the cache service in multiple cache nodes in the cache initialization node group according to instruction 1.

需要说明的是,尽管图1仅示出了一个创建了缓存服务的缓存进程worker的缓存节点,但应当理解的是,在应用中,可以在多个缓存节点中创建缓存进程worker。It should be noted that although FIG. 1 only shows a cache node that has created a cache process worker for a cache service, it should be understood that, in an application, cache process workers may be created in multiple cache nodes.

其中,在缓存初始化节点组的哪些缓存节点中创建缓存进程worker由其他节点组中的调度器根据预设的调度策略确定。在缓存节点中创建缓存进程worker后,控制器为已经创建了缓存进程worker的缓存节点设置cache-id标签,以表明该节点已经创建了标签为cache-id的缓存服务。而不带缓存进程worker的缓存节点有缓存初始化标签cpu但没有cache-id标签。Among them, which cache nodes in the cache initialization node group to create the cache process worker is determined by the schedulers in other node groups according to the preset scheduling policy. After the cache process worker is created in the cache node, the controller sets a cache-id label for the cache node that has created the cache process worker, to indicate that the node has created a cache service with the label cache-id. A cache node without a cache process worker has the cache initialization tag cpu but no cache-id tag.

1.3,预加载远程服务器中数据集1的原始数据到缓存节点的缓存进程worker中。1.3, preload the original data of dataset 1 in the remote server into the cache process worker of the cache node.

这样,数据集1就被缓存到了缓存节点的缓存进程worker中,集群1中的计算节点在训练计算任务的过程中,无需从远程服务器中读取数据,而是从缓存节点的缓存进程worker中读取数据。In this way, data set 1 is cached in the cache process worker of the cache node. During the process of training computing tasks, the computing nodes in cluster 1 do not need to read data from the remote server, but read data from the cache process worker of the cache node. read data.

1.1至1.3的详细过程请参见图3所示的流程。For the detailed process of 1.1 to 1.3, please refer to the process shown in Figure 3.

图3为示例性示出的缓存服务创建过程的示例图。请参见图3,本实施例中,缓存服务创建过程可以包括如下步骤:Fig. 3 is an exemplary diagram showing a process of creating a cache service. Referring to Fig. 3, in this embodiment, the cache service creation process may include the following steps:

S301、控制节点中的管理服务接收缓存服务创建请求。S301. The management service in the control node receives a cache service creation request.

在应用中,用户可以通过在管理服务中点击创建缓存服务的选项,向管理服务发出缓存服务创建请求。In the application, the user can send a cache service creation request to the management service by clicking the option to create a cache service in the management service.

S302、管理服务响应于用户选择目标数据集的操作,获取目标数据集的相关信息,目标数据集的相关信息包括目标数据集的数据量。S302. The management service acquires relevant information of the target data set in response to the user's operation of selecting the target data set, where the relevant information of the target data set includes the data volume of the target data set.

本实施例中,目标数据集即为图1中的数据集1。In this embodiment, the target data set is the data set 1 in FIG. 1 .

用户可以从当前的数据集列表中选择目标数据集。其中,数据集列表中可以包括控制节点当前所连接的各个远程服务器中的所有数据集。并且,控制节点的管理服务的数据库中存储有远程服务器所有数据集的信息,这些数据集信息包括数据集的数据量等信息。在一个示例中,管理服务的数据库中存储的数据集信息可以是用户预先输入的。Users can select the target dataset from the current dataset list. Wherein, the data set list may include all data sets in each remote server currently connected to the control node. Moreover, the database of the management service of the control node stores information of all data sets of the remote server, and the information of the data sets includes information such as the data volume of the data sets. In one example, the data set information stored in the database of the management service may be pre-input by the user.

举例说明。假设数据集列表中包括数据集1、数据集2、数据集3、数据集4、数据集5个数据集,其中图2中远程服务器中的数据集为数据集1,当用户在数据集列表选择数据集1,那么数据集1就是目标数据集,管理服务就从数据库获取数据集1的相关信息,该相关信息包括数据集1的数据量。for example. Suppose the data set list includes data set 1, data set 2, data set 3, data set 4, and data set 5 data sets. The data set in the remote server in Figure 2 is data set 1. When the user is in the data set list If dataset 1 is selected, then dataset 1 is the target dataset, and the management service obtains relevant information of dataset 1 from the database, and the relevant information includes the data volume of dataset 1.

S303、管理服务判断目标数据集的数据量是否小于数据量阈值,如果是,执行步骤S305,否则执行步骤S304。S303. The management service judges whether the data volume of the target data set is smaller than the data volume threshold, if yes, execute step S305, otherwise execute step S304.

在一个示例中,可以根据节点的硬盘存储容量设置数据库阈值。例如,可以设置数据量阈值等于硬盘存储容量的60%。In one example, the database threshold can be set according to the hard disk storage capacity of the node. For example, the data volume threshold can be set equal to 60% of the storage capacity of the hard disk.

本实施例中,管理服务根据目标数据集的数据量和数据量阈值设置缓存服务中的单个缓存进程worker的缓存容量。In this embodiment, the management service sets the cache capacity of a single cache process worker in the cache service according to the data volume of the target data set and the data volume threshold.

缓存服务可以包括一个或多个缓存进程worker。缓存服务所包括的缓存进程worker的数量可以根据数据集的数据量和数据量阈值来设置。A cache service can include one or more cache process workers. The number of cache process workers included in the cache service can be set according to the data volume and data volume threshold of the dataset.

S304、设置单个缓存进程worker的缓存容量等于数据量阈值,关闭弹性调度,执行步骤S306。S304. Set the cache capacity of a single cache process worker equal to the data volume threshold, disable flexible scheduling, and execute step S306.

当目标数据集的数据量大于数据量阈值,表示数据集过大。此时,使用多个缓存进程worker缓存一整个目标数据集的全部原始数据,设置单个缓存进程worker的缓存容量等于数据量阈值,缓存进程worker的数量等于目标数据集的数据量除以数据量阈值所得的商。在不能整除的情况下,通过向上取整的方式获得缓存进程worker的数量。When the data volume of the target data set is greater than the data volume threshold, it indicates that the data set is too large. At this time, use multiple cache process workers to cache all the original data of an entire target dataset, set the cache capacity of a single cache process worker equal to the data volume threshold, and the number of cache process workers equal to the data volume of the target dataset divided by the data volume threshold earned quotient. If it is not divisible, the number of cached process workers is obtained by rounding up.

当目标数据集的数据量大于数据量阈值,不需要开启弹性调度来将数据集中的数据缓存到计算节点本地,因此关闭弹性调度。管理服务可以为每个缓存进程worker设置状态标志位,该状态标志位用于指示弹性调度是否开启。例如,在一个示例中,如果缓存进程worker的状态标志位为1,表明该缓存进程worker的弹性调度是开启的,可以将该缓存进程worker扩容到计算节点;如果缓存进程worker的状态标志位为0,表明该缓存进程worker的弹性调度是关闭的,不能将该缓存进程worker扩容到计算节点。When the data volume of the target dataset is greater than the data volume threshold, elastic scheduling does not need to be enabled to cache the data in the dataset locally on the computing node, so elastic scheduling is disabled. The management service can set a status flag bit for each cache process worker, and the status flag bit is used to indicate whether elastic scheduling is enabled. For example, in one example, if the status flag of the cache process worker is 1, it indicates that the flexible scheduling of the cache process worker is enabled, and the cache process worker can be expanded to the computing node; if the status flag of the cache process worker is 0, indicating that the elastic scheduling of the cache process worker is disabled, and the cache process worker cannot be expanded to compute nodes.

S305、设置单个缓存进程worker的缓存容量等于目标数据集的数据量,开启弹性调度。S305. Set the cache capacity of a single cache process worker equal to the data volume of the target data set, and enable elastic scheduling.

当目标数据集的数据量小于或等于数据量阈值,使用一个缓存进程worker缓存一整个目标数据集的全部原始数据,并设置单个缓存进程worker的缓存容量等于目标数据集的数据量。在一个示例中,综合考虑资源利用和数据读取速度,可以设置初始化缓存进程worker的数量为2。这样,既能够抵御可能的单点故障的影响,又可以较大限度地节约存储资源。When the data volume of the target data set is less than or equal to the data volume threshold, use a cache process worker to cache all the original data of the entire target data set, and set the cache capacity of a single cache process worker equal to the data volume of the target data set. In one example, taking resource utilization and data reading speed into consideration, the number of initialization cache process workers can be set to 2. In this way, it is possible to resist the impact of a possible single point of failure and save storage resources to the greatest extent.

假设图2中数据集1的数据量小于数据量阈值,那么为数据集1设置2个缓存进程worker,并且设置单个缓存进程worker的缓存容量等于数据集1的数据量,开启弹性调度缓存进程worker的功能,这样,数据集1的缓存进程worker就可以被弹性调度了。Assuming that the data volume of Dataset 1 in Figure 2 is less than the data volume threshold, then set two cache process workers for Dataset 1, and set the cache capacity of a single cache process worker equal to the data volume of Dataset 1, and start the elastic scheduling cache process worker In this way, the cache process worker of dataset 1 can be flexibly scheduled.

S306、为缓存服务资源设置亲和性标签。S306. Set an affinity tag for the cache service resource.

本步骤中的亲和性标签包括缓存服务标签cache-id和缓存初始化标签cpu,缓存服务标签cache-id可以控制缓存进程worker的创建,缓存初始化标签可以控制缓存进程worker初始化调度到缓存初始化节点组。可以设置初始化标签的亲和性权重低于缓存服务标签cache-id,以保证优先调度到打缓存服务标签cache-id的节点。The affinity tags in this step include the cache service tag cache-id and the cache initialization tag cpu. The cache service tag cache-id can control the creation of the cache process worker, and the cache initialization tag can control the cache process worker initialization scheduling to the cache initialization node group. . You can set the affinity weight of the initialization label to be lower than that of the cache service label cache-id to ensure that nodes with the cache service label cache-id are dispatched preferentially.

其中,步骤S301至S306是在图1中的步骤1.1之前在管理服务中完成的。步骤S306之后,管理服务执行图1中的步骤1.1。Wherein, steps S301 to S306 are completed in the management service before step 1.1 in FIG. 1 . After step S306, the management service executes step 1.1 in FIG. 1 .

步骤S306之后,管理服务生成图1中的步骤1.1中的指令1,然后执行步骤1.1,这样,集群1就接收到了指令1。After step S306, the management service generates instruction 1 in step 1.1 in FIG. 1, and then executes step 1.1. In this way, cluster 1 receives instruction 1.

S307、创建缓存进程worker。S307. Create a cache process worker.

本步骤中,集群1中的k8s的调度器和控制器根据指令1创建缓存进程worker。其中,调度器用于进行节点调度(即确定在哪些节点创建缓存服务),控制器用于根据调度结果(即调度器调度的节点),在相应节点上创建缓存进程worker。In this step, the k8s scheduler and controller in cluster 1 create a cache process worker according to instruction 1. Among them, the scheduler is used for node scheduling (that is, to determine which nodes to create cache services), and the controller is used to create a cache process worker on the corresponding node according to the scheduling result (that is, the nodes scheduled by the scheduler).

请参见图2,假设本次创建的结果是,在2个缓存节点(即图2中的缓存节点1和缓存节点2)上分别创建了缓存进程worker1和缓存进程worker2。Please refer to Figure 2, assuming that the result of this creation is that the cache process worker1 and the cache process worker2 are respectively created on the two cache nodes (that is, cache node 1 and cache node 2 in Figure 2).

S308、启动预加载任务,加载目标数据集的原始数据到缓存进程worker。S308. Start a preloading task, and load the original data of the target data set to the cache process worker.

管理服务定时任务查询k8s,缓存进程是否创建完成。如果完成,k8s启动数据预加载任务。The management service timing task queries k8s to see if the cache process is created. If completed, k8s starts the data preloading task.

通过图3所示实施例,用户可以分别创建多个缓存服务,创建完成后,这些缓存服务可以显示在缓存服务列表中。创建计算任务的界面中可以显示缓存服务列表,以供用户在创建计算任务时从缓存服务列表选择缓存服务。Through the embodiment shown in FIG. 3 , the user can create multiple cache services respectively, and after the creation is completed, these cache services can be displayed in the cache service list. A cache service list may be displayed on the interface for creating a computing task, so that the user may select a cache service from the cache service list when creating a computing task.

然后,对计算任务的创建过程(对应缓存服务应用过程)进行说明。Then, the creation process of the computing task (corresponding to the application process of the cache service) is described.

请继续参见图1,其中的过程2表示创建计算任务。过程2可以包括如下步骤:Please continue to refer to FIG. 1 , where process 2 represents creating a computing task. Process 2 may include the following steps:

2.1,控制节点中的管理服务向api接口服务发送创建计算任务资源的指令2,该指令2中携带缓存服务信息和计算任务资源。2.1. The management service in the control node sends an instruction 2 for creating computing task resources to the api interface service, and the instruction 2 carries cache service information and computing task resources.

其中,缓存服务信息可以包括缓存服务标签cache-id。Wherein, the cache service information may include a cache service tag cache-id.

2.2,在计算节点上创建计算任务进程。2.2. Create a computing task process on the computing node.

其中,在计算节点组的哪些计算节点中创建计算任务进程,是由其他节点组中的调度器根据各计算节点的资源情况以及预设的调度策略确定的。Among them, which computing nodes in the computing node group to create computing task processes are determined by schedulers in other node groups according to resource conditions of each computing node and a preset scheduling strategy.

2.3,为计算任务进程所在的计算节点设置缓存客户端标签client-id。2.3. Set the cache client label client-id for the computing node where the computing task process is located.

客户端标签的作用是控制客户端创建。当对某个节点打上客户端标签时,管理服务中的缓存客户端守护程序会在该节点创建对应的缓存客户端。The role of the client tag is to control client creation. When a node is labeled as a client, the cache client daemon in the management service will create a corresponding cache client on the node.

2.4,设置了缓存客户端标签client-id的计算节点上的缓存客户端守护进程,在检测到本计算节点设置有缓存客户端标签client-id后,在本计算节点上创建缓存客户端。2.4. The cache client daemon process on the computing node with the cache client label client-id is set. After detecting that the cache client label client-id is set on this computing node, create a cache client on this computing node.

缓存客户端用于从缓存进程worker中读取数据。The cache client is used to read data from the cache process worker.

其中,每个缓存客户端对应一个缓存服务的所有缓存进程worker,即每个缓存客户端可以从一个缓存服务的所有缓存进程worker中读取数据。Each cache client corresponds to all cache process workers of a cache service, that is, each cache client can read data from all cache process workers of a cache service.

2.1至2.4对应的流程请参见图4。Please refer to Figure 4 for the procedures corresponding to 2.1 to 2.4.

图4为示例性示出的缓存服务应用过程的示例图。请参见图4,本实施例中,缓存服务应用过程可以包括如下步骤:Fig. 4 is an exemplary diagram of a caching service application process. Referring to FIG. 4, in this embodiment, the cache service application process may include the following steps:

S401、管理服务根据用户在创建计算任务时选择缓存服务的操作,将缓存服务配置为计算任务的输入。S401. The management service configures the cache service as an input of the calculation task according to the user's operation of selecting the cache service when creating the calculation task.

在缓存服务可用情况下,用户在创建计算任务时可以选择缓存服务作为输入。When the cache service is available, the user can select the cache service as input when creating a computing task.

例如,假设当前有缓存服务1、缓存服务2、缓存服务3共3个缓存服务可用,那么用户在创建计算任务X时,该3个缓存服务均可选择。假设用户选择了缓存服务1,那么管理服务就将缓存服务1配置为计算任务X的输入。For example, assuming that there are currently three cache services available: cache service 1, cache service 2, and cache service 3, when the user creates computing task X, all three cache services can be selected. Assuming that the user selects cache service 1, then the management service configures cache service 1 as the input of computing task X.

假设缓存服务1即为图3所示实施例创建的缓存服务。It is assumed that cache service 1 is the cache service created in the embodiment shown in FIG. 3 .

S402、管理服务向api接口服务提交计算任务后,调度器将计算任务调度到计算节点,也即在计算节点上创建计算任务进程。S402. After the management service submits the computing task to the api interface service, the scheduler schedules the computing task to the computing node, that is, creates a computing task process on the computing node.

管理服务向api接口服务发送创建计算任务资源的指令2,即认为管理服务向api接口服务提交计算任务。The management service sends an instruction 2 to create a computing task resource to the api interface service, which means that the management service submits a computing task to the api interface service.

请参见图2,假设本次计算任务调度到2个计算节点,即图2中的计算节点1和计算节点2。计算节点1中的计算进程1和计算节点2中的计算进程2运行的是该计算任务的分布式任务。即将本次计算任务分布到计算节点1和计算节点2中。Please refer to Figure 2, assuming that this calculation task is scheduled to two computing nodes, that is, computing node 1 and computing node 2 in Figure 2 . Computing process 1 in computing node 1 and computing process 2 in computing node 2 run distributed tasks of the computing task. That is, this calculation task is distributed to computing node 1 and computing node 2.

S403、集群1中的控制器定时检测缓存服务资源的使用情况,如果存在计算节点使用缓存服务,对该节点打上该缓存服务对应的缓存客户端标签client-id。S403. The controller in the cluster 1 regularly detects the usage of the cache service resource, and if there is a computing node using the cache service, label the node with a cache client tag client-id corresponding to the cache service.

集群1内部的数据库存储有缓存资源的使用情况。缓存资源的使用情况的内容可以是:计算容器使用了哪个缓存服务。然后,集群1的其他节点组的控制器根据检测到的缓存服务资源的使用情况,在这些计算容器所在节点打上缓存客户端标签client-id。The database inside cluster 1 stores the usage of cache resources. The content of the cache resource usage may be: which cache service is used by the computing container. Then, the controllers of other node groups in the cluster 1 mark the cache client tag client-id on the nodes where these computing containers are located according to the detected usage of the cache service resources.

这样,图1中带有计算任务的计算节点设置有标签GPU和缓存客户端标签client-id。In this way, the computing nodes with computing tasks in Figure 1 are set with the label GPU and the cache client label client-id.

S404、客户端守护进程感知到计算节点具有缓存客户端标签后,在该计算节点创建缓存客户端。S404. After the client daemon perceives that the computing node has a cache client label, it creates a cache client on the computing node.

S405、计算任务通过缓存客户端读取缓存服务对应的缓存进程worker中的数据。S405. The computing task reads the data in the cache process worker corresponding to the cache service through the cache client.

然后,对缓存服务的扩容过程进行说明。Then, the expansion process of the cache service is described.

请继续参见图1,其中的过程3表示扩容过程。过程3可以包括如下步骤:Please continue to refer to FIG. 1 , where process 3 represents the capacity expansion process. Process 3 may include the following steps:

3.1,管理服务决策扩容,为计算节点设置缓存服务标签cache-id,更新相应的缓存进程worker数量。3.1, management service decision expansion, set the cache service label cache-id for the computing node, and update the number of corresponding cache process workers.

这样,图1中带有计算任务、缓存服务的计算节点设置有标签GPU、缓存服务标签cache-id和缓存客户端标签client-id。In this way, in FIG. 1 , the computing nodes with computing tasks and cache services are set with the label GPU, the cache service label cache-id, and the cache client label client-id.

3.2,扩容缓存服务到计算节点,即在计算节点上创建计算任务进程。3.2, expand the cache service to the computing node, that is, create a computing task process on the computing node.

3.3,从缓存节点中的缓存进程worker中读取数据,保存到计算节点中的缓存进程worker中。3.3. Read data from the cache process worker in the cache node and save it to the cache process worker in the computing node.

3.1至3.3对应的流程请参见图5。Please refer to Figure 5 for the processes corresponding to 3.1 to 3.3.

图5为示例性示出的缓存服务扩容过程的示例图。请参见图5,本实施例中,缓存服务扩容过程可以包括如下步骤:Fig. 5 is an exemplary diagram showing a caching service expansion process. Please refer to FIG. 5. In this embodiment, the caching service expansion process may include the following steps:

S501、在开启弹性调度情况下,管理服务检测到存在使用缓存服务的计算节点本地没有缓存进程worker,触发弹性调度任务,计算节点中的计算任务为计算任务a。S501. When elastic scheduling is enabled, the management service detects that there is no local cache process worker on the computing node using the cache service, and triggers an elastic scheduling task. The computing task in the computing node is computing task a.

本文中,计算任务也可称为训练任务,这些任务是利用数据集中的数据进行训练的任务。In this paper, computing tasks may also be referred to as training tasks, and these tasks are tasks for training using data in a dataset.

S502、获取扩容决策模型所需的输入数据,将该输入数据输入到扩容决策模型,得到是否针对计算任务a扩容的决策结果。S502. Obtain input data required by the capacity expansion decision model, input the input data into the capacity expansion decision model, and obtain a decision result of whether to expand capacity for computing task a.

扩容决策模型所需的输入数据可以根据具体模型确定。扩容决策模型可以基于历史数据,推断当前是否扩容。The input data required for the capacity expansion decision-making model can be determined according to the specific model. The expansion decision model can infer whether to expand the current capacity based on historical data.

扩容决策模型可以是二分类模型,该模型可以基于人工经验收集不同场景的数据构成训练数据集训练得到。例如,扩容决策模型可以是决策树、逻辑回归、svm等模型。扩容决策模型是一个机器学习领域的分类模型,例如扩容决策模型可以为LGBM模型,当然不限于此模型,也可以采用其他模型作为扩容决策模型。The capacity expansion decision model can be a binary classification model, and the model can be trained by collecting data from different scenarios based on manual experience to form a training data set. For example, the capacity expansion decision model may be a decision tree, logistic regression, svm and other models. The capacity expansion decision model is a classification model in the field of machine learning. For example, the capacity expansion decision model can be an LGBM model. Of course, it is not limited to this model, and other models can also be used as the capacity expansion decision model.

其中,扩容决策模型所需的输入数据可以包括如下数据:Among them, the input data required for the capacity expansion decision-making model may include the following data:

(1)与计算任务相关的缓存服务特征数据,包括但不限于:(1) Cache service characteristic data related to computing tasks, including but not limited to:

原生数据集的统计信息:文件总大小、文件总数量、文件格式。Statistics of the original dataset: total file size, total number of files, and file format.

缓存设置信息: 缓存容量、缓存介质(ram、ssd)、缓存worker数量;缓存设置信息也可称为缓存节点详情。Cache setting information: cache capacity, cache media (ram, ssd), number of cache workers; cache setting information can also be called cache node details.

缓存应用信息:应用原生数据集的缓存的计算任务数量、应用原生数据集的缓存的计算任务历史信息。缓存应用信息也可称为使用任务详情。Cache application information: the number of computing tasks cached by the application native data set, and the history information of the cached computing tasks of the application native data set. Caching application information may also be referred to as usage task details.

(2)与计算任务相关的计算任务特征数据,包括但不限于:(2) Computing task characteristic data related to computing tasks, including but not limited to:

任务优先级、用户信息、申请cpu资源、申请gpu资源、申请内存资源、使用的输入数据信息、对应的算法类型、历史执行信息。Task priority, user information, application for cpu resources, application for gpu resources, application for memory resources, input data information used, corresponding algorithm type, historical execution information.

(3)与计算任务相关的计算节点组特征数据,包括但不限于:(3) Computing node group characteristic data related to computing tasks, including but not limited to:

各个计算节点可分配空闲的cpu、gpu、内存、固态硬盘等信息,各个计算节点已经分配的cpu、gpu、内存、固态硬盘等信息,各个计算节点所处的网络拓扑结构。由扩容决策模型所需的输入数据得到是否扩容的结果的过程请参见图6。图6为示例性示出的由扩容决策模型所需的输入数据得到是否扩容的结果的过程示意图。Each computing node can allocate idle cpu, gpu, memory, solid state disk and other information, each computing node has allocated cpu, gpu, memory, solid state disk and other information, and the network topology structure of each computing node. See Figure 6 for the process of obtaining the result of capacity expansion from the input data required by the capacity expansion decision model. FIG. 6 is a schematic diagram illustrating a process of obtaining a result of whether to expand capacity from the input data required by the capacity expansion decision-making model.

请参见图6,在获取上述特征数据后,从缓存服务特征数据中提取缓存服务特征向量,从计算任务特征数据中提取计算任务特征向量,从计算节点组特征数据中提取计算节点组特征向量;Please refer to Figure 6, after obtaining the above feature data, extract the cache service feature vector from the cache service feature data, extract the computing task feature vector from the computing task feature data, and extract the computing node group feature vector from the computing node group feature data;

然后,将缓存服务特征向量、计算任务特征向量、计算节点组特征向量组合为一个组合特征向量,将组合特征向量输入到已训练好的扩容决策模型,由扩容决策模型输出是否扩容的决策结果。Then, the cache service feature vector, computing task feature vector, and computing node group feature vector are combined into a combined feature vector, and the combined feature vector is input into the trained expansion decision model, and the expansion decision model outputs the decision result of whether to expand.

其中,扩容决策模型是已经训练好的模型。Among them, the expansion decision model is a trained model.

扩容决策模型的训练过程可以包括如下步骤:The training process of the expansion decision model may include the following steps:

构建第一分类模型,并设置初始参数值;Construct the first classification model and set initial parameter values;

获取若干组样本数据,每组样本数据包括:组合特征向量样本和对应的决策结果标签数据;Obtain several sets of sample data, each set of sample data includes: combined feature vector samples and corresponding decision result label data;

利用若干组样本数据对第一分类模型进行训练,得到训练完毕的第一分类模型,将训练完毕的第一分类模型作为已训练好的扩容决策模型。Several sets of sample data are used to train the first classification model to obtain a trained first classification model, and the trained first classification model is used as a trained expansion decision model.

其中,样本数据中的组合特征向量样本的获取过程,与前述由扩容决策模型所需的输入数据得到组合特征向量的过程一致,此处不再赘述。Wherein, the acquisition process of the combined feature vector samples in the sample data is consistent with the aforementioned process of obtaining the combined feature vector from the input data required by the capacity expansion decision model, and will not be repeated here.

其中,利用若干组样本数据对第一分类模型进行训练,得到训练完毕的第一分类模型的过程,可以是:Wherein, the process of using several sets of sample data to train the first classification model to obtain the trained first classification model may be:

将上一组样本数据训练后得到的第一分类模型确定为本组样本数据对应的初始分类模型;Determining the first classification model obtained after training the previous set of sample data as the initial classification model corresponding to this set of sample data;

将本组样本数据中的组合特征向量样本输入初始分类模型,获得初始分类模型输出的决策结果,记为输出决策结果;Input the combined feature vector samples in this group of sample data into the initial classification model to obtain the decision result output by the initial classification model, which is recorded as the output decision result;

根据输出决策结果与本组样本数据中的决策结果标签数据之间的差异,调整初始分类模型中的参数值,将本次调整参数值后的分类模型作为本组样本数据训练后得到的第一分类模型;According to the difference between the output decision result and the decision result label data in this group of sample data, adjust the parameter value in the initial classification model, and use the classification model after adjusting the parameter value this time as the first one obtained after training this group of sample data. classification model;

判断是否满足训练的收敛条件,如果满足,停止训练,将本组样本数据训练后得到的第一分类模型作为已训练好的扩容决策模型;否则,继续执行下一组样本数据的训练。Judging whether the convergence condition of the training is met, if so, stop the training, and use the first classification model obtained after the training of this set of sample data as the trained expansion decision model; otherwise, continue to execute the training of the next set of sample data.

其中,第一组样本数据对应的第一分类模型为构建的带有初始参数值的第一分类模型。Wherein, the first classification model corresponding to the first group of sample data is the constructed first classification model with initial parameter values.

当然,以上仅为训练方式的一种示例性说明,并非用于对本实施例进行限定,本实施例可以不限于上述列举的训练方式。Of course, the above is only an exemplary description of the training methods, and is not intended to limit this embodiment, and this embodiment may not be limited to the training methods listed above.

本实施例中,通过采用机器学习领域的分类模型决策是否将缓存进程worker扩容到计算节点,可以提高决策准确度。In this embodiment, decision accuracy can be improved by using a classification model in the field of machine learning to decide whether to expand the cache process worker to a computing node.

当然,除了采用分类模型决策是否扩容外,还可以采用其他的方式确定是否扩容,本实施例对如何确定扩容不作限制。例如,基于预设规则的扩容决策方式。Of course, in addition to using a classification model to determine whether to expand, other methods may also be used to determine whether to expand. This embodiment does not limit how to determine whether to expand. For example, the expansion decision-making method based on preset rules.

在一个示例中,该预设规则可以为:如果计算节点的可用存储资源大于数据集的数据量就扩容,否则,如果计算节点的可用存储资源小于或等于数据集的数据量就不扩容。In an example, the preset rule may be: if the available storage resource of the computing node is greater than the data volume of the dataset, expand the capacity; otherwise, if the available storage resource of the computing node is less than or equal to the data volume of the dataset, no expansion will be performed.

在一个示例中,该预设规则可以为:如果计算任务的优先级高于预设等级就扩容,否则,如果计算任务的优先级低于或等于预设等级就不扩容。In an example, the preset rule may be: expand the capacity if the priority of the computing task is higher than the preset level; otherwise, do not expand the capacity if the priority of the computing task is lower than or equal to the preset level.

在一个示例中,该预设规则可以为:如果计算任务的算法的历史训练速度小于预设速度值就扩容,否则,如果计算任务的算法的历史训练速度大于或等于预设速度值就不扩容。In an example, the preset rule can be: if the historical training speed of the algorithm of the computing task is less than the preset speed value, expand the capacity; otherwise, if the historical training speed of the algorithm of the computing task is greater than or equal to the preset speed value, the capacity will not be expanded .

S503、判断决策结果是否为扩容,如果是,执行步骤S504,否则执行步骤S507。S503. Determine whether the decision result is expansion, if yes, execute step S504, otherwise execute step S507.

S504、对计算任务a相关的所有计算节点打缓存服务标签cache-id,统计打缓存服务标签cache-id的全部节点数量n,设置扩容的缓存进程worker数量等于节点数量n,更新云原生资源。S504. Apply the cache service tag cache-id to all computing nodes related to the computing task a, count the number n of all nodes with the cache service tag cache-id, set the number of cache process workers to be expanded to be equal to the number n of nodes, and update cloud native resources.

更新云原生资源的方式是:向云原生服务提交请求,更新缓存进程worker的数量。The way to update cloud-native resources is to submit a request to the cloud-native service to update the number of cache process workers.

步骤S501至S504的过程,对应的是决策是否扩容和图1中步骤3.1的组合过程。The process of steps S501 to S504 corresponds to the combined process of deciding whether to expand capacity and step 3.1 in FIG. 1 .

S505、调度器根据缓存服务标签cache-id在打缓存服务标签cache-id的计算节点创建缓存进程worker。S505. The scheduler creates a cache process worker on the computing node with the cache service label cache-id according to the cache service label cache-id.

步骤S505对应图1中步骤3.2。Step S505 corresponds to step 3.2 in FIG. 1 .

请参见图2,本次创建的结果是,在2个计算节点(即图2中的计算节点1和计算节点2)上分别创建了缓存进程worker3和缓存进程worker4。Please refer to Figure 2. As a result of this creation, a cache process worker3 and a cache process worker4 are created on two computing nodes (that is, computing node 1 and computing node 2 in Figure 2).

S506、计算节点中的计算进程的缓存客户端从缓存节点的缓存进程worker读取数据,并将读取的数据缓存到计算节点本地的缓存进程worker中,结束。S506. The cache client of the computing process in the computing node reads data from the caching process worker of the caching node, and caches the read data in the local caching process worker of the computing node, and ends.

步骤S506对应图1中步骤3.3。Step S506 corresponds to step 3.3 in FIG. 1 .

这样,后续计算节点就可以直接从本地的缓存进程worker中读取数据集数据了。以图2为例,计算进程1在训练过程中,可以直接从本地的缓存进程worker3读取数据。同样地,计算进程2在训练过程中,可以直接从本地的缓存进程worker4读取数据。这样,通过将缓存服务弹性调度到计算节点,使得计算节点中的计算任务在分布式训练过程中,能够直接从本地读取数据,提高了数据读取速度,从而提高了计算任务的分布式训练速度。In this way, subsequent computing nodes can directly read dataset data from the local cache process worker. Taking Figure 2 as an example, the calculation process 1 can directly read data from the local cache process worker3 during the training process. Similarly, during the training process, computing process 2 can directly read data from the local cache process worker4. In this way, by elastically scheduling the cache service to the computing nodes, the computing tasks in the computing nodes can directly read data from the local during the distributed training process, which improves the data reading speed, thereby improving the distributed training of computing tasks. speed.

S507、计算节点中的计算进程的缓存客户端从缓存节点的缓存进程worker中读取数据,结束。S507. The cache client of the calculation process in the calculation node reads data from the cache process worker of the cache node, and ends.

下面通过对比的方式,来说明扩容与不扩容情况下计算任务读取数据的速度的区别。The following is a comparison to illustrate the difference in the speed at which computing tasks read data between expansion and no expansion.

图7为示例性示出的不同数据集的扩容情况与读取方式示例图。请参见图7,数据集1、数据集2、数据集3和数据集4中,数据集1扩容缓存服务worker到计算节点,因此,使用数据集1的计算任务能够达到计算节点本地读取数据效果,训练速度快。数据集2和数据集3没有进行扩容,读取集群内其他节点数据,因此,使用数据集2和数据集3的计算任务的训练速度要慢一些。Fig. 7 is an exemplary diagram showing the expansion situation and reading method of different data sets. Please refer to Figure 7. In Dataset 1, Dataset 2, Dataset 3, and Dataset 4, Dataset 1 expands the cache service worker to the computing node. Therefore, the computing tasks using Dataset 1 can reach the computing node to read data locally. The effect, the training speed is fast. Dataset 2 and Dataset 3 are not expanded, and the data of other nodes in the cluster is read. Therefore, the training speed of computing tasks using Dataset 2 and Dataset 3 is slower.

可见,扩容后,由于计算任务能够在计算节点本地读取数据,因此训练速度快。It can be seen that after expansion, the training speed is fast because the computing task can read data locally on the computing node.

本实施例的分布式训练方法,能够将计算任务训练所需的数据集对应的缓存进程,自动扩容到计算任务所在的计算节点,从而使计算任务训练过程中能够在本地读取训练所需数据,提高训练速度。The distributed training method of this embodiment can automatically expand the cache process corresponding to the data set required by the computing task training to the computing node where the computing task is located, so that the data required for training can be read locally during the computing task training process , to increase the training speed.

特别是在计算节点内存资源不足,需要利用磁盘资源缓存数据的情况下,本实施例能够在不改变云原生调度器的情况下,自适应扩容缓存进程。Especially when the computing node memory resources are insufficient and disk resources need to be used to cache data, this embodiment can adaptively expand the cache process without changing the cloud-native scheduler.

并且,本实施例的分布式训练方法,由于能够在扩容的计算节点本地读取训练所需数据,无需访问远程存储服务,还能够缓解远程存储服务压力,避免出现大量计算任务读取远程存储服务时导致远程存储服务性能下降的问题。In addition, the distributed training method of this embodiment can read the data required for training locally on the expanded computing node without accessing the remote storage service, and can also relieve the pressure on the remote storage service, avoiding a large number of computing tasks to read the remote storage service The problem that causes the performance of the remote storage service to degrade.

进一步地,本实施例的分布式训练方法,由于能够在扩容的计算节点本地读取训练所需数据,无需访问远程存储服务,还减少了对通信带宽的占用,节约了通信资源,避免大模型进行分布式训练时,由于不同节点的参数交换需要大量的带宽资源,远程读取数据会占据一定的带宽资源而降低参数之间交换性能的问题。Furthermore, the distributed training method of this embodiment, since the data required for training can be read locally on the expanded computing node, there is no need to access remote storage services, and the occupation of communication bandwidth is reduced, communication resources are saved, and large model When performing distributed training, because the parameter exchange of different nodes requires a large amount of bandwidth resources, reading data remotely will occupy a certain amount of bandwidth resources and reduce the performance of parameter exchange.

本实施例还提供一种分布式训练系统,该系统包括控制节点和第一集群,其中:This embodiment also provides a distributed training system, the system includes a control node and a first cluster, wherein:

控制节点,用于:Control node for:

在第一集群的多个缓存节点中创建第一数据集对应的第一缓存服务的缓存进程,第一数据集是位于第一集群之外的远程数据库中的数据集;Create a cache process of the first cache service corresponding to the first data set in multiple cache nodes of the first cluster, where the first data set is a data set located in a remote database outside the first cluster;

在第一集群的多个第一计算节点中创建第一计算任务对应的第一计算任务进程,并设置第一缓存服务为第一计算任务的输入,第一计算节点属于计算节点组;Create a first computing task process corresponding to the first computing task in a plurality of first computing nodes in the first cluster, and set the first cache service as an input of the first computing task, and the first computing node belongs to the computing node group;

根据第一缓存服务、第一计算任务、第一计算节点,确定是否扩容;Determine whether to expand capacity according to the first cache service, the first computing task, and the first computing node;

如果确定扩容,在第一计算节点中创建第一缓存服务的缓存进程;If it is determined to expand the capacity, create a cache process of the first cache service in the first computing node;

第一集群中的第一计算节点,用于在训练第一计算任务的过程中,从第一计算节点中的缓存进程读取数据,来完成第一计算节点上的第一计算任务进程的训练。The first computing node in the first cluster is used to read data from the cache process in the first computing node during the training process of the first computing node to complete the training of the first computing task process on the first computing node .

其中,第一集群请参见前述的集群1,即图1中的集群1。For the first cluster, please refer to the aforementioned cluster 1, that is, cluster 1 in FIG. 1 .

本申请实施例还提供一种电子设备,该电子设备包括存储器和处理器,存储器与处理器耦合,存储器存储有程序指令,当程序指令由所述处理器执行时,使得电子设备前述电子设备所执行的分布式训练方法。The embodiment of the present application also provides an electronic device, the electronic device includes a memory and a processor, the memory is coupled to the processor, the memory stores program instructions, and when the program instructions are executed by the processor, the electronic device described above Execution of the distributed training method.

可以理解的是,电子设备为了实现上述功能,其包含了执行各个功能相应的硬件和/或软件模块。结合本文中所公开的实施例描述的各示例的算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。本领域技术人员可以结合实施例对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。It can be understood that, in order to realize the above functions, the electronic device includes hardware and/or software modules corresponding to each function. Combining the algorithm steps of each example described in the embodiments disclosed herein, the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software drives hardware depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions in combination with the embodiments for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.

本实施例还提供一种计算机存储介质,该计算机存储介质中存储有计算机指令,当该计算机指令在电子设备上运行时,使得电子设备执行上述相关方法步骤实现上述实施例中的分布式训练方法。This embodiment also provides a computer storage medium, in which computer instructions are stored, and when the computer instructions are run on the electronic device, the electronic device is made to execute the above-mentioned related method steps to realize the distributed training method in the above-mentioned embodiment .

本实施例还提供了一种计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述相关步骤,以实现上述实施例中的分布式训练方法。This embodiment also provides a computer program product, which, when running on a computer, causes the computer to execute the above-mentioned related steps, so as to realize the distributed training method in the above-mentioned embodiment.

另外,本申请实施例还提供一种装置,这个装置具体可以是芯片,组件或模块,该装置可包括相连的处理器和存储器;其中,存储器用于存储计算机执行指令,当装置运行时,处理器可执行存储器存储的计算机执行指令,以使芯片执行上述各方法实施例中的分布式训练方法。In addition, the embodiment of the present application also provides a device, which may specifically be a chip, a component or a module, and the device may include a connected processor and a memory; wherein, the memory is used to store instructions executed by a computer, and when the device is running, the processing The device can execute the computer-executed instructions stored in the memory, so that the chip executes the distributed training method in the above method embodiments.

其中,本实施例提供的电子设备、计算机存储介质、计算机程序产品或芯片均用于执行上文所提供的对应的方法,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。Wherein, the electronic device, computer storage medium, computer program product or chip provided in this embodiment is all used to execute the corresponding method provided above, therefore, the beneficial effects it can achieve can refer to the corresponding method provided above The beneficial effects in the method will not be repeated here.

通过以上实施方式的描述,所属领域的技术人员可以了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。Through the description of the above embodiments, those skilled in the art can understand that for the convenience and brevity of the description, only the division of the above-mentioned functional modules is used as an example for illustration. In practical applications, the above-mentioned functions can be assigned by different Completion of functional modules means that the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.

在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个装置,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed devices and methods may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of modules or units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or It may be integrated into another device, or some features may be omitted, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是一个物理单元或多个物理单元,即可以位于一个地方,或者也可以分布到多个不同地方。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。A unit described as a separate component may or may not be physically separated, and a component shown as a unit may be one physical unit or multiple physical units, which may be located in one place or distributed to multiple different places. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

本申请各个实施例的任意内容,以及同一实施例的任意内容,均可以自由组合。对上述内容的任意组合均在本申请的范围之内。Any content of each embodiment of the present application, as well as any content of the same embodiment, can be freely combined. Any combination of the above contents is within the scope of the present application.

集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该软件产品存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If an integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a readable storage medium. Based on this understanding, the technical solution of the embodiment of the present application is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the software product is stored in a storage medium Among them, several instructions are included to make a device (which may be a single-chip microcomputer, a chip, etc.) or a processor (processor) execute all or part of the steps of the methods in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read only memory (ROM), random access memory (random access memory, RAM), magnetic disk or optical disk, and other media capable of storing program codes.

上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本申请的保护之内。The embodiments of the present application have been described above in conjunction with the accompanying drawings, but the present application is not limited to the above-mentioned specific implementations. The above-mentioned specific implementations are only illustrative and not restrictive. Those of ordinary skill in the art will Under the inspiration of this application, without departing from the purpose of this application and the scope of protection of the claims, many forms can also be made, all of which belong to the protection of this application.

结合本申请实施例公开内容所描述的方法或者算法的步骤可以硬件的方式来实现,也可以是由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于随机存取存储器(Random Access Memory,RAM)、闪存、只读存储器(Read Only Memory,ROM)、可擦除可编程只读存储器(Erasable Programmable ROM,EPROM)、电可擦可编程只读存储器(Electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、只读光盘(CD-ROM)或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。The steps of the methods or algorithms described in connection with the disclosure of the embodiments of the present application may be implemented in the form of hardware, or may be implemented in the form of a processor executing software instructions. The software instructions can be composed of corresponding software modules, and the software modules can be stored in random access memory (Random Access Memory, RAM), flash memory, read only memory (Read Only Memory, ROM), erasable programmable read-only memory ( Erasable Programmable ROM, EPROM), Electrically Erasable Programmable Read-Only Memory (Electrically EPROM, EEPROM), registers, hard disk, removable hard disk, CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be a component of the processor. The processor and storage medium can be located in the ASIC.

本领域技术人员应该可以意识到,在上述一个或多个示例中,本申请实施例所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。计算机可读介质包括计算机存储介质和通信介质,其中通信介质包括便于从一个地方向另一个地方传送计算机程序的任何介质。存储介质可以是通用或专用计算机能够存取的任何可用介质。Those skilled in the art should be aware that, in the foregoing one or more examples, the functions described in the embodiments of the present application may be implemented by hardware, software, firmware or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本申请的保护之内。The embodiments of the present application have been described above in conjunction with the accompanying drawings, but the present application is not limited to the above-mentioned specific implementations. The above-mentioned specific implementations are only illustrative and not restrictive. Those of ordinary skill in the art will Under the inspiration of this application, without departing from the purpose of this application and the scope of protection of the claims, many forms can also be made, all of which belong to the protection of this application.

Claims (15)

1. A distributed training method, the method comprising:
Creating a cache process of a first cache service corresponding to a first data set in a plurality of cache nodes of a first cluster, wherein the first data set is a data set in a remote database outside the first cluster;
creating a first computing task process corresponding to a first computing task in a plurality of first computing nodes of the first cluster, and setting the first cache service as input of the first computing task, wherein the first computing nodes belong to a computing node group;
determining whether to expand the capacity according to the first cache service, the first computing task and the first computing node;
if the capacity expansion is determined, a cache process of the first cache service is created in the first computing node, so that in the process of training the first computing task, the first computing node reads data from the cache process in the first computing node to complete training of the first computing task process on the first computing node.
2. The method of claim 1, wherein determining whether to expand based on the first cache service, the first computing task, the first computing node, comprises:
The first characteristic data corresponding to the first cache service, the second characteristic data corresponding to the first computing task and the third characteristic data corresponding to the computing node group are obtained;
extracting a first feature vector from the first feature data, extracting a second feature vector from the second feature data, and extracting a third feature vector from the third feature data;
obtaining a combined feature vector according to the first feature vector, the second feature vector and the third feature vector;
and inputting the combined feature vector into a trained capacity expansion decision model, and outputting a decision result of whether capacity expansion is performed or not by the capacity expansion decision model.
3. The method of claim 2, wherein the capacity expansion decision model is a classification model.
4. The method of claim 2, wherein the first characteristic data comprises statistics, cache setup information, and cache application information for the first data set.
5. The method of claim 4, wherein the statistics of the first data set include a total file size, a total number of files, a file format of the first data set; the cache setting information of the first data set comprises cache capacity, cache media and cache process quantity; the cache application information of the first data set comprises the number of computing tasks of the cache of the first data set and the historical information of the computing tasks of the cache of the first data set.
6. The method of claim 2, wherein the second characteristic data comprises any one or more of the following:
task priority, user information, applied CPU resources, applied GPU resources, applied memory resources, used input data information, corresponding algorithm types and historical execution information.
7. The method of claim 2, wherein the third characteristic data comprises any one or more of the following:
the CPU information, the GPU information, the memory information and the solid state disk information which can be allocated by each computing node are allocated, and the network topology structure of each computing node is located.
8. The method of claim 1, wherein creating a caching process for the first caching service corresponding to the first data set in the plurality of caching nodes of the first cluster comprises:
receiving a first cache service creation request;
acquiring the data volume of a first data set;
if the data volume of the first data set is smaller than a data volume threshold value, setting the cache capacity of a cache process of a first cache service to be equal to the data volume of the first data set;
Setting a cache initialization tag and a cache service tag for a first cache service resource corresponding to the first data set;
sending a first instruction to the first cluster, wherein the first instruction carries the first cache service resource;
according to the first instruction, a cache process of a first cache service corresponding to the first data set is created in a plurality of cache nodes with the cache initialization tag in the first cluster;
and loading the data in the first data set into the caching process.
9. The method of claim 1, wherein determining whether to expand based on the first cache service, the first computing task, the first computing node, comprises:
and if the available storage resources of the first computing node are larger than the data volume of the first data set, determining expansion.
10. The method of claim 1, wherein determining whether to expand based on the first cache service, the first computing task, the first computing node, comprises:
acquiring the priority of the first computing task;
and if the priority of the first computing task is higher than a preset level, determining expansion.
11. The method of claim 1, wherein determining whether to expand based on the first cache service, the first computing task, the first computing node, comprises:
acquiring the historical training speed of an algorithm of the first computing task;
and if the historical training speed of the algorithm of the first computing task is smaller than a preset speed value, determining expansion.
12. The method of claim 1, wherein each of the caching processes stores all of the data of the first data set.
13. A distributed training system, the system comprising a control node and a first cluster, wherein:
the control node is configured to:
creating a cache process of a first cache service corresponding to a first data set in a plurality of cache nodes of the first cluster, wherein the first data set is a data set in a remote database outside the first cluster;
creating a first computing task process corresponding to a first computing task in a plurality of first computing nodes of the first cluster, and setting the first cache service as input of the first computing task, wherein the first computing nodes belong to a computing node group;
Determining whether to expand the capacity according to the first cache service, the first computing task and the first computing node;
if the capacity expansion is determined, a cache process of the first cache service is created in the first computing node;
the first computing node in the first cluster is configured to read data from the cache process in the first computing node during training of the first computing task, so as to complete training of a first computing task process on the first computing node.
14. An electronic device, comprising:
a memory and a processor, the memory coupled with the processor;
the memory stores program instructions that, when executed by the processor, cause the electronic device to perform the distributed training method of any of claims 1 to 12.
15. A computer readable storage medium comprising a computer program, characterized in that the computer program, when run on an electronic device, causes the electronic device to perform the distributed training method of any of claims 1 to 12.
CN202310374312.8A 2023-04-10 2023-04-10 Distributed Training Method and System Active CN116089477B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310374312.8A CN116089477B (en) 2023-04-10 2023-04-10 Distributed Training Method and System

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310374312.8A CN116089477B (en) 2023-04-10 2023-04-10 Distributed Training Method and System

Publications (2)

Publication Number Publication Date
CN116089477A CN116089477A (en) 2023-05-09
CN116089477B true CN116089477B (en) 2023-08-08

Family

ID=86201108

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310374312.8A Active CN116089477B (en) 2023-04-10 2023-04-10 Distributed Training Method and System

Country Status (1)

Country Link
CN (1) CN116089477B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117332881B (en) * 2023-11-27 2024-04-05 荣耀终端有限公司 Distributed training method and electronic equipment

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105224470A (en) * 2015-09-30 2016-01-06 汉柏科技有限公司 A kind of memory allocation method and device based on simplifying configuration
EP3367310A1 (en) * 2017-02-28 2018-08-29 Fujitsu Limited Method and apparatus for parallelizing layers of deep neural networks onto parallel computing systems
CN110427222A (en) * 2019-06-24 2019-11-08 北京达佳互联信息技术有限公司 Data load method, device, electronic equipment and storage medium
CN111211998A (en) * 2019-12-12 2020-05-29 北京淇瑀信息科技有限公司 Resource allocation method and device capable of elastically expanding capacity and electronic equipment
CN112000473A (en) * 2020-08-12 2020-11-27 中国银联股份有限公司 Distributed training method and device for deep learning model
CN113867959A (en) * 2021-09-29 2021-12-31 苏州浪潮智能科技有限公司 Training task resource scheduling method, device, equipment and medium
WO2022121519A1 (en) * 2020-12-10 2022-06-16 清华大学 Enhancement plug-in and enhancement method for elastic scaling of distributed data stream resource
CN114721844A (en) * 2022-03-10 2022-07-08 云和恩墨(北京)信息技术有限公司 Data caching method and device, computer equipment and storage medium
WO2022199824A1 (en) * 2021-03-25 2022-09-29 Telefonaktiebolaget Lm Ericsson (Publ) Methods for improved federated machine learning in wireless networks
CN115150471A (en) * 2022-06-27 2022-10-04 北京百度网讯科技有限公司 Data processing method, device, equipment, storage medium and program product

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150324690A1 (en) * 2014-05-08 2015-11-12 Microsoft Corporation Deep Learning Training System
US11093862B2 (en) * 2019-03-21 2021-08-17 International Business Machines Corporation Locality aware data loading for machine learning

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105224470A (en) * 2015-09-30 2016-01-06 汉柏科技有限公司 A kind of memory allocation method and device based on simplifying configuration
EP3367310A1 (en) * 2017-02-28 2018-08-29 Fujitsu Limited Method and apparatus for parallelizing layers of deep neural networks onto parallel computing systems
CN110427222A (en) * 2019-06-24 2019-11-08 北京达佳互联信息技术有限公司 Data load method, device, electronic equipment and storage medium
CN111211998A (en) * 2019-12-12 2020-05-29 北京淇瑀信息科技有限公司 Resource allocation method and device capable of elastically expanding capacity and electronic equipment
CN112000473A (en) * 2020-08-12 2020-11-27 中国银联股份有限公司 Distributed training method and device for deep learning model
WO2022033024A1 (en) * 2020-08-12 2022-02-17 中国银联股份有限公司 Distributed training method and apparatus of deep learning model
WO2022121519A1 (en) * 2020-12-10 2022-06-16 清华大学 Enhancement plug-in and enhancement method for elastic scaling of distributed data stream resource
WO2022199824A1 (en) * 2021-03-25 2022-09-29 Telefonaktiebolaget Lm Ericsson (Publ) Methods for improved federated machine learning in wireless networks
CN113867959A (en) * 2021-09-29 2021-12-31 苏州浪潮智能科技有限公司 Training task resource scheduling method, device, equipment and medium
CN114721844A (en) * 2022-03-10 2022-07-08 云和恩墨(北京)信息技术有限公司 Data caching method and device, computer equipment and storage medium
CN115150471A (en) * 2022-06-27 2022-10-04 北京百度网讯科技有限公司 Data processing method, device, equipment, storage medium and program product

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Kubernetes集群上深度学习负载优化;陈培 等;计算机系统应用;第31卷(第09期);114-126 *

Also Published As

Publication number Publication date
CN116089477A (en) 2023-05-09

Similar Documents

Publication Publication Date Title
US11914894B2 (en) Using scheduling tags in host compute commands to manage host compute task execution by a storage device in a storage system
US10817380B2 (en) Implementing affinity and anti-affinity constraints in a bundled application
CN107046508A (en) Message receiving method and network equipment
CN106681826A (en) Resource planning method, system and device for cluster computing architecture
CN116483546B (en) Distributed training task scheduling method, device, equipment and storage medium
WO2018068714A1 (en) Deduplication processing method, and storage device
CN119883578B (en) Task scheduling method and system, electronic device, and storage medium
CN116089477B (en) Distributed Training Method and System
CN114625474A (en) Container migration method and device, electronic equipment and storage medium
CN118312320A (en) Memory management method, system, desktop computer and computer storage medium
US11416152B2 (en) Information processing device, information processing method, computer-readable storage medium, and information processing system
CN117332881B (en) Distributed training method and electronic equipment
JPWO2018181961A1 (en) Virtual network function management device, virtual infrastructure management device, and virtual network function construction method
CN113535346A (en) Method, device, device and computer storage medium for adjusting the number of threads
CN113347238A (en) Message partitioning method, system, device and storage medium based on block chain
US9858204B2 (en) Cache device, cache system, and cache method
CN111090627B (en) Log storage method and device based on pooling, computer equipment and storage medium
JP6285850B2 (en) Process migration method and cluster system
CN113986458A (en) Container set scheduling method, device, equipment and storage medium
CN116743589B (en) Cloud host migration method, device and electronic equipment
CN103593240B (en) The dispatching method of a kind of optimization and management equipment
CN118138589B (en) Service cluster scheduling method, device, equipment and medium
CN111930781B (en) Method and device for processing data request of cache database
US20230043057A1 (en) Computer-readable recording medium storing application control program and application control method
JP2009104373A (en) Parallel computer system, information processor, job management method, and job management program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: Unit 3401, unit a, building 6, Shenye Zhongcheng, No. 8089, Hongli West Road, Donghai community, Xiangmihu street, Futian District, Shenzhen, Guangdong 518040

Patentee after: Honor Terminal Co.,Ltd.

Country or region after: China

Address before: 3401, unit a, building 6, Shenye Zhongcheng, No. 8089, Hongli West Road, Donghai community, Xiangmihu street, Futian District, Shenzhen, Guangdong

Patentee before: Honor Device Co.,Ltd.

Country or region before: China

CP03 Change of name, title or address