[go: up one dir, main page]

CN112825044B - Task execution method, device and computer storage medium - Google Patents

Task execution method, device and computer storage medium Download PDF

Info

Publication number
CN112825044B
CN112825044B CN201911147892.7A CN201911147892A CN112825044B CN 112825044 B CN112825044 B CN 112825044B CN 201911147892 A CN201911147892 A CN 201911147892A CN 112825044 B CN112825044 B CN 112825044B
Authority
CN
China
Prior art keywords
component
execution
engine
machine learning
message
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911147892.7A
Other languages
Chinese (zh)
Other versions
CN112825044A (en
Inventor
侯俊雄
姜伟浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Hikvision Digital Technology Co Ltd
Original Assignee
Hangzhou Hikvision Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Hikvision Digital Technology Co Ltd filed Critical Hangzhou Hikvision Digital Technology Co Ltd
Priority to CN201911147892.7A priority Critical patent/CN112825044B/en
Publication of CN112825044A publication Critical patent/CN112825044A/en
Application granted granted Critical
Publication of CN112825044B publication Critical patent/CN112825044B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Stored Programmes (AREA)

Abstract

The application discloses a task execution method, and belongs to the technical field of information. After receiving a to-be-processed DAG task through a scheduling engine deployed on a machine learning platform, for any one of a plurality of components contained in the DAG task, the scheduling engine selects an execution engine loaded with a machine learning frame corresponding to the any one component from a plurality of execution engines, and instructs the selected execution engine to execute the any one component based on the loaded machine learning frame. Because there are machine learning frameworks loaded by at least two of the plurality of execution engines that are different, components that require the use of different machine learning frameworks may be included in the DAG task, i.e., the machine learning platform may be used to handle complex tasks that require the use of a hybrid machine learning framework. Thus, for different components contained in the DAG task, the scheduling engine can call the execution engine loaded with different machine learning frameworks to complete, and the universality of the machine learning platform is improved.

Description

Task execution method, device and computer storage medium
Technical Field
The present invention relates to the field of information technologies, and in particular, to a task execution method, a task execution device, and a computer storage medium.
Background
With the advent of the big data age and the continued development of artificial intelligence, machine learning frameworks can be deployed on machine learning platforms to perform some tasks through the deployed machine learning frameworks. The machine learning framework may include a traditional machine learning framework (Scikit Learn) for a small-scale data scene, a distributed parallel computing framework (Spark mlib) for a large-scale data scene, and the like.
In the related art, for any machine learning platform, a machine learning framework is deployed on the machine learning platform. When a task is received, it is determined whether a deployed machine learning framework is available to process the task, and if so, the task is performed based on the machine learning framework. If not, feeding back an execution failure message.
In the method, the machine learning platform can only be used for executing the tasks which can be processed by the deployed machine learning frame, namely, if the machine learning frame deployed on the machine learning platform is changed, the tasks which can be executed by the machine learning platform are also changed, so that the application flexibility of the machine learning platform is seriously affected.
Disclosure of Invention
The embodiment of the application provides a task execution method, a task execution device and a computer memory, which can improve the universality of a machine learning platform. The technical scheme is as follows:
In one aspect, a task execution method is provided, the method including:
the scheduling engine receives a DAG (directed acyclic graph ) task to be processed, wherein the DAG task comprises a plurality of components and execution sequence indication information, the execution sequence indication information is used for indicating the execution sequence of the plurality of components, each component in the plurality of components is used for indicating a data processing subtask, and each component corresponds to a machine learning framework;
when executing to any component according to the execution sequence indication information, the scheduling engine selects an execution engine loaded with a machine learning framework corresponding to any component from the plurality of execution engines;
the scheduling engine issues any component to a selected execution engine, and the scheduling engine is used for instructing the selected execution engine to execute the any component based on the loaded machine learning framework so as to execute the data processing subtasks corresponding to the any component.
Optionally, after the scheduling engine issues the any component to the selected execution engine, the method further includes:
and sending a component operation message to the selected execution engine, wherein the component operation message carries an identifier of input data corresponding to any component, and the input data corresponding to any component refers to data required when the any component is executed based on a machine learning framework corresponding to any component.
Optionally, the component running message further carries an identifier of a model corresponding to the any component, where the model corresponding to the any component refers to a model required when the any component is executed based on a machine learning framework corresponding to the any component.
Optionally, the component running message further carries a type of the any component, which is used for indicating the selected execution engine to execute the any component according to the model corresponding to the last executed component when the component running message does not carry the identifier of the model corresponding to the any component and the type of the any component is consistent with the type of the last executed component.
Optionally, the component operation message further carries a first operation mode or a second operation mode, where the first operation mode is used to instruct the selected execution engine to cache output data when executing the any component, and the second operation mode is used to instruct the selected execution engine not to cache output data when executing the any component.
Optionally, after sending the component running message to the selected execution engine, the method further includes:
and receiving a component execution completion message sent by the selected execution engine, wherein the component execution completion message carries the identification of any component, the execution result of any component and the identification of output data when any component is executed.
Optionally, after receiving the component execution completion message sent by the selected execution engine, the method further includes:
if the execution result carried by the component execution completion message is successful, updating the output state of any component to be execution completion;
and checking the output states of all components included in the DAG task, and if the output states of all the components are all finished, sending an executor stopping message to each execution engine in the plurality of execution engines, wherein the executor stopping message is used for indicating each execution engine to stop running.
Optionally, after receiving the component execution completion message sent by the selected execution engine, the method further includes:
and if the execution result carried by the component execution completion message is failure, sending an execution stopping message to each execution engine in the plurality of execution engines, wherein the execution stopping message is used for indicating each execution engine to stop running.
Optionally, the method further comprises:
receiving a termination task message;
and sending an execution stopping message to each execution engine in the plurality of execution engines, wherein the execution stopping message is used for indicating each execution engine to stop running.
Optionally, after the sending the stop executor message to each execution engine of the plurality of execution engines, the method further includes:
Receiving an actuator closed message sent by each execution engine;
updating the state of each execution engine to a closed state;
and when the states of all the execution engines are checked to be updated to be in the closed state, updating the states of the DAG tasks to be ended.
Optionally, the method further comprises:
and receiving an executor registration message sent by any execution engine of the plurality of execution engines, wherein the executor registration message is used for indicating the dispatching engine to communicate with any execution engine based on the execution registration message.
In another aspect, there is provided a task performing device including:
the system comprises a receiving module, a scheduling engine and a machine learning framework, wherein the receiving module is used for receiving a Directed Acyclic Graph (DAG) task to be processed, the DAG task comprises a plurality of components and execution sequence indication information, the execution sequence indication information is used for indicating the execution sequence of the plurality of components, each component in the plurality of components is used for indicating a data processing subtask, and each component corresponds to the machine learning framework;
the scheduling engine is used for selecting an execution engine loaded with a machine learning framework corresponding to any component from the plurality of execution engines when the execution sequence indication information is executed to any component;
And the issuing module is used for issuing the any component to a selected execution engine by the scheduling engine and instructing the selected execution engine to execute the any component based on the loaded machine learning framework so as to execute the data processing subtasks corresponding to the any component.
Optionally, the apparatus further includes:
the first sending module is configured to send a component operation message to the selected execution engine, where the component operation message carries an identifier of input data corresponding to any component, and the input data corresponding to any component is data required when the any component is executed based on a machine learning framework corresponding to any component.
Optionally, the component running message further carries an identifier of a model corresponding to the any component, where the model corresponding to the any component refers to a model required when the any component is executed based on a machine learning framework corresponding to the any component.
Optionally, the component running message further carries a type of the any component, which is used for indicating the selected execution engine to execute the any component according to the model corresponding to the last executed component when the component running message does not carry the identifier of the model corresponding to the any component and the type of the any component is consistent with the type of the last executed component.
Optionally, the component operation message further carries a first operation mode or a second operation mode, where the first operation mode is used to instruct the selected execution engine to cache output data when executing the any component, and the second operation mode is used to instruct the selected execution engine not to cache output data when executing the any component.
Optionally, the apparatus further includes:
the first receiving module is used for receiving the component execution completion message sent by the selected execution engine, wherein the component execution completion message carries the identification of any component, the execution result of any component and the identification of output data when any component is executed.
Optionally, the apparatus further includes:
the first updating module is used for updating the output state of any component to be finished if the execution result carried by the component finishing information is successful;
and the checking module is used for checking the output states of all the components included in the DAG task, and sending an executor stopping message to each execution engine in the plurality of execution engines if the output states of all the components are all executed, wherein the executor stopping message is used for indicating each execution engine to stop running.
Optionally, the apparatus further includes:
and the second sending module is used for sending an executor stopping message to each execution engine in the plurality of execution engines if the execution result carried by the component execution completion message is failure, and used for indicating each execution engine to stop running.
Optionally, the apparatus further includes:
the second receiving module is used for receiving the termination task message;
and the third sending module is used for sending an executor stopping message to each execution engine in the plurality of execution engines and used for indicating each execution engine to stop running.
Optionally, the apparatus further includes:
the third receiving module is used for receiving the closed message of the executor sent by each execution engine;
the second updating module is used for updating the state of each execution engine to be in a closed state;
and the third updating module is used for updating the state of the DAG task to be ended when the state of all execution engines is checked to be updated to be the closed state.
Optionally, the apparatus further includes:
and the fourth receiving module is used for receiving an executor registration message sent by any execution engine of the plurality of execution engines and is used for indicating the dispatching engine to communicate with any execution engine based on the execution registration message.
In another aspect, a task execution device is provided, the task execution device including a processor, a communication interface, a memory, and a communication bus;
the processor, the communication interface and the memory complete communication with each other through the communication bus;
the memory is used for storing a computer program;
the processor is used for executing the program stored in the memory to realize the task execution method.
In another aspect, a computer-readable storage medium is provided, in which a computer program is stored which, when executed by a processor, implements the steps of the task execution method provided above.
The beneficial effects brought by the technical scheme provided by the embodiment of the application at least can include:
in the embodiment of the application, after a scheduling engine deployed on a machine learning platform receives a DAG task to be processed, when the scheduling engine is executed to any component according to execution sequence indication information included in the DAG task, the scheduling engine selects an execution engine loaded with a machine learning frame corresponding to the any component from a plurality of execution engines, and instructs the selected execution engine to execute the any component based on the loaded machine learning frame. Because there are machine learning frameworks loaded by at least two of the plurality of execution engines that are different, components that require the use of different machine learning frameworks may be included in the DAG task, i.e., the machine learning platform may be used to handle complex tasks that require the use of a hybrid machine learning framework. Thus, for different components contained in the DAG task, the scheduling engine can call the execution engine loaded with different machine learning frameworks to complete, and the universality of the machine learning platform is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a system architecture diagram of a machine learning platform according to an embodiment of the present application;
FIG. 2 is a scheduling flow diagram provided by an embodiment of the present application;
FIG. 3 is a flowchart of a task execution method according to an embodiment of the present application;
FIG. 4 is a flowchart of another task execution method provided by an embodiment of the present application;
FIG. 5 is another scheduling flow diagram provided by an embodiment of the present application;
FIG. 6 is another scheduling flow diagram provided by an embodiment of the present application;
FIG. 7 is a block diagram of a task execution device provided in an embodiment of the present application;
fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Before explaining the embodiments of the present application in detail, a system architecture related to the embodiments of the present application will be described.
Fig. 1 is a system architecture diagram of a machine learning platform according to an embodiment of the present application. As shown in fig. 1, the machine learning platform has deployed thereon a scheduling engine 101, a plurality of execution engines 102, and a data storage layer 103.
The scheduler engine 101 may be connected with any execution engine 102 in a wireless or wired manner for communication. The data storage layer 103 is used to hold relevant data of the execution engine 102 and the scheduling engine 101.
The system architecture adopts a scheduling engine-execution engine double-layer structure and realizes the execution decoupling between the scheduling engine and the execution engine, namely, the scheduling engine 101 can call different execution engines 102 to execute corresponding tasks. The scheduling engine 101 may in principle be implemented by any algorithmic language. While the execution engine 102 may be loaded with different machine learning frameworks, which may be implemented in different algorithmic languages.
In the embodiment of the application, the task processed by the machine learning platform is a DAG task. The DAG task refers to a type of task whose execution flow is performed according to a flow indicated by preset execution sequence indication information. Also, the DAG tasks may include different components that require the use of different machine learning frameworks, and thus, through the DAG tasks, the machine learning platform may process tasks that include multiple components.
The scheduling engine 101 is responsible for scheduling the various components that the DAG task includes. For example, a component is issued to an execution engine loaded with a machine learning framework. The machine learning frameworks loaded by each execution engine comprise traditional machine learning frameworks (ScikitLearn), native distributed parallel computing frameworks (Pure Spark), distributed parallel computing frameworks (Spark Mllib), machine learning frameworks based on Pethon language (Python) and other machine learning frameworks.
Fig. 2 is a schematic diagram of a scheduling flow provided in an embodiment of the present application. As shown in fig. 2, a scheduling engine interface is deployed on the scheduling engine, and the scheduling engine interface may be a logic interface or a physical interface. The scheduler engine receives DAG tasks through the scheduler engine interface.
As shown in fig. 2, the DAG task may include a first component and a second component. Assuming that the machine learning framework that needs to be used to execute the first component is a universal parallel computing framework (Spark), the machine learning framework that needs to be used to execute the second component is a pesen language-based machine learning computing framework.
As shown in fig. 2, each execution engine is deployed with an execution engine interface, and the execution engine interface may be a logical interface or a physical interface. Each execution engine receives the components issued by the scheduling engine through the execution engine interface. As shown in fig. 2, the execution engine a is loaded with a general parallel computing framework, and the execution engine B is loaded with a machine learning computing framework based on the pesen language, so the scheduling engine may issue a first component to the execution engine a through the execution engine interface a, for instructing the execution engine a to execute the first component according to the general parallel computing framework. Meanwhile, the scheduling engine issues a second component to the execution engine B through the execution engine interface B for instructing the execution engine B to execute the second component in accordance with the parsen language-based machine learning computing framework.
In addition, the scheduler engine 101 is also responsible for controlling in what execution mode the respective execution engine 102 executes the respective components. The execution mode includes: each component is executed in series, each component is executed in parallel, and whether a component result falls on the disk or not. Whether the component result is dropped or not refers to whether the execution result of the component is saved to the hard disk.
The dispatch engine 101 may also be referred to as a dispatch node or scheduler, etc., and the execution engine 102 may also be referred to as an executor. The present application is not particularly limited thereto.
In addition, fig. 1 is only illustrated with 4 execution engines 102, but this does not constitute a limitation on the number of execution engines in the machine learning platform.
Next, a task execution method provided in the embodiment of the present application will be described.
Fig. 3 is a flowchart of a task execution method according to an embodiment of the present application, and as shown in fig. 3, the task execution method may include the following steps:
step 301: the scheduling engine receives a DAG task to be processed, wherein the DAG task comprises a plurality of components and execution sequence indication information, the execution sequence indication information is used for indicating the execution sequence of the plurality of components, each component in the plurality of components is used for indicating one data processing subtask, and each component corresponds to one machine learning framework.
Step 302: when executing to any component according to the execution sequence indication information, the scheduling engine selects an execution engine loaded with a machine learning framework corresponding to the any component from a plurality of execution engines.
Step 303: the scheduling engine issues the any component to a selected execution engine for instructing the selected execution engine to execute the any component based on the loaded machine learning framework to execute the data processing subtask corresponding to the any component.
In the embodiment of the application, after a scheduling engine deployed on a machine learning platform receives a DAG task to be processed, for any one of a plurality of components included in the DAG task, the scheduling engine selects an execution engine loaded with a machine learning frame corresponding to the any one component from a plurality of execution engines, and instructs the selected execution engine to execute the any one component based on the loaded machine learning frame. Because there are machine learning frameworks loaded by at least two of the plurality of execution engines that are different, components that require the use of different machine learning frameworks may be included in the DAG task, i.e., the machine learning platform may be used to handle complex tasks that require the use of a hybrid machine learning framework. Thus, for different components contained in the DAG task, the scheduling engine can call the execution engine loaded with different machine learning frameworks to complete, and the universality of the machine learning platform is improved.
Fig. 4 is a flowchart of another task execution method provided in an embodiment of the present application, and as shown in fig. 4, the task execution method may include the following steps:
step 401: the scheduling engine receives a DAG task to be processed, wherein the DAG task comprises a plurality of components and execution sequence indication information, the execution sequence indication information is used for indicating the execution sequence of the plurality of components, each component in the plurality of components is used for indicating one data processing subtask, and each component corresponds to one machine learning framework.
In one possible implementation, as shown in fig. 2, the scheduler engine may receive DAG tasks through a scheduler engine interface. When the DAG task is received by the dispatching engine interface, the DAG task can be instantiated, the DAG task is analyzed into a DAG instance which can be directly interpreted by the dispatching engine, the DAG instance is issued to the dispatching engine, and the dispatching engine analyzes each component and the execution sequence indication information of each component according to the DAG instance.
Step 402: when executing to any component according to the execution sequence indication information, the scheduling engine selects an execution engine loaded with a machine learning framework corresponding to the any component from a plurality of execution engines.
Since the multiple components contained in the DAG task need to be executed by the execution engines loaded with different types of machine learning frameworks, for any component, the scheduler engine needs to select an appropriate execution engine from the multiple execution engines to execute the component.
In one possible implementation, the implementation procedure of step 402 may be: and acquiring a machine learning frame loaded by each component in the stored multiple components, and selecting an execution engine loaded with the machine learning frame corresponding to any component according to the machine learning frame loaded by each component.
As shown in fig. 2, for the first component, since the machine learning framework corresponding to the first component is a general parallel computing framework and the execution engine a is loaded with the general parallel computing framework, the execution engine a is determined as the selected execution engine for the first component.
For the second component, since the machine learning framework corresponding to the second component is a pesen language-based machine learning computing framework and the execution engine B is loaded with the pesen language-based machine learning computing framework, the execution engine B is determined as the selected execution engine for the second component.
In addition, each execution engine is registered in the scheduling engine in advance so that the scheduling engine can communicate with each execution engine later. As shown in fig. 5 and 6, the implementation of each execution engine registered in the scheduling engine may be: for any one of the execution engines, after the execution engine is started, an actuator registration message (Register Executor Message) is sent to the dispatching engine, the dispatching engine receives the actuator registration message sent by the execution engine, and the actuator registration message is stored so that the dispatching engine can communicate with the execution engine based on the actuator registration message.
The above-mentioned actuator registration message includes information such as IP (internet protocol ) address of the execution engine, port information, and type of the execution engine, and the scheduling engine registers the relevant information included in the actuator registration message of the execution engine in the local and database for subsequent communication with the execution engine based on the actuator registration message.
In addition, when using the machine learning platform, RPC (remote procedure call ) protocol services may be deployed in the dispatch engine and each execution engine, respectively, so that communication may be performed between the dispatch engine and any execution engine using the RPC protocol.
Step 403: the scheduling engine issues any component to the selected execution engine, and sends a component running message to the selected execution engine, wherein the component running message is used for instructing the selected execution engine to execute the any component based on the loaded machine learning framework so as to execute the data processing subtasks corresponding to the any component.
As shown in fig. 2, for the first component, the first component is issued to the execution engine a through the execution engine interface a on the execution engine a. For the second component, the second component is issued to the execution engine B through the execution engine interface B on the execution engine B.
After the scheduling engine issues the component to the selected execution engine, a component running message (Run Component Message) needs to be sent to the selected execution engine, where the component running message carries information that needs to be used when the selected execution engine executes the component, so that the selected execution engine executes the component according to the component running message.
In one possible implementation, as shown in fig. 6, the component running message carries an identifier of input data corresponding to the any component, so that the selected execution engine processes the input data to implement executing the data processing subtasks corresponding to the component. The input data corresponding to any component refers to data required by the scheduling engine when executing the any component based on the machine learning framework corresponding to the any component. The Identification of the input data may be an ID (identity) of the input data.
For example, when the data processing subtask corresponding to the component is a model, the input data corresponding to any component includes a series of training samples, and the selected execution engine determines a model according to the series of training samples. The model in the embodiment of the application is usually an algorithm, and the final required data can be obtained correspondingly by inputting the initial parameters into the algorithm.
When the data processing subtasks corresponding to the components further include evaluating the determined model, the input data corresponding to any component may further include a series of test samples, and the selected execution engine evaluates the determined model according to the series of test samples.
For another example, when the data processing subtask corresponding to the component is simply converting the input data, for example, the data processing subtask is y=2x, and at this time, the component operation message only needs to carry the identifier of the input data x, and the selected execution engine can directly obtain the output data y according to the input data x.
In another possible implementation manner, as shown in fig. 5, when the data processing subtasks corresponding to the component are used for processing the input data based on one model, the component running message may carry, in addition to the identifier of the input data, the identifier of the model corresponding to any component, so that the execution engine selected later processes the input data based on the model. The model corresponding to any component refers to a model required by the scheduling engine when executing the any component based on the machine learning framework corresponding to the any component. Wherein the identification of the model may be an ID of the model.
For example, when the data processing subtask corresponding to the component processes the input data x based on the model y=f (x), at this time, the component running message carries the identifier of the model y=f (x) and the identifier of the input data x, and the execution engine acquires the model y=f (x) and the input data x from the data center according to the identifier of the model y=f (x) and the identifier of the input data x, and inputs the input data x to the model y=f (x), where the model y=f (x) outputs one y to complete execution of the data processing subtask corresponding to the component.
In another possible implementation manner, as shown in fig. 5, when the data processing subtask corresponding to the component is to process input data based on a model, the component running message may carry, in addition to the identifier of the input data, the type of any component, so that the selected execution engine executes any component according to the model corresponding to the last executed component when the component running message does not carry the identifier of the model corresponding to any component, and the type of any component is consistent with the type of the last executed component.
In addition, in the above implementation manner, if the component running message does not carry the identifier of the model corresponding to any component, and the type of any component is inconsistent with the type of the component executed last, the execution engine may also temporarily train out a model, and then process the input data through the temporarily trained model.
It should be noted that, since training a model is generally very resource-consuming and takes a long time. Thus, after training to obtain a model, it is often necessary to save the model for reuse in subsequent execution of the same type of component. This can increase the efficiency of executing the components and save computer resources.
In any of the above implementations, as shown in fig. 6, after the execution engine receives the component running message, the execution engine obtains the input data, or the input data and the model based on the component running message, and the selected execution engine executes any component according to the input data, or the input data and the model based on the loaded machine learning framework.
In addition, the component running message may further include a first running mode or a second running mode, where the first running mode is used to instruct the selected execution engine to cache output data when executing any component, and the second running mode is used to instruct the selected execution engine not to cache output data when executing any component, in addition to the identifier of the input data, the identifier of the model, and the type of the component.
As shown in fig. 6, the selected execution engine detects an operation mode during execution of any component, and if the operation mode is the first operation mode, the selected execution engine saves output data to the memory. If the operation mode is the second operation mode, the output data is dropped.
In addition, as shown in fig. 6, before the execution engine detects the operation mode, it may also determine whether the type of any component is consistent with the type of the component to be processed subsequently. If the output data are inconsistent, the probability of using the output data later is small, so that the output data are stored in the hard disk, namely the output data are dropped, and whether the output data are cached or not is not needed to be determined according to the operation mode. If the output data are consistent, the probability of using the output data later is high, and at the moment, whether the output data are cached or not is determined according to the operation mode.
In addition, as shown in fig. 6, the execution engine may also save the model used when executing any component for reuse in the execution of a subsequent component of the same type.
Step 404: the scheduling engine receives the component execution completion message sent by the selected execution engine, wherein the component execution completion message carries the identification of any component, the execution result of any component and the identification of output data when any component is executed.
As shown in FIG. 5, after the execution engine finishes running the component, it is generally necessary to send a component execution completion message (Comp Complete Message) to the dispatch engine to indicate that the component execution is finished. To facilitate managing DAG tasks, the scheduling engine is pre-configured with output states for each component, the output state of each component being used to indicate whether the component is executing completely, and the output state of each component being set to incomplete before issuing each component. Therefore, if the execution result carried by the component execution completion message is successful, the scheduling engine updates the output state of any component from incomplete to complete execution.
In addition, after the scheduling engine updates the output state of any component from incomplete to finished execution, the scheduling engine can also check the output states of all components included in the DAG task, if the output states of all components are finished execution, the scheduling engine sends a stop executor message (Stop Executor Message) to each execution engine in the plurality of execution engines, and the stop executor message is used for indicating each execution engine to stop running.
In addition, as shown in fig. 5, after the output status of any component is updated to be executed by the scheduling engine, whether the component to be executed is ready or not may be detected, and if so, a component running message of the component to be executed is obtained, so as to complete execution of the component to be executed through the steps 402 and 403.
In addition, as shown in fig. 5, if the execution result carried by the component execution completion message is failure, which indicates that a problem occurs in the process of executing the component, in order to ensure that the subsequent component can continue to execute, at this time, a stop executor message is sent to each execution engine of the plurality of execution engines, where the message is used to instruct each execution engine to stop running. After each execution engine stops running, checking the problems occurring in the process of executing the component, and restarting each execution engine after the problems are solved.
In addition, as shown in FIG. 5, if the scheduling engine receives a terminate task message (Stop Task Message) indicating that it is currently necessary to terminate running the DAG task, an executor stop message is sent to each of the plurality of execution engines. The terminate task message may be triggered by an administrator, that is, during execution of the DAG task by the machine learning platform, the administrator may terminate the DAG task according to actual requirements.
In addition, in any of the above scenarios of sending a stop executor message, the scheduling engine may further receive an executor closed message (Executor Stopped Message) sent by each execution engine after sending the stop executor message, to indicate that the execution engine is closed successfully. At this time, in order to facilitate management of the DAG task, the scheduling engine is configured with a state in advance for each execution engine, and the state of each execution engine is used to indicate whether the execution engine is in an operating state. Thus, the scheduler engine may update the state of each execution engine to a closed state based on the actuator closed message sent by each execution engine. When the scheduling engine detects that the states of all execution engines are updated to be closed, the state of the DAG task is updated to be ended, and the scheduling engine is closed.
In addition, the component execution completion message may also carry execution error information, which is used to indicate some error information that occurs in the process of executing the component. The component execution completion message may also carry other information such as the identification of the component output port, and will not be described in detail herein.
In addition, the scheduling engine may also log when it receives the component execution completion message, and the diary is used to record when and what components were executed.
In the embodiment of the application, after a scheduling engine deployed on a machine learning platform receives a DAG task to be processed, for any one of a plurality of components included in the DAG task, the scheduling engine selects an execution engine loaded with a machine learning frame corresponding to the any one component from a plurality of execution engines, and instructs the selected execution engine to execute the any one component based on the loaded machine learning frame. Because there are machine learning frameworks loaded by at least two of the plurality of execution engines that are different, components that require the use of different machine learning frameworks may be included in the DAG task, i.e., the machine learning platform may be used to handle complex tasks that require the use of a hybrid machine learning framework. Thus, for different components contained in the DAG task, the scheduling engine can call the execution engine loaded with different machine learning frameworks to complete, and the universality of the machine learning platform is improved.
Fig. 7 is a block diagram of a task performing device that may be implemented in software, hardware, or a combination of both, as illustrated in an embodiment of the present application. The task performing device 700 may include:
a receiving module 701, configured to receive a DAG task to be processed by the scheduling engine, where the DAG task includes a plurality of components and execution sequence indication information, where the execution sequence indication information is used to indicate an execution sequence of the plurality of components, and each component in the plurality of components is used to indicate a data processing subtask, and each component corresponds to a machine learning framework.
The selection module 702 is configured to, when executing the instruction information to any component according to the execution order, select, from the plurality of execution engines, an execution engine loaded with a machine learning framework corresponding to the any component.
And a issuing module 703, configured to issue the arbitrary component to a selected execution engine by using the scheduling engine, and instruct the selected execution engine to execute the arbitrary component based on the loaded machine learning framework, so as to execute a data processing subtask corresponding to the arbitrary component.
Optionally, the apparatus further comprises:
the first sending module is configured to send a component running message to the selected execution engine, where the component running message carries an identifier of input data corresponding to any component, and the input data corresponding to any component is data required when the any component is executed based on a machine learning framework corresponding to the any component.
Optionally, the component running message further carries an identifier of a model corresponding to the any component, where the model corresponding to the any component is a model required when the any component is executed based on the machine learning framework corresponding to the any component.
Optionally, the component running message further carries a type of the any component, which is used for indicating the selected execution engine to execute the any component according to the model corresponding to the last executed component when the component running message does not carry the identifier of the model corresponding to the any component and the type of the any component is consistent with the type of the last executed component.
Optionally, the component running message further carries a first running mode or a second running mode, where the first running mode is used to instruct the selected execution engine to cache output data when executing the any component, and the second running mode is used to instruct the selected execution engine not to cache output data when executing the any component.
Optionally, the apparatus further comprises:
the first receiving module is used for receiving the component execution completion message sent by the selected execution engine, wherein the component execution completion message carries the identification of any component, the execution result of any component and the identification of output data when any component is executed.
Optionally, the apparatus further comprises:
the first updating module is used for updating the output state of any component to be finished if the execution result carried by the component execution finishing message is successful;
and the checking module is used for checking the output states of all the components included in the DAG task, and sending an executor stopping message to each execution engine in the plurality of execution engines if the output states of all the components are all executed, wherein the executor stopping message is used for indicating each execution engine to stop running.
Optionally, the apparatus further comprises:
and the second sending module is used for sending an executor stopping message to each execution engine in the plurality of execution engines if the execution result carried by the component execution completion message is failure, and used for indicating each execution engine to stop running.
Optionally, the apparatus further comprises:
the second receiving module is used for receiving the termination task message;
and the third sending module is used for sending a stop executor message to each execution engine in the plurality of execution engines and used for indicating each execution engine to stop running.
Optionally, the apparatus further comprises:
the third receiving module is used for receiving the closed message of the executor sent by each execution engine;
The second updating module is used for updating the state of each execution engine to be in a closed state;
and the third updating module is used for updating the state of the DAG task to be finished when the state of all execution engines is checked to be updated to be the closed state.
Optionally, the apparatus further comprises:
and the fourth receiving module is used for receiving an executor registration message sent by any execution engine in the plurality of execution engines and is used for indicating the dispatching engine to communicate with any execution engine based on the execution registration message.
In the embodiment of the application, after a scheduling engine deployed on a machine learning platform receives a DAG task to be processed, for any one of a plurality of components included in the DAG task, the scheduling engine selects an execution engine loaded with a machine learning frame corresponding to the any one component from a plurality of execution engines, and instructs the selected execution engine to execute the any one component based on the loaded machine learning frame. Because there are machine learning frameworks loaded by at least two of the plurality of execution engines that are different, components that require the use of different machine learning frameworks may be included in the DAG task, i.e., the machine learning platform may be used to handle complex tasks that require the use of a hybrid machine learning framework. Thus, for different components contained in the DAG task, the scheduling engine can call the execution engine loaded with different machine learning frameworks to complete, and the universality of the machine learning platform is improved.
It should be noted that: in the task execution device provided in the above embodiment, only the division of the above functional modules is used for illustration when executing tasks, and in practical application, the above functional allocation may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the device for executing the task provided in the foregoing embodiment and the device embodiment for executing the task belong to the same concept, and the specific implementation process of the device embodiment is detailed in the device embodiment, which is not repeated herein.
Fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application. The server may be a server in a backend server cluster. Either the scheduling engine or the execution engine in the above embodiments may be implemented by the server. Specifically, the present invention relates to a method for manufacturing a semiconductor device.
The server 800 includes a Central Processing Unit (CPU) 801, a system memory 804 including a Random Access Memory (RAM) 802 and a Read Only Memory (ROM) 803, and a system bus 805 connecting the system memory 804 and the central processing unit 801. The server 800 also includes a basic input/output system (I/O system) 806 for facilitating the transfer of information between various devices within the computer, and a mass storage device 807 for storing an operating system 813, application programs 814, and other program modules 815.
The basic input/output system 806 includes a display 808 for displaying information and an input device 809, such as a mouse, keyboard, or the like, for user input of information. Wherein both the display 808 and the input device 809 are connected to the central processing unit 801 via an input output controller 810 connected to the system bus 805. The basic input/output system 806 may also include an input/output controller 810 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input output controller 810 also provides output to a display screen, a printer, or other type of output device.
The mass storage device 807 is connected to the central processing unit 801 through a mass storage controller (not shown) connected to the system bus 805. The mass storage device 807 and its associated computer-readable media provide non-volatile storage for the server 800. That is, the mass storage device 807 may include a computer readable medium (not shown) such as a hard disk or CD-ROM drive.
Computer readable media may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any apparatus or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that computer storage media are not limited to the ones described above. The system memory 804 and mass storage device 807 described above may be collectively referred to as memory.
According to various embodiments of the present application, server 800 may also operate by a remote computer connected to the network through a network, such as the Internet. I.e., server 800 may be connected to a network 812 through a network interface unit 811 connected to the system bus 805, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 811.
The memory also includes one or more programs, one or more programs stored in the memory and configured to be executed by the CPU. The one or more programs include instructions for performing the task execution methods provided by the embodiments of the present application.
The present application also provides a non-transitory computer-readable storage medium, which when executed by a processor of a server, enables the server to perform the task execution method provided by the above embodiments.
The present application also provides a computer program product containing instructions that, when executed on a server, cause the server to perform the task execution method provided in the above embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the above storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing description of the preferred embodiment of the present invention is not intended to limit the invention to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

Claims (15)

1. The task execution method is characterized in that a scheduling engine and a plurality of execution engines are deployed on a machine learning platform, each execution engine is loaded with a machine learning frame, and at least two execution engines in the plurality of execution engines are different in loaded machine learning frames; the method comprises the following steps:
the scheduling engine receives a Directed Acyclic Graph (DAG) task to be processed, wherein the DAG task comprises a plurality of components and execution sequence indication information, the execution sequence indication information is used for indicating the execution sequence of the components, each component in the plurality of components is used for indicating a data processing subtask, and each component corresponds to a machine learning framework;
when executing to any component according to the execution sequence indication information, the scheduling engine selects an execution engine loaded with a machine learning framework corresponding to any component from the plurality of execution engines;
the scheduling engine issues any component to a selected execution engine, and the scheduling engine is used for instructing the selected execution engine to execute the any component based on the loaded machine learning framework so as to execute the data processing subtasks corresponding to the any component.
2. The method of claim 1, wherein after the scheduler engine issues the any component to the selected execution engine, further comprising:
and sending a component operation message to the selected execution engine, wherein the component operation message carries an identifier of input data corresponding to any component, and the input data corresponding to any component refers to data required when the any component is executed based on a machine learning framework corresponding to any component.
3. The method of claim 2, wherein the component running message further carries an identification of a model corresponding to the any component, the model corresponding to the any component being a model required when executing the any component based on a machine learning framework corresponding to the any component.
4. The method of claim 2, wherein the component running message further carries a type of the any component, and is used for indicating the selected execution engine to execute the any component according to the model corresponding to the last executed component when the component running message does not carry an identifier of the model corresponding to the any component, and the type of the any component is consistent with the type of the last executed component.
5. The method of claim 2, wherein the component run message further carries a first run mode or a second run mode, the first run mode being for instructing the selected execution engine to cache output data when executing the any component, the second run mode being for instructing the selected execution engine not to cache output data when executing the any component.
6. The method of claim 2, wherein after sending a component run message to the selected execution engine, further comprising:
and receiving a component execution completion message sent by the selected execution engine, wherein the component execution completion message carries the identification of any component, the execution result of any component and the identification of output data when any component is executed.
7. The method of claim 6, wherein after receiving the component execution completion message sent by the selected execution engine, further comprising:
if the execution result carried by the component execution completion message is successful, updating the output state of any component to be execution completion;
and checking the output states of all components included in the DAG task, and if the output states of all the components are all finished, sending an executor stopping message to each execution engine in the plurality of execution engines, wherein the executor stopping message is used for indicating each execution engine to stop running.
8. The method of claim 6, wherein after receiving the component execution completion message sent by the selected execution engine, further comprising:
and if the execution result carried by the component execution completion message is failure, sending an execution stopping message to each execution engine in the plurality of execution engines, wherein the execution stopping message is used for indicating each execution engine to stop running.
9. The method of claim 1, wherein the method further comprises:
receiving a termination task message;
and sending an execution stopping message to each execution engine in the plurality of execution engines, wherein the execution stopping message is used for indicating each execution engine to stop running.
10. The method of any of claims 7 to 9, wherein after sending a stop executor message to each of the plurality of execution engines, further comprising:
receiving an actuator closed message sent by each execution engine;
updating the state of each execution engine to a closed state;
and when the states of all the execution engines are checked to be updated to be in the closed state, updating the states of the DAG tasks to be ended.
11. The method of claim 1, wherein the method further comprises:
And receiving an executor registration message sent by any execution engine of the plurality of execution engines, wherein the executor registration message is used for indicating the dispatching engine to communicate with any execution engine based on the executor registration message.
12. A task execution device, which is characterized in that a scheduling engine and a plurality of execution engines are deployed on a machine learning platform, each execution engine is loaded with a machine learning frame, and the machine learning frames loaded by at least two execution engines in the plurality of execution engines are different; the device comprises:
the system comprises a receiving module, a scheduling engine and a machine learning framework, wherein the receiving module is used for receiving a Directed Acyclic Graph (DAG) task to be processed, the DAG task comprises a plurality of components and execution sequence indication information, the execution sequence indication information is used for indicating the execution sequence of the plurality of components, each component in the plurality of components is used for indicating a data processing subtask, and each component corresponds to the machine learning framework;
the scheduling engine is used for selecting an execution engine loaded with a machine learning framework corresponding to any component from the plurality of execution engines when the execution sequence indication information is executed to any component;
And the issuing module is used for issuing the any component to a selected execution engine by the scheduling engine and instructing the selected execution engine to execute the any component based on the loaded machine learning framework so as to execute the data processing subtasks corresponding to the any component.
13. The apparatus of claim 12, wherein the apparatus further comprises:
the first sending module is configured to send a component operation message to the selected execution engine, where the component operation message carries an identifier of input data corresponding to any component, and the input data corresponding to any component is data required when the any component is executed based on a machine learning framework corresponding to any component.
14. A task execution device, the device comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the steps of the method of any of the preceding claims 1 to 11.
15. A computer readable storage medium having stored thereon instructions which, when executed by a processor, implement the steps of the method of any of the preceding claims 1 to 11.
CN201911147892.7A 2019-11-21 2019-11-21 Task execution method, device and computer storage medium Active CN112825044B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911147892.7A CN112825044B (en) 2019-11-21 2019-11-21 Task execution method, device and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911147892.7A CN112825044B (en) 2019-11-21 2019-11-21 Task execution method, device and computer storage medium

Publications (2)

Publication Number Publication Date
CN112825044A CN112825044A (en) 2021-05-21
CN112825044B true CN112825044B (en) 2023-06-13

Family

ID=75907503

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911147892.7A Active CN112825044B (en) 2019-11-21 2019-11-21 Task execution method, device and computer storage medium

Country Status (1)

Country Link
CN (1) CN112825044B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107733977A (en) * 2017-08-31 2018-02-23 北京百度网讯科技有限公司 A kind of cluster management method and device based on Docker
CN108062246A (en) * 2018-01-25 2018-05-22 北京百度网讯科技有限公司 For the resource regulating method and device of deep learning frame
CN108628669A (en) * 2018-04-25 2018-10-09 北京京东尚科信息技术有限公司 A kind of method and apparatus of scheduling machine learning algorithm task

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9794145B2 (en) * 2014-07-23 2017-10-17 Cisco Technology, Inc. Scheduling predictive models for machine learning systems
US10698954B2 (en) * 2016-06-30 2020-06-30 Facebook, Inc. Computation platform agnostic data classification workflows
DE102018110138A1 (en) * 2017-10-18 2019-04-18 Electronics And Telecommunications Research Institute Workflow engine framework
US11526728B2 (en) * 2018-04-09 2022-12-13 Microsoft Technology Licensing, Llc Deep learning model scheduling
US11144883B2 (en) * 2018-05-04 2021-10-12 International Business Machines Corporation Intelligent scheduling of events

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107733977A (en) * 2017-08-31 2018-02-23 北京百度网讯科技有限公司 A kind of cluster management method and device based on Docker
CN108062246A (en) * 2018-01-25 2018-05-22 北京百度网讯科技有限公司 For the resource regulating method and device of deep learning frame
CN108628669A (en) * 2018-04-25 2018-10-09 北京京东尚科信息技术有限公司 A kind of method and apparatus of scheduling machine learning algorithm task

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Gandiva: Introspective cluster scheduling for deep learning;Xiao, Wencong等;《13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18)》;第595-608页 *
基于Hadoop YARN的TensorFlow GPU集群的调度扩展;陆忠华等;《科研信息化技术与应用》;第8卷(第06期);第33-42页 *

Also Published As

Publication number Publication date
CN112825044A (en) 2021-05-21

Similar Documents

Publication Publication Date Title
CN108897557B (en) Updating method and device of microservice architecture
CN113094125B (en) Business process processing method, device, server and storage medium
WO2017193737A1 (en) Software testing method and system
CN107105009A (en) Job scheduling method and device based on Kubernetes system docking workflow engines
CN104462243B (en) A kind of ETL scheduling system and methods of combination data check
CN111026541B (en) Rendering resource scheduling method, apparatus, device and storage medium
CN107870948A (en) Method for scheduling task and device
WO2019024679A1 (en) Method for upgrading network function and upgrade management entity
CN112395736A (en) Parallel simulation job scheduling method of distributed interactive simulation system
CN109144701A (en) A kind of task flow management method, device, equipment and system
CN113448988A (en) Method and device for training algorithm model, electronic equipment and storage medium
CN111147541B (en) Node processing method, device and equipment based on parameter server and storage medium
CN113658351B (en) Method and device for producing product, electronic equipment and storage medium
CN112825044B (en) Task execution method, device and computer storage medium
CN110381143B (en) Job submission and execution methods, devices, equipment and computer storage media
CN116302448B (en) Task scheduling method and system
CN111190725B (en) Task processing method, device, storage medium and server
CN115499493B (en) Asynchronous transaction processing method, device, storage medium and computer equipment
CN115525413A (en) Cluster-based model training method, system, device, medium and product
CN112600906B (en) Resource allocation method, device and electronic device for online scene
CN111581042B (en) A cluster deployment method, deployment platform and server to be deployed
CN111553379A (en) Image data processing method and system based on asynchronous training
CN117573329B (en) Multi-brain collaborative task scheduling method, task scheduling device and storage medium
CN117076558B (en) A method and system for online analysis of massive data based on multiple computing engines
CN115454450B (en) Method and device for resource management of data job, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant