Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein, but rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the exemplary embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.
Large language model LLM is becoming increasingly important for its high performance in natural language processing tasks, however, efficient services of LLM face key challenges in computing resource optimization, memory management, and reasoning speed due to the following drawbacks.
Firstly, the resources used by the large language model are statically allocated, and the memory and the computing resources cannot be dynamically adjusted according to the actual requirements of the reasoning task, so that the resources are wasted or insufficient.
And secondly, by adopting a state-unaware reasoning service, the context information between requests cannot be effectively utilized, so that repeated consumption of computing resources is caused.
In addition, the centralized reasoning architecture is too late and too simple to adapt to the increasing computing demands due to the limitation of expansibility when processing large-scale or high-concurrency requests, and cannot realize optimal resource allocation and task scheduling.
When LLM reasoning tasks are processed based on the problems, ideal speed and efficiency cannot be achieved, and user experience is affected.
Fig. 1 shows a schematic structural diagram of a model inference optimization system in an embodiment of the disclosure, including a plurality of terminals 120 and a server cluster 140.
The terminal 120 may be a mobile terminal such as a mobile phone, a game console, a tablet computer, an electronic book reader, a smart glasses, an MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image expert compression standard audio plane 4) player, a smart home device, an AR (Augmented Reality) device, a VR (Virtual Reality) device, or the terminal 120 may be a personal computer (Personal Computer, PC) such as a laptop portable computer and a desktop computer, etc.
Wherein the terminal 120 may have installed therein an application for providing model inference optimization.
The terminal 120 is connected to the server cluster 140 through a communication network. Optionally, the communication network is a wired network or a wireless network.
The server cluster 140 is a server, or is composed of several servers, or is a virtualized platform, or is a cloud computing service center. The server cluster 140 is used to provide background services for applications that provide model inference optimization. Optionally, the server cluster 140 performs primary computing, the terminal 120 performs secondary computing, or the server cluster 140 performs secondary computing, the terminal 120 performs primary computing, or a distributed computing architecture is used between the terminal 120 and the server cluster 140.
In some alternative embodiments, the server cluster 140 is used to store model inference optimization models, and the like.
Alternatively, the clients of the applications installed in different terminals 120 are the same, or the clients of the applications installed on both terminals 120 are clients of the same type of application of different control system platforms. The specific form of the client of the application program may also be different based on the difference of the terminal platforms, for example, the application program client may be a mobile phone client, a PC client, or a World Wide Web (Web) client.
Those skilled in the art will appreciate that the number of terminals 120 may be greater or lesser. Such as the above-mentioned terminals may be only one, or the above-mentioned terminals may be several tens or hundreds, or more. The embodiment of the application does not limit the number of terminals and the equipment type.
Optionally, the system may further comprise a management device (not shown in fig. 1), which is connected to the server cluster 140 via a communication network. Optionally, the communication network is a wired network or a wireless network.
Alternatively, the wireless network or wired network described above uses standard communication techniques and/or protocols. The network is typically the Internet, but may be any network including, but not limited to, a local area network (Local Area Network, LAN), metropolitan area network (Metropolitan Area Network, MAN), wide area network (Wide Area Network, WAN), mobile, wired or wireless network, private network, or any combination of virtual private networks. In some embodiments, data exchanged over the network is represented using techniques and/or formats including HyperText Mark-up Language (HTML), extensible markup Language (Extensible MarkupLanguage, XML), and the like. All or some of the links may also be encrypted using conventional encryption techniques such as secure sockets layer (Secure Socket Layer, SSL), transport layer security (Transport Layer Security, TLS), virtual private network (Virtual Private Network, VPN), internet protocol security (Internet ProtocolSecurity, IPsec), etc. In other embodiments, custom and/or dedicated data communication techniques may also be used in place of or in addition to the data communication techniques described above.
As shown in fig. 2, a model inference optimization method according to one embodiment of the present disclosure includes:
Step S202, in response to the received reasoning request of the model reasoning, determining, by the global scheduler, a first execution instance based on the local awareness policy of the configured global hint tree, the first execution instance being capable of reusing the context cache of the reasoning request.
The global hint tree is used for storing and organizing various information related to model reasoning so as to assist task scheduling and resource allocation decision, and each node in the tree stores specific hint information and KV (key value) cache index for context retrieval.
The local awareness policy is a decision scheme based on the local resources and load conditions of the execution examples, which is adopted by the global scheduler when task allocation is performed, and takes the actual environment and state around each execution example into consideration, so as to ensure that the task is allocated to the most suitable execution unit.
The global scheduler is used to schedule resources of cloud nodes in the data center.
In some embodiments, the first execution instance is a computing node or execution environment selected by the global scheduler for the model-inferred pre-filled task according to a local awareness policy of the global hint tree, and the first execution instance may be a memory instance.
In some embodiments, when the global scheduler receives the reasoning request of the model reasoning, the prompt identification is analyzed first. The prompt identifier contains key information of the request, such as an inference task type, and the like, and based on the analyzed prompt identifier, tree nodes which are stored with the matched specified prompt information are searched in the global prompt tree, and according to the index of the found matched tree nodes, the instance positions of the context cache of the inference request are obtained, wherein the instance positions point to a storage area which stores a historical calculation result or intermediate data related to the current inference request.
In step S204, the model-inferred pre-filling task is scheduled to the first execution instance, so as to perform the pre-filling operation based on the first execution instance to obtain a key value cache including the context cache.
Where context caching of an inference request is understood as a caching mechanism storing historical computation results and intermediate data related to the inference request, including useful information generated by previous processing of similar or partial stages of the same inference request.
The pre-filling (Prefill) task is used for preprocessing the input data of the reasoning request, generating a richer key value cache by combining with the reusable context cache, and providing necessary data support for the subsequent decoding task.
In some embodiments, the queried context cache is analyzed to determine the part related to the current reasoning request, for related key value pairs, the related key value pairs are directly multiplexed in the pre-filling operation, the recalculation process of the data is skipped, so that the calculation resources and time are saved, in the process of processing new input data, the first execution instance generates new key value pairs, the key value pairs record the result after the new input data is processed, and the new key value pairs are combined with the key value pairs multiplexed from the context cache to form the key value cache comprising the context cache.
Step S206, the decoding task of the model reasoning is scheduled to the second execution instance, so that the key value cache comprising the context cache is decoded based on the second execution instance, and a reasoning result is obtained.
The second execution instance is a computing node for executing the decoding task in the model reasoning, and the second execution instance and the first execution instance can share the same memory pool or can be located in different memory areas.
The decoding task is a subsequent key stage in the model reasoning process, and is executed on the second execution example, the key value cache comprising the context cache generated by the pre-filling task is taken as input, and a reasoning result is obtained through a specific decoding algorithm and a model structure.
In some embodiments, the second execution instance receives a key cache comprising a context cache. The decoding operation is carried out by utilizing the cached data, and as the key value cache is enriched and optimized in the pre-filling stage, a great amount of information related to the current reasoning request is contained in the key value cache, the information can be more effectively utilized in the decoding process, and in the decoding process, the data in the key value cache are processed according to the decoding algorithm, and the reasoning result is gradually restored.
Step S208, the global scheduler feeds the reasoning result back to the request end.
In this embodiment, after receiving a model reasoning request, the global scheduler determines a first execution instance according to a configured global hint tree and a local perception policy, where the global hint tree stores information including a reasoning task type, a historical data feature, a model parameter, an execution instance resource attribute, and the like, and the local perception policy considers working conditions such as a hardware resource and a workload of the execution instance, where the first execution instance performs a pre-filling operation on new input data by using a context cache to generate a key value cache including the context cache, and schedules a decoding task of model reasoning to a second execution instance, where the second execution instance decodes by using the key value cache including the context cache, thereby obtaining a reasoning result.
As shown in fig. 3, in one embodiment of the present disclosure, determining, by a global scheduler, a first execution instance based on a local awareness policy of a configured global hint tree in response to a received reasoning request of model reasoning, includes:
In step S302, in response to the reasoning request, the global scheduler parses the hint identification of the reasoning request.
In some embodiments, the global scheduler analyzes the inference request and extracts hint identifications therefrom, which may be identifications of request types (e.g., text processing related inferences, image recognition related inferences, or other types), request source identifications (specific users, applications, etc.), feature identifications of input data (e.g., language types of text, resolution ranges of images, etc.), and special attribute identifications that may affect the manner of processing (e.g., real-time requirements, special requirements for accuracy of results, etc.).
Step S304, searching tree nodes in the global prompt tree, wherein the tree nodes store specified prompt information, and the specified prompt information is matched with the prompt identifier.
In some embodiments, each node in the global hint tree stores a defined hint information covering various possible reasoning scenarios, task features, and related attributes, and the scheduler attempts to find a matching tree node by comparing the hint identity of the reasoning request with the specified hint information stored by the tree node.
Step S306, storing the instance locations of the context cache of the inference request based on the tree node index.
In some embodiments, based on the determined matching tree node, the global scheduler may obtain the instance location of the context cache of the inference request from the index information of that node. This instance location is a pointer or address to the storage area where the context information associated with the current inference request is stored. The information stored in the context cache exists in the form of key value pairs, the information is accumulated when similar reasoning requests are processed before, and the information has important reference value for the processing of the current reasoning requests and can help to accelerate the processing process.
Step S308, perceives the resource environment and the load information of the instance position based on the local perception strategy to determine the first execution instance based on the perception result.
In some embodiments, after obtaining the instance location of the context cache, the global scheduler uses a local awareness policy to perceive the resource environment and load information corresponding to the instance location, the local awareness policy perceives the computing power (such as the core number and frequency of the CPU, the performance parameters of the GPU, etc.), the available memory size, the network bandwidth condition, and the current workload (such as the number of tasks already allocated to the instance, the occupation degree of the resources by the tasks, etc.), by comprehensively evaluating the resource environment and load information, the scheduler determines a most suitable first execution instance from all available execution instances, and the selected first execution instance can not only utilize the context cache, but also ensure sufficient resources and reasonable loads.
In the embodiment, the context cache position is determined by searching the matching node in the global prompt tree based on the prompt identifier, and the first execution example is selected by combining the local perception strategy, so that the existing context cache information can be accurately utilized, repeated calculation is reduced, meanwhile, the tasks are distributed according to the actual resource environment and the load condition of the example, resources can be reasonably distributed, the reasoning speed is accelerated, and the waiting time of a user is reduced.
In one embodiment of the present disclosure, scheduling a model-inferred pre-fill task to a first execution instance to perform a pre-fill operation based on the first execution instance to obtain a key-value cache comprising a context cache, includes:
The method comprises the steps of inquiring reusable context caches based on instance positions to serve as first key value caches, scheduling a pre-filling task to a first execution instance to execute acceleration pre-filling operation on an inference request based on the first execution instance and the first key value caches to obtain second key value caches, and determining the key value caches comprising the context caches based on the first key value caches and the second key value caches.
In some embodiments, reusable context caches are looked up in the respective storage areas based on the previously determined instance location information. This instance location points to a historical data storage location associated with the current inference request, the first instance analyzes key-value pairs in the first key-value cache to determine which portions are associated with the current inference request, multiplexes directly in the pre-fill operation for the associated key-value pairs, skips the re-computation of the data, processes new input data portions in the current inference request (i.e., portions not covered by the first key-value cache) using a pre-fill algorithm, and during the processing of the new input data, the first instance generates new key-value pairs that record the results of the processing of the new input data.
In some embodiments, these new key-value pairs are combined with the key-value pairs multiplexed from the first key-value cache to form a second key-value cache, and in the combining process, the key-value pairs may be optimized, for example, reorganized according to the use frequency, relevance and other factors of the key-value pairs, so as to improve the access efficiency and storage efficiency of the cache.
In this embodiment, by reusing the context cache (first key cache), duplicate computation of existing computation results is prevented, consumption of computational resources (e.g., computation time of CPU, GPU) is reduced, the accelerated pre-fill operation uses information in the first key cache, part of the computation steps are skipped, processing time of the pre-fill stage can be reduced, since the first key cache contains related information that previously handled similar reasoning requests, the information is multiplexed in the current reasoning request, so that the consistency and accuracy of the reasoning process are maintained, and the optimization processing (such as organizing key value pairs according to relevance and using frequency) of the data in the process of generating the second key value buffer and integrating the final key value buffer is also conducive to improving the accuracy of the reasoning, so that the system can output more reliable reasoning results.
In one embodiment of the present disclosure, scheduling a pre-fill task to a first execution instance to perform an accelerated pre-fill operation on an inference request based on the first execution instance and a first key cache, resulting in a second key cache, includes:
And performing pre-filling operation on the reasoning request by the first execution instance based on multiplexing operation on the key value pairs with different association relations to obtain a second key value cache.
In this embodiment, the context buffer pair is subjected to correlation analysis and multiplexing to directly use the previous calculation result, so that the reprocessing step of the data is skipped, the information used in the pre-filling process is ensured to be the multiplexing operation related to the current reasoning task and having pertinence, and the second key value buffer generated by the pre-filling operation contains more accurate and comprehensive information in combination with the processing of the new input data, so that better data support is provided for the subsequent reasoning step, and the accuracy of the reasoning result is improved.
As shown in fig. 4, in one embodiment of the present disclosure, before determining, by the global scheduler, the first execution instance based on the local awareness policy of the configured global hint tree in response to the received reasoning request of the model reasoning, further comprises:
step S402, the memory resource and the computing capacity of each cloud node of the data center are registered in the memory pool, and corresponding execution examples are generated so as to configure the information of the execution examples in the global scheduler.
In some embodiments, each cloud node has its own memory resources and computing power in the data center. First, the memory resources and the computing power information of the cloud nodes are registered in a memory pool. Memory pools are a mechanism for unified management of memory resource allocation.
Through the registration process, a corresponding execution instance is generated for each cloud node. These execution instances represent logical units that can perform tasks on the cloud node. The details of these execution instances (including their memory resources, computing power, etc. related attributes) are then configured into a global scheduler. The global scheduler will use this information to make task allocation and scheduling decisions to reasonably allocate tasks to execution instances on the various cloud nodes in subsequent steps.
Step S404, constructing a tree frame based on the multi-layer radix tree structure.
In some embodiments, radix tree is a special data structure that can effectively organize and store data, and especially for data with hierarchical structure, the multi-layer structure can divide and store information according to different dimensions or attributes, thus providing a basic architecture for subsequent storage and management of information related to executing instance and reasoning tasks.
Step S406, initializing the tree frame based on the information of the execution instance to obtain an initialized tree structure.
In some embodiments, the built tree frame is initialized based on the execution instance information, and the process may include storing relevant attributes (such as memory size, computing capability type, etc.) of the execution instance into each node or hierarchy of the tree frame according to a certain rule, for example, according to the computing capability of the execution instance, the relevant attributes may be placed in a specific hierarchy or branch of the tree, so that the execution instance is quickly searched for according to the computing requirement of a task, and through the initialization operation, the tree frame is endowed with content related to the actual execution instance.
Step S408, based on the mapping relation between the prompt information and the context cache and the execution instance stored with the context cache, the initialization tree structure is configured to obtain the global prompt tree.
In some embodiments, the initialization tree structure is further configured in consideration of the mapping relationship between the hint information and the context caches and the execution instance information stored with the context caches, the hint information may include the type of the reasoning task, the characteristics of the input data, the special requirements of the user, and the like, and the hint information is associated with the corresponding context caches and the execution instance storing the caches and stored into the initialization tree structure.
In some embodiments, if an inference task is image recognition and the input is a color image of a specific resolution, then the hint information associated with the task of this type, the context cache accumulated by previously processing similar images, and the execution instance information for storing the cache are all integrated into the corresponding location of the tree structure, and by such configuration, initializing the tree structure becomes a global hint tree that can provide a comprehensive decision basis for the global scheduler in processing the inference request, helping to quickly find the appropriate execution instance and to utilize the context cache to accelerate the inference process.
In this embodiment, through registering the memory resources and computing power of each cloud node of the data center into the memory pool and generating the execution instance, so as to perform centralized management on the resources, the global scheduler can better allocate tasks according to the information, in addition, the multi-layer radix tree structure can rapidly locate and retrieve related information according to different attributes, so that time complexity when searching for suitable execution instance and related cache information is reduced, response speed of the system is improved, the generated global prompt tree organically combines prompt information, context cache and execution instance together, when processing the reasoning request, the global scheduler can determine the most suitable execution instance by using the global prompt tree, and accelerate the reasoning process by using the context cache, so that processing speed of the reasoning task is improved, historical data (context cache) can be better utilized, accuracy and efficiency of reasoning are further improved, and the whole data center can better perform when processing the complex reasoning task.
In some embodiments, the global scheduler uses a multi-level radix tree structure to construct a global hint tree, each node stores not only hint information in the tree, but also index the instance locations where the relevant KV caches are stored, and when new reasoning requests arrive, the scheduler determines the first executing instance with the highest cache reuse potential by querying the trees, which strategy allows the system to intelligently allocate requests to the most appropriate instance, thereby maximizing resource utilization and response speed.
As shown in FIG. 5, the present disclosure also provides a dependency framework to support and leverage dependencies between and within requests to integrate a variety of existing inter-request and intra-request optimization techniques into a unified system.
To build context caches on parameter server (PARAMETER SERVER, PD) shared reasoning instances, this can be done by using index APIs, such as insert and match operations, and in addition, to build discrete reasoning, active Key-Value (KV) caches can be transferred from Prefill instance to Decode instance by calling a transfer API.
As shown in FIG. 5, the dependency framework dispatches PD (Parallel Distribution) instances, prefill instances, decode instances, and SP (Storage Provider) instances to the elastic memory pool based on the global scheduler and the global hint tree, and KV cache is applied in combination with index operations to achieve inference optimization.
In one embodiment of the present disclosure, before determining, by the global scheduler, the first execution instance based on the local awareness policy of the configured global hint tree in response to the received reasoning request of the model reasoning, further comprising:
the method includes the steps of detecting a key value cache to be stored, determining a storage position of the key value cache to be stored based on an index interface and a configured fixed amount of memory blocks, and establishing an index relationship between the key value cache to be stored and associated context information based on the storage position.
In some embodiments, KV caches are stored using fixed-size memory blocks, allocation and release of these memory blocks is managed by simple API calls, and hint marks are mapped to corresponding KV cache locations by configuring an indexing mechanism, thereby quickly retrieving and updating the cache.
In some embodiments, large page technology may also be used to reduce memory fragmentation and optimize the data transfer API to reduce communication overhead, especially when handling KV cache transfers across instances, because of the discreteness of KV caches, only one memory block is transferred per network API call, which may result in excessive network call times, increased communication overhead, by configuring memory block aggregation policies, the method has the advantages that a plurality of small memory blocks are aggregated into the large memory block for transmission, the calling times of the network API are reduced, different network environments and hardware configurations including networks with different speeds and different types of memory media can be adapted through designing the universal API, the efficiency and expansibility of LLM service are improved, and delay and resource consumption are remarkably reduced through intelligent scheduling and memory management.
As shown in fig. 6, in one embodiment of the present disclosure, a decoding task of model reasoning is scheduled to a second execution instance to decode a key value cache including a key value cache of a context cache based on the second execution instance to obtain a reasoning result, including:
step S602, detecting the computing power of the registered execution instance based on the global scheduler.
In some embodiments, the global scheduler is responsible for managing and monitoring registered execution instances in the system, and the global scheduler initiates a mechanism for detecting computing power of each execution instance, where the computing power includes the core number and frequency of the CPU, performance parameters of the GPU (if present), and the load conditions of the current CPU and GPU, the amount of free memory, and the like.
Step S604, selecting a second execution instance based on the detection result.
In some embodiments, according to the detection result of the computing power of the execution instance, a suitable second execution instance is selected, and the selection process comprehensively considers the requirement of the decoding task, for example, if the decoding task is to perform complex deep learning model decoding on a high-resolution image, it may be required to select an execution instance with strong GPU computing power and enough memory to load the model and cache data, and if it is to perform simple statistical analysis decoding on a large amount of text data, it may be more focused on selecting an execution instance with more CPU cores and sufficient memory and lower current load.
Step S606, the decoding task is scheduled to the second execution instance.
In some embodiments, the global scheduler assigns and transmits the decoding task to the selected second execution instance, which involves scheduling and delivery mechanisms of the task, and may also deliver parameters and information related to the decoding task, such as model configuration parameters, relevant specifications of the input data, etc., to the second execution instance so that the second execution instance can properly prepare the decoding environment and perform the decoding operation.
In step S608, the transfer interface is invoked to transfer the key value cache including the reuse context from the first execution instance to the second execution instance, so as to perform a decoding operation on the key value cache including the reuse context based on the second execution instance, thereby obtaining an inference result.
In some embodiments, invoking the transfer interface implementation includes reusing a key cache of the context cache for transfer from the first execution instance to the second execution instance.
And the second execution example performs decoding operation after receiving the key value cache, and in the decoding process, the reasoning result is gradually restored by utilizing the information in the key value cache according to a preset decoding model and algorithm logic.
In this embodiment, the second execution instance is selected by detecting the computing power of the execution instance, ensuring that the decoding task is allocated to the execution environment that best suits its resource needs, scheduling the decoding task exclusively to the appropriate second execution instance, and transmitting the key value cache including the context cache. The second execution example can fully utilize the cache information to decode, and reduce the data processing amount in the decoding process, thereby obviously improving the decoding speed and accelerating the execution of the whole reasoning process.
In one embodiment of the present disclosure, before determining, by the global scheduler, the first execution instance based on the local awareness policy of the configured global hint tree in response to the received reasoning request of the model reasoning, further comprising:
The method comprises the steps of receiving an inference request, analyzing the scale and/or task characteristics of the inference request in response to the received inference request, pushing the inference request to an edge execution instance deployed at an edge node based on the analysis result, and/or pushing the inference request to a global scheduler to analyze the inference request by the global scheduler.
In one embodiment of the present disclosure, pushing the inference request into an edge execution instance deployed at an edge node and/or pushing the inference request to a global scheduler based on the analysis results includes the inference request being a delay-sensitive task, scheduling the pre-filled task to the edge node, and scheduling the decoded task to the global scheduler.
In the embodiment, reasonable distribution is performed according to the scale and characteristics of the reasoning request, so that the situation that all tasks are pour into in a global scheduler or edge nodes is prevented, the tasks with small real-time performance are processed at the edge nodes, the local resources of the edges are fully utilized, the large-scale complex tasks are coordinated and processed by the global scheduler, the cloud resources are better utilized, and the optimal configuration of the resources is realized.
In some embodiments, lightweight memory pool instances are deployed at edge nodes near the user for handling small-scale or low-latency reasoning requests.
For requests requiring a large amount of computing resources, the edge node may forward the task to the cloud-based disclosed system for processing.
And the edge node and the cloud memory pool instance keep the data synchronization of the KV cache, and the consistency of the context information is ensured.
For the reasoning task sensitive to delay, the prefill stage is preferentially carried out on the edge node, and then the result is transmitted to the cloud for decoding stage.
The global scheduler can intelligently decide whether to execute the task at the edge or the cloud according to the characteristics of the task and the resource condition of the edge node.
The data processed at the edge node can reduce the data volume transmitted to the cloud, so that the safety and privacy protection of the data are improved.
In the embodiment, the response time of the reasoning request is further reduced through the cooperation of the edge calculation and the cloud calculation, the method is particularly suitable for application scenes with high real-time requirements, the calculation efficiency and the data consistency are ensured through data synchronization and intelligent task distribution, meanwhile, the flexibility and the reliability of the system are enhanced, the system can adapt to different workloads and performance requirements, whether in a cloud service platform or an edge calculation environment, in addition, in an expansion scheme, the data transmission quantity is reduced through processing part of data at an edge node, the safety and the privacy of the data are enhanced, and the integration of the edge calculation enables the application with high real-time requirements to be responded quickly.
In one embodiment of the disclosure, the method further comprises continuously monitoring memory resources and computing power of the cloud node during the process of executing the pre-filling task and/or the decoding task, so as to adjust task allocation based on monitoring results.
In this embodiment, the resources are adjusted according to the actual requirements by configuring the dynamic memory allocation and management mechanism of the memory pool.
If the cloud node executing the pre-filling task has insufficient memory resources or overload computing capacity, firstly considering whether resource optimization can be performed in the node, and if the internal optimization cannot meet the requirement, searching other idle cloud nodes with enough memory and computing capacity or less load. The partial pre-filled tasks are migrated to the new node and the input data and associated computing tasks are reassigned. During migration, to ensure data integrity and consistency, unprocessed input data and intermediate results generated are securely transferred to the new node, for example, by appropriate data transfer protocols and verification mechanisms.
For the problem of shortage of cloud node resources for executing decoding tasks, similar to pre-filling tasks, optimization is attempted in the nodes, if calculation resources are shortage, the execution sequence of a decoding algorithm can be adjusted, the part with larger influence on the result is preferentially processed, or some approximate algorithms are adopted to reduce the calculation amount, meanwhile, the quality of the decoding result is ensured to be within an acceptable range, if the internal optimization effect is poor, other suitable cloud nodes are searched for carrying out task sharing, and certain subtasks (such as decoding of different data blocks or different parts) in the decoding tasks can be distributed to other nodes.
As shown in fig. 7, a model inference optimization method according to another embodiment of the present disclosure is applied to a cloud service platform, where the cloud service platform has a plurality of data centers, each of which is deployed with a high-performance GPU and a CPU server cluster, and the model inference optimization method includes:
in step S702, an inference system is deployed on the cloud service platform, and each data center is configured with an elastic memory pool and an instance of a global scheduler.
In step S704, the memory resources and computing power of each server are registered in the memory pool, and a resource index is established.
In step S706, the global scheduler is configured with the memory pool instance information of each data center, and initializes the global hint tree.
In step S708, when the user submits the inference request, the request is first sent to the global scheduler, and the scheduler analyzes the characteristics of the request, such as the hint prefix of the request, and searches the global hint tree for the matched node, so as to determine the best memory pool instance.
In step S710, the memory pool checks whether the hint prefix of the request exists in the index, and if so, retrieves the corresponding KV cache.
Step S712, distributing prefill tasks to the instances with enough memory resources according to the dispatching result, generating KV cache, distributing decode tasks to the instances with stronger computing power, and generating output by using the retrieved KV cache.
Step S714, the memory pool dynamically adjusts memory allocation based on the retrieved KV cache, and optimizes storage and access of the KV cache.
In step S716, the scheduler monitors the load condition of each instance in real time, dynamically adjusts task allocation, and realizes load balancing.
Step S718, output the reasoning result, return the reasoning result to the request terminal through the dispatcher.
In the embodiment, through dynamic resource allocation and intelligent scheduling, the cloud platform system can more efficiently utilize calculation and memory resources, resource waste is reduced, the context caching and separation reasoning architecture remarkably reduces repeated calculation, the reasoning speed is increased, and accordingly waiting time of a user is reduced, through separation of tasks between Prefill-only and Decode-only examples, the system can concentrate on calculation optimization of each stage, overall calculation efficiency is improved, the global scheduler monitors and adjusts task allocation in real time, load balancing is achieved, and the condition that some nodes are overloaded and other nodes are idle is avoided.
As shown in fig. 8, according to another embodiment of the present disclosure, a global scheduler receives an inference request to match an optimal execution (memory) instance based on a local awareness policy and a global hint tree.
Scheduling the pre-filling task to a first execution (Prefill examples), scheduling the decoding task to a second execution example (decoding example) and combining the PD-Colocated (parallel data co-placement) example to execute an reasoning operation, obtaining a reasoning result, and updating a global prompt tree based on a K cache generated by the reasoning result.
Each example structure comprises a CPU and a GPU, the GPU is packaged with a DRAM, the GPU is packaged with an HBM, a CPU memory pool optimizes memory management from the perspective of access and use of the CPU to a memory, under a multi-task environment, a plurality of processes or threads can compete for memory resources at the same time, the CPU needs to acquire and process the memory data efficiently, the CPU memory pool can design the size and allocation strategy of a memory block according to the cache structure, instruction execution mode and other factors of the CPU, and the DRAM and the HBM respectively store historical KV cache and active KV cache.
To implement sequence parallelism (Sequential Parallelism, SP) reasoning, TRANSFER API can be used to exchange the output of the attention mechanism, i.e., transport KV, between different instances, simplifying data transport between different instances through a generic distributed transport API, hiding the underlying hardware heterogeneity from being implemented within the system without obstruction.
In the embodiment, the distributed memory pool and the global scheduler support system horizontal expansion, so that the system can adapt to the ever-increasing computing requirements and high concurrency scenes, in addition, the dynamic memory management mechanism of the memory pool optimizes the memory use, reduces memory fragments, improves the memory access speed, and is beneficial to reducing the operation cost of a data center by improving the resource utilization rate and optimizing the computing flow, the cloud platform system can provide high-efficiency services in a large-scale cloud computing environment or an edge computing environment close to a user, and the experience and satisfaction of an end user can be improved by fast response and high-efficiency reasoning services, so that the configuration idea cloud platform system not only promotes the development of a large-scale language model service technology, but also provides a practical solution for the actual deployment and operation of artificial intelligent application.
It is noted that the above-described figures are merely schematic illustrations of processes involved in a method according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.
A model inference optimizing apparatus 900 according to an embodiment of the present disclosure is described below with reference to fig. 9. The model inference optimizing apparatus 900 shown in fig. 9 is only one example, and should not impose any limitation on the functions and scope of use of the embodiments of the present disclosure.
The model inference optimizing means 900 is represented in the form of hardware modules or software modules. The components of the model inference optimization apparatus 900 may include, but are not limited to, a determining unit 902 configured to determine, by a global scheduler, a first execution instance based on a local awareness policy of a configured global hint tree in response to a received inference request of a model inference, the first execution instance being capable of reusing a context cache of the inference request, a first scheduling unit 904 configured to schedule a pre-filled task of the model inference to the first execution instance to perform a pre-fill operation based on the first execution instance to obtain a key cache including the context cache, a second scheduling unit 906 configured to schedule a decode task of the model inference to the second execution instance to decode the key cache including the context cache based on the second execution instance to obtain an inference result, and a feedback unit 908 configured to feed back the inference result to the requesting end by the global scheduler.
As shown in fig. 10, the electronic device 1000 is embodied in the form of a general purpose computing device. The components of the electronic device 1000 may include, but are not limited to, at least one processing unit 1010 described above, at least one memory unit 1020 described above, and a bus 1030 that connects the various system components, including the memory unit 1020 and the processing unit 1010.
Wherein the storage unit stores program code that is executable by the processing unit 1010 such that the processing unit 1010 performs steps according to various exemplary embodiments of the present disclosure described in the above section of the present specification. For example, the processing unit 1010 may perform the scheme as described in fig. 2.
The memory unit 1020 may include readable media in the form of volatile memory units such as Random Access Memory (RAM) 10201 and/or cache memory unit 10202, and may further include Read Only Memory (ROM) 10203.
The storage unit 1020 may also include a program/utility 10204 having a set (at least one) of program modules 10205, such program modules 10205 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
Bus 1030 may be representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 1000 can also communicate with one or more external devices 1070 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 1000, and/or with any device (e.g., router, modem, etc.) that enables the electronic device 1000 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 1050. Also, electronic device 1000 can communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 1060. As shown, the network adapter 1060 communicates with other modules of the electronic device 1000 over the bus 1030. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with the electronic device 1000, including, but not limited to, microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or an electronic device, etc.) to perform the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, a computer-readable storage medium having stored thereon a program product capable of implementing the method described above in the present specification is also provided. In some possible implementations, aspects of the present disclosure may also be implemented in the form of a program product comprising program code for causing an electronic device to carry out the steps according to the various exemplary embodiments of the disclosure as described in the "exemplary methods" section of this specification, when the program product is run on the electronic device.
A program product for implementing the above-described method according to an embodiment of the present disclosure may employ a portable compact disc read-only memory (CD-ROM) and include program code, and may be run on an electronic device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of a readable storage medium include an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the internet of things terminal, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
Furthermore, although the steps of the methods in the present disclosure are depicted in a particular order in the drawings, this does not require or imply that the steps must be performed in that particular order, or that all illustrated steps be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a mobile terminal, or an electronic device, etc.) to perform the method according to the embodiments of the present disclosure.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.