Detailed Description
Hereinafter, embodiments of the present application will be described with reference to the accompanying drawings. It should be understood that the description is only illustrative and is not intended to limit the scope of the application. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the application. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the present application.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.
Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a convention should be interpreted in accordance with the meaning of one of skill in the art having generally understood the convention (e.g., "a system having at least one of A, B and C" would include, but not be limited to, systems having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
Graphics processors are widely used in servers and data centers to afford large-scale computing tasks (e.g., graphics processing, deep learning, graph computation, etc.). However, the traditional discrete graphics processor has limited device memory capacity, and is difficult to handle for large-scale applications (e.g., social network analysis, recommendation systems). To solve this problem, the present application proposes a virtual memory management (Unified Virtual Memory, abbreviated as UVM) mechanism. UVM allows the CPU and the graphics processor to share the same virtual address space by requiring paging (DEMAND PAGING) and automatically migrate data, thereby reducing the burden of manually managing memory and improving portability and usability of programs.
The relationship between virtual memory management techniques and thread block operation will be further embodied by the introduction of the processor in fig. 1.
FIG. 1 shows a schematic diagram of a processor according to an embodiment of the application.
As shown in FIG. 1, processor 100, which may include a General-purpose graphics processor (GPGPU) is a type of graphics processor that utilizes graphics processors that process graphics tasks to compute General-purpose computing tasks that would otherwise be processed by a central processor.
As shown in fig. 1, the processor 100 may include a plurality of streaming multiprocessors (STREAMING MULTIPROCESSOR, abbreviated SM, also called computing units) 110 and a storage unit 120.
Illustratively, the SM is a basic computational unit of a general-purpose graphics processor, including an instruction scheduler, registers, computational cores, and the like. These general purpose computations may have no relation to graphics processing. The streaming multiprocessor can handle non-graphics data due to the powerful parallel processing capability of general-purpose graphics processors and programmable pipelines.
As shown in fig. 1, on each streaming multiprocessor 110, a plurality of Thread blocks 111 (TB for short) may be run in parallel. Each thread block includes a plurality of thread bundles 111-1 (Warps). One thread bundle 111-1 includes multiple threads, illustrated as a pipeline.
Alternatively, thread bundle 111-1 is the smallest unit of operation on the SM, typically 32 threads are 1 thread bundle. Each thread bundle 111-1 operates in single instruction multithreading (Single Instruction Multiple Threads, SIMT for short) mode.
Illustratively, threads in thread bundle 111-1 access virtual memory addresses based on task requests, and a translation lookaside buffer (Translation Lookaside Buffer, TLB for short) traverses physical memory addresses from the page table that have a mapping relationship to the virtual memory addresses based on the virtual memory addresses. So that the data to be executed by the thread is called from the physical memory pages stored by the storage unit 120 of the processor 100 based on the physical memory address to run the thread.
If the data is not stored in the storage unit 120 of the processor 100, and the data is stored in the virtual memory space extension storage space, it is determined that the thread has a page fault. The page to be migrated including the data stored in the extended memory space may be migrated to the memory unit 120 of the processor 100 via the high speed serial computer expansion bus standard (PERIPHERAL COMPONENT INTERCONNECT EXPRESS, PCIe for short) and the thread may be interrupted during the migration.
The extended storage space may include, but is not limited to, a disk and may also include a storage unit configured on the central processor.
The processor is described above. An abnormality processing method of the page fault abnormality will be described below.
FIG. 2 shows a flow chart of an exception handling method according to an embodiment of the application.
As shown in FIG. 2, the method includes operations S210-S230.
In operation S210, in case it is determined that the first thread block operated by the processor has a page fault abnormality, the operation of the first thread block is interrupted.
In operation S220, a second thread block is run on the processor.
In operation S230, in response to the current time being the target time, merging the to-be-migrated virtual memory pages causing the page fault exception, and expanding the storage space from the virtual memory space to batch migrate the batch-migrated pages to the storage unit of the processor as batch-migrated pages.
The first thread block may include a plurality of thread bundles, the presence of the page fault in the first thread block indicating that the plurality of thread bundles all have page fault anomalies, the presence of the page fault in the thread bundles indicating that data required to operate the thread bundles is stored in the extended memory space.
Alternatively, the extended storage space may include extended storage space in the server other than the storage unit configured on the processor. The length required to read data from the extended memory space is longer than if the data were read directly from a memory location configured on the processor.
Alternatively, the extended storage space may include, but is not limited to, a disk, and may also include a storage unit configured on the central processor.
In the running process of the thread bundle, the physical memory address of data required for running the thread bundle is queried from the page table based on the virtual memory address. A storage location to store the data is determined based on the physical storage address.
The page fault abnormality of the first thread block indicates that a plurality of thread bundles of the first thread block all have page fault abnormality, and the data required by the thread bundles with the page fault abnormality indicates that the thread bundles are operated is stored in the expansion storage space.
Interrupting the operation of the first thread block may refer to blocking the first thread block. I.e., halting the first thread block running on the processor.
After the first thread block is interrupted, a second thread block can be determined from the plurality of standby thread blocks to run the second thread block on the processor, thereby reducing idle time of the processor and improving resource utilization.
The plurality of pages to be migrated may be consolidated for a predetermined length of time to form a batch of migrated pages for batch migration from the extended storage space into the storage unit of the processor.
The plurality of pages to be migrated includes at least one of data for running a first thread block resulting in a page fault exception and data for running a second thread block resulting in a page fault exception.
The second thread block is used for switching the first thread block, so that the idle time of the processor can be reduced, and the batch migration pages collected in a preset period can comprise data required by running the first thread block and the second thread block, so that the method is adaptive to an exception handling mechanism of the virtual memory, the number of pages to be migrated in the batch migration pages is increased, and the problem of performance degradation caused by the migration of the batch migration pages is further reduced.
The exception handling method is described above in its entirety. The advantages of the exception handling method provided by the present application will be further described below by way of examples provided as figures 3A and 3B.
FIG. 3A shows a flow diagram of an exception handling method according to a related example of the present application.
As shown in fig. 3A, the "a Fault, B Fault, C Fault" are trigger signals of Page Fault events, and the corresponding pages to be migrated to be processed include "Page a, page B, page C".
When the processor runs the first thread block, a plurality of pages to be migrated of the first thread block in a preset time length are collected, wherein the pages to be migrated comprise Page A and Page B, and the Page A and Page B are merged into a batch of batch migration pages and then migrated to a storage unit of the processor for restoration.
And combining the Page C to the next batch of migration pages because the time between the abnormality determination time of the Page C to be migrated and the abnormality determination time of each of Page A and Page B exceeds a preset time. The Batch migration page is executed in a serialization manner, and migration is executed strictly according to Batch processing sequence when running, namely Batch N is processed, and Batch N+1 is processed after Batch N is processed. After the batch migration pages comprising Page A and Page B are processed, the batch migration Page comprising Page C can be processed.
FIG. 3B is a flow chart of an exception handling method according to an embodiment of the application.
As shown in FIG. 3B, while the processor is running the first thread block TB1, multiple pages to be migrated, e.g., page A, page B, of the first thread block TB1 are collected for a predetermined period of time. In the case where it is determined that there is a page fault in all the thread bundles in the first thread block TB1, the operation of the first thread block TB1 is interrupted, and the first thread block TB1 is switched using the second thread block TB 2. The idle time of the processor can be reduced and the batch of migration pages collected for a predetermined period of time can include pages to be migrated Page a, page B resulting in a failure for running the first thread block TB1 and pages to be migrated Page C resulting in a failure for running the second thread block TB 2. Therefore, the method is matched with an exception handling mechanism of the virtual memory, the number of pages to be migrated in the batch of pages to be migrated is increased, and the problem of performance degradation caused by the migration of the pages to be migrated is further reduced.
Compared with a related example, by using the exception handling method provided by the embodiment of the disclosure, the idle time of the processor is reduced, and the resource utilization rate of the processor is improved. In addition, in the same period, the switching frequency among the thread blocks is increased, so that the data volume of pages to be migrated in the batch of migration pages is improved, the migration efficiency is improved, and the frequency of exception handling is improved.
The advantages of the application are further illustrated above by way of example. How to determine that the first thread block has a page fault exception will be described below.
In accordance with an embodiment of the present application, prior to operation S210 as shown in FIG. 2, the exception handling method may further include determining that the first thread block has a page fault exception, in the event that it is determined that a page fault exception exists for all of the plurality of thread bundles attributed to the first thread block.
A first thread block is run on a processor provided by an embodiment of the present disclosure, the first thread block including M thread bundles, each thread bundle including X threads. The X threads may execute the same instruction to process different data. M and X are positive integers greater than 1.
When any one of the X threads has a page fault, the processing is delayed by the thread having the page fault because the X threads execute the same instruction. Thus, it is determined that the thread bundle has a page fault exception, interrupting the operation of the thread bundle.
In the case that there is a page fault abnormality in all of the M thread bundles belonging to the first thread block, it may be determined that there is a page fault abnormality in the first thread block, and the first thread block is interrupted. In the case that any one of the M thread bundles belonging to the first thread block operates normally, the first thread block is not interrupted.
According to the embodiment of the application, the exception handling method provided by the embodiment of the application is matched with an actual SIMT execution mode, so that the accuracy and the fine granularity of exception handling can be improved, the task execution efficiency is improved, and the running stability is improved.
According to the embodiment of the application, determining whether the plurality of thread bundles belonging to the first thread block all have page fault abnormality can comprise inquiring data required by a target thread running on a processor from a page table based on a virtual memory address of the target thread to obtain an inquiring result. And under the condition that the query result indicates that the data required by the target thread is stored in the expansion storage space, determining that the thread bundle to which the target thread belongs has page shortage abnormality, and interrupting the execution of the thread bundle to which the target thread belongs.
Multiple threads in the thread bundle respectively process different data according to the same instruction.
Each target thread handles its own task and invokes the data required for the task. Alternatively, the virtual memory address may be utilized to walk from the page table. The page table characterizes the mapping relationship between virtual memory addresses and physical memory addresses. In the case where the query result of any thread indicates that data is stored in the extended storage space, it is determined that the thread has a page fault abnormality. In this case, it is determined that the thread bundle to which the target thread belongs has a page fault abnormality, and the thread bundle to which the target thread belongs is interrupted.
In case the query result indicates that the data is stored in the memory unit of the processor, it is operated normally.
According to the embodiment of the application, in the parallel processing task process of the processor, threads with the same instruction and different data to be processed are combined into one thread bundle for processing, so that the parallel capability is improved, and meanwhile, under the condition that any target thread has page fault abnormality, the thread bundle to which the target thread belongs is determined to have page fault abnormality, and the problem of delay of other threads in the same thread bundle is avoided.
The above describes how to determine that the first thread block has a page fault abnormality, and the following describes how to interrupt the first thread block.
According to an embodiment of the present application, for operation S210 as shown in FIG. 2, interrupting the operation of the first thread block may include saving the context of the first thread block to global memory.
The context of the first thread block represents a first thread block operational state.
Alternatively, the context may be stored to shared memory or global memory. Global memory may refer to memory space external to the processor that is available for access by the processor, and shared memory may refer to memory space provided on the processor. The shared memory has low delay but limited capacity, and the global memory has high delay but large capacity. The storage position of the context can be flexibly set according to actual conditions.
Alternatively, the interruption of the first thread block may be accomplished by dynamically switching contexts. For example, when all thread bundles of the first thread block in the active state are interrupted by a page fault, a switch may be made to a standby thread block, which is continued to run on the processor as a second thread block.
Specifically, the context of the first thread block may be saved and the state may be managed and marked, e.g., marked as interrupt state, by the scheduler of the GPU, completing the interrupt.
Alternatively, the global memory may be storage space outside the processor that is accessible to the processor. And completing the interrupt of the first thread block by switching the context.
According to the embodiment of the application, the interrupt of the first thread block is completed by processing the context of the first thread block, so that an interrupt mode is simplified, and the processing efficiency is improved.
The interruption of the first thread block is described above. The activation of the second thread block is described below.
Running the second thread block on the processor, in accordance with embodiments of the present application, may include migrating data required to run the second thread block and a context of the second thread block into a memory location of the processor, as illustrated in operation S220 of fig. 2. The context of the second thread block is used to characterize the operating state of the second thread block.
A standby thread block may refer to a thread block that is in an inactive state. In the case where the standby thread block is determined to be the second thread block, the context of the second thread block is loaded from the global memory into a memory location of the processor.
Optionally, the context may be used to support the execution of the second thread block. The pages to be migrated including the data to be processed may be migrated together into the memory unit of the processor.
According to the embodiment of the application, the context and the page to be migrated are migrated to the storage unit of the processor, and the standby thread block in the inactive state is converted into the second thread block in the active state, so that the second thread block runs on the processor, the activation mode is simple, the activation content is comprehensive and effective, and the activation efficiency and the switching stability of the thread blocks are improved.
The above describes how the second thread block is activated, and the following describes how the batch migration page is determined.
According to an embodiment of the present application, for operation S230 shown in fig. 2, in response to the current time being the target time, merging a plurality of pages to be migrated, which each cause a page missing exception, of the first thread block and the second thread block, as a batch of migrated pages, including sorting the plurality of pages to be migrated according to the exception determination time when the determined data amount of the plurality of pages to be migrated exceeds a predetermined data amount threshold, to obtain a time sorting result. Based on the time ordering result, a plurality of pages to be migrated meeting a predetermined data amount threshold are used as batch migration pages. And taking the rest pages to be migrated as batch migration pages of the next batch.
The migration time of the previous batch of migration pages can be counted, and when the time length between the current time and the migration time meets the preset time length, the current time is taken as the target time. The plurality of pages to be migrated determined within the predetermined time period may be merged as a batch of migrated pages. And can also be used as candidate batch migration pages. The data amount of the candidate batch migration page is determined. In the case where the data amount exceeds a predetermined data amount threshold, a part thereof is selected as a batch migration page. The pages to be migrated may be randomly selected, but not limited to, and may be sorted according to the anomaly determination time to obtain a time sorting result. And merging the pages to be migrated, which occur early at the abnormal determination moment, into the batch migration pages to be processed currently based on the time sequencing result so as to process and recover faults early based on the task request responded early. Thereby improving the task processing effect.
According to another embodiment of the application, in response to the current moment being the target moment, merging the plurality of pages to be migrated, which have page missing anomalies, as a batch of pages to be migrated, and merging the plurality of pages to be migrated, as a batch of pages to be migrated, in the case that the determined data amount of the plurality of pages to be migrated does not exceed a predetermined data amount threshold.
The data volume is further taken as a limiting condition while the preset duration is taken as a merging limiting condition of the batch migration pages, so that limitation is carried out through a data volume threshold value in the process of batch processing of the pages to be migrated, and the processing efficiency is improved while the resource occupancy rate is reduced.
The general manner in which batch migration pages are determined is described above. How the migration of the batch migration page is performed will be described below.
According to the embodiment of the application, the batch migration page is migrated to the storage unit of the processor from the expansion storage space in batches, and the method can comprise the steps of locking a target page table item in a lock page table to prevent writing operation of the target page table item under the condition that the free capacity of the storage unit of the processor is determined to meet the batch migration page, and the batch migration page is migrated to the storage unit of the processor from the expansion storage space in batches, and updating the mapping relation between the virtual memory address and the physical memory address of the target page table item in the page table.
Locking a target page table entry in a page table may refer to restricting a write operation to the target page table entry in the page table. After the batch migration of the batch migration page is executed, the mapping relation between the virtual memory address and the physical memory address of the target page table item in the page table can be updated. After updating the page table, unlocking the target page table item in the page table. And (5) completing fault repair. After the SM is idle on the processor, the thread blocks corresponding to the batch migration pages may be restarted.
According to the batch migration operation of the batch migration pages, the page table can be updated simultaneously under the condition that the batch migration pages are migrated, linkage operation is realized, and new abnormality caused by non-uniform migration and updating is avoided.
The above describes how to perform batch migration in detail, and the following describes how to improve the processing efficiency of batch migration.
According to a preferred example of the present application, in the process of performing operation S230 shown in fig. 2, the exception handling method may further include evaluating a plurality of physical memory pages stored in the storage unit to obtain a management page evaluation result. And under the condition that the management page evaluation result represents the discomfort between the target physical memory page and the target thread block, the target physical memory page is used as a drive page by page, and is migrated from a storage unit of the processor to the expansion storage space.
The target physical memory page is allocated for use by a target thread block, which is a thread block running in parallel with the first thread block or the second thread block on the processor.
Optionally, the physical memory pages are stored in the memory unit of the processor. The capacity of the memory cell is fixed. Before each migration of a batch of migrated pages, storage space needs to be freed up in the storage unit, e.g., to migrate an existing physical memory page, which forms a serialization.
Fig. 4A shows a migration flow diagram of a page to be migrated according to a related example of the present application.
As shown in fig. 4A, page to be migrated Page a and Page to be migrated Page B respectively cause Page fault exception, and are stored in the extended storage space to be migrated to the storage unit of the processor. The physical memory pages Page X and Page Y are stored in the memory unit of the processor, and need to be migrated to the extended memory space. Because the capacity of the storage unit is fixed, when the storage unit is in a full-load state, the physical memory Page Page X needs to be migrated first, then the Page A to be migrated is migrated, then the physical memory Page Page Y is migrated, and then the Page B to be migrated is migrated, and the series of operations are executed in series.
When the page to be migrated is full, the old physical memory page needs to be migrated first to make room.
The physical memory pages with low use frequency or unsuitable for running the thread blocks are pre-migrated by using the embodiment of the application, and the page migration efficiency can be improved by using the pre-migration mechanism.
Target physical memory pages and target thread blocks are not adapted by either not accessing the data contained in the target physical memory page or by accessing the data contained in the target physical memory page less frequently than a predetermined threshold.
Fig. 4B shows a migration flow diagram of a page to be migrated according to an embodiment of the present application.
As shown in fig. 4B, the first thread block TB1 and the target thread block TBT are run in parallel in the processor, when the first thread block TB1 is switched to the second thread block TB2, before determining that the Page to be migrated includes a Page to be migrated Page-TB1 of the first thread block TB1 and a plurality of pages to be migrated Page-TB2 of the second thread block TB2, a target physical memory Page-TBT that does not fit the target thread block TBT is determined in advance, and the target physical memory Page-TBT is migrated from the storage unit to the expansion storage space in advance.
And identifying the target physical memory page which is not matched with the target thread block in advance, and migrating, so that the memory space of the memory unit is dynamically expanded, and hard serial execution of migration-in and migration-out operation is avoided.
When the storage unit of the processor is detected to be fully loaded, even if a page migration request is not available, the target physical memory page is migrated in advance, so that usable storage space is vacated, and the migration operation of batch migration pages of one batch can be immediately performed, thereby reducing the seriality of page migration and improving the management efficiency.
How to migrate a batch migration page is described above. How the second thread block is determined from the standby thread blocks will be described below.
In accordance with an embodiment of the present application, before performing operation S220 as shown in fig. 2, the exception handling method may further include determining a second thread block from among the standby thread blocks.
Optionally, determining the second thread block from the standby thread blocks may include sorting the plurality of standby thread blocks based on the priority of the task to be executed of the standby thread block, resulting in a thread block sorting result. A second thread block is determined from the standby thread blocks based on the thread block ordering result. But is not limited thereto. Candidate standby thread blocks may also be determined from the standby thread blocks based on the thread block ordering result. In the event that it is determined that a portion of the data required for the candidate standby thread block has been stored in the memory location of the processor, the candidate standby thread block is treated as a second thread block.
Determining the second thread block from the standby thread blocks based on the thread block ordering result may include ordering the plurality of standby thread blocks according to a priority level to obtain a thread block ordering result. And taking the standby thread block with the highest priority as a candidate standby thread block. And directly taking the candidate standby thread block as a second thread block.
Alternatively, it may be identified whether or not the partial data required for the candidate standby thread block is already stored in the memory unit of the processor, and the candidate standby thread block is regarded as the second thread block in the case where it is determined that the partial data required for the candidate standby thread block is already stored in the memory unit of the processor. And under the condition that the data needed by the candidate standby thread blocks are not stored in the storage unit of the processor, continuing to filter according to the thread block sequencing result until the standby thread blocks are obtained and are used as second thread blocks.
Compared with the mode of directly taking the standby thread block with the highest priority as the second thread block, partial data required by the candidate standby thread block is determined to be stored in a storage unit of the processor, the migration data volume of batch migration pages for running the second thread block can be reduced, the processing speed is improved, and the problem that the determined second thread block is interrupted again because the batch migration pages are required and cannot be run on the processor, and further the processor continues to idle is avoided.
The above describes how the second thread block is determined from the standby thread blocks. How to determine the number of standby thread blocks will be described below.
According to an embodiment of the present application, the number of standby thread blocks may be determined as a fixed value. But is not limited thereto. The number of standby thread blocks may also be determined to be dynamically adjustable.
In accordance with a preferred embodiment of the present application, the particular manner in which the number of standby thread blocks is set to be dynamically adjusted may include the operation of determining the number of standby thread blocks based on the processor's operating state information and the processor's hardware metrics over a historical period of time.
The number of configured standby thread blocks for the current time period may be estimated based on the operating state information of the processor and the hardware index of the processor during the historical time period.
The operating state information of the processor may characterize the operating performance of the processor, and the hardware metrics of the processor may characterize the maximum capabilities of the supportable resources of the processor.
Based on the running state information of the processor and the hardware index of the processor in the historical period, the number of standby thread blocks configurable in the current period is estimated, multiple factors can be comprehensively considered together, the expandability and the dynamic adjustability of the standby thread can be guaranteed, the idle time of the processor is reduced through dynamic switching of the thread blocks, the frequency of batch merging is improved, and finally the effect of improving the resource utilization rate is achieved. In addition, the capacity of both load handling capacity and stable running performance of the processor can be improved through running state information.
According to embodiments of the present application, a function fit may be used to determine the number of standby thread blocks. For example, the running state information of the processor and the hardware index of the processor in the history period are used as parameters and substituted into the evaluation function to obtain the number of standby thread blocks.
According to another embodiment of the application, the manner in which the motion estimation model predicts can also be employed. For example, based on the operational status information and the hardware metrics, status features are derived. Based on the state characteristics, the number of the standby thread blocks is evaluated, and an action evaluation result for adjusting the number of the standby thread blocks is obtained. And obtaining the number of the standby thread blocks in the current period based on the historical number of the standby thread blocks in the historical period and the action evaluation result.
Specifically, the state features may be input into the action evaluation model, resulting in an action evaluation result for adjusting the number of standby thread blocks. For example, the number of standby thread blocks configured during the history period includes Y. Based on the action evaluation result, a predetermined number can be increased and decreased on the basis of YObtaining the number of standby thread blocks in the current period。
According to another alternative embodiment of the present application, the status feature may be input into the action evaluation model to obtain a thread block number evaluation result indicating the number of standby thread blocks.
Compared with a mode of directly obtaining the number evaluation result of the thread blocks, the method has the advantages that the number of the standby thread blocks in the current period is determined together based on the action evaluation result and the number of the standby thread blocks configured in the historical period, the number of the standby thread blocks configured in the historical period can be finely adjusted by utilizing the action evaluation result, the expansion flexibility is improved, and meanwhile, the prediction error is reduced based on the number of the standby thread blocks configured in the historical period, so that the problem of large error caused by direct determination is avoided.
According to the embodiment of the application, the state characteristics are obtained based on the running state information and the hardware index, and the method can comprise the steps of respectively extracting the characteristics of the running state information and the hardware index to obtain the running state characteristics and the index characteristics. And carrying out feature fusion on the running state features and the index features to obtain state features.
Fig. 5 shows a schematic diagram of a determination action evaluation result according to an embodiment of the present application.
As shown in fig. 5, the operation state information 510 and the hardware index 520 may be input to the feature extraction module M510, respectively, to obtain an operation state feature 530 and an index feature 540. The running state feature 530 and the index feature 540 are input into the fusion module M520 to perform feature fusion, so as to obtain a state feature 550. The state features 550 are input to the motion estimation model M530, resulting in a motion estimation result 560.
Alternatively, the fusion module may comprise a splice or a dot product module. The feature extraction module M510 may include a codec, a convolutional neural network, a long-term short-term memory network. The action evaluation model may include at least one of a convolutional neural network, a large language model, and a random forest model, which will not be described in detail herein.
And the action evaluation result is estimated by using the action evaluation model M530, so that the processing efficiency is improved.
According to an embodiment of the present application, the running state information 510 may include at least one of, for example, a page processing performance recognition result, a page access attribute recognition result, and a thread running performance recognition result. The hardware metrics may include at least one of a processing metric and a hardware resource metric.
Specifically, the processing performance of the physical memory page to be migrated can be identified, and a page processing performance identification result is obtained. The page handling performance identification result may include a premature recovery or a premature recovery ratio.
Identifying the processing performance of the physical memory pages to be migrated may include determining a premature reclamation rate per unit time. Or determining the change amount of the proportion of the physical memory pages which are recovered prematurely in unit time. Lower page processing performance recognition results indicate that increased concurrency is more desirable.
And identifying the access attribute of the physical memory page to obtain a page access attribute identification result. The page access attribute identification result may include at least one of an access frequency of physical memory pages, a proportion of shared physical memory pages. The access frequency of the physical memory pages may reflect the load pressure, with higher levels of concurrency being required. The higher the proportion of shared physical memory pages indicates the greater need to increase concurrency.
For example, identifying access attributes of physical memory pages may include determining a proportion of physical memory pages commonly accessed by multiple thread blocks to total access physical memory pages. Or determining the total number of physical memory pages per unit time.
And identifying the running performance of the thread blocks running on the processor to obtain a thread running performance identification result. The thread execution performance identification result may include a switching frequency of the thread blocks. The higher the frequency of thread running performance recognition results, the greater the overhead representing the switch, and the more concurrency needs to be reduced.
Identifying the runnability of a thread block running on a processor may include a number of thread block context switches per unit time.
And determining the processing index and the hardware resource index of the batch migration page. A hardware indicator is determined based on at least one of the processing indicator and the hardware resource indicator.
The processing index of the batch migration page can comprise the total amount of the page of the single batch processing migration and the merging time of the batch migration page. The hardware resource indicator may include a hardware resource occupancy time of the processor.
According to the embodiment of the application, the reference factors influencing the concurrency are determined according to the actual running condition, and the number of standby thread blocks is estimated by comprehensively referencing the reference factors, so that the estimation is accurate and effective.
According to the alternative embodiment of the application, a reinforcement learning mode can be adopted to evaluate the number of the standby thread blocks based on the state characteristics, so as to obtain an action evaluation model for adjusting the action evaluation result of the number of the standby thread blocks, and the action evaluation model is optimally trained, so that the pertinence and the intelligence of the action evaluation model are improved.
With continued reference to fig. 5, a model evaluation result 590 may be determined based on the current performance monitoring result 570 for the current time period and the historical performance monitoring result 580 for the historical time period. Based on the model evaluation result 590, the action evaluation model M530 is optimally trained, and an optimally trained action evaluation model is obtained.
According to an embodiment of the application, determining a model evaluation result based on a current performance monitoring result of a current period and a historical performance monitoring result of a historical period may include determining a first evaluation result for a task execution performance index based on respective processed task amounts of the current period and the historical period. And determining a second evaluation result aiming at the unreasonable management index based on the premature recovery rate change rate of the physical memory pages to be migrated in the current period and the historical period. And determining a third evaluation result aiming at the resource occupation index based on the resource occupation monitoring result of the current period. A model evaluation result is determined based on the first evaluation result, the second evaluation result, and the third evaluation result.
Optionally, determining the resource occupancy monitoring result may include determining a context switch duration of the thread block. The context switch frequency of the thread block is determined. A bandwidth occupancy for the transmission context is determined. And determining a resource occupation monitoring result based on the switching time length, the switching frequency and the bandwidth occupation rate.
Optionally, determining the model evaluation result may include performing data conversion on the first evaluation result and the second evaluation result, respectively, determining a first target value representing the performance gain and a second target value representing the management performance loss, determining a third target value representing the resource occupancy based on a sub-evaluation result for the time occupancy index and a sub-evaluation result for the bandwidth resource occupancy index in the third evaluation result, obtaining the target values for the first target value, the second target value and the third target value, and obtaining the model evaluation result based on the target values and the evaluation threshold.
The first evaluation result can be used as a core index for measuring the instantaneous change of the computing performance of the processor, and is defined as the difference value of the average execution instruction number per cycle in adjacent monitoring cycles. The current cycle instruction throughput may be utilized as the processed task amount, as a first evaluation result, relative to the change value of the last cycle instruction throughput.
The first evaluation result may be data-converted, thereby yielding a first evaluation result reflecting a first target value of the performance gain.
For example, the first evaluation result=the instruction number of the current period-the instruction number of the history period.。
Second evaluation resultMay include premature recovery.
Premature recovery Premature Rate refers to the proportion of physical memory pages in the processor that are removed in advance, and is used to measure performance loss due to an improper recovery strategy. The definition formula can be seen in formula (1).
Formula (1)
Data converting the second assessment result to determine a second target value indicative of a loss of management performance may include determining a premature recovery ratio based on the premature recovery. The premature recovery ratio refers to the amount of change in the premature recovery that is removed in advance per unit time. The ratio of the premature recovery can be used as the second target value Δ EvictionRate, and the calculation can be referred to in formula (2). An increase in the proportion of premature recovery indicates an increase in recovery decision error rate.
Formula (2)
The third evaluation result may represent a resource occupancy monitoring result. For example, monitoring results of at least one dimension, such as sub-evaluation results for time occupancy metrics, frequency penalty factors, and sub-evaluation results for bandwidth resource occupancy metrics, may be included.
Sub-evaluation results for time occupancy indicatorsCan be determined by the following formula (3).
Formula (3)
Where ContextSize denotes the amount of context data for a single-threaded block, memoryBandwidth denotes the global memory bandwidth of the processor, and CYCLESPERTRANSFER denotes the number of clock cycles of the processor that are consumed by each data transfer operation.
Frequency penalty factorCan be determined using the following equation (4).
Formula (4)
Wherein ContextSwitchCount represents the number of thread block context switches within the monitoring window, monitoringCycles represents a fixed monitoring period.
Sub-evaluation results for bandwidth resource occupancy indicatorsCan be determined using the following equation (5).
Formula (5)
Wherein MemBWUtil denotes global memory bandwidth occupancy.
The third evaluation result may be calculated as in equation (6) to obtain a third target value。
Formula (6)
Wherein, the A third target value is indicated and a third target value,Representing the sub-evaluation results for the time occupancy indicator,A frequency penalty factor is represented and is used,And representing the sub-evaluation result aiming at the bandwidth resource occupation index.
The target values are obtained for the first target value, the second target value, and the third target value, and can be calculated by referring to the following formula (7).
Formula (7)
Where Δipc represents the first target value.Representing a second target value.Representing a third target value.Representing the target value.Representing the weights.
The target value can be compared with the evaluation threshold, and when the target value is larger than the evaluation threshold, the model evaluation result shows that the evaluation performance of the current action evaluation model is good, and optimization is not needed. When the target value is smaller than the evaluation threshold value, the model evaluation result shows that the evaluation performance of the current action evaluation model is poor, and optimization can be performed to improve the evaluation performance of the action evaluation model.
According to the model training method provided by the embodiment of the application, the monitoring results of the current time period and the adjacent historical time period can be compared, so that the model evaluation result is combined with reality, and is real and effective. In addition, the model evaluation result combines reference evaluation data considered from different angles, so that the referenceness of the model evaluation result is comprehensive and effective, and the model evaluation result is utilized to optimize and train with high efficiency and high model precision after training.
Fig. 6 shows a block diagram of an exception handling apparatus according to an embodiment of the present application.
As shown in fig. 6, the exception handling apparatus 600 of the embodiment of the present application includes an interrupt module 610, an activation module 620, and a migration module 630.
And the interrupt module 610 is configured to interrupt operation of a first thread block running on the processor if it is determined that the first thread block has a page fault abnormality, where the first thread block includes a plurality of thread bundles, the first thread block has a page fault abnormality to indicate that the plurality of thread bundles all have a page fault abnormality, and data required by the thread bundles for running the thread bundles is stored in the extended storage space. In an embodiment, the interrupt module 610 may be configured to perform the operation S210 described above, which is not described herein.
An activation module 620 is configured to run a second thread block on the processor, where the second thread block is determined from the plurality of standby thread blocks. In an embodiment, the activation module 620 may be configured to perform the operation S220 described above, which is not described herein.
And the migration module 630 is configured to merge, as a batch migration page, a plurality of pages to be migrated, which result in a page fault abnormality, in response to the current time being the target time, and batch migrate the batch migration page from the extended storage space to a storage unit of the processor, where the plurality of pages to be migrated include at least one of data for running a first thread block, which result in the page fault abnormality, and data for running a second thread block, which result in the page fault abnormality. In an embodiment, the migration module 630 may be configured to perform the operation S230 described above, which is not described herein.
According to an embodiment of the present application, the exception handling apparatus 600 further includes a thread block ordering module, a candidate determination module, and a second thread block determination module.
And the thread block sequencing module is used for sequencing the plurality of standby thread blocks based on the priority of the task to be executed of the standby thread blocks to obtain a thread block sequencing result.
And the candidate determining module is used for determining candidate standby thread blocks from the standby thread blocks based on the thread block ordering result.
And the second thread block determining module is used for taking the candidate standby thread block as a second thread block in the condition that partial data required for running the candidate standby thread block is determined to be stored in a physical memory page, wherein the physical memory page is stored in a storage unit of the processor.
According to an embodiment of the application, the activation module 620 includes an activation sub-module.
And the activating submodule is used for migrating data required by running the second thread block and the context of the second thread block into a storage unit of the processor, wherein the context of the second thread block is used for representing the running state of the second thread block.
According to an embodiment of the application, the interrupt module 610 includes a context save sub-module.
And the context saving submodule is used for saving the context of the first thread block to the global memory, wherein the context of the first thread block represents the running state of the first thread block.
The migration module 630 includes a time ordering sub-module and a first management page migration sub-module, according to an embodiment of the present application.
And the time ordering sub-module is used for ordering the plurality of pages to be migrated according to the abnormal determination moment under the condition that the determined data quantity of the plurality of pages to be migrated exceeds a preset data quantity threshold value, so as to obtain a time ordering result.
The first management page migration submodule is used for taking a plurality of pages to be migrated meeting a preset data quantity threshold value as batch migration pages based on time sequencing results.
The migration module 630 also includes a second management page migration submodule according to an embodiment of the present application.
And the second management page migration submodule is used for merging the plurality of pages to be migrated as batch migration pages under the condition that the determined data quantity of the plurality of pages to be migrated does not exceed a preset data quantity threshold value.
According to an embodiment of the present application, the exception handling apparatus 600 further includes a page evaluation module and a premigrating module.
And the page evaluation module is used for evaluating the plurality of pages to be migrated stored in the storage unit to obtain a page evaluation result.
And the premigrating module is used for migrating the target physical memory pages from a storage unit of the processor to an expansion storage space as page-by-page drive under the condition that the page evaluation result represents that the target physical memory pages are not matched with the target thread blocks, wherein the target physical memory pages are distributed to the target thread blocks for use, and the target thread blocks are thread blocks which run in parallel with the first thread blocks or the second thread blocks on the processor.
According to an embodiment of the application, the migration module 630 includes a lock sub-module and an update sub-module.
And the lock submodule is used for locking a target page table item in the page table to prevent a write operation on the target page table item under the condition that the storage unit of the processor is determined to have the storage space of the batch migration page, and migrating the batch migration page from the extended storage space to the storage unit of the processor in batch.
And the updating sub-module is used for updating the mapping relation between the virtual memory address and the physical memory address of the target page table entry in the page table.
The exception handling apparatus 600 further includes a query module, a bundle page fault exception determination module, according to an embodiment of the present application.
And the inquiry module is used for inquiring the data required by the target thread running on the processor from the page table based on the virtual memory address of the target thread to obtain an inquiry result, wherein the page table represents the mapping relation between the virtual memory address and the physical memory address.
And the beam page fault abnormality determination module is used for determining that the thread beam to which the target thread belongs has page fault abnormality under the condition that the query result indicates that the data required by the target thread is stored in the expansion storage space, and interrupting the execution of the thread beam to which the target thread belongs, wherein a plurality of threads in the thread beam process different data according to the same instruction respectively.
According to an embodiment of the present application, the number of standby thread blocks configured on a processor is determined by a standby block determination module as follows.
And the standby block determining module is used for determining the number of standby thread blocks based on the running state information of the processor and the hardware index of the processor in the history period.
The standby block determination module comprises a state characteristic determination sub-module, an action evaluation sub-module and a standby block determination sub-module according to an embodiment of the application.
And the state characteristic determining submodule is used for obtaining state characteristics based on the running state information and the hardware index.
And the action evaluation sub-module is used for evaluating the number of the standby thread blocks based on the state characteristics to obtain an action evaluation result for adjusting the number of the standby thread blocks.
And the standby block determining submodule is used for obtaining the number of standby thread blocks in the current period based on the historical number of the standby thread blocks in the historical period and the action evaluation result.
According to the embodiment of the application, the state characteristic determining submodule comprises a characteristic extracting unit and a fusing unit.
And the feature extraction unit is used for respectively carrying out feature extraction on the running state information and the hardware index to obtain running state features and index features.
And the fusion unit is used for carrying out feature fusion on the running state features and the index features to obtain state features.
According to an embodiment of the present application, the exception handling apparatus 600 further includes a first recognition module, a second recognition module, a third recognition module, a fourth recognition module, a state determination module, a fifth recognition module, and a hardware determination module.
The first identification module is used for identifying the processing performance of the physical memory page to be migrated to obtain a page processing performance identification result.
And the second identification module is used for identifying the access attribute of the physical memory page to obtain a page access attribute identification result.
And the third identification module is used for identifying the running performance of the thread block running on the processor and obtaining a thread running performance identification result.
And the state determining module is used for obtaining the running state information based on the page processing performance identification result, the page access attribute identification result and the thread running performance identification result.
And the fifth identification module is used for determining the processing index and the hardware resource index of the batch migration page.
And the hardware determining module is used for determining the hardware index based on the processing index and the hardware resource index.
According to an embodiment of the present application, the exception handling apparatus 600 includes a model evaluation module and an optimization module.
And the model evaluation module is used for determining a model evaluation result based on the current performance monitoring result of the current period and the historical performance monitoring result of the historical period.
The optimizing module is used for carrying out optimizing training on the action evaluating model based on the model evaluating result to obtain an action evaluating model for optimizing training, wherein the action evaluating model is used for evaluating the number of standby thread blocks based on the state characteristics to obtain an action evaluating result for adjusting the number of the standby thread blocks.
According to an embodiment of the application, the model evaluation module comprises a first evaluation sub-module, a second evaluation sub-module, a third evaluation sub-module and a fourth evaluation sub-module.
And the first evaluation submodule is used for determining a first evaluation result of the task execution performance index based on the respective processed task quantity of the current time period and the historical time period.
And a second evaluation sub-module for determining a second evaluation result for the unreasonable management index based on the premature recovery rate change rate of the batch migration page for each of the current period and the historical period.
And the third evaluation sub-module is used for determining a third evaluation result aiming at the resource occupation index based on the resource occupation monitoring result of the current time period.
And the fourth evaluation sub-module is used for determining a model evaluation result based on the first evaluation result, the second evaluation result and the third evaluation result.
According to an embodiment of the application, the fourth evaluation sub-module comprises a first conversion unit, a second conversion unit, a fusion unit and an evaluation determination unit.
The first conversion unit is used for respectively carrying out data conversion on the first evaluation result and the second evaluation result, and determining a first target value representing the performance gain and a second target value representing the management performance loss.
And the second conversion unit is used for determining a third target value representing the resource occupation based on the sub-evaluation result aiming at the time occupation index and the sub-evaluation result aiming at the bandwidth resource occupation index in the third evaluation result.
And the fusion unit is used for obtaining the target values for the first target value, the second target value and the third target value.
And the evaluation determining unit is used for obtaining a model evaluation result based on the target value and the evaluation threshold value.
Any number of the interrupt module 610, the activate module 620, and the migrate module 630 may be combined in one module/unit/sub-unit or any number of the modules/units/sub-units may be split into multiple modules/units/sub-units according to embodiments of the present application. Or at least some of the functionality of one or more of these modules/units/sub-units may be combined with at least some of the functionality of other modules/units/sub-units and implemented in one module/unit/sub-unit. At least one of the interrupt module 610, the activate module 620, and the migrate module 630 may be implemented, at least in part, as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system-on-chip, a system-on-a-substrate, a system-on-a-package, an Application Specific Integrated Circuit (ASIC), or by hardware or firmware, such as any other reasonable manner of integrating or packaging the circuitry, or in any one of or a suitable combination of any of the three. Or at least one of the interrupt module 610, the activate module 620, and the migrate module 630 may be at least partially implemented as a computer program module that, when executed, performs the corresponding functions.
The present application also provides a computer-readable storage medium that may be included in the apparatus/device/system described in the above embodiments, or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present application.
According to embodiments of the application, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to an embodiment of the application, the computer-readable storage medium may include ROM and/or RAM and/or one or more memories other than ROM and RAM as described above.
Embodiments of the present application also include a computer program product comprising a computer program containing program code for performing the method shown in the flowcharts. When the computer program product runs in a computer system, the program code is used for enabling the computer system to realize the control method of the electronic equipment provided by the embodiment of the application.
The above-described functions defined in the system/apparatus of the embodiment of the present application are performed when the computer program is executed by a processor. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the application.
According to embodiments of the present application, program code for carrying out computer programs provided by embodiments of the present application may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or in assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that the features recited in the various embodiments of the application can be combined and/or combined in a variety of ways, even if such combinations or combinations are not explicitly recited in the present application. In particular, the features recited in the various embodiments of the application can be combined and/or combined in various ways without departing from the spirit and teachings of the application. All such combinations and/or combinations fall within the scope of the application.