[go: up one dir, main page]

CN102934081B - Operation space method, system and apparatus - Google Patents

Operation space method, system and apparatus Download PDF

Info

Publication number
CN102934081B
CN102934081B CN201180019116.4A CN201180019116A CN102934081B CN 102934081 B CN102934081 B CN 102934081B CN 201180019116 A CN201180019116 A CN 201180019116A CN 102934081 B CN102934081 B CN 102934081B
Authority
CN
China
Prior art keywords
codelet
data
codelets
code
memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201180019116.4A
Other languages
Chinese (zh)
Other versions
CN102934081A (en
Inventor
里什·李·卡恩
丹尼尔·奥罗斯科
高光荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ET INTERNATIONAL Inc
Original Assignee
ET INTERNATIONAL Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ET INTERNATIONAL Inc filed Critical ET INTERNATIONAL Inc
Priority claimed from PCT/US2011/032316 external-priority patent/WO2011130406A1/en
Publication of CN102934081A publication Critical patent/CN102934081A/en
Application granted granted Critical
Publication of CN102934081B publication Critical patent/CN102934081B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The present invention, referred to as a runtime space, relates to the field of computing system management, data processing and data communications, and in particular to a collaborative method and system that provides economical computing, particularly for decomposable multi-component tasks that can be executed on multiple processing elements, by using a metric space representation of code and data locality to guide the allocation and migration of code and data, by performing analytics to identify code regions that provide improved runtime opportunities, and by providing a low-power, local, secure memory management system adapted to distribute calls to compressed code segments that access local memory. The runtime space provides a mechanism to support hierarchical allocation, optimization, monitoring, and control, and to support resilient, energy efficient, large-scale computing.

Description

运行空间方法、系统和装置Operation space method, system and apparatus

发明人:inventor:

Rishi L.KhanRishi L. Khan

相关申请案Related applications

本申请案请求以下申请案的权益:This application claims the benefit of the following applications:

[1]2010年4月13日提交的美国临时申请案No.61/323,362。[1] U.S. Provisional Application No. 61/323,362, filed April 13, 2010.

[2]2010年8月25日提交的美国临时申请案No.61/377,067。[2] U.S. Provisional Application No. 61/377,067, filed August 25, 2010.

[3]2010年9月25日提交的美国临时申请案No.61/386,472。[3] U.S. Provisional Application No. 61/386,472, filed September 25, 2010.

以上各申请案的全文以引用的方式并入本文。Each of the above applications is incorporated herein by reference in its entirety.

技术领域 technical field

本发明称为运行空间(runspace),大体上涉及计算系统控制、数据处理和数据通信领域,并且具体涉及提供节约型计算的方法和系统,包括执行分布在多个处理元件上的大的多部件任务。The present invention, referred to as a runspace, relates generally to the fields of computing system control, data processing, and data communications, and in particular to methods and systems for providing economical computing, including execution of large multi-components distributed over multiple processing elements Task.

发明背景Background of the invention

现代高端计算机结构包含成千上万甚至数百万个处理元件、大量分布式存储器,以及各种级别的非局部存储器、网络组件和存储基础设施。这些系统为执行应用所消耗的资源的静态和动态优化提供了巨大挑战。传统上,计算机结构努力提供只有单个、简单的地址空间的应用,并且提供本身合理的语义以进行随后的代码执行和数据访问。产生的范式多年以来使用良好,但当通过平行处理而非通过更快的时钟率来分布计算和数据并且实际上实现所有硬件加速时,所述范式阻碍了优化资源分配。本发明预测了半导体制造商着手处理减小电路尺寸的物理或成本效率限制的阶段,认为平行性是性能改进的最有前景的途径。在最大性能至关重要的应用中,传统的通过中断和抢占的OS资源分配已经妨碍性能。因此,实现有效分布计算的主要挑战在于提供最优地使用物理系统的系统软件,同时为应用编码器提供计算的可用抽象模型。Modern high-end computer architectures contain thousands or even millions of processing elements, large amounts of distributed memory, and various levels of non-local memory, network components, and storage infrastructure. These systems present great challenges for performing static and dynamic optimization of resources consumed by applications. Traditionally, computer architectures have struggled to provide applications with only a single, simple address space, and to provide inherently reasonable semantics for subsequent code execution and data access. The resulting paradigm has worked well for many years, but prevents optimal resource allocation when computing and data are distributed through parallel processing rather than through faster clock rates and virtually all hardware acceleration is achieved. The present invention anticipates the stage at which semiconductor manufacturers are addressing the physical or cost-effective constraints of reducing circuit size, considering parallelism to be the most promising avenue for performance improvement. In applications where maximum performance is critical, traditional OS resource allocation through interrupts and preemption has hampered performance. Therefore, the main challenge in enabling efficient distributed computing is to provide system software that makes optimal use of the physical system, while at the same time providing a usable abstract model of computation to the application coder.

发明内容 Contents of the invention

本发明提供编译并运行计算机程序从而达成寻求最节约型程序执行的目标的系统和方法。这些系统和方法涉及:在编译时,确定称为小码的给定程序段的最优效率执行环境;和在运行时间,相应地放置并调度小码到其最优效率执行环境进行执行。The present invention provides systems and methods for compiling and running computer programs to achieve the goal of seeking the most economical program execution. These systems and methods involve: at compile time, determining an optimal efficiency execution environment for a given program segment called a codelet; and at run time, accordingly placing and scheduling the codelet into its optimal efficiency execution environment for execution.

本发明的实施方案包括用于高效地分配数据处理系统资源给应用程序任务的方法。所述方法涉及:获取一组实现某些数据处理任务的小码;确定这些小码间的依赖性;和基于小码间的依赖性并基于数据处理系统资源的可用性和各种资源的相关使用成本,使用辨识的资源在给定数据处理系统上,动态地放置并调度小码进行执行。Embodiments of the present invention include methods for efficiently allocating data processing system resources to application tasks. The method involves: obtaining a set of codelets that implement certain data processing tasks; determining dependencies among these codelets; and based on the dependencies among the codelets and based on the availability of data processing system resources and the relative use of the various resources Cost, dynamically places and schedules codelets for execution on a given data processing system using identified resources.

根据本发明的实施方案,用于寻求执行计算机程序的用户或系统定义目标的方法基于:将给定计算机程序分解成抽象模型集,所述抽象模型包括小码、合作小码集、合作抽象模型集和给定抽象模型成员间共享的数据。此外,在各种实施方案中,这些方法包括以下步骤:获取与程序相关的关于抽象模型、性能和资源利用的程序运行时间信息;和使用所述程序运行时间信息来指导计算机程序或部分计算机程序的正在进行的或随后的运行中抽象模型的随后的放置或执行调度。这些方法的另外的实施方案包括至少部分由运行时间系统实施的步骤:定义存储器空间或执行时间中抽象模型成员的邻近目标;最初放置数据并调度执行抽象模型中的小码,并且,当寻求给定用户或系统定义目标有益时,迁移抽象模型的成员,其中以协调方式进行放置和迁移,以根据其定义目标来最大化抽象模型成员间的实际邻近性。According to an embodiment of the present invention, the method for seeking user- or system-defined goals for executing a computer program is based on decomposing a given computer program into a set of abstract models including codelets, sets of cooperating codelets, cooperating abstract models Sets and data shared between members of a given abstract model. Furthermore, in various embodiments, the methods include the steps of: obtaining program run-time information associated with the program regarding abstraction models, performance, and resource utilization; and using the program run-time information to direct the computer program or portion of a computer program Subsequent placement or execution scheduling of ongoing or subsequent in-run abstract models. Additional embodiments of these methods include the steps performed at least in part by the run-time system: defining memory space or proximity objects for members of the abstract model in execution time; initially placing data and scheduling execution of codelets in the abstract model, and, when seeking to Migrate members of an abstract model when it is beneficial to specify user- or system-defined goals, where placement and migration are performed in a coordinated manner to maximize actual proximity among members of the abstract model according to their defined goals.

本发明的其它方面包括一种用于最优平行放置执行软件程序的方法,所述方法涉及以下步骤:a)查询运行时间系统以发现执行程序可用的处理核的量;b)确定所述程序可分为的处理单元的最大量;c)基于步骤a)和b)中确定的量,将所述程序划分为最优数目和尺寸的处理单元,例如小码;和d)根据每个步骤c)的划分,管理程序的平行执行。Other aspects of the invention include a method for optimally parallelizing the execution of a software program, the method involving the steps of: a) querying the runtime system to find the number of processing cores available to execute the program; b) determining the program the maximum number of processing units that can be divided into; c) divide the program into an optimal number and size of processing units, such as codelets, based on the quantities determined in steps a) and b); and d) according to each step c) Division, parallel execution of management procedures.

根据本发明的实施方案,系统最优地定位并调度小码集在给定数据处理硬件上的执行。这些系统包括基于数字硬件和软件的构件,所述构件用来:在处理资源集间交换与度量相关的信息,所述度量与处理资源间小码集的最优放置有关;确定将所述集中待执行的小码定位于处理资源中哪个处理资源;和根据所述确定使用处理资源来放置并调度小码的执行,其中在系统运行时间期间,动态地运用至少一些所述构件。本发明的另外方面涉及一种由多个核心构成的数据处理系统,其中所述系统包括:a)系统管理代理集,所述集包括以下一个或多个:数据渗滤管理器、小码调度器、小码迁移管理器、负载均衡器、调功器或性能管理器;和b)供所述代理集以协作方式异动来寻求系统范围目标的构件,在各种提供动态运行时间系统行为的实施方案中,所述目标为时变的。According to an embodiment of the present invention, the system optimally locates and schedules the execution of a set of codelets on a given data processing hardware. These systems include digital hardware and software based components for: exchanging information between sets of processing resources related to metrics related to optimal placement of sets of codelets between processing resources; on which of the processing resources the codelet to be executed is located; and using the processing resource to place and schedule execution of the codelet based on said determination, wherein at least some of said components are dynamically employed during system runtime. A further aspect of the invention relates to a data processing system comprised of multiple cores, wherein the system includes: a) a set of system management agents, the set including one or more of the following: data percolation manager, codelet scheduler manager, codelet migration manager, load balancer, power tuner, or performance manager; and b) components for said set of agents to change in a cooperative fashion to seek system-wide goals, in various ways that provide dynamic runtime system behavior In embodiments, the target is time-varying.

本发明还包含用于实施本发明的方法的各种组合的应用和系统软件程序,以及运行这些程序的硬件系统和相关硬件和软件产品。The invention also encompasses application and system software programs in various combinations for carrying out the methods of the invention, as well as hardware systems and related hardware and software products for running these programs.

附图说明 Description of drawings

图1示出示范性运行空间结构。FIG. 1 shows an exemplary operating space structure.

图2示出具有多个范围的示范性运行空间分配。Figure 2 illustrates an exemplary runspace allocation with multiple ranges.

图3示出示范性运行空间运行时间系统。Figure 3 illustrates an exemplary runspace runtime system.

图4示出运行时间性能监测和分配的示范性案例。Figure 4 shows an exemplary case of runtime performance monitoring and allocation.

图5例示了运行空间系统中代码的运行时间行为。Figure 5 illustrates the runtime behavior of code in the runspace system.

图6示出示范性交互层次。Figure 6 illustrates an exemplary interaction hierarchy.

图7示出示范性自优化操作系统。Figure 7 illustrates an exemplary self-optimizing operating system.

图8例示显式和隐式应用指令。Figure 8 illustrates explicit and implicit apply instructions.

图9示出示范性微存储器管理单元。Figure 9 illustrates an exemplary micro memory management unit.

图10示出示范性应用使用案例。Figure 10 illustrates an exemplary application use case.

图11示出运行空间随时间的示范性分组。Figure 11 shows an exemplary grouping of runspaces over time.

图12示出使用小码集的计算系统。Figure 12 shows a computing system using codelets.

图13示出小码集表示系统。Fig. 13 shows a codelet set representation system.

图14示出小码集转换的实例。Figure 14 shows an example of codelet conversion.

图15示出元级别小码集分布。Figure 15 shows the meta-level codelet set distribution.

图16示出小码集执行和迁移。Figure 16 illustrates codelet set execution and migration.

图17示出双端队列并发访问机制:写入/入队。Figure 17 shows the double-ended queue concurrent access mechanism: write/enqueue.

图18示出出队并发访问机制:读出/出队。Figure 18 shows the dequeue concurrent access mechanism: read/dequeue.

图19示出通过原子加法数组(A)的并发访问:写入。Figure 19 shows concurrent access to an array (A) by atomic addition: write.

图20示出通过原子加法数组(B)的并发访问:写入。Figure 20 shows concurrent access to array (B) by atomic addition: write.

图21示出通过原子加法数组(C)的并发访问:读出。Figure 21 shows concurrent access to array (C) by atomic addition: read.

图22示出链表,具体是原子加法数组(A)。Figure 22 shows a linked list, specifically an atomic addition array (A).

图23示出链表,具体是原子加法数组(B)。Figure 23 shows a linked list, specifically an atomic addition array (B).

图24示出链表,具体是原子加法数组(C)。Figure 24 shows a linked list, specifically an atomic addition array (C).

图25示出链表,具体是原子加法数组(D)。Fig. 25 shows a linked list, specifically an atomic addition array (D).

图26示出链表,具体是原子加法数组(E)。Figure 26 shows a linked list, specifically an atomic addition array (E).

图27示出通过具有转向的共享数组的并发访问。Figure 27 shows concurrent access through a shared array with steering.

图28示出组合网络分布增量。Figure 28 shows combined network distribution increments.

图29示出通过原子加法数组(A)的单任务和多任务执行并发访问。Figure 29 shows single-task and multi-task execution concurrent access by atomic addition array (A).

图30示出通过原子加法数组(B)的单任务和多任务执行并发访问。Figure 30 shows single-task and multi-task execution concurrent access by atomic addition array (B).

图31示出通过原子加法数组(C)的单任务和多任务执行并发访问。Figure 31 shows single-task and multi-task execution concurrent access by atomic addition array (C).

图32示出通过原子加法数组(D)的单任务和多任务执行并发访问。Figure 32 shows single-task and multi-task execution concurrent access by atomic addition array (D).

图33示出通过原子加法数组(E)的单任务和多任务执行并发访问。Figure 33 shows single-task and multi-task execution concurrent access by atomic addition array (E).

图34示出小码集计算系统场景。Fig. 34 shows a scenario of a codelet computing system.

图35示出芯片级通用示范性结构。Figure 35 shows a general exemplary structure at the chip level.

图36示出板级/系统级通用结构。Figure 36 shows the board-level/system-level general structure.

图37示出小码和小码集的指派Figure 37 shows the assignment of codelets and codelet sets

图38示出双缓冲计算(A)。Figure 38 shows the double buffer calculation (A).

图39示出双缓冲计算(B)。Figure 39 shows the double buffer calculation (B).

图40示出双缓冲计算(C)。Figure 40 shows the double buffer calculation (C).

图41示出双缓冲计算(D)。Figure 41 shows the double buffer calculation (D).

图42示出双缓冲计算(E)。Figure 42 shows the double buffer calculation (E).

图43示出双缓冲计算(F)。Figure 43 shows the double buffer calculation (F).

图44示出双缓冲计算(G)。Figure 44 shows the double buffer calculation (G).

图45示出具有SRAM和DRAM的矩阵相乘。Figure 45 shows matrix multiplication with SRAM and DRAM.

图46示出矩阵相乘双缓冲/DRAM。Figure 46 shows matrix multiplication double buffering/DRAM.

图47示出计算LINPACK DTRSM(线性代数函数)的实例。Fig. 47 shows an example of calculating LINPACK DTRSM (linear algebra function).

图48示出DTRSM情况下小码集的运行时间初始化。Figure 48 shows run-time initialization of a codelet set for the DTRSM case.

图49示出快速排序实例。Fig. 49 shows an example of quick sort.

图50示出带有应用小码集的可扩展系统函数。Figure 50 shows an extensible system function with an application codelet set.

图51示出遗留代码转换为多任务。Figure 51 illustrates the conversion of legacy code to multitasking.

图52示出与多任务代码一同运行的黑箱代码。Figure 52 shows the black box code running with the multitasking code.

图53示出改进的与多任务代码一同运行的黑箱代码。Figure 53 shows the improved black box code running with the multitasking code.

具体实施方式 detailed description

本发明使用的术语表Glossary of terms used in this invention

·应用:包含用户希望执行的单个或多个相关特定任务的指令集Application: A set of instructions containing a single or multiple related specific tasks that the user wishes to perform

·应用程序编程接口(API):将系统功能供应用开发员所编写的程序操纵的编程器可访问程序集,所述应用开发员可能不能访问系统的内部组件,或者可能希望获得比可通过系统的基本功能获得的接口更简单或更一致的接口,或者可能希望获得符合互操作特定标准的接口。Application Programming Interface (API): The set of programmer-accessible assemblies that expose system functionality to programs written by application developers who may not have access to internal components of the system, or may wish to obtain A simpler or more consistent interface is obtained for the basic functionality of the interface, or it may be desired to obtain an interface that conforms to a specific standard for interoperability.

·小码:一组在其输入可用之后大体上能被连续执行到完成为止的指令。• Codelet: A set of instructions that, after their input is available, can be executed substantially continuously to completion.

·小码集:相对于依赖性分析或执行可作为单元来处理的小码组。• Codelet Set: A group of codelets that can be processed as a unit with respect to dependency analysis or execution.

·计算区域:按局部性或功能来分组的处理元件集。这些区域可分层地包括其它计算区域。分层区域实例可包括系统、节点、套接口、核心和/或硬件线程。• Computational Area: A set of processing elements grouped by locality or function. These areas may hierarchically include other computing areas. Hierarchical region instances may include systems, nodes, sockets, cores, and/or hardware threads.

·并发系统:并发进程集和这些进程操纵的对象集。• Concurrent system: the set of concurrent processes and the set of objects manipulated by those processes.

·核心:计算设备的处理单元。这些处理单元包括但不限于CPU(中央处理单元)、GPU(图形处理单元)、FPGA(现场可编程数组)或前述各者的子集。• Core: The processing unit of a computing device. These processing units include, but are not limited to, CPUs (Central Processing Units), GPUs (Graphics Processing Units), FPGAs (Field Programmable Arrays), or subsets of the foregoing.

·依赖性:两个小码集之间的表示一个小码集结束另一个才能开始的有向弧。• Dependency: A directed arc between two codelet sets that indicates that one codelet set ends before the other begins.

·分形调整结构:在系统中多个量值上安全并可靠地提供资源的高效使用的机制,其中在每一级别使用类似策略。• Fractal Adjustment Architecture: A mechanism that safely and reliably provides efficient use of resources at multiple scales in the system, where similar strategies are used at each level.

·GACT/广义角色:一个用户或一组用户,或一组用户和软件代理,或充当用户以达成某个目标的计算实体。GACT/Generalized Role: A user or a group of users, or a group of users and software agents, or a computing entity that acts as a user to achieve a certain goal.

·GCS/广义计算系统:包括可编程处理器、存储器、提供数据访问和计算服务的I/O设备的一个或多个计算机。• GCS/Generalized Computing System: One or more computers including programmable processors, memory, I/O devices that provide data access and computing services.

·CSIG/小码信号:小码之间或监测系统与至少一个小码之间的通信,所述通信使能依赖性得到满足的小码或传递状态和完成信息。• CSIG/codelet signal: A communication between codelets or between a monitoring system and at least one codelet that enables a codelet whose dependencies are satisfied or conveys status and completion information.

·分层执行模型:其中在若干级别应用被分解,包括在粒度基本级被分解为小码的多级别执行模型。• Hierarchical Execution Model: A multi-level execution model in which applications are decomposed at several levels, including codelets at the base level of granularity.

·可线性化:在并发处理系统中似乎同时发生的一个或多个操作。可线性化通常由连续(作为一个群)或被删除(卷回恢复)的指令达成,并且由通过特定指令提供“原子”操作或在临界区域周围提供加锁的系统来达成。• Linearizable: One or more operations that appear to occur simultaneously in a concurrent processing system. Linearizability is usually achieved by sequential (as a group) or deleted (wrap-back recovery) instructions, and by systems that provide "atomic" operations through specific instructions or locks around critical regions.

·无锁同步:为确保(至少)系统范围进展的共享资源的无阻塞同步。· Lock-free synchronization: non-blocking synchronization of shared resources to ensure (at least) system-wide progress.

·局域网(LAN):在相对较小距离,通常在单个组织内部,连接计算机和其它网络设备。• Local Area Network (LAN): Connects computers and other network devices over relatively small distances, usually within a single organization.

·节点:由一个或多个计算机处理器,且可选地存储器、网络接口和外围设备构成的设备。• Node: A device consisting of one or more computer processors, and optionally memory, network interfaces, and peripherals.

·过量提供:提供超过最小数目的处理元件和局部存储器,以允许资源分配有较大余地。例如,将以高时钟率运行高度顺序任务的小数目处理元件替换成以低时钟率运行更分布式的代码和数据的更多的处理元件。• Overprovisioning: Provisioning more than a minimum number of processing elements and local memory to allow greater margin for resource allocation. For example, replacing a small number of processing elements running highly sequential tasks at high clock rates with more processing elements running more distributed code and data at low clock rates.

·多任务:相对于计算资源集可作为单元来处理的一组相关任务。通常,多任务具有类似的资源需求,并且可寻求一批资源的分配。多任务也可具有互补的资源需求,并且可借助于分布式请求来执行负载均衡。• Multitasking: A set of related tasks that can be processed as a unit with respect to a set of computing resources. Typically, multiple tasks have similar resource requirements and may seek allocation of a batch of resources. Multitasking can also have complementary resource requirements, and load balancing can be performed by means of distributed requests.

·邻近:如存储器空间、计算空间中的局部性,或时间或依赖性上接近的状态。• Proximity: such as locality in memory space, computation space, or states that are close in time or dependencies.

·队列:可接受入队元素和删除并返回出队元素的数据结构。元素可在任何位置入队或出队,包括但不限于队列的开始处、末尾或中间。Queue: A data structure that accepts enqueued elements and deletes and returns dequeued elements. Elements can be enqueued or dequeued at any position, including but not limited to the beginning, end, or middle of the queue.

·运行时间系统(RTS):被设计来支持计算机程序的执行的软件集合。• Run Time System (RTS): A collection of software designed to support the execution of computer programs.

·可扩展性:计算机系统、结构、网络或过程的能力,所述能力允许它通过使用另外的处理器、存储器和连通性来高效地满足更大量处理的需要。• Scalability: The capability of a computer system, structure, network, or process that allows it to efficiently meet the needs of greater processing volumes by using additional processors, memory, and connectivity.

·自知控制系统:使用其自身性能和约束的模型,从而允许将高层目标相对于模型属性而陈述性地表示出来的系统。• Self-aware control system: A system that uses a model of its own properties and constraints, allowing high-level goals to be expressed declaratively with respect to model properties.

·信号:使能小码集的事件。在执行期间,信号可由小码发出。• Signals: Events that enable codelet sets. During execution, signals can be emitted by codelets.

·任务:软件程序中的工作单元。• Task: A unit of work in a software program.

·线程:被限于特定处理元件的历时长久的运行时间处理目标。• Thread: A long-lived run-time processing object that is restricted to a specific processing element.

·无等待同步:保证有系统范围进展和每个线程进展的共享资源的无阻塞同步。· Wait-free synchronization: Guarantees non-blocking synchronization of shared resources with system-wide progress and per-thread progress.

·广域网(WAN):在可能较大的地理区域,连接计算机和其它网络设备。Wide Area Network (WAN): Connects computers and other network devices over a potentially large geographic area.

本发明提供用于表示、操纵和执行小码集的方法和系统。小码是在其依赖性得到满足之后通常可连续执行到完成为止的一般非抢占式指令群。小码集是相对于依赖性分析或执行可作为单元来处理的小码组。小码集在显著方面不同于传统编程和执行模型。应用被分解成可在需要最少量的系统协调的情况下执行的独立的代码段。根据本发明的实施方案,不同于集中控制和资源分配,系统代码(其自身是通过小码集来实施)只初始化小码集的平台,以通过使能小码集的初始小码来运行。这些小码没有先验依赖性,因此一旦使能小码集就会使能这些小码。小码集应用在它们执行期间不必完全保持为文字码空间。事实上,即使不确定地,如果特定运行或执行期间提供的特定数据不需要一些不常使用的小码集元素,那么可推迟这些小码集元素的转换。The present invention provides methods and systems for representing, manipulating and executing codelet sets. A codelet is a generally non-preemptive group of instructions that, after their dependencies are satisfied, typically continue to execute to completion. A codelet set is a group of codelets that can be processed as a unit with respect to dependency analysis or execution. Codelet sets differ from traditional programming and execution models in significant ways. Applications are broken down into independent code segments that can be executed with a minimal amount of system coordination required. According to embodiments of the present invention, instead of centralized control and resource allocation, the system code (which itself is implemented by the codelet set) only initializes the platform of the codelet set to run by enabling the initial codelets of the codelet set. These codelets have no a priori dependencies, so they are enabled once the codelet set is enabled. Codelet applications do not have to remain entirely literal code spaces during their execution. In fact, conversion of some infrequently used codelet set elements may be deferred, if indeterminately, if they are not needed for particular data provided during a particular run or execution.

小码集方法的实施方案的特征包括:Features of embodiments of the codelet set method include:

·将计算任务分解成最小化模型间依赖性的抽象模型;Decompose computational tasks into abstract models that minimize inter-model dependencies;

·指导初始小码使能和计算资源的初始及正在进行的分配的抽象依赖性映射的结构;The structure of the abstract dependency map that directs the initial codelet enablement and initial and ongoing allocation of computing resources;

·使用具有至少与Petri网络相同表达能力的计算表示;Use computational representations with at least the same expressive power as Petri networks;

·迁移正在执行或即将执行小码集,以利用例如局部存储器、特定数据和中间结果的资源的局部性,和合作小码的局部性,从而最小化通信时延;Migrating sets of executing or soon-to-be-executed codelets to take advantage of locality of resources such as local memory, specific data and intermediate results, and locality of cooperating codelets to minimize communication latency;

·迁移小码集以获取资源的更好全局分配,以允许衰减一些处理资源从而节能或保留容量,或者例如在异构系统中利用更适于给定处理任务的资源;· Migrating codelet sets to obtain a better global allocation of resources to allow attenuation of some processing resources to save energy or preserve capacity, or to utilize resources more suitable for a given processing task, such as in heterogeneous systems;

·使用多任务,即相对于计算资源集可作为单元来处理或可由表示代理任务来管理的相关任务,所述表示代理任务起作用以获取所需资源或所述群的另外的任务;use of multitasking, i.e. relative tasks that can be processed as a unit with respect to a set of computing resources or can be managed by a presentation agent task that acts to acquire the required resource or additional tasks of the group;

·使用原子加法数组,所述原子加法数组有效地调解在共享数据或其它处理输入或资源上工作的小码的并发访问,其中访问的顺序具有潜在重要性;Use of atomic addition arrays that efficiently mediate concurrent access of codelets working on shared data or other processing inputs or resources, where the order of access is potentially important;

·使用链表原子加法数组,所述链表原子加法数组许可占优势局部访问的效率同时支持并发数据存储的实质上无限成长;·Use linked list atomic addition array, which allows dominant local access efficiency while supporting virtually unlimited growth of concurrent data storage;

·使用多回转/多代原子加法数组,以维持严格局部存储的益处同时支持大量未决的操作;和Use of multi-turn/multi-generation atomic-add arrays to maintain the benefits of strict local storage while supporting a large number of pending operations; and

·组合网络,以提供级联型增量到存储器访问,从而避免单个全局下一个函数的瓶颈。• Combine networks to provide cascaded delta-to-memory accesses, avoiding the bottleneck of a single global next function.

以下结合附图描述本发明的关键概念和方面。注意,在下面的描述中,步骤和步骤的顺序是为了说明的目的给定的,但是很多其它的顺序、子集和超集在描述本发明后对于实践者来说是显而易见的。简洁的目的防止了列举每一个属于本发明的合法范围中步骤的组合。Key concepts and aspects of the present invention are described below with reference to the accompanying drawings. Note that in the following description, the steps and order of steps are given for illustrative purposes, but many other orders, subsets and supersets will be apparent to practitioners after describing the invention. The purpose of brevity prevents listing every combination of steps falling within the legal scope of the invention.

综述:Summary:

运行空间被建构来利用许多处理元件的高度平行的结构,其中数据和代码都分布在一致多级组织中。运行空间系统和方法通过维持其中将距离测量应用到代码和数据的度量空间模型来达成处理资源的最优使用。任务分配的良好级别是小码级别,所述小码是在输入条件得到满足后可非抢占性地执行到完成为止的指令群。Runspaces are structured to take advantage of the highly parallel architecture of many processing elements, where both data and code are distributed in a coherent multi-level organization. Runspace systems and methods achieve optimal use of processing resources by maintaining a metric space model in which distance measures are applied to code and data. A good level of task allocation is the level of codelets, which are groups of instructions that can be non-preemptively executed to completion after input conditions are met.

在本发明的实施方案中,运行空间方法和系统通过执行以下一个或多个步骤来分配计算资源给计算任务:获取实现任务集的小码集;获取小码请求的数据规格集;建构表示小码局部性和它们将访问的数据的度量空间;获取小码相对于度量空间距离的静态定义初始排列;使用度量空间表示来初始放置小码或数据;获取小码和数据的动态可用运行时间资源请求;和使用所述度量空间表示来动态地放置或移动小码或数据。In an embodiment of the present invention, the runspace method and system allocate computing resources to computing tasks by performing one or more of the following steps: obtaining a set of small codes that implement a task set; obtaining a set of data specifications requested by a small code; Code locality and the metric space of the data they will access; obtain a statically defined initial arrangement of codelets relative to the distance of the metric space; use a metric space representation to initially place codelets or data; obtain dynamically available runtime resources for codelets and data requests; and using the metric space representation to dynamically place or move codelets or data.

另外,在实施方案中,运行空间通过以下步骤来准备分配机会并在运行时间利用这些机会:在编译时,分析用于操作和指示合并或迁移小码和数据的机会的参照的潜在代码和数据分配;然后执行这些小码、合并小码或数据的运行时间迁移,以练习实际代码和数据分配所提供的机会。Additionally, in an embodiment, the runspace prepares for allocation opportunities and utilizes these opportunities at runtime by, at compile time, analyzing underlying code and data for manipulation and references indicating opportunities to merge or migrate codelets and data allocation; and then perform runtime migrations of these codelets, merged codelets, or data to practice the opportunities presented by actual code and data allocation.

另外,为了支持细粒度执行小码,运行空间的实施方案通过以下动作的一个或多个来提供安全和有效的局部存储器访问:将应用代码分解成小码;提供含有逻辑和物理地址的本地路由表;将确切相关小码群的物理地址映射到确切地址空间,其中每个确切地址空间的相关小码确切群可访问每个确切地址空间;和将给定确切小码群到其确切地址空间外部空间的任何访问处理为错误。Additionally, to support fine-grained execution of codelets, implementations of the runspace provide safe and efficient local memory access by doing one or more of the following actions: breaking application code into codelets; providing local routing with logical and physical addresses table; mapping physical addresses of exact groups of associated codelets to exact address spaces, wherein each exact address space is accessible by the exact group of associated codelets for each exact address space; and mapping a given exact group of codelets to its exact address space Any access to external spaces is treated as an error.

本发明还提供用于表示、操纵和执行小码集的方法和系统。小码集是相对于依赖性分析或执行可作为单元来处理的小码组。小码集提供用于开发和执行分布式应用的机制,和用于应用的可组合性的机制:小码集可含有小码集并且它们可被分层建构和重用。即使小码无需抢占也可运行到完成为止只要它们的依赖性得到满足,但是它们也可在抢占式系统上运行,任何一种情况都是为了模拟非抢占式多核心结构,或者因为对于小码集所表示的分布式应用而言抢占式计算的一些其它属性很理想。进一步地,可给抢占式OS提示以最小化抢占,例如核心紧密度和过程优先。这样,小码的运行空间可与当前计算机系统上的其它旧有应用共存。The present invention also provides methods and systems for representing, manipulating and executing codelet sets. A codelet set is a group of codelets that can be processed as a unit with respect to dependency analysis or execution. Codelet sets provide mechanisms for developing and executing distributed applications, and mechanisms for application composability: codelet sets can contain codelet sets and they can be constructed and reused hierarchically. Even codelets can run to completion without preemption as long as their dependencies are satisfied, but they can also run on preemptive systems, either to simulate non-preemptive multi-core architectures, or because for codelets Some other properties of preemptive computing are ideal for distributed applications represented by sets. Further, preemptive OS hints can be given to minimize preemption, such as core tightness and process prioritization. In this way, the codelet's runtime can coexist with other legacy applications on the current computer system.

根据本发明的实施方案,不同于集中控制和资源分配,系统代码(其自身是通过小码集来实施)只初始化小码集的平台,以通过使能小码集的初始程序来运行。根据本发明,应用程序被分解成可在最少量的系统协调的情况下执行的独立的代码段。According to embodiments of the present invention, instead of centralized control and resource allocation, the system code (which itself is implemented by the codelet set) only initializes the platform of the codelet set to run by enabling the initial program of the codelet set. According to the present invention, an application program is broken down into independent code segments that can be executed with a minimum amount of system coordination.

系统利用和管理综述 System utilization and management overview :

在本发明的实施方案中,例如下面更详细研究的实例中,运行空间执行模型遍及所有级别的系统利用和监测。在细粒度级别,执行模型提供一系列的小码和它们各自的依赖性。小码细粒度本质允许运行时间系统有效并动态地分配资源,同时监测性能和能耗并且使调度改变能够符合应用的性能和功率需求。In embodiments of the present invention, such as the examples studied in more detail below, the runspace execution model pervades all levels of system utilization and monitoring. At a fine-grained level, the execution model provides a series of codelets and their respective dependencies. The fine-grained nature of codelets allows the runtime system to allocate resources efficiently and dynamically, while monitoring performance and energy consumption and enabling scheduling changes to match the performance and power needs of the application.

运行空间系统分配可用资源给给定应用并提供API以访问芯片外资源,例如磁盘、外围设备、其它节点存储器等。应用区域(即应用可用的节点)由超管理器定义。小码细粒度本质允许运行时间系统有效并动态地分配资源,同时监测性能和能耗并且使调度改变能够符合应用和系统的性能和功率目标。The runspace system allocates available resources to a given application and provides APIs to access off-chip resources such as disks, peripherals, other node memory, and the like. Application areas (i.e. nodes where applications are available) are defined by the hypervisor. The fine-grained nature of codelets allows the runtime system to efficiently and dynamically allocate resources while monitoring performance and energy consumption and enabling scheduling changes to meet application and system performance and power goals.

在根据本发明的实施方案的系统101中,如图1中所示,有5个用于在示范性运行空间结构101中系统利用和管理的组件:(1)共享长期文件系统和开始应用的传统操作系统(OS),(2)在初级控制系统资源分配的超管理器,(3)管理芯片外资源的微OS,(4)提供任务同步性并管理能量消耗和性能的运行时间系统,和(5)提供微OS便携性并允许访问新外围设备的硬件抽象层。根据这些实施方案,线程虚拟机(TVM)代替常规OS来提供到硬件的直接访问和小码之间的细粒度同步。TVM在这里不应理解为单独的组件,实际上,它由运行时间系统和微OS实施。In the system 101 according to the embodiment of the present invention, as shown in FIG. 1, there are five components for system utilization and management in the exemplary runspace structure 101: (1) Shared long-term file system and start application Traditional operating systems (OS), (2) a hypervisor that controls system resource allocation at the primary level, (3) a micro-OS that manages off-chip resources, (4) a runtime system that provides task synchronization and manages power consumption and performance, and (5) a hardware abstraction layer that provides microOS portability and allows access to new peripherals. According to these embodiments, a threaded virtual machine (TVM) replaces a conventional OS to provide direct access to hardware and fine-grained synchronization between codelets. The TVM should not be understood here as a separate component, in fact it is implemented by the runtime system and the micro-OS.

图1概述了组件之间的整体交互。Figure 1 outlines the overall interaction between components.

超管理器 HyperManager :

超管理器基于用户参数和可选地应用中指定的参数来分配全局资源给给定应用。这包括应使用多少节点,和在某些实施方案中,节点的连接度。超管理器设定应用区域并定义运行在每个节点上的微OS。然后超管理器载入应用特定参数(例如,命令行参数、环境变量等)并指导运行时间系统开始应用。运行时间系统通过开始核心上始于主程序开始指针的一个或多个小码来开始用户应用。用户应用可请求在运行时间繁衍更多小码。另外,用户应用直接与运行时间系统交互来进行任务同步。所有芯片外I/O由微OS调解,所述微OS串联化通过串联导线管(例如磁盘I/O、以太网、节点到节点通信等)通道的请求和响应。另外,微OS帮助运行时间系统在节点间到其它运行时间系统组件通信。硬件抽象层为获得微OS便携性提供公共API到其它平台来发现新外围设备。The hypervisor allocates global resources to a given application based on user parameters and optionally application-specified parameters. This includes how many nodes should be used, and in some embodiments, the degree of connectivity of the nodes. The hypervisor sets the application domain and defines the microOS running on each node. The hypervisor then loads application-specific parameters (eg, command line arguments, environment variables, etc.) and instructs the runtime system to start the application. The runtime system starts the user application by starting one or more codelets on the core starting at the main program start pointer. User applications may request that more codelets be spawned at runtime. In addition, user applications interact directly with the runtime system for task synchronization. All off-chip I/O is mediated by a micro-OS that serializes requests and responses channeled through serial conduits (eg, disk I/O, Ethernet, node-to-node communication, etc.). In addition, the micro-OS helps the runtime system communicate between nodes to other runtime system components. The hardware abstraction layer provides a common API to other platforms to discover new peripherals for micro-OS portability.

下面的段落概述了系统利用和维护中涉及的不同组件的整体结构和功能性。The following paragraphs outline the overall structure and functionality of the different components involved in system utilization and maintenance.

线程虚拟机(TVM):Threaded Virtual Machine (TVM):

TVM提供框架来将工作划分成称为小码的小的非抢占块并在运行时间有效地调度它们。TVM用一薄层能够直接与硬件对接的系统软件来替换OS,并且大体上使应用程序员免受结构复杂性的困扰。不同于常规OS,TVM能够公开对达成性能至关重要的资源。TVM provides a framework to divide work into small non-preemptive chunks called codelets and schedule them efficiently at runtime. TVM replaces the OS with a thin layer of system software that interfaces directly with the hardware and largely shields the application programmer from architectural complexity. Unlike a conventional OS, TVM is able to expose resources that are critical to achieving performance.

图2示出了TVM的实施方案。TVM将任何控制流、数据依赖性或同步条件抽象化为统一数据非循环图(DAG),运行时间可将DAG分解成小码机制。在DAG顶部,TVM还重叠另一DAG,所述DAG使用范围概念表示程序的局部性。在本发明的实施方案中,小码可访问父级(例如,201)上建立的任何变量或状态,但同级(例如,202和203或204和205)不能访问彼此的存储器空间。使用这个范围,编译器和运行时间可确定给定图的合适的工作集和可用并发性,从而允许运行时间使用功率优化模型设定紧密度并载入平衡特征来调度资源到小码执行和系统状态或范围变量的渗滤。Figure 2 shows an embodiment of a TVM. TVM abstracts any control flow, data dependencies, or synchronization conditions into a unified data acyclic graph (DAG), and the runtime can decompose the DAG into small code mechanisms. On top of the DAG, the TVM also overlaps another DAG that expresses program locality using the notion of scope. In an embodiment of the invention, a codelet can access any variables or state established on the parent (eg, 201 ), but siblings (eg, 202 and 203 or 204 and 205 ) cannot access each other's memory space. Using this range, the compiler and runtime can determine the appropriate working set and available concurrency for a given graph, allowing the runtime to use power-optimized models to set the tightness and load-balancing features to schedule resources to codelet execution and system Percolation of state or scope variables.

不同于常规OS框架,TVM维持分形语义结构并对运行时间给出调度和渗滤控制,以最优地执行任务。通过遵循这个分形本质,使能的编程模型将能够提供实质信息给运行时间系统。因此,不同于具有不可预知和不复杂的缓存机制的单片线程,粒度和运行时间额外负担的静态和动态本质都被尽可能紧地管理,以提供更大的功率效率。Unlike conventional OS frameworks, TVM maintains a fractal semantic structure and gives scheduling and percolation control over runtime to optimally execute tasks. By following this fractal nature, the enabled programming model will be able to provide substantial information to the runtime system. Thus, unlike monolithic threads with unpredictable and uncomplicated caching mechanisms, the granularity and static and dynamic nature of the runtime overhead are both managed as tightly as possible to provide greater power efficiency.

运行时间系统:Runtime system:

运行时间系统在软件中实施为用户库,且在硬件中被运行时间系统核心实施以服务若干执行核心。在实施方案中,所述运行时间系统核心可能与执行核心不同,或可能具有特定的硬件来帮助更有效地运行时间操作。在实施方案中,执行核心可执行运行时间系统任务,并且可能有或可能没有专用的核心来进行运行时间系统任务执行。The runtime system is implemented in software as a user library and in hardware by a runtime system core to serve several execution cores. In an embodiment, the runtime system core may be distinct from the execution core, or may have specific hardware to help the runtime operate more efficiently. In an embodiment, an execution core may execute runtime system tasks, and there may or may not be a dedicated core for runtime system task execution.

根据本发明的实施方案,配置并执行动态运行时间系统涉及最优地分配数据处理资源给数据处理任务的方法。所述方法涉及:在编译时,分析潜在代码和数据分配、放置和迁移;和在运行时,放置或迁移小码或数据以练习实际代码和数据分配所提供的机会;以及在某些实施方案中,将至少一些数据从一个场所复制到另一个场所,预期迁移一个或多个小码;和移动小码到其它未充分利用的处理器。According to an embodiment of the present invention, configuring and executing a dynamic runtime system involves a method of optimally allocating data processing resources to data processing tasks. The method involves: at compile time, analyzing potential code and data allocations, placements and migrations; and at runtime, placing or migrating codelets or data to practice the opportunities presented by actual code and data allocations; and in some embodiments , copying at least some data from one site to another, in anticipation of migrating one or more codelets; and moving codelets to other underutilized processors.

本发明的实施方案涉及一种数据处理系统,所述数据处理系统包括最优地定位系统中小码集的硬件和软件。所述系统的元件包括基于数字硬件和软件的构件,所述构件用来:(i)在系统中处理资源集间交换与度量相关的信息,所述度量与处理资源间小码集的最优放置有关;(ii)确定将所述集中一个或多个小码定位于处理资源中哪个处理资源;和(iii)根据所述确定将一个或多个小码映射到一个或多个处理资源。在各种实施方案中,所述映射可涉及次优数据局部性所触发的数据和/或小码迁移。在某些场景中,根据迁移的成本,迁移体积小码和数据。在实施方案中,迁移成本驱动程序包括以下一个或多个:将被迁移的数据或代码的量、迁移距离、同步的额外负担、存储器带宽利用和可用性。Embodiments of the present invention relate to a data processing system including hardware and software to optimally locate a codelet set in the system. Elements of the system include digital hardware and software based components for: (i) exchanging information between sets of processing resources in the system related to metrics related to optimality of sets of codelets between processing resources Placement is related; (ii) determining which of the processing resources to locate the one or more codelets in the set; and (iii) mapping the one or more codelets to the one or more processing resources based on the determination. In various implementations, the mapping may involve data and/or codelet migration triggered by sub-optimal data locality. In some scenarios, small code and data are migrated depending on the cost of the migration. In an embodiment, migration cost drivers include one or more of the following: amount of data or code to be migrated, migration distance, overhead of synchronization, memory bandwidth utilization, and availability.

运行时间系统可使用编译时注释或来自指定小码最优效率环境的当前或先前执行的注释。本发明的实施方案中的相关方法涉及:编译并运行计算机程序以寻求最大节约型程序执行。所述方法,在程序编译时,确定称为小码的程序部分的最优效率执行环境;并且相应地,在程序运行时间,定位小码以在它们最优效率执行环境执行。此外,在某些实施方案中,所述最优环境的确定是基于例如以下的程序源代码的指示来进行的:(i)编译器指令;(ii)函数调用,其中调用的一种类型的函数提供与所述函数的最优执行环境相关的信息;(iii)具有例如跨步、工作集、浮点使用的某些特征的循环体,其中最优执行环境已由类似数据处理平台上的类似循环的系统运行来预先确定。执行给定小码的最优效率执行环境可由例如以下的标准来定义:能耗、处理硬件资源使用、完成时间、给定能耗预算的最短完成时间。The runtime system may use compile-time annotations or annotations from current or previous executions of the specified codelet optimal efficiency environment. A related method in an embodiment of the present invention involves compiling and running a computer program for maximum economy of program execution. The method, at program compilation time, determines an optimal efficiency execution environment for portions of the program called codelets; and correspondingly, at program run time, locates the codelets for execution in their optimal efficiency execution environment. Furthermore, in some embodiments, the determination of the optimal environment is based on indications of program source code such as: (i) compiler directives; (ii) function calls, where a type of functions provide information about the optimal execution environment for said function; (iii) loop bodies with certain characteristics such as striding, working set, floating point usage, where the optimal execution environment has been determined by a similar data processing platform The cycle-like system operates to be predetermined. An optimally efficient execution environment for executing a given codelet may be defined by criteria such as: energy consumption, processing hardware resource usage, completion time, minimum completion time for a given energy consumption budget.

内部硬件/软件运行时间堆栈:Internal hardware/software runtime stack:

在本发明的实施方案中,例如在图3示出的系统300中,运行时间系统核心301与事件池存储器302共置。在实施方案中,运行时间系统的任务可在专用运行时间系统核心上操作,或由执行核心操作。事件池302含有即将运行的细粒度小码、应用和系统目标(例如性能和功率目标)和数据可用性事件。事件池302可为例如列表的实际共享数据结构,或例如资源利用改变时(例如当队列具有自由空间;处理元件可用于工作;或互斥锁可用时)拨叫的回呼系统的分布式结构。运行时间系统核心301响应事件池302中的事件。根据本发明的实施方案,有5个在运行时间系统核心301上运行的管理器:(1)数据渗滤管理器;(2)小码调度器;(3)小码集迁移管理器;(4)负载均衡器和(5)运行时间性能监测器/调节器。在某些实施方案中,这些管理器通过在极其邻近状态操作并共享运行时间状态而协作工作。图4示出在一个示范性实施方案的运行时间系统核心301上运行的管理器的输入、输出和交互401。当合适时,数据渗滤管理器渗滤数据依赖性(即当可用时预取输入数据)和代码依赖性(即预取指令缓存)。当满足所有输入依赖性时,小码调度器将小码放置在工作队列中,从而在某些场景中重排队列中就绪小码的优先顺序。执行核心反复地从工作队列接受任务并运行它们到完成为止。在运行小码的过程中,执行核心可创建小码或线程,并把它们放置到事件池里。运行时间性能监测器/调节器监测执行核心的功率和性能,并且可作出调整来减小功率(例如,按比例降低核心的频率和/或电压;关掉核心;或将工作队列的一些或全部工作迁移到芯片上其它计算区域并关掉核心)或提高性能(例如,按比例提高频率和/或电压;打开核心;从其它计算区域招收更多工作或打开不同计算区域并把它们结合到应用)。负载均衡器分析工作队列和事件池并确定工作是否应局部地进行(即在这个计算区域中)或迁移到别处。小码迁移管理器与所述节点和远程节点上其它运行时间系统核心一起工作,以发现小码集的最优目的地并合适地迁移它们。小码迁移也可由不良数据局部性触发:如果小码集中许多小码请求定位在另一个节点上的数据,那么重新定位所述代码可能比重新定位所述数据更好。In an embodiment of the present invention, for example in the system 300 shown in FIG. 3 , the runtime system core 301 is co-located with the event pool memory 302 . In an embodiment, tasks of the runtime system may operate on dedicated runtime system cores, or by an execution core. The event pool 302 contains upcoming fine-grained codelets, application and system goals (eg, performance and power goals), and data availability events. The event pool 302 can be an actual shared data structure such as a list, or a distributed structure such as a call-back system that is called when resource utilization changes (such as when a queue has free space; a processing element is available for work; or a mutex is available). . The runtime system core 301 responds to events in the event pool 302 . According to an embodiment of the present invention, there are five managers running on the runtime system core 301: (1) data percolation manager; (2) codelet scheduler; (3) codelet set migration manager; 4) load balancers and (5) runtime performance monitors/governers. In some embodiments, these managers work cooperatively by operating in close proximity and sharing runtime state. Figure 4 shows the input, output and interaction 401 of a manager running on the runtime system core 301 of an exemplary embodiment. The data percolation manager percolates data dependencies (i.e. prefetch input data when available) and code dependencies (i.e. prefetch instruction cache) when appropriate. When all input dependencies are satisfied, the codelet scheduler places codelets in a work queue, thereby re-prioritizing ready codelets in the queue in some scenarios. The execution core repeatedly accepts tasks from the work queue and runs them to completion. In the process of running codelets, the execution core can create codelets or threads and place them in the event pool. The runtime performance monitor/governor monitors the power and performance of the execution cores and may make adjustments to reduce power (e.g., scale down the frequency and/or voltage of the cores; power down the cores; or shift some or all of the work queues to Migrate work to other compute areas on the chip and turn off cores) or increase performance (e.g., scale up frequency and/or voltage; turn on cores; recruit more work from other compute areas or turn on different compute areas and combine them into applications ). The load balancer analyzes the work queue and event pool and determines whether the work should be done locally (that is, within this compute area) or migrated elsewhere. The codelet migration manager works with other runtime system cores on the node and remote nodes to discover the optimal destinations for sets of codelets and migrate them appropriately. Codelet migration can also be triggered by bad data locality: if many codelets in a codelet set request data that is located on another node, it may be better to relocate the code than the data.

这些管理器也以协作方式一起通信,以获取具有例如给定能耗预算的最小完成时间的共同利益的目标。例如,如果性能管理器想把功率调低而负载均衡器想局部迁移更多工作,那么将这两个管理器共置在RTS核心上意味着它们可同时传递它们目标的最好的动作过程并进行快速、确定性动作。因此,这些子系统提供基于广义角色(GACT)目标建构内部性能模型并获取设定点的控制结构。所述系统的目标是以由GACT约束的能量比例方式来提供最小能耗的最高性能。在本发明的实施方案中,这些功能依靠运行时间系统核心来通过发送载入和功率指示符并接收目标对象来非同步地与主运行时间系统核心通信。主运行时间系统核心的工作是监测芯片上给定应用的整体性能/功率分布,并合适地调整每个计算区域的性能(所述性能可包括个体核心的频率、电压和开/关状态)。These managers also communicate together in a cooperative manner to achieve goals of common interest with eg a minimum completion time for a given energy budget. For example, if a performance manager wants to power down and a load balancer wants to shift more work locally, then co-locating the two managers on the RTS core means that they can simultaneously deliver the best course of action for their targets and Make quick, deterministic actions. Thus, these subsystems provide control structures that model internal performance and derive set points based on generalized role (GACT) objectives. The goal of the system is to provide the highest performance with minimum energy consumption in an energy proportional manner constrained by GACT. In an embodiment of the invention, these functions rely on the runtime system core to communicate asynchronously with the main runtime system core by sending load and power indicators and receiving target objects. It is the job of the main runtime system core to monitor the overall performance/power distribution of a given application on the chip and adjust the performance of each compute region appropriately (which may include frequency, voltage and on/off status of individual cores).

分配到应用的每个节点的主运行时间系统核心非同步地与所谓应用的头节点的主运行时间系统核心通信,并交换性能度量和目标对象,例如完成时间、能耗和最大资源约束(例如,存储器空间、节点、网络链路等)。运行时间系统硬件的分层和分形管理结构反映了执行模型的分层本质。全体地,运行应用的节点的主运行时间系统核心如下面在超管理器部分中描述的那样执行超管理器任务。运行时间系统彼此通信并提供反馈(例如,局部运行时间核心确定工作量很低,告诉主运行时间核心,并接收更多工作),使得整个系统就是自知的。The main runtime system core of each node assigned to the application communicates asynchronously with the main runtime system core of the so-called head node of the application and exchanges performance metrics and target objects such as completion time, energy consumption and maximum resource constraints (e.g. , memory space, nodes, network links, etc.). The hierarchical and fractal management structure of the runtime system hardware reflects the hierarchical nature of the execution model. Collectively, the main runtime system core of the node running the application performs hypervisor tasks as described below in the hypervisor section. The runtime systems communicate with each other and provide feedback (for example, a local runtime core determines that the workload is low, tells the main runtime core, and accepts more work), so that the entire system is self-aware.

在自知操作系统的实施方案中,监测区域的分形分层网络实现数据处理系统的管理。例如,在基本簇,区域可以是:集群、节点、套接口、核心、硬件线程。每个叶域处的进程(可以是调度器)监测硬件和应用(例如,能耗、载入、程序完成的进展等)的健康状况。阶层中较高级别的监测器聚合来自它们子域的信息(且可能可选地添加在它们区域的信息——或要求所有监测都由子域进行)并把信息向上传递到它们的父域。当硬件组件失效时,被沿链接向上报告。阶层中任何级别可选来重新开始在失效的硬件上运行或沿链接向上传递的小码。一旦一个级别选择重新开始小码,那么它可向下委派任务到它的子级别执行。也可用这种方式迁移使能的小码。如果一个级别发现它的队列变得太满或消耗太大功率,那么它可用与上面描述的方式相同的方式来迁移使能小码。最终,如果一个级别发现它任务太少,那么它可从它父级别请求工作,且这个请求可沿链接向上直到能发现一个合适的施体为止。In an embodiment of the self-aware operating system, a fractal layered network of monitoring areas enables management of the data processing system. For example, in a basic cluster, zones can be: cluster, node, socket, core, hardware thread. A process (which may be a scheduler) at each leaf domain monitors the health of the hardware and applications (eg, energy consumption, loading, progress of program completion, etc.). Monitors higher in the hierarchy aggregate information from their subdomains (and may optionally add information in their domains - or require all monitoring to be done by subdomains) and pass information up to their parent domains. When a hardware component fails, it is reported up the link. Any level in the hierarchy can be selected to restart codelets running on failed hardware or passed up a link. Once a level chooses to restart the codelet, it can delegate tasks down to its sub-levels for execution. Enabled codelets can also be migrated in this way. If a level finds that its queues are getting too full or consuming too much power, it can migrate enabling codelets in the same manner as described above. Finally, if a level finds it has too few tasks, it can request work from its parent level, and this request can be followed up the chain until a suitable donor can be found.

运行时间系统用户API:Runtime System User API:

小码可通过调用运行时间库调用来创建另外的小码,以定义另外小码的数据依赖性、引数和程序计数器。可通过数据依赖性或控制依赖性来实现同步性。例如,通过繁衍依赖于具有参与屏障的角色的数目(看图5)的变量等式的小码来实施屏障。每个参与小码原子地添加一个到屏障变量。可用类似方式实施互斥:具有临界区域的小码把互斥锁捕获用做数据依赖性,并当完成时释放锁。然而,如果临界区域短,那么在某些场景中(在没有死锁并且当所述锁在空间局部存储器中时),核心只等待锁可能是富有成效的。最终,存储器中的原子操作()允许许多类型的隐式无阻塞同步,例如比较和交换队列入口和原子添加增量/减量。A codelet may create additional codelets by calling run-time library calls to define the data dependencies, arguments, and program counters of the additional codelets. Synchronization can be achieved through data dependencies or control dependencies. For example, barriers are implemented by multiplying codelets that depend on variable equations with the number of characters participating in the barrier (see Figure 5). Each participating codelet atomically adds one to the barrier variable. Mutexes can be implemented in a similar fashion: a codelet with a critical section captures the mutex as a data dependency, and releases the lock when done. However, if the critical section is short, then in some scenarios (when there is no deadlock and when the lock is in space local memory), it may be fruitful for the core to just wait for the lock. Finally, atomic operations() on memory allow many types of implicit non-blocking synchronization, such as compare and swap queue entries and atomic add increment/decrement.

微OSmicro-OS

微OS在节点边界提供非本地节点资源和安全。在本发明的实施方案中,微OS具有两个组件:(1)在执行核心上运行的特定小码;和(2)用户通过系统调用来调用的库函数。特定小码用于基于事件的中断驱动执行或串联设备的非同步轮询和将数据放置到队列中。典型设备包括以太网、将这个节点连接到其它节点的开关的端口,和主动提供的输入(可能是来自磁盘-I/O的非同步响应)的其它来源。另外,可为例如在诸如TCP/IP的可靠通信协议上的转发中继操作的时序事件保留小码。这些小码分析发送者和接收者,以保证允许属于具有节点的应用的特定来源访问节点上的资源或应用专用资源(例如,磁盘上的刮痕空间)。到共享资源(例如,全局文件系统)的访问是通过例如用户、群、作用或访问级别的能力的构件来认证的。The microOS provides non-local node resources and security at node boundaries. In an embodiment of the present invention, a micro-OS has two components: (1) specific codelets that run on the execution cores; and (2) library functions that users invoke through system calls. Specific codelets are used for event-based interrupt-driven execution or asynchronous polling of in-line devices and placing data into queues. Typical devices include Ethernet, ports on switches connecting this node to other nodes, and other sources of unsolicited input (perhaps asynchronous responses from disk-I/O). Additionally, codelets may be reserved for sequential events such as forward relay operations over a reliable communication protocol such as TCP/IP. These codelets analyze senders and receivers to ensure that specific sources belonging to the application with the node are allowed to access resources on the node or application-specific resources (eg, scratch space on disk). Access to shared resources (eg, the global file system) is authenticated by means of capabilities such as user, group, role, or access level.

库函数允许用户应用直接访问硬件,而不需要介入或额外调度。这些函数中的一些可直接在硬件(例如,LAN、节点到节点或磁盘写入)中实施。其它函数使用低级别支持来直接发送并通过缓冲从非同步输入轮询线程接收数据,例如从另一个节点请求磁盘访问。库调用指导用户访问被分配到它应用的数据。用户或系统库可指定是否阻塞等待响应(例如,我们知道它很快回来)或调度小码以在结果上运行数据依赖性。Library functions allow user applications to directly access hardware without intervention or additional scheduling. Some of these functions can be implemented directly in hardware (eg, LAN, node-to-node, or disk writing). Other functions use low-level support to directly send and receive data from an asynchronous input polling thread via buffering, such as requesting disk access from another node. Library calls direct the user to access data assigned to its application. A user or system library can specify whether to block waiting for a response (e.g. we know it will be back soon) or to schedule codelets to run data dependencies on the result.

库函数被设计成能量有效并通过紧密地与运行时间系统耦合来隐藏延迟。例如,调用文件系统读的小码可使文件系统请求;创建小码来在文件系统响应上处理具有数据依赖性的响应;并离开。这样允许执行核心在其它小码上工作,同时数据在传输中(而不是保持在I/O等待状态)。如果没有足够的并发性,那么运行时间系统可关掉核心或调低核心的频率,来允许面对长时延读取操作的更慢计算。Library functions are designed to be energy efficient and hide latency by being tightly coupled to the runtime system. For example, a codelet that invokes a filesystem read may make a filesystem request; create a codelet to handle responses with data dependencies on filesystem responses; and leave. This allows the execution core to work on other codelets while data is in transit (rather than being held in an I/O wait state). If there is not enough concurrency, the runtime system can turn off cores or reduce the frequency of cores to allow slower calculations in the face of long-latency read operations.

本发明的实施方案用两种模式提供安全:高性能计算(HPC)模式,其中一个应用拥有全部节点;和非HPC模式,其中多个应用可在一个节点上共存。在HPC模式下,大体上在节点边界(即除了内核/用户存储器空间和只读存储器以外不检查芯片上访问)上执行安全是足够的。用户应用知道它们应用中节点(即节点0到N-1,其中N是应用中节点的数目)的逻辑映射也是足够的。微OS知道将节点ID物理映射到逻辑ID,并在适当时重写地址。另外,当微OS从节点边界外部获取输入时,它验证数据就是那个节点的。因此,芯片上安全包括保护内核代码与用户代码分开,并保护用户的只读存储器不写。在非HPC模式下,微OS允许节点与外部外围设备通信但大体上不与其它节点通信。用相同方式验证输入。由如超管理器部分描述的超管理器配置的硬件来执行另外的安全。可在粗粒度应用级别或细粒度小码级别执行安全。在小码级别,因为在运行时间已知数据依赖性和数据块的尺寸,所以可通过使用防护指针(像M机器上使用的指针)由硬件或由使用无效页面或金丝雀(用于ProPolice或堆栈保护中)的软件来在数据对象周围保证安全。Embodiments of the present invention provide security in two modes: High Performance Computing (HPC) mode, where one application owns all nodes; and non-HPC mode, where multiple applications can co-exist on a node. In HPC mode, it is generally sufficient to enforce security on node boundaries (ie no checking of on-chip accesses other than kernel/user memory space and ROM). It is also sufficient for user applications to know the logical mapping of nodes in their application (ie, nodes 0 to N-1, where N is the number of nodes in the application). The microOS knows to physically map node IDs to logical IDs, and rewrite addresses when appropriate. Additionally, when the microOS gets input from outside the node boundary, it verifies that the data belongs to that node. Therefore, on-chip security includes keeping the kernel code separate from user code and protecting the user's ROM from writing. In non-HPC mode, the microOS allows nodes to communicate with external peripherals but generally not with other nodes. Validate input in the same way. Additional security is performed by hardware configured by the hypervisor as described in the hypervisor section. Security can be enforced at the coarse-grained application level or at the fine-grained codelet level. At the codelet level, since data dependencies and block sizes are known at run time, it can be implemented by hardware by using guard pointers (like those used on M machines) or by using invalid pages or canaries (used in ProPolice or stack protection) to provide security around data objects.

超管理器:HyperManager:

超管理器负责把资源分配给用户应用。在本发明的实施方案中,它物理上常驻在所有节点上且部分在主系统上。每个芯片上的一个或多个小码集可用于超管理器功能。它们常驻在运行时间系统核心和执行核心中,并且大体上遵循和系统中其它小码集相同的细粒度执行模型。主软件上超管理器的实施方案维持所有资源分配到系统中所有应用的状态。当开始一个应用时,广义角色(GACT)可指定一组执行环境变量,例如节点数目和功率和性能目标。超管理器把应用放置在系统中并分配资源,从而应用空间中的节点是连续的并很好地匹配GACT应用请求。一旦分配一组节点,主超管理器就与每个节点上的超管理器例子通信来分配所述节点,传递应用代码图像和用户环境(如果有就包括功率和性能目标),并用信号通知运行时间系统以开始应用。超管理器通知微OS和运行时间系统:资源被分配到应用。然后,节点上的超管理器例子监测应用性能并与分配到应用的其它节点上的其它超管理器例子和运行时间系统核心一起工作,以通过管理功率、性能、安全和韧性的关系来达成功率/性能目标,从而维持能量比例运行时间功率预算(见图6整体系统、超管理器和运行时间系统交互的阶层601)。微OS线程和库提供分配到应用的所有节点上的应用数据和环境的安全。The hypervisor is responsible for allocating resources to user applications. In an embodiment of the invention, it is physically resident on all nodes and partly on the master system. One or more codelet sets on each chip are available for hypervisor functions. They are resident in the runtime system core and execution core, and generally follow the same fine-grained execution model as other codelet sets in the system. The implementation of the hypervisor on the host software maintains the state of all resources allocated to all applications in the system. When starting an application, a generalized role (GACT) can specify a set of execution environment variables, such as the number of nodes and power and performance goals. The hypervisor places applications in the system and allocates resources such that nodes in the application space are contiguous and well matched to GACT application requests. Once a set of nodes is allocated, the master hypervisor communicates with the hypervisor instance on each node to allocate the nodes, passing the application code image and user environment (including power and performance targets if applicable), and signaling the runtime time system to start applying. The hypervisor informs the micro-OS and the runtime system that resources are allocated to applications. The hypervisor instance on the node then monitors application performance and works with other hypervisor instances and runtime system cores on other nodes assigned to the application to achieve success by managing the relationship between power, performance, security, and resilience rate/performance goals, thereby maintaining an energy-proportional runtime power budget (see Figure 6 for hierarchy 601 of overall system, hypervisor, and runtime system interaction). MicroOS threads and libraries provide security of application data and environments on all nodes distributed to the application.

在多个应用可共存于一个节点的非HPC模式下,超管理器从核心集创建计算区域。每个应用的RAM都被分割,且用户应用不能写入彼此的DRAM或芯片上SRAM。这可由功率效率的基本存储器管理单元(MMU)或旧有机器上广义虚拟存储器管理器(VMM)实现。在应用启动阶段,超管理器确定每个分割的地址前缀和大小,并且MMU可立即重写应用地址。大体上,可用这种方式来访问映射到应用存储器空间的地址。In non-HPC mode where multiple applications can co-exist on one node, the hypervisor creates compute regions from core sets. Each application's RAM is partitioned, and user applications cannot write to each other's DRAM or on-chip SRAM. This can be implemented by a power-efficient basic memory management unit (MMU) or by a generalized virtual memory manager (VMM) on legacy machines. During the application startup phase, the hypervisor determines the address prefix and size of each partition, and the MMU can immediately rewrite the application address. In general, addresses mapped into application memory space can be accessed in this manner.

硬件抽象层:Hardware abstraction layer:

硬件抽象层(HAL)允许微OS和用户应用询问硬件设备可用性并以统一的方式与硬件交互。设备可以是执行核心、磁盘、网络接口、其它节点等。系统的部分可被用户应用通过文件描述符访问。例如打开、读、写和关闭的微OS库函数调用向应用提供基本硬件抽象层。驱动程序与HAL交互,其中一系列存储器读并写。HAL实施例把这些请求转换成与硬件平台相关的总线事务。这样允许用户在不同的基本平台上再次使用驱动代码。The Hardware Abstraction Layer (HAL) allows the micro-OS and user applications to query hardware device availability and interact with the hardware in a uniform manner. Devices can be execution cores, disks, network interfaces, other nodes, etc. Parts of the system are accessible by user applications through file descriptors. MicroOS library function calls such as open, read, write, and close provide the application with a basic hardware abstraction layer. The driver interacts with the HAL, where a series of memory reads and writes. The HAL implementation translates these requests into hardware platform dependent bus transactions. This allows the user to reuse the driver code on a different base platform.

另外,应用可向硬件或运行时间系统询问可用于应用的节点的数目、芯片上执行核心的数目和存储器可用性,以帮助确定怎样划分所述问题。例如,如果存在一千个核心,那么应用可将一百万个迭代循环划分成一千个迭代小码;而如果只有四个核心,那么它会把工作划分成更粗粒度块,因为不能从硬件获得更多并发性并且更少小码的额外负担更低。在各种实施方案中,块的最优尺寸例如可以是(1)可平行进行的工作单元的最大数目除以可用于应用的处理元件的量的圆形整数商;(2)块之间的不同大小,从而最小块尺寸和最大块尺寸之间的最大差异被最小化;或(3)允许在提供的时间预算中完成应用分割,同时保持在提供的能耗预算内的最大尺寸。Additionally, the application can query the hardware or runtime system for the number of nodes available to the application, the number of execution cores on the chip, and memory availability to help determine how to partition the problem. For example, if there are a thousand cores, the application can divide a million iterative loop into a thousand iterative codelets; if there are only four cores, then it can divide the work into coarser-grained chunks, because the The hardware gets more concurrency and less overhead with less codelets. In various embodiments, the optimal size of a block may be, for example, the rounded integer quotient of (1) the maximum number of work units that can be performed in parallel divided by the amount of processing elements available to the application; different sizes such that the maximum difference between the minimum and maximum block sizes is minimized; or (3) allow application partitioning to be done within the provided time budget while maintaining the maximum size within the provided energy budget.

自优化操作系统:Self-optimizing OS:

操作系统服务由微OS和运行时间系统执行,并由超管理器管理。这些组件一起组成了图7中示出实施方案中描述的示范性自知操作系统701。运行时间系统的自优化本质由以下来实现:(1)执行系统的自知特征;(2)OS的自知特征;和(3)在(1)和(2)之间的交互。如图7所示,OS、超管理器、运行时间系统和执行单元彼此通信,并和它们的邻近级别通信,以提供反馈观察-确定-控制循环。Operating system services are performed by the microOS and runtime system and managed by the hypervisor. Together, these components make up the exemplary self-aware operating system 701 described in the embodiment shown in FIG. 7 . The self-optimizing nature of the runtime system is achieved by: (1) the self-awareness of the execution system; (2) the self-awareness of the OS; and (3) the interaction between (1) and (2). As shown in Figure 7, the OS, hypervisor, runtime system, and execution units communicate with each other and with their adjacent levels to provide a feedback observe-determine-control loop.

这部分描述了自优化系统模型701的实施方案。This section describes an embodiment of the self-optimizing system model 701 .

(1)嵌入执行系统中的自优化循环:执行模型的实施方案以两种小码为特征:非同步任务和数据流小码。在两种类型中,相应小码活动的引起是事件驱动的。至少在非同步任务的情况下,小码的调用可另外取决于计算载入、能量消耗、错误率或在任务可分配到的特定物理区域上的其它条件。自优化也可应用到性能自知监测和自适应。(1) Self-optimizing loops embedded in the execution system: Implementations of the execution model feature two types of codelets: asynchronous task and dataflow codelets. In both types, the invocation of the corresponding codelet activity is event-driven. At least in the case of asynchronous tasks, invocation of codelets may additionally depend on computational load, energy consumption, error rates, or other conditions on the particular physical area to which the task may be allocated. Self-optimization can also be applied to performance-aware monitoring and adaptation.

(2)嵌入操作系统中的自优化循环:自优化OS观察本身、反映它的行为并适应。它是目标定向型;理想地,系统客户指定目标是足够的,且系统的工作是想出怎样实现目标。为了支持这些自优化功能性,实施方案中的OS观察器代理(即运行时间系统核心和超管理器)配备有能被编程来观察程序执行和系统资源利用的所有方面的性能监测设施,和可在不同时间间隙或特定位置/区域请求OS时观察系统功率消耗的能量效率监测设施。(2) A self-optimizing loop embedded in the operating system: The self-optimizing OS observes itself, reflects its behavior and adapts. It is goal-oriented; ideally, it is sufficient for the system client to specify goals, and the system's job is to figure out how to achieve them. To support these self-optimizing functionalities, the OS observer agents (i.e., the runtime system kernel and hypervisor) of the implementation are equipped with performance monitoring facilities that can be programmed to observe all aspects of program execution and system resource utilization, and can Energy efficiency monitoring facility to observe system power consumption when OS is requested at different time slots or specific locations/regions.

在实施方案中,OS确定代理(在运行时间系统核心上运行的代码)配备有合适的模型建构器和学习能力,这样它可采取及时有效地动作来进行自校正和自适应,从而实现目标。在一些实施方案中,OS自优化循环可使控制理论方法实现它的目标。图7示出(1)和(2)之间的交互:连接OS中的控制循环和每个执行系统中的控制循环。OS控制循环可向执行系统询问关于它们运行状态、资源使用、能量效率和错误状态的情况,以做出有根据的确定来执行系统级别全局控制和调整。同时,每个个体执行系统可向OS寻求帮助,来解决它自身控制中可更好地在OS级别解决的问题。In an embodiment, the OS determines that the agent (code running on the core of the runtime system) is equipped with appropriate model builders and learning capabilities so that it can take timely and effective actions to self-correct and adapt to achieve goals. In some embodiments, the OS self-optimizing loop enables the control theory method to achieve its goals. Figure 7 shows the interaction between (1) and (2): connecting the control loop in the OS and the control loop in each execution system. The OS control loop can query the executing systems about their operating status, resource usage, energy efficiency, and error states to make educated determinations to perform system-level global control and adjustments. At the same time, each individual execution system can seek help from the OS to solve problems in its own control that can be better solved at the OS level.

为了有效地使用运行空间系统和方法,应用开发者可提供指令,在编译时系统对所述指令注解,并且所述指令产生更好的初始静态分配、更好的运行时间(动态)分配或两者。图8示出C语言中的显式语言元素(801),其中应用程序员报警:系统为“资源拖延”,这可指示可迁移代码到低功耗、慢的执行单元。参考数字802表示隐式指令:使用低保真度浮点计算的特定API调用。这些计算可在浮点处理单元上用很小的尾数位来便宜地进行,从而允许更大的专门化,且在系统的计算区域内更好地匹配能力和需求。这些是运行时间可用来做出动态确定的用户特定指令的一些例子。另外,可用指令来剖绘和注释应用,这样运行时间可基于注释提供的提示在随后的运行中做出更好的动态确定。To efficiently use the runspace system and method, an application developer can provide instructions that are annotated by the system at compile time and that result in better initial static allocation, better run-time (dynamic) allocation, or both. By. Figure 8 shows an explicit language element (801) in C where the application programmer is alerted that the system is "resource stalled", which may indicate that code can be migrated to low power, slow execution units. Reference numeral 802 denotes an implicit instruction: a specific API call using low-fidelity floating-point calculations. These calculations can be done cheaply on floating point processing units with very small mantissa bits, allowing greater specialization and better matching capabilities and needs within the computational domain of the system. These are some examples of user-specific instructions that the runtime can use to make dynamic determinations. In addition, instructions can be used to profile and annotate applications so that runtime can make better dynamic determinations in subsequent runs based on hints provided by the annotations.

图9示出示范性微存储器管理单元。参考数字901是具有局部代码执行和四个局部物理存储器块的处理单元。参考数字902和903是同一控制任务-拥有者X-所拥有的两个存储器块,且所述两个存储器块可访问与所述任务相关的小码。902具有逻辑地址00和物理地址00,而903具有物理地址10和逻辑地址L01。参考数字904显示L01以外的存储器访问对于X拥有的小码怎样显示。即L02以外的任何局部逻辑地址对于X拥有的小码来说都是错误的。参考数字905显示常驻在物理位置01上的存储器部分,所述位置对于Y拥有的小码逻辑性地显示为L00。所有其它局部物理存储器都不能访问Y小码。参考数字906显示常驻在物理位置11上的存储器部分,所述位置对于Z拥有的小码逻辑性地显示为L00。所有其它局部物理存储器都不能访问Z小码。Figure 9 illustrates an exemplary micro memory management unit. Reference numeral 901 is a processing unit with local code execution and four local physical memory blocks. Reference numerals 902 and 903 are two memory blocks owned by the same control task - owner X - and which have access to the codelets associated with said task. 902 has logical address 00 and physical address 00, while 903 has physical address 10 and logical address L01. Reference numeral 904 shows how memory accesses other than L01 appear to X owned codelets. i.e. any local logical address other than L02 is wrong for the codelet owned by X. Reference numeral 905 shows the portion of memory resident at physical location 01, which logically shows as L00 for the codelet owned by Y. All other local physical memory cannot access the Y codelet. Reference numeral 906 shows the portion of memory resident at physical location 11, which logically shows as L00 for the codelet owned by Z. All other local physical memory cannot access the Z codelet.

图10示出包括运行空间系统的简单使用例子,其中广义代理1001指示任务(一般通过编译源代码);开始应用1003并获取结果1004。并发地,另一GACT 1005执行监测和系统维护1006。在典型环境中,运行空间系统可通过局域网(LAN)和/或广域网(WAN)1007使用,并且由与常规前端服务器1008的交互来进行,所述常规前端服务器1008与高端计算机(HEC)1009通信。Figure 10 shows a simple usage example of a system including a runspace, where a generalized agent 1001 directs tasks (typically by compiling source code); starts an application 1003 and obtains a result 1004 . Concurrently, another GACT 1005 performs monitoring and system maintenance 1006 . In a typical environment, the runspace system is available over a local area network (LAN) and/or wide area network (WAN) 1007 and is conducted by interaction with a conventional front end server 1008 which communicates with a high end computer (HEC) 1009 .

图11示出运行空间中观察到的具有小码和数据随时间分配的代码和数据局部性的例子。运行空间的另外属性可包括外围设备资源需求或分配、处理器操作包封和约束、任务紧迫性或可迟性等。运行空间系统使用度量空间距离模型来最初分配代码和数据到合适的局部处理元件,并当认为相对于当前目标优化系统性能有利时可动态地迁移代码和数据。系统可使用政策驱动优化技术来动态分配并在编译时使用详尽优化方法。另外,系统可从之前的性能数据中学习,以改进特定小码、子程序、任务和应用的未来分配。Figure 11 shows an example of code and data locality observed in a run space with small code and data distribution over time. Additional attributes of a runspace may include peripheral resource requirements or allocations, processor operation encapsulation and constraints, task urgency or deferrability, and the like. The run-space system uses a metric-space distance model to initially allocate code and data to appropriate local processing elements, and to dynamically migrate code and data when it is deemed beneficial to optimize system performance relative to current goals. The system can use policy-driven optimization techniques to dynamically allocate and use exhaustive optimization at compile time. Additionally, the system can learn from previous performance data to improve future assignments of specific codelets, subroutines, tasks, and applications.

横切交互:Cross cutting interaction:

执行模型:运行时间系统和微OS管理、迁移和繁衍小码。它们根据运行时间目标选择运行小码版本。如上面所述,运行时间系统核心管理小码之间的数据依赖性,从而一起迁移数据和小码,并基于运行时间约束繁衍正确的小码版本。Execution Model: Runtime system and microOS management, migration and propagation of codelets. They choose to run small code versions based on their runtime goals. As described above, the runtime system core manages data dependencies between codelets, thereby migrating data and codelets together, and breeding the correct version of the codelet based on runtime constraints.

可依赖性是安全和弹性的组合。根据实施方案,本发明的安全方面包括向小码提供安全标示,其中所述标示指示分配所述小码和它们相关数据时将考虑的限制或权限。数据绑定或规定的权限之外的存储器访问将增加运行时间系统需处理的安全例外。在HPC模式下,应用完全拥有节点。在核心级别安全由用户/内核空间存储器和指令集实施提供。在应用级别安全由主系统和超管理器提供,所述主系统定义应用在上面运行的节点集,所述超管理器中继信息到在分配节点上运行的微OS。在系统级别安全由主系统上的工作管理器提供,所述工作管理器以互相排斥的方式调度并分配节点到应用。在非HPC模式下,系统被进一步子划分为互相排斥的芯片区域和存储器部分,并且存储器和资源被映射从而防止应用访问彼此在相同芯片上的数据。Dependency is a combination of security and resilience. According to an embodiment, the security aspect of the present invention includes providing codelets with a security indication, wherein the indication indicates the restrictions or permissions to be considered when distributing the codelets and their associated data. Memory accesses outside of data binding or specified permissions will add security exceptions to be handled by the runtime system. In HPC mode, the application fully owns the nodes. Security at the core level is provided by user/kernel space memory and instruction set implementation. Security at the application level is provided by the main system, which defines the set of nodes on which the application runs, and the hypervisor, which relays information to the micro-OS running on the allocated nodes. Security at the system level is provided by a work manager on the host system that schedules and assigns nodes to applications in a mutually exclusive manner. In non-HPC mode, the system is further subdivided into mutually exclusive chip regions and memory portions, and memory and resources are mapped to prevent applications from accessing each other's data on the same chip.

弹性是通过分形监测系统的健康状况并重新执行失效的小码来维持的。计算区域中局部运行时间核心监测执行核心健康状况。节点级别运行时间核心监测运行时间核心。节点级别运行时间核心被主系统监测。当一个组件失效时,在核心上运行的小码被重新开始(如果它们没有在程序中创建状态改变)或者应用被从检查点重新开始(如果程序的状态是非确定的)。Resilience is maintained by Fractal monitoring the health of the system and re-executing failed codelets. The local runtime cores in the compute region monitor the execution core health. Node-level runtime cores monitor runtime cores. The node-level runtime core is monitored by the main system. When a component fails, the codelets running on the core are restarted (if they did not create a state change in the program) or the application is restarted from the checkpoint (if the state of the program is non-deterministic).

效率目标在给定应用和系统目标集的情况下寻求最大化性能并最小化能耗。这是基于代码的依赖性和工作的可用性通过在执行核心级别频率和电压缩放来实现的。另外,小码和数据被迁移到它们能最有效地彼此通信(例如,通过更紧密保持交互小码在一起)并消耗最小量功率(例如,把小码移到一起以允许未使用簇的功率区域关机并消除闲置能耗)的地方。Efficiency goals seek to maximize performance and minimize energy consumption given a set of application and system goals. This is achieved by frequency and voltage scaling at the execution core level based on code dependencies and work availability. In addition, codelets and data are migrated to where they can communicate with each other most efficiently (e.g., by keeping interacting codelets closer together) and consume the least amount of power (e.g., by moving codelets together to allow unused cluster power where the area shuts down and eliminates idle energy consumption).

自优化:通过分形监测网络(的健康状态和性能)和运行时间系统重新调度来维持自优化,从而实现应用和系统的目标同时维持可依赖性和效率。Self-optimization: Self-optimization is maintained through fractal monitoring of the network (for health and performance) and runtime system rescheduling to achieve application and system goals while maintaining reliability and efficiency.

实施方案描述:Implementation plan description:

下面进一步查看附图描述本发明的实施方案的操作例子和应用场景。The following describes the operation examples and application scenarios of the embodiments of the present invention with further reference to the accompanying drawings.

图12示出使用小码集的计算系统。重要的表示步骤包括:1201在GCS上提供小码集表示系统;1202从GACT获取小码集表示;1203把小码集转换成可执行或可判读的指令和依赖性表示;1204在GCS上使用指令来进行元级别分布和小码集分配;1205执行小码集可执行例子的动态具体分布和迁移;1206执行小码集和1207至少部分基于依赖性使能新的小码集。Figure 12 shows a computing system using codelets. Important representation steps include: 1201 providing a codelet representation system on GCS; 1202 obtaining codelet representation from GACT; 1203 converting codelets into executable or interpretable instructions and dependency representations; 1204 using on GCS 1205 executes dynamic concrete distribution and migration of executable instances of codelets; 1206 executes codelets and 1207 enables new codelets based at least in part on dependencies.

图13示出包括以下步骤的小码集表示系统:1301提供规格系统来指派小码集;1302向GACT提供机制来建构并修改小码集并获取小码集的初始分析;1303向GACT提供机制来在实际或模拟资源上执行小码集;1304向GACT提供机制来监测运行小码集或查看小码集的历史轨迹;1305向GACT提供机制来动态地操纵小码集;和1306向GACT提供机制来剖绘小码集性能和资源利用。Figure 13 shows a codelet set representation system comprising the following steps: 1301 providing a specification system to assign a codelet set; 1302 providing a mechanism to GACT to construct and modify a codelet set and obtaining an initial analysis of a codelet set; 1303 providing a mechanism to GACT to execute codelet sets on actual or simulated resources; 1304 provide mechanisms to GACT to monitor running codelet sets or view historical traces of codelet sets; 1305 provide mechanisms to GACT to dynamically manipulate codelet sets; and 1306 provide GACT mechanism to profile codelet set performance and resource utilization.

图14示出包括以下步骤的小码集转换的例子:1401从表示中抽取小码集描述符;1402转换可执行指令;1403应用资源不变优化;1404建构、分组并分布指令以指导运行时间分配、分布和迁移;1405应用资源特定优化;和1406产生可执行文字并使能初始小码。Figure 14 shows an example of a codelet set transformation comprising the following steps: 1401 extracting a codelet set descriptor from a representation; 1402 transforming executable instructions; 1403 applying resource invariant optimizations; 1404 constructing, grouping and distributing instructions to direct runtime allocation, distribution and migration; 1405 applying resource specific optimizations; and 1406 generating executable text and enabling initial codelets.

图15示出元级别小码集分布的例子,所述例子包括以下步骤:1501使用指令来初始分配小码集到计算和数据资源;1502监测具体级别小码集执行和资源利用;1503收集修改小码集分布的机会;1504建构改进初始(编译时)小码集分布的指令;和1505提供资源信息和仲裁来支持小码集的动态(运行时间)迁移。Figure 15 shows an example of meta-level codelet distribution, said example comprising the following steps: 1501 use instructions to initially allocate codelet sets to computing and data resources; 1502 monitor concrete level codelet execution and resource utilization; 1503 collect modifications Opportunities for codelet set distribution; 1504 construct instructions that improve initial (compile-time) codelet set distribution; and 1505 provide resource information and arbitration to support dynamic (run-time) migration of codelet sets.

图16示出小码集执行和迁移并包括以下步骤:1601使用小码集分布指令来分布小码集文字到通信资源或到模拟计算资源;1602提供小码集执行文字和分布指令之间的映射;1603配置小码集来传回资源和结果到系统只到完成为止;1604监测资源利用和使能的小码队列载入;1605使用小码信号来获取或传递状态信息,或监测小码系统;1606监测以识别并提交资源或把请求向上串接到更高级别监测器;和1607在适当时从使能队列移去小码集并和数据一起迁移它们。Fig. 16 shows codelet set execution and migration and includes the following steps: 1601 uses codelet set distribution instruction to distribute codelet set literals to communication resources or to simulated computing resources; 1602 provides communication between codelet set execution literals and distribution instructions Mapping; 1603 configures the codelet set to return resources and results to the system until completion; 1604 monitors resource utilization and enabled codelet queue loading; 1605 uses codelet signals to obtain or transmit status information, or monitor codelets The system; 1606 monitors to identify and commit resources or chain requests up to higher level monitors; and 1607 removes codelet sets from enable queues as appropriate and migrates them along with the data.

图17示出双端队列并发访问机制:1702写和1703列队。队列的其它状态在1701是空的且在1704是内务处理。FIG. 17 shows the double-ended queue concurrent access mechanism: 1702 writing and 1703 queuing. The other status of the queue is empty at 1701 and housekeeping at 1704 .

图18示出出队并发访问机制,这次执行1801一致性检查,1802空队列,1803非空队列和1804读和出队。注意这种系统的一个优势是使用系统的过程具有处理清除任务的整体性,所以队列很强健。Fig. 18 shows the dequeue concurrent access mechanism, this time execute 1801 consistency check, 1802 empty queue, 1803 non-empty queue and 1804 read and dequeue. Note that one advantage of this system is that the process using the system has the integrity to handle cleanup tasks, so the queue is very robust.

图19示出通过原子加法数组(A)的并发访问:写。所述状态包括1901初始状态和1902原子更新写指针。Figure 19 shows concurrent access to an array (A) by atomic addition: write. The states include 1901 an initial state and 1902 an atomic update of the write pointer.

图20示出通过原子加法数组(B)的并发访问:写。所述状态包括2001数据写和2002标记更新和识读器可见数据。Figure 20 shows concurrent access to array (B) by atomic addition: write. The states include 2001 Data Write and 2002 Tag Update and Reader Visible Data.

图21示出通过原子加法数组(C)的并发访问:读。所述状态包括2101数据就绪可被读且读指针更新和2102读开始和2103读完成和标记更新。Figure 21 shows concurrent access to array (C) by atomic addition: read. The states include 2101 data ready to be read and read pointer updated and 2102 read started and 2103 read done and flag updated.

图22示出链表,特别是原子加法数组(A)。Figure 22 shows a linked list, specifically an atomic addition array (A).

图23示出链表,特别是原子加法数组(B)。Figure 23 shows a linked list, specifically an atomic addition array (B).

图24示出链表,特别是原子加法数组(C)。Figure 24 shows a linked list, specifically an atomic addition array (C).

图25示出链表,特别是原子加法数组(D)。Figure 25 shows a linked list, specifically an atomic addition array (D).

图26示出链表,特别是原子加法数组(E)。Figure 26 shows a linked list, specifically an atomic addition array (E).

图27示出通过具有转向的共享数组的并发访问。Figure 27 shows concurrent access through a shared array with steering.

图28示出组合网络分布增量。Figure 28 shows combined network distribution increments.

图29示出通过原子加法数组(A)的单任务和多任务执行并发访问。Figure 29 shows single-task and multi-task execution concurrent access by atomic addition array (A).

图30示出通过原子加法数组(B)的单任务和多任务执行并发访问。Figure 30 shows single-task and multi-task execution concurrent access by atomic addition array (B).

图31示出通过原子加法数组(C)的单任务和多任务执行并发访问。Figure 31 shows single-task and multi-task execution concurrent access by atomic addition array (C).

图32示出通过原子加法数组(D)的单任务和多任务执行并发访问。Figure 32 shows single-task and multi-task execution concurrent access by atomic addition array (D).

图33示出通过原子加法数组(E)的单任务和多任务执行并发访问。Figure 33 shows single-task and multi-task execution concurrent access by atomic addition array (E).

图34示出小码集计算系统场景,显示不同用户相对于系统的角色。Figure 34 shows a codelet computing system scenario showing the roles of different users with respect to the system.

图35示出微芯片级通用示范性结构。注意存储器级别是非特定的,且想要传送局部存储器(具有快速访问)对非局部存储器的阶层。例如,LI可实施为寄存器文件、SRAM等。Figure 35 shows a general exemplary structure at the microchip level. Note that the memory class is non-specific, and it is intended to transfer local memory (with fast access) to a hierarchy of non-local memory. For example, LI may be implemented as a register file, SRAM, etc.

图36示出板级/系统级通用结构;同样,又示出性能和全局性的范围。Figure 36 shows the board-level/system-level general structure; likewise, shows the range of performance and globality.

图37示出小码和小码集的指派。有许多等效的方式来指定小码集。规格通常通过特定元语言的使用由本机语言结构,或甚至由非执行注释或通过整合的开发环境进行的选择来发出信号。小码集是可编写的,并可被定义来激发其它小码或小码集。GACT通过从基本小码建构小码集,然后通过组合所述集成为包括全部应用的大的集合来建构功能性。函数setDependency允许表示一个小码集的两个元素或不同小码集的两个元素之间的依赖性。在一个实施方案中,在运行时间调用函数implementSet,以建构依赖性图并把它们转换成指针。另外,在实施方案中,修改编译器来从代码产生依赖性信息,即使所述依赖性信息不是由GACT提供的。Figure 37 shows the assignment of codelets and codelet sets. There are many equivalent ways to specify codelet sets. Specifications are usually signaled by native language constructs, or even by non-executive annotations or choices made through integrated development environments, through the use of specific metalanguages. Codelet sets are authorable and can be defined to inspire other codelets or sets of codelets. GACT builds functionality by building codelet sets from basic codelets, and then by combining the sets into larger sets that include all applications. The function setDependency allows expressing a dependency between two elements of one codelet set or two elements of different codelet sets. In one embodiment, the function implementSet is called at runtime to construct dependency graphs and convert them into pointers. Additionally, in an embodiment, the compiler is modified to generate dependency information from code, even if the dependency information is not provided by GACT.

图38示出双缓冲计算(A)。注意每个小码集具有初始化和清理程序,以开始系统并清理和激发出口依赖性。在一些实施方案中,初始化和清理任务可在编译时被静态地优化或在运行时间动态地优化。运行时间系统当被表示成Petri网时是同构的,Petri网是位置和转换的图。位置扩展数据流模型并允许表示数据依赖性、控制流依赖性和资源依赖性。在一个实施方案中,系统首先执行高优先顺序任务,然后进行高优先顺序任务。这允许调度某些系统关键小码,例如维持系统的并发资源访问的任务。如果所有的执行核心在Compl然后在Comp2上工作,那么突然大多数核心会没有工作直到copyl和copy2完成。因此,产生更多小码的小码被给予更高的优先顺序,这样运行队列永远不会空。在下面的说明中,一旦开始系统,它会连续执行至少一些计算小码,因为当复制小码可用时它们具有高的优先顺序。Figure 38 shows the double buffer calculation (A). Note that each codelet set has initialization and cleanup routines to start the system and clean up and fire up export dependencies. In some embodiments, initialization and cleanup tasks may be statically optimized at compile time or dynamically optimized at run time. Runtime systems are isomorphic when represented as Petri nets, which are graphs of positions and transitions. Locations extend the data flow model and allow the representation of data dependencies, control flow dependencies, and resource dependencies. In one embodiment, the system executes the high priority tasks first and then proceeds to the high priority tasks. This allows scheduling of certain system-critical codelets, such as tasks that maintain concurrent resource access to the system. If all execution cores work on Compl and then on Comp2, then suddenly most cores will have no work until copyl and copy2 are done. Therefore, codelets that generate more codelets are given higher priority so that the run queue is never empty. In the following description, once the system is started, it executes at least some computation codelets sequentially, since they have high priority when replica codelets are available.

另外,在双缓冲计算例子中,示范性索引1024界限指示当Init结束时,它使能1024Compl小码。类似地,在复制小码集中激发示范性索引界限8复制小码。注意使用数字8是因为系统可具有许多要求DRAM带宽在其间仲裁的处理器。因此,尽管额外负担(上下文切换)很小,但是小码系统可使用很少的执行核心来实现同样持久的带宽,从而实现改进的应用程序处理通量。在另一实施方案中,系统可动态地提供进入copyl并从copyl回传的位置,其中所述位置中始终有8个符记。类似地,可对copy2进行相同的优化。最终,在另一实施方案中,两个位置可融合成同一位置,且复制函数可使用DRAM带宽符记的同一池。在这种情况下,如果计算比复制长,那么系统可保证copyl和copy2不同时发生。这是petri网的资源约束的表达能力的例子,例如存储器带宽、执行单元、功率、网络、锁等;并说明小码集可利用所述表达能力来使能高平行、高可扩展性应用的建构。注意,在2702,deltaT暗示SignalSet(buffer_set[0])在SignalSet(buffer_set[1])前执行。Additionally, in the double-buffered computation example, the exemplary index 1024 bound indicates that when Init ends, it enables 1024 the Compl codelet. Similarly, an exemplary index bound of 8 replicated codelets is fired in the replicated codelet set. Note that the number 8 is used because a system may have many processors requiring DRAM bandwidth to arbitrate among them. Thus, while the overhead (context switching) is small, a codelet system can use fewer execution cores to achieve the same sustained bandwidth, resulting in improved application processing throughput. In another embodiment, the system may dynamically provide the location into and back from the copyl, where there are always 8 tokens in the location. Similarly, the same optimization can be performed on copy2. Finally, in another embodiment, the two locations can be fused into the same location, and the replication function can use the same pool of DRAM bandwidth tokens. In this case, if the computation takes longer than the copy, then the system guarantees that copy1 and copy2 do not happen at the same time. This is an example of the expressive power of the resource constraints of petri nets, such as memory bandwidth, execution units, power, networking, locks, etc.; and illustrates that codelet sets can exploit said expressive power to enable high-parallel, high-scalability applications build. Note that at 2702, deltaT implies that SignalSet(buffer_set[0]) is executed before SignalSet(buffer_set[1]).

图39示出双缓冲计算(B)。在3901,用信号通知Init Set 1,而在3902,用信号通知Init Set 2,且开始计算1024小码的示范性数目。Figure 39 shows the double buffer calculation (B). At 3901, Init Set 1 is signaled, while at 3902, Init Set 2 is signaled and an exemplary number of 1024 codelets is computed.

图40示出双缓冲计算(C)。在4001,任务Comp2在队列中,但当系统在先来先服务模式下操作时执行核心将继续在Compl上工作,除非有优先顺序差别。在4002,Compl完成,且放置高优先顺序任务“清除”。Comp2现在继续。在其它实施方案中,工作可用先进先出以外的方式进行,例如后进先出以给出堆栈类语义。这个实施方案适用于共享递推应用的工作。Figure 40 shows the double buffer calculation (C). At 4001, task Comp2 is in the queue, but the execution core will continue to work on Comp1 when the system is operating in first come first serve mode unless there is a priority difference. At 4002, Compl completes, and the high priority task "Clear" is placed. Comp2 now continues. In other embodiments, work may be done in ways other than first-in-first-out, such as last-in-first-out to give stack-like semantics. This implementation is suitable for shared recursive application work.

图41示出双缓冲计算(D)。在4101,Comp2可继续,但至少一个执行单元用于高优先顺序任务copy(8)。在4102,Comp2仍在继续,但已向复制函数分配更多执行单元。系统在复制之后清除资源。Figure 41 shows the double buffer calculation (D). At 4101, Comp2 can continue, but at least one execution unit is used for the high priority task copy(8). At 4102, Comp2 continues, but has allocated more execution units to the copy function. The system cleans up resources after copying.

图42示出双缓冲计算(E)。在4201,系统将检查完成标记是否在缓冲1中。在4202,初始化Compl小码。Figure 42 shows the double buffer calculation (E). At 4201, the system will check to see if the completion marker is in buffer 1 . At 4202, the Compl codelet is initialized.

图43示出双缓冲计算(F)。在4301中,Compl小码排在现有Comp2小码后。在4302中,Comp2完成而Comp1继续。Figure 43 shows the double buffer calculation (F). In 4301, the Compl codelets follow the existing Comp2 codelets. In 4302, Comp2 completes and Comp1 continues.

图44示出双缓冲计算(G)。最终,在4401,初始化复制集2的高优先顺序小码,而Compl继续。注意小码可在任何时候接收信号,甚至在它们执行期间。这样使代码和数据的迁移能更好地利用计算资源。总而言之,一些显著的方面可包括:(a)优先顺序;(b)平衡并发性与队列空间;和(c)数据流外的扩展,它可包括例如早期信号、事件流和/或使编程器能影响调度。Figure 44 shows the double buffer calculation (G). Finally, at 4401, the high-priority codelets of replica set 2 are initialized, and Compl continues. Note that codelets can receive signals at any time, even during their execution. This enables code and data migration to better utilize computing resources. To summarize, some notable aspects may include: (a) prioritization; (b) balancing concurrency with queue space; and (c) extensions out of data streams, which may include, for example, early signals, event streams, and/or make programmers can affect scheduling.

图45示出具有SRAM和DRAM的矩阵相乘。在4501,系统把矩阵A和B的块从DRAM复制SRAM,并计算SRAM中的矩阵C。在4502,C的每块被复制回DRAM中适当位置。Figure 45 shows matrix multiplication with SRAM and DRAM. At 4501, the system copies blocks of matrices A and B from DRAM to SRAM and computes matrix C in SRAM. At 4502, each block of C is copied back to its proper location in DRAM.

图46示出矩阵相乘双缓冲/DRAM。在这种情况下,小码用来双缓冲DRAM访问,从而减小访问的时延;这在括号中代码4602的部分中说明。Figure 46 shows matrix multiplication double buffering/DRAM. In this case, codelets are used to double-buffer DRAM accesses, thereby reducing the latency of the accesses; this is illustrated in the section of code 4602 in parentheses.

图47示出计算LINPACK DTRSM(双三角正解倍数)的例子。4701显示初始依赖性。一旦矩阵相乘进行第一行和列,那么系统可转到下一数据集。Fig. 47 shows an example of calculating LINPACK DTRSM (Double Triangle Positive Solution Multiple). 4701 shows initial dependencies. Once the matrix multiplication is done for the first row and column, the system can move on to the next data set.

图48示出DTRSM情况下小码集的运行时间初始化。注意用指示将产生多少小码的参数来调用Init()。4802显示一些可在DTRSM的小码集实施上执行的优化。Figure 48 shows run-time initialization of a codelet set for the DTRSM case. Note that Init() is called with a parameter indicating how many codelets to generate. 4802 shows some optimizations that can be performed on a codelet implementation of DTRSM.

图49示出快速排序例子。在4901,控制流路径依赖于数据。如果所述依赖性先前得到解决/满足,那么可基于小码输出或中间状态来有条件地设定所述依赖性。4902示出了快速排序图的Petri网表示。在这种表示下,线程将在上半部分工作直到交换小码没有输入数据(因为没有数据或因为所有脏数据都在一边)。当执行单元没有高优先顺序小码时,它用低优先顺序小码,例如,在屏障处等待。这时,“移动”小码清除并把枢移到正确的位置。Fig. 49 shows an example of quick sort. At 4901, the control flow path is data dependent. If the dependency was previously resolved/satisfied, the dependency may be conditionally set based on codelet output or intermediate state. 4902 shows a Petri net representation of the quicksort graph. With this representation, the thread will work on the first half until the swaplet has no incoming data (either because there is no data or because all dirty data is on the side). When the execution unit does not have a high-priority codelet, it uses a low-priority codelet, eg, to wait at a barrier. At this point, the "move" codelet clears and moves the pivot to the correct position.

图50示出带有应用小码集的可扩展系统函数。因为系统功能性能流动地与小码集应用整合,所以系统设计师在平衡系统额外负担与系统服务时具有很大的弹性。对于一些使用和应用,可几乎没有系统软件,但在其它情况下,扩展监测和除错可产生比在给定时间运行的应用任务更多的系统任务。Figure 50 shows an extensible system function with an application codelet set. Because system functionality is fluidly integrated with codelet-set applications, system designers have great flexibility in balancing system overhead with system services. For some uses and applications, there may be little to no system software, but in other cases, extensive monitoring and debugging may generate more system tasks than application tasks running at a given time.

图51示出将现有程序码转换为多任务,显示可怎样将多任务输入输出表用来通过小码集开发代码的并发评估。建构优先顺序,这样使能一个或多个随后的并发任务所必需的顺序任务就具有高优先顺序。输入变量集的特定元素到输出变量集的映射允许接收函数在第一例子可用之后即可开始处理。可分离例子的数目的计数允许系统软件分布小码执行,从而来允许高CPU利用并利用数据的局部性。Figure 51 illustrates the conversion of existing program code to multitasking, showing how multitasking input-output tables can be used to develop concurrency evaluation of code through small code sets. Construct priorities such that sequential tasks necessary to enable one or more subsequent concurrent tasks have high priority. The mapping of specific elements of the input variable set to the output variable set allows the receiving function to begin processing as soon as the first instance is available. Counting the number of separable instances allows system software to distribute codelet execution, thereby allowing high CPU utilization and exploiting data locality.

图52示出“黑箱”代码与多任务代码一起运行的情况,它示出了场景5201,其中库代码已被转换成小码集,但黑箱用户代码5202仍是固有的顺序。在其它实施方案中,假设随后的处理需要所有黑箱值,那么优先顺序可以是保守的;或者基于先前运行的性能评估或所述同一运行的先前周期,优先顺序可以是统计的;或两者都可以。Figure 52 shows "black box" code running with multitasking code, which shows a scenario 5201 where the library code has been converted to a small code set, but the black box user code 5202 is still in inherent order. In other embodiments, the prioritization may be conservative, assuming all black box values are required for subsequent processing; or statistical, based on performance evaluations from previous runs or previous cycles of the same run; or both. Can.

图53示出改进的与多任务代码一同运行的黑箱代码。在这种情况下,黑箱代码的部分可由用户标示,所以它可用于并发执行。原始黑箱任务之前的多任务5302对应于函数F25303。黑箱任务5304的初始部分对应于重构函数BB1a,且当5302的结果可用时使用所述结果来转换所述初始部分以当前地运行。黑箱函数的下一部分仍是固有的顺序,且依然是将在随后的操作之前完成的黑箱。注意在实施方案中,可执行随后的函数的推测执行,从而提供一种方式来在5306的执行期间获取并发性。参考数字5308是重构黑箱函数的第三部分,它对应于函数BB1c,并允许对应于MP25311的库调用5310的并发执行。Figure 53 shows the improved black box code running with the multitasking code. In this case, parts of the black-box code can be marked by the user, so it can be used for concurrent execution. The multitask 5302 before the original black box task corresponds to the function F25303. The initial portion of black box task 5304 corresponds to reconstruction function BB1a, and the result of 5302 is used when available to convert the initial portion to run currently. The next part of the black box function is still inherently sequential and still black box to be done before subsequent operations. Note that in an embodiment, speculative execution of subsequent functions may be performed, providing a way to gain concurrency during the execution of 5306. Reference numeral 5308 is a third part of the refactored black box function, which corresponds to function BB1c and allows concurrent execution of library calls 5310 corresponding to MP25311.

另外说明:In addition:

本发明的各种实施方案可解决应用程序相对于一些性能测量或相对于一些资源约束的性能优化。示范性性能测量或约束可涉及但不限于程序的总运行时间、特定部分中程序的运行时间、执行特定指令前的最大延时、所用处理单元的量;所用存储器的量;寄存器文件的使用;高速缓存存储器的使用;一级高速缓存的使用;二级高速缓存的使用;三级高速缓存的使用;N级高速缓存的使用,其中N为正数;静态随机存取存储器的使用;动态随机存取存储器的使用;全局存储器的使用;虚拟存储器的使用;可用于使用而非用于执行程序的量处理器;可用于使用而非用于执行程序的存储器的量;能量消耗;高峰能量消耗;计算系统的寿命成本;所需寄存器更新的量;所需存储器清除的量;安全强制的效能和安全强制的成本。Various embodiments of the invention may address performance optimization of an application relative to some performance measure or relative to some resource constraint. Exemplary performance measures or constraints may relate to, but are not limited to, the total run time of the program, the run time of the program in particular portions, the maximum delay before executing a particular instruction, the amount of processing units used; the amount of memory used; the use of register files; Use of cache memory; use of level 1 cache; use of level 2 cache; use of level 3 cache; use of level N cache, where N is a positive number; use of static random access memory; dynamic random access memory Access memory usage; Global memory usage; Virtual memory usage; Amount available for use rather than for program execution Processor; Amount of memory available for use rather than for program execution; Energy consumption; Peak energy consumption ; the lifetime cost of the computing system; the amount of register updates required; the amount of memory clears required; the effectiveness of security enforcement and the cost of security enforcement.

结论:in conclusion:

此详细描述提供先前讨论的本发明实施方案的例示系统操作场景和应用例子的说明。为描述本发明概念的可能实施例子以及相关发明利用场景,本专利申请案和相关专利申请案中提供了特定应用、结构和逻辑实施例子。当然,有多种替代性的方式来整体或部分地实施或利用上面阐述的本发明的原理。例如,本文中明确描述或示出的元素或过程步骤在各种实施方案中可相互组合或与另外的元素或步骤组合。还可再划分所述元素,而不会偏离本发明的精神和范围。另外,在各种实施方案中,本发明的方面可使用应用和系统软件、通用和专用微处理器、定制硬件逻辑和以上的各种组合来实施。大体上,本领域的技术人员能够开发所述实施方案的不同版本和各种修改,即使本文中没有个别地描述所述版本和修改,但是根据本发明的原理,它们都包括在本发明的精神和范围里。因此,本说明书和附图旨在仅作为例示性而不是限制性,本发明的真正范围由上面的权利要求书指定。This detailed description provides a description of exemplary system operating scenarios and application examples for embodiments of the invention discussed previously. To describe possible implementation examples of the inventive concepts and related invention utilization scenarios, specific application, structural and logical implementation examples are provided in this and related patent applications. There are, of course, numerous alternative ways of implementing or utilizing the principles of the invention as set forth above, in whole or in part. For example, elements or process steps explicitly described or illustrated herein may in various embodiments be combined with each other or with additional elements or steps. The elements may also be subdivided without departing from the spirit and scope of the invention. Additionally, in various embodiments, aspects of the invention can be implemented using application and system software, general and special purpose microprocessors, custom hardware logic, and various combinations of the above. In general, those skilled in the art will be able to develop different versions of the described embodiments and various modifications, which are included in the spirit of the invention according to the principles of the invention, even if they are not individually described herein. and range. Accordingly, the specification and drawings are intended to be illustrative only and not restrictive, with the true scope of the invention being indicated by the following claims.

Claims (10)

1. one kind is used for distributing data processing system resources to the task of one or more application programs Method, described method includes:
A) obtaining one group of little code being configured to implement at least one task, wherein each little code is to join It is set to when can use to the required input of each little code by the instruction block that gone to of non-preemptive ground;
B) dependency of little intersymbol described in described group is determined;
C) the described dependency of described little intersymbol it is at least partially based on and based on data processing system resources Availability, dynamically map given little code in described group to data processing system resources collection to perform Described given little code.
2. the method for claim 1, wherein said mapping includes selected from consisting of The function of group: place, position, reorientate, move and migrate.
3. the method for claim 1, wherein said mapping includes selected from consisting of The function of group: determine the time started performing described given little code, and determine that execution is described given little The position of code.
4. the method for claim 1, wherein said mapping includes being selected from based at least one The standard of the group consisted of performs mapping: the 1) performance metric of improvement application program, and 2) Improve the utilization of described data processing system resources, and 3) maximize application program performance metric, Observe given resource consumption object set simultaneously.
5. the method for claim 1, wherein perform described mapping with relative to choosing freely with The measurement of the group of lower composition carrys out the performance of optimization application:
The total run time of described program;The operation time of program described in specific part;Perform specific Maximum delay before instruction;The amount of processing unit used;Memory-aided amount;Register file Use;The use of cache memory;The use of on-chip cache;Making of second level cache With;The use of three grades of caches;The use of level-N cache, wherein N is positive number;Static The use of random access memory;The use of dynamic random access memory;The use of global storage; The use of virtual memory;May be used in not for the amount processor performing described program;Available In using the amount not for the memorizer performing described program;Energy expenditure;Peak energy consumes; Calculate system lifetim cost;The amount that required depositor updates;The amount of required core dump memory clear;Safety Compulsory usefulness and the cost of security enforcement.
6. the method for claim 1, wherein performs described mapping with choosing freely following group Operation in the resource constraint of the group become:
The total run time of described program;The operation time of program described in specific part;Perform specific Maximum delay before instruction;The amount of processing unit used;Memory-aided amount;Register file Use;The use of cache memory;The use of on-chip cache;Making of second level cache With;The use of three grades of caches;The use of level-N cache, wherein N is positive number;Static The use of random access memory;The use of dynamic random access memory;The use of global storage; The use of virtual memory;May be used in not for the amount processor performing described program;Available Amount in the memorizer used not for the program of execution;Energy expenditure;Peak energy consumes;Calculate System lifetim cost;The amount that required depositor updates;The amount of required core dump memory clear;Security enforcement Usefulness and the cost of security enforcement.
7. the method for claim 1, wherein performs described mapping to seek the time-varying of target Mixing, wherein said mixing changes over due to the factor selected from the group consisted of: pre- The fixed change changed and dynamically occur.
8. the method for claim 1, also includes instruction set during application compiling, to help reality Row described acquisition, described determine or described dynamically map in one or more.
9. method as claimed in claim 8, during wherein said compiling, instruction is selected from consisting of Group: required floating point unit, required floating point precision, the frequency of access, the locality of access, stop Stagnant access, read-only data type, initial read-only data type, final read-only data type and have ready conditions Read-only data type.
10. the data handling system being made up of multiple cores, including:
A) multisystem administration agent collection, described collection includes following one or more: data diafiltration manages Device, little code scheduler, little code migration manager, load equalizer, power regulating eqiupment or performance manager, Wherein each little code is arranged to when can use to the required input of each little code be held by non-preemptive Row is to the instruction block completed;With
B) the plurality of core intermediate range is optimized alternately with cooperation mode for described system management agent collection The component that sequence performs.
CN201180019116.4A 2010-04-13 2011-04-13 Operation space method, system and apparatus Expired - Fee Related CN102934081B (en)

Applications Claiming Priority (7)

Application Number Priority Date Filing Date Title
US32336210P 2010-04-13 2010-04-13
US61/323,362 2010-04-13
US37706710P 2010-08-25 2010-08-25
US61/377,067 2010-08-25
US38647210P 2010-09-25 2010-09-25
US61/386,472 2010-09-25
PCT/US2011/032316 WO2011130406A1 (en) 2010-04-13 2011-04-13 Runspace method, system and apparatus

Publications (2)

Publication Number Publication Date
CN102934081A CN102934081A (en) 2013-02-13
CN102934081B true CN102934081B (en) 2016-11-30

Family

ID=

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090164399A1 (en) * 2007-12-19 2009-06-25 International Business Machines Corporation Method for Autonomic Workload Distribution on a Multicore Processor

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090164399A1 (en) * 2007-12-19 2009-06-25 International Business Machines Corporation Method for Autonomic Workload Distribution on a Multicore Processor

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A dynamic matching and scheduling algorithm for heterogeneous computing systems;M. Maheswaran等;《Heterogeneous Computing Workshop, 1998. (HCW 98) Proceedings. 1998 Seventh》;19980330;第57-69页 *
Automatic Calibration of Performance Models on Heterogeneous Multicore Architectures;Cédric Augonnet等;《Euro-Par 2009 – Parallel Processing Workshops》;20090828;第56-65页 *
Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures;Amir H. Hormati等;《Parallel Architectures and Compilation Techniques, 2009. PACT 09. 18th International Conference on》;20090916;第214-223页 *

Similar Documents

Publication Publication Date Title
US9542231B2 (en) Efficient execution of parallel computer programs
US20140115596A1 (en) Codeletset representation, manipulatoin, and execution - method, system and apparatus
Huang et al. Programming and runtime support to blaze FPGA accelerator deployment at datacenter scale
Chen et al. FlinkCL: An OpenCL-based in-memory computing architecture on heterogeneous CPU-GPU clusters for big data
Ng et al. Paella: Low-latency model serving with software-defined gpu scheduling
CN103262064A (en) Distributed computing architecture
Han et al. Experimental evaluation and selection of data consistency mechanisms for hard real-time applications on multicore platforms
Thomadakis et al. Multithreaded runtime framework for parallel and adaptive applications
Knorr et al. Declarative data flow in a graph-based distributed memory runtime system
Schuchart et al. Global task data-dependencies in pgas applications
Peterson et al. Automatic halo management for the Uintah GPU-heterogeneous asynchronous many-task runtime
CN102934081B (en) Operation space method, system and apparatus
Uhlig The mechanics of in-kernel synchronization for a scalable microkernel
Mirian et al. Exploring pipe implementations using an OpenCL framework for FPGAs
Bertout et al. Workload assignment for global real-time scheduling on unrelated clustered platforms
Perarnau et al. Argo
Meng Large-scale distributed runtime system for DAG-based computational framework
Pereira Efficient Use of Task-based Parallelism in HPC Parallel Applications
Ukidave Architectural and Runtime Enhancements for Dynamically Controlled Multi-Level Concurrency on GPUs
Gokhale Swann Perarnau, Brian C. Van Essen, Roberto Gioiosa, Kamil Iskra, Maya B. Gokhale, Kazutomo Yoshii and Pete Beckman
Aguilar Mena Methodology for malleable applications on distributed memory systems
Computing 12.1 A Brief History of Argo
Aumage Instruments of Productivity for High Performance Computing
Ali Optimization techniques for distributed task-based programming models
Pagani Enabling predictable hardware acceleration in heterogeneous SoC-FPGA computing platforms

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20161130

Termination date: 20180413