WO2025085693A1

WO2025085693A1 - Framework for domain-specific embedded systems

Info

Publication number: WO2025085693A1
Application number: PCT/US2024/051862
Authority: WO
Inventors: Joshua Andrew MACK; Ali Akoglu; Sahil HASSAN; Serhan GENER; Hasan Umit SULUHAN
Original assignee: University of Arizona
Current assignee: University of Arizona
Priority date: 2023-10-18
Filing date: 2024-10-17
Publication date: 2025-04-24
Anticipated expiration: 2026-04-18

Abstract

A framework that brings together application development, resource management, and accelerator design into a single compilation and runtime toolchain. In one example aspect, a platform for deploying a software application on an embedded system that comprises multiple types of processing elements includes a set of Application Programing Interface (API) functions; a set of program modules each comprising hardware-specific implementation of the set of API functions corresponding to one of the multiple types of processing elements; and a compiler. The compiler is configured to receive a set of instructions representing the software application, generate a binary object by compiling the set of instructions to link to at least part of the set of program modules that corresponds to one or more target types of processing elements.

Description

FRAMEWORK FOR DOMAIN-SPECIFIC EMBEDDED SYSTEMS

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims the benefit of U.S. Patent Application No. 63/591,327, filed on Oct. 18, 2023, entitled FRAMEWORK FOR DOMAIN-SPECIFIC EMBEDDED SYSTEMS, which is hereby incorporated by reference in its entirety.

GOVERNMENT FUNDING

[0002] This invention was made with government support under Grant No. FA8650-18-2-7860 awarded by DOD/DARPA. The government has certain rights in the invention.

FIELD OF THE DISCLOSURE

[0003] The present disclosure relates to a development and runtime framework for a wide range of heterogenous System on Chip (SoC) architectures.

BACKGROUND

[0004] As technology scaling reaches its limits, system designers are exploring an increasingly diverse range of methodologies for building systems that can maximize their compute performance within limited size, weight, power, and cost budgets. One such methodology in the literature is the design and fabrication of Domain- Specific System on Chip (DSSoC) devices.

SUMMARY

[0005] This patent document describes, among other things, a framework that brings together application development, resource management, and accelerator design into a single compilation and runtime toolchain.

[0006] In one example aspect, a platform for deploying a software application on an embedded system that comprises multiple types of processing elements is disclosed. The platform includes a set of Application Programing Interface (API) functions, a set of program modules each comprising hardware-specific implementation of the set of API functions corresponding to one of the multiple types of processing elements, and a compiler configured to receive a set of instructions representing the software application and generate a binary object by compiling the set of instructions, wherein the binary object is configured to link to at least part of the set of program modules that corresponds to one or more target types of processing elements.

[0007] In another example aspect, a method for deploying and executing a software application on multiple types of process elements is disclosed. The method includes receiving a set of instructions representing the software application, wherein the set of instructions comprises one or more invocations of a set of Application Programing Interface (API) functions. The method also includes generating a binary object by compiling the set of instructions, wherein the binary object is configured to link to a set of program modules that corresponds to one or more target types of processing units, and wherein each of the set of program modules comprises hardware-specific implementation of the set of API functions corresponding to a target type of processing units.

[0008] In yet another example aspect, a method for enabling execution of multiple software applications on an embedded system is disclosed. The method includes receiving, by a runtime module deployed on the embedded system, a first invocation of an Application Programing Interface (API) function by a first software application and a second invocation of the API function by a second software application. The method also includes determining, by the runtime module, a mapping between the API function and available computing resources on the embedded system and scheduling the first invocation of the API function by the first software application and the second invocation of the API function by the second software application according to the mapping.

[0009] These, and other aspects are described in the present document.

BRIEF DESCRIPTION OF THE DRAWINGS

[0010] FIG. 1 is a schematic block diagram of an integrated compiler and runtime system in a Compiler-Integrated, Extensible, DSSoC Runtime (CEDR) configuration.

[0011] FIG. 2 illustrates an example application structure that can be represented in the system in accordance with one or more embodiments of the present disclosure.

[0012] FIG. 3 is a schematic block diagram of an example library and Application Programming Interface (API) compilation methodology in accordance with one or more embodiments of the present disclosure. [0013] FIG. 4A illustrates an example of synchronization methodology dispatching heterogeneous kernels from the example library to an example runtime system in accordance with one or more embodiments of the present disclosure.

[0014] FIG. 4B is a schematic block diagram of example queue operations of an example runtime system in accordance with one or more embodiments of the present disclosure.

[0015] FIG. 5 is a graphical comparison between runtime overhead in an example system in accordance with one or more embodiments of the present disclosure and prior art systems.

[0016] FIG. 6A illustrates an example comparison between prior art systems and systems in accordance with one or more embodiments of the present disclosure.

[0017] FIG. 6B illustrates another example comparison between prior art systems and systems in accordance with one or more embodiments of the present disclosure.

[0018] FIG. 7A illustrates another example comparison between prior art systems and systems in accordance with one or more embodiments of the present disclosure.

[0019] FIG. 7B illustrates yet another example comparison between prior art systems and systems in accordance with one or more embodiments of the present disclosure.

[0020] FIG. 8 is a flowchart representation of a method for deploying and executing a software application on multiple types of process elements in accordance with one or more embodiments of the present technology.

[0021] FIG. 9 is a flowchart representation of a method for enabling execution of multiple software applications on an embedded system in accordance with one or more embodiments of the present technology.

DETAILED DESCRIPTION

[0022] Software and programming abstractions can be used to support scenarios where multiple users can coexist and interleave their applications across the DSSoC’s heterogeneous pool of processing elements (PEs) while utilizing compute resources effectively in a dynamic way. Many of the predominant heterogeneous compute frameworks such as CUD A, OpenCL, or SYCL assume an environment where an application expert performs offline analysis across a number of possible implementations for a given application, determines the optimal static mapping for all computational kernels in that application, and produces a fixed binary that represents a single, expertly-tuned instance of the application. In an environment of widespread heterogeneous computation, these greedy, inflexible mappings ignore runtime resource contention and can lead to system inefficiencies when such programs are required to share system resources with an arbitrary number of other heterogeneous applications. When resource contention issues are not solved by the operating system, DSSoC software abstractions can be coupled with intelligent, intermediate runtime systems that are capable of arbitrating or scheduling requests from all applications to the PEs across the DSSoC. To enable this runtime to effectively arbitrate resources among applications, each application is configured to map its computational kernels to as wide of a number of the system’s heterogeneous processing elements as possible. Namely, the programming abstractions should be agnostic to the underlying hardware. A runtime can be designed to choose the composition and capabilities of accelerators on a DSSoC, determine a scheduling heuristic for an application domain, and provide a set of software abstractions for programmers writing applications for DSSoCs.

[0023] Many DSSoC design efforts start with simulation-based modeling in both high level simulation and cycle accurate simulation. In early design space exploration (DSE) scenarios, many of these options are highly effective at narrowing down the scope of designs that are worth exploring on hardware. Compared to a Compilar-Integrated, Extensible, DSSoC Runtime (CEDR), these works are complementary as the designs that are narrowed down via early DSE can then be modeled on commercial FPGA platforms and evaluated to a greater extent with CEDR in order to collect ground-truth hardware measurements that feed back into future cycles of chip design.

[0024] Works associated with application runtimes can be segmented by those that target accelerator-rich heterogeneous platforms and those that do not. Some researchers have proposed a runtime controller for applications on heterogeneous platforms that include the ability to perform cluster-level mapping of tasks and monitoring power or execution metrics but without the ability to launch simultaneous applications or adjust their scheduling policy. Other researchers have proposed a hardware-based runtime that provides support for applications that are mapped with API calls but are unable to support interchangeable and platform-independent scheduling policies. Still other researchers have presented an approach for dynamically scheduling Function-as-a- Service (FaaS) computations to heterogeneous, network-connected computing resources, but their definition of heterogeneity primarily focuses networked collections of CPU only or CPU-GPU systems, leaving a large unsolved problem in the area of a FaaS-based methodology for dispatching work to function-specific accelerators like FFTs. Other researchers have introduced a runtime built to allow efficient execution on heterogeneous SoCs where each API can have a number of implementations on heterogeneous resources, but only linear chains of kernels are supported, and parallel dispatch through a non-blocking execution methodology is not supported. Other researchers have presented a performance evaluation framework built on a task-based runtime that supports heterogeneous execution of accelerators. Other researchers have proposed a deviceagnostic, adaptive scheduling approach for scheduling machine learning kernels on heterogeneous architectures. Still other researchers have presented a heterogeneous runtime system that incorporates resource discovery, adaptive scheduling, data movement, and programming model capabilities.

[0025] In a general-purpose computing context, heterogeneous computing systems can be difficult to program and utilize effectively. DSSoC devices restrict the applications used on a given system to a particular domain to build software and programming abstractions for the finalized hardware. In traditional heterogeneous programming paradigms, massive amounts of effort are put into offline performance analysis by domain experts to determine portions of an application that must be accelerated, the type of accelerators needed, and effective implementation strategy for a target hardware configuration. Low-performance serial implementations are replaced with optimized heterogeneous implementations, and a static binary that represents a single, expert-tuned instance of the application is produced. Such static and offline resource allocation decisions result in a greedily optimized implementation that assumes the application does not share heterogeneous accelerators with other applications, which has a potential to lead to drastic mismanagement or under-utilization of the target hardware.

[0026] To ensure that different applications can share the underlying hardware effectively despite being unaware the other applications exist, there is a need for an intelligent runtime system and programming framework to enable effective utilization of DSSoC platforms. DSSoC platforms should also enable a productive programming and deployment experience in such a way that multiple users can coexist and share the underlying hardware as a service by supporting execution of any combination of dynamically arriving applications.

[0027] Systems and methods in accordance with embodiments of the present disclosure include a framework that brings together application development, resource management, and accelerator design into a single compilation and runtime toolchain. In an example configuration, DSSoC hardware can be the target platform to execute a user application that results from the use of the framework. With respect to application development, frameworks in accordance with embodiments of the present disclosure enable compilation and execution of a user application for test purposes only. When tests are complete, the framework enables the application to be built as a shared object. The shared object is provided to a runtime environment in accordance with embodiments of the present disclosure where the runtime environment includes implementations of the API functions utilized by the application. When the API calls are encountered, the shared object is linked to implementations of the API functions in a runtime library in accordance with embodiments of the present disclosure. The resulting task is placed on a queue and processed, possibly in a multi -threaded manner.

[0028] The programming methodology presenting herein enables more productive programming of domain-specific architectures while supporting concurrent execution of heterogeneous kernels. While runtime overhead is reduced, as systems evolve into highly heterogeneous architectures with large numbers of distinct processing elements, runtime systems should be designed in ways that allow them to cope with that growth in heterogeneity.

[0029] Described herein is a framework in accordance with embodiments of the present disclosure. Referring now to FIG. 1, a design of an example CEDR configuration 100 is shown. The CEDR configuration 100 is composed of two components: a compilation workflow and a runtime workflow 125. The compilation workflow is used to transform user applications into applications that the runtime workflow 125 can execute using conventional methods.

[0030] The runtime workflow 125 includes, but is not limited to including, worker threads 123, a queue of tasks, and a main event loop of the runtime workflow 125 that receives, parses, launches, and manages applications. Each processing element (PE) 117 in the system - for example, but not limited to, accelerator or CPU core - is paired with a worker thread 123 that manages executing tasks on that compute resource. When the PE 117 is a CPU core, each worker thread 123 is assigned via its processor affinity to run on the corresponding resource (the CPU core). When the PE 117 is an accelerator core, the respective worker thread 123 is assigned via its affinity to a CPU core in the system, and that CPU core is responsible for coordinating any configuration updates or data transfers that the accelerator core requires. Within the main event loop of the runtime workflow 125, the runtime workflow 125 periodically pushes work to the worker threads 123 by scheduling tasks to them from the queue of tasks according to a heuristic as part of the input by the user application 101 to the runtime configuration 103. [0031] The runtime configuration 103 includes features such as application programming interface (API)-based performance counters. As tasks are completed, the worker threads 123 signal completion back to a main thread and JavaScript Object Notation (JSON) direct acyclic graph (DAG) dependencies of the tasks are then pushed to the back of the queue. To support heterogeneous execution, when a task is scheduled to a given resource, the runtime workflow 125 dynamically updates that task’ s function pointer such that its worker thread 123 invokes a function that is compatible with that resource. In some embodiments, as applications 101 are submitted over the IPC channel, the applications 101 are parsed by application parser 119 and their executions are started by placing the head nodes of their DAGs into the queue. The runtime workflow 125 continues indefinitely until an IPC command is received instructing the runtime workflow 125 to shut down. When the shut down command is received, the runtime workflow 125 serializes the logs 107. The logs 107 are related to, for example, but not limited to, task execution and performance counter measurements, and can be used, for example, among other things, for online or offline analysis.

[0032] Referring now to FIG. 2, the runtime workflow 125 can be configured to, for example, but not limited to, accommodate concepts of iteration 201 and conditional branching 203 like that shown in the right half of FIG. 2. Applications that can be represented with a DAG-based format are unable to represent control flow concepts of iteration or conditional branching. This is problematic when trying to schedule applications that have structures like a for-loop structure 205 of FIG. 2. Kernel 1, Kernel2, and Kernel3 can be individually compatible with accelerators on the system, but because a DAG-based program representation cannot allow for a sufficiently granular program representation similar to one shown in the right half of FIG. 2, the entire for-loop structure 205 must be collapsed to a single DAG node and presented to CEDR as a single unit to be scheduled. Because it is unlikely that an accelerator exists on the system that can handle this specific sequence of iterated kernels, such single DAG node is likely to be supported only on the CPU, and benefits of acceleration in the associated application are reduced. The present document discloses a new API-based development workflow that allows calls to domain-specific accelerators such as, but not limited to, FFT, general matrix multiplication (GEMM), and 2d convolution operation (CONV2D) accelerators.

[0033] FIG. 3 is a schematic block diagram of an example library and API compilation methodology in accordance with one or more embodiments of the present disclosure. The structure of the library 301 is shown in the left side of FIG. 3. A set of APIs for use in application code are exposed to developers through the header file 303, which includes high level kernel declarations that do not contain any implementation details of the underlying operation. As different DSSoC platforms develop different accelerator implementations of these kernels, they are incorporated through a set of program modules (also referred to as library modules). Each module includes implementation of the set of API functions corresponding to one of the multiple types of PEs (e.g., CPU core, FPGA cores, etc.) As an example, for a platform with a Fast Fourier Transform (FFT) accelerator, the library 301 provides an FFT module. The FFT module provides physical implementations of the high-level APIs as desired. The configuration header file 305 provides global information about the platform in use such as base addresses for accelerators’ interfaces to enable driverless memory-mapped I/O (MMIO) control and dispatch of tasks. In some embodiments, standard implementations that can be leveraged across the platforms that are supported for the set of APIs. As such, in some embodiments at compilation time, the user sets up the configuration file 305, chooses which modules from the library 301 to enable, and receives two outputs: a static library archive 307 containing the implementations of implemented APIs and a “runtime” library shared object 309 containing both the implementations that are included in the static library archive 307 and accelerator implementations through their respective library modules. In some embodiments, e.g., when the user passes the initial functional test stage, the user receives one output — the “runtime” library shared object 309 with the accelerator implementations through their respective library modules.

[0034] After the library 301 is compiled, the workflow as shown in the right side of FIG. 3 is navigated. One key benefit of this compilation approach is the way that it enables rapid application bring up and evaluation prior to testing in complex heterogeneous computing scenarios. In early stages of development integration, a user can begin by treating the library 301 as if it were another CPU-based library. When functional testing of the code is complete, the code is compiled by building the application as a shared object that avoids linking in implementations for the library API calls.

[0035] The shared object application is then provided to the runtime workflow similar to the runtime workflow 125 shown in FIG. 1. During startup, the runtime workflow is provided with the corresponding library -rt shared object containing the system’s API implementations, and it builds a mapping from each API and resource type pairing to a physical implementation of that API on that resource if one exists. For example, the runtime can determine that an API call to FFT is mapped to an FFT accelerator and another API call to GEMM is mapped to a matrix-specific accelerator. When an application is received by the main event loop of the runtime workflow, the shared object is parsed and a new system thread is spawned that executes that application’s main function. As the application’s main function executes, it periodically encounters library API calls. The library API calls are linked during binary parsing against implementations in library-rt 309 that correspondingly call a function inside the runtime workflow to place the task into a queue, and from there, CEDR’s existing heuristics are able to process it in the same fashion as the existing DAG-based methodology.

[0036] Referring now to FIGS. 4A and 4B, since the runtime workflow is multi-threaded, involving the user application thread 403, the runtime workflow thread 405, and the eventual worker thread 408 of the executing resource, there is a need for synchronization so that the user application can recognize completion of the underlying API call. Before pushing the requested task to the queue 401, the user application thread 403 initializes variables 407 that can receive updates on the progress of the user application. After dispatching the user application to the runtime workflow thread 405, the user application thread 403 reaches a wait barrier 411 and goes to sleep. As the task propagates through to the scheduler 413 and eventual API implementation 415, the corresponding worker thread 408 signals task completion, such as Pthread barrier condition signal 450 as shown in FIG. 4A, back to the application thread 403. After this, the application thread 403 wakes and resumes its computation. The process described in FIGs. 4A-4B ensures functional correctness relative to the single-threaded execution of the user application. In another embodiment, non-blocking variations of APIs allow full control over the task synchronization primitives thus enabling parallelism in the user application 403. When the user is certain about the data dependency and/or the parallel nature of the function calls, the user can invoke the non-blocking variations of APIs to improve the performance of the application.

[0037] Referring now to FIG. 5, the results of experiments using a system in accordance with embodiments of the present disclosure are graphically shown. For these experiments, Xilinx Zynq Ultrascale+ ZCU102 and NVIDIA Jetson AGX Xavier development boards may be used. Three representative real-world applications can be used for evaluations covering radar processing, communications system, and autonomous vehicle domains with Pulse Doppler (PD), WiFi TX (TX), and Lane Detection (LD). Pulse Doppler calculates velocity of an object by measuring distance of the object using 256-point FFTs and the frequency shift between transmitted and emitted signals. WiFi TX generates packets of 64 bits and prepares for transmission over an arbitrary channel through scrambler, encoder, modulation, and forward error correction processes. WiFi TX relies on 128-point inverse FFT for each packet transmitted. Lane Detection is a convolution intensive routine from autonomous vehicles domain. It has been shown that implementing convolution in the frequency domain rather than the spatial domain through a combination of FFT and pointwise product (ZIP) operations can reduce algorithmic complexity and inference time.

[0038] A workload composed of these three applications allow evaluation of various scenarios where a heterogeneous SoC is shared by multiple applications in an interleaved manner. An example scenario could involve Lane Detection running as a continuous process where Pulse Doppler and WiFi TX applications arrive dynamically and are executed periodically. Such scenarios allow studying of the relationship between degree of SoC heterogeneity, scheduling overhead and quality of schedules achieved by various heuristics targeting autonomous vehicles domain. The Lane Detection application stresses the FFT accelerator on the emulated heterogeneous SoC with number of 1024-point FFTs and IFFTs scaling to 16384 and 8192 instances respectively for a 960x540 image. WiFi TX and Pulse Doppler are lower latency applications with number of FFTs scaling to 100 and 512 respectively. Driven by these three applications, FFT and ZIP can be used as key functions that are supported with accelerator based execution. Each application is implemented via the hardware agnostic API calls for each function. The three applications can be compiled using the CEDR compilation toolchain described in Section II and binaries to be executed on heterogeneous SoC configurations that are emulated on both ZCU102 and letson development boards can be prepared. The ZCU102 is formed of 4 ARM cores running at 1.2 GHz and programmable FPGA fabric where FFT accelerators were invoked running at 300 MHz. The FFT accelerator can be implemented using Xilinx FFT IP supporting up to 2048-point FFTs. The FFT accelerator uses direct memory access (DMA) to manage data transfers between ARM cores and accelerator through AXI4-Stream. The Jetson board is formed of 8 ARM cores running at 2.3 GHz and a Volta GPU running at 1.3 GHz, where FFT and ZIP accelerators are implemented as CUDA kernels. The data transfers between ARM cores and the accelerators can be handled with standard cudaMemcpy functions using the PCIe interface. Heterogeneous SoCs can be composed by varying the number and types of processing elements on the ZCU 102 platform from the pool of 3 ARM cores along with 8 FFT accelerators. The Jetson platform can then be utilized to demonstrate the portability of the integrated compiler and runtime system, and cross-platform comparisons in terms of factors that contribute to the runtime and scheduling overhead can be conducted. The amount of data processed by an application is considered a frame, measured in Megabits (Mb). Injection rate is defined as the rate at which frame instances are generated per second and measured in Mbps. In some implementations, 29 injection rates between 10 and 2000 Mbps are used, where each injection rate defines a periodic rate of job along with its associated input data arrival for the given workload. Round Robin (RR), Earliest Finish Time (EFT), Earliest Task First (ETF), and the runtime variant of the Heterogeneous Earliest Finish Time (HEFTRT) scheduling heuristics can be executed along with the CEDR management thread using one of the ARM cores on the target SoC. Metrics of average scheduling overhead per application and average execution time per application can be used for performance evaluation. The scheduling overhead captures the time spent by the runtime in making scheduling decisions. This time is proportional to the number of scheduling rounds made by the runtime as well as the complexity of the scheduling algorithm. The application execution time is the time difference between the beginning and completion of an application’s execution, including the overhead of scheduling decisions in between. Lower execution times indicate the scheduler’s capability to manage the workload efficiently. To make these two metrics comparable across different runtime configurations, the metrics are normalized with respect to the number of applications and the average over 25 trials is obtained to reduce the effect of noise. For brevity, each metric, i.e., execution time, may refer to its corresponding averaged-per-application version, i.e., average execution time/application.

[0039] The runtime overhead of API-based CEDR with respect to the DAG-based CEDR is illustrated in FIG. 5. Runtime overhead is defined as the overall time spent by CEDR to receive, manage, and terminate applications in a given workload. This overhead excludes the overhead of task scheduling. In the experiment illustrated in FIG. 5, five instances for each of the Pulse Doppler and WiFi TX applications are used as a workload, and runtime overhead across the sweeping range of the inj ection rate on the ZCU 102 platform with 3 ARM CPUs and 1 FFT accelerator is collected. The X-axis shows the injection rates, and Y-axis shows the runtime overhead. As the injection rate grows, the runtime overhead reduces and then saturates at around injection rate of 200 Mbps for both API- and DAG-based CEDR. [0040] As shown in FIG. 5, the applications arrive in an increasingly overlapping manner to the runtime with the increase in injection rate and, in turn, queue size grows. This gives runtime the opportunity to manage multiple tasks concurrently rather than serially, which in turn enables the runtime to complete processing same number of applications in a shorter span of time, thereby reducing the runtime overhead. The saturation of the trend lines indicates that beyond a certain injection rate, the runtime becomes oversubscribed, where the applications within the workload get executed by CEDR with maximum concurrency. Throughout the saturated region, the APIbased CEDR achieves a 19.52% reduction on average in runtime overhead with respect to the DAG-based CEDR. This reduction can be attributed to the simplification of runtime steps in APIbased CEDR compared to DAG-based CEDR. For the DAG-based CEDR, the runtime overhead involves time required for receiving and parsing application DAG files via IPC to construct application DAG, parsing shared object, pushing tasks to the queue, popping completed tasks from the queue, and finally terminating the completed applications. For the API-based CEDR, two factors contribute to the reduced overhead. First, API-based CEDR does not need to parse DAG files when applications are submitted via IPC. Second, pushing tasks to the queue is eliminated as it is handled by the application thread.

[0041] An experiment to validate the application execution time and scheduling overhead trends in API-based CEDR against DAG-based performance trends has been conducted. For this experiment, parameters such as hardware configuration, workload composition, and scheduling heuristics may be taken from the DAG-based CEDR work. The hardware can be composed of 3 ARM CPUs, 1 FFT, and 1 MMULT accelerators on the ZCU102 platform. The workload consists of WiFi TX and Pulse Doppler applications with five instances each.

[0042] FIGS. 6 A and 6B illustrate average execution time per application for the DAG- and APIbased CEDR executions respectively, with respect to injection rate, where individual line plots represent execution using different schedulers. Both Figures 6A and 6B show similar saturation trends as injection rate increases where system becomes oversubscribed at around 200 Mbps. Furthermore, the mean magnitude of the saturated region of API-based execution deviates by 32% compared to the one of DAG-based execution. From the scheduler perspective, the ETF scheduler demonstrates a significantly higher execution time in both plots while other schedulers perform similar to each other in each setup. The ETF scheduler in the oversubscribed region shows average execution time as 700 ms in the DAG-based CEDR of FIG. 6A, while 425 ms has been observed with the APT-based CEDR of FIG. 6B. The execution time reduction for ETF can be attributed to the smaller queue size, as API-based CEDR only schedules libCEDR API calls/portions of the application, which have support for heterogeneous execution. In DAG-based CEDR, the whole application, including non-accelerated regions, is divided into tasks that are scheduled by CEDR scheduler.

[0043] FIGS. 7A and 7B illustrate the scheduling overhead with respect to injection rate and different schedulers for DAG- and API-based CEDR executions, respectively. In both plots, except for the ETF scheduler, the scheduling overhead is stable for the other schedulers across the injection rates with very close overhead values. The ETF scheduler however shows remarkably different trend in the API-based CEDR of FIG. 7B, where the scheduling overhead reduces to around 1.15 ms in the saturated region, from a scale of around 70 ms in DAG-based CEDR. This reduction is due to fewer number of tasks that need scheduling in the API-based CEDR. This further demonstrates that the ETF’s execution overhead is more sensitive to the queue size than the remaining schedulers.

[0044] Referring back to FIGS. 6A and 6B, while ETF is observing a reduction in average execution time with the API-based CEDR of FIG. 6B, the schedulers other than the ETF scheduler observe an increase in execution time from around 200 ms on the DAG-based CEDR of FIG. 6A to around 350 ms on the API-based CEDR of FIG. 6B in the oversubscribed region. This is primarily due to the way the worker and application threads are managed in API-based CEDR compared to DAG-based CEDR. In DAG-based CEDR, the whole application code is executed on the worker threads as DAG task nodes, hence the available CPU cores are only shared among worker threads. In API-CEDR however, both application and worker threads are launched on the available CPU resources, where only the worker threads execute the application portion with heterogeneity support. For the experiment presented on ZCU102 with 3 CPU cores, DAG-based CEDR spawns 4 worker threads while API-based CEDR launches an additional 10 application threads (give instances of each application), leading to increased thread contention on the underlying CPUs. The same experiment may be performed on the Jetson with a configuration of 3 CPU cores and 1 GPU. With the availability of a total of 7 CPU cores, the 4 worker threads (3 CPU and 1 GPU) and 10 application threads have more resources to share between them. This reduces the thread contention compared to the ZCU102. Compared to DAG-based CEDR which spawns only 4 worker threads to execute the workload while underutilizing the available CPU cores, API-based CEDR better exploits the available resources through concurrent execution of worker and application threads.

[0045] Referring now to FIGS. 7B and 7B, ETF is most sensitive to the heterogeneity with the highest scheduling overhead. The benefit of reduced queue size due to API-based execution results in reduction in scheduling time that is larger in magnitude than the increase in execution time due to thread contention.

[0046] Experimental evaluations can demonstrate the versatility of the CEDR framework by introducing Lane Detection as a new application to the workload, increasing the number of FFT accelerators on the ZCU102 to 8, and performing execution time performance analysis with respect to changes in injection rate using the same workload on a Jetson platform. Lane Detection has a large number of FFT instances that can stress both the runtime system and the schedulers as the queue size is expected to grow substantially. The autonomous vehicle workload includes a single instance of Lane Detection as a long latency job while lower latency WiFi TX and Pulse Doppler applications arrive dynamically.

[0047] FIG. 8 is a flowchart representation of a method for deploying and executing a software application on multiple types of process elements in accordance with one or more embodiments of the present technology. The method 800 includes, at operation 810, receiving a set of instructions representing the software application. The set of instructions comprises one or more invocations of a set of Application Programing Interface (API) functions (e g., FFT, GEMM, CONV2D kernels). In some embodiments, the instructions can include information, such as directives (or referred to as Macros) to indicate the types of processing elements that are available to execute the software application (e.g., CPU core, AMR core, FFT accelerators, etc.). This information, once compiled, can help the runtime to determine the appropriate process element(s) to be used for executing the software application. The method 800 includes, at operation 820, generating a binary object by compiling the set of instructions. The binary object is configured to link to a set of program modules that corresponds to the one or more target types of processing units. Each of the set of program modules comprises hardware-specific implementation of the set of API functions corresponding to a target type of processing units.

[0048] In some embodiments, the multiple types of processing elements comprise a Central Processing Unit (CPU). In some embodiments, the multiple types of processing elements comprise a domain-specific accelerator. [0049] In some embodiments, the method includes generating a second binary object comprising only a CPU-based implementation of the set of API functions. In some embodiments, the method includes determining, by a runtime module included in the binary object, a mapping between each of the set API function and one or more computing resources of the one or more target types of hardware systems. In some embodiments, the runtime selects the proper computing resources based on the current state of the computing resources of the target system, without any directives or indications from the software application, in addition to the mapping between the API function(s) and the computing resources. In some embodiments, the method includes enqueueing, by the runtime module upon an invocation of an API function by the software application, a task corresponding to the invocation of the API function into a task queue; and scheduling tasks in the task queue based on the mapping.

[0050] In some embodiments, the set of API functions comprises a blocking API function that is configured to block a process or a threshold until a completion of the API function. In some embodiments, the set of API functions comprises a non-blocking API function that allows a concurrent execution of another API function.

[0051] FIG. 9 is a flowchart representation of a method for enabling execution of multiple software applications on an embedded system in accordance with one or more embodiments of the present technology. The method 900 includes, at operation 910, receiving, by a runtime module deployed on the embedded system, a first invocation of an Application Programing Interface (API) function by a first software application and a second invocation of the API function by a second software application. The method 900 includes, at operation 920, determining, by the runtime module, a mapping between the API function and available computing resources on the embedded system. The method 900 includes, at operation 930, scheduling the first invocation of the API function by the first software application and the second invocation of the API function by the second software application according to the mapping. For example, both invocations can be related to a same FFT function. Only one FFT dedicated accelerator is available, but the CPU core is largely idle and is available to provide parallel computing power. The two invocations of the same FFT function can be scheduled on the FFT accelerator and the CPU core respectively to achieve optical parallelism.

[0052] In some embodiments, the available computing resources comprise at least one domainspecific processing unit. In some embodiments, the method includes linking to a program module that comprises hardware-specific implementation of the API function corresponding to the at least one domain-specific processing unit. In some embodiments, the method includes invoking a hardware-specific implementation of the API function on the at least one domain-specific processing unit.

[0053] Example solutions related to the disclosed techniques are described below.

[0054] 1. A method for deploying and executing a software application on multiple types of process elements, comprising receiving a set of instructions representing the software application, wherein the set of instructions comprises one or more invocations of a set of Application Programing Interface (API) functions, and generating a binary object by compiling the set of instructions, wherein the binary object is configured to link to a set of program modules that corresponds to one or more target types of processing units, and wherein each of the set of program modules comprises hardware-specific implementation of the set of API functions corresponding to a target type of processing units.

[0055] 2. The method of solution 1, wherein the multiple types of processing elements comprise a Central Processing Unit (CPU).

[0056] 3. The method of solution 1 or 2, wherein the multiple types of processing elements comprise a domain-specific accelerator.

[0057] 4. The method of any of solutions 1 to 3, comprising generating a second binary object comprising only a CPU-based implementation of the set of API functions.

[0058] 5. The method of any of solutions 1 to 4, further comprising determining, by a runtime module included in the binary object, a mapping between each of the set API function and one or more computing resources of the one or more target types of hardware systems.

[0059] 6. The method of solution 5, further comprising enqueueing, by the runtime module upon an invocation of an API function by the software application, a task corresponding to the invocation of the API function into a task queue and scheduling tasks in the task queue based on the mapping.

[0060] 7. The method of any of solutions 1 to 6, wherein the set of API functions comprises a blocking API function that is configured to block a process or a threshold until a completion of the API function.

[0061] 8. The method of any of solutions 1 to 7, wherein the set of API functions comprises a non-blocking API function that allows a concurrent execution of another API function. [0062] 9 A method for enabling execution of multiple software applications on an embedded system, comprising receiving, by a runtime module deployed on the embedded system, a first invocation of an Application Programing Interface (API) function by a first software application and a second invocation of the API function by a second software application, determining, by the runtime module, a mapping between the API function and available computing resources on the embedded system, and scheduling the first invocation of the API function by the first software application and the second invocation of the API function by the second software application according to the mapping.

[0063] 10. The method of solution 9, wherein the available computing resources comprise at least one domain-specific processing unit.

[0064] 11. The method of solution 10, comprising linking to a program module that comprises hardware-specific implementation of the API function corresponding to the at least one domainspecific processing unit.

[0065] 12. The method of solution 10 or 11, further comprising invoking a hardware-specific implementation of the API function on the at least one domain-specific processing unit.

[0066] 13. A platform for deploying a software application on an embedded system that comprises multiple types of processing elements, comprising a set of Application Programing Interface (API) functions, a set of program modules each comprising hardware-specific implementation of the set of API functions corresponding to one of the multiple types of processing elements, and a compiler configured to receive a set of instructions representing the software application and generate a binary object by compiling the set of instructions, wherein the binary object is configured to link to at least part of the set of program modules that corresponds to one or more target types of processing elements, and wherein the compiler is configured to implement the method of any of solutions 1 to 12.

[0067] 14. The platform of solution 13, wherein the multiple types of processing elements comprise a Central Processing Unit (CPU).

[0068] 15. The platform of solution 13 or 14, wherein the multiple types of processing elements comprise a domain-specific accelerator.

[0069] 16. The platform of any of solutions 13 to 15, wherein the compiler is further configured to generate a second binary object comprising only a CPU-based implementation of the set of API functions. [0070] 17. The platform of any of solutions 13 to 16, wherein the binary object comprises a runtime module that is configured to determine a mapping between each of the set API function and one or more resources of the one or more target types of processing elements.

[0071] 18. The platform of solution 17, wherein the runtime module is configured to, upon an invocation of an API function by the software application, enqueue a task corresponding to the invocation of the API function into a task queue and schedule tasks in the task queue based on the mapping.

[0072] 19. The platform of any of solutions 13 to 18, wherein the set of API functions comprises a blocking API function that is configured to block a process or a threshold until a completion of the API function.

[0073] 20. The platform of any of solutions 13 to 19, wherein the set of API functions comprises a non-blocking API function that allows a concurrent execution of another API function.

[0074] 21. A non-transitory, computer-readable storage medium comprising instructions recorded thereon, wherein the instructions when executed by at least one data processor of an embedded system, cause the embedded system to implement the method of any of solutions 1 to 12.

[0075] Implementations of the subject matter and the functional operations described in this patent document can be implemented in various systems, digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Implementations of the subject matter described in this specification can be implemented as one or more computer program products, e.g., one or more modules of computer program instructions encoded on a tangible and non-transitory computer readable medium for execution by, or to control the operation of, data processing apparatus. The computer readable medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter effecting a machine-readable propagated signal, or a combination of one or more of them. The term “data processing unit” or “data processing apparatus” encompasses all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

[0076] A computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program does not necessarily correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.

[0077] The processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).

[0078] Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a processor for performing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Computer readable media suitable for storing computer program instructions and data include all forms of nonvolatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry. [0079] While this patent document contains many specifics, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this patent document in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

[0080] Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Moreover, the separation of various system components in the embodiments described in this patent document should not be understood as requiring such separation in all embodiments.

[0081] Only a few implementations and examples are described and other implementations, enhancements and variations can be made based on what is described and illustrated in this patent document.

Claims

1. A platform for deploying a software application on an embedded system that comprises multiple types of processing elements, comprising: a set of Application Programing Interface (API) functions; a set of program modules each comprising hardware-specific implementation of the set of API functions corresponding to one of the multiple types of processing elements; and a compiler configured to: receive a set of instructions representing the software application; and generate a binary object by compiling the set of instructions, wherein the binary object is configured to link to at least part of the set of program modules that corresponds to one or more target types of processing elements.

2. The platform of claim 1, wherein the multiple types of processing elements comprise a Central Processing Unit (CPU).

3. The platform of claim 1, wherein the multiple types of processing elements comprise a domain-specific accelerator.

4. The platform of claim 1, wherein the compiler is further configured to: generate a second binary object comprising only a CPU-based implementation of the set of API functions.

5. The platform of claim 1, wherein the binary object comprises a runtime module that is configured to determine a mapping between each of the set API function and one or more resources of the one or more target types of processing elements.

6. The platform of claim 5, wherein the runtime module is configured to, upon an invocation of an API function by the software application: enqueue a task corresponding to the invocation of the API function into a task queue; and schedule tasks in the task queue based on the mapping.

7. The platform of claim 1, wherein the set of API functions comprises a blocking API function that is configured to block a process or a threshold until a completion of the API function.

8. The platform of claim 1, wherein the set of API functions comprises a non-blocking API function that allows a concurrent execution of another API function.

9. A method for deploying and executing a software application on multiple types of process elements, comprising: receiving a set of instructions representing the software application, wherein the set of instructions comprises one or more invocations of a set of Application Programing Interface (API) functions; and generating a binary object by compiling the set of instructions, wherein the binary object is configured to link to a set of program modules that corresponds to one or more target types of processing units, wherein each of the set of program modules comprises hardware-specific implementation of the set of API functions corresponding to a target type of processing units.

10. The method of claim 9, wherein the multiple types of processing elements comprise a Central Processing Unit (CPU).

11. The method of claim 9, wherein the multiple types of processing elements comprise a domain-specific accelerator.

12. The method of claim 9, further comprising: generating a second binary object comprising only a CPU-based implementation of the set of API functions.

13. The method of claim 9, further comprising: determining, by a runtime module included in the binary object, a mapping between each of the set API function and one or more computing resources of the one or more target types of hardware systems.

14. The method of claim 13, further comprising: enqueueing, by the runtime module upon an invocation of an API function by the software application, a task corresponding to the invocation of the API function into a task queue; and scheduling tasks in the task queue based on the mapping.

15. The method of claim 9, wherein the set of API functions comprises a blocking API function that is configured to block a process or a threshold until a completion of the API function.

16. The method of claim 9, wherein the set of API functions comprises a non-blocking API function that allows a concurrent execution of another API function.

17. A method for enabling execution of multiple software applications on an embedded system, comprising: receiving, by a runtime module deployed on the embedded system, a first invocation of an Application Programing Interface (API) function by a first software application and a second invocation of the API function by a second software application; determining, by the runtime module, a mapping between the API function and available computing resources on the embedded system; and scheduling the first invocation of the API function by the first software application and the second invocation of the API function by the second software application according to the mapping.

18. The method of claim 17, wherein the available computing resources comprise at least one domain-specific processing unit.

19. The method of claim 18, comprising: linking to a program module that comprises hardware-specific implementation of the API function corresponding to the at least one domain-specific processing unit.

20. The method of claim 18 or 19, comprising: invoking a hardware-specific implementation of the API function on the at least one domain-specific processing unit.