WO2013178864A1

WO2013178864A1 - A method and apparatus for deferring processor selection

Info

Publication number: WO2013178864A1
Application number: PCT/FI2012/050519
Authority: WO
Inventors: Eero Aho; Tomi Aarnio; Kimmo Kuusilinna
Original assignee: Nokia Inc
Current assignee: Nokia Inc
Priority date: 2012-05-29
Filing date: 2012-05-29
Publication date: 2013-12-05
Anticipated expiration: 2014-11-29

Description

A METHOD AND APPARATUS FOR DEFERRING PROCESSOR SELECTION

Field of the Application The present application relates to applications which can execute across heterogeneous computing platforms.

Background of the Application Open computing language frameworks such as Open Computing Language (OpenCL) provide an environment whereby computer programs can be developed for execution across a number of heterogeneous computing platforms. In particular OpenCL enables a programming environment whereby developers can write code which can be executed on a number of computing platforms such as servers, personal computers and handheld devices.

Additionally, open computing language frameworks allow program code to be developed which can be distributed across and executed on a diverse selection of computer processing cores such as single core central processing unit (CPU), multi core processing units, graphics processing units (GPU), digital signal processors (DSP), image signal processors (ISP), and image video engines (IVE).

In order to facilitate the portability of open computing language framework code the functionality and technical features of the underlying hardware (or computing processor) may be isolated from the syntax of the open computing language framework. Detachment of the open computing language framework from the underlying hardware processor therefore enables an application to be executed on a multitude of processors. For instance an open computing language framework may provide provision in the language syntax which enables a particular type of processor to be explicitly defined for use for a executing a particular kernel. Alternatively an open computing language framework application may be developed such that the framework determines which type of processor is used to execute a kernel. However, for some devices such as handheld portable devices the processing capacity of the various processors on the device may vary considerably as the device is used. For instance the processing capacity of a particular processor may be dependent on other functionalities which may also require access to the various processors on the device. Furthermore, the availability of processing capacity can also be affected by other factors such as available battery power.

Summary of the Application

This application proceeds from the consideration that whilst an open computing language framework program allows the selection of a particular processor core for the execution of a kernel. The selection is made when the open language framework code is written, there is little or no consideration given to the dynamics of varying processing capacities of different processing units in the device at the time when a kernel is going to be run.

The following embodiments aim to address the above problem.

There is provided according to an aspect of the application a method comprising: collating processor static data for at least one processor of a plurality of processors; collating processor dynamic data for the at least one processor of the plurality of processors; and selecting at least one processor from the plurality of processors for the execution of a kernel based at least in part on the collated processor static data for the at least one processor and the collated processor dynamic data for the at least one processor. The processor dynamic data may comprise information relating to the dynamic operating conditions of the at least one processor.

The dynamic operating conditions of the at least one processor may comprises at least one of: an indication of the computing load of the at least one processor for a period of time whilst the at least one processor is in an active state; an indication of the power consumption of the at least one processor for the period of time whilst the at least one processor is in an active state; an indication of the operating mode of the at least one processor for the period of time, wherein the operating mode may comprise at least one of: an active state; a sleeping state, idle state; and a powered down state; and an indication of current battery power level.

The processor static data may comprise information relating to the hardware configuration of the at least one processor.

The information relating to the hardware configuration of the least one processor may comprises at least one of: memory capacity of the at least one processor; operating clock frequency of the at least one processor; pipeline depth count of the at least one processor; configuration of computational registers of the at least one processor; performance of the at least one processor; level of parallelism of the at least one processor; expected power consumption of the at least one processor; memory organization of the at least one processor; latencies in the computing system of the at least one processor; and direct memory access capability of the at least one processor.

The information relating to the hardware configuration of the at least one processor may further comprise information relating to the type of the at least one processor. The type of the least one processor may comprise at least one of: central processing unit; digital signal processor; graphics processing unit; image signal processors; and image video engines. The processor static data may further comprises application related data, wherein application related data may comprise at least one of: maximum thread count; vector width; and register consumption for a thread.

The method may further comprises: collating application developer specific data for the selection of the at least one processor from the plurality of processors, and wherein selecting the at least one processor from the plurality of processors for the execution of a kernel may be further based at least in part on the collated application developer specific data.

Selecting the at least one processor from the plurality of processors for the execution of a kernel can be performed by an algorithm which may combine at least two of: the collated processor static data for the at least one processor; the collated processor dynamic data for the at least one processor; and the collated application developer specific data.

The method further may comprise determining a context for the execution of the kernel on the selected at least one processor. Determining a context for the execution of the kernel on the selected at least one processor may further comprise: checking a cache memory in order to determine if a previous instance of the kernel has been executed on the selected at least one processor. The context is preferably part of an open computing language framework host application. The open computing language framework host application is preferably an OpenCL host application. Or the open computing language framework host application can be a WebCL host application.

The at least one processor can be a compute device, and the plurality of processors can be plurality of compute devices.

The selection of the at least one processor may be deferred until the kernel is instantiated for execution on the at least one processor.

The selection of the at least one processor can be further deferred until there is a transfer of data to the selected at least one processor.

According to a further aspect there is provided an apparatus comprising at least one processor of a plurality of processors and at least one memory including computer code for one or more programs, the at least one memory and the computer code configured with the at least one processor of the plurality or processors to cause the apparatus at least to: collate processor static data for the at least one processor of the plurality of processors; collate processor dynamic data for the at least one processor of the plurality of processors; and select at least one processor from the plurality of processors for the execution of a kernel based at least in part on the collated processor static data for the at least one processor and the collated processor dynamic data for the at least one processor.

The processor dynamic data may comprise information relating to the dynamic operating conditions of the at least one processor. The dynannic operating conditions of the at least one processor may comprises at least one of: an indication of the computing load of the at least one processor for a period of time whilst the at least one processor is in an active state; an indication of the power consumption of the at least one processor for the period of time whilst the at least one processor is in an active state; an indication of the operating mode of the at least one processor for the period of time, wherein the operating mode may comprise at least one of: an active state; a sleeping state, idle state; and a powered down state; and an indication of current battery power level. The processor static data may comprises information relating to the hardware configuration of the at least one processor.

The information relating to the hardware configuration of the least one processor may comprise at least one of: memory capacity of the at least one processor; operating clock frequency of the at least one processor; pipeline depth count of the at least one processor; configuration of computational registers of the at least one processor; performance of the at least one processor; level of parallelism of the at least one processor; expected power consumption of the at least one processor; memory organization of the at least one processor; latencies in the computing system of the at least one processor; and direct memory access capability of the at least one processor.

The information relating to the hardware configuration of the at least one processor may further comprise information relating to the type of the at least one processor.

The type of the least one processor may preferable comprise at least one of: central processing unit; digital signal processor; graphics processing unit; image signal processors; and image video engines. The processor static data may further comprise application related data, wherein application related data may preferably comprise at least one of: maximum thread count; vector width; and register consumption for a thread.

The apparatus may be further caused to collate application developer specific data for the selection of the at least one processor from the plurality of processors, and wherein the apparatus caused to select the at least one processor from the plurality of processors for the execution of a kernel may be further caused to select the at least one processor from the plurality of processors further based at least in part on the collated application developer specific data.

Selecting the at least one processor from the plurality of processors for the execution of a kernel may preferably be performed by an algorithm which combines at least two of: the collated processor static data for the at least one processor; the collated processor dynamic data for the at least one processor; and the collated application developer specific data.

The apparatus may be further caused to determine a context for the execution of the kernel on the selected at least one processor.

The apparatus caused to determine the context for the execution of the kernel on the selected at least one processor may be further caused to check a cache memory in order to determine if a previous instance of the kernel has been executed on the selected at least one processor.

The context is preferably part of an open computing language framework host application. The open computing language framework host application may be an OpenCL host application. The open computing language framework host application may also be a WebCL host application. The at least one processor is preferably a compute device, and the plurality of processors are preferably a plurality of compute devices.

According to a further aspect there is provided an apparatus configured to collate processor static data for at least one processor of the plurality of processors; collate processor dynamic data for the at least one processor of the plurality of processors; and select at least one processor from the plurality of processors for the execution of a kernel based at least in part on the collated processor static data for the at least one processor and the collated processor dynamic data for the at least one processor.

The processor dynamic data may comprise information relating to the dynamic operating conditions of the at least one processor. The dynamic operating conditions of the at least one processor may comprise at least one of: an indication of the computing load of the at least one processor for a period of time whilst the at least one processor is in an active state; an indication of the power consumption of the at least one processor for the period of time whilst the at least one processor is in an active state; an indication of the operating mode of the at least one processor for the period of time, wherein the operating mode may comprise at least one of: an active state; a sleeping state, idle state; and a powered down state; and an indication of current battery power level

The type of the least one processor may preferably comprise at least one of: central processing unit; digital signal processor; graphics processing unit; image signal processors; and image video engines. The processor static data further may comprise application related data, wherein application related data comprises at least one of: maximum thread count; vector width; and register consumption for a thread.

The apparatus may be further configured to: collate application developer specific data for the selection of the at least one processor from the plurality of processors, and wherein the apparatus configured to select the at least one processor from the plurality of processors for the execution of a kernel may be further configured to select the at least one processor from the plurality of processors further based at least in part on the collated application developer specific data. Selecting the at least one processor from the plurality of processors for the execution of a kernel may be performed by an algorithm which combines at least two of: the collated processor static data for the at least one processor; the collated processor dynamic data for the at least one processor; and the collated application developer specific data.

The apparatus may be further configured to determine a context for the execution of the kernel on the selected at least one processor.

The apparatus configured to determine the context for the execution of the kernel on the selected at least one processor may be further configured to check a cache memory in order to determine if a previous instance of the kernel has been executed on the selected at least one processor.

The context is preferably part of an open computing language framework host application.

The open computing language framework host application is preferably an OpenCL host application. The open computing language framework host application is preferably a WebCL host application.

The at least one processor is preferably compute device, and the plurality of processors are preferably a plurality of compute devices. The selection of the at least one processor may be deferred until the kernel is instantiated for execution on the at least one processor.

The selection of the at least one processor may be further deferred until there is a transfer of data to the selected at least one processor.

According to a further aspect of the application there is provided an apparatus comprising means for collating processor static data for the at least one processor of the plurality of processors; means for collating processor dynamic data for the at least one processor of the plurality of processors; and means for selecting at least one processor from the plurality of processors for the execution of a kernel based at least in part on the collated processor static data for the at least one processor and the collated processor dynamic data for the at least one processor. A computer readable medium comprising a computer program code thereon, the computer program code configured to realize the actions of the method as discussed herein.

For better understanding of the present invention, reference will now be made by way of example to the accompanying drawings in which:

Figure 1 shows schematically an electronic device employing embodiments of the invention;

Figure 2 shows schematically a block diagram of a system for the provision of deferring processor allocation for the execution of a kernel in an open computing language framework environment;

Figure 3 shows a block diagram of a flowchart of an initial phase for the deferred selection of a compute device in an open computing language framework;

Figure 4 shows a block diagram of a flowchart for the collection of information relating to the deferred selection of a compute device; Figure 5 shows a block diagram of a flowchart for the execution phase for a deferred selection of a compute device;

Figure 6 shows a block diagram of a flowchart for the deferred selection of a compute device for an embodiment; and

Figure 7 shows schematically a further electronic device employing embodiments of the invention.

Description of Some Embodiments of the Application The following describes apparatus and methods for the provision of open computing language frameworks, and in particular web based open computing language (WebCL) in which web applications are able to access heterogeneous processors from within a web browser. In this regard reference is first made to Figure 1 schematic block diagram of an exemplary electronic device 10 or apparatus, according to some embodiments of the application.

The electronic device 10 is in some embodiments a mobile terminal, mobile phone or user equipment for operation in a wireless communication system.

In other embodiments the electronic device 10 may be a multimedia device comprising a digital video camera. The electronic device 10 comprises a microphone 1 1 , which is linked via an analogue-to-digital converter 14 to a processor block 21 . The processor block 21 is further linked via a digital-to-analogue converter 32 to loudspeakers 33. The processor 21 is further linked to a transceiver (TX/RX) 13, to a user interface (Ul) 15 and to a memory 22. The processor 21 may comprise a number of individual processing units or processing cores. For example the processor 21 may comprise a combination of a central processing unit (CPU) together with other application specific processing cores such as a graphics processing unit (GPU), a digital signal processor (DSP) or a multimedia accelerator. In some embodiments the processor 21 may comprise multi-core CPU.

The processor 21 in some embodiments may, for example, be embodied as various means including one or more microprocessors with accompanying digital signal processor(s), one or more processor(s) without an accompanying digital signal processor, one or more coprocessors, one or more multi-core processors, one or more controllers, processing circuitry, one or more computers, various other processing elements including integrated circuits such as, for example, an ASIC (application specific integrated circuit) or FPGA (field programmable gate array), one or more other hardware processors, or some combination thereof.

Accordingly, although illustrated in Figure 1 as a single processor, in some embodiments the processor 21 may comprise a plurality of processors. The plurality of processors may be in operative communication with each other and may be collectively configured to perform one or more functionalities of the electronic device 10 as described herein.

In some embodiments the plurality of processors may be embodied on a single computing device or distributed across a plurality of computing devices collectively configured to function as the electronic device 10.

In some embodiments, the processor 21 is configured to execute instructions stored in the memory 22 or otherwise accessible to the processor 21 . These instructions, when executed by the processor 21 , may cause the electronic device 10 to perform one or more of the functionalities of the electronic device 10 as described herein. As such, whether configured by hardware or software methods, or by a combination thereof, the processor 21 may comprise an entity capable of performing operations according to embodiments of the present invention while configured accordingly. Thus, for example, when the processor 21 is embodied as an ASIC, FPGA or the like, the processor 21 may comprise specifically configured hardware for conducting one or more operations described herein. Alternatively, as another example, when the processor 21 is embodied as an executor of instructions, such as may be stored in the memory 22, the instructions may specifically configure the processor 21 to perform one or more algorithms and operations described herein.

The processor 21 (or one of the processing units of the processor 21 ) may be configured to execute various program codes 23. The implemented program codes 23, in some embodiments, can comprise code according to an open computing language framework. In particular the implemented program codes 23 may comprise WebCL or OpenCL code which may, in the case of WebCL code, enable web applications to harness at least one processing core of the processor 21 . For example, in some embodiments the WebCL code residing in the program codes 23 may be configured to enable web applications to access a combination of a multi core CPU or a single core CPU and a graphics processing unit of the processor 21 The implemented program codes 23 may in some embodiments be stored for example in the memory 22 for retrieval by the processor 21 whenever needed. The memory 22 in some embodiments may further provide a section 24 for storing data, for example data that has been processed in accordance with the application. The user interface 15 in some embodiments enables a user to input commands to the electronic device 10, for example via a keypad, and/or to obtain information from the electronic device 10, for example via a display. It is to be understood again that the structure of the electronic device 10 could be supplemented and varied in many ways.

There is also shown in Figure 7 a schematic block diagram of a further exemplary device 70 or apparatus, according to other embodiments of the application.

The electronic device 70 in some embodiments is a personal computer.

The electronic device 70 can comprise a processor module 71 which may comprise a single processor or a number of processor cores or units. For example the processor module 71 can comprise a CPU 74 or a multi-core CPU. The multi- core CPU may have a number of individual processors integrated on a single silicon wafer.

The processor module 71 can be configured to execute instructions stored in the memory 72 or otherwise accessible to the processor 71 . These instructions, when executed by the processor 71 , can cause the electronic device 70 to perform one or more of the functionalities of the electronic device 71 as described herein.

The processor module 71 may be configured to execute various program codes which may either reside in the memory 72, or may equally reside in a storage device 73 which may be remote to the processor module 71 . The program codes stored in the memories 72 and 73 may as before comprise code according to the open computing language framework. In some embodiments the storage device 73 may be a hard disk which as depicted in Figure 7 may reside inside the electronic device 70. With reference to Figure 7 there is also shown a further processing module 75. In some embodiments the further processing module 75 may comprise an application specific processor such as a graphics processor unit (GPU) 76, and also memory 77 which may be configured for use with the GPU 76.

The GPU 76 may be configured to execute instructions or program codes stored in the memory 77, and additionally the memory 77 may also be arranged as a data store.

The processing module 75 may also be arranged to have access to the remote storage device 73 within the electronic device 70.

The processing module 71 , the further processing module 75 and the storage device 73 may be interconnected and arranged to be operatively communicative with each other via a bus 80.

In embodiments the bus 80 may be arranged to allow data and program codes to be passed between the processing modules 71 and 75, and the storage device 73.

In some embodiments the processing module 71 may be arranged to control the further processing module 75 via the bus 80.

It is to be understood that the structure of the electronic device 70 could be supplemented and varied in many ways. For example the electronic device 70 may comprise even further processing modules, which may be configured to operatively communicate to other processing modules via the bus 80.

It would be appreciated that the schematic structures described in figures 2 and 7, and the method steps in figures 3, 4, 5 and 6 represent only a part of the operation of a complete system comprising some embodiments of the application as shown implemented in the electronic device shown in figure 1 .

With reference to Figure 2 there is illustrated a block diagram of a system for the provision of deferring processor allocation for the execution of a kernel in an open computing language framework environment. In particular Figure 2 depicts the open computing language (OpenCL) host application 21 10 as being executed on a central processing unit 210 in the processor 21 of the electronic device 10. In other embodiments an open language framework environment host application may be also be implemented in any one of DirectCompute, C++ AMP (Accelerated Massive Parallelism) and RenderScript. In addition, the OpenCL host 21 10 could be implemented on other system computing resources than the CPU 210.

Where elements similar to those shown in Figure 1 are described, the same reference numbers are used.

The OpenCL host may control access to further processing units in the processor 21 . For instance, the OpenCL host application 21 10 may control access to other processing units such as a further a CPU, a DSP or a GPU. The further processing units may be depicted as processors 212 and 213 in Figure 2. Each of these further processing units, in other words the DSP 212 and GPU 213 may each be a compute device 2120 and 2130 respectively.

There is also depicted in Figure 2 a compute device 2100 being provisioned for use on the same CPU device 210 as the host application 21 10. In other words, the CPU device 210 executing the host application 21 10 may also be used as a hosting processor for a compute device 2100.

Additionally, there is shown in Figure 2 a web browser which may also be executed on a processing unit of the processor 21 . Figure 2 depicts the web browser 21 1 1 as being executed on the same central processing unit as that of the OpenCL host application. However, it is to be appreciated in other embodiments the web browser may be executed on a processing unit other than the processing unit executing the OpenCL host application.

The web browser 21 1 1 in Figure 2 is shown as having a WebCL implementation 21 12. In embodiments the WebCL implementation 21 12 may comprise the functionality to communicate with the OpenCL host application 21 10 in order that the web browser can access the various processing units or compute devices of the processor 21 .

In some embodiments the WebCL implementation may be termed a WebCL application programming interface (API) or a WebCL wrapper. It is to be appreciated that the WebCL implementation 21 12 within the Web Browser 21 1 1 may be arranged to control access to the various processing units in the processor 21 for the execution of kernels from the web browser. In other words the WebCL implementation 21 12 can arrange for a kernel to be executed on a specific processing unit or compute device within the processor 21 .

It is to be appreciated in embodiments that the WebCL implementation (or WebCL API) 21 12 is called through the Web browser 21 1 1 .

Figure 2 also depicts a web page or web based application 21 13 which may comprise a Hyper Text Markup Language (HTML) for parsing on the web browser 21 1 1 . Other examples of languages which may be used for web based applications include HTML5, JavaScript (JS), OpenCL C kernel language, and OpenGL shading language. The web page or application 21 13 may execute at a higher level than the web browser 21 1 1 . In other words the web application 21 13 may call the web browser 21 1 1 . The software hierarchy between web application 21 13 and web browser may be depicted in Figure 2 by the web application communicating with the web browser 21 1 1 via the link 21 15. It is to be appreciated that although the web browser and web application are depicted in Figure 2 as both being executed on the same processor 210. In other embodiments the web browser 21 1 1 and web application 21 13 may be each be executed on a different processors. The system may further comprise a network 212. The network 212 may comprise one or more wireline networks, one or more wireless networks, or some combination thereof.

According to some embodiments, the network 212 may comprise the Internet. The network 212 may comprise, in some embodiments, a Content Delivery Network (CDN), which may also be referred to as a Content Distribution Network. In various embodiments, the network 212 may comprise a wired access link connecting one or more terminal apparatuses to the rest of the network 212 using, for example, Digital Subscriber Line (DSL) technology. In some embodiments, the network 212 may comprise a public and mobile network (for example, a cellular network), such as may be implemented by a network operator (for example, a cellular access provider). The network 212 may operate in accordance with universal terrestrial radio access network (UTRAN) standards, evolved UTRAN (E- UTRAN) standards, current and future implementations of Third Generation Partnership Project (3GPP) LTE (also referred to as LTE-A) standards, current and future implementations of International Telecommunications Union (ITU) International Mobile Telecommunications - Advanced (IMT-A) systems standards, and/or the like. It will be appreciated, however, that where references herein are made to a network standard and/or terminology particular to a network standard, the references are provided merely by way of example and not by way of limitation. It is to be understood in embodiments that the electronic device 10 may be capable of communicating with the network 212 in accordance with an air interface standard of an applicable cellular system, and/or any number of different wireline or wireless networking techniques, comprising but not limited to Wi-Fi, wireless local access network (WLAN) techniques such as BluetoothTM (BT), Ultra- wideband (UWB), Institute of Electrical and Electronics Engineers (IEEE) 802.1 1 , 802.16, and/or the like. More particularly, the electronic device 10 may be capable of operating in accordance with various first generation (1 G), second generation (2G), 2.5G, third- generation (3G) communication protocols, fourth-generation (4G) communication protocols, Internet Protocol Multimedia Subsystem (IMS) communication protocols (for example, session initiation protocol (SIP)), and/or the like. For example, the mobile terminal may be capable of operating in accordance with 2G wireless communication protocols IS-136 (Time Division Multiple Access (TDMA)), Global System for Mobile communications (GSM), IS-95 (Code Division Multiple Access (CDMA)), and/or the like. Also, for example, the mobile terminal may be capable of operating in accordance with 2.5G wireless communication protocols General Packet Radio Service (GPRS), Enhanced Data GSM Environment (EDGE), and/or the like. Further, for example, the mobile terminal may be capable of operating in accordance with 3G wireless communication protocols such as Universal Mobile Telecommunications System (UMTS), Code Division Multiple Access 2000 (CDMA2000), Wideband Code Division Multiple Access (WCDMA), Time Division- Synchronous Code Division Multiple Access (TD-SCDMA), and/or the like. The mobile terminal may be additionally capable of operating in accordance with 4G wireless communication protocols such as Long Term Evolution (LTE) or Evolved Universal Terrestrial Radio Access Network (E-UTRAN) and/or the like. Additionally, for example, the mobile terminal may be capable of operating in accordance with fourth-generation (4G) wireless communication protocols such as LTE Advanced and/or the like as well as similar wireless communication protocols that may be developed in the future.

There is also depicted in Figure 2 a web server 213 which may be arranged to operatively communicate with the electronic device 10 via the network 212.

The web server 213 may be further arranged to provide the web application 21 13 for execution in the electronic device 10. It is to be appreciated that an OpenCL kernel can be part of the web application 21 13. The kernel is hosted on the electronic device 10 by the OpenCL host. The function of the OpenCL host may include the functionalities of; creating a context for the kernel, assigning a compute device for the kernel, instructing the kernel to execute whereby the kernel may use the assigned compute device to perform specific calculations; and managing the flow of data to and from the compute device.

Both OpenCL and WebCL standards can enable application developers to determine a particular processing unit or compute device to execute a kernel of code. According to the current OpenCL standard a particular processing unit or compute device for the execution of a kernel can be selected when creating an OpenCL context.

For example, an OpenCL context can be created with the cLCreateContextFromType function. This function can enable the application developer to pre-set the type of compute device (or processing unit) for execution of the kernel associated with the OpenCL context. For instance, the cLCreateContextFromType function may allow a particular compute device to be explicitly selected from the following device type directive list; CL_DEVICE_TYPE_GPU,CL_DEVICE_TYPE_CPU, CL_DEVOCE_TYPE_ACCELERATOR. Alternatively the application developer can specify that the choice of compute device for the execution of the kernel associated with the OpenCL context can be determined by the OpenCL environment. This can be instigated by selecting one of the following device types directives CL_DEVICE_TYPE_DEFAULT or CL_DEVICE_TYPE_AII.

In some embodiments compute device selection in an open computing language framework such as OpenCL and WebCL may be deferred until after the application is running. In other words compute device selection is no longer predetermined by the application developer at the point at which the code is laid down.

This can have the advantage of allowing additional dynamic information to be gathered about the current computational load and memory usage for each compute devices before a particular device is selected to execute the OpenCL or WebCL application code.

In some embodiments the selection of the particular device to execute a kernel may be further enhanced by adding additional criteria which can influence the choice of compute device. For instance in some embodiments the additional criteria may be pre-determined for the create context function, and as such may be used to bias any dynamic decision in favour of a particular compute device. For example, the additional criteria may include directives such as "always use GPU", "execute with minimum energy", or "execute with high priority."

In these embodiments the open computing language framework may incorporate a create context function in which an algorithm is deployed in order to determine which compute device should be used to execute the kernel. In other words, the create context function in an OpenCL or WebCL environment may be dynamic and responsive to variations in operating conditions for each compute device, and this may be achieved by means of an algorithm. For instance, in some embodiments the algorithm which may be used to determine a particular compute device may take as input parameters: pre-determined biasing factors such as the directives "always use GPU", "execute with minimum energy", or "execute with high priority", and dynamic based performance information from each of the compute devices. In other words the algorithm in the OpenCL or WebCL framework can assist in the selection of the compute device in a runtime execution environment. This has the advantage of allowing the choice of compute device to be deferred, and to take into account information which can affect the selection of a particular processor to be a compute device at the instance of time the compute device is required. For example, such information can be of a dynamic nature (in other words dynamic compute device data) and may include computational loading on each processor which can be selected as a compute device. Environmental factors may also be taken into account such as the amount of available battery capacity, and whether the available battery capacity is sufficient to change the state of a processor from a "sleep state" to an "awake state". Further examples of compute device dynamic data may include current computational loading on a compute device, which may incorporate factors such as activity time, memory consumption and register consumption.

In addition static related information may also be factored into the choice of processor as a compute device. For example, developer preferences as to which type of processor is used to execute a particular kernel may be taken into account at the time the compute device is selected. Further examples of static related information may include compute device properties such as performance, memory capacity, number of computational registers, the level or parallelism, hardware pipeline depth, compute device clock frequency, power consumption, memory organisation of the compute device, latencies in the computing system of the compute device, and the direct memory access capability of the compute device. Furthermore, static related in information may incorporate data relating to preferred values for an application, such as thread count, vector width and register consumption per thread. For example, OpenCL allows the preferred vector width to be queried from each compute device.

Figure 3 illustrates a flowchart for an initial phase of a deferred selection of a compute device in an open computing language framework environment. In particular Figure 3 can show the initial phase of a deferred compute device selection in an OpenCL or WebC framework environment.

In some embodiments the processor 21 may be arranged, as part of the initial phase for deferred compute device selection, to collate static system information regarding the processing system upon which the kernel is to be run. The static system information may comprise details regarding the configuration and type of the various processing units in the processor 21 . For example there may be included details relating to how many CPU processing units are available, or whether the processor 21 comprises a GPU or DSP or other function specific computational devices. Further examples, of static system information may include information on memory configurations or cache memory size available to each processing unit. Static system information may also include details of clock speeds and the like for each compute device in the processor 21 .

In other words there may be provided means for collating processor static data for at least one compute device of a plurality of compute devices. The processor static data may comprise information relating to the hardware configuration of the compute devices. For example, hardware configuration information may comprise one or more of: memory capacity of a processor (or compute device); operating clock frequency of a processor; pipeline depth count of a processor; configuration of computational registers of a processor; performance of the at least one processor; level of parallelism of the at least one processor; expected power consumption of the at least one processor; memory organization of the at least one processor; latencies in the computing system of the at least one processor; and direct memory access capability of the at least one processor. In some embodiments information relating to the hardware configuration may also comprise information relating to the type of a particular compute device. For example the type of a compute device may comprise one of: central processing unit; digital signal processor; graphics processing unit; image signal processors; and image video engines.

The collation of static system information for the deferred compute device selection is shown as the processing step 301 in Figure 3.

Further, in some embodiments the processor 21 may also be arranged, as part of the initial phase for deferred compute device selection, to collate information relating to platform criteria. The platform criteria may provide a set of default optimization criteria for the execution of a kernel by a web application. Such criteria may include information such as max performance, min power, min latency required by a processor (or compute device) for the execution of the kernel.

Additionally, platform criteria may include information as to which type of compute device is preferred for execution with the kernel. For example, the platform criteria may include information such as whether a CPU, GPU, or DSP is the preferred processor for execution of the Kernel code.

The collation of platform criteria for the deferred compute device selection is shown as the processing step 302 in Figure 3.

The static system information and platform criteria collated at processing steps 301 and 302 may then be used as inputs to an algorithm for the selection of compute devices in an open computing language framework such as OpenCL and WebCL.

In some embodiments the algorithm may be used as an indicator as to which compute device may be used to execute a kernel of OpenCL and WebCL code.

For example, the algorithm may be used to pre select a preferred set of compute devices to be used for execution of the OpenCL or WebCL kernels. The compute device used to execute the OpenCL and WebCL kernel may then be selected from the preferred set at a time instance just before the kernel is executed. Delaying the final compute device selection until just before an OpenCL or WebCL kernel is executed can have the advantage of allowing dynamic conditions relating to the set of compute devices to be taken into account just before the point of execution of the kernel.

It is to be understood that the advantage of the compute device selection algorithm is that the choice of compute device to be made according to the availability of compute devices in the processor 21 . It is to be further understood therefore that the choice of compute device for the execution of an OpenCI or WebCL kernel may not have to be made by the application developer when the application code is written.

Furthermore, the compute device selection algorithm has the effect of enabling the selection of the most suitable compute device for the execution of a particular kernel.

The step of performing the compute device selection algorithm is shown as the processing step 303 in Figure 3. There is shown in Figure 4 a flow chart for the collection of dynamic information relating to the deferred selection of a compute device.

The collection of dynamic information may be run as a background task, and comprise the collection of information metrics relating to current operating conditions of the various compute devices.

In some embodiments information metrics relating to current operating conditions may comprise information relating to the current battery status of the electronic device 10. The information gathered may also comprise various processing load statistics of each compute device such as processing activity time, memory consumption, and register consumption.

Furthermore the information gathered by the processing step may include data on whether a particular device is in an idle state, or whether the processor is in a powered down or sleep state.

The operating state of the compute device may affect the performance of said device in terms of processing time and power consumption. For example, if a compute device is in a sleep state then the amount of time and power consumption required to return to an active state may need to be taken into account when selecting a particular compute device.

In other words there may be provided means for collating processor dynamic data for the at least processor for at least one compute device of a plurality of compute devices. The processor static data may comprise information relating to the dynamic operating conditions of a compute device. For example such information may comprise at least one of: an indication of the computing load of the at least one processor for a period of time whilst the at least one processor is in an active state; an indication of the power consumption of the at least one processor for the period of time whilst the at least one processor is in an active state; an indication of the operating mode of the at least one processor for the period of time, wherein the operating mode may comprise at least one of: an active state; a sleeping state, idle state; and a powered down state; and an indication of current battery power level.

The step of collating information relating to the current (dynamic) operating conditions of the various compute devices in the processor 21 is shown as the processing step 401 in Figure 4. With further reference to Figure 4, there is also shown a delay processing step 403.

The processing step 403 may perform a delay before the background process repeats the processing step 401 of collating information relating to the current operating conditions.

It is to be understood in embodiments that a compute device may be selected from a number of compute devices for the execution of a kernel. The compute device may be selected dependent on the collated static data for each compute device and the collated dynamic data for each compute device.

In Figure 5 there is shown a flow diagram of the execution phase of the host application code for a deferred selection of a compute device for an open language computing frame application, such as an OpenCL/WebCL host application.

With reference to Figure 5 there is shown the processing step 501 which signifies the start of the execution of the WebCL/OpenCL host application code. The particular processing unit or compute device executing the host application code may then perform a check to determine if the code being executed is a call to a create context function. The decision step of determining whether the host application code being executed is a call to a create context function is depicted as processing decision step 503 in Figure 5.

With further reference to Figure 5, the decision step 503 has two outputs. The first output branch 503a indicates that the host application code being executed is not a call to a create context routine. In this first decision instance the host application execution cycle is passed back to the processing step 501 , where the host application code is continued to be executed. The second output branch 503b may be taken if the executed host application code is a call to a create context routine. In this second decision instance the execution phase may proceed to the next phase of the host application execution cycle. The next phase of the host application execution cycle may incorporate a processing step whereby additional application developer criteria may be incorporated into the compute device decision process.

For example, at the point of writing the open language framework application code the application developer may specify a preference as to what type of compute device may be used to execute a particular kernel. This information may be stored within the host application code in the form of a data structure which may then be incorporated into the compute device decision step. For example, application developer criteria may include such indicators as "always use GPU", "execute with minimum energy", or "execute with high priority". The step of incorporating application developer criteria into the compute device selection is shown as processing step 505 in Figure 5.

In other words there may be provided means for collating application specific data for the selection of a compute device.

The host application execution cycle may incorporate a code analysis processing step whereby the application (host and kernel) code can be analysed to obtain information about the code. In some embodiments information about the code may include details about the code structure and variables set by the application.

The step of analysing the application code is shown as processing step 507 in Figure 5. The host application execution cycle may now be in a condition whereby all criteria have been analysed in order that a particular compute device may be selected for the execution of a particular kernel.

The host application execution cycle may now continue to execute the WebCL/OpenCL host code. This phase of the execution cycle may be depicted as the processing step 509 in Figure 5.

Whilst executing the WebCL/OpenCL host application code, the host application execution cycle may be required to determine when to trigger the process of selecting the compute device for the execution of a kernel.

In some embodiments the processing step of selecting the compute device may be triggered to start on at least one of the following conditions: start of input data transfers to a compute device; whether there has been sufficient information gathered to uniquely select a single compute device; and the actual start of a kernel execution. The step of deternnining whether sufficient conditions have been met in order to trigger the start of the compute device selection is depicted as the decision step 51 1 in Figure 5.

From Figure 5 it can be seen that the decision step 51 1 has two outputs. The first output branch 51 1 a indicates that conditions for triggering the compute device selection have not been met. In this first decision instance the host application execution cycle is passed back to the processing step 509, where the host application code is continued to be executed.

The second output branch 51 1 b may be taken if the conditions for triggering the compute device selection have been met. In this second decision instance the execution phase may proceed to the next phase of the application execution cycle, processing step 513.

The next phase of the host application execution cycle may be the selection of the compute device for processing of a kernel. In embodiments the compute device selection can be based at least on the information gathered during processing steps 301 and 302, in other words the static system information and platform criteria.

It is to be understood that the information gathered during processing steps 301 and 302 may not be directly used by the compute device selection processing step. Instead, the output of the compute device selection algorithm 303, of which the static system information and platform criteria are inputs, may be used as a selection input to the compute device selection input. Furthermore, compute device selection may also be made in light of the dynamic data relating to the various compute devices collated during processing step 401 . Therefore, the dynannic infornnation may also be used as an input to the process of selecting the compute device.

In some embodiments the compute device selection processing step may also use the developer criteria gathered during processing step 505, and the code analysis step 509 as additional inputs on which to base the compute device selection.

The step of selecting the compute device is shown as processing step 513 in Figure 5.

The processing step of executing the kernel on the selected compute device is shown as processing step 515 in Figure 5.

In other words the selection of the compute device may be deferred until the kernel is instantiated for execution on a compute device.

Furthermore, in some embodiments the execution of the kernel on a compute device may be delayed until there is a transfer of data to the compute device. Figure 6 shows one possible implementation with a flow chart for the deferred selection of a compute device for embodiments deploying the WebCL open language computing framework.

With reference to Figure 6 there is shown the processing step 601 which signifies the start of the execution of the application code.

In some embodiments the WebCL implementation code 21 12 may collate static system information regarding the processing system upon which the WebCL kernel is to be run as in processing step 301 . Furthermore the WebCL implementation code may also collate information relating to platform specific criteria as in processing step 302. In some embodiments the platform specific criteria and the static system information may form inputs to a compute device selection algorithm, as in processing step 303. As before, the output of the compute device selection algorithm may be used to assist in a further stage of the compute device selection processing step.

The step of collating static system information and platform criteria is shown in Figure 6 as the processing step 601 .

The processing step 603 signifies the execution of the WebCL host application code.

The processing unit executing the WebCL API implementation 21 12 may as before perform a check in order to determine if the code being executed is a call to a create context routine.

The decision step of determining whether the host application code being executed is a call to a create context routine is depicted as processing step 605 in Figure 6.

With reference to Figure 6, it can be seen that as before the decision step 605 has two outputs. The first output branch 605a may be followed if the WebCL host application code being executed is not a call to a create context routine. For this first decision instance the WebCL host application cycle returns to a state whereby the host application code is continued to be executed, whilst monitoring for the execution of a call to a create context routine.

The second output branch 605b may be taken if the host application code is a call to a create context routine. The second decision instance may lead to the next phase of the WebCL application execution cycle. As before the next phase of the WebCL host application execution cycle may incorporate a processing step whereby application developer criteria may be incorporated into the decision as to which compute device to select.

The step of incorporating application developer criteria into the compute devise selection is shown as processing step 607 in Figure 6.

The WebCL host application execution cycle may then enter into a further decision stage in which in which the static system information and platform criteria collated in processing step 601 together with the developer criteria of processing step 607 are used to select a compute device for the execution of a kernel.

It is to be appreciated in some embodiments that the selection of the compute device may be made in light of the output from the compute device selection algorithm of which static system information and platform criteria are inputs.

Upon selection of a potential compute device for execution of a kernel the cache memory of the parallel execution framework may be checked in order to ascertain if the selected device is in cache memory.

It is to be appreciated that a cache memory may have program data stored in cache memory as a result of an earlier kernel being executed on the selected device. The cache memory may comprise the previous device selection if the same web application has been executed previously. In this instance the context for the device may be directly created without any further analysis.

The creation of the context for the selected device already resident in cache memory is depicted as processing step 610 in Figure 6. The decision stage of determining if the selected device is already resident in cache is depicted as 609 in Figure 6.

The decision stage 609 has two output branches. The first output branch 609a corresponds to the positive decision of the selected compute device being in cache memory. As a result of this decision instance the WebCL host application execution cycle proceeds to the processing step 610 where the call to the create context routine for the selected compute device is executed. Upon creating the context for the compute device the kernel can be executed on the selected compute device.

The step of executing the kernel on the selected compute device is shown as processing step 614 in Figure 6.

The second output branch 609b of the decision step 609 may be taken if the selected compute device is not in cache memory. Taking the second output branch 609b may result in the WebCL host application execution cycle proceeding to the processing step 61 1 .

The processing step 61 1 can result in the WebCL host application execution cycle performing the create OpenCL context for at least one of the possible compute devices in the processor 21 . In some embodiments the processing step 61 1 may result in the create OpenCL context for all compute devices in the processor 21 .

In other embodiments the processing step 61 1 may result in the create OpenCL context for a subset of the compute devices available on the processor 21 . The WebCL host application execution cycle may then execute application code for each CL context generated for each compute device in the previous processing step 61 1 . Executing application code for each CL context is shown as processing step 613 in Figure 6.

In embodiments the code executed for each CL context may then be checked in decision step 615 in order to ascertain if the instruction compile Kernel is to be executed.

As a result of the above check, if it is determined that the code is not a compile Kernel. The WebCL host application execution cycle returns to executing the application code for each context CL along the decision branch 615a, and accordingly the following step of determining whether the application code being executed is the compile Kernel is repeated.

However, if as a result of the above check that it is determined that the code is a compile Kernel function call the WebCL host application execution cycle proceeds to the next processing step 616 via the decision branch 615b.

At processing step 616 the WebCL kernel code is analyzed and compiled corresponding to each CL context for each device of the set of devices. In processing step 617, WebCL host application code for each CL context is executed during which it is checked whether the code is the execution of a kernel,

In embodiments the WebCL host application code for each CL context (corresponding to each compute device) may be checked for execution of a kernel by determining if the kernel execution function is called. In other embodiments the WebCL host application code for each CL context may be checked for execution of a kernel by monitoring the movement of data from a host to a compute device. The step of determining whether the WebCL host application code for a context is an execution of a kernel is depicted as the decision step 618 in Figure 6.

From Figure 6, it can be seen that the decision step 618 has two output branches. The first output branch 618a may be taken if it has been determined by the decision step 618 that the WebCL host application code being executed is not a executeKernel function call. This first output branch (or decision instance) may result in the WebCL host application execution cycle being returned to the processing step 617 and the WebCL host application code for each CL context being continued to be executed. As before the following processing step 618 of checking whether the WebCL host application code is an execution of a kernel is repeated.

The second output branch 618b may be taken if it has been determined by the decision step 618 that the host code being executed is an executeKernel function call. The second output branch 618b results in the WebCL host application execution cycle moving to the processing step 619 in which dynamic information relating to the selection of a compute device may be collated. As before the dynamic information may comprise information relating to the varying operating conditions of the various compute devices, and such information may comprise processing load statistics, memory consumption and register consumption for each compute device (or processing unit).

Once the dynamic information relating to each CL context and consequently each compute device has been collated, the WebCL host application execution cycle enters a processing step whereby a particular CL context and compute device is selected based on the previously collated static and dynamic information relating to each compute device together with the platform criteria as well as developer criteria.

It is to be appreciated in embodiments that the particular CL context selected is a CL context selected from the set of CL contexts opened in the processing step 61 1 .

It is to be further appreciated in embodiments that the particular CL context is the CL context which best matches the criteria derived from the static and dynamic information together with the platform and developer criteria gathered during earlier processing steps.

The processing step of selecting a particular CL context from the set of CL contexts created in processing step 61 1 is shown as processing step 620 in Figure 6.

Once the particular context and the corresponding compute device has been selected, all other CL contexts created in processing step 61 1 may be deleted. The processing step of deleting the CL contexts which were not selected in processing step 620 is shown as processing step 621 in Figure 6.

Processing step 622 may update the device selection cache according to the selected CL context and compute device of processing step 620.

Execution of the kernel on the selected compute device is shown as processing step 613 in Figure 6.

It is to be understood that the open computing language framework environment as described herein is not limited to implementations in OpenCL and WebCL. Other embodiments may implement the features of the open computing language framework described herein using other systems of implementation such as the Open Graphics Library (OpenGL) and its derivatives such as OpenGL Embedded Systems (OpenGL ES), and WebGL. Some other embodiments may implement the features of the open computing language framework described herein using propriety systems such DirectX, Silverlight3D and Macromedia Flash Stage 3D.

Although the above examples describe embodiments of the invention operating an within an electronic device 10 or apparatus, it would be appreciated that the invention as described below may be implemented as part of any computer system executing instructions on one or more computational processing units.

It shall be appreciated that the term user equipment is intended to cover any suitable type of wireless user equipment, such as mobile telephones, portable data processing devices or portable web browsers.

In general, the various embodiments described above may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof. The embodiments of the application may be implemented by computer software executable by a data processor, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example digital versatile disc (DVD), compact discs (CD) and the data variants thereof both.

The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.

Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication. The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.

As used in this application, the term circuitry may refer to all of the following: (a) hardware-only circuit implementations (such as implementations in only analogue and/or digital circuitry) and (b) to combinations of circuits and software (and/or firmware), such as and where applicable: (i) to a combination of processor(s) or (ii) to portions of processor(s)/software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions) and (c) to circuits, such as a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation, even if the software or firmware is not physically present.

This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry would also cover an implementation of merely a processor (or multiple processors) or portion of a processor and its (or their) accompanying software and/or firmware. The term circuitry would also cover, for example and if applicable to the particular claim element, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or a similar integrated circuit in server, a cellular network device, or other network device. The term processor and memory may comprise but are not limited to in this application: (1 ) one or more microprocessors, (2) one or more processor(s) with accompanying digital signal processor(s), (3) one or more processor(s) without accompanying digital signal processor(s), (3) one or more special-purpose computer chips, (4) one or more field-programmable gate arrays (FPGAS), (5) one or more controllers, (6) one or more application-specific integrated circuits (ASICS), or detector(s), processor(s) (including dual-core and multiple-core processors), digital signal processor(s), controller(s), receiver, transmitter, encoder, decoder, memory (and memories), software, firmware, RAM, ROM, display, user interface, display circuitry, user interface circuitry, user interface software, display software, circuit(s), antenna, antenna circuitry, and circuitry.

Claims

CLAIMS:

1 . A method comprising:

collating processor static data for at least one processor of a plurality of processors;

collating processor dynamic data for the at least one processor of the plurality of processors; and

selecting at least one processor from the plurality of processors for the execution of a kernel based at least in part on the collated processor static data for the at least one processor and the collated processor dynamic data for the at least one processor.

2. The method as claimed in Claim 1 , wherein the processor dynamic data comprises information relating to the dynamic operating conditions of the at least one processor.

3. The method as claimed in Claim 2, wherein the dynamic operating conditions of the at least one processor comprises at least one of:

an indication of the computing load of the at least one processor for a period of time whilst the at least one processor is in an active state;

an indication of the power consumption of the at least one processor for the period of time whilst the at least one processor is in an active state;

an indication of the operating mode of the at least one processor for the period of time, wherein the operating mode may comprise at least one of: an active state; a sleeping state, idle state; and a powered down state; and

an indication of current battery power level.

4. The method as claimed in Claims 1 to 3, wherein the processor static data comprises information relating to the hardware configuration of the at least one processor.

5. The method as claimed in Claim 4, wherein the information relating to the hardware configuration of the least one processor comprises at least one of:

memory capacity of the at least one processor;

operating clock frequency of the at least one processor;

pipeline depth count of the at least one processor;

configuration of computational registers of the at least one processor;

performance of the at least one processor;

level of parallelism of the at least one processor;

expected power consumption of the at least one processor;

memory organization of the at least one processor;

latencies in the computing system of the at least one processor; and direct memory access capability of the at least one processor.

6. The method as claimed in Claims 4 and 5, wherein the information relating to the hardware configuration of the at least one processor further comprises information relating to the type of the at least one processor.

7. The method as claimed in Claim 6, wherein the type of the least one processor comprises at least one of:

central processing unit;

digital signal processor;

graphics processing unit;

image signal processors; and

image video engines.

8. The method as claimed in Claims 1 to 7, wherein the processor static data further comprises application related data, wherein application related data comprises at least one of:

maximum thread count;

vector width; and

register consumption for a thread.

9. The method as claimed in Claims 1 to 8, wherein the method further comprises: collating application developer specific data for the selection of the at least one processor from the plurality of processors, and wherein selecting the at least one processor from the plurality of processors for the execution of a kernel is further based at least in part on the collated application developer specific data.

10. The method as claimed in Claims 1 to 9, wherein selecting the at least one processor from the plurality of processors for the execution of a kernel is performed by an algorithm which combines at least two of:

the collated processor static data for the at least one processor;

the collated processor dynamic data for the at least one processor; and the collated application developer specific data.

1 1 . The method as claimed in Claims 1 to 10, wherein the method further comprises determining a context for the execution of the kernel on the selected at least one processor.

12. The method as claimed in Claim 1 1 , wherein determining a context for the execution of the kernel on the selected at least one processor further comprises: checking a cache memory in order to determine if a previous instance of the kernel has been executed on the selected at least one processor.

13. The method as claimed in Claims 1 1 and 12, wherein the context is part of an open computing language framework host application.

14. The method as claimed in Claims 13, wherein the open computing language framework host application is an OpenCL host application.

15. The method as claimed in Claim in claim 13, wherein the open computing language framework host application is a WebCL host application.

16. The method as claimed in Claims 1 to 15, wherein the at least one processor is a compute device, and the plurality of processors are a plurality of compute devices.

17. The method as claimed in Claims 1 to 16, such that the selection of the at least one processor is deferred until the kernel is instantiated for execution on the at least one processor.

18. The method as claimed in Claim 17, wherein the selection of the at least one processor is further deferred until there is a transfer of data to the selected at least one processor.

19. An apparatus comprising at least one processor of a plurality of processors and at least one memory including computer code for one or more programs, the at least one memory and the computer code configured with the at least one processor of the plurality or processors to cause the apparatus at least to:

collate processor static data for the at least one processor of the plurality of processors;

collate processor dynamic data for the at least one processor of the plurality of processors; and

select at least one processor from the plurality of processors for the execution of a kernel based at least in part on the collated processor static data for the at least one processor and the collated processor dynamic data for the at least one processor.

20. The apparatus as claimed in Claim 19, wherein the processor dynamic data comprises information relating to the dynamic operating conditions of the at least one processor.

21 . The apparatus as claimed in Claim 20, wherein the dynamic operating conditions of the at least one processor comprises at least one of:

an indication of current battery power level

22. The apparatus as claimed in Claims 19 to 21 , wherein the processor static data comprises information relating to the hardware configuration of the at least one processor.

23. The apparatus as claimed in Claim 22, wherein the information relating to the hardware configuration of the least one processor comprises at least one of:

memory capacity of the at least one processor;

operating clock frequency of the at least one processor;

pipeline depth count of the at least one processor;

configuration of computational registers of the at least one processor;

performance of the at least one processor;

level of parallelism of the at least one processor;

expected power consumption of the at least one processor;

memory organization of the at least one processor;

24. The apparatus as claimed in Claims 22 and 23, wherein the information relating to the hardware configuration of the at least one processor further comprises information relating to the type of the at least one processor.

25. The apparatus as claimed in claim 24, wherein the type of the least one processor comprises at least one of:

central processing unit;

digital signal processor;

graphics processing unit;

image signal processors; and

image video engines.

26. The apparatus as claimed in Claims 19 to 25, wherein the processor static data further comprises application related data, wherein application related data comprises at least one of:

maximum thread count;

vector width; and

register consumption for a thread.

27. The apparatus as claimed in Claims 19 to 26, wherein the apparatus is further caused to:

collate application developer specific data for the selection of the at least one processor from the plurality of processors, and wherein the apparatus caused to select the at least one processor from the plurality of processors for the execution of a kernel is further caused to select the at least one processor from the plurality of processors further based at least in part on the collated application developer specific data.

28. The apparatus as claimed in Claims 19 to 27, wherein selecting the at least one processor from the plurality of processors for the execution of a kernel is performed by an algorithm which combines at least two of:

the collated processor static data for the at least one processor;

29. The apparatus as claimed in Claims 19 to 28, wherein the apparatus is further caused to determine a context for the execution of the kernel on the selected at least one processor.

30. The apparatus as claimed in Claim 29, wherein the apparatus caused to determine the context for the execution of the kernel on the selected at least one processor is further caused to:

check a cache memory in order to determine if a previous instance of the kernel has been executed on the selected at least one processor.

31 . The apparatus as claimed in Claims 29 and 28, wherein the context is part of an open computing language framework host application.

32. The apparatus as claimed in Claim 31 , wherein the open computing language framework host application is an OpenCL host application.

33. The apparatus as claimed in Claim 13, wherein the open computing language framework host application is a WebCL host application.

34. The apparatus as claimed in Claims 1 to 15, wherein the at least one processor is a compute device, and the plurality of processors are a plurality of compute devices.

35. The apparatus as claimed in Claims 19 to 34, such that the selection of the at least one processor is deferred until the kernel is instantiated for execution on the at least one processor.

36. The apparatus as claimed in Claim 35, wherein the selection of the at least one processor is further deferred until there is a transfer of data to the selected at least one processor.

37. An apparatus configured to:

collate processor static data for at least one processor of the plurality of processors;

38. The apparatus as claimed in Claim 37, wherein the processor dynamic data comprises information relating to the dynamic operating conditions of the at least one processor.

39. The apparatus as claimed in Claim 38, wherein the dynamic operating conditions of the at least one processor comprises at least one of:

an indication of the operating mode of the at least one processor for the period of time, wherein the operating mode may comprise at least one of: an active state; a sleeping state, idle state; and a powered down state; and an indication of current battery power level

40. The apparatus as claimed in Claims 37 to 39, wherein the processor static data comprises information relating to the hardware configuration of the at least one processor.

41 . The apparatus as claimed in Claim 40, wherein the information relating to the hardware configuration of the least one processor comprises at least one of:

memory capacity of the at least one processor;

operating clock frequency of the at least one processor;

pipeline depth count of the at least one processor;

configuration of computational registers of the at least one processor;

performance of the at least one processor;

level of parallelism of the at least one processor;

expected power consumption of the at least one processor;

memory organization of the at least one processor;

42. The apparatus as claimed in claims 40 and 41 , wherein the information relating to the hardware configuration of the at least one processor further comprises information relating to the type of the at least one processor.

43. The apparatus as claimed in Claim 42, wherein the type of the least one processor comprises at least one of:

central processing unit;

digital signal processor;

graphics processing unit;

image signal processors; and

image video engines.

44. The apparatus as claimed in Claims 37 to 43, wherein the processor static data further comprises application related data, wherein application related data comprises at least one of:

maximum thread count;

vector width; and

register consumption for a thread.

45. The apparatus as claimed in Claims 37 to 44, wherein the apparatus is further configured to:

collate application developer specific data for the selection of the at least one processor from the plurality of processors, and wherein the apparatus configured to select the at least one processor from the plurality of processors for the execution of a kernel is further configured to select the at least one processor from the plurality of processors further based at least in part on the collated application developer specific data.

46. The apparatus as claimed in Claims 37 to 45, wherein selecting the at least one processor from the plurality of processors for the execution of a kernel is performed by an algorithm which combines at least two of:

the collated processor static data for the at least one processor;

47. The apparatus as claimed in Claims 37 to 46, wherein the apparatus is further configured to determine a context for the execution of the kernel on the selected at least one processor.

48. The apparatus as claimed in Claim 47, wherein the apparatus configured to determine the context for the execution of the kernel on the selected at least one processor is further configured to: check a cache memory in order to determine if a previous instance of the kernel has been executed on the selected at least one processor.

49. The apparatus as claimed in Claims 47 and 48, wherein the context is part of an open computing language framework host application.

50. The apparatus as claimed in Claim 49, wherein the open computing language framework host application is an OpenCL host application.

51 . The apparatus as claimed in Claim 49, wherein the open computing language framework host application is a WebCL host application.

52. The apparatus as claimed in Claims 37 to 51 , wherein the at least one processor is a compute device, and the plurality of processors are a plurality of compute devices.

53. The apparatus as claimed in Claims 37 to 52, such that the selection of the at least one processor is deferred until the kernel is instantiated for execution on the at least one processor.

54. The apparatus as claimed in Claim 53, wherein the selection of the at least one processor is further deferred until there is a transfer of data to the selected at least one processor.

55. An apparatus comprising:

means for collating processor static data for the at least one processor of the plurality of processors;

means for collating processor dynamic data for the at least one processor of the plurality of processors; and

means for selecting at least one processor from the plurality of processors for the execution of a kernel based at least in part on the collated processor static data for the at least one processor and the collated processor dynamic data for the at least one processor.

56. An apparatus as claimed in Claim 55 comprising means for performing the method as claimed in any of Claims 2 to 18.

57. A computer readable medium comprising a computer program code thereon, the computer program code configured to:

58. A computer readable medium comprising a computer program code thereon, the computer program code configured to realize the actions of the method of any of Claims 1 to 18.