GB2564863A

GB2564863A - Containerized application platform

Info

Publication number: GB2564863A
Application number: GB1711868.8A
Authority: GB
Inventors: Drozdov Ignat; Drozdov Aleksandr; Drozdova Natalja
Original assignee: Bering Ltd
Current assignee: Bering Ltd
Priority date: 2017-07-24
Filing date: 2017-07-24
Publication date: 2019-01-30
Also published as: GB201711868D0

Abstract

One or more host computer devices 104, 106 execute applications. The host devices may be connected by a network 108. The host devices create resource constrained containers 112, 114, 116 and execute the applications in the containers. A central controller 110 monitors the resource usage of the containers and the resource availability of the host devices. The controller may execute in a central container 111, which may be on a different host device 102 to the containers in which the applications execute. The resources of the host devices are allocated to the containers based on the monitored resources. Each container may have a dedicated module to monitor its resource usage. Future resource usage may be predicted based on past usage. Containers may be added or removed based on the monitored resources. The applications may process extremely large datasets.

Description

Containerized application platform

Field of invention

The present invention relates to the implementation of applications across one or more devices, in particular to the use of containers to execute an application across one or more devices in a networked computing system.

Background

The ability to process extremely large data sets (sometimes referred to as big data) is a growing concern. Traditional data processing software executing on a single computer processing unit (CPU) is unable to process such large data sets effectively. In order to process such large data sets, it is known to use distributed CPUs spread across a device ore more often a network to handle such processing tasks (for example generating statistical models based on the data). However, efficiently coordinating processing tasks across multiple CPUs/CPU cores

In order to allow for coordination of processing tasks, it is known to use virtualization and containerization in networked computing systems to provide environments in which applications can execute. These techniques provide an isolated environment within a device/network, which is assigned a pre-allocated portion of the resources available on the device/network. An application can then be executed within the isolated environment using the pre-allocated resources. For example, such techniques are used in data centres for the purpose of enabling scalable cloud infrastructures.

Various virtualization techniques are known. Full virtualisation techniques typically provide a simulation of substantially the whole hardware of the device it is being implemented on (e.g. a virtual machine), and allow, for example, a guest operating system to be implemented on the same hardware that is running the full virtualisation processes. Para-virtualisation does not simulate an entire hardware environment, but executes programs in isolated domains on a device.

However, virtualization techniques are typically resource intensive, since they involve the emulation of hardware, and thus lead to large performance overheads. Moreover, they must be deployed on pre-allocated resources, meaning that the applications running in the virtual machine are only able to utilise those allocated resources, and the allocated resources are not available for programs running on the device/network outside the virtual machine. In addition, such techniques typically have a large impact on a variety of operations including cloning, backing, snapshots, start-up and shutdown.

An alternative to traditional virtualization is to use containerization. Containerization is an operating-system-level virtualisation technique, which implements containers (sometimes referred to as user-space instances) that provide isolated operating system environments, in which processes can run. Containers can thus provide an isolated portion of the operating system running on the device/network. Because containerization does not involve the complete emulation of hardware components, it is less resource intensive than traditional virtualization techniques. Examples of containerization include Linux containers (LXC) wherein each container provides an isolated Linux system, and Docker by Docker Inc. LXC can be implemented using Linux Container Daemon (LXD).

Typically, containerization is implemented such that each container provides a dedicated environment for one particular application component, in other words, each container has a different concern. For example, a web application stack might be separated into a web application, a database and an in-memory cache, each executed in its own container. In practice, this separation of application components into different containers often means executing one process per container. The different application components executing in the containers are then managed centrally. This central management involves managing tasks including basic application configuration, deployment, networking, storage and orchestration. Accordingly, because these tasks are all dependent on the central management, the portability of the containers is limited, and additional overhead is involved in a production setting.

Traditionally containerization has been seen as less secure than full virtualization techniques.

Accordingly there is a need to provide a robust and flexible technique for executing an application across one or more computing devices, particularly in the field of processing big data.

Summary of invention

In order to mitigate at least some of the issues above, there is provided a method for implementing an application platform across one or more devices, the method comprising, by a central controller: creating a plurality of containers on the one or more host devices, each container providing an isolated execution environment on a respective device; executing one or more applications across the plurality of containers; while executing the one or more applications dynamically monitoring resource usage for each of the plurality of containers; dynamically monitoring resource availability of the one or more host devices; and based on the monitored resource usage and resource availability, dynamically allocating resources of the one or more host devices to the plurality of containers.

There is also provided a system for implementing an application platform, the system comprising: one or more host devices; a central controller; a plurality of containers provided on the one or more host devices; one or more applications executing across the plurality of containers; wherein the system is configured to, while executing the one or more applications: dynamically monitor resource usage for each of the plurality of containers; dynamically monitor resource availability of the one or more host devices; and based on the monitored resource usage and resource availability, dynamically allocate resources of the one or more host devices to the plurality of containers.

Advantageously the present invention allows applications, for example applications for processing extremely large datasets, to be run on existing computing systems without the to install the application at the host system, whilst allowing the resources consumed by the application to be adjusted automatically in real time. Thus the application is able, for example to scale up rapidly in response to rising demand, without tying up unused resources when demand is low. The invention also allows the application to be deployed anywhere and managed remotely.

In a preferred example, each of the plurality of containers is provided with a monitoring module, the monitoring module configured to monitor the resource usage for an associated container and send resource usage information to the central controller; the resource requirement information being received, for example, at the central controller. Advantageously this further contributes to improving resource management by providing a particularly effective means for monitoring resource usage.

In one example, future container resource usage is predicted, for example using the central controller, based on historical resource usage. Beneficially this allows resources to be allocated pre-emptively, further contributing to the efficient management of computing resources.

Preferably each of the plurality of containers is dynamically monitored for errors; and if an error is detected for a first container of the plurality of containers, the first container is removed and a second container created to replace the first container. Advantageously this provides a robust, rapidly self-healing system, which allows the processing (for example of extremely large datasets) to continue even in the event that a container fails.

Preferably additional containers are added or removed based on the monitored resource usage and resource availability. Beneficially this allows containers to be added/removed as additional computers or hardware resources are added/removed from the network. This further enhances effective resource management, and helps the platform to be scaled up or down as required rapidly and effectively.

Optionally the central controller is executed in a central container. Beneficially this avoids the need to install any components on a host device(s), providing a highly portable platform that can be implemented in any networked computer system.

Optionally each container is assigned a static IP address. This advantageously enhances the security and portability of the platform, by ensuring that containers can communicate with one another over an internal network without the need to use external networking provisions.

In the preferred embodiment, the one or more applications includes an application for processing extremely large datasets (for example an application for performing statistical modelling and/or analysis on an extremely large dataset). In this embodiment, the applications preferably also include: an application programming interface module, configured to provide programs running outside the plurality of containers with access to functionality provided by the application for processing extremely large datasets (advantageously providing a standard interface that can be consumed by any client program operating outside the containers); a web user interface module configured to implement a web-based user interface for interacting with the application for processing extremely large datasets (advantageously providing an effective means for implementing a user interface for the platform without the need to install components on the one or more host devices, and allowing user access from remote locations); and/or a request execution and scheduling module configured to queues and schedules requests sent to the application for processing extremely large datasets (which advantageously enhances system stability, consistency and responsiveness).

Brief description of the drawings

Embodiments of the present invention will now be described, by way of example only, with reference to the following figures in which:

Figure 1 shows a schematic of a networked computing system implementing the present invention.

Figure 2 shows a schematic of a container configured to execute an application in accordance with the present invention.

Figure 3 shows a method for implementing an application across one or more devices in accordance with the present invention.

Figure 4 shows a schematic of a specific example of a networked computing system.

Figure 5 shows a schematic of a specific example of a container configured to execute an artificial intelligence application.

Detailed description

Figure 1 shows a schematic of a networked computing system 100 on which the present invention is implemented. The system 100 includes one or more computing devices 102, 104, 106 linked over a network 108 (for example a wired or wireless local area network).

The system 100 includes a central controller 110. The central controller 110 is preferably a software component (however the skilled person would appreciate that the controller could be implemented as a hardware component). Optionally, the central controller 110 is executed in a container 111 - advantageously this allows the present invention to be employed without the need to install any specific software on the one or more devices 102, 104, 106. Preferably the central controller 110 is replicated for redundancy, for example on a separate device 102, 104, 106. The controller 110 facilitates manipulation of containers (such as the containers 112, 114, 116), and performs actions on the containers (such as allocating hardware resources to the containers, executing software modules in the containers, monitoring resource requirements for the containers). Preferably the controller 110 performs such manipulation/actions via a simple application programming interface (API). Advantageously, the central controller 110 allows for effective and coordinated management of the components of the system 100.

The central controller 110 is configured to implement a plurality of containers 112, 114, 116 across the one or more computing devices 102, 204, 106, as will be explained in further detail below. In one example each device 102, 204, 106 is provided with (i.e. hosts) at least one container 112, 114, 116, for example one container 112, 114, 116 can be provided per device 102, 204, 106, or per CPU core present in the device 102, 204, 106. Each container 112, 114, 116 provides an isolated execution environment for an application to run in. The central controller 110 is configured to manage the containers 112, 114, 116, for instance by dynamically allocating hardware resources to the containers 112, 114, 116, monitoring the containers 112, 114, 116 for errors, and adding/removing containers, as will be discussed in more detail below. The central controller 110 advantageously provides a flexible tool for managing the lifecycle of all the containers 112, 114, 116 in the system 100 from one central location (for example a host device or a dedicated server set up to manage remote containers) regardless of the location if the containers themselves.

The containers 112, 114, 116 are kept on an internal network within the one or more devices 104, 104, 106. Advantageously, by separating the containers from the network 108, the containers 112, 114, 116 are more secure and are shielded from any potential conflicts with the existing network set-up. The containers 112, 114, 116 are assigned static IP addresses, and configured for constant up-time: as a result, when a container needs to contact another container, it knows where to find it, and it will always have something which can respond to it. Advantageously this set-up improves the portability of the system, as well as its security. In this example the containers 112, 114, 116 are kept unprivileged. Advantageously this helps to isolate the containers 112, 114, 116 from the devices 102, 104, 106, further enhancing security.

Figure 2 shows a container 200, for example one of the containers 112, 114, 116 shown in figure 1. The container 200 has access to hardware resources 202 of the device 102, 104, 106 on which it is implemented, the hardware resources 202 having been assigned to the container 200 by the central controller 110 as described below. The hardware resources 202 include a portion of memory resources of the device 102, 104, 106 on which the container 200 is implemented, and a portion of the CPU resources of the device 102, 104, 106 on which the container 200 is implemented. In one example, the hardware resources also include input/output resources of the device 102, 104, 106 on which the container 200 is implemented. The container 200 provides an isolated execution environment for the execution of a software module, the software module running on the hardware resources 202 assigned to the container 200. In one example, the software module is one or more of an application 204 and additional modules 208, 210, 212, 214 (see below). The container 200 is of a type suitable for use on the operating system of the one or more devices 102, 104, 106 in one example in which the one or more devices 102, 104, 106 are running a Linux-based operating system, container 200 is an LXC container.

Preferably in addition to the software module, the container 200 includes a monitoring module 206 configured to monitor the resources used by the container 200, for example resources from the assigned hardware resources 202 that are used by the software module (for example the application 204) and the monitoring module 206 itself. For example, the monitoring module 206 monitors load data (for example a number of data processing requests sent to the application 204 by external programs) and/or metrics (such as CPU load, core temperature, memory utilization and swap memory utilization) for the container 200. The monitoring module 206 is configured to send resource usage information describing the monitored resources used by the container 200 to the central controller 110, as described in more detail below. Advantageously the resource usage information can be used by the central controller 110 when dynamically allocating resources.

Optionally the container 200 includes a software module selected from additional modules including one or more of an application programming interface module 208 configured to provide an interface that can be consumed by external applications to interact with the application 204, for example by sending requests for data processing to the application 204 (the external applications either executing on the one or more devices 102, 104, 106, or executing on a remote device and communicating with the container 200 via a network connection); a web user interface module 210 configured to provide a user interface for interacting with the application 204 or other software modules; a data store module 212 configured to store data used by and/or generated by the application 204; and a requests execution and scheduling module 214 configured to manage the requests sent to the application 204 from other programs via the application programming interface module. Advantageously the inclusion of such additional modules enables a user/programs to interact with the application 204 (for example assign the application 204 tasks to perform, and receive data generated by the application 204) from outside the container 200.

In one embodiment, a system of containers 112, 114, 116, 200 are implemented, each container running one of the software modules mentioned above and thereby providing a platform for executing the application 204, for example: a first container including an application 204 a second container including an application programming interface module 208, a third container including a web user interface module 210, a fourth container including a data store module 212, and a fifth container including a requests execution and scheduling module 214. Preferably each of the containers also includes a monitoring module 206 as described above. In this embodiment the containers are configured to be in communication with each other, and together form a platform for executing the application 204 over the one or more devices 102, 104, 106. Advantageously the hardware resources 202 assigned to each part of the platform (i.e. to the software modules included in each container) can be dynamically and independently changed as required, providing a platform with improved flexibility. In this embodiment, the containers are preferably each assigned a static IP address. Advantageously, this allows the containers to communicate with one another within an internal network 108 without conflicting with any existing network set-up deployed on the network 108. This enhances both the security and portability of the system.

Figure 3 shows a method 300 for implementing an application platform (for example for the application 204 discussed above in relation to figure 2) across one or more devices 102, 106, 108 in accordance with the preferred embodiment.

The method begins in step S302 in which the central controller 110 is implemented and a plurality of containers 112, 114, 116, 200 are created across one or more host computing devices 102, 104, 106.

In the preferred embodiment, the central controller 110 is first implemented (for example in its own central container 111), the central controller 110 having access to data that describes the software modules discussed above that are to be executed in each of the plurality of containers 112, 114, 116, and configuration information, such as the resources to which the programs require access. This data constitutes a predefined specification for the containers, and may be stored as a manifest file. The controller 110 then creates the plurality of containers 112, 114, 116 across the one or more host computing devices 102, 104, 106 and implements the relevant software modules in each container 112, 114, 116 based on this data. Advantageously, this deployment approach allows the allocation of resources for different containers to be adjusted at the deployment stage more efficiently (in addition to the dynamic allocation discussed in more detail below). For example, instead of allocating the same resources to all containers as has traditionally been the case, increased hardware capacity is given preferentially to computationally intensive containers (e.g. containers that manipulate large datasets, and/or containers that carry out CPU/GPU-intensive calculations).

Each container 112, 114, 116, 200 includes one or more software modules as described above. At least one container 112, 114, 116, 200 includes an application 204 (such as an artificial intelligence application for processing extremely large datasets). Optionally one or more containers 112, 114, 116, 200 includes one or more additional modules 208, 210, 212, 214 (for example one or more of an application programming interface module 208, a web user interface module 210, a data store module 212 and a requests execution and scheduling module 214, as describe above).

Preferably a monitoring module 206 (as described above) is provided in each of the containers 112, 114, 116, 200.

Preferably the containers 112, 114, 116, 200 are also assigned a static IP address during creation as described above. This enhances both the security and portability of the system.

In step S306 the resource requirements of the application 204 running in each of the containers 112, 114, 116, 200 are monitored. Monitoring resource requirements preferably includes monitoring the assigned hardware resources 202 used by the container 200 (for example the hardware resources 202 used by the application 204, the monitoring module 206 and any additional modules 208, 210, 212, 214). Monitoring resource usage preferably includes monitoring one or more of a CPU load average representing the CPU load over a predetermined period of time; a current CPU load; a percentage CPU usage for every CPU core used by the application 204; an amount of used RAM; an amount of free RAM; an amount of used SWAP memory; an amount of free SWAP memory; and a physical CPU core temperature. Preferably step S306 is performed continually or periodically for all containers 112, 114, 116, 200 whilst the application 204 is running. Step S306 is performed by the central controller 110, and may optionally include receiving application resource information gathered by the monitoring module 206.

Optionally future resource usage is predicted in step S308, based on historical resource usage. In this embodiment the central controller 110 stores data describing resource usage for each container 112, 114, 116, 200 over time (for example the resource usage information received from monitoring modules 206 associated with each container 112, 114, 116, 200). The central controller 110 then analyses the stored data to predict future resource usage for each container 112, 114, 116, 200 for a given set of conditions. The analysis preferably takes the form of implementing a machine learning algorithm, for example a random forest, support vector machine, linear model or deep belief network algorithms. In one example, the central controller 110 analyses historical data to determine trends in percentage CPU and/or RAM usage, and predicts future percentage CPU and/or RAM usage based on the determined trends.

In a further example of step S308, the central controller 110 analyses historical instances of container failures and anticipates possible future failures. For example the central controller

110 analyses the data describing resource usage for a particular container that is known to have failed, the data corresponding to a period of time preceding and or including the time at which the container failed. The controller 110 is configured to determine a set of conditions that led to the failure of the particular container. The controller 110 then compares the data describing resource usage for each running container 112, 114, 116, 200 to the determined set of conditions, thereby determining whether any of the running containers 112, 114, 116, 200 are at risk of failure. Advantageously, the central controller 110 preferably creates an additional container if it is determined that one of the running containers 112, 114, 116, 200 is at risk of failure. For example the central controller 110 creates a duplicate of the container determined to be at risk of failure. Beneficially this allows the system to continue uninterrupted (by using the duplicated container) in the case that the container determined to be at risk of failure actually fails.

In step S310 the central controller 110 is configured to monitor resource availability at the one or more host devices, including hardware resources not currently assigned to any of the containers 112, 114, 116, 200.

In step S312, the central controller 110 dynamically allocates resources of the one or more devices to the plurality of containers 112, 114, 116, 200 based on the monitored resource usage (and optionally predicted future resource usage) and resource availability based on the monitored resource usage and resource availability. For example, if a particular container 112, 114, 116, 200 is placing a high demand (or is predicted to place a high demand) on the hardware resources 202 assigned to that container 112, 114, 116, 200, the central controller 110 assigns further hardware resources from the one or more devices 102, 104, 106 to that container 112, 114, 116, 200, as long as the one or more devices 102, 104, 106 have further hardware resources available. Similarly, if a particular container 112, 114, 116, 200 is placing a low demand (or is predicted to place a low demand) on the hardware resources 202 assigned to that container 112, 114, 116, 200, the central controller 110 reduces the hardware resources 202 assigned to that container 112, 114, 116, 200.

In some embodiments, the central controller 110 also creates additional containers in step S312 based on the monitored resource usage and the resources available at the one or more devices 102, 104, 106. For example if a particular container 112, 114, 116, 200 is placing a high demand (or is predicted to place a high demand) on the hardware resources 202 assigned to that container 112, 114, 116, 200, the central controller 110 creates one or more new containers (to spread the load of the application across more containers) respectively. For example the new containers may replicate the functionality of existing containers, such as creating an additional container running the application 204. In a further example, if the if the central controller 110 detects that there has been an increase in the amount of resources available across the one or more devices 102, 104, 106 (for example if a new devices is connected to the network 108), the central controller 110 creates one or more additional containers. During the creation of a new container, the controller 110 is configured to run any scripts as required to ensure that the new container integrates into the network 108 and system 100 without conflicting with any other containers 112, 114, 116 already running. If after creating one or more new containers, the central controller 110 detects that there has been a reduction in the amount of resources available across the one or more devices 102, 104, 106 (for example if one of the devices is powered off or disconnected from the network 108), the central controller 110 deletes one or more of the new containers in order to free up hardware resources for other purposes.

Advantageously, the method 300 provides improved resource management in the context of executing an application across one or more devices. In contrast to known methods for executing an application across one or more devices, the method 300 in combination with the central controller 110 allows hardware resources 202 available to each container to be changed, dynamically, after the container has been created. Thus the resources available to the application 204 can be maximised wherever possible, whilst still allowing the one or more devices 102, 104, 106 to perform other tasks. Moreover, the invention is able to increase or decrease required hardware resources depending on real-time demand without terminating processes running in the containers 112, 114, 116, 200. Thus the present invention provides a better computing system, in that hardware resources are more effectively managed. It also allows for enhanced system planning.

Preferably, in addition to monitoring resource usage, the central controller is configured to monitor each container 112, 114, 116, 200 for errors in step S314. The containers 112, 114, 116, 200, preferably have a their own mechanisms for attempting to detect and solve issues related to the key services they offer, however the error monitoring provided by the central controller 110 further enhances system stability, for example in a situation in which a container may become non-responsive. In one example, the central controller 110 determines that a container is in an error condition based on: the receipt of a signal from one of the components of the container (for example the application 204, monitoring module 206, or additional module 208, 210, 212, 214) indicating that the component is in an error condition; a failure of one or more of the components of the container to respond to a request; a failure of the monitoring module 206 to send resource usage information; and/or the monitored resource usage exceeding a threshold (for example a CPU temperature exceeding a threshold value). For example the central controller 110 can monitor whether a container is in an error condition based on the resource usage information sent by the monitoring module 206. In this way, the central controller 110 advantageously provides real-time error monitoring.

If an error is detected for a container 112, 114, 116, 200, that container is deleted by the central controller 110. Preferably the central controller 110 maintains a database of all running containers and their state. Accordingly in the event that a container is deleted on discovery of an error with that container, the central controller 110 creates a new container to replace the deleted container. Advantageously, this allows the system to be self-healing. The system can detect and repair any problems encountered easily, by replacing entire containers. Such repair is also fast - for example LXC containers spin up very quickly, allowing for near instant repair in the event of a problem. Furthermore, the present technique does not require the application platform to stop running entirely. Thus repairs can be made without interrupting the execution of the application platform (for example there is no need to restart the entire system of containers 112, 114, 116, 200 in the event that an error is detected in one of the containers). Beneficially these provisions maximise the availability of the system 100.

Similarly, the central controller 110 is preferably configured to use the database of all running containers to create new, replacement containers after a power loss experienced by one or more of the devices 102, 104, 106 on which the containers 112, 114, 116, 200 are implemented in step S314. For example, if the central controller determines that power has been lost at one of the devices 102, 104, 106 (for example if communication is lost with the device and/or the container 112, 114, 116, 200 implemented on the device), once the central controller 110 determines that power has been returned to the device 102, 104, 106 the central controller 110 creates new container(s) (including the application 204, monitoring module 206 and any additional modules 208, 210, 212, 214) on the device 102, 104, 106 to replace the containers 112, 114, 116, 200 that were lost during the power loss. Again, this advantageously provides a rapidly self-healing system.

Accordingly the performance of optional step S314 improves system stability. At any point, a running container 112, 114, 116, 200 can be duplicated, taken off-line, replaced with another one, or copied into another device on the network 108 without interrupting normal system function. If a container 112, 114, 116, 200 fails, the device 102, 104, 106 that was hosting the container 112, 114, 116, 200 remains stable and a failed container can be brought back online without jeopardizing user experience.

Whilst optional step S314 is shown as being subsequent to the dynamic monitoring of resource usage and dynamic assignment of resources of steps S306 to S312, it will be appreciated that step S314 can be performed before or in parallel to these steps.

The method optionally comprises the additional step of using the central controller 110 to mount folders or disks directly into a particular container 112, 114, 116, 200. Advantageously this allows to efficient sharing of any data that might be needed by the containers remotely.

In one embodiment there is provided a computer readable medium (for example a transitory or non-transitory computer readable medium) storing instructions, which when executed by one or more computing devices, cause the one or more devices to perform the method of figure 3 as described above.

Example of the invention

A specific example of an embodiment of the invention will now be described with reference to figures 4 and 5. The specific example is configured to be implemented in Linux environments, and uses Linux Containers virtualization (LXC) in combination with a suitable daemon or controller that allows manipulation of LXC containers, for example a Linux Container Daemon (LXD). Both LXC virtualization and LXD are known in the art. Whilst this example utilises LXC, it is noted that the present invention can be implemented on any system that supports containers, and is not limited to systems supporting LXC in particular.

Figure 4 shows a schematic of a specific example of a networked computing system 400. The system 400 corresponds to the system 100 of figure 1, and is Linux-based. The system 400 includes a master device 404 running a Linux-based operating system 405 and at least one node device 402, also running a Linux-based operating system 403. A central container 410 (such as container 111 of figure 1) is implemented on the master device 404 using LXC functionality. The central container 410 comprises a central controller 411, such as the central controller 110 of figure 1 as described above. The central container 410 is replicated for redundancy on the master device 404 and/or a node device. One or more containers 406, 408 (such as containers 112, 114, 116, 200 of figures 1 and 2) are implemented on the at least one node device 402 using LXC functionality. The containers 406, 408 are provided access to hardware resources 420 of the node device 402 using LXC functionality. Similarly, the central container 410 is provided access to hardware resources 422 of the master device 402 using LXC functionality.

The central controller 411 can use LXD provisions to manipulate the LXC containers 406, 408 as shown in figure 4. As an alternative, different provisions can be used that allow manipulation of LXC containers.

System configuration is preferably described using a manifest file. The manifest file is written using a structured schema (for example utilising a markup language such as XML or YAML), which describes software and hardware specifications for each container, the environment in which the containers are running, and the ways in which they communicate. Hardware resources 420, 422 (for example allocated RAM, CPU, GPU resources), networking options (for example an IP address for each container, cross-container communications channels, and names of other containers that are expected to work in unison during deployment), and contents of each container are preferably described in the manifest file. The central controller 411 is configured to load the manifest file and build the containerized system according to the specifications given in the manifest file. Beneficially this manifest-driven deployment approach creates an opportunity to fine-tune hardware resources at the system's deployment stage more efficiently.

The central controller 411 receives data from the containers 406, 408 on the node device 402 describing their resource usage and status (for example the percentage CPU usage, CPU core temperature, whether one of the components with the containers 406, 408 is in an error conditions etc. as described above). In response, central controller 411 assigns more hardware resources 420 to the containers 406, 408, reduces the amount of hardware resources 420 assigned to the containers 406, 408, deletes a container 406, 408 and/or creates a new container as appropriate (as described above).

Additionally, the central controller 411 comprises a set of machine learning algorithms (e.g. random forest, support vector machines, linear models, deep belief networks) that are trained on historical metrics collected by the monitoring modules 206 (for example the Smart Resource Allocation module 514 discussed below). All models are trained to predict either a categorical outcome (e.g. container fault) or a numerical outcome (e.g. core temperature in 30 minutes under the same system load). Predictions are collated and a majority voting or a value average is used to calculate a consensus prediction for categorical or numerical outcome respectively. To identify how predictive models made their decision and adjust a specific system component needs to prevent failure, reduce core temperature, or increase processing speed, the Local Interpretable Model-Agnostic Explanations (LIME) method can be used (as described in Ribeiro, Μ. T., S. Singh and C. Guestrin Why Should I Trust You?: Explaining the Predictions of Any Classifier. arXiv:1602.04938) and Explanation Vectors (as described in Baehrens, D., T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen and K.-R. Muller (2010). How to Explain Individual Classification Decisions. Journal of Machine Learning Research 11: 1803-1831). LIME is an algorithm that can explain the predictions of any classifier or regressor in a faithful way, by approximating it locally with an interpretable model such as linear model. Similarly, Explanation Vector is a vector that has the same dimension as the data point itself (the number of features) and points toward the direction of maximum “probability flow” away from the class in question. The entries in this vector that have large absolute values correspond to features that have a large local influence on a prediction. The central controller 411 thus identifies which system parameters need to be adjusted by additional allocation or decoupling of resources. The central controller 411 then sends API requests to all containers through LXD API and performs run-time adjustment to the system’s resources. Run-time adjustment executes LXC limit commands to adjust running container metrics for optimal performance. LXC configuration changes (such as a number of available CPUs) are applied and become effective immediately. If the system doesn’t have adequate resources to satisfy optimal configuration settings, an alert record is made in the log file.

The LXC containers 406, 408 are kept on an internal network within the one or more node devices 402. Advantageously, by separating the containers from existing networks over the one or more devices 402, 404, the containers 406, 408 are more secure and are shielded from any potential conflicts with the existing network set-up. The containers 406, 408 are preferably assigned static IP addresses, and configured for constant up-time: as a result, when a container needs to contact another container, it knows where to find it, and it will always have something which can respond to it. This set-up enhances the portability of the system, as well as its security. Alternatively, it is possible to assign dynamic IP addresses, depending on the requirements of the particular system being implemented. In the present example, the containers 406, 408 are kept unprivileged. Advantageously this helps to isolate the containers 406, 408 from the devices 402, 404, further enhancing security. If desired, the containers can access the Internet, and even use DNS if configured to do so, for example, using Google PublicDNS. The containers themselves, though, will not be accessible on the Internet or on whatever network they are deployed.

In this example, preferably the device 402, 404 on which each container 406, 408, 410 is implemented is configured to act as a firewall which only opens the required ports to access the services offered by the container to external networks/devices. Advantageously this allows services offered by the containers 406, 408, 410 to be accessed from outside the network 108. The system optionally uses mDNS for service discovery allowing for access via a hostname. This set-up advantageously allows the system to be deployed on the web without issues.

Figure 5 shows a schematic of an artificial intelligence platform 500 implemented on the system 400 of figure 4. The platform 500 is configured to execute an artificial intelligence (Al) application 509. Artificial intelligence application 509 is configured to process extremely large datasets, for example by generating statistical models to describe the datasets.

The platform 500 includes a number of modules 502, 504, 506, 508, 509, each of which is executed in a separate container 501, 503, 505, 507, 511 (for example containers such as containers 406, 408 shown in figure 4). The containers 501, 503, 505, 507, 511 are implemented over one or more node devices 402.

The platform 500 includes a container 501 running a Web based User Interface (UI) module 502 (such as web user interface module 210 described above). In this example UI module 502 is a web application built on AngularJS framework and provides a user interface to the system. The web UI module 502 is deployed into an LXC container 500 that hosts an instance of an NGINX web server with the Web UI web application. To reduce the amount of traffic between client and server, the entire application is downloaded into a web browser. It receives required data from the server by executing an HTTP request through the system's API. Data exchange between client application and a server uses JSON format. This approach provides full decoupling between client and server and allows for flexible client implementation using wide range available technologies.

The platform 500 also includes a container 503 running an API module (such as the API module 208 shown in figure 2 above), in this case a REpresentational State Transfer (REST) Application Programming Interface (API) 504. The REST API 504 provides a standard interface for sending requests to and receiving data from the Al application 509. The REST API 504 can be consumed by a variety of external applications wishing to access the functionality provided by the container 500 (for example access to the Al application 509). The REST API 504 comprises a lightweight web server, which provides endpoints allowing external applications to access the functionality provided in the container 500 (for example access to the Al application 509). For example the REST API 504 allows external applications to send requests to the Al application 509 (for example a request to process a certain set of data, for example by generating a statistical model based on the data set). The REST API 504 also comprises a database access module, which enables unified access to a database 508 for external applications through the relevant endpoints. The database access module implements the following endpoints: /exists (check for the existence of a specific filename in the database 508); /upload (push an array of bytes into the database 508); and /pull (download the requested file from the database 508). The REST API 504 also includes a user management module configured to provide user management, authentication and authorization functionality, and provides a mechanism to keep separate workspaces for individual users within the container 500. The REST API also comprises a workflow engine module, which handles requests to the analytics and prediction engine. In one example the REST API 504 is implemented in Go programming language and is executed as a process controlled by supervisorctl, a client/server system that allows its users to monitor and control a number of processes on UNIX-like operating systems. Container parameters are controlled externally through a configuration file and can be easily adjusted as required.

Overall, the REST API 504 module is a gateway between the processing functionality of the Al application 509 and the outside world. It abstracts the functions provided by the Al application 509 in terms that could be interpreted and used by a variety of front-end applications.

The platform 500 also comprises a container 505 running a request execution and scheduling module 212, in this case a request execution and scheduling engine 506, which is responsible for handling the load of the Artificial Intelligence (Al) application 509. It queues and schedules requests from external applications, the requests received by REST API 504 for the Al application 509. The request execution and scheduling engine 506 introduces another level of abstraction between REST API module 504 and the Artificial Intelligence application 509. Advantageously the request execution and scheduling engine enhances system stability, consistency, and responsiveness. The request execution and scheduling engine 506 is written in Go programming language and uses its goroutines and channels to enable concurrent execution of multiple requests sent to Artificial Intelligence application 509. The request execution and scheduling engine 506 receives requests from REST API through a single RESTful access point, creates a worker that then sends an HTTP request to the Al application 509 (for example through an Rserve R package - see below). The worker provides an environment for any request as long as the Al application 509 container exposes a valid HTTP end point that can handle that request. The number of concurrent workers is set when request execution and scheduling engine 506 starts and can be changed in according to available hardware resources, by restarting the module with a different set of parameters. In one example the requests execution and scheduling engine runs as a process controlled and monitored by supervisorctl, a client/server system that allows its users to monitor and control a number of processes on UNIX-like operating systems. Required run-time parameters are specified in external configuration file and can be easily adjusted for different requirements and environments.

The platform 500 also comprises a container 511 running an application 204, in this case an artificial intelligence (Al) application 509 comprising an Al server 510 and an Al engine 512. Al engine 512 handles all machine learning tasks involved in processing extremely large datasets, for example including training, cross-validation, informative feature selection, and prediction. In one example, the Al engine takes electronic hospital patient data as an input, and generates statistical models based on the data that can be used to predict, for example, the likelihood that a patient will be readmitted to hospital within a certain period of time.

In one example, the Al engine 512 consists of Internal Data Representation, Model Training, Feature Selection, and Prediction. Learning algorithm is implemented in C++ and is accessible via an Rserve R Statistical Package (Rserve R being known in the art). The Al engine can handle datasets comprising text, images or sound presented in delimited format. There are no restrictions on the size or dimensions of the dataset. Datasets can be imported sequentially or in parallel using standard Input/Output streams. The Al engine 512 constructs predictive models using a Random Ensemble Decision Tree Learning (REDTL) method. The algorithm, also known as Random Forest (RF), has advantages when applied to large datasets. Firstly, due to construction of multiple of decision trees and subsequent voting procedure, the algorithm reduces the risk of over fitting. Secondly the algorithm naturally calculates feature importance scores using the mean decrease in Gini coefficient. Thirdly, Random Forests can deal with high-order interactions and correlated predictor variables. Finally, Random Forest is easily executable on a distributed computing cluster, given its “embarrassingly parallel” nature. The learning process of a Random Forest algorithm involves growing a large number of randomised decision trees, as described in Breiman, L. (2001). Random Forest. Machine Learning 45(5): 5-32. During model training, the Al engine 509 is configured to split a dataset into training and validation sets by randomly allocating 75% of all records to the training set and 25% of all records to the validation set. This procedure is repeated ten times to generate a collection of training and validation sets. The Random Forest algorithm is applied to each of the ten training sets to learn the structure of the underlying data and to predict a response variable, which can be either continuous or categorical. During model construction, two training parameters are tuned - number of trees (ntrees) and number of variables randomly sampled as candidates at each split (mtry). To select the best parameters, a grid search is carried out for the combination of ntrees and mtry that yields in the lowest misclassification error on the validation set. The Al engine 509 is configured to perform feature selection by computing the Gini Impurity measure for each variable during the training of a Random Forest algorithm. In this example, the Al engine makes predictions based on the datasets as follows. To classify a new object from an input vector, the input vector is propagated down each of the trees in the forest. Each tree gives a classification vote for that class. The forest chooses the classification having the most votes (over all the trees in the forest). Therefore, each validation set is used to assess the accuracy of a model trained on the respective training set. Samples that were previously not seen by the algorithm, but with known outcomes, are propagated down the decision forest and classifications are matched against the known outcome. Model performance is calculated using Specificity, Sensitivity, Positive Predictive Value, Negative Predictive Value, and Area under the Receiver Operating Characteristic curve (AUC) methods. The Random Forest algorithm assigns every unknown sample a probability of group membership. Beyond a simple statement that a patient is in one group or another, more information can be gained by having an estimated probability for belonging to one of the two groups (Emergency vs. Control). This can be accomplished in two ways: 1) number of total votes per class, normalised by total number of trees, or 2) growing probability machines, as described in Malley, J. D., J. Kruppa, A. Dasgupta, K. G. Malley and A. Ziegler (2012). Probability machines: consistent probability estimation using nonparametric learning machines. Methods Inf Med 51(1): 74-81.

In this example, the Al server 510 is provided by an RServe package acting as a socket server (TCP/IP or local sockets), which allows binary requests to be sent to R from external LXC containers. Every connection has a separate workspace and working directory. The Al engine is packaged into a container, which in this example preferably comprises: R, a locally configured Comprehensive R Archive Network (CRAN), Source code of R packages required for operations within the Al engine 512.

The platform 500 also comprises a container 507 including a data store 508 (such as data store 212 of figure 2). In this example the data store 208 is configured to store data in a nonrelational object database MongoDB. MongoDB can store JSON as native objects or as binary' objects. The platform 500 uses the MongoDB database to store user information (JSON object), datasets (binary object), and predictive models (binary object). The MongoDB engine is accessible by other components via standard MongoDB ports. To ensure data persistence when containers are rebuilt or destroyed, a MongoDB data folder is located on host and presented to a container as a shared resource. When container hosting database engine is not available (for example if the container is stopped, rebuilt, or unavailable) actual data is still available and the directory can be attached to another container which hosts database engine. This approach simplifies database backup, which is done as a simple snapshot of data folder to any removable media.

Together, the containers 501, 503, 505, 507, 511 including the web UI 502, the REST API 504, the request execution and scheduling engine 506, data store 508 and Al application 509 provide a platform for executing the Al application 509 for performing statistical analysis on extremely large datasets.

The platform 500 also implements a monitoring module 206, in this case Smart resource Allocation Module 514, which is run as an agent in each container 501, 503, 505, 507, 511. System resources 202 assigned to each container 406, 408 are adjusted in real-time via the Smart Resource Allocation Module 514. The Smart Resource Allocation Module 514 is a lightweight agent that is hosted on every container and collects running system’s parameters (e.g. CPU and memory utilization, core temperature, swap memory utilization), sending them for actionable analytics to the central controller 411. The central controller 411 hosts an analytical model that adjusts container resources (e.g. allocated CPU and RAM) in real-time and in response to collected system parameters. The state of the running LXC containers can be described by a series of metrics available through a set of standard Unix/Linux packages. While a module inside LXC container performs a set of standard tasks (e.g. data upload, data exploration, building a predictive mode, calculation a prediction), its load and performance are reflected in that container's metrics. Preferably the Smart Resource Allocation Module 514 collects the following metrics: CPU load average that represents the system load over a period of time (for example by taking standard measurements every 1, 5 and 15 minutes); CPU usage for every core on the node device 402 hosting the container 406, 408 in question; an amount of used/free RAM on the node device 402 hosting the container 406, 408 in question; an amount of used/free SWAP memory on the node device 402 hosting the container 406, 408 in question; and a physical core temperature C of the node device 402 hosting the container 406, 408 in question. Additionally metrics for each running component are preferably monitored (these are defined and implemented during component development, specific for different components and could include for example a number of processed requests per unit of time, number of tasks in queue, model building time, number of execution fails, etc.). All measurements are sent to the central controller 411 and saved in a database for run time and historical analysis as described above.

The combination of providing the smart resource allocation module 514 in each container 501, 503, 505, 507, 511 in combination with a central controller 411 provides a number of benefits. The platform provides a portable and reproducible environment: container provision and system deployment workflows can run on a wide range of hardware that provides an LXC environment. In addition the platform allows for better hardware resource management, precise system planning, and in addition calculation of running cost. By providing need-based resource allocations, the platform is able to increase or decrease required hardware resources depending on real-time demand without terminating running processes. The platform is also highly scalable: identical workflow scripts can be run on different types of devices - from mini-computer boards (e.g. Raspberry Pi TM boards) to data centres. The platform can also be deployed into any cloud that provides LXC capabilities. The platform also provides benefits in terms of system stability: at any point, a running container can be duplicated, taken off-line, replaced with another one, or copied into another host on the network without interrupting normal system function. If a container fails, the host system remains stable and a failed container can be brought back online without jeopardizing user experience.

The present platform is also advantageous in that it allows for efficiently managing updates to the software modules. When an update is ready to go live, an updated container is built, which in some examples is then tested to ensure that everything is working correctly. Then, a hot-swap can be performed, wherein a container running the outdated version of the software module is replaced with the new container, which is already running. There is barely any downtime involved in this; the present platform allows one container to simply be switched for another one, which can pick up from the earlier container instantly. Advantageously this allows, software updates to be released to the end-user regularly, with minimal down time.

The above embodiments are provided as examples only. Further aspects of the invention will be understood from the appended claims.

Claims

1. A method for implementing an application platform across one or more devices, the method comprising, by a central controller:

creating a plurality of containers on the one or more host devices, each container providing an isolated execution environment on a respective device;

executing one or more applications across the plurality of containers; while executing the one or more applications:

dynamically monitoring resource usage for each of the plurality of containers;

dynamically monitoring resource availability of the one or more host devices; and based on the monitored resource usage and resource availability, dynamically allocating resources of the one or more host devices to the plurality of containers.

2. The method of claim 1 comprising:

providing each of the plurality of containers with a monitoring module, the monitoring module configured to monitor the resource usage for an associated container and send resource usage information to the central controller; and receiving the resource requirement information at the central controller.

3. The method of claim 1 or claim 2, comprising predicting future container resource usage based on historical resource usage.

4. The method of any preceding claim, comprising dynamically monitoring each of the plurality of containers for errors; and if an error is detected for a first container of the plurality of containers, removing the first container and creating a second container to replace the first container.

5. The method of any preceding claim, comprising, based on the monitored resource usage and resource availability, dynamically adding or removing additional containers.

6. The method of any preceding claim wherein the central controller is executing in a central container.

7. The method of claim 6, wherein the central container in which the central controller is executing is implemented on a host device and each of the one or more containers is implemented on a different device in a networked computing system.

8. The method of any preceding claim, comprising assigning each container a static IP address.

9. The method of any preceding claim, wherein the one or more applications includes an application for processing extremely large datasets.

10. The method of claim 9, wherein the one or more applications includes an application programming interface module, configured to provide programs running outside the plurality of containers with access to functionality provided by the application for processing extremely large datasets.

11. The method of claim 9 or claim 10, wherein the one or more applications includes a web user interface module configured to implement a web-based user interface for interacting with the application for processing extremely large datasets.

12. The method of any of claim 9 to 11, wherein the one or more application includes a request execution and scheduling module configured to queues and schedules requests sent to the application for processing extremely large datasets.

13. The method of any of claims 1 to 12, wherein creating the plurality of containers on the one or more host devices is based on a predefined specification.

14. A computer readable medium storing instructions, which when executed by one or more computing devices, cause the one or more computing devices to perform the method of any of claims 1 to 13.

15. A system for implementing an application platform, the system comprising:

one or more host devices;

a central controller;

a plurality of containers provided on the one or more host devices; one or more applications executing across the plurality of containers; wherein the central controller is configured to, while executing the one or more applications:

dynamically monitor resource usage for each of the plurality of containers;

dynamically monitor resource availability of the one or more host devices; and based on the monitored resource usage and resource availability, dynamically allocate resources of the one or more host devices to the plurality of containers.

16. The system of claim 15, wherein each of the plurality of containers comprises a monitoring module configured to monitor the resource usage for an associated container and send resource usage information to the central controller.

17. The system of claim 15 or claim 16, wherein the central controller is configured to predict future container resource usage based on historical resource usage.

18. The system of any of claims 15 to 17, comprising dynamically monitoring each of the plurality of containers for errors; and if an error is detected for a first container of the plurality of containers, removing the first container and creating a second container to replace the first container.

19. The system of any of claims 15 to 18 configured to, based on the monitored resource usage and resource availability, dynamically add or remove additional containers.

20. The system of any of claims 15 to 19 comprising a central container comprising the central controller.

21. The system of claim 20, wherein the central container comprising the central controller is implemented on a host device and each of the one or more containers is implemented on a different device in a networked computing system.

22. The system of any of claims 15 to 21, wherein each container has an associated static IP address.

23. The system of any of claims 15 to 22, wherein the one or more applications includes an application for processing extremely large datasets.

24. The system of claim 23, wherein the one or more applications includes an application programming interface module, configured to provide programs running outside the plurality of containers with access to functionality provided by the application for processing extremely large datasets.

25. The system of claim 23 or claim 24, wherein the one or more applications includes a web user interface module configured to implement a web-based user interface for interacting with the application for processing extremely large datasets.

26. The system of any of claim 23 to 25, wherein the one or more application includes a request execution and scheduling module configured to queues and schedules requests sent to the application for processing extremely large datasets.

27. The system of any of claims 15 to 26 wherein the central controller is configured to create the plurality of containers on the one or more host devices based on a predefined specification.