WO2022029443A1 - Method and apparatus for reducing the risk of successful side channel and fault injection attacks - Google Patents
Method and apparatus for reducing the risk of successful side channel and fault injection attacks Download PDFInfo
- Publication number
- WO2022029443A1 WO2022029443A1 PCT/GB2021/052034 GB2021052034W WO2022029443A1 WO 2022029443 A1 WO2022029443 A1 WO 2022029443A1 GB 2021052034 W GB2021052034 W GB 2021052034W WO 2022029443 A1 WO2022029443 A1 WO 2022029443A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- task
- queue
- tasks
- pipeline
- executable tasks
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/70—Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer
- G06F21/71—Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer to assure secure computing or processing of information
- G06F21/75—Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer to assure secure computing or processing of information by inhibiting the analysis of circuitry or operation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L9/00—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
- H04L9/002—Countermeasures against attacks on cryptographic mechanisms
- H04L9/003—Countermeasures against attacks on cryptographic mechanisms for power analysis, e.g. differential power analysis [DPA] or simple power analysis [SPA]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L9/00—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
- H04L9/002—Countermeasures against attacks on cryptographic mechanisms
- H04L9/004—Countermeasures against attacks on cryptographic mechanisms for fault attacks
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L2209/00—Additional information or applications relating to cryptographic mechanisms or cryptographic arrangements for secret or secure communication H04L9/00
- H04L2209/08—Randomization, e.g. dummy operations or using noise
Definitions
- This invention relates to a method and apparatus for reducing the risk of successful side channel and fault injection attacks when executing a computer algorithm.
- the execution of the computer algorithm may be immune to side channel and fault injection attacks.
- BACKGROUND There are many situations where computer algorithms perform tasks which one might wish to keep secret but they are open to Side Channel Attacks [SCAs]. One such situation arises in cryptography. [03] Traditionally, cryptography has sought to improve the security of encrypted data by increasing the complexity of the mathematical function, the cipher, that turns plaintext into ciphertext.
- SCA Differential Power Analysis
- Kocher et al published at the Annual International Cryptology Conference in December 1999.
- US2017099134 (A1) [ Kocher et al] describes a method and apparatus for conducting DPAs on devices to establish how vulnerable they are to DPAs.
- EM emissions can also be used for successful SCAs.
- Electric current flowing through a conductor induces electromagnetic emanations, and these emanations are a source of side-channel information.
- electromagnetic field As the power consumption in a device varies while the data is being processed, so does the electromagnetic field and one can expect to extract secret information from a relevant analysis.
- EMA ElectroMagnetic Analysis
- Both DPA and EMA can be improved by using various wavelet transform denoising methods, typically based on singular spectral analysis (SSA) and detrended fluctuation analysis (DFA), as described for example in “Improved wavelet transform for noise reduction in power analysis attacks” [Ai et al, 2016, IEEE International Conference on Signal and Image Processing (ICSIP)] .
- SSA singular spectral analysis
- DFA detrended fluctuation analysis
- Principal signal component in SSA can be selected by DFA adaptively, and residual part can be denoised by wavelet transform to retrieve important information.
- Using wavelet transforms in such a way improves SCA success rate whilst significantly decreasing the necessary number of power consumption traces significantly.
- DPA and EMA rely on two facts.
- Examples include “A Network-based Asynchronous Architecture for Cryptographic Devices” a PhD by Spadavecchia published by Edinburgh University in 2005, “Improving DPA Security by Using Globally-Asynchronous Locally-Synchronous Systems” by Gürkaynak et al published in the Proceedings of the 31 st European Solid-State Circuits Conference 2005, and “Countering power analysis attacks by exploiting characteristics of multicore processors” by Yang et al published in the IEICE Electronics Express in 2018, Volume 15, Issue 7, Pages 20180084. Most of these approaches have been published as modelling exercises to establish chip architecture which can function with some randomness in core use. However, for these approaches to be implemented typically requires new multicore hardware for every user, which is a considerable expense for banks and the like issuing cards with new chips.
- FIAs fault injection attacks
- FIAs can be generalised and relatively easy to mount such as glitching the voltage up or down out of the working range, or by generating sudden out of working range temperature changes. More specialised and costly FIAs involving lasers or ion beams can be focussed on much smaller parts of a processor. FIAs may sometimes lead to a processor dumping the task it was performing and the dump can be harvested and analysed. Sometimes an FIA can lead to a specific bit flop which changes the processing. Comparing “normal” running with the bit flop can reveal useful information.
- the invention relates to apparatus and methods for enhancing security when executing a computer algorithm as defined in the independent claims. Further features are defined in the dependent claims.
- apparatus for enhancing security when executing a computer algorithm comprising separately executable tasks, each of which produce an electrical signal when executed, the apparatus comprising memory in which at least part of the computer algorithm is stored as at least one pipeline, wherein the at least one pipeline comprises a plurality of non-commutative separately executable tasks; and a processor which is configured to receive a plurality of inputs to be processed by the computer algorithm; randomise the order of execution of the plurality of non-commutative separately executable tasks within the pipeline; attempt to execute each of the plurality of non-commutative separately executable tasks from the randomised order of execution once by determining whether preconditions for execution are met, when the preconditions are met, executing the task and storing an output of the executed task; and repeat the randomising and attempting steps until the computer algorithm has processed the plurality of inputs; whereby,
- the plurality of separately executable tasks may be considered to be randomised at a pipeline level.
- apparatus for enhancing security when executing a computer algorithm comprising separately executable tasks, each of which produce an electrical signal when executed, the apparatus comprising memory in which at least part of the computer algorithm is stored as at least one pipeline, wherein the at least one pipeline comprises at least one task which comprises a set of commutative operations; and a processor which is configured to: receive a plurality of inputs to be processed by the computer algorithm; randomise an order of execution for the commutative operations of the at least one task within the pipeline; execute each of the at least one tasks within the pipeline once and repeat the randomising and executing steps until the computer algorithm has processed all inputs; whereby, at each repetition, the electrical signals produced when executing the at least one task are randomised to enhance security.
- the plurality of separately executable tasks may be considered to be randomised at a task level.
- apparatus for enhancing security when executing a computer algorithm comprising memory in which at least part of the computer algorithm is stored as at least one pipeline comprising a plurality of separately executable tasks; and a processor which is configured to randomise the plurality of separately executable tasks at a pipeline level and/or at a task level; and execute the randomised plurality of separately executable tasks whereby electrical signals produced when executing each task are randomised to enhance security.
- a method for enhancing security when executing a computer algorithm comprising storing at least part of the computer algorithm as at least one pipeline comprising a plurality of separately executable tasks; randomising the plurality of separately executable tasks at a pipeline level and/or at a task level; and executing the randomised plurality of separately executable tasks whereby electrical signals produced when executing each task are randomised to enhance security.
- a method for enhancing security when executing a computer algorithm comprising separately executable tasks, each of which produce an electrical signal when executed, the method comprising: storing at least part of the computer algorithm as at least one pipeline, wherein the at least one pipeline comprises a plurality of non-commutative separately executable tasks; receiving a plurality of inputs to be processed by the computer algorithm; randomising the order of execution of the plurality of separately executable tasks within the pipeline; attempting to execute each of the plurality of separately executable tasks from the randomised order of execution once by determining whether preconditions for execution are met, when the preconditions are met, executing the task and storing an output of the executed task; and repeating the randomising and attempting steps until the computer algorithm has processed the plurality of input; whereby, at each repeating step, the electrical signals produced when executing the tasks are randomised to enhance security.
- a method for enhancing security when executing a computer algorithm comprising separately executable tasks, each of which produce an electrical signal when executed, the method comprising: storing at least part of the computer algorithm as at least one pipeline, wherein the at least one pipeline comprises at least one task which comprises a set of commutative operations; receiving a plurality of inputs to be processed by the computer algorithm; randomising an order of execution for the commutative operations of the at least one task within the pipeline; executing each of the at least one tasks within the pipeline once and repeat the randomising and executing steps until the computer algorithm has processed all inputs; whereby, at each repetition, the electrical signals produced when executing the at least one task are randomised to enhance security.
- the randomising at a pipeline level and/or a task level may be considered to be a Random Out Of Order Execution [ROOOX] approach. Both commutative and non-commutative portions of the algorithm can be randomised separately or together. Moreover, there may be random mixing both within individual runs and between different runs of the algorithm.
- an algorithm may be considered to be unrolled, de-nested, and split into portions of sufficiently fine granularity that they can all be randomised and presented in a wait-free asynchronous manner.
- each task may be presented for execution in a random order, when randomising at a pipeline level but is executed in the correct order for each input.
- all tasks in the random order may meet the preconditions for execution and thus there may appear to be a random out of order execution of the pipeline. However, each task is still being executed in the correct order for the associated input.
- An executable task may be considered to be a portion of program code residing in memory that can be run by processing hardware, e.g. the processor.
- Such a task may be a suitably intercommunicating, wait-free, asynchronously executing sub step of a larger computer algorithm or algorithms.
- asynchronous it is meant that the tasks do not operate synchronously, they are not in a lock-step, that operate in a non-blocking manner so that the algorithm is guaranteed to complete in a finite number of steps regardless of the speed of operation of the tasks.
- wait-free it is meant that all tasks will complete in finite time and in other words, there is no blocking wait-state and the processor cannot be indefinitely blocked or starved.
- the separately executable tasks may be of sufficiently fine granularity which depends on several factors.
- the size of an executable task is not merely the physical size that the task occupies in memory but also the amount of time that is spent executing the program instructions that define that particular task.
- the executable task may, for example, be as small as a call or write function, or as large as all the steps necessary to expand an encryption key.
- Each task may be without repetitions occurring during a single execution episode, for example to assist in reducing any information that will be leaked at the point of execution.
- the algorithm may comprise a plurality of rounds of component actions.
- the granularity of the task may be considered to be coarse when the algorithm is split into tasks, each of which constitutes a round within the algorithm.
- the granularity of the task may be considered to be medium when the algorithm is split into tasks, each of which constitutes a component action within the algorithm.
- the granularity of the task may be considered to be fine when the algorithm is split into tasks, each of which constitutes an operation within a component action.
- the apparatus may further comprise at least one buffer comprising a first queue and a second queue.
- the at least one buffer may be termed a double buffer.
- Randomising the plurality of separately executable tasks at a pipeline level may be achieved by generating a pointer which is associated with each executable task; storing the plurality of pointers in temporal order in the first queue; randomising the order of the plurality of pointers; and transferring the randomised plurality of pointers to the second queue from which each pointer is retrievable to execute the associated task.
- By randomising the order of the pointers the order in which the plurality of tasks are presented for execution is randomised.
- pointer is a programming language object that stores the memory address of another value located in computer memory. A pointer references a location in memory, and obtaining the value stored at that location is known as dereferencing the pointer.
- the plurality of separately executable tasks may be executed by retrieving a pointer at the head of the second queue (for example, in response to an instruction issued by the processor); attempting to execute the task associated with the retrieved pointer; returning the pointer to the first queue; and repeating the retrieving, attempting and returning steps until the second queue is empty.
- the processor may be configured to determine whether the second queue is empty and when it is determined that the second queue is empty, randomise the order of the plurality of pointers; and transfer the randomised plurality of pointers to the second queue.
- the randomisation of the pointers in the first queue and transfer to the second queue may be done repeatedly and indefinitely or until the computer algorithm has processed all inputs.
- Attempting to execute the task may comprise determining whether preconditions for execution are met and when the preconditions are met, executing the task or when the preconditions are not met, proceeding straight to the returning step.
- the preconditions may comprise determine whether there is an intermediate output associated with the task, i.e. whether the task has already executed and its output stored.
- the preconditions may comprise determine whether there is an intermediate input associated with the task, i.e. whether the previous task has already been executed and its output stored so the task can be executed.
- the repetitive cycling of unmet tasks adds time
- the wait-free action reduces time.
- the asynchronous nature of the process according to the invention reduces processing time.
- the pointers may be retrieved from (i.e. exit) the second queue in an asynchronous, wait-free manner.
- a non-uniform execution is achieved and any semblance of a pattern is obliterated which reduces the chance of a successful side channel attack.
- the plurality of executable tasks are presented in a random order, the continual testing to see if the task can be executed ensures that they are executed in order which it will be appreciated is important for non-commutative tasks.
- the apparatus may comprise a first buffer and a second buffer each comprising a first and second queue.
- Each of the first and second buffer may be a temporal double buffer.
- the memory may store a first pipeline comprising a plurality of separately executable tasks and a second pipeline comprising a plurality of separately executable tasks.
- the first pipeline may be an encryption pipeline for encrypting plaintext and the second pipeline may be a decryption pipeline for decrypting cipher text.
- Randomisation may be achieved by generating a pointer which is associated with each executable task in the first and second pipelines; storing the plurality of pointers for the plurality of separately executing tasks in the first pipeline in temporal order in the first queue of the first buffer; storing the plurality of pointers for the plurality of separately executing tasks in the second pipeline in temporal order in the first queue of the second buffer; randomising the order of the plurality of pointers in each of the first queues; transferring the randomised plurality of pointers from the first queue of the first buffer to the second queue of the first buffer; and transferring the randomised plurality of pointers from the first queue of the second buffer to the second queue of the second buffer.
- the pointers in each of the second buffers are retrievable for wait-free execution as described above.
- the apparatus may comprise at least one circular buffer for storing a plurality of pointers with each pointer being associated with one of the plurality of tasks and a plurality of queues connected to the at least one circular buffer with each queue storing the output from each of the plurality of tasks when it is executed.
- a circular buffer may be termed a single processor, single customer (SPSC).
- SPSC single processor, single customer
- the stored output may be termed an intermediate output.
- the processor may be configured to randomise the order in which the plurality of separately executable tasks within the pipeline (or at a pipeline level) are presented for execution by randomising the order of the plurality of pointers within the at least one circular buffer.
- the processor is configured to attempt to execute each of the plurality of separately executable tasks from the randomised order by: retrieving a pointer at the head of the circular buffer; attempting to execute the task associated with the retrieved pointer; returning the pointer to the circular buffer; and repeating the retrieving, attempting and returning steps until all pointers within the circular buffer have been attempted. Once all the pointers within the circular buffer have been attempted, the order of the plurality of pointers may be randomised again and the process repeats. [35] There may be a plurality of processing cores for executing the plurality of separately executable tasks. The plurality of processing cores may be spread across a network or may be on the same chip, or in the same device.
- each core may involve several logical threads and a pointer may be associated with each of the logical threads. Randomisation of the location in which a task is executed may mean that it is not possible to rely on tracking the emanations from a particular core for an SCA.
- a first processing core may be configured to retrieve the pointers from the second queue of the first buffer (and hence execute the associated tasks of the first pipeline) and a second processing core may be configured to retrieve the pointers from the second queue of the second buffer (and hence execute the associated tasks of the second pipeline). It will be appreciated that two is also illustrative in this example and the number of processing cores may be selected to match the number of pipelines.
- the feature of utilising a plurality of cores may be used for parallel processing; e.g. for encryption and decryption or other similar substantially embarrassingly parallel problems.
- an apparatus with a plurality of cores generally follows Gustafson- Barsis’ Law and benefits from linear scaling.
- the apparatus may be adapted to allow for recruiting and retiring processing cores from a pool or processing cores. When a core is recruited to the pool, a pointer to the core, or pointers to its logical cores, may be generated and supplied to the plurality of pointers.
- the pointers to the processing cores may then be randomised, leading to randomness in location of execution of a program (i.e. topological randomisation).
- a core When a core is retired, its respective pointer, or pointers to its logical cores, may be removed from the plurality of pointers.
- randomisation may be introduced by generating a pointer which is associated with each of the plurality of processing cores/threads.
- the pointers may be randomised using the double buffer and/or the SPSC described above.
- the randomisation may be introduced by storing the plurality of pointers in topological order in the first queue; randomising the order of the plurality of pointers; and transferring the randomised plurality of pointers to the second queue from which each pointer is retrievable to determine which of the plurality of processing cores/threads is to execute a task.
- a double buffer may be termed a topological buffer because it randomises the location in which the task is executed as opposed to the order in which a task is executed as in a temporal double buffer described above.
- the randomness may be enhanced by combining temporal and location randomisation.
- the apparatus may comprise a first buffer having a first temporal queue and a second temporal queue; a second buffer having a first topological queue and a second topological queue; and a scheduler which is configured to cooperate with the processor to randomise the plurality of separately executable tasks at a pipeline level both temporally and topologically.
- a functional change in temporal and topological ordering of execution may thus be achieved by the mixing of pointers to both the program code of the executable tasks, and location of execution, such that, when and where the tasks are physically executed is entirely random.
- the at least one buffer may be located within a register of the processor, particularly if pointers are embodied as small enough pieces of memory that can readily fit inside a processor’s register.
- the at least one buffer may be separate from the processor.
- the processor may thus be configured to randomise the plurality of separately executable tasks at a pipeline level by communicating or cooperating with the at least one buffer.
- a plurality of shuffled double buffers may be used as explained above to shuffle pointers to executable tasks and/or a plurality of shuffled double buffers may be used as explained above to shuffle pointers to locations of execution.
- Randomisation may be introduced within the pipeline (at a pipeline level) or within the task itself (at a task level). Both types of randomisation may be used separately or combined. Randomisation within the task may be used when the task comprises a set of commutative operations.
- the processor may then be configured to randomise an order of execution of the commutative operations within the task.
- the commutative operations may have a set of indices and randomising the order of execution of commutative operations may comprise randomising the set of indices for the commutative operations before execution of the selected task.
- the location of execution of each commutative operation may also be randomised.
- Randomisation may be introduced into the plurality of separately executable tasks at a task level by selecting at least one task of the plurality of separately executable tasks, wherein the selected task comprises a set of discrete actions, wherein each set of actions has a set of indices indicating the order in which each action is executed; and randomising the set of indices for the discrete actions before execution of the selected task.
- the at least one task may be selected by determining the length of time required to execute each task in the pipeline and selecting one or more tasks which take longer to execute than other tasks. The tasks which take longer to execute may be more vulnerable to successful SCAs.
- the at least one task may be selected based on the number of commutative actions within the plurality of separately executable tasks.
- the at least one task may contain the most commutative actions or a greater proportion of commutative tasks than non-commutative actions.
- the randomisation at a task level may also comprise separating the or each selected task into the set of discrete actions.
- the set of discrete actions may be obtained by unrolling one or more loops within the task into the discrete actions, wherein each discrete action is commutative.
- the set of indices may be obtained from the loop indices.
- the at least one task may be selected based on the number of times the at least one task repeats in the pipeline. For example, the selected task may have more repeats.
- randomisation at a task level may be introduced for each occurrence of the same task, for example by identifying each occurrence of a repeating task and randomising the set of indices for each occurrence.
- randomisation at a task level may be introduced for each repeat of at least one task or each repeat of several tasks within the pipelines. For example, where there are four tasks which repeat the same number of times within the algorithm, there may be considered to be 25% randomisation at a task level when randomisation is introduced for each repeat of one of the four tasks.
- the computer algorithm may be an encryption algorithm encrypting plaintext or decrypting to plaintext.
- the plurality of separately executable tasks may comprise AddRoundKey, SubBytes [or inverse], ShiftRows [or inverse], and MixColumns [or inverse].
- the AddRoundKey may be the selected task because this is rich in commutative actions and is repeated multiple times in the pipeline.
- any computer algorithm where there are one or more parts that one wishes to remain secret and immune from SCAs can benefit from the invention.
- algorithms involved in blockchain processing would benefit from having the relevant part or parts of blockchain processing being kept secret.
- the invention may, for example, be applied to Secure Multiparty Computation [SMC].
- SMC Secure Multiparty Computation
- Machine learning algorithms have many repeated processes that can be converted to execute in accordance with the invention. Moreover, in-house and custom banking algorithms which evaluate stock and foreign exchange for purchase would also benefit from having parts immune from SCAs.
- the computer algorithm may be a hashing algorithm, e.g. SHA256.
- apparatus for controlling encryption (or decryption) which comprises memory in which the encryption (or decryption) algorithm is stored as at least one pipeline comprising a plurality of separately executable tasks; and a processor which is configured to receive at least one plaintext input (ciphertext input); randomise the plurality of separately executable tasks at a pipeline level and/or at a task level; and execute the randomised plurality of separately executable tasks to encrypt plaintext (or decipher ciphertext).
- the encryption algorithm can be any one from the many known to the skilled person. In an earlier competition run by the National Institute of Standards and Technology (NIST), there were five algorithms referred to as the “AES finalists” namely, Rijndael, Serpent, Twofish, RC6 and MARS.
- Rijndael’ s algorithm has multiple stages each of which comprise one or more steps selected from: AddRoundKey, SubBytes or its inverse, ShiftRows or its inverse, MixColumns or its inverse loops.
- Each of these steps may be considered to be an executable task which is of sufficiently small granularity for the method and system in which randomisation may be introduced at the pipeline level.
- the granularity may be considered to be fine.
- the executable task for AddRoundKey comprises two nested loops; an outer loop executes four iterations of the inner loop, which itself has four iterations - resulting in sixteen iterations of the EXclusive Or (XOR) operation which is a set of commutative operations.
- Each outer loop may be considered to comprise a set of four discrete actions with each set of actions having a set of indices (in this example, one to four).
- each inner loop may be considered to comprise a set of four discrete actions with each set of actions having a set of indices (in this example, also one to four).
- the randomisation at a task level may thus be introduced by randomising the set of indices for at least one of the inner and outer loops.
- shuffling of indices can also be applied to the 16 discrete actions of the SubBytes task, the four discrete actions (cycles) of the MixColumns task and the four discrete actions (stages) of ShiftRows task. There may also be shuffling within each of the cycles of the MixColumns and ShiftRows tasks.
- the AES finalist, Rijndael can be run with different size keys, e.g. 128, 192 and 256, and different numbers of rounds, 10 rounds for AES-128, 12 rounds for AES-192 and 14 rounds for AES- 256.
- the finalists are Classic McEliece, Crystals-Kyber, NTRU and Saber in the first group, and Crystals-Dilithium, Falcon and Rainbow in the second group. Three of the first group admit that they are vulnerable to side channel attack without modification and NTRU is silent on the matter. All these finalists are computer algorithms which can be modified to execute in accordance with the invention. [49] Many of the digital signature candidates are based on the Keccak family, which also underpins SHA3, and this group are non-prime based algorithms, so they are safe from quantum computers running Shor’s algorithm.
- the Rijndael AES finalist is also non-prime based.
- several pipelines may be processed.
- several encryption algorithms may be decomposed into executable tasks in different pipelines running on different plaintext feeds or decryption modes.
- the apparatus typically receives a plurality of inputs, for example in the context of encryption, there may be several sources of plaintext or ciphertext.
- the plurality of inputs may be processed at the same time, for example a video stream together with an audio stream. In this regard, the temporal mixing of chunks of each source further adds to the randomness and thereby the resistance to SCAs
- Randomisation may be introduced at the task level, e.g.
- true randomness is the quality or state of lacking any pattern or principle of organization being entirely unpredictable and, therefore, unable to convey any information whatsoever. It will be appreciated that true randomness is, in fact, an abstract concept that can only be achieved when a system is at the theoretical limit of disorganization or maximum entropy. However, in cryptographic theory, the term “true randomness” describes a quality of the order of events such that they are not derived from any deterministic logic. Some physical phenomenon that is expected to be random such as atmospheric noise, thermal noise, other external electromagnetic, and quantum phenomena must be sampled and measured. [53] True randomness may be slow.
- a “Cryptographically Secure Pseudo-Randomness” depends upon the mathematical technique used to generate the Pseusdo-Random Numbers (PRN).
- PRN Pseusdo-Random Numbers
- the size and “quality” of the sequence initiating seed will have a greater or lesser degree of randomness.
- sequences of numbers can possess varying “amounts” of randomness may seem odd but using various statistical tests it is possible to assess the amount of information that such sequences reveal and assigning to them degrees of entropy.
- the apparatus may comprise a random number generator to assist the processor when randomising the plurality of executable tasks at a pipeline or task level.
- the processor may be configured to randomise the plurality of separately executable tasks at a pipeline level and/or a task level by communicating with the random number generator.
- Random number generators fall into two main categories: fast but imperfect pseudo-random number generators (PRNG) and slow but perfect entropy True Random Number Generators which for all practical purposes cannot be guessed or predetermined and typically use external physical phenomena to generate true randomness.
- PRNG pseudo-random number generators
- the temporal and topological scrambling and/or the index shuffling may use a random number generator.
- the choices are implementation dependent but at the very least a fast PRNG is required to drive the shuffling permutations of indices and pointers required by the methods described below.
- a combination of both PRNG and TRNG may be preferred where a slow TRNG is used, at intervals (implementation dependent), to seed the PRNG.
- both a hardware generated source of true randomness such as the on-chip entropy source present in most modern processors
- a demand led fall back to periodically re- seeded software-based, NIST approved, cryptographically secure pseudorandom number generators (CSPRNGs)
- CSPRNGs cryptographically secure pseudorandom number generators
- CSPRNGs and on-chip true randomness may add further to the entropy thwarting SCAs, since the attacker cannot establish a correlation with a known CSPRNG.
- the invention also thwarts FIAs.
- FIAs have high dependency on each task being executed at exactly the same time and order in each iteration of an algorithm.
- the random out of order execution of each task for each iteration which is key to the inventive techniques described above, means that there is no consistency between one round and another.
- a fault injection attacker has no reliable event to focus upon because they can never expect task X to occur at time Y in a repeatable fashion.
- present techniques may be embodied as a system, method or computer program product. Accordingly, present techniques may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. For example, the automatic encryption of data as it is stored on a memory chip is commonplace and the invention may be integrated into such memory chips as part of their encrypting process.
- present techniques may take the form of a computer program product embodied in a computer readable medium having computer readable program code embodied thereon.
- the computer readable medium may be a computer readable signal medium or a computer readable storage medium.
- a computer readable medium may be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.
- Computer program code for carrying out operations of the present techniques may be written in any combination of one or more programming languages, including object oriented programming languages and conventional procedural programming languages. Code components may be embodied as procedures, methods or the like, and may comprise sub-components which may take the form of instructions or sequences of instructions at any of the levels of abstraction, from the direct machine instructions of a native instruction set to high-level compiled or interpreted language constructs.
- Embodiments of the present techniques also provide a non-transitory data carrier carrying code which, when implemented on a processor, causes the processor to carry out any of the methods described herein.
- the techniques further provide processor control code to implement the above-described methods, for example on a general purpose computer system or on a digital signal processor (DSP).
- DSP digital signal processor
- the techniques also provide a carrier carrying processor control code to, when running, implement any of the above methods, in particular on a non-transitory data carrier.
- the code may be provided on a carrier such as a disk, a microprocessor, CD- or DVD-ROM, programmed memory such as non-volatile memory (e.g. Flash) or read-only memory (firmware), or on a data carrier such as an optical or electrical signal carrier.
- Code (and/or data) to implement embodiments of the techniques described herein may comprise source, object or executable code in a conventional programming language (interpreted or compiled) such as Python, C, or assembly code, code for setting up or controlling an ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array), or code for a hardware description language such as Verilog (RTM) or VHDL (Very high speed integrated circuit Hardware Description Language).
- a controller which includes a microprocessor, working memory and program memory coupled to one or more of the components of the system.
- a logical method may suitably be embodied in a logic apparatus comprising logic elements to perform the steps of the above-described methods, and that such logic elements may comprise components such as logic gates in, for example a programmable logic array or application- specific integrated circuit.
- Such a logic arrangement may further be embodied in enabling elements for temporarily or permanently establishing logic structures in such an array or circuit using, for example, a virtual hardware descriptor language, which may be stored and transmitted using fixed or transmittable carrier media.
- the present techniques may be realised in the form of a data carrier having functional data thereon, said functional data comprising functional computer data structures to, when loaded into a computer system or network and operated upon thereby, enable said computer system to perform all the steps of the above-described method.
- FIG. 3 is a flowchart illustrating the method of randomising at a pipeline level which may be carried out on the apparatus of Fig.1; [69] Fig.4 shows the double buffered random queue of Fig.2 together with the steps in the process; [70] Fig. 5a is a flowchart for AES encryption and decryption which is an example of a computer program which can be processed using the method and apparatus described above; [71] Fig.
- FIG. 5b is a schematic diagram illustrating of different levels of task granularity applied to the AES algorithm
- Fig.6 is a diagram of an AES dataflow communicating task pipeline of ten rounds for a 128-bit key relating to the encryption process illustrated in Fig.5
- Fig.7a is a key for each of the steps in Fig.7b and 7c
- Fig.7b and 7c are alternate representations of the pipeline in Fig.6 using the key in Fig.7a
- Figs.8a and 8b are representations of two alternative arrangements having two pipelines using the key in Fig.7a
- Fig.9a is a variation of the apparatus shown in Fig.1
- Fig.9b is a representation of the pipeline for Fig.9a using the key in Fig.7a
- Figs.10a and 10b illustrates a further arrangement of the apparatus in two configurations
- Fig.10c is a representation of
- FIG. 11c is variation of the apparatus shown in Fig. 9a with six processing cores and related buffered random queues;
- Fig. 12a is a schematic illustration of a circular buffer which is used to implement an interoperation communication queue, e.g. SPSC;
- Figs. 12b and 12c are schematic illustrations of pipelines using SPSC queues at the coarse and medium granularity levels respectively;
- Fig.12d is a schematic flowchart for implementing random out of order execution of the pipeline of Fig.12c; [86] Figs.
- FIG. 13a and 13b are schematic representations of the commutative operations within ShiftRows and MixColumns of the AES algorithm, respectively;
- Fig.13c is a flowchart of the method of randomising at a task level which may be carried out on the apparatus of Fig.13d;
- Fig.13d is a schematic block diagram of the apparatus for carrying out the method of Fig.13c;
- FIG. 13e and 13f are schematic representations of tuneable shuffling of the commutative operations within AddRoundKey and SubBytes of the AES algorithm, respectively;
- Figs.14a and 14b are screen shots from an oscilloscope measuring the test apparatus carrying out standard AES encryption and AES encryption using the method of Fig.13c, respectively;
- Figs. 15a to 15e are test results indicating the performance of AES encryption without any mixing and with different levels of mixing as implemented using the method of Fig.13c;
- Fig. 16 is a schematic diagram illustrating of different levels of task granularity applied to the Keccak-f algorithm;
- SCAs Side channel attacks
- FFAs fault injection attacks
- an algorithm can be adapted so that it is amenable to randomised out of order execution (ROOOX) and therefore resistant to SCAs and FIAs.
- Adapting the algorithm may comprise examining the parts (for example tasks and/or operations) within the algorithm and the order in which they occur. Commuting steps are those which can be executed in any order and non-commuting steps must be executed in a specific order. An element of the adaptation may thus include identifying the commuting and non-commuting steps. Different techniques for random execution may then be applied to commuting and non- commuting tasks (or operations) as explained in more detail below. [98] The process of modifying the part(s) or whole of the algorithm to enable ROOOX may be termed “pugging”.
- the dictionary definitions of pugging include the act or process of working and tempering clay to make it plastic and or uniform consistency or deadening sound e.g. by laying mortar or the like between the joists under the boards of a floor or within a partition.
- the goal of pugging an algorithm is to enable flexibility of execution to the point where the randomised out of order execution of its parts prevents an attacker gaining any useful information from the analysis of its side-channel emissions during execution.
- the act of pugging may be described as an approach that breaks apart, or decomposes, an algorithm into its smaller parts in order to understand the relationship between its execution and the side-channel information which is leaked during execution.
- the decomposed parts may then be randomised in time (temporal) or in time (temporal) and place (topological). As described below, this is achieved by using a system which is able to change the order in which parts are executed by one or more processors and may also change the location (i.e. processor/processing core) which is executing the part of the algorithm.
- Commuting and non- commuting operations need to be randomised differently and thus a successful implementation requires one or both of: a) A means of elastically executing commuting operations temporally or temporally and topologically b) A means of elastically executing non-commuting operations temporally or temporally and topologically Non-Commutative Operations [100] Fig.
- FIG. 1 shows a schematic diagram of one system for executing a computer algorithm so that the risk of successful side channel attacks is reduced.
- this is the simplest arrangement with a single core microprocessor 10 which accesses pointers from a double buffered random queue (DBRQ) 12.
- DBRQ double buffered random queue
- the DBRQ is long enough to store a plurality of pointers each of which points to a task within the computer algorithm.
- the DBRQ may thus be considered to be a pipeline comprising a plurality of separately executable tasks.
- a pipeline is particularly suitable when tasks are non-commuting and thus need to be executed in a specific order.
- the connections in the pipeline retain the logical order of the processing but physical execution of the algorithm’s order is random.
- Non-commuting tasks are critically dependent on the immediate outcome of the neighbouring tasks within the pipeline.
- non-commuting tasks may also be termed operations where an operation is defined in mathematics as a function which takes zero or more input values (called operands) to a well-defined output value.
- operands input values
- Pointer is a programming language object that stores the memory address of another value located in computer memory.
- a pointer references a location in memory, and obtaining the value stored at that location is known as dereferencing the pointer.
- the actual format and content of a pointer variable is dependent on the underlying computer architecture. However, as schematically shown in Fig.1, the processor 10 accesses a memory 14 based on the pointer from the DBRQ 12.
- Each pointer (ptr) is represented with a single symbol, e.g.
- the memory may comprise a volatile memory, such as random access memory (RAM), for use as temporary memory, and/or non-volatile memory such as Flash, read only memory (ROM), or electrically erasable programmable ROM (EEPROM), for storing data, programs, or instructions, for example.
- RAM random access memory
- ROM read only memory
- EEPROM electrically erasable programmable ROM
- each pointer points to a piece of executable code.
- Each piece of executable code may be part of a larger computer algorithm.
- a computer algorithm is a sequence of steps that take an input and convert it into an output via a series of intermediate input and output stages. Traditionally, a single hardware execution resource steps and loops through each of the computational stages in the form of commands.
- a task may be defined as a unit of execution or a unit of work. Alternative terms include process, light-weight process, thread (for execution), step, request or query (for work).
- a generic task is one where the details of purpose are not relevant, such as in the scheduling of tasks, and may be represented by a single symbol, e.g. a black box. This black box can be a placeholder for any of the tasks, e.g. the serial fractions, within an algorithm.
- a memory pointer may also be represented with a single symbol, e.g. an empty box.
- such a DBRQ as shown in Fig. 1 is suitable for use with a decomposed 10 round 128bit key AES encryption pipeline. Accordingly, in this example, there are 38 pointers but it will be appreciated that the length may be adjusted to suit the computer algorithm which is being executed on the system.
- the purpose of a task is of no concern to the DBRQ and so the DBRQ may be conveniently represented as 2 short groups of either vertical or horizontal (as shown in Fig. 1) task pointer symbols. Either side of the task pointer symbols, Fig.1 shows a PUSHable input and a POPable queried output to complete the symbolic representation of the DBRQ.
- the apparatus may comprise a random number generator 16 to assist the processor when randomising the plurality of executable tasks at a pipeline level using the DBRQ.
- the temporal scrambling may use both a hardware generated source of true randomness, such as the on- chip entropy source present in most modern processors, and a demand led fall back to periodically re- seeded software-based, NIST approved, cryptographically secure pseudorandom number generators (CSPRNGs).
- CSPRNGs cryptographically secure pseudorandom number generators
- Fig.3 is a schematic flowchart illustrating the steps which may be carried out by the system of Fig.1 with the DBRQ of Fig.2.
- a first step S300 the processor issues a first POP instruction to the DBRQ which results in an internal POP request to the pending task queue in the DBRQ.
- the processor determines whether the pending task queue is empty in step S302. When it is determined that the pending task queue is empty, the pointers in the full future task queue are randomly shuffled in step S304. The randomised pointers in the future task queue are then swapped into the empty pending task queue in step S306. In other words, the failed POP request effectively triggers a request to swap the empty pending task queue and the full future task queue which exchanges their roles. [110] As shown in Fig.3, a new POP instruction may then be issued to the DBRQ (step S300).
- step S302 the determination of whether the pending task queue is empty (step S302) will determine that the pending task queue is not empty.
- the now full, randomly ordered, pending queue can fulfil the POP request as set out at step S308.
- the DBRQ (particularly the pending task queue) releases a task to the processor for execution.
- step S310 There is an attempt to execute the task at step S310.
- intermediate input data may be needed for the task to be executed, thus it is not always possible to execute a task.
- the processor issues a DBRQ PUSH instruction to recycle the completed task pointer at step S312.
- the DBRQ stores this recycled pointer in the future task queue which is initially empty.
- steps S300, S302, S308, S310 and S312 is then repeated for each task pointer in the pending task queue so that the future task queue is gradually filled.
- steps S300, S302, S308, S310 and S312 are then repeated for each task pointer in the pending task queue so that the future task queue is gradually filled.
- Fig.4 is a schematic system and flowchart illustrating a snapshot of the process of Fig.3.
- the reference numbers for features in common with Fig.2 and 3 are used in Fig.4.
- the future task queue 22 comprises a number (in this example, three) of task pointers which have been recycled following their execution, if possible, by the processor 10.
- a task pointer 30 (which has the sequential order number 10) is shown as being fetched by the POP instruction (S300) to be executed by the processor 10.
- Another task pointer 32 (which has the sequential order number 13) is shown within the processor 10 for execution.
- FIG. 4 illustrates the process of attempting to execute each task in more detail.
- the processor determines whether the output is full, i.e. has the task already been executed in a previous cycle through the task pointers. If the processor determines that the output is full, the task pointer is recycled (step S312) and returned to the future task queue.
- step S402 determines whether the input is empty (step S402). Some of the tasks will be dependent on other tasks having already been performed. If the input is empty (i.e. the task cannot be completed until at least one other task is completed), the task pointer is recycled (step S312) and returned to the future task queue. If the input is not empty, the task is ready for execution.
- the checks to determine whether the input is empty and/or the output is full may be considered to be preconditions which need to be met before executing the task. [114] As explained above, the pointer references a location in memory. Once the preconditions are met, the next step S404 is to read the input from the memory 16. The appropriate task is then carried out in step S406.
- the mechanism of keeping two separate queues, one for pending task pointers and one for future task pointers, and only applying a random transformation to the temporal ordering when the queues are swapped means that the double buffered random queue ensures two critical fairness guarantees, namely: 1. Although each task will be presented for execution in a random order it is guaranteed to be executed at least once within the execution time of the total number of tasks in the pipeline 2. Although the tasks in the pipeline will be presented for execution randomly the whole pipeline is guaranteed to be executed [117] The importance of the critical fairness guarantees may be illustrated by considering a short algorithm which comprises several tasks, e.g.
- this function is composed of four tasks (or operations) performed in the sequence square, sine, multiplication and addition.
- the ordered sequence of the operations can be broken into steps with the intermediate (or intermediary) outputs represented by a, b, c and d.
- the DBRQ may allow the algorithm to execute more than one input, which may be represented as x0, x1... xn. These inputs may be represented in a queue as [xn... x0].
- the pipeline process may be represented as: [xn... x0] ⁇ square ⁇ [] ⁇ sine[] ⁇ multiply 3 ⁇ [] ⁇ add 1 ⁇ f(x)
- a pointer to each of these tasks is initially included in the future task queue and then the pointers are randomly shuffled into the pending task queue. If the order of the task pointers is randomly shuffled to be 3, 4, 1, 2 in the pending task pointer queue, following the process above will mean that the task pointer to task 3 is selected first and there is an attempt to execute step 3. The precondition checks will show that there is no intermediate input b0, so the step will fail and thus there is no change to the representation above.
- step 4 there will then be an attempt to execute step 4 which will also fail and return the same result as above. Only when the third task pointer is taken from the pending task queue will the attempt on the task be successful because the required input to step 1, i.e. x0, is available and the intermediate output a0 can be output. Thus, the result below is returned and it is noted that the input queue length is reduced: [xn... x1] ⁇ square ⁇ [a0] ⁇ sine[] ⁇ multiply 3 ⁇ [] ⁇ add 1 ⁇ f(x) Step 2 “sine” can also now be implemented because the intermediate input a0 is available.
- step 4 “add” fails because there is no intermediate input c0 and the result above is returned again.
- Step 3 receives the intermediate input b0 and can return intermediate output c0 and so the result below is returned: [xn... x2] ⁇ square ⁇ [a1] ⁇ sine[] ⁇ multiply 3 ⁇ [c0] ⁇ add 1 ⁇ f(x)
- step 2 “sine” is processed and the intermediate input a1 is available so the result below is returned: [xn... x2] ⁇ square ⁇ [] ⁇ sine[b1] ⁇ multiply 3 ⁇ [c0] ⁇ add 1 ⁇ f(x) [119]
- the task pointers are randomly shuffled again and moved to the pending task queue.
- step 3 “multiply” receives the intermediate input b1 and can return intermediate output c1 and so the result below is returned: [xn... x2] ⁇ square ⁇ [] ⁇ sine[] ⁇ multiply 3 ⁇ [c1, c0] ⁇ add 1 ⁇ f(x)
- step 4 “add” receives the intermediate input c0 (there is only a single processor) and can return intermediate output d0 and so the result below is returned: [xn... x2] ⁇ square ⁇ [] ⁇ sine[] ⁇ multiply 3 ⁇ [c1] ⁇ add 1 ⁇ [d0]
- Step 2 “sine” tries to execute but there is no intermediate input so the result above is unchanged.
- step 1 “square” is processed and the intermediate input x2 is available so the result below is returned: [xn... x3] ⁇ square ⁇ [a2] ⁇ sine[] ⁇ multiply 3 ⁇ [c1] ⁇ add 1 ⁇ [d0]
- the pipeline has now processed its first element to completion but as shown there are additional inputs to process. Given that the pipeline is not empty, processing continues using randomly permuted out of order execution (or randomised out of order execution – the terms may be used interchangeably) until all inputs have been processed. In other words, the tasks in the pipeline continue to be presented for execution in a random order but are only executed if the associated intermediate input is available. Thus, for each input, the tasks are executed in the correct order.
- a first intermediate output a0 will be generated, i.e. only one of the tasks can be executed.
- the intermediate output a0 can then be used in the next pass through the pipeline to generate first intermediate output b0 and if there is more than one input, a second intermediate output a1 will also be generated, i.e. two of the tasks can be executed.
- Intermediate output c0 is generated in the third pass together with a second intermediate output b1 and a third intermediate output a3. In other words, three of the tasks execute when there are multiple inputs.
- the first output d0 is generated on the last pass and thus the algorithm has produced an output after four executions.
- the algorithm which is executed as described above may be any algorithm with any number of tasks arranged in a pipeline. Regardless of the action of a pipeline task there are defining features common to all tasks. For example, failing any of the preconditions required for execution should happen in the shortest amount of time possible so that the task pointer can be recycled almost immediately. Such a failure-fast-track mechanism is key in order to minimise the overheads inherent in decomposing an algorithm into communicating sequential tasks.
- the task may comprise a minimal number of repeats and as explained below may further have the order of their indices randomly shuffled every time.
- all repeating sections in a task are essential, short, discrete and randomly shuffled - otherwise they should be broken out into separate tasks.
- a task with looping sections could have each of the loop stages broken out into separate smaller communicating tasks. However, doing so rapidly balloons the unavoidable overheads of inter-task communication and randomised scheduling affecting performance and scalability. Therefore, a balance must be struck between breaking up an algorithm into ever smaller tasks - the task granularity - and the overheads of the system (inter-task communication, temporal randomisation, topological randomisation and scheduling).
- Examples of algorithms which may be separated into tasks and processed as described above include encryption algorithms and these can be any one from the many known to the skilled person.
- Examples of encryption algorithms are the five algorithms referred to as the “AES finalists” in the NIST selection process namely, Rijndael, Serpent, Twofish, RC6 and MARS.
- the Rijndael AES encryption algorithm will be used as an example algorithm.
- the algorithm may be decomposed into a task pipeline with several discrete tasks.
- Fig.5a illustrates the steps or tasks involved in encrypting plaintext [on the left] and decrypting cipher text [on the right].
- Each stage consists of one or more steps selected from: AddRoundKey, SubBytes or its inverse, ShiftRows or its inverse, MixColumns or its inverse. These steps occur in certain orders at certain times and some are grouped in loop. It would seem that each named step is a suitable place to start decomposing the AES algorithm into tasks that communicate with each other via inputs and outputs to form a communicating task pipeline.
- the AES algorithm can be decomposed into smaller constituent parts, all the way down to the microcode of the CPU if needed. How far an algorithm is broken up into smaller parts is termed the granularity of decomposition and examples of levels of granularity are set out below: * Coarse grained – such as splitting the AES algorithm into its constituent rounds * Medium grained – such as splitting a round into its component actions, e.g. SubBytes * Fine grained – such as splitting SubBytes into its program language, e.g. C++ * Very fine grained – such as splitting the program language operations into their assembly language components * Ultra fine grained – such as splitting the assembly language opcodes into their on chip microcode.
- Fig.5b illustrates the coarse, medium and fine grained composition.
- the coarse grained level of decomposition comprises 10 rounds. The order of execution of these rounds is strictly non- commutative.
- the medium grained level of decomposition comprises 4 sub-steps in each of rounds 1 to n-1 (AddRoundKey, SubBytes, ShiftRows and MixColumns) and 3 sub-steps in the last round (AddRoundKey, SubBytes and ShiftRows). Each of the sub-steps in each round are also strictly non- commutative.
- the encryption algorithm of Fig.5a has been decomposed in Fig.6 to show in detail the tasks which communicate with each other via intermediate inputs and outputs [shown by arrows which represent the data flows] to form a communicating task pipeline.
- N 10 rounds
- the resulting AES decomposed task pipeline (medium granularity) consists of 36 separate task duplicates of the four basic tasks, an input stage (plain block, p) and an output stage (encrypted block, e) making 38 task based stages in total having 40 separate communication channels.
- the key size used for an AES cipher specifies the number of transformation rounds that convert the input, called the plaintext, into the final output called the ciphertext.
- the full algorithm will be executed after at most 38 runs of the pipeline.
- a data block will pass through the pipeline and be fully encrypted after at most 38 runs of the pipeline.
- the processor would confirm that the output connection is not full and the input connection is not empty and then read a 128-bit data block from the input connection.
- the next step in the AES pipeline would then be applied to the read data block to transform the data block.
- the transformed (or modified) data block would then be written to the output connection.
- the pointer would be recycled to the back of the queue with a PUSH instruction and the whole process would be repeated for the next pointer in the queue until the queue is completed.
- Completing the steps in sequence means that the process is vulnerable to a SCA.
- a simple, na ⁇ ve randomisation may be applied to the single queue of pointers, for example by selecting a pointer within the queue at random rather than selecting a pointer at the head of the queue.
- a disadvantage of such a process is that a random transformation must take place each time a task pointer is selected rather than each time the queues are swapped in the process described above.
- a na ⁇ ve randomisation is likely to lead to a pipeline that almost never executes from beginning to end.
- the probability P 1 of selecting the next stage of the pipeline is also 1/38 and so on up to P 37 .
- the probability of the data block progressing from the first stage of the pipeline to the second stage is P 0 x P 1 , i.e. 0.000676.
- Fig.8a illustrates two abbreviated multi-stage AES encryption pipelines p 1 and p 2 consisting of the four basic task types identified above and repeated with input and output tasks to a total of 38 tasks.
- Fig. 8b shows a mechanism for splitting the incoming data stream into plain data blocks p 1 and p 2 , known as demultiplexer (demux), and a mechanism for recombining the encrypted data blocks p 1 and p 2 , known as a multiplexer (mux).
- demultiplexer demux
- miux multiplexer
- the data blocks are split, encrypted by each pipeline and the recombined for onwards transmission.
- the two pipelines may be buffered together.
- Each pipeline could have a different plaintext or ciphertext feed.
- pipeline A might be involved with encrypting a video feed while pipeline B may be involved with encrypting an audio stream from a different source.
- the pointers to the 76 tasks of both the pipelines can be randomised in two possible ways.
- a first way is shown in Fig. 9a and comprises a single processor 10 and a single DBRQ 92 with an enlarged capacity when compared with Fig.1.
- Figs.10a and 10b there may be two DBRQs 112A and 112B (each having a capacity for 38 task pointers) and a scheduler 100.
- the memory and each of the pending and future task queues of the DBRQs are omitted for ease of display.
- Fig.9b shows the workings of the embodiment of Fig.9a.
- the single DBRQ contains all of the 76 tasks from the first and second AES pipeline p 1 and p 2 .
- the single processor processes each task pointer(ptr) randomly presented from the pending task queue of the single DBRQ as described above and then returns the processed task pointer to the future task queue of the single DBRQ.
- the first DBRQ 112A contains the 38 tasks from the first AES pipeline p 1 and the second DBRQ 112B contains the 38 tasks from the second pipeline p 2 .
- the single processor 10 processes task pointers (ptr) in a round-robin fashion as determined by the scheduler 100.
- Fig.10b shows the single processor processing each task pointer from the pending task queue of the first DBRQ 112A as described above and then returning the processed task pointer to the future task queue of the first DBRQ 112A.
- 10c shows the single processor processing a task pointer from the pending task queue of the second DBRQ 112B as described above and then returning the processed task pointer to the future task queue of the second DBRQ 112B.
- the round robin fashion means that the arrangements shown in Fig.10b and 10c are alternated with a task from the first DBRQ 112A being completed followed by a task from the second DBRQ 112B, following by another task from the first DBRQ 112A and so on.
- Fig 11a shows a variation of the arrangement shown in Fig.9a with a dual processing core 110. As in Fig.9a, there is a single DBRQ 120 of double capacity.
- the single DBRQ 120 randomises the temporal order in which the tasks are executed and may be termed a temporal DBRQ.
- the temporal DBRQ 120 is connected to a scheduler 200 which issues two POP instructions to the temporal DBRQ 120. In response to the POP instructions, when there are pointers in the pending task queue, two pointers are pushed to the scheduler 200. The number of pointers corresponds to the number of processing cores and it will be appreciated that two is merely indicative.
- a second DBRQ which may be termed a topological DBRQ 122 is also connected to the scheduler 200 to determine which core is to attempt to execute each of the two pointers.
- the topological DBRQ 122 is implemented in the same manner as the temporal DBRQ 122 but comprises a pool of pointers to the physical means of executing the task. In this case, there are two cores and hence two pointers, one for each core. The pair of pointers are initially in the future task queue of the topological DBRQ 122. A POP instruction to the empty pending task queue will trigger a shuffle of the task pointers before they are swopped to the pending task queue. The shuffle randomises which core will be selected. The scheduler uses the combined results from the temporal and topological DBRQs to determine which core is to attempt to execute which task at each moment in time.
- Fig.11b is a variant of the arrangement of Figs.10a and 10b follows on from the embodiment in Fig. 8, in that there are now two temporal DBRQs 112A, 112B combined with a single topological DBRQ 122.
- a pointer from each of the temporal DBRQs 112A, 112B is fed to a scheduler 300, which also receives pointers regarding which core to use from topological DBRQ 122.
- the individual cores on the dual processing core 110 thus randomly receive tasks to execute.
- the randomisation is both temporal and in location.
- Figs.11a and 11b are scalable to multiple cores or any combination of different processing elements (e.g. processors, network machines, logical cores, physical cores) by using a topological DBRQ of appropriate capacity.
- Fig.11c shows a processor 210 having six cores, A, B, C, D, E, F.
- a temporal DBRQ 120 comprising a plurality of task pointers, for example 38 as described above.
- a topological buffer 122 comprising a plurality of core pointer, for example six to match the number of cores.
- Figs. 12a to 12d illustrate an alternative implementation of random out of order execution for non-commutative tasks. As set out above, this can be achieved when various conditions including a wait-free communication queue and asynchronous operation.
- a mechanism that enables elasticity without breaking the logical, algorithmic order can typically be implemented using two fundamental components: a wait-free communication queue and an asynchronous execution shuffler.
- the wait-free communication queue must provide queued buffering of intermediate results between non-commuting operations.
- the shuffler must enable shuffling of the physical order of execution of the non-commuting operations that maintains the logical order of the underlying algorithm and meets the various conditions.
- a simple circular buffer can be used to implement the interoperation communication queue.
- Fig. 12a shows an example of this implementation.
- the buffering queue must be wait-free to enable asynchronous operation and thus the Abstract Data Type (ADT) operations of the queue must be conditional. This is typically achieved by first testing if the operation can take place and returning a Boolean value true on success and false if the operation failed. For example, the operation to “push” an item into the back (tail) of the queue must first test if the queue is full, return false if so. Otherwise, returning true after the item has been added to the tail of the queue and the size of the queue has been increased by one. Similarly, the operation to remove an item from the front (head) of the queue must first check to see if the queue is empty, returning false if so. Otherwise, returning true after the item has been removed and reducing the size of the queue by one.
- ADT Abstract Data Type
- the performance of the circular buffer implementation may be improved by restricting the capacity of the queue to a power of 2.
- Such a “2n” queue has the advantage of being able to replace the slower modulo division step to ensure circular progression of the indices with a faster bit mask operation.
- the communication queue between non-commuting operations must be synchronised to prevent race conditions. This can be achieved by using the data structure known as a single-producer, single consumer (SPSC) queue.
- SPSC single-producer, single consumer
- the wait-free feature can be implemented using light-weight concurrency data types known as “atomic variables” for the head and tail indices.
- atomic variables for the head and tail indices.
- C++ an implementation in C++ is shown below: /** * Concurrent wait free, lock free, single producer, single consumer, bounded FIFO queue * Capacity must be an unsigned power of 2 to enable performance advantage of bitwise masking.
- a longer pipeline could be implemented consisting of 40 stages connected by 39 wait-free queues. This is schematically illustrated in Fig.12c.
- the permutations of the pointers in the pipeline are 10! and 40! respectively, Both the arrangements in Figs. 12b and 12c can be executed using wait free single producer single consumer queues.
- each pipeline can be executed out of order and asynchronously as follows: * The function pointer sequence is shuffled * Each function pointer from the shuffled sequence is taken and execution is attempted * The function checks if it has a non-empty input queue and a non-full output queue and if so, executes * Once all the functions in the sequence have attempted execution, the sequence is shuffled again and the process repeats until there are no more input bytes. [150] In this way, at least one function will be able to make progress at each run of the shuffled sequence of pointers (in a similar manner to the simpler four stage example pipeline described above).
- Fig. 12d schematically illustrates a flowchart for implementing the pipeline of Fig. 12c.
- the pipeline requires 39 connecting queues of bytes, an input stream of plain text bytes, an output stream of cipher text bytes and a vector of function wrappers 20 to the pipeline stages.
- f1 is AddRoundKey-1
- f2 is SubBytes-1
- f3 is ShiftRows-1
- f4 is MixColumns-1
- f5 is AddRoundKey-2
- f6 is SubBytes-2 and so on.
- the shuffle module 24 shuffles the array of indices and hence the function pointer sequence is shuffled.
- a CPU 10 and a memory 14 are also part of the system.
- the array and the vector of function wrappers shows the sequence without any shuffling, but it will be appreciated that any order may be generated by the RNG and shuffle module.
- each function pointer from the shuffled sequence is taken in turn and execution is attempted.
- the vector of function wrappers before execution is attempted, first there are checks to see the output queue is full and the input queue is empty. These checks may be performed by the CPU 10 or any other suitable processing unit or processor (not shown). [154] When both the output queue is empty (not full) and the input queue is full (not empty), the function executes. The execution may be done by the CPU 10. The output of the execution is sent to the appropriate output queue which may be stored in memory 14.
- the function is not executed.
- the next function is then selected according to the next function pointer in the sequence and the pre-condition checks are repeated. Once an attempt to execute all the functions in the sequence has been attempted, the sequence is shuffled again, and the process repeats until there are no more input bytes. Despite the random out of order physical execution of each stage f in the pipeline, the logical order is maintained as a directed graph of function nodes connected by wait- free queues, i.e. preserving the algorithm, despite the random order of execution of its parts.
- a multi-core and/or multi-processor implementation could use any scheduling approach but perhaps the simplest is to use the “grand central dispatch” approach.
- each DBRQ needs to have a switch to turn on the random transformation once all the buffers are loaded; this applies to single core - single pipeline as well.
- each of the tasks is a non- commutative operation within the overall AES computer algorithm. In other words, each task must be carried out in the specified order.
- Each task may thus be considered to be a communicating sequential process (CSP) “bubble”. All of the CSP bubbles are connected together into a pipeline using asynchronous wait-free queues as communication channels between the bubbles.
- CSP communicating sequential process
- the bubbles may be executed randomly out of order provided that at each turn, execution of all bubbles is attempted exactly once before the order is shuffled and the next turn proceeds. This is achieved by use of the temporal DBRQ as explained above. It will be clear that each queue in the double buffer is at least as long as the number of bubbles in the pipeline to avoid random stalls. [158] Initially, the process means that no final output will appear unless the random order generated is the correct order – a 1 in 38! chance for the example AES algorithm. However, the process is guaranteed to produce output after a number of runs (i.e. shuffles and execution (or attempt to execute) cycles) with the number of runs being less than or equal to the number of bubbles (38 in the case of the example AES algorithm).
- the normal distribution means that typically the pipeline will produce output after about 16 runs for the example AES algorithm. Additionally, in practice the pipeline has at least one data item in each of its communication queues after a short while (as shown above) and thus at every turn of executing the pipeline bubbles, each and every bubble will execute successfully and the pipeline no longer stalls.
- Commutative Operations [159] As explained above, one of the design features for all tasks is that any repeated section of code within the task must itself be short and discrete. Nevertheless, a task may comprise a minimal number of repeats and these repeating tasks may be commutative, i.e. may be carried out in any order.
- AES operates on a 4x4 column major order array of bytes which may be termed the state.
- Each XOR operation may be considered to be a discrete action and the indices i, j indicate the order in which action is executed. Given that those mounting SCAs are aware of these loops and the rhythmical and predictable and computations, they can seek and exploit the power consumption and EMF signals arising during their execution. There is no requirement to execute loop(s) in an ordered fashion if the action being looped over is discrete i.e. not dependent on previous action(s) and as demonstrated in the code below, the 16 iterations of the XOR operation for each AddRoundKey task are commutative, i.e. may be carried out in any order.
- the AddRoundKey task may be re-designed to exploit the commutative nature and introduce loop index shuffling - an example is shown below: Code Block 2 - Loop Index Shuffled AddRoundKey // This function adds the round key to state using Loop Index Shuffling. // The round key is added to the state by an XOR function.
- rijndael :add_round_key(uint8_t &rkey, uint8_t* i) ⁇ if(blend_add[add_counter++]) ⁇ // if the next item in the blend sequence is true(1)... (NB add_counter is 8 bit so will overflow back to zero) shuffled_add.shuffle(); // shuffle the block array index sequence for(auto& j : shuffled_add) ⁇ // use the shuffled indices...
- Loop index shuffling thus results in 20,922,789,888,000 possible random permutations in the order in which the memory locations are addressed by the XOR action inside the AddRoundKey loops.
- Loop index shuffling provides temporal randomisation of the execution of the operations within the AddRoundKey task. Additionally, each of the 16 steps is a tiny embarrassingly parallel problem which may also be subject to topological randomisation. Given a multi-core processor which is amenable to parallel execution, each step could randomly be assigned to one of the cores. A Gustafson-Barsis linear scaling (up to 16 cores or hyperthreads) suggests that the performance penalty of mixing may be overcome.
- the box is a 16x16 matrix of byte values and consists of all the possible combinations of an 8 bit sequence. However, the box is not just a random permutation of these values and there is a well defined method for creating the s-box tables. As before, each byte is separately replaced and this may be expressed as follows: [168] Each of these 16 replacements may be termed a discrete action and each is commutative and can be carried out in any order. Loop index shuffling can be applied in a similar manner to that in the AddRoundKey task and there are the same number of possible permutations, i.e. 16!.
- one format for the loop index shuffled arrangement may be expressed as follows: [169]
- the ShiftRows task the bytes in each row are cyclically shifted by a certain offset.
- the first row is left unchanged, each byte of the second row is shifted one to the left.
- the third and fourth rows are shifted by offsets or two and three respectively.
- the ShiftRows task changes the order of bytes in each row of the state for all rows except the first. This may be represented as follows: [170]
- Fig. 13a schematically illustrates the nested commutative steps within ShiftRows which may be pugged, i.e. randomised for out of order execution.
- the MixColumns task is a 32-bit operation that transforms four bytes of each column in the state.
- the new bytes of the column are obtained by the given constant matrix multiplication in the Galois Field GF(2 8 ).
- each column is transformed using a fixed matrix and may be expressed as: [173]
- a finer grained mixing could occur inside each of these steps too.
- Fig.13c illustrates the main steps in the operation and Fig.13d is a schematic representation of a hardware implementation which is adapted from that shown in Fig.1 and thus the same reference numbers are used where appropriate.
- the first step is to determine the task to be executed (step S400). This may be determined by using a buffer as shown in Fig.13d. In this arrangement the buffer has all 38 tasks in temporal order but it will be appreciated that the DBRQs or SPSCs described above can be used if randomisation at both pipeline and task level is required. It will also be appreciated that the determining step may be derived from the algorithm itself depending on its implementation.
- the next step may be to read an input from memory 14 (step S402), e.g.
- the input may be original plaintext (or ciphertext) or an intermediate input generated by a previous task in the algorithm. If the DBRQ or SPSC is used as described above, the pre-condition checks may also be carried out before attempting to read an input to make sure that the task can be executed.
- the system comprises a random number generator 16 which may be the same as the ones described above in relation to Fig.1.
- the random number generator 16 provides the random number to a shuffle module 24.
- the shuffle module is shown as a separate module but it will be appreciated that the functionality may be incorporated in the CPU 10.
- the shuffle module 24 introduces a random shuffling permutation of the loop indices as shown in the array of indices 26.
- the array of indices 24 contains the 16 indices for the AddRoundKey task but this is merely exemplary.
- the trade-offs between speed of execution and SCA resistance may be balanced by using a blend of fully shuffled operations together with standard ordered execution.
- Fig.13e shows a similar schematic representation of the fork-join parallelisation method for the SubBytes task.
- the tuneable approach of selecting the execution pathway as shown in Figs.13e or 13f can be applied to any of the four stages of the AES algorithm (or to any other algorithm). If mixing is applied to all occurrences of one task, the mixing levels may be considered to be 25% mixing when one of the four tasks is mixed, 50% when two of the four tasks are mixed, 75% when three of the four tasks are mixed, and 100% when all tasks are mixed as described above. It will be appreciated that different levels of mixing may be achieved by tuning one or more of the tasks. In other words, mixing may be applied to one task or any number of the tasks or some but not all of the repeating tasks in the pipeline. [181] Shuffling the indices requires a random number to be generated.
- the task may request a new random number for each loop, e.g. 16 times, or may more optimally request the number of random numbers which are required at the same time.
- the random number may be requested before the task is executed or during the execution of the task.
- the order in which the random number is obtained during the method may also be optional. Where there are 16 cycles to be randomised, this may for example, be achieved by requesting a single 64 bit random number at the start of the 16 shuffles and breaking the single random number into 16 four bit numbers. This could be done quickly, for example by masking the lower nibble of the 64 bit random number and then shifting it right by 4 bits for the next round.
- the task may now be executed (step S406).
- the task may be executed as shown on a single CPU 10.
- each of the commutative cycles e.g.16 or 4 cycles is suitable for parallel implementation on multiple cores.
- the output is written (step S408) for this task and the process repeats again, e.g. by issuing a POP instruction to obtain the next task in the pipeline from the buffer.
- the pipeline may even be full of intermediate results, one for each of the different inputs.
- the number of tasks that are executable with each execution of the pipeline thus may also vary.
- the number of permutations is also dependent on the number of runs of the algorithm which are needed to fully process all the inputs. It will be appreciated, for a given plurality of sequential blocks of plaintext, the ones earlier in the sequence will have been enciphered or will be having later tasks executed and for ones later in the sequence, they will be having earlier tasks executed or be waiting to join. This “overlapping” of runs of the algorithm applies to any algorithm after the pipeline has been executed once and before the final execution of the pipeline. [185] Loop Index Shuffling can be employed for commutative sections of a task.
- the scheduler either truly randomly or cryptographically securely randomly selects a pointer from the pools of pointers and asynchronously executes a small serial fraction of an algorithm (e.g. a cipher algorithm) if the pre-conditions are met.
- the pointers are then returned to their respective pool to be randomly used again.
- the execution can be on a randomly selected hardware resource.
- the computational steps of an algorithm are randomised in at least time and also optionally location.
- loop index shuffling may be used.
- the randomisation means that the electrical signals released by each small computational step are also random. Such random electrical noise can still be harvested externally at a distance but it can no longer be interpreted in time or possibly also location.
- a testing method was developed to perform a side channel attack on a particular hardware platform to show whether the encryption key can be found using a correlation power analysis (CPA).
- CPA uses changes in the power consumption of the microprocessor and these can be measured by monitoring the changes in the current consumed by the device.
- the testbed comprises a processor for executing the encryption algorithm, e.g. an chicken UnoTM chip which is based on the ATmega328P microcontroller.
- Figs. 14a and 14b show two screen shots from an oscilloscope, e.g. a 1GSa/s storage oscilloscope, which is monitoring the voltage drop across the resistor in the test set up described above.
- Fig.14a is a screen shot of the power trace when the chicken UnoTM chip is executing standard AES encryption using a 128 bit key.
- the varying (lower jagged) line is the voltage drop across the monitoring resistor and the straight (upper) line is a timing pulse to identify when the encryption is taking place.
- the ten rounds of the AES encryption process are clearly shown in the varying trace.
- Fig.14b is a screen shot of the power trace for the same AES encryption using a 128 bit key but with the mixing (shuffling) process applied to all four sub sections of the AES encryption. In other words, 100% mixing is applied.
- the ten rounds of the AES encryption process are still clearly shown in the varying trace but the signal is significantly noisier.
- CPA uses traces such as those illustrated in Figs.14a and 14b and applies a statistical analysis to the results of multiple encryptions to yield the encryption key.
- One method for performing such a statistical analysis can be done using a ChipWhisperer-Lite which is open source hardware and software product optimised for SCA. In this test bed arrangement, it was also necessary to include a bidirectional level shifter to sit between the chicken UnoTM chip which uses 5v logic levels and the ChipWhisperer hardware which uses TTL 3.3v logic levels.
- Fig.15a shows the data from a power spectrum side channel attack on standard AES using 25 different 16-byte encryption keys. After 170 traces, 24 of the 25 keys are correctly returned by the SCA and for the final key, 15 of the 16 sub-keys are returned.
- standard AES is clearly vulnerable to a successful SCA with only a relatively small number of traces. It is noted that not all the keys are returned with fewer traces, e.g.
- Figs. 15b to 15e show that the same AES algorithm which has been subject to varying amounts of the mixing (shuffling) process described above are much less vulnerable to a successful SCA.
- Fig. 15b shows that if mixing is applied to one of the four tasks (for example AddRoundKey), none of the 25 encryption keys are completely returned after 170 traces of the SCA. There are three keys for which 9 and 10 subkeys are respectively returned and 10 subkeys is the highest number of subkeys returned.
- Fig.15c shows that if mixing is applied to two of the four tasks (for example AddRoundKey and SubBytes), the maximum number of subkeys which is returned for any one of the 25 encryption keys is reduced to just 5. Similarly, increasing the mixing to three of the four tasks (for example, mixing within each of AddRoundKey, SubBytes and ShiftRows), reduces the maximum number of subkeys which is returned for any one of the 25 encryption keys still further to just 1. Finally, when all of the tasks (e.g. AddRoundKey, SubBytes, ShiftRows and MixColumns) are subject to the mixing process described above, none of the sub keys are determined by the SCA.
- the tasks e.g. AddRoundKey, SubBytes, ShiftRows and MixColumns
- the 128bit AES encryption process has been randomised at a task level by truncated randomising the AddRoundKey stage (mix1) as described above 3. randomised at task level by truncated randomising the SubBytes stage (mix2) as described above. 4. randomised at a task level by truncated randomising both the AddRoundKey stage and the SubBytes stage (mix1 & mix2). 5. initial single step randomisation at the start of the algorithm randomising the AddRoundKey stage (mix1) 6. initial single step randomisation at the start of the algorithm randomising the SubBytes stage (mix2) 7. initial single step randomisation at the start of the algorithm randomising both the AddRoundKey stage and the SubBytes stage (mix1 & mix2) 8.
- randomisation (mix1) combined with the randomisation introduced at a pipeline line (parallel) 9.
- randomisation (mix2) combined with the randomisation introduced at a pipeline line (parallel) 10.
- randomisation (mix1 & mix2) combined with the randomisation introduced at a pipeline line (parallel).
- this first type of randomisation (mix_1) is sufficient to foil 170 traces of CPA side channel attack but the trade- off is an increase in time to 0.193028 seconds/100,000 encrypts compared to 0.0768093 seconds/100,000 encrypts.
- the third test results show an increase in time to 0.212729 seconds/100,000 encrypts for this second type of randomisation compared to 0.0768093 seconds/100,000 encrypts for the normal AES encryption.
- both options for 25% mixing result in a similar increase in processing time.
- the 50% mixing results in an additional increase in processing time when compared to 25% mixing as shown in the results for test four.
- the results of tests two to four indicate that there is a performance penalty for introducing the randomisation. This may be an acceptable penalty given the improvement in reducing the risk of a successful side channel attack as demonstrated in Figs.15a to 15e.
- the performance may be improved by introducing the randomisation in a single step at the start of the algorithm rather than introducing a truncated randomisation at each step as explained above. Tests five to seven indicate the improved performance.
- Test five shows that the first type of randomisation (mix_1) only results in an increase in time to 0.110798 seconds/100,000 encrypts compared to 0.0768093 seconds/100,000 encrypts for the unmixed, normal AES encryption.
- the second type of randomisation (mix_2) only results in an increase in time to 0.116044 seconds/100,000 encrypts compared to 0.0768093 seconds/100,000 encrypts for the unmixed, normal AES encryption.
- the second type of randomisation (mix_2) combined with the parallel randomisation (parallel) results in an increase in time to 2.7305 seconds/100,000 encrypts compared to 0.0768093 seconds/100,000 encrypts for the unmixed, normal AES encryption.
- the combination of the first and second type of randomisation (mix_1,mix_2) and parallel mixing results in an increase in time to 5.67733 seconds/100,000 encrypts compared to 0.0768093 seconds/100,000 encrypts for the unmixed, normal AES encryption.
- Keccak is a versatile cryptographic function. Best known as a hash function, it nevertheless can be used for authentication, (authenticated) encryption and pseudo-random number generation.
- Keccak-f cryptographic permutation algorithm can be decomposed into smaller constituent parts, all the way down to the microcode of the CPU if needed.
- Figure 16 illustrates examples of coarse, medium and fine grained decomposition for the Keccak-f cryptographic permutation algorithm: * Coarse grained – such as splitting into its constituent rounds (in this case 24) * Medium grained – such as splitting a round into its component actions, e.g.
- Fine grained – such as splitting ⁇ (theta) into its program language, e.g. C++ * Very fine grained – such as splitting the program language operations into their assembly language components * Ultra fine grained – such as splitting the assembly language opcodes into their on chip microcode.
- a canonical implementation of the init/update/finalise (IUF) paradigm can be considered as the basic block permutation function consisting of 24 rounds of the five classical Keccak-f steps: * ⁇ (theta) * ⁇ (rho) * ⁇ (pi) * x (chi) * ⁇ (iota)
- the main SHA-3 submission uses 64-bit words and so implementing the basic block permutation function consists of 24 rounds of the five steps. Each of the five steps and each of the rounds must be implemented in a specified order and thus are non-commuting operations. The opportunities for shuffling the commuting operations become apparent after examining the loops with the theta, ro, pi, chi and iota stages.
- the theta ( ⁇ ) task is schematically illustrated in the right-hand side of Figure 16 which shows that the first and second steps each have an array of indices ⁇ 0, 1, 2, 3, 4 ⁇ and the final step has an array of indices ⁇ 0, 5, 10, 15, 20 ⁇ .
- the theta ( ⁇ ) task within the Keccak-f algorithm has a set of nested shuffled sequences and randomisation (or mixing) may thus be introduced by randomising the set of indices.
- the theta ( ⁇ ) task can be randomly executed at a task level by shuffling the indices.
- unaltered code for the theta ( ⁇ ) stage could be blended randomly with the altered code.
- the blending could be tuned to deliver SCA protection at nominal throughput.
- the same fork-join parallelisation described above could be used to randomise the location of the code execution.
- the ⁇ (rho) stage comprises a bitwise rotation of each of the 25 words by a different triangular number (0, 1, 3, 6, 10, 15, ...) and the ⁇ (pi) stage permutes the 25 words in a fixed pattern.
- sequence_t rho_pi_indices ⁇ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 ⁇ ;
- the chi ( x) stage is schematically illustrated in Fig. 18 and like the theta ( ⁇ ) stage, there are nested sequences which can be shuffled to add a deeper level of mixing than for the combined ⁇ (rho) and ⁇ (pi) stages.
- the operation is to exclusive-or a round constant into one word of the state.
- the ⁇ (iota) stage is one of the non-commuting tasks which is amenable to alteration as described below.
- the ⁇ (rho) and ⁇ (pi) stages may be combined into a single subroutine.
- the 24 round loop may be unrolled and expressed (for example using C++) as: static void keccakf(uint64_t s[25]) ⁇ int i, j, round; uint64_t t, bc[5]; // round 1 theta(s); rho_pi(s); chi(s); iota(s); // round 2 theta(s); rho_pi(s); chi(s); iota(s); ... // round 23 theta(s); rho_pi(s); chi(s); iota(s); // round 24 theta(s); rho_pi(s); chi(s); iota(s); ⁇ [215]
- This sequence of non-commuting functions, theta, rho_pi, chi and iota can now be randomly executed out of order using one of the techniques described above.
- each of the functions may be interconnected with wait free single product single consumer queues (SPSC) to result in a pipeline.
- SPSC wait free single product single consumer queues
- This pipeline is schematically illustrated in Fig.19a.
- the pipeline has 96 functions with 95 connecting queues of bytes.
- Fig.19b schematically illustrates the system to implement the pipeline of Fig.19a.
- the pipeline can be executed asynchronously and out of order by virtue of the wait free connections carrying intermediate results.
- the system comprises an array 122 of indices for each of the function pointers and a vector 120 of function wrappers to the pipeline stages.
- each function pointer from the shuffled sequence is taken in turn and execution is attempted.
- execution is attempted.
- the function executes. The execution may be done by the CPU 110. The output of the execution is sent to the appropriate output queue which may be stored in memory 114. If one or both of the pre-conditions is not met, the function is not executed. The next function is then selected according to the next function pointer in the sequence and the pre-condition checks are repeated.
- each function will be able to make progress at each turn of the shuffled sequence and after at most 96 turns of the shuffled sequence, each queue will have at least one intermediate result.
- each function can make progress after at least 96 turns.
- the probability of having to wait 96 turns before the pipeline is fully primed, and no longer stalls, is vanishingly small.
- loading is likely to follow a Gaussian distribution and the pipeline is likely to be full in about 48 turns.
- FIG. 20 is a flowchart illustrating the key steps in the method (or carried out by the processor). As shown in a first step S2000, at least part of the computer algorithm is stored as at least one pipeline, for example in memory. The at least one pipeline comprises a plurality of separately executable tasks.
- FIG. 20 shows two routes for achieving randomisation. Although these are shown as separate options, they can be combined as described above.
- the first route the plurality of separately executable tasks are randomised at a pipeline level. Typically this is used for non-commutative tasks which must be carried out in a specific order.
- the first step is to randomise the order in which tasks are presented for execution (step S2010), for example using the DBRQ buffer or circular buffers with pointers as described above. Once the order is randomised, there is an attempt to execute each task once from the randomised order (step S2012).
- the plurality of separately executable tasks are randomised at a task level.
- Each task which is randomised comprises multiple operations. These operations may be commutative operations and may thus be executed in any order.
- the first step may be randomising an order of execution of the set of operations within the at least one task (step S2020), for example using loop index shuffling as described above. For commutative operations, there is no need to check if the operation can be executed because they may be carried out in any order.
- step S2022 the operations may be executed in the order in which they appear in the randomised order and hence the task is executed.
- the next step is to check whether all the inputs which were received in step S2002 have been processed. In other words, there is a check to see if all the outputs have been generated (step S2030). For the encryption example described above, this may involve checking whether the input plain text has been processed by the algorithm to be a cipher text. If the outputs are not ready, as indicated by the arrows, the method loops back to the randomising steps.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Storage Device Security (AREA)
Abstract
There are described methods and apparatus for enhancing security when executing a computer algorithm comprising separately executable tasks, each of which produce an electrical signal when executed. The apparatus comprises memory in which at least part of the computer algorithm is stored as at least one pipeline, wherein the at least one pipeline comprising a plurality of separately executable tasks; and a processor. The processor is configured to receive a plurality of inputs to be processed by the computer algorithm; randomise the plurality of separately executable tasks at a pipeline level and/or at a task level; execute the randomised plurality of separately executable tasks, and repeat the randomising and executing steps until the computer algorithm has processed the plurality of inputs. In this way, at each repetition, the electrical signals produced when executing the plurality of separately executable tasks are randomised to enhance security.
Description
METHOD AND APPARATUS FOR REDUCING THE RISK OF SUCCESSFUL SIDE CHANNEL AND FAULT INJECTION ATTACKS FIELD [01] This invention relates to a method and apparatus for reducing the risk of successful side channel and fault injection attacks when executing a computer algorithm. In particular, the execution of the computer algorithm may be immune to side channel and fault injection attacks. BACKGROUND [02] There are many situations where computer algorithms perform tasks which one might wish to keep secret but they are open to Side Channel Attacks [SCAs]. One such situation arises in cryptography. [03] Traditionally, cryptography has sought to improve the security of encrypted data by increasing the complexity of the mathematical function, the cipher, that turns plaintext into ciphertext. However, regardless of the complexity of the cipher, its implementation in the real world is not secure from SCAs. Computational activity is electrically noisy. When a device performs a task, energy has to be lost irreversibly from that device. This necessity of thermodynamics still applies when a device is performing encryption where repeated and regular computational activity releases repeated and regular electrical signals. Such electrical signals leak sensitive information from side channels. Types of information include the timing of operations, power consumption, sound and electromagnetic [EM] emissions. The emissions are rhythmical and predictable, and computation of a cipher algorithm generates repeated patterns of electrical signals. The availability of such information is utilised in SCAs. The electrical signal can be harvested, at a distance, and analysed to reveal the secret numeric keys used by the cipher algorithm. This is of particular concern in constrained devices such as smart cards and phones where cryptographic algorithms can be broken with minimal work, a cryptographic key obtained and the ciphertext can be read. Moreover, SCAs are of concern whenever plaintext is being encrypted or decrypted. [04] One form of SCA is Differential Power Analysis [DPA] which is described, for example in “Differential Power Analysis” by Kocher et al published at the Annual International Cryptology Conference in December 1999. Moreover, US2017099134 (A1) [ Kocher et al] describes a method and apparatus for conducting DPAs on devices to establish how vulnerable they are to DPAs. [05] EM emissions can also be used for successful SCAs. Electric current flowing through a conductor induces electromagnetic emanations, and these emanations are a source of side-channel information. As the power consumption in a device varies while the data is being processed, so does the electromagnetic field and one can expect to extract secret information from a relevant analysis. For example, the paper “ElectroMagnetic Analysis (EMA): Measures and Countermeasures for Smart Cards” by Quisquater et al in The Proceedings of the International Conference on Research in Smart
Cards (E-smart 2001), volume 2140-LNCS, pages 200–210. Springer-Verlag, 2001, introduced electromagnetic analysis (EMA) as a more general type of analysis than timing and power analysis. [06] Both DPA and EMA can be improved by using various wavelet transform denoising methods, typically based on singular spectral analysis (SSA) and detrended fluctuation analysis (DFA), as described for example in “Improved wavelet transform for noise reduction in power analysis attacks” [Ai et al, 2016, IEEE International Conference on Signal and Image Processing (ICSIP)] . Principal signal component in SSA can be selected by DFA adaptively, and residual part can be denoised by wavelet transform to retrieve important information. Using wavelet transforms in such a way improves SCA success rate whilst significantly decreasing the necessary number of power consumption traces significantly. [07] DPA and EMA rely on two facts. Firstly, whatever the encryption process is being used, it consists of repeated identical processing loops or rounds of transformation of plaintext in a synchronised, timed, manner. Secondly, those rounds of transformation have distinct power consumption signatures and EM emissions. With regard to EM emissions, these can be harvested by either near field devices, which can more easily localise signal emissions to, say, one core, or at longer range. [08] Attempts thus far to address these problems broadly fall into 4 groups. In a first group, there is an attempt to obscure the synchrony with the addition of cryptographic or non-cryptographic steps. Examples include GB2494731 to NDS Limited, US2014013425 to Samson, US2008019507 to Incard SA, and US2010166177 to Incard SA. However, this approach does not solve the problem of SCAs because the introduced steps are easily filtered out. [09] In a second group, there is typically an attempt to introduce some sort of asynchronous operation. Examples include US201608143 to Kindarji, and US2016171252 to Leiserson et al. Whilst there is a degree of asynchrony introduced to the system, the inherent imperative execution of the cryptographic programme means that asynchronous steps are, in effect, brought into synchrony, thereby making the process vulnerable to SCAs. [10] A third group of solutions attempts to add randomisation to which core operates on which bit of plaintext. Examples include “A Network-based Asynchronous Architecture for Cryptographic Devices” a PhD by Spadavecchia published by Edinburgh University in 2005, “Improving DPA Security by Using Globally-Asynchronous Locally-Synchronous Systems” by Gürkaynak et al published in the Proceedings of the 31st European Solid-State Circuits Conference 2005, and “Countering power analysis attacks by exploiting characteristics of multicore processors” by Yang et al published in the IEICE Electronics Express in 2018, Volume 15, Issue 7, Pages 20180084. Most of these approaches have been published as modelling exercises to establish chip architecture which can function with some randomness in core use. However, for these approaches to be implemented typically requires new multicore hardware for every user, which is a considerable expense for banks and the like issuing cards with new chips. It is also a significant problem when considering updating users of, say, social media, when they would be asked to replace handsets, tablets, and other devices. Even if these attempted solutions could be enabled by firmware updates, there is still considerable cost to develop updates for each type of chip.
[11] It should be noted that this last group attempts to randomise core use but, like a biased roulette wheel or a Plinko machine, such an asynchronous network will have non-uniform distribution of execution time and/or location that will be vulnerable to side channel attacks. There will be inherent biases within each chip that will assist SCAs to differentiate between cores. [12] The final group describes attempts to vary power consumption or EM emissions and thereby reduce the signal to noise ratio. Examples of include GB2524335 to Primary Key Associates Limited, US2006288239 to Pessolano, and “Evaluation of Dynamic Voltage and Frequency Scaling as a Differential Power Analysis Countermeasure” by Baddam et al published at the 20th International Conference on VLSI Design in January 2007, and J. Daemen and “Resistance Against Implementation Attacks: A Comparative Study of the AES Proposals” by V. Rijmen et al published in the Second Advanced Encryption Standard (AES) Candidate Conference in 1999. However, these approaches have the failing that the sensitivity of equipment can be adjusted to extract a signal from noise. In addition, they have the same inherent problem as the third group in that implementation relies on replacing hardware. [13] Various papers have described how combinations of shuffling and masking can be used to improve the security against side-channel attacks. Examples include “Shuffling against Side-channel attacks: A Comprehensive study with Cautionary Note” by Veyrat-Charvillon et al published in AsiaCrypt 2012, “Higher-order Masking and Shuffling for Software Implementations of Block Ciphers” by Rivain et al published in CHES 2009 and “An AES Smart Card Implementation Resistant to Power Analysis Attacks” by Herbst et al published in ACNS 2006. In each paper, a combination of shuffling and masking is required. Moreover, in the paper by Herbst, the randomisation is only done at the beginning and end of the AES encryption. [14] If an attacker has access to the hardware, they also have the possibility of mounting fault injection attacks (FIAs). FIAs can be generalised and relatively easy to mount such as glitching the voltage up or down out of the working range, or by generating sudden out of working range temperature changes. More specialised and costly FIAs involving lasers or ion beams can be focussed on much smaller parts of a processor. FIAs may sometimes lead to a processor dumping the task it was performing and the dump can be harvested and analysed. Sometimes an FIA can lead to a specific bit flop which changes the processing. Comparing “normal” running with the bit flop can reveal useful information. Whichever approach is being used, the attacker is reliant on consistent timing with each task being executed at the same position relative to the rest of the tasks in the algorithm. [15] In summary, methods such as designing special leakage-resilient hardware and leakage- resilient cryptography may be effective. However, they are problematic in many circumstances. Redesigning hardware is expensive and its integration into existing devices and systems is impractical. Moreover, existing leakage-resilient cryptographic programmes run inefficiently and are still open to SCAs. [16] There is a need for SCA immunity that can be simply installed on existing hardware, that is not reliant on there being more than one core but is also inherently scalable. Moreover, there is a need for a process and apparatus with SCA immunity which works regardless of the cryptographic algorithm being used.
SUMMARY [17] The invention relates to apparatus and methods for enhancing security when executing a computer algorithm as defined in the independent claims. Further features are defined in the dependent claims. [18] We describe apparatus for enhancing security when executing a computer algorithm comprising separately executable tasks, each of which produce an electrical signal when executed, the apparatus comprising memory in which at least part of the computer algorithm is stored as at least one pipeline, wherein the at least one pipeline comprises a plurality of non-commutative separately executable tasks; and a processor which is configured to receive a plurality of inputs to be processed by the computer algorithm; randomise the order of execution of the plurality of non-commutative separately executable tasks within the pipeline; attempt to execute each of the plurality of non-commutative separately executable tasks from the randomised order of execution once by determining whether preconditions for execution are met, when the preconditions are met, executing the task and storing an output of the executed task; and repeat the randomising and attempting steps until the computer algorithm has processed the plurality of inputs; whereby, at each repetition, the electrical signals produced when executing the tasks are randomised to enhance security. In this aspect, the plurality of separately executable tasks may be considered to be randomised at a pipeline level. [19] We describe apparatus for enhancing security when executing a computer algorithm comprising separately executable tasks, each of which produce an electrical signal when executed, the apparatus comprising memory in which at least part of the computer algorithm is stored as at least one pipeline, wherein the at least one pipeline comprises at least one task which comprises a set of commutative operations; and a processor which is configured to: receive a plurality of inputs to be processed by the computer algorithm; randomise an order of execution for the commutative operations of the at least one task within the pipeline; execute each of the at least one tasks within the pipeline once and repeat the randomising and executing steps until the computer algorithm has processed all inputs; whereby, at each repetition, the electrical signals produced when executing the at least one task are randomised to enhance security. In this aspect, the plurality of separately executable tasks may be considered to be randomised at a task level. [20] We also describe apparatus for enhancing security when executing a computer algorithm, the apparatus comprising memory in which at least part of the computer algorithm is stored as at least one pipeline comprising a plurality of separately executable tasks; and a processor which is configured to randomise the plurality of separately executable tasks at a pipeline level and/or at a task level; and execute the randomised plurality of separately executable tasks whereby electrical signals produced when executing each task are randomised to enhance security. [21] Similarly, we describe a method for enhancing security when executing a computer algorithm, the method comprising storing at least part of the computer algorithm as at least one pipeline comprising a plurality of separately executable tasks; randomising the plurality of separately executable tasks at a pipeline level and/or at a task level; and executing the randomised plurality of separately
executable tasks whereby electrical signals produced when executing each task are randomised to enhance security. [22] Similarly, we describe a method for enhancing security when executing a computer algorithm comprising separately executable tasks, each of which produce an electrical signal when executed, the method comprising: storing at least part of the computer algorithm as at least one pipeline, wherein the at least one pipeline comprises a plurality of non-commutative separately executable tasks; receiving a plurality of inputs to be processed by the computer algorithm; randomising the order of execution of the plurality of separately executable tasks within the pipeline; attempting to execute each of the plurality of separately executable tasks from the randomised order of execution once by determining whether preconditions for execution are met, when the preconditions are met, executing the task and storing an output of the executed task; and repeating the randomising and attempting steps until the computer algorithm has processed the plurality of input; whereby, at each repeating step, the electrical signals produced when executing the tasks are randomised to enhance security. [23] Similarly, we describe a method for enhancing security when executing a computer algorithm comprising separately executable tasks, each of which produce an electrical signal when executed, the method comprising: storing at least part of the computer algorithm as at least one pipeline, wherein the at least one pipeline comprises at least one task which comprises a set of commutative operations; receiving a plurality of inputs to be processed by the computer algorithm; randomising an order of execution for the commutative operations of the at least one task within the pipeline; executing each of the at least one tasks within the pipeline once and repeat the randomising and executing steps until the computer algorithm has processed all inputs; whereby, at each repetition, the electrical signals produced when executing the at least one task are randomised to enhance security. [24] By randomising the occurrence of the electrical signals when executing the tasks, sensitive information associated with the execution of the tasks is also randomised. The sensitive information may include the timing of operations, power consumption, sound and electromagnetic [EM] emissions. When the electrical signals and hence emissions are rhythmical and predictable, the risk of side channel attacks or fault injection attacks is higher than when they are randomised. [25] As explained in more detail below, the randomising at a pipeline level and/or a task level may be considered to be a Random Out Of Order Execution [ROOOX] approach. Both commutative and non-commutative portions of the algorithm can be randomised separately or together. Moreover, there may be random mixing both within individual runs and between different runs of the algorithm. To allow the operation, an algorithm may be considered to be unrolled, de-nested, and split into portions of sufficiently fine granularity that they can all be randomised and presented in a wait-free asynchronous manner. For non-commutative tasks, each task may be presented for execution in a random order, when randomising at a pipeline level but is executed in the correct order for each input. When there are multiple inputs, it is possible that all tasks in the random order may meet the preconditions for execution and thus there may appear to be a random out of order execution of the pipeline. However, each task is still being executed in the correct order for the associated input. [26] The following features are applicable to the methods and apparatuses described above.
[27] An executable task may be considered to be a portion of program code residing in memory that can be run by processing hardware, e.g. the processor. Such a task may be a suitably intercommunicating, wait-free, asynchronously executing sub step of a larger computer algorithm or algorithms. By asynchronous, it is meant that the tasks do not operate synchronously, they are not in a lock-step, that operate in a non-blocking manner so that the algorithm is guaranteed to complete in a finite number of steps regardless of the speed of operation of the tasks. By wait-free, it is meant that all tasks will complete in finite time and in other words, there is no blocking wait-state and the processor cannot be indefinitely blocked or starved. The separately executable tasks may be of sufficiently fine granularity which depends on several factors. The size of an executable task is not merely the physical size that the task occupies in memory but also the amount of time that is spent executing the program instructions that define that particular task. The executable task may, for example, be as small as a call or write function, or as large as all the steps necessary to expand an encryption key. Each task may be without repetitions occurring during a single execution episode, for example to assist in reducing any information that will be leaked at the point of execution. Merely, as an example, the algorithm may comprise a plurality of rounds of component actions. The granularity of the task may be considered to be coarse when the algorithm is split into tasks, each of which constitutes a round within the algorithm. The granularity of the task may be considered to be medium when the algorithm is split into tasks, each of which constitutes a component action within the algorithm. The granularity of the task may be considered to be fine when the algorithm is split into tasks, each of which constitutes an operation within a component action. [28] The apparatus may further comprise at least one buffer comprising a first queue and a second queue. The at least one buffer may be termed a double buffer. [29] Randomising the plurality of separately executable tasks at a pipeline level (or within the pipeline) may be achieved by generating a pointer which is associated with each executable task; storing the plurality of pointers in temporal order in the first queue; randomising the order of the plurality of pointers; and transferring the randomised plurality of pointers to the second queue from which each pointer is retrievable to execute the associated task. By randomising the order of the pointers, the order in which the plurality of tasks are presented for execution is randomised. The term pointer, as used herein, is a programming language object that stores the memory address of another value located in computer memory. A pointer references a location in memory, and obtaining the value stored at that location is known as dereferencing the pointer. As an analogy, a page number in a book's index could be considered a pointer to the corresponding page; dereferencing such a pointer would be done by flipping to the page with the given page number and reading the text found on that page. The actual format and content of a pointer variable is dependent on the underlying computer architecture. [30] The plurality of separately executable tasks may be executed by retrieving a pointer at the head of the second queue (for example, in response to an instruction issued by the processor); attempting to execute the task associated with the retrieved pointer; returning the pointer to the first queue; and repeating the retrieving, attempting and returning steps until the second queue is empty. Having the pointers queued ready in the second queue enables the wait-free execution of the associated task, if it can be executed. The processor may be configured to determine whether the second queue is empty
and when it is determined that the second queue is empty, randomise the order of the plurality of pointers; and transfer the randomised plurality of pointers to the second queue. The randomisation of the pointers in the first queue and transfer to the second queue may be done repeatedly and indefinitely or until the computer algorithm has processed all inputs. [31] Attempting to execute the task may comprise determining whether preconditions for execution are met and when the preconditions are met, executing the task or when the preconditions are not met, proceeding straight to the returning step. For example, the preconditions may comprise determine whether there is an intermediate output associated with the task, i.e. whether the task has already executed and its output stored. Alternatively, or additionally, the preconditions may comprise determine whether there is an intermediate input associated with the task, i.e. whether the previous task has already been executed and its output stored so the task can be executed. [32] By continually testing whether each task can be executed and returning pointers to unexecuted tasks to the pointer pool to be randomised in a queue, there is no temporal pattern. In other words, stepwise and synchronous serial execution of commands for the whole algorithm is circumvented and the rhythmic and predictable computations for the whole algorithm are also circumvented. By being wait-free, dwell times for the processor waiting for conditions to be met is avoided. On the one hand, the repetitive cycling of unmet tasks adds time, the wait-free action reduces time. Moreover, the asynchronous nature of the process according to the invention reduces processing time. In summary, the pointers may be retrieved from (i.e. exit) the second queue in an asynchronous, wait-free manner. A non-uniform execution is achieved and any semblance of a pattern is obliterated which reduces the chance of a successful side channel attack. Furthermore, although the plurality of executable tasks are presented in a random order, the continual testing to see if the task can be executed ensures that they are executed in order which it will be appreciated is important for non-commutative tasks. Thus, the random order may be considered to be a random order of presenting of the tasks for execution rather than a random order of executing the tasks. [33] The apparatus may comprise a first buffer and a second buffer each comprising a first and second queue. Each of the first and second buffer may be a temporal double buffer. The memory may store a first pipeline comprising a plurality of separately executable tasks and a second pipeline comprising a plurality of separately executable tasks. For example, the first pipeline may be an encryption pipeline for encrypting plaintext and the second pipeline may be a decryption pipeline for decrypting cipher text. Randomisation may be achieved by generating a pointer which is associated with each executable task in the first and second pipelines; storing the plurality of pointers for the plurality of separately executing tasks in the first pipeline in temporal order in the first queue of the first buffer; storing the plurality of pointers for the plurality of separately executing tasks in the second pipeline in temporal order in the first queue of the second buffer; randomising the order of the plurality of pointers in each of the first queues; transferring the randomised plurality of pointers from the first queue of the first buffer to the second queue of the first buffer; and transferring the randomised plurality of pointers from the first queue of the second buffer to the second queue of the second buffer. The pointers in each of the second buffers are retrievable for wait-free execution as described above. It will
be appreciated that two buffers is merely exemplary and the number of buffers may be selected to match the number of pipelines. [34] The apparatus may comprise at least one circular buffer for storing a plurality of pointers with each pointer being associated with one of the plurality of tasks and a plurality of queues connected to the at least one circular buffer with each queue storing the output from each of the plurality of tasks when it is executed. Such a circular buffer may be termed a single processor, single customer (SPSC). The stored output may be termed an intermediate output. The processor may be configured to randomise the order in which the plurality of separately executable tasks within the pipeline (or at a pipeline level) are presented for execution by randomising the order of the plurality of pointers within the at least one circular buffer. The processor is configured to attempt to execute each of the plurality of separately executable tasks from the randomised order by: retrieving a pointer at the head of the circular buffer; attempting to execute the task associated with the retrieved pointer; returning the pointer to the circular buffer; and repeating the retrieving, attempting and returning steps until all pointers within the circular buffer have been attempted. Once all the pointers within the circular buffer have been attempted, the order of the plurality of pointers may be randomised again and the process repeats. [35] There may be a plurality of processing cores for executing the plurality of separately executable tasks. The plurality of processing cores may be spread across a network or may be on the same chip, or in the same device. Furthermore, each core may involve several logical threads and a pointer may be associated with each of the logical threads. Randomisation of the location in which a task is executed may mean that it is not possible to rely on tracking the emanations from a particular core for an SCA. For example, in the arrangement above in which there are two pipelines, a first processing core may be configured to retrieve the pointers from the second queue of the first buffer (and hence execute the associated tasks of the first pipeline) and a second processing core may be configured to retrieve the pointers from the second queue of the second buffer (and hence execute the associated tasks of the second pipeline). It will be appreciated that two is also illustrative in this example and the number of processing cores may be selected to match the number of pipelines. [36] The feature of utilising a plurality of cores may be used for parallel processing; e.g. for encryption and decryption or other similar substantially embarrassingly parallel problems. With executable tasks of sufficiently small granularity, and tasks which may be acted on wait-free and asynchronously as explained above, an apparatus with a plurality of cores generally follows Gustafson- Barsis’ Law and benefits from linear scaling. The apparatus may be adapted to allow for recruiting and retiring processing cores from a pool or processing cores. When a core is recruited to the pool, a pointer to the core, or pointers to its logical cores, may be generated and supplied to the plurality of pointers. As explained above, the pointers to the processing cores may then be randomised, leading to randomness in location of execution of a program (i.e. topological randomisation). When a core is retired, its respective pointer, or pointers to its logical cores, may be removed from the plurality of pointers. [37] Where there are a plurality of processing cores or a processing core with a plurality of logical threads, randomisation may be introduced by generating a pointer which is associated with each of the plurality of processing cores/threads. The pointers may be randomised using the double buffer and/or
the SPSC described above. For example, the randomisation may be introduced by storing the plurality of pointers in topological order in the first queue; randomising the order of the plurality of pointers; and transferring the randomised plurality of pointers to the second queue from which each pointer is retrievable to determine which of the plurality of processing cores/threads is to execute a task. Such a double buffer may be termed a topological buffer because it randomises the location in which the task is executed as opposed to the order in which a task is executed as in a temporal double buffer described above. [38] The randomness may be enhanced by combining temporal and location randomisation. For example, the apparatus may comprise a first buffer having a first temporal queue and a second temporal queue; a second buffer having a first topological queue and a second topological queue; and a scheduler which is configured to cooperate with the processor to randomise the plurality of separately executable tasks at a pipeline level both temporally and topologically. A functional change in temporal and topological ordering of execution may thus be achieved by the mixing of pointers to both the program code of the executable tasks, and location of execution, such that, when and where the tasks are physically executed is entirely random. [39] The at least one buffer may be located within a register of the processor, particularly if pointers are embodied as small enough pieces of memory that can readily fit inside a processor’s register. Alternatively, the at least one buffer may be separate from the processor. The processor may thus be configured to randomise the plurality of separately executable tasks at a pipeline level by communicating or cooperating with the at least one buffer. Depending on hardware availability, a plurality of shuffled double buffers may be used as explained above to shuffle pointers to executable tasks and/or a plurality of shuffled double buffers may be used as explained above to shuffle pointers to locations of execution. [40] Randomisation may be introduced within the pipeline (at a pipeline level) or within the task itself (at a task level). Both types of randomisation may be used separately or combined. Randomisation within the task may be used when the task comprises a set of commutative operations. The processor may then be configured to randomise an order of execution of the commutative operations within the task. For example, the commutative operations may have a set of indices and randomising the order of execution of commutative operations may comprise randomising the set of indices for the commutative operations before execution of the selected task. Where there are a plurality of processing cores, the location of execution of each commutative operation may also be randomised. [41] Randomisation may be introduced into the plurality of separately executable tasks at a task level by selecting at least one task of the plurality of separately executable tasks, wherein the selected task comprises a set of discrete actions, wherein each set of actions has a set of indices indicating the order in which each action is executed; and randomising the set of indices for the discrete actions before execution of the selected task. The at least one task may be selected by determining the length of time required to execute each task in the pipeline and selecting one or more tasks which take longer to execute than other tasks. The tasks which take longer to execute may be more vulnerable to successful SCAs. The at least one task may be selected based on the number of commutative actions within the plurality of separately executable tasks. For example, the at least one task may contain the most
commutative actions or a greater proportion of commutative tasks than non-commutative actions. The randomisation at a task level may also comprise separating the or each selected task into the set of discrete actions. For example, the set of discrete actions may be obtained by unrolling one or more loops within the task into the discrete actions, wherein each discrete action is commutative. In this example, the set of indices may be obtained from the loop indices. [42] The at least one task may be selected based on the number of times the at least one task repeats in the pipeline. For example, the selected task may have more repeats. Where the same task repeats multiple times within the algorithm pipeline, randomisation at a task level may be introduced for each occurrence of the same task, for example by identifying each occurrence of a repeating task and randomising the set of indices for each occurrence. Where a plurality of tasks each repeat multiple times within the algorithm pipeline, randomisation at a task level may be introduced for each repeat of at least one task or each repeat of several tasks within the pipelines. For example, where there are four tasks which repeat the same number of times within the algorithm, there may be considered to be 25% randomisation at a task level when randomisation is introduced for each repeat of one of the four tasks. Similarly, there may be considered to be 50% when all repeats of two of the four tasks are randomised, 75% when all repeats of three of the four tasks are randomised, and 100% when randomisation at the task level is introduced into all tasks. Alternatively, randomisation may be introduced for some but not all of the repeats. [43] The computer algorithm may be an encryption algorithm encrypting plaintext or decrypting to plaintext. For example, when the encryption algorithm is AES, the plurality of separately executable tasks (or component actions) may comprise AddRoundKey, SubBytes [or inverse], ShiftRows [or inverse], and MixColumns [or inverse]. When randomising at a task level, the AddRoundKey may be the selected task because this is rich in commutative actions and is repeated multiple times in the pipeline. However, it should be clear to the skilled person that any computer algorithm where there are one or more parts that one wishes to remain secret and immune from SCAs can benefit from the invention. For example, algorithms involved in blockchain processing would benefit from having the relevant part or parts of blockchain processing being kept secret. Also, the invention may, for example, be applied to Secure Multiparty Computation [SMC]. Third party awareness of the data in SMC arrangements is wholly undesirable and the invention would thwart third party SCAs aimed at gaining such data. Indeed, it has become apparent that machine learning algorithms are vulnerable to SCA and there is a call to reduce their vulnerability. Example of successful SCAs on ML models are described for example in “NN: Reverse Engineering of Neural Network Architectures Through Electromagnetic Side Channel” by Batina et al published in USENIX Security Symposium, pp.515–532, 2019 CSI, “Reverse-Engineering Deep Neural Networks using Floating-Point Timing Side-Channels” by Gongye et al published in Li Z., (ed) DAC '20: The 57th Annual Design Automation Conference, article 30,pp 1-62020, and “Open DNN Box by Power Side-Channel Attack” by Xiang et al published in IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 67, no. 11, pp. 2717-2721, Nov. 2020, doi: 10.1109/TCSII.2020.2973007. Machine learning algorithms have many repeated processes that can be converted to execute in accordance with the invention. Moreover, in-house and custom banking algorithms which evaluate stock and foreign exchange for purchase would also benefit from
having parts immune from SCAs. As another example of a computer algorithm which aims to keep information secret, the computer algorithm may be a hashing algorithm, e.g. SHA256. [44] Thus, we also describe apparatus for controlling encryption (or decryption) which comprises memory in which the encryption (or decryption) algorithm is stored as at least one pipeline comprising a plurality of separately executable tasks; and a processor which is configured to receive at least one plaintext input (ciphertext input); randomise the plurality of separately executable tasks at a pipeline level and/or at a task level; and execute the randomised plurality of separately executable tasks to encrypt plaintext (or decipher ciphertext). We also describe method for encrypting (or decrypting) which comprises storing an encryption (or decryption) algorithm as at least one pipeline comprising a plurality of separately executable tasks; receiving at least one plaintext input (ciphertext input); randomising the plurality of separately executable tasks at a pipeline level and/or at a task level; and executing the randomised plurality of separately executable tasks to encrypt the input plaintext (or decipher the input ciphertext). [45] Reference to plaintext herein refers to any data before encryption or after decryption and includes, ordinary text files, delimited text files (e.g. files with the suffix csp), images, videos, sound files, spreadsheet files, web pages (e.g. files with the suffix php) and streaming data. During the process of encryption, the plaintext is converted into cipher in stages. While passing through the encryption process, the plaintext can be said to be partially encrypted text. In the first step, plaintext becomes partially encrypted text and after the last step becomes ciphertext. It should be clear, in terms of this invention, that reference to “text” covers any of the above states of plain, partially encrypted and encrypted text. [46] The encryption algorithm can be any one from the many known to the skilled person. In an earlier competition run by the National Institute of Standards and Technology (NIST), there were five algorithms referred to as the “AES finalists” namely, Rijndael, Serpent, Twofish, RC6 and MARS. As an example, Rijndael’ s algorithm has multiple stages each of which comprise one or more steps selected from: AddRoundKey, SubBytes or its inverse, ShiftRows or its inverse, MixColumns or its inverse loops. Each of these steps may be considered to be an executable task which is of sufficiently small granularity for the method and system in which randomisation may be introduced at the pipeline level. For example, the granularity may be considered to be fine. At a task level, the executable task for AddRoundKey comprises two nested loops; an outer loop executes four iterations of the inner loop, which itself has four iterations - resulting in sixteen iterations of the EXclusive Or (XOR) operation which is a set of commutative operations. Each outer loop may be considered to comprise a set of four discrete actions with each set of actions having a set of indices (in this example, one to four). Similarly, each inner loop may be considered to comprise a set of four discrete actions with each set of actions having a set of indices (in this example, also one to four). The randomisation at a task level may thus be introduced by randomising the set of indices for at least one of the inner and outer loops. Similarly, shuffling of indices can also be applied to the 16 discrete actions of the SubBytes task, the four discrete actions (cycles) of the MixColumns task and the four discrete actions (stages) of ShiftRows task. There may also be shuffling within each of the cycles of the MixColumns and ShiftRows tasks.
[47] The AES finalist, Rijndael, can be run with different size keys, e.g. 128, 192 and 256, and different numbers of rounds, 10 rounds for AES-128, 12 rounds for AES-192 and 14 rounds for AES- 256. “Side channel power analysis of an AES-256 bootloader” by O’Flynn and Chen published in 2015 2015 IEEE 28th Canadian Conference on Electrical and Computer Engineering (CCECE), Halifax, NS, Canada, 2015, pp. 750-755, doi: 10.1109/CCECE.2015.7129369 describes how they successfully mounted an SCA on a 256 AES bootloader and it appears that the invention described herein could thwart such an attack. [48] A more recent NIST competition was aimed at generating post quantum cryptographic algorithms (PQCAs). It is believed that quantum computers will be able to run Shor’s algorithm (which finds the prime factors) with great speed and so compromise encryption algorithms such as RSA, which is based on products of large primes. The competition called for Public-Key Encryption and Key- establishment Algorithms, as well as Digital Signature Algorithms. The finalists are Classic McEliece, Crystals-Kyber, NTRU and Saber in the first group, and Crystals-Dilithium, Falcon and Rainbow in the second group. Three of the first group admit that they are vulnerable to side channel attack without modification and NTRU is silent on the matter. All these finalists are computer algorithms which can be modified to execute in accordance with the invention. [49] Many of the digital signature candidates are based on the Keccak family, which also underpins SHA3, and this group are non-prime based algorithms, so they are safe from quantum computers running Shor’s algorithm. The Rijndael AES finalist, mentioned above, is also non-prime based. [50] As explained above, several pipelines may be processed. Thus, several encryption algorithms may be decomposed into executable tasks in different pipelines running on different plaintext feeds or decryption modes. [51] The apparatus typically receives a plurality of inputs, for example in the context of encryption, there may be several sources of plaintext or ciphertext. The plurality of inputs may be processed at the same time, for example a video stream together with an audio stream. In this regard, the temporal mixing of chunks of each source further adds to the randomness and thereby the resistance to SCAs [52] Randomisation may be introduced at the task level, e.g. when shuffling the indices of loops or at the pipeline level, either temporally or temporally and topologically. Randomness is the quality or state of lacking any pattern or principle of organization being entirely unpredictable and, therefore, unable to convey any information whatsoever. It will be appreciated that true randomness is, in fact, an abstract concept that can only be achieved when a system is at the theoretical limit of disorganization or maximum entropy. However, in cryptographic theory, the term “true randomness” describes a quality of the order of events such that they are not derived from any deterministic logic. Some physical phenomenon that is expected to be random such as atmospheric noise, thermal noise, other external electromagnetic, and quantum phenomena must be sampled and measured. [53] True randomness may be slow. The act of repeatedly measuring a natural source of true randomness and harvesting sufficient entropy to be confident takes time and it becomes a blocking or rate limiting step that would prevent practical implementation if it were to be used solely for the temporal and topological mixing or the index shuffling.
[54] “Pseudo-Randomness” is a mathematical mechanism that produces long sequences of apparently random results, which are in fact completely determined by a shorter initial value termed a seed. Therefore, the entire seemingly random sequence can be reproduced exactly provided the initiating seed value is known. This entirely deterministic method of producing (pseudo) random numbers is the one used almost exclusively in computer software and, depending on the quality of implementation, is vulnerable to attack. A “Cryptographically Secure Pseudo-Randomness” depends upon the mathematical technique used to generate the Pseusdo-Random Numbers (PRN). The size and “quality” of the sequence initiating seed will have a greater or lesser degree of randomness. The idea that sequences of numbers can possess varying “amounts” of randomness may seem odd but using various statistical tests it is possible to assess the amount of information that such sequences reveal and assigning to them degrees of entropy. It follows, therefore, that whilst PRNs based solely on deterministic logic can never be regarded as a truly random, in practice carefully designed and implemented pseudo- random number generators (PRNG) can demonstrate sufficient entropy to be certified for security-critical cryptographic purposes by authorities such as The National Institute of Standards and Technology (NIST) part of the U.S. Department of Commerce. [55] The apparatus may comprise a random number generator to assist the processor when randomising the plurality of executable tasks at a pipeline or task level. In other words, the processor may be configured to randomise the plurality of separately executable tasks at a pipeline level and/or a task level by communicating with the random number generator. Random number generators fall into two main categories: fast but imperfect pseudo-random number generators (PRNG) and slow but perfect entropy True Random Number Generators which for all practical purposes cannot be guessed or predetermined and typically use external physical phenomena to generate true randomness. The temporal and topological scrambling and/or the index shuffling may use a random number generator. The choices are implementation dependent but at the very least a fast PRNG is required to drive the shuffling permutations of indices and pointers required by the methods described below. A combination of both PRNG and TRNG may be preferred where a slow TRNG is used, at intervals (implementation dependent), to seed the PRNG. [56] Alternatively, both a hardware generated source of true randomness, such as the on-chip entropy source present in most modern processors, and a demand led fall back to periodically re- seeded software-based, NIST approved, cryptographically secure pseudorandom number generators (CSPRNGs) may be used. The fallback occurs when the desired read rate of randomness exceeds the ability of the on-chip entropy source to meet the demand. This approach avoids rate-limited blocking behaviour while waiting for the on-chip entropy source, whilst optimally maximising the randomness of mixing and, therefore, the ability to foil SCAs. CSPRNGs and on-chip true randomness may add further to the entropy thwarting SCAs, since the attacker cannot establish a correlation with a known CSPRNG. [57] In addition to providing successful resistance to SCAs, it should be clear to the skilled person that the invention also thwarts FIAs. FIAs have high dependency on each task being executed at exactly the same time and order in each iteration of an algorithm. The random out of order execution of each task for each iteration, which is key to the inventive techniques described above, means that there is
no consistency between one round and another. A fault injection attacker has no reliable event to focus upon because they can never expect task X to occur at time Y in a repeatable fashion. [58] As will be appreciated by one skilled in the art, the present techniques may be embodied as a system, method or computer program product. Accordingly, present techniques may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. For example, the automatic encryption of data as it is stored on a memory chip is commonplace and the invention may be integrated into such memory chips as part of their encrypting process. [59] Furthermore, the present techniques may take the form of a computer program product embodied in a computer readable medium having computer readable program code embodied thereon. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable medium may be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. [60] Computer program code for carrying out operations of the present techniques may be written in any combination of one or more programming languages, including object oriented programming languages and conventional procedural programming languages. Code components may be embodied as procedures, methods or the like, and may comprise sub-components which may take the form of instructions or sequences of instructions at any of the levels of abstraction, from the direct machine instructions of a native instruction set to high-level compiled or interpreted language constructs. [61] Embodiments of the present techniques also provide a non-transitory data carrier carrying code which, when implemented on a processor, causes the processor to carry out any of the methods described herein. [62] The techniques further provide processor control code to implement the above-described methods, for example on a general purpose computer system or on a digital signal processor (DSP). The techniques also provide a carrier carrying processor control code to, when running, implement any of the above methods, in particular on a non-transitory data carrier. The code may be provided on a carrier such as a disk, a microprocessor, CD- or DVD-ROM, programmed memory such as non-volatile memory (e.g. Flash) or read-only memory (firmware), or on a data carrier such as an optical or electrical signal carrier. Code (and/or data) to implement embodiments of the techniques described herein may comprise source, object or executable code in a conventional programming language (interpreted or compiled) such as Python, C, or assembly code, code for setting up or controlling an ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array), or code for a hardware description language such as Verilog (RTM) or VHDL (Very high speed integrated circuit Hardware Description Language). As the skilled person will appreciate, such code and/or data may be distributed between a plurality of coupled components in communication with one another. The techniques may comprise a controller which includes a microprocessor, working memory and program memory coupled to one or more of the components of the system. [63] It will also be clear to one of skill in the art that all or part of a logical method according to embodiments of the present techniques may suitably be embodied in a logic apparatus comprising logic
elements to perform the steps of the above-described methods, and that such logic elements may comprise components such as logic gates in, for example a programmable logic array or application- specific integrated circuit. Such a logic arrangement may further be embodied in enabling elements for temporarily or permanently establishing logic structures in such an array or circuit using, for example, a virtual hardware descriptor language, which may be stored and transmitted using fixed or transmittable carrier media. [64] In an embodiment, the present techniques may be realised in the form of a data carrier having functional data thereon, said functional data comprising functional computer data structures to, when loaded into a computer system or network and operated upon thereby, enable said computer system to perform all the steps of the above-described method. BRIEF DESCRIPTION OF THE DRAWINGS [65] Embodiments of the present invention will now be described, by way of example only, and with reference to the accompanying drawings in which: [66] Fig.1 is a simplified schematic diagram of an apparatus which may be used; [67] Fig.2 shows a double buffered random queue which may be used in the apparatus of Fig.1; [68] Fig. 3 is a flowchart illustrating the method of randomising at a pipeline level which may be carried out on the apparatus of Fig.1; [69] Fig.4 shows the double buffered random queue of Fig.2 together with the steps in the process; [70] Fig. 5a is a flowchart for AES encryption and decryption which is an example of a computer program which can be processed using the method and apparatus described above; [71] Fig. 5b is a schematic diagram illustrating of different levels of task granularity applied to the AES algorithm; [72] Fig.6 is a diagram of an AES dataflow communicating task pipeline of ten rounds for a 128-bit key relating to the encryption process illustrated in Fig.5; [73] Fig.7a is a key for each of the steps in Fig.7b and 7c; [74] Fig.7b and 7c are alternate representations of the pipeline in Fig.6 using the key in Fig.7a; [75] Figs.8a and 8b are representations of two alternative arrangements having two pipelines using the key in Fig.7a; [76] Fig.9a is a variation of the apparatus shown in Fig.1; [77] Fig.9b is a representation of the pipeline for Fig.9a using the key in Fig.7a; [78] Figs.10a and 10b illustrates a further arrangement of the apparatus in two configurations; [79] Fig.10c is a representation of the pipeline for Figs.10a and 10b using the key in Fig.7a; [80] Fig.11a is a variation of the apparatus shown in Fig.9a with two processing cores and related buffered random queues; [81] Fig.11b is a variation of the apparatus shown in Fig.10a with two processing cores and related buffered random queues; [82] Fig. 11c is variation of the apparatus shown in Fig. 9a with six processing cores and related buffered random queues;
[83] Fig. 12a is a schematic illustration of a circular buffer which is used to implement an interoperation communication queue, e.g. SPSC; [84] Figs. 12b and 12c are schematic illustrations of pipelines using SPSC queues at the coarse and medium granularity levels respectively; [85] Fig.12d is a schematic flowchart for implementing random out of order execution of the pipeline of Fig.12c; [86] Figs. 13a and 13b are schematic representations of the commutative operations within ShiftRows and MixColumns of the AES algorithm, respectively; [87] Fig.13c is a flowchart of the method of randomising at a task level which may be carried out on the apparatus of Fig.13d; [88] Fig.13d is a schematic block diagram of the apparatus for carrying out the method of Fig.13c; [89] Figs. 13e and 13f are schematic representations of tuneable shuffling of the commutative operations within AddRoundKey and SubBytes of the AES algorithm, respectively; [90] Figs.14a and 14b are screen shots from an oscilloscope measuring the test apparatus carrying out standard AES encryption and AES encryption using the method of Fig.13c, respectively; [91] Figs. 15a to 15e are test results indicating the performance of AES encryption without any mixing and with different levels of mixing as implemented using the method of Fig.13c; [92] Fig. 16 is a schematic diagram illustrating of different levels of task granularity applied to the Keccak-f algorithm; [93] Figs. 17 and 18 are schematic representations of the commutative operations within the combined rho_pi stage and the chi stage of the Keccak-f algorithm of Fig.16; [94] Fig.19a is a schematic illustration of the tasks of the Keccak-f algorithm; [95] Fig.19b is a schematic illustration of a system for implementing random out of order execution of the pipeline of Fig.19a; and [96] Fig.20 is a flowchart summarising the methods described in the preceding figures. DETAILED DESCRIPTION OF THE DRAWINGS [97] Side channel attacks (SCAs) and fault injection attacks (FIAs) on an algorithm are predicated on the orderly and predictable execution of the targeted algorithm. As described below, an algorithm can be adapted so that it is amenable to randomised out of order execution (ROOOX) and therefore resistant to SCAs and FIAs. Adapting the algorithm may comprise examining the parts (for example tasks and/or operations) within the algorithm and the order in which they occur. Commuting steps are those which can be executed in any order and non-commuting steps must be executed in a specific order. An element of the adaptation may thus include identifying the commuting and non-commuting steps. Different techniques for random execution may then be applied to commuting and non- commuting tasks (or operations) as explained in more detail below. [98] The process of modifying the part(s) or whole of the algorithm to enable ROOOX may be termed “pugging”. The dictionary definitions of pugging include the act or process of working and tempering clay to make it plastic and or uniform consistency or deadening sound e.g. by laying mortar or the like
between the joists under the boards of a floor or within a partition. The goal of pugging an algorithm is to enable flexibility of execution to the point where the randomised out of order execution of its parts prevents an attacker gaining any useful information from the analysis of its side-channel emissions during execution. The act of pugging may be described as an approach that breaks apart, or decomposes, an algorithm into its smaller parts in order to understand the relationship between its execution and the side-channel information which is leaked during execution. [99] The decomposed parts may then be randomised in time (temporal) or in time (temporal) and place (topological). As described below, this is achieved by using a system which is able to change the order in which parts are executed by one or more processors and may also change the location (i.e. processor/processing core) which is executing the part of the algorithm. Commuting and non- commuting operations need to be randomised differently and thus a successful implementation requires one or both of: a) A means of elastically executing commuting operations temporally or temporally and topologically b) A means of elastically executing non-commuting operations temporally or temporally and topologically Non-Commutative Operations [100] Fig. 1 shows a schematic diagram of one system for executing a computer algorithm so that the risk of successful side channel attacks is reduced. As shown in Fig. 1, this is the simplest arrangement with a single core microprocessor 10 which accesses pointers from a double buffered random queue (DBRQ) 12. In this example, the DBRQ is long enough to store a plurality of pointers each of which points to a task within the computer algorithm. The DBRQ may thus be considered to be a pipeline comprising a plurality of separately executable tasks. A pipeline is particularly suitable when tasks are non-commuting and thus need to be executed in a specific order. As explained in more detail below, the connections in the pipeline retain the logical order of the processing but physical execution of the algorithm’s order is random. DBRQ is one example of a scheduling mechanism which achieves an asynchronous, wait-free pipeline of communicating sequential processes. [101] Non-commuting tasks are critically dependent on the immediate outcome of the neighbouring tasks within the pipeline. Thus, non-commuting tasks may also be termed operations where an operation is defined in mathematics as a function which takes zero or more input values (called operands) to a well-defined output value. As explained in more detail below, it is possible to execute the tasks (or operations) out of order, in both time and place if the following conditions are met: i) Their outputs (i.e. the intermediate results) are in a queue between operations ii) Accessing those queues is wait-free iii) Those queues can be tested for being full or empty iv) Execution is asynchronous [102] The term pointer, as used herein, is a programming language object that stores the memory address of another value located in computer memory. A pointer references a location in memory, and obtaining the value stored at that location is known as dereferencing the pointer. The actual format and content of a pointer variable is dependent on the underlying computer architecture. However, as
schematically shown in Fig.1, the processor 10 accesses a memory 14 based on the pointer from the DBRQ 12. Each pointer (ptr) is represented with a single symbol, e.g. an empty box such as those shown either side of the number 38. [103] The memory may comprise a volatile memory, such as random access memory (RAM), for use as temporary memory, and/or non-volatile memory such as Flash, read only memory (ROM), or electrically erasable programmable ROM (EEPROM), for storing data, programs, or instructions, for example. In the example shown in Fig.1, each pointer points to a piece of executable code. Each piece of executable code may be part of a larger computer algorithm. A computer algorithm is a sequence of steps that take an input and convert it into an output via a series of intermediate input and output stages. Traditionally, a single hardware execution resource steps and loops through each of the computational stages in the form of commands. However, this stepwise and synchronous serial execution of commands for the whole algorithm is a consequence of the engineering decision and constraints of the hardware and not an inherent consequent of the algorithm itself. [104] It is generally possible to break up or decompose an algorithm into smaller, separate steps which may be considered to be serial fractions of the algorithm. These serial fractions act upon intermediate input and output data towards the end result. Such serial fractions can be executed asynchronously if the necessary intermediate input data is ready and the intermediate output data can be made available for the next stage. If the required intermediate data is not present, no computation takes place but it may be possible to execute a different serial fraction. Decomposing an algorithm into its serial fractions that intercommunicate data may be termed dataflow programming from the perspective of graph theory and communicating sequential processes from the perspective of process calculus. [105] The concept of a generic task is important when describing how the DBRQ pipelines are used and combined as described below. In computing, a task may be defined as a unit of execution or a unit of work. Alternative terms include process, light-weight process, thread (for execution), step, request or query (for work). A generic task is one where the details of purpose are not relevant, such as in the scheduling of tasks, and may be represented by a single symbol, e.g. a black box. This black box can be a placeholder for any of the tasks, e.g. the serial fractions, within an algorithm. Similarly, a memory pointer may also be represented with a single symbol, e.g. an empty box. [106] As explained in more detail in relation to Figs.5a to 7c, such a DBRQ as shown in Fig. 1 is suitable for use with a decomposed 10 round 128bit key AES encryption pipeline. Accordingly, in this example, there are 38 pointers but it will be appreciated that the length may be adjusted to suit the computer algorithm which is being executed on the system. The purpose of a task is of no concern to the DBRQ and so the DBRQ may be conveniently represented as 2 short groups of either vertical or horizontal (as shown in Fig. 1) task pointer symbols. Either side of the task pointer symbols, Fig.1 shows a PUSHable input and a POPable queried output to complete the symbolic representation of the DBRQ. [107] As shown in Fig.1, the apparatus may comprise a random number generator 16 to assist the processor when randomising the plurality of executable tasks at a pipeline level using the DBRQ. The temporal scrambling may use both a hardware generated source of true randomness, such as the on-
chip entropy source present in most modern processors, and a demand led fall back to periodically re- seeded software-based, NIST approved, cryptographically secure pseudorandom number generators (CSPRNGs). [108] Fig.2 shows more detail of the DBRQ of Fig.1. The DBRQ 12 comprises a first queue which may be termed a future task queue 22 and a second queue which may be termed a pending task queue 24. Each of these queues may be considered to be a resource pool with the resource being a pointer to executable code. The future task queue stores future task pointers and the pending task queue stores pending task pointers. In Fig.2, the future task queue 22 is full, e.g. with the 38 pointers which are used to implement AES encryption and the pending task queue 24 is empty. All the pointers to each of the tasks in the future task queue 22 are in consecutive order. Thus, the DBRQ 12 is depicted in its start-up mode or as explained below at a point in the process when all pointers have been returned to the future queue. [109] Fig.3 is a schematic flowchart illustrating the steps which may be carried out by the system of Fig.1 with the DBRQ of Fig.2. In a first step S300, the processor issues a first POP instruction to the DBRQ which results in an internal POP request to the pending task queue in the DBRQ. The processor then determines whether the pending task queue is empty in step S302. When it is determined that the pending task queue is empty, the pointers in the full future task queue are randomly shuffled in step S304. The randomised pointers in the future task queue are then swapped into the empty pending task queue in step S306. In other words, the failed POP request effectively triggers a request to swap the empty pending task queue and the full future task queue which exchanges their roles. [110] As shown in Fig.3, a new POP instruction may then be issued to the DBRQ (step S300). This time, the determination of whether the pending task queue is empty (step S302) will determine that the pending task queue is not empty. The now full, randomly ordered, pending queue can fulfil the POP request as set out at step S308. Thus, the DBRQ (particularly the pending task queue) releases a task to the processor for execution. There is an attempt to execute the task at step S310. As explained above, intermediate input data may be needed for the task to be executed, thus it is not always possible to execute a task. Regardless of whether the task is executed, the processor issues a DBRQ PUSH instruction to recycle the completed task pointer at step S312. The DBRQ stores this recycled pointer in the future task queue which is initially empty. The process of steps S300, S302, S308, S310 and S312 is then repeated for each task pointer in the pending task queue so that the future task queue is gradually filled. [111] As execution of the tasks is attempted one by one from the pending task queue, they are stored in the future task queue until such time as the pending task queue is empty. At this time, when a new POP instruction is issued to the DBRQ (step S300), the determining step (S302) will determine that the pending task queue is empty. The steps S304 and S306 will then be repeated. In other words, when the pending task queue is empty the future task queue can be randomly shuffled, and the pending and future task queues can be swapped and the cycle repeated. This continues until the algorithm has completed the processing of all the input data which it has been presented. [112] Fig.4 is a schematic system and flowchart illustrating a snapshot of the process of Fig.3. The reference numbers for features in common with Fig.2 and 3 are used in Fig.4. As shown in Fig.4, the
future task queue 22 comprises a number (in this example, three) of task pointers which have been recycled following their execution, if possible, by the processor 10. A task pointer 30 (which has the sequential order number 10) is shown as being fetched by the POP instruction (S300) to be executed by the processor 10. Another task pointer 32 (which has the sequential order number 13) is shown within the processor 10 for execution. Another task pointer 34 (which has the sequential order number 7) is shown being recycled by the processor 10 using a recycle instruction (S312) and pushed to the future task queue 22 using the push instruction (S314). The remaining task pointers are stored in the pending task queue 24 in the randomised order generated by the shuffle step (S304). [113] Fig. 4 illustrates the process of attempting to execute each task in more detail. In a first step S400, the processor determines whether the output is full, i.e. has the task already been executed in a previous cycle through the task pointers. If the processor determines that the output is full, the task pointer is recycled (step S312) and returned to the future task queue. If the processor determines that the output is not full, the processor determines whether the input is empty (step S402). Some of the tasks will be dependent on other tasks having already been performed. If the input is empty (i.e. the task cannot be completed until at least one other task is completed), the task pointer is recycled (step S312) and returned to the future task queue. If the input is not empty, the task is ready for execution. The checks to determine whether the input is empty and/or the output is full may be considered to be preconditions which need to be met before executing the task. [114] As explained above, the pointer references a location in memory. Once the preconditions are met, the next step S404 is to read the input from the memory 16. The appropriate task is then carried out in step S406. Merely as example, this task may be a step from the AES cypher algorithm. The output from the execution of the task is then written into the memory 16 (step S408). Once the task is complete and the output is stored, the task pointer is recycled (step S312) and returned to the future task queue. The process is repeated until all task pointers are moved from the pending task queue to the future task queue. The future task queue may then be shuffled again and the pointers moved to the pending task queue. [115] As shown in Figs.3 and 4, a key step in the process is to randomise or shuffle the task pointers before moving all the task pointers from the future task queue to the pending task queue. If the task pointers were simply moved from the future task queue to the pending task queue without shuffling, the predictable sequential ordering of the task would leave the operation vulnerable to an SCA. This randomisation helps to defeat higher level SCAs such as DPA, High-Order Differential Power Analysis [HO-DPA] and Wavelet Denoised High-Order Differential Power Analysis [WD-HO-DP]. [116] The randomisation is not naively implemented to prevent the process may become terminally stalled virtually from the outset. The implementation described above may be considered to provide a “Fair Secure Random Ordering Task Pipeline Execution”. This is achieved by separating out the random execution of all of the pipeline tasks in the present (i.e. those in the pending task queue) from the random execution of all of the pipeline tasks in the future (i.e. those in the future task queue). The mechanism of keeping two separate queues, one for pending task pointers and one for future task pointers, and only applying a random transformation to the temporal ordering when the queues are
swapped means that the double buffered random queue ensures two critical fairness guarantees, namely: 1. Although each task will be presented for execution in a random order it is guaranteed to be executed at least once within the execution time of the total number of tasks in the pipeline 2. Although the tasks in the pipeline will be presented for execution randomly the whole pipeline is guaranteed to be executed [117] The importance of the critical fairness guarantees may be illustrated by considering a short algorithm which comprises several tasks, e.g. the function: ^( ^ ) = 3 ( sin ( ^ ^)) + 1 By virtue of the rules of operator precedence, this function is composed of four tasks (or operations) performed in the sequence square, sine, multiplication and addition. The ordered sequence of the operations can be broken into steps with the intermediate (or intermediary) outputs represented by a, b, c and d.
[118] The DBRQ may allow the algorithm to execute more than one input, which may be represented as x0, x1… xn. These inputs may be represented in a queue as [xn… x0]. Using a directed graph representation and a queue of n inputs, the pipeline process may be represented as: [xn… x0]→square→[]→sine[]→multiply 3→ []→add 1→f(x) Using the DBRQ above, a pointer to each of these tasks is initially included in the future task queue and then the pointers are randomly shuffled into the pending task queue. If the order of the task pointers is randomly shuffled to be 3, 4, 1, 2 in the pending task pointer queue, following the process above will mean that the task pointer to task 3 is selected first and there is an attempt to execute step 3. The precondition checks will show that there is no intermediate input b0, so the step will fail and thus there is no change to the representation above. Similarly, there will then be an attempt to execute step 4 which will also fail and return the same result as above. Only when the third task pointer is taken from the pending task queue will the attempt on the task be successful because the required input to step 1, i.e. x0, is available and the intermediate output a0 can be output. Thus, the result below is returned and it is noted that the input queue length is reduced: [xn… x1]→square→[a0]→sine[]→multiply 3→ []→add 1→f(x) Step 2 “sine” can also now be implemented because the intermediate input a0 is available. Thus, the result below is returned:
[xn… x1]→square→[]→sine[b0]→multiply 3→ []→add 1→f(x) Although all the task pointers have now been returned to the future task queue, the full algorithm has not yet executed. Accordingly, the task pointers are randomly shuffled again and moved to the pending task queue. This time the random shuffling results in the order 1, 4, 3, 2. Working through this example, the square process has the necessary input for the next piece of data x1, and the result below is returned: [xn… x2]→square→[a1]→sine[b0]→multiply 3→ []→add 1→f(x) As before, step 4 “add” fails because there is no intermediate input c0 and the result above is returned again. Step 3 “multiply” receives the intermediate input b0 and can return intermediate output c0 and so the result below is returned: [xn… x2]→square→[a1]→sine[]→multiply 3→ [c0]→add 1→f(x) Finally, step 2 “sine” is processed and the intermediate input a1 is available so the result below is returned: [xn… x2]→square→[]→sine[b1]→multiply 3→ [c0]→add 1→f(x) [119] Although all the task pointers have now been returned to the future task queue, the full algorithm has still not yet executed for the first input x0. Accordingly, the task pointers are randomly shuffled again and moved to the pending task queue. This time the random shuffling results in the order 3, 4, 2, 1. Working through as above, step 3 “multiply” receives the intermediate input b1 and can return intermediate output c1 and so the result below is returned: [xn… x2]→square→[]→sine[]→multiply 3→ [c1, c0]→add 1→f(x) Step 4 “add” receives the intermediate input c0 (there is only a single processor) and can return intermediate output d0 and so the result below is returned: [xn… x2]→square→[]→sine[]→multiply 3→ [c1]→add 1→[d0] Step 2 “sine” tries to execute but there is no intermediate input so the result above is unchanged. Finally, step 1 “square” is processed and the intermediate input x2 is available so the result below is returned: [xn… x3]→square→[a2]→sine[]→multiply 3→ [c1]→add 1→[d0]
The pipeline has now processed its first element to completion but as shown there are additional inputs to process. Given that the pipeline is not empty, processing continues using randomly permuted out of order execution (or randomised out of order execution – the terms may be used interchangeably) until all inputs have been processed. In other words, the tasks in the pipeline continue to be presented for execution in a random order but are only executed if the associated intermediate input is available. Thus, for each input, the tasks are executed in the correct order. However, if each task has an associated intermediate input, all tasks in the pipeline may be executed in the randomised out of order execution. As can be seen above, the processing of inputs x0, x1 and x2 is temporally mixed together in the final step. This may be considered to be adding further to the randomness and thereby to the resistance to SCAs. [120] Using this simple example above, we can see that stalling will not occur. Supposing that, by chance, the worst case permutations for the order of execution transpire, i.e.4, 3, 2, 1 four times in a row. It is appreciated that the probability of the same shuffle with four elements is 1 in 24 and the probability that the same sequence is shuffled four times is 1 in 13824. Nevertheless, the complete algorithm will operate after the four executions of the pipeline. In the first pass, a first intermediate output a0 will be generated, i.e. only one of the tasks can be executed. The intermediate output a0 can then be used in the next pass through the pipeline to generate first intermediate output b0 and if there is more than one input, a second intermediate output a1 will also be generated, i.e. two of the tasks can be executed. Intermediate output c0 is generated in the third pass together with a second intermediate output b1 and a third intermediate output a3. In other words, three of the tasks execute when there are multiple inputs. Finally the first output d0 is generated on the last pass and thus the algorithm has produced an output after four executions. If there are multiple inputs, the pipeline is now full and will also generate a second intermediate output c1 a third intermediate output b2 and a fourth intermediate output a3. In other words, all of the tasks execute. Subsequent passes will also always generate an output until all the inputs are processed. [121] The algorithm which is executed as described above may be any algorithm with any number of tasks arranged in a pipeline. Regardless of the action of a pipeline task there are defining features common to all tasks. For example, failing any of the preconditions required for execution should happen in the shortest amount of time possible so that the task pointer can be recycled almost immediately. Such a failure-fast-track mechanism is key in order to minimise the overheads inherent in decomposing an algorithm into communicating sequential tasks. The importance of reducing all of the overhead costs associated with a pipeline model is not simply the outright performance of a pipeline(s) but also, as will be explored later, the scalability of a pipelines in a parallel execution environment. [122] The defence against SCAs which is provided by executing the algorithm as described above is predicated on the fact that each task is a short section of serial code that executes in a consistently minimal timescale and when executed randomly in time has an EMF signature that is indistinguishable from white noise. In order to achieve this, one of the design features for all tasks is that any repeated section of code within the task must itself be short and discrete (such that any repetition is not dependent on the order of preceding repetitions). The task may comprise a minimal number of repeats and as explained below may further have the order of their indices randomly shuffled every time. Preferably,
all repeating sections in a task are essential, short, discrete and randomly shuffled - otherwise they should be broken out into separate tasks. [123] It is clear that a task with looping sections could have each of the loop stages broken out into separate smaller communicating tasks. However, doing so rapidly balloons the unavoidable overheads of inter-task communication and randomised scheduling affecting performance and scalability. Therefore, a balance must be struck between breaking up an algorithm into ever smaller tasks - the task granularity - and the overheads of the system (inter-task communication, temporal randomisation, topological randomisation and scheduling). [124] At each level of task granularity, there will be opportunities to reorder and relocate the execution of the algorithm’s programmatic steps. When and how to perform this reordering of execution (which may be termed random out of order execution (ROOOX)) will depend on the trade-off between performance and SCA resistance. The random out of order execution of non-commuting operations carries a slightly higher overhead than the random out of order execution of commuting operations because of the need to ensure that non-commuting operations are carried out in the correct order for each input. Ultimately, separating or decomposing the code of a particular algorithm is dependent on many factors, such as the algorithm itself, the mode of execution and the trade-off between performance and SCA resistance that the use case calls for. [125] Examples of algorithms which may be separated into tasks and processed as described above include encryption algorithms and these can be any one from the many known to the skilled person. Examples of encryption algorithms are the five algorithms referred to as the “AES finalists” in the NIST selection process namely, Rijndael, Serpent, Twofish, RC6 and MARS. For the purposes of describing the invention, the Rijndael AES encryption algorithm will be used as an example algorithm. As shown in Fig.5a, the algorithm may be decomposed into a task pipeline with several discrete tasks. [126] Fig.5a illustrates the steps or tasks involved in encrypting plaintext [on the left] and decrypting cipher text [on the right]. For both pathways there are three stages; first AddRoundKey, repeat multiple times Round Key process, and last AddRoundKey. These three stages convert plaintext to cipher text using the encryption algorithm or convert cipher text to plaintext using the decryption algorithm. Each stage consists of one or more steps selected from: AddRoundKey, SubBytes or its inverse, ShiftRows or its inverse, MixColumns or its inverse. These steps occur in certain orders at certain times and some are grouped in loop. It would seem that each named step is a suitable place to start decomposing the AES algorithm into tasks that communicate with each other via inputs and outputs to form a communicating task pipeline. [127] The AES algorithm can be decomposed into smaller constituent parts, all the way down to the microcode of the CPU if needed. How far an algorithm is broken up into smaller parts is termed the granularity of decomposition and examples of levels of granularity are set out below: * Coarse grained – such as splitting the AES algorithm into its constituent rounds * Medium grained – such as splitting a round into its component actions, e.g. SubBytes * Fine grained – such as splitting SubBytes into its program language, e.g. C++ * Very fine grained – such as splitting the program language operations into their assembly language components
* Ultra fine grained – such as splitting the assembly language opcodes into their on chip microcode. Fig.5b illustrates the coarse, medium and fine grained composition. In this example, the coarse grained level of decomposition comprises 10 rounds. The order of execution of these rounds is strictly non- commutative. The medium grained level of decomposition comprises 4 sub-steps in each of rounds 1 to n-1 (AddRoundKey, SubBytes, ShiftRows and MixColumns) and 3 sub-steps in the last round (AddRoundKey, SubBytes and ShiftRows). Each of the sub-steps in each round are also strictly non- commutative. However, in the fine-grained level of decomposition, some commutative operations can be identified in each of AddRoundKey, SubBytes, ShiftRows and MixColumns tasks. At each of these stages of granularity, there will be opportunities to reorder and relocate the execution of the algorithm’s programmatic steps taking in to account whether the operations are non-commutative or commutative as described in more detail below. [128] In a standard operation of the algorithm, the above steps occur as rhythmical and predictable computations and some are grouped in loops. This makes them vulnerable to side channel attacks. The encryption algorithm of Fig.5a has been decomposed in Fig.6 to show in detail the tasks which communicate with each other via intermediate inputs and outputs [shown by arrows which represent the data flows] to form a communicating task pipeline. When 10 rounds are applied, e.g. N=10, it should be noted that the resulting AES decomposed task pipeline (medium granularity) consists of 36 separate task duplicates of the four basic tasks, an input stage (plain block, p) and an output stage (encrypted block, e) making 38 task based stages in total having 40 separate communication channels. The key size used for an AES cipher specifies the number of transformation rounds that convert the input, called the plaintext, into the final output called the ciphertext. Thus, there are typically 10 rounds for 128-bit keys, 12 rounds for 192-bit keys and 14 rounds for 256-bit keys. It will be appreciated that the task pipeline will thus be longer for 192 and 256-bit keys and thus a longer DBRQ will be required. [129] Using the DBRQ above and the “Fair Secure Random Ordering Task Pipeline Execution”, the first fairness guarantee is met by ensuring that no task pointer to an already attempted task can be returned to the pending task queue until an attempt to execute all of the pending tasks has been made. Thus, for a pipeline having 38 task based stages, using the DBRQ described above means that each task in the pipeline has an increasing chance of being executed: 1/38, 1/37, 1/36, … 1/3, 1/2, 1. Furthermore, by virtue of the second guarantee, the full algorithm will be executed after at most 38 runs of the pipeline. Thus, a data block will pass through the pipeline and be fully encrypted after at most 38 runs of the pipeline. [130] It is useful to contrast the process of the invention with standard AES adapted to a task pipeline approach in which pointers to each of the 38 tasks are stored in a single queue and presented to a single processing core for orderly execution. The data stream of 128-bit blocks could progress through the pipeline undergoing all of the 38 steps of encryption in temporal order. Initially, the pointer at the front of the queue may be removed by a POP instructions. The processor may then use the pointer to start the task in memory. The processor would confirm that the output connection is not full and the input connection is not empty and then read a 128-bit data block from the input connection. The next step in the AES pipeline would then be applied to the read data block to transform the data block. The
transformed (or modified) data block would then be written to the output connection. The pointer would be recycled to the back of the queue with a PUSH instruction and the whole process would be repeated for the next pointer in the queue until the queue is completed. [131] Completing the steps in sequence means that the process is vulnerable to a SCA. A simple, naïve randomisation may be applied to the single queue of pointers, for example by selecting a pointer within the queue at random rather than selecting a pointer at the head of the queue. A disadvantage of such a process is that a random transformation must take place each time a task pointer is selected rather than each time the queues are swapped in the process described above. [132] Furthermore, such a naïve randomisation is likely to lead to a pipeline that almost never executes from beginning to end. For example, the probability P0 of starting the pipeline at any point by randomly selecting the first task stage of the pipeline is 1/38, i.e. P0=0.026. The probability P1 of selecting the next stage of the pipeline is also 1/38 and so on up to P37. The probability of the data block progressing from the first stage of the pipeline to the second stage is P0 x P1, i.e. 0.000676. Therefore, using Naïve Random Ordering Task Pipeline Executing, the probability of a single data block making it all the way through the 38 stage AES pipeline and becoming fully encrypted is P0 x P1x… P37, i.e.5.87x10-61 which is supremely improbable. When a task pipeline stops working at an execution point it is termed “stalled” and if it is unable to start again it is “terminally stalled”. The Naïve Random Ordering Task Pipeline Executing of a single pipeline will be stalled just under 100% of the time and thus is useless. By contrast, the present invention executes the complete algorithm in at most 38 executions of the pipeline. [133] The AES encryption pipeline shown at Fig 6 may be represented by the symbols shown in Fig. 7a. These effectively give shorthand descriptions of the components presented earlier. Use of the symbols leads to the symbolic representation of the 128bit key AES encryption communicating task pipeline shown in Fig. 7b. This can be further condensed by omission of the communication arrows, which leads to the representation shown in Fig.7c. Such condensed representations may be helpful in explaining how the arrangement of Fig.1 may be adapted to multiple pipelines. For example, given a single input stream, such as audio visual captured from a camera, and a single execution core able to perform fast enough to handle the workload from two or more pipelines then effectiveness of SCA countermeasures can be considerably increased. It is also within the scope of the invention to have several sources of plaintext or ciphertext being processed at the same time, for example a video stream together with an audio stream. [134] Fig.8a. illustrates two abbreviated multi-stage AES encryption pipelines p1 and p2 consisting of the four basic task types identified above and repeated with input and output tasks to a total of 38 tasks. Fig. 8b shows a mechanism for splitting the incoming data stream into plain data blocks p1 and p2, known as demultiplexer (demux), and a mechanism for recombining the encrypted data blocks p1 and p2, known as a multiplexer (mux). As shown in Fig.8b, the data blocks are split, encrypted by each pipeline and the recombined for onwards transmission. The two pipelines may be buffered together. Each pipeline could have a different plaintext or ciphertext feed. For example, pipeline A might be involved with encrypting a video feed while pipeline B may be involved with encrypting an audio stream from a different source.
[135] The pointers to the 76 tasks of both the pipelines can be randomised in two possible ways. A first way is shown in Fig. 9a and comprises a single processor 10 and a single DBRQ 92 with an enlarged capacity when compared with Fig.1. Alternatively, as shown in Figs.10a and 10b, there may be two DBRQs 112A and 112B (each having a capacity for 38 task pointers) and a scheduler 100. In each arrangement, the memory and each of the pending and future task queues of the DBRQs are omitted for ease of display. [136] Fig.9b shows the workings of the embodiment of Fig.9a. The single DBRQ contains all of the 76 tasks from the first and second AES pipeline p1 and p2. The single processor processes each task pointer(ptr) randomly presented from the pending task queue of the single DBRQ as described above and then returns the processed task pointer to the future task queue of the single DBRQ. [137] Fig. 10c shows the workings of the embodiment of Figs.10a and 10b. The first DBRQ 112A contains the 38 tasks from the first AES pipeline p1 and the second DBRQ 112B contains the 38 tasks from the second pipeline p2. The single processor 10 processes task pointers (ptr) in a round-robin fashion as determined by the scheduler 100. Fig.10b shows the single processor processing each task pointer from the pending task queue of the first DBRQ 112A as described above and then returning the processed task pointer to the future task queue of the first DBRQ 112A. Fig. 10c shows the single processor processing a task pointer from the pending task queue of the second DBRQ 112B as described above and then returning the processed task pointer to the future task queue of the second DBRQ 112B. The round robin fashion means that the arrangements shown in Fig.10b and 10c are alternated with a task from the first DBRQ 112A being completed followed by a task from the second DBRQ 112B, following by another task from the first DBRQ 112A and so on. Once the pending task queue in the first or second DBRQ is empty, the task pointers in the future task queue are shuffled and moved to the pending task queue as described above. It is, of course, not necessary to have DBRQs A and B to be coherently task aligned with separate AES pipelines p1 and p2, and it will be appreciated that there are various possible permutations of distributing tasks between multiple DBRQs. [138] When only a single processing core is used, as in the last two embodiments, there is no real difference in start-up times and data encryption throughput - though this is not the case when more than one processing core is available, as will be discussed below. However, with these two embodiments, the “amount” of randomness [entropy] associated with EMF signal leakage is doubled, over and above the single AES pipeline. It follows that the resistance to SCAs is significantly increased - but at the cost of increased processing requirements. Indeed, one could increase the entropy of the leaked EMF signals further by adding multiple pipelines. However, although more pipelines imply more EMF entropy it comes at a cost of either slower throughput or need for a faster processor. It is, as ever, a trade off - more security must come at more cost of memory and processing resources. [139] In summary, using multiple AES Pipelines to encrypt a single input stream magnifies the entropy of leaked EMF and, therefore increases resistance to Side Channel Attacks. [140] A further improvement may be to use multiple cores. Fig 11a shows a variation of the arrangement shown in Fig.9a with a dual processing core 110. As in Fig.9a, there is a single DBRQ 120 of double capacity. As explained above the single DBRQ 120 randomises the temporal order in which the tasks are executed and may be termed a temporal DBRQ. The temporal DBRQ 120 is
connected to a scheduler 200 which issues two POP instructions to the temporal DBRQ 120. In response to the POP instructions, when there are pointers in the pending task queue, two pointers are pushed to the scheduler 200. The number of pointers corresponds to the number of processing cores and it will be appreciated that two is merely indicative. [141] A second DBRQ which may be termed a topological DBRQ 122 is also connected to the scheduler 200 to determine which core is to attempt to execute each of the two pointers. The topological DBRQ 122 is implemented in the same manner as the temporal DBRQ 122 but comprises a pool of pointers to the physical means of executing the task. In this case, there are two cores and hence two pointers, one for each core. The pair of pointers are initially in the future task queue of the topological DBRQ 122. A POP instruction to the empty pending task queue will trigger a shuffle of the task pointers before they are swopped to the pending task queue. The shuffle randomises which core will be selected. The scheduler uses the combined results from the temporal and topological DBRQs to determine which core is to attempt to execute which task at each moment in time. The combination of a temporal and topological DBRQ randomises both time and location thus increasing resistance to SCA. [142] Fig.11b is a variant of the arrangement of Figs.10a and 10b follows on from the embodiment in Fig. 8, in that there are now two temporal DBRQs 112A, 112B combined with a single topological DBRQ 122. A pointer from each of the temporal DBRQs 112A, 112B is fed to a scheduler 300, which also receives pointers regarding which core to use from topological DBRQ 122. The individual cores on the dual processing core 110 thus randomly receive tasks to execute. The randomisation is both temporal and in location. Those computations do not generate any rhythmical or predictable power usage or EMF on which SCAs can be mounted. [143] It will be appreciated that the arrangements of Figs.11a and 11b are scalable to multiple cores or any combination of different processing elements (e.g. processors, network machines, logical cores, physical cores) by using a topological DBRQ of appropriate capacity. This is schematically illustrated in Fig.11c which shows a processor 210 having six cores, A, B, C, D, E, F. There is a temporal DBRQ 120 comprising a plurality of task pointers, for example 38 as described above. There is also a topological buffer 122 comprising a plurality of core pointer, for example six to match the number of cores. Six task pointers, e.g.4, 3, 5, 8, 0 and 2, may be called from the temporal DBRQ 120 to the scheduler 200 and a random ordering of the six processing cores is called from the topological DBRQ 122. The result is that processing core A attempts to execute the task represented by task pointer 4, processing core B attempts to execute the task represented by task pointer 8 and so on. The cores may work on each task in parallel. [144] Figs. 12a to 12d illustrate an alternative implementation of random out of order execution for non-commutative tasks. As set out above, this can be achieved when various conditions including a wait-free communication queue and asynchronous operation. A mechanism that enables elasticity without breaking the logical, algorithmic order can typically be implemented using two fundamental components: a wait-free communication queue and an asynchronous execution shuffler. The wait-free communication queue must provide queued buffering of intermediate results between non-commuting operations. The shuffler must enable shuffling of the physical order of execution of the non-commuting operations that maintains the logical order of the underlying algorithm and meets the various conditions.
[145] In the context of a single core processor environment, as an alternative to the DBRQ described above, a simple circular buffer can be used to implement the interoperation communication queue. Fig. 12a shows an example of this implementation. The buffering queue must be wait-free to enable asynchronous operation and thus the Abstract Data Type (ADT) operations of the queue must be conditional. This is typically achieved by first testing if the operation can take place and returning a Boolean value true on success and false if the operation failed. For example, the operation to “push” an item into the back (tail) of the queue must first test if the queue is full, return false if so. Otherwise, returning true after the item has been added to the tail of the queue and the size of the queue has been increased by one. Similarly, the operation to remove an item from the front (head) of the queue must first check to see if the queue is empty, returning false if so. Otherwise, returning true after the item has been removed and reducing the size of the queue by one. [146] The performance of the circular buffer implementation may be improved by restricting the capacity of the queue to a power of 2. Such a “2n” queue has the advantage of being able to replace the slower modulo division step to ensure circular progression of the indices with a faster bit mask operation. Furthermore, from a practical standpoint, it makes sense to implement the queue as a generic data structure. [147] In the context of a multi-core and/or multiprocessor environment, the communication queue between non-commuting operations must be synchronised to prevent race conditions. This can be achieved by using the data structure known as a single-producer, single consumer (SPSC) queue. Such an SPSC queue restricts queues between non-commuting operations to that of a single producer and a single consumer. The value to performance of this approach is that the wait-free feature can be implemented using light-weight concurrency data types known as “atomic variables” for the head and tail indices. Merely as an example, an implementation in C++ is shown below: /** * Concurrent wait free, lock free, single producer, single consumer, bounded FIFO queue * Capacity must be an unsigned power of 2 to enable performance advantage of bitwise masking. */ template<typename T, size_t N> class spsc_queue final { static_assert((N & (N - 1)) == 0, "capacity + 1 must be a power of 2"); public: typedef std::array<T, N> container_type; typedef T value_type; typedef T& reference; typedef const T& const_reference; explicit spsc_queue(); spsc_queue(const spsc_queue& other); bool empty(); // always check before pop bool full(); // always check before push
void push(const_reference item); // push to full is UB void pop(reference item); // pop from empty is UB private: std::atomic<size_t> head; std::atomic<size_t> tail; size_t mask; container_type q; }; [148] From a functional perspective, in highly memory constrained environments (such as Atmel microcontrollers with only a few kilobytes of SRAM) then a queue capacity of only one is workable. However, this will lead to regular stalls of the pipeline where no progress on intermediate values can be made. The rate of pipeline stalling drops away as the size of the queue increases, culminating in the pipeline not stalling at all once the capacity of each communication queue is the same or more than the number of operations in the pipeline. [149] Considering the application of a SPSC queue to the AES algorithm, looking back at Fig.5b, the coarse granularity level has ten non-commuting stages (Encryption round 1, … Encryption round 10) and each of these stages could be connected with wait-free queues. This is schematically illustrated in Fig. 12b. Similarly, considering the medium granularity level shown in Fig. 5b having 40 stages (i.e. including repeated AddRoundKey, SubBytes, ShiftRows and MixColumns as described above), a longer pipeline could be implemented consisting of 40 stages connected by 39 wait-free queues. This is schematically illustrated in Fig.12c. The permutations of the pointers in the pipeline are 10! and 40! respectively, Both the arrangements in Figs. 12b and 12c can be executed using wait free single producer single consumer queues. Thus, each pipeline can be executed out of order and asynchronously as follows: * The function pointer sequence is shuffled * Each function pointer from the shuffled sequence is taken and execution is attempted * The function checks if it has a non-empty input queue and a non-full output queue and if so, executes * Once all the functions in the sequence have attempted execution, the sequence is shuffled again and the process repeats until there are no more input bytes. [150] In this way, at least one function will be able to make progress at each run of the shuffled sequence of pointers (in a similar manner to the simpler four stage example pipeline described above). After at most 40 runs of the shuffled sequence of pointers in the pipeline, each queue has at least one intermediate result and therefore, each function in the pipeline can make progress. In reality, the probability of having to wait 40 turns before the pipeline is full primed, and no longer stalls, is vanishingly small. [151] Fig. 12d schematically illustrates a flowchart for implementing the pipeline of Fig. 12c. The pipeline requires 39 connecting queues of bytes, an input stream of plain text bytes, an output stream of cipher text bytes and a vector of function wrappers 20 to the pipeline stages. In this example, f1 is AddRoundKey-1, f2 is SubBytes-1, f3 is ShiftRows-1, f4 is MixColumns-1, f5 is AddRoundKey-2, f6 is
SubBytes-2 and so on. There is also an array 22 of indices for each of the function pointers (in this example, there are 39 indices) and a random number generator 16 which is connected to a shuffle module 24. The shuffle module 24 shuffles the array of indices and hence the function pointer sequence is shuffled. A CPU 10 and a memory 14 are also part of the system. [152] Using C++ as the example language (other languages are available), one can take advantage of a language feature by embedding C++ lambda functions to check preconditions before invoking the AES subroutine. This may be expressed as: // queue data structure using queue_t = spsc_queue<uint8_t, 8>; // function sequence data structure using function_sequence_t = std::vector<std::function<void()>>; // 39 connecting queues queue_t q0, q1, q2, q3, q4, q5, q6, q7, q8, q9, q10, q11, q12, q13, q14, q15, q16, q17, q18, q19, q20, q21, q22, q23, q24, q25, q26, q27, q28, q29, q30, q31, q32, q33, q34, q35, q36, q37, q38, q39; // input and output streams stream_t in_plain, out_cypher; // 45 non-commuting functions with pre and post condition checks // Initialising the function sequence can be done in the natural order function_sequence_type aes_subroutines { [&]() { if (!q0.full() && in_plain.ready()) { q0.push(add_round_key(rkey, in_plain.read()); }}, // round 1 [&]() { if (!q0.empty() && !q1.full()) { q1.push(sub_bytes(q0.pop()); }}, [&]() { if (!q1.empty() && !q2.full()) { q2.push(shift_rows(q1.pop()); }}, [&]() { if (!q2.empty() && !q3.full()) { q3.push(mix_columns(q2.pop()); }}, [&]() { if (!q3.empty() && !q4.full()) { q4.push(add_round_key(rkey, q3.pop()); }}, … }}, // round 10 last round no mix columns [&]() { if (!q36.empty() && !q37.full()) { q37.push(sub_bytes(q36.pop()); }}, [&]() { if (!q37.empty() && !q38.full()) { q38.push(shift_rows(q37.pop()); }}, [&]() { if (!q38.empty() && !q39.full()) { q40.push(add_round_key(rkey, q39.pop()); }}, [&]() { if (!q39.empty() && !out_cypher.ready()) { put_cypher.write(add_round_key(rkey, q42.pop()); }}
}; [153] For simplicity in Fig. 12d, the array and the vector of function wrappers shows the sequence without any shuffling, but it will be appreciated that any order may be generated by the RNG and shuffle module. Once the shuffling step is completed, each function pointer from the shuffled sequence is taken in turn and execution is attempted. As shown in the vector of function wrappers, before execution is attempted, first there are checks to see the output queue is full and the input queue is empty. These checks may be performed by the CPU 10 or any other suitable processing unit or processor (not shown). [154] When both the output queue is empty (not full) and the input queue is full (not empty), the function executes. The execution may be done by the CPU 10. The output of the execution is sent to the appropriate output queue which may be stored in memory 14. If one or both of the pre-conditions is not met, the function is not executed. The next function is then selected according to the next function pointer in the sequence and the pre-condition checks are repeated. Once an attempt to execute all the functions in the sequence has been attempted, the sequence is shuffled again, and the process repeats until there are no more input bytes. Despite the random out of order physical execution of each stage f in the pipeline, the logical order is maintained as a directed graph of function nodes connected by wait- free queues, i.e. preserving the algorithm, despite the random order of execution of its parts. The resulting C++ example code would be as follows: // prepare a PRNG auto rng = std::default_random_engine{}; // process all input plain text bytes while(in_plain.ready()) { // mix up the aes subroutines std::shuffle(std::begin( aes_subroutines), std::end( aes_subroutines), rng); // attempt to execute the subroutines for (const auto& f : aes_subroutines) { f(); } } [155] A multi-core and/or multi-processor implementation could use any scheduling approach but perhaps the simplest is to use the “grand central dispatch” approach. In this approach, a single AES subroutine data structure would be wrapped by concurrency primitives that fed a pool of executor threads that randomly jostled to execute the next subroutine. [156] In each of the embodiments described above, it has not been necessary to examine task connections in detail as a simple shared memory location interface between them would suffice. However, the potential problem of pipeline stalling arises when fair secure random temporal reordering of the stages of the AES pipeline is implemented. Slower start-up times are inevitable and will be randomly dispersed between upper lower bounds while saturation of all the buffers occurs, but then the process runs quicker. A possible mitigation to this problem is structured sequential buffer loading at start-up. From a communication perspective therefore, each DBRQ needs to have a switch to turn on the random transformation once all the buffers are loaded; this applies to single core - single pipeline as well.
[157] In the AES example described above, each of the tasks (AddRoundKey etc.) is a non- commutative operation within the overall AES computer algorithm. In other words, each task must be carried out in the specified order. Each task may thus be considered to be a communicating sequential process (CSP) “bubble”. All of the CSP bubbles are connected together into a pipeline using asynchronous wait-free queues as communication channels between the bubbles. It will be appreciated that other non-commutative algorithms, e.g. RSA, can be structured in a similar way. The bubbles may be executed randomly out of order provided that at each turn, execution of all bubbles is attempted exactly once before the order is shuffled and the next turn proceeds. This is achieved by use of the temporal DBRQ as explained above. It will be clear that each queue in the double buffer is at least as long as the number of bubbles in the pipeline to avoid random stalls. [158] Initially, the process means that no final output will appear unless the random order generated is the correct order – a 1 in 38! chance for the example AES algorithm. However, the process is guaranteed to produce output after a number of runs (i.e. shuffles and execution (or attempt to execute) cycles) with the number of runs being less than or equal to the number of bubbles (38 in the case of the example AES algorithm). Further, the normal distribution means that typically the pipeline will produce output after about 16 runs for the example AES algorithm. Additionally, in practice the pipeline has at least one data item in each of its communication queues after a short while (as shown above) and thus at every turn of executing the pipeline bubbles, each and every bubble will execute successfully and the pipeline no longer stalls. Commutative Operations [159] As explained above, one of the design features for all tasks is that any repeated section of code within the task must itself be short and discrete. Nevertheless, a task may comprise a minimal number of repeats and these repeating tasks may be commutative, i.e. may be carried out in any order. Where an algorithm is rich in commutative steps, it may be possible to avoid using the communicating channels between each of the “bubbles” so that the bubbles may be considered to be disjoint from one another. A randomised out of order execution may be achieved by shuffling the order of the indices used for each of the bubble’s operator. [160] Returning to the four fundamental building blocks for the AES algorithm in Fig. 5a and the resultant 4 processes in Fig.6, i.e. AddRoundKey, SubBytes [or inverse], ShiftRows [or inverse], and MixColumns [or inverse], we can illustrate a process for introducing randomisation at the task level, i.e. within each of the tasks. This process may also be considered to introduce randomisation to the commuting operations. As explained above in relation to Fig. 5b, this is also a fine-grained level of granularity. [161] AES operates on a 4x4 column major order array of bytes which may be termed the state. For instance, 16 bytes b0, b1, …b15 are represented by the two dimensional array:
[162] Considering each task in turn, in the AddRoundKey step, each byte ai,j of the state is combined with a byte ki,j of the round subkey to generate each byte bi,j of the result as illustrated below:
This may be expressed in a C++ language implementation as: static void AddRoundKey(uint8_t round, state_t* state, const uint8_t* RoundKey) { uint8_t i,j; for (i = 0; i < 4; ++i) { //outer loop for (j = 0; j < 4; ++j) { //inner loop (*state)[i][j] ^= RoundKey[(round * Nb * 4) + (i * Nb) + j]; } } } [163] The above implementation reveals two nested loops; the outer loop (i values – row values) executes four iterations of the inner loop (j values – column number), which itself has four iterations - resulting in sixteen iterations of the EXclusive Or (XOR) operation. Each XOR operation may be considered to be a discrete action and the indices i, j indicate the order in which action is executed. Given that those mounting SCAs are aware of these loops and the rhythmical and predictable and computations, they can seek and exploit the power consumption and EMF signals arising during their execution. There is no requirement to execute loop(s) in an ordered fashion if the action being looped over is discrete i.e. not dependent on previous action(s) and as demonstrated in the code below, the 16 iterations of the XOR operation for each AddRoundKey task are commutative, i.e. may be carried out in any order. Thus, instead of having the index variable of a loop progress sequentially, e.g.1,2,3,4, they can be randomly shuffled, e.g. 2,4,1,3. If every time a loop is executed the order of the index variable is randomly shuffled then the memory locations addressed by the looping action will never occur in a predictable order and leaked EMF signals will not be vulnerable to Simple Power Analysis. [164] The AddRoundKey task may be re-designed to exploit the commutative nature and introduce loop index shuffling - an example is shown below: Code Block 2 - Loop Index Shuffled AddRoundKey // This function adds the round key to state using Loop Index Shuffling. // The round key is added to the state by an XOR function. rijndael::add_round_key(uint8_t &rkey, uint8_t* i) { if(blend_add[add_counter++]) { // if the next item in the blend sequence is true(1)... (NB add_counter is 8 bit so will overflow back to zero) shuffled_add.shuffle(); // shuffle the block array index sequence for(auto& j : shuffled_add) { // use the shuffled indices... // this is where to put the parallel execution *(i + j ) ^= xkey[rkey + j]; // to 'add' (xor) the round key } rkey += 16; // advance the overall round key index
} else { // otherwise default to 'add' the round keys in order // unrolling the loop shows that all of these 16 'add' operations are commutative k [ k ]
…
} [165] Applying loop index shuffling to the 16 cycles of the AddRoundKey task results in a number of permutations equal to the factorial of 16! = 20922789888000. Loop index shuffling thus results in 20,922,789,888,000 possible random permutations in the order in which the memory locations are addressed by the XOR action inside the AddRoundKey loops. [166] Loop index shuffling provides temporal randomisation of the execution of the operations within the AddRoundKey task. Additionally, each of the 16 steps is a tiny embarrassingly parallel problem which may also be subject to topological randomisation. Given a multi-core processor which is amenable to parallel execution, each step could randomly be assigned to one of the cores. A Gustafson-Barsis linear scaling (up to 16 cores or hyperthreads) suggests that the performance penalty of mixing may be overcome. Indeed, the performance of the task when it is temporally and topologically randomised in execution is likely to be increased relative to the normal algorithm. In theory, any algorithm which has been subjected to temporal randomisation (for one or both of the commutative or non-commutative operations) is likely to run faster than the standard implementation once all the pipelines are full. [167] There are also some commutative steps in each of the SubBytes, ShiftRows and MixColumns tasks and their inverses. For example, in the SubBytes task, each byte ai,j in the state array is replaced with a SubByte S(ai,j) using an 8-bit substitution box. The box is a 16x16 matrix of byte values and consists of all the possible combinations of an 8 bit sequence. However, the box is not just a random permutation of these values and there is a well defined method for creating the s-box tables. As before, each byte is separately replaced and this may be expressed as follows:
[168] Each of these 16 replacements may be termed a discrete action and each is commutative and can be carried out in any order. Loop index shuffling can be applied in a similar manner to that in the AddRoundKey task and there are the same number of possible permutations, i.e. 16!. Merely as an example, one format for the loop index shuffled arrangement may be expressed as follows: [169] In the ShiftRows task, the bytes in each row are cyclically shifted by a certain offset. For AES, the first row is left unchanged, each byte of the second row is shifted one to the left. Similarly, the third and fourth rows are shifted by offsets or two and three respectively. In other words, the ShiftRows task
changes the order of bytes in each row of the state for all rows except the first. This may be represented as follows:
[170] Clearly, there are not 16 commutative steps in this task. However, there are three commutative discrete actions or cycles – one for each rotate (or shift) step. Thus, the number of permutations is 3!= 6. Furthermore, within each of the rotate steps another small shuffle sequence, e.g. [1, 5, 9] may be used to mix within each rotate step. This gives a nested set of permutations and thus the overall number of permutations is 3!(3!)=36. Fig. 13a schematically illustrates the nested commutative steps within ShiftRows which may be pugged, i.e. randomised for out of order execution. [171] Merely as an example, one format for the loop index shuffled arrangement may be expressed as follows: void rijndael::shift_rows(uint8_t* i) { if(blend_rows[rows_counter++]) { shuffled_rows.shuffle(); for(auto& j : shuffled_rows) { switch(j) { case 0: { // Rotate first row 1 columns to left uint8_t rol{*(i + 1)}; *(i + 1 ) = *(i + 5 ); *(i + 5 ) = *(i + 9 ); *(i + 9 ) = *(i + 13); *(i + 13) = rol; break; } case 1: { // Rotate second row 2 columns to left uint8_t rol = *(i + 2 ); *(i + 2 ) = *(i + 10); *(i + 10) = rol; rol = *(i + 6 ); *(i + 6 ) = *(i + 14); *(i + 14) = rol; break; } case 2: { // Rotate third row 3 columns to left
uint8_t rol = *(i + 3 ); *(i + 3 ) = *(i + 15); *(i + 15) = *(i + 11); *(i + 11) = *(i + 7 ); *(i + 7 ) = rol; break; } } } } else { // each of these 3 rotate stages are commutative but still contain several regular steps // Rotate first row 1 columns to left uint8_t rol{*(i + 1 )}; // *but there are 4 steps in this rotate stage that are commutative and could be shuffled *(i + 1 ) = *(i + 5 ); *(i + 5 ) = *(i + 9 ); *(i + 9 ) = *(i + 13); *(i + 13) = rol; // Rotate second row 2 columns to left rol = *(i + 2 ); *(i + 2 ) = *(i + 10); *(i + 10) = rol; rol = *(i + 6 ); *(i + 6 ) = *(i + 14); *(i + 14) = rol; // Rotate third row 3 columns to left rol = *(i + 3 ); *(i + 3 ) = *(i + 15); *(i + 15) = *(i + 11); *(i + 11) = *(i + 7 ); *(i + 7 ) = rol; } [172] In the MixColumns task, the bytes in each column are combined using an invertible linear transformation. The MixColumns task is a 32-bit operation that transforms four bytes of each column in the state. The new bytes of the column are obtained by the given constant matrix multiplication in the Galois Field GF(28). During this operation, each column is transformed using a fixed matrix and may be expressed as:
[173] Again, there are not 16 commutative steps in this task. However, there are at least four commutative steps – one for each transformation step. However, a finer grained mixing could occur inside each of these steps too. Thus, like ShiftRows, there is a set of nested permutations which is schematically illustrated in Fig.13b. In this case, the number of permutations is 4!(4!)=576. [174] Merely as an example, one format for part of the loop index shuffled arrangement may be expressed as follows: void rijndael::mix_columns(uint8_t* i) { if(blend_cols[cols_counter++]) { shuffled_cols.shuffle(); for(auto& j : shuffled_cols) { switch(j) { case 0: { uint8_t a, b, c; //first column // the order of calculation of a, b, and c also commutative a = *(i + 0); b = *(i + 0) ^ *(i + 1); // these indices could be shuffled c = *(i + 0) ^ *(i + 1) ^ *(i + 2) ^ *(i + 3); // these indices could be shuffled b = GF2(b); *(i + 0) ^= b ^ c; b = *(i + 1) ^ *(i + 2); // these indices could be shuffled b = GF2(b); *(i + 1) ^= b ^ c; b = *(i + 2) ^ *(i + 3); // these indices could be shuffled b = GF2(b); *(i + 2) ^= b ^ c; b = *(i + 3) ^ a; b = GF2(b); *(i + 3) ^= b ^ c; break; } case 1: { uint8_t a, b, c; //second column // the order of calculation of a, b, and c also commutative … (the other column calculations have been omitted but it will be appreciated follow a similar pattern)
else { uint8_t a, b, c; //first column
… (the other column calculations have been omitted but follow a similar pattern) [175] The programming idiom presented above for loop index shuffling is generic and can also be applied to any looping section with commutative operations. Fig.13c illustrates the main steps in the operation and Fig.13d is a schematic representation of a hardware implementation which is adapted from that shown in Fig.1 and thus the same reference numbers are used where appropriate. As shown in Fig.13c, the first step is to determine the task to be executed (step S400). This may be determined by using a buffer as shown in Fig.13d. In this arrangement the buffer has all 38 tasks in temporal order but it will be appreciated that the DBRQs or SPSCs described above can be used if randomisation at both pipeline and task level is required. It will also be appreciated that the determining step may be derived from the algorithm itself depending on its implementation. [176] The next step may be to read an input from memory 14 (step S402), e.g. the input may be original plaintext (or ciphertext) or an intermediate input generated by a previous task in the algorithm. If the DBRQ or SPSC is used as described above, the pre-condition checks may also be carried out before attempting to read an input to make sure that the task can be executed. There may be an optional step of shuffling the indices of the commutative operations (step S404). This step may be optional depending on whether or not mixing is to be applied within the task, e.g. using loop index shuffling. Shuffling the indices may include obtaining a random number. [177] As shown in Fig.13d, the system comprises a random number generator 16 which may be the same as the ones described above in relation to Fig.1. The random number generator 16 provides the random number to a shuffle module 24. In this arrangement the shuffle module is shown as a separate module but it will be appreciated that the functionality may be incorporated in the CPU 10. The shuffle module 24 introduces a random shuffling permutation of the loop indices as shown in the array of indices
26. In this example, the array of indices 24 contains the 16 indices for the AddRoundKey task but this is merely exemplary. [178] As illustrated in Figs.13e and 13f, it is possible to choose whether mixing is to be used at any moment for each of the tasks. The trade-offs between speed of execution and SCA resistance may be balanced by using a blend of fully shuffled operations together with standard ordered execution. The standard ordered execution is typically faster but is SCA vulnerable whereas the permuted or randomised order is slightly slower but SCA resistant. Each of the shuffled commutative stages can be finely tuned to create a maximally performant implementation that achieves SCA resistance without needlessly overshooting the goal. In the example of Fig.13e, the aim is to shuffle 75% of the operations and leave 25% executing in a normal order for the AddRoundKey task. As indicated, the decision as to which of the multiple occurrences of the AddRoundKey task is shuffled is a random decision. This may be termed a fork-join parallelisation method. Fig.13f shows a similar schematic representation of the fork-join parallelisation method for the SubBytes task. [179] Consider an implementation of these steps for the AddRoundKey task in C++ (other languages are available) as an example: void rijndael::add_round_key(uint8_t &rkey, uint8_t* i) { // if the next item in the blend sequence is true(1)... //NB add_counter is 8 bit so will overflow back to zero if(blend_add[add_counter++]) { // shuffle the block array index sequence std::shuffle(std::begin(add_indices), std::end(add_indices), rng); for(const auto& j : add_indices) { // use the shuffled indices... *(i + j ) ^= xkey[rkey + j]; } rkey += 16; // advance the overall round key index } else { // otherwise default to 'add' the round keys in order *(i + 0 ) ^= xkey[rkey++]; *(i + 1 ) ^= xkey[rkey++]; *(i + 2 ) ^= xkey[rkey++]; *(i + 3 ) ^= xkey[rkey++]; *(i + 4 ) ^= xkey[rkey++]; *(i + 5 ) ^= xkey[rkey++]; *(i + 6 ) ^= xkey[rkey++]; *(i + 7 ) ^= xkey[rkey++]; *(i + 8 ) ^= xkey[rkey++]; *(i + 9 ) ^= xkey[rkey++]; *(i + 10) ^= xkey[rkey++]; *(i + 11) ^= xkey[rkey++]; *(i + 12) ^= xkey[rkey++]; *(i + 13) ^= xkey[rkey++]; *(i + 14) ^= xkey[rkey++]; *(i + 15) ^= xkey[rkey++]; } } [180] This method of randomising the indices used in discrete operations can be applied separately to the 16 cycles of the AddRoundKey task, the 16 cycles of the SubBytes task, the four cycles of the MixColumns task and the four stages of ShiftRows task. Similarly, the tuneable approach of selecting
the execution pathway as shown in Figs.13e or 13f can be applied to any of the four stages of the AES algorithm (or to any other algorithm). If mixing is applied to all occurrences of one task, the mixing levels may be considered to be 25% mixing when one of the four tasks is mixed, 50% when two of the four tasks are mixed, 75% when three of the four tasks are mixed, and 100% when all tasks are mixed as described above. It will be appreciated that different levels of mixing may be achieved by tuning one or more of the tasks. In other words, mixing may be applied to one task or any number of the tasks or some but not all of the repeating tasks in the pipeline. [181] Shuffling the indices requires a random number to be generated. The task may request a new random number for each loop, e.g. 16 times, or may more optimally request the number of random numbers which are required at the same time. Thus, the random number may be requested before the task is executed or during the execution of the task. Accordingly, the order in which the random number is obtained during the method may also be optional. Where there are 16 cycles to be randomised, this may for example, be achieved by requesting a single 64 bit random number at the start of the 16 shuffles and breaking the single random number into 16 four bit numbers. This could be done quickly, for example by masking the lower nibble of the 64 bit random number and then shifting it right by 4 bits for the next round. This saves 15 calls to the random number generator if this is used to generate the random number and reduces the algorithm’s time complexity to O(n) compared to O(n2). It may also provide 64 bit register optimisation as a bonus. This improved method for the shuffling may be implemented as follows: */ void shuffle() { uint64_t w = r(); //load up a 64bit i.e.16 nibble word uint8_t j = w & 0xF; //mask off the lower nibble for (uint8_t i{ 0 }; i < 15; ++i) { std::swap(v[i], v[j]); w >>= 4; // shift right next nibble j = w & 0xF; // set j to the new lower nibble } } [182] Returning to Fig.13c, the task may now be executed (step S406). In this arrangement, the task may be executed as shown on a single CPU 10. However, it will be appreciated that each of the commutative cycles, e.g.16 or 4 cycles is suitable for parallel implementation on multiple cores. Finally, the output is written (step S408) for this task and the process repeats again, e.g. by issuing a POP instruction to obtain the next task in the pipeline from the buffer. [183] Reflecting on the 36 repeating tasks in the 128bit key AES example, the possible permutations are given in the table below:
[184] In other words, using Loop Index Shuffling for the 128bit key AES task pipeline example results in 397,533,007,872,408 permutations for the temporal ordering of memory access during a single run of the algorithm (i.e. when encrypting/decrypting a single piece of input data). There are typically multiple inputs and each input must be processed through a complete run of the algorithm to be encrypted/decrypted. However, as illustrated with the short algorithm example above, after each randomisation of the ordering of the pipeline, it may be possible to execute more than one step within the pipeline because the pipeline contains multiple intermediate results. The pipeline may even be full of intermediate results, one for each of the different inputs. The number of tasks that are executable with each execution of the pipeline thus may also vary. Thus, it will be appreciated, the number of permutations is also dependent on the number of runs of the algorithm which are needed to fully process all the inputs. It will be appreciated, for a given plurality of sequential blocks of plaintext, the ones earlier in the sequence will have been enciphered or will be having later tasks executed and for ones later in the sequence, they will be having earlier tasks executed or be waiting to join. This “overlapping” of runs of the algorithm applies to any algorithm after the pipeline has been executed once and before the final execution of the pipeline. [185] Loop Index Shuffling can be employed for commutative sections of a task. When the commutative sections of code are not discrete, short and/or require a large number of repetitions, it may be necessary to break them up into smaller communicating sequential tasks. The resulting increase in overheads may thus be an unavoidable cost of increased security. [186] As a consequence of the ordering of the fine-grained machine code instructions and the predictable ordering of the tasks pipeline, at least Simple Power Analysis [SPA] can be defeated by the use of Loop Index Shuffling as shown in Figures 15a to 15e below. Additionally, second order DPA and CPA can be defeated. However, both MixColumns and ShiftRows have larger discrete ordered sections of program code and even for the single discrete actions of AddRound and SubBytes when the C++ program code is compiled into machine code then even these small sections of C++ code result in a large number of sequential machine code instructions that must necessarily be executed in order and therefore leak ordered power consumption and attendant EMF. One might be able to mitigate against such leakage by further decomposing the executable tasks. [187] If shuffling the loop indices is still vulnerable to more aggressive SCAs, a combination of Loop Index Shuffling at the task level and Double Buffered Random Queue or SPSC Scheduling at the pipeline level, provides a stronger solution which may defeat higher more aggressive SCAs such as
HO-DPA and WD-HO-DPA attacks. Further security can be implemented by introducing multiple DBRQs and multiple cores to defeat all types of SCAs. [188] The arrangements described above may be considered to be cryptographically secure randomising schedulers. The arrangements maintain one or more pools of pointers to executable program code (temporal DBRQs or SPSCs) and optionally one or more pools of places to execute program code (topological DBRQs or SPSCs). The scheduler either truly randomly or cryptographically securely randomly selects a pointer from the pools of pointers and asynchronously executes a small serial fraction of an algorithm (e.g. a cipher algorithm) if the pre-conditions are met. The pointers are then returned to their respective pool to be randomly used again. Where there are multiple cores or processors, the execution can be on a randomly selected hardware resource. In this way, the computational steps of an algorithm are randomised in at least time and also optionally location. As an additional or alternative way to achieve randomisation, loop index shuffling may be used. [189] The randomisation means that the electrical signals released by each small computational step are also random. Such random electrical noise can still be harvested externally at a distance but it can no longer be interpreted in time or possibly also location. Signals may thus no longer contain information about the computational steps which produced them. This randomisation may defeat a side channel attack. [190] A testing method was developed to perform a side channel attack on a particular hardware platform to show whether the encryption key can be found using a correlation power analysis (CPA). CPA uses changes in the power consumption of the microprocessor and these can be measured by monitoring the changes in the current consumed by the device. In this example, the testbed comprises a processor for executing the encryption algorithm, e.g. an Arduino Uno™ chip which is based on the ATmega328P microcontroller. The measurement should be performed as close to the microprocessor as possible and in this example, this was achieved by detaching the power pin of the Arduino Uno™ chip and placing a low value (100 ^) shunt resistor between the power supply and the power input to the chip. Changes in current to the device now produce a proportional change in the voltage drop across this resistor. It will be appreciated that this is just one example of a suitable test bed. [191] Figs. 14a and 14b show two screen shots from an oscilloscope, e.g. a 1GSa/s storage oscilloscope, which is monitoring the voltage drop across the resistor in the test set up described above. Fig.14a is a screen shot of the power trace when the Arduino Uno™ chip is executing standard AES encryption using a 128 bit key. The varying (lower jagged) line is the voltage drop across the monitoring resistor and the straight (upper) line is a timing pulse to identify when the encryption is taking place. The ten rounds of the AES encryption process are clearly shown in the varying trace. Fig.14b is a screen shot of the power trace for the same AES encryption using a 128 bit key but with the mixing (shuffling) process applied to all four sub sections of the AES encryption. In other words, 100% mixing is applied. The ten rounds of the AES encryption process are still clearly shown in the varying trace but the signal is significantly noisier. It is noted that the time base for Figs.14a and 14b is not the same because the encryption process takes around 2.5 times longer using 100% mixing. [192] CPA uses traces such as those illustrated in Figs.14a and 14b and applies a statistical analysis to the results of multiple encryptions to yield the encryption key. One method for performing such a
statistical analysis can be done using a ChipWhisperer-Lite which is open source hardware and software product optimised for SCA. In this test bed arrangement, it was also necessary to include a bidirectional level shifter to sit between the Arduino Uno™ chip which uses 5v logic levels and the ChipWhisperer hardware which uses TTL 3.3v logic levels. [193] Figs. 15a to 15e are test results indicating the performance of AES encryption without any mixing and with different levels of mixing (shuffling) tested with this example test bed. Fig.15a shows the data from a power spectrum side channel attack on standard AES using 25 different 16-byte encryption keys. After 170 traces, 24 of the 25 keys are correctly returned by the SCA and for the final key, 15 of the 16 sub-keys are returned. Thus, standard AES is clearly vulnerable to a successful SCA with only a relatively small number of traces. It is noted that not all the keys are returned with fewer traces, e.g. for 50 traces roughly half of the sub keys are returned but no complete key is returned and for 100 traces 6 keys are correctly returned and all of the attacks yield more than 12 of the 16 subkeys for each key. [194] By contrast, Figs. 15b to 15e show that the same AES algorithm which has been subject to varying amounts of the mixing (shuffling) process described above are much less vulnerable to a successful SCA. Fig. 15b shows that if mixing is applied to one of the four tasks (for example AddRoundKey), none of the 25 encryption keys are completely returned after 170 traces of the SCA. There are three keys for which 9 and 10 subkeys are respectively returned and 10 subkeys is the highest number of subkeys returned. Fig.15c shows that if mixing is applied to two of the four tasks (for example AddRoundKey and SubBytes), the maximum number of subkeys which is returned for any one of the 25 encryption keys is reduced to just 5. Similarly, increasing the mixing to three of the four tasks (for example, mixing within each of AddRoundKey, SubBytes and ShiftRows), reduces the maximum number of subkeys which is returned for any one of the 25 encryption keys still further to just 1. Finally, when all of the tasks (e.g. AddRoundKey, SubBytes, ShiftRows and MixColumns) are subject to the mixing process described above, none of the sub keys are determined by the SCA. [195] Additional test results indicating the performance of the mixing (shuffling) process described above were performed. They were generated by using an Intel Core i7-850H CPU @2.3GHz with 6 cores and 12 logical processors. A high performance clock is used to time the process of encrypting 100,000 pieces of data using AES. All results are expressed as seconds/100,000 encrypts. All test code was compiled in release mode with all speed optimizations enabled. [196] Ten tests were performed and the results are given in the table below. The tests were: 1. the time required to process 100,000 encryptions using 128bit AES encryption without randomisation, i.e. standard AES encryption (no mix) 2. the 128bit AES encryption process has been randomised at a task level by truncated randomising the AddRoundKey stage (mix1) as described above 3. randomised at task level by truncated randomising the SubBytes stage (mix2) as described above. 4. randomised at a task level by truncated randomising both the AddRoundKey stage and the SubBytes stage (mix1 & mix2).
5. initial single step randomisation at the start of the algorithm randomising the AddRoundKey stage (mix1) 6. initial single step randomisation at the start of the algorithm randomising the SubBytes stage (mix2) 7. initial single step randomisation at the start of the algorithm randomising both the AddRoundKey stage and the SubBytes stage (mix1 & mix2) 8. randomisation (mix1) combined with the randomisation introduced at a pipeline line (parallel) 9. randomisation (mix2) combined with the randomisation introduced at a pipeline line (parallel) 10. randomisation (mix1 & mix2) combined with the randomisation introduced at a pipeline line (parallel). [197] In other words, for tests 8 to 10 there is randomisation both at a task level (i.e. temporal randomisation) together with temporal randomisation by introducing randomisation as to which processing core within the processor executes each part of the task. The results are shown below: Tests on multicore processor - 100,000 encryptions
[198] The second test compares the time required to process 100,000 encryptions using AES encryption without randomisation, i.e. standard AES encryption (no_mix) and the time required to process 100,000 encryptions when the AES encryption process has been randomised at a task level by randomising the add_round_key stage (mix_1) as described above. As shown in Fig.15b, this first type of randomisation (mix_1) is sufficient to foil 170 traces of CPA side channel attack but the trade- off is an increase in time to 0.193028 seconds/100,000 encrypts compared to 0.0768093 seconds/100,000 encrypts. The third test results show an increase in time to 0.212729 seconds/100,000 encrypts for this second type of randomisation compared to 0.0768093
seconds/100,000 encrypts for the normal AES encryption. Thus, both options for 25% mixing result in a similar increase in processing time. [199] As expected the 50% mixing results in an additional increase in processing time when compared to 25% mixing as shown in the results for test four. [200] The results of tests two to four indicate that there is a performance penalty for introducing the randomisation. This may be an acceptable penalty given the improvement in reducing the risk of a successful side channel attack as demonstrated in Figs.15a to 15e. However, the performance may be improved by introducing the randomisation in a single step at the start of the algorithm rather than introducing a truncated randomisation at each step as explained above. Tests five to seven indicate the improved performance. Test five shows that the first type of randomisation (mix_1) only results in an increase in time to 0.110798 seconds/100,000 encrypts compared to 0.0768093 seconds/100,000 encrypts for the unmixed, normal AES encryption. Similarly, as shown in the results for test six, the second type of randomisation (mix_2) only results in an increase in time to 0.116044 seconds/100,000 encrypts compared to 0.0768093 seconds/100,000 encrypts for the unmixed, normal AES encryption. Finally, as shown in the results for test seven, the combination of the first and second type of randomisation (mix_1,mix_2) only results in an increase in time to 0.161795 seconds/100,000 encrypts compared to 0.0768093 seconds/100,000 encrypts for the unmixed, normal AES encryption. [201] Tests eight to ten compares the time required to process 100,000 encryptions using AES encryption without randomisation, i.e. standard AES encryption (no_mix) and the time required to process 100,000 encryptions when the AES encryption process has been randomised both at a task level and by introducing randomisation as to which processing core within the processor executes each part of the task. It is clear that whilst the AES algorithm is now randomised by both order and location, the price for this significantly increased resistance to SCAs is a large reduction in performance. [202] As shown in the results for test eight, the first type of randomisation (mix_1) combined with the randomisation introduced at a pipeline line (parallel) results in an increase in time to 2.82678 seconds/100,000 encrypts compared to 0.0768093 seconds/100,000 encrypts for the unmixed, normal AES encryption. Similarly, as shown in the results for test nine, the second type of randomisation (mix_2) combined with the parallel randomisation (parallel) results in an increase in time to 2.7305 seconds/100,000 encrypts compared to 0.0768093 seconds/100,000 encrypts for the unmixed, normal AES encryption. Finally, as shown in the results for test ten, the combination of the first and second type of randomisation (mix_1,mix_2) and parallel mixing results in an increase in time to 5.67733 seconds/100,000 encrypts compared to 0.0768093 seconds/100,000 encrypts for the unmixed, normal AES encryption. It is clear from tests 8,9, and 10 that randomising by both order and location carries a significant price in performance for the increased resistance to SCAs. [203] As explained above, the methods described above to dissect an algorithm into parts which can be randomly executed are applicable to all algorithms. The application of the methods is described in detail with reference to the Rijndel (AES) block cipher algorithm. As an alternative example, the methods are applied to the Keccak (SHA3) innovative Keccak-f cryptographic permutation as illustrated in Figure 16 onwards. Keccak is a versatile cryptographic function. Best known as a hash function, it nevertheless can be used for authentication, (authenticated) encryption and pseudo-random number
generation. Its structure is the extremely simple sponge construction and internally it uses the innovative Keccak-f cryptographic permutation. It is the Keccak-f cryptographic permutation primitive that is the focus of the methods. [204] As with the AES algorithm, the Keccak-f cryptographic permutation algorithm can be decomposed into smaller constituent parts, all the way down to the microcode of the CPU if needed. Figure 16 illustrates examples of coarse, medium and fine grained decomposition for the Keccak-f cryptographic permutation algorithm: * Coarse grained – such as splitting into its constituent rounds (in this case 24) * Medium grained – such as splitting a round into its component actions, e.g. θ (theta) * Fine grained – such as splitting θ (theta) into its program language, e.g. C++ * Very fine grained – such as splitting the program language operations into their assembly language components * Ultra fine grained – such as splitting the assembly language opcodes into their on chip microcode. [205] A canonical implementation of the init/update/finalise (IUF) paradigm can be considered as the basic block permutation function consisting of 24 rounds of the five classical Keccak-f steps: * θ (theta) * ρ (rho) * π (pi) * x (chi) * ι (iota) The main SHA-3 submission uses 64-bit words and so implementing the basic block permutation function consists of 24 rounds of the five steps. Each of the five steps and each of the rounds must be implemented in a specified order and thus are non-commuting operations. The opportunities for shuffling the commuting operations become apparent after examining the loops with the theta, ro, pi, chi and iota stages. Randomisation of the execution of the commuting operations is explained first followed by an implementation for randomising execution of the non-commuting operations. [206] For the Keccak-f permutation, the state can be considered to be a 5 x 5 x w array of bits. It is the job of theta ( θ) to compute the parity of each of the 5w (320, when w = 64) 5-bit columns and exclusive-or that into two nearby columns in a regular pattern. The theta ( θ) task is schematically illustrated in the right-hand side of Figure 16 which shows that the first and second steps each have an array of indices {0, 1, 2, 3, 4} and the final step has an array of indices {0, 5, 10, 15, 20}. The theta ( θ) task within the Keccak-f algorithm has a set of nested shuffled sequences and randomisation (or mixing) may thus be introduced by randomising the set of indices. In other words, the theta ( θ) task can be randomly executed at a task level by shuffling the indices. At this fine grained level, the number of permutations is: nPr + nPr(nPr)=5!+5!(5!) = 14520
[207] Consider an implementation of the theta ( θ) stage in the C++ programming language (other languages are available) as an example. As previously described, first prepare two shuffle sequences at startup: sequence_t theta_indices { 0, 1, 2, 3, 4 }; Sequence_t inner { 0, 5, 10, 15, 20}; Applying the method of shuffling indices described above results in: /* Theta */ for(const auto& i : theta_indices) { bc[i] = s[i] ^ s[i + 5] ^ s[i + 10] ^ s[i + 15] ^ s[i + 20]; } std::shuffle(std::begin(theta_indices), std::end(theta_indices), rng); for(const auto& i : theta_indices) { t = bc[(i + 4) % 5] ^ SHA3_ROTL64(bc[(i + 1) % 5], 1); std::shuffle(std::begin(inner), std::end(inner), rng); for(const auto& j : inner) { s[j + i] ^= t; } } [208] Similar to the AES implementation , the normal (i.e. unaltered) code for the theta ( θ) stage could be blended randomly with the altered code. The blending could be tuned to deliver SCA protection at nominal throughput. Similarly, the same fork-join parallelisation described above could be used to randomise the location of the code execution. [209] It is programmatically expedient to combine the ρ (rho) and π (pi) stages and this combination is schematically illustrated in Figure 17. The ρ (rho) stage comprises a bitwise rotation of each of the 25 words by a different triangular number (0, 1, 3, 6, 10, 15, …) and the π (pi) stage permutes the 25 words in a fixed pattern. No nesting of the ρ (rho) and π (pi) stages is required to implement the randomisation and shuffling the indices at this fine grained level gives nPr (nPr) = 24! = 6.204484017x 1023 permutations. [210] Consider an implementation of the ρ (rho) and π (pi) combined stages in the C++ programming language (other languages are available) as an example. First a shuffle sequence is prepared: sequence_t rho_pi_indices { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 }; Applying the method of shuffling indices described above results in: /* Rho Pi */ t = s[1]; std::shuffle(std::begin(rho_pi_indices), std::end(rho_pi_indices), rng); for(const auto& i : rho_pi_indices) { j = keccakf_piln[i]; bc[0] = s[j]; s[j] = SHA3_ROTL64(t, keccakf_rotc[i]); t = bc[0]; }
[211] It is the job of the chi ( x) stage to bitwise combine along rows, such that x ← x ⊕ (¬y & z). The chi ( x) stage is schematically illustrated in Fig. 18 and like the theta ( θ) stage, there are nested sequences which can be shuffled to add a deeper level of mixing than for the combined ρ (rho) and π (pi) stages. Implementing the randomisation and shuffling the indices at this fine grained level gives nPr (nPr+ nPr) = 5!(5!+5!) = 28800 permutations. [212] Consider an implementation of the Chi χ stage in the C++ programming language (other languages are available) as an example. Prepare two shuffle sequences at startup: sequence_t chi_indices { 0, 1, 2, 3, 4 }; Sequence_t outer { 0, 5, 10, 15, 20 }; Applying the method to adapt the code described above results in: /* Chi */ std::shuffle(std::begin(outer), std::end(outer), rng); for(const auto& j : outer) { std::shuffle(std::begin(chi_indices), std::end(chi_indices), rng); for(const auto& i : chi_indices) { bc[i] = s[j + i]; } std::shuffle(std::begin(chi_indices), std::end(chi_indices), rng); for(const auto& i : chi_indices) { s[j + i] ^= (~bc[(i + 1) % 5]) & bc[(i + 2) % 5]; } } [213] The ^ (iota) stage is single round based operation and does not contain any commuting steps. The operation is to exclusive-or a round constant into one word of the state. At this granularity, no changes can be introduced to the code. However, the ^ (iota) stage is one of the non-commuting tasks which is amenable to alteration as described below. [214] As shown in Figure 16, at a coarse grained level of granularity there are 24 rounds with each round containing the five classical steps. As explained above, the ρ (rho) and π (pi) stages may be combined into a single subroutine. The 24 round loop may be unrolled and expressed (for example using C++) as: static void keccakf(uint64_t s[25]) { int i, j, round; uint64_t t, bc[5]; // round 1 theta(s); rho_pi(s); chi(s); iota(s); // round 2 theta(s); rho_pi(s); chi(s); iota(s); … // round 23 theta(s); rho_pi(s);
chi(s); iota(s); // round 24 theta(s); rho_pi(s); chi(s); iota(s); } [215] This sequence of non-commuting functions, theta, rho_pi, chi and iota can now be randomly executed out of order using one of the techniques described above. For example, each of the functions may be interconnected with wait free single product single consumer queues (SPSC) to result in a pipeline. This pipeline is schematically illustrated in Fig.19a. As shown, the pipeline has 96 functions with 95 connecting queues of bytes. There is also an input of plain text 64-bit words in chunks of 25 and an output of cipher 64-bit words in chunks of 25. [216] Fig.19b schematically illustrates the system to implement the pipeline of Fig.19a. The pipeline can be executed asynchronously and out of order by virtue of the wait free connections carrying intermediate results. As shown the system comprises an array 122 of indices for each of the function pointers and a vector 120 of function wrappers to the pipeline stages. In this example, there are 96 indices and thus 96! permutations in each complete run of the algorithm. As explained above in relation to the AES example, there may be multiple runs of the algorithm to process multiple inputs and thus the number of permutations may be still higher. There is a random number generator 116 which is connected to a shuffle module 124. The shuffle module 124 shuffles the array of indices and hence the function pointer sequence is shuffled. A CPU 110 and a memory 114 are also part of the system. [217] For simplicity in Fig. 19b, the array and the vector of function wrappers shows the sequence without any shuffling, but it will be appreciated that any order may be generated by the RNG and shuffle module. Once the shuffling step is completed, each function pointer from the shuffled sequence is taken in turn and execution is attempted. As shown in the vector of function wrappers, before execution is attempted, first there are checks to see the output queue is full and the input queue is empty. These checks may be performed by the CPU 110 or any other suitable processing unit or processor (not shown). [218] When both the output queue is empty (not full) and the input queue is full (not empty), the function executes. The execution may be done by the CPU 110. The output of the execution is sent to the appropriate output queue which may be stored in memory 114. If one or both of the pre-conditions is not met, the function is not executed. The next function is then selected according to the next function pointer in the sequence and the pre-condition checks are repeated. Once an attempt to execute all the functions in the sequence has been attempted, the sequence is shuffled again, and the process repeats until there are no more input bytes. [219] In this way, at least one function will be able to make progress at each turn of the shuffled sequence and after at most 96 turns of the shuffled sequence, each queue will have at least one intermediate result. Thus, each function can make progress after at least 96 turns. In reality, the probability of having to wait 96 turns before the pipeline is fully primed, and no longer stalls, is
vanishingly small. In this regard, loading is likely to follow a Gaussian distribution and the pipeline is likely to be full in about 48 turns. [220] Merely as an example, the preconditions may be specified as embedded C++ lambda functions: // queue data structure using queue_t = spsc_queue<uint64_t s[25], 8>; // function sequence data structure using function_sequence_t = std::vector<std::function<void()>>; // 95 connecting queues queue_t q0, q1, q2, q3, q4, q5, q6, q7, q8, q9, q10, q11, q12, q13, q14, . . . q90, q91, q92, q93, q94; // input and output structure uint64_t s[25] in_plain, out_hashed; // 96 non-commuting functions with pre and post condition checks // Initialising the function sequence can be done in the natural order function_sequence_type keccak_subroutines { // round 1 [&]() { if (!q0.full() && in_plain) { q0.push(theta(in_plain); }}, [&]() { if (!q0.empty() && !q1.full()) { q1.push(rho_pi(q0.pop()); }}, [&]() { if (!q1.empty() && !q2.full()) { q2.push(chi(q1.pop()); }}, [&]() { if (!q2.empty() && !q3.full()) { q3.push(iota(q2.pop()); … // round 24 [&]() { if (!q91.full() && !q92.full()) { q92.push(theta(q91.pop()); }}, [&]() { if (!q92.empty() && !!q93.full()) { q93.push(rho_pi(q92.pop()); }}, [&]() { if (!q93.empty() && !q94.full()) { q94.push(chi(q93.pop()); }}, [&]() { if (!q94.empty() && out_cipher) { out_cipher = iota(q94.pop()); }} }; [221] With the communication and code redirecting data structures established, the shuffling and scheduling in a single core scenario becomes trivial, e.g. // prepare a PRNG auto rng = std::default_random_engine{}; // hash the inputs while(inputs) { // mix up the Keccak-f subroutines
std::shuffle(std::begin( keccak_subroutines), std::end( keccak_subroutines), rng); // attempt to execute the subroutines for (const auto& f : keccak_subroutines) { f(); } } [222] A multi-core and/or multi-processor implementation could use any scheduling approach but perhaps the simplest is to use the “grand central despatch” model. In this model, a single subroutine data structure could be wrapped by concurrent primitives that fed a pool of executor threads that randomly jostled to execute the next subroutine. The double buffer arrangement described above may also be used to implement the random execution of the different tasks of the Keccak-f algorithm. [223] Thus, in summary methods and apparatus for enhancing security when executing a computer algorithm comprising separately executable tasks are described above, each of which produce an electrical signal when executed. Figure 20 is a flowchart illustrating the key steps in the method (or carried out by the processor). As shown in a first step S2000, at least part of the computer algorithm is stored as at least one pipeline, for example in memory. The at least one pipeline comprises a plurality of separately executable tasks. A plurality of inputs to be processed by the computer algorithm are received at step S2002. [224] Figure 20 shows two routes for achieving randomisation. Although these are shown as separate options, they can be combined as described above. In the first route, the plurality of separately executable tasks are randomised at a pipeline level. Typically this is used for non-commutative tasks which must be carried out in a specific order. As shown, the first step is to randomise the order in which tasks are presented for execution (step S2010), for example using the DBRQ buffer or circular buffers with pointers as described above. Once the order is randomised, there is an attempt to execute each task once from the randomised order (step S2012). For non-commutative tasks, there is an addition step of checking whether preconditions are met and only executing the task if these are met (step S2014). [225] In the second route, the plurality of separately executable tasks are randomised at a task level. Each task which is randomised comprises multiple operations. These operations may be commutative operations and may thus be executed in any order. Accordingly, the first step may be randomising an order of execution of the set of operations within the at least one task (step S2020), for example using loop index shuffling as described above. For commutative operations, there is no need to check if the operation can be executed because they may be carried out in any order. Accordingly, as shown in step S2022, the operations may be executed in the order in which they appear in the randomised order and hence the task is executed. [226] Whether there is randomisation at task level or pipeline level or both, the next step is to check whether all the inputs which were received in step S2002 have been processed. In other words, there is a check to see if all the outputs have been generated (step S2030). For the encryption example described above, this may involve checking whether the input plain text has been processed by the algorithm to be a cipher text. If the outputs are not ready, as indicated by the arrows, the method loops back to the randomising steps. The randomising and executing steps (including attempting where required) are then repeated until the computer algorithm has processed the plurality of inputs and the
algorithm can end. As described above, at each repetition, the electrical signals produced when executing the plurality of separately executable tasks are randomised to enhance security. [227] Those skilled in the art will appreciate that while the foregoing has described what is considered to be the best mode and where appropriate other modes of performing present techniques, the present techniques should not be limited to the specific configurations and methods disclosed in this description of the preferred embodiment. Those skilled in the art will recognise that present techniques have a broad range of applications, and that the embodiments may take a wide range of modifications without departing from any inventive concept as defined in the appended claims.
Claims
Claims 1. Apparatus for enhancing security when executing a computer algorithm comprising separately executable tasks, each of which produce an electrical signal when executed, the apparatus comprising memory in which at least part of the computer algorithm is stored as at least one pipeline, wherein the at least one pipeline comprising a plurality of separately executable tasks; and a processor which is configured to receive a plurality of inputs to be processed by the computer algorithm; randomise the plurality of separately executable tasks at a pipeline level and/or at a task level; execute the randomised plurality of separately executable tasks, and repeat the randomising and executing steps until the computer algorithm has processed the plurality of inputs, whereby, at each repetition, the electrical signals produced when executing the plurality of separately executable tasks are randomised to enhance security.
2. The apparatus of claim 1, wherein the at least one pipeline comprises a plurality of non- commutative separately executable tasks and wherein randomising the plurality of separately executable tasks at a pipeline level comprises randomising the order in which the plurality of non- commutative separately executable tasks are presented for execution to the processor.
3. The apparatus of claim 2, wherein executing the randomised plurality of separately executable tasks comprises attempting to execute each of the plurality of non-commutative separately executable tasks once in turn from the randomised order by determining whether preconditions for execution are met, when the preconditions are met, executing the task and storing an output of the executed task.
4. The apparatus of claim 3, wherein determining whether preconditions for execution are met comprises: determining whether there is an intermediate input associated with the task; determining whether there is an intermediate output associated with the task and when it is determined that there is an intermediate input and that there is no intermediate output, determining that the preconditions are met.
5. The apparatus of any one of claims 2 to 4, further comprising a wait-free queue for storing each output of the executed tasks as an intermediate output.
6. The apparatus of any one of claims 2 to 5, wherein the processor is configured to execute each task asynchronously.
7. The apparatus of any one of claims 2 to 6, further comprising: at least one circular buffer for storing a plurality of pointers with each pointer being associated with one of the plurality of non-commutative separately executable tasks and a plurality of queues connected to the at least one circular buffer with each queue storing the output from each of the plurality of non-commutative separately executable tasks when it is executed.
8. The apparatus of claim 7, wherein the processor is configured to randomise the order in which the plurality of non-commutative separately executable tasks are presented for execution to the processor by randomising the order of the plurality of pointers within the at least one circular buffer.
9. The apparatus of claim 8, wherein the processor is configured to attempt to execute each of the plurality of separately executable tasks from the randomised order by: retrieving a pointer at the head of the circular buffer; attempting to execute the task associated with the retrieved pointer; returning the pointer to the circular buffer; and repeating the retrieving, attempting and returning steps until all pointers within the circular buffer have been attempted.
10. The apparatus of any one of claims 2 to 6, further comprising at least one buffer comprising a first queue and a second queue.
11. The apparatus of claim 3, wherein the processor is configured to randomise the order in which the plurality of non-commutative separately executable tasks are presented for execution to the processor by: generating a pointer which is associated with one of the plurality of non-commutative separately executable tasks; storing the plurality of pointers in temporal order in the first queue; randomising the order of the plurality of pointers; and transferring the randomised plurality of pointers to the second queue from which each pointer is retrievable to attempt to execute the associated task.
12. The apparatus of claim 11, wherein the processor is configured to attempt to execute each of the plurality of non-commutative separately executable tasks from the randomised order by: retrieving a pointer at the head of the second queue; attempting to execute the task associated with the retrieved pointer; returning the pointer to the first queue; and repeating the retrieving, attempting and returning steps until the second queue is empty.
13. The apparatus of claim 12, wherein the processor is configured to repeat the randomising step by: determining whether the second queue is empty, and when it is determined that the second queue is empty, randomising the order of the plurality of pointers; and transferring the randomised plurality of pointers in the first queue to the second queue from which each pointer is retrievable to execute the associated task.
14. The apparatus of any one of the preceding claims, wherein the processor comprises a plurality of processing cores for executing the plurality of separately executable tasks.
15. The apparatus of claim 14, comprising a first buffer having a first temporal queue and a second temporal queue; a second buffer having a first topological queue and a second topological queue; and a scheduler which is configured to cooperate with the processor to randomise the order in which the plurality of non-commutative separately executable tasks are presented for execution both temporally and topologically.
16. The apparatus of any one of the preceding claims, wherein at least one task of the plurality of separately executable tasks comprises a set of commutative operations and wherein randomising the plurality of separately executable tasks at a task level comprises randomising an order of execution of the set of commutative operations within the at least one task.
17. The apparatus of claim 16, wherein the commutative operations have a set of indices and randomising the order of execution of commutative operations comprises: randomising the set of indices for the commutative operations before execution of the selected task.
18. The apparatus of claim 16 or claim 17, wherein the processor is configured to: identify each repeat of the at least one task for which the order of execution of commutative operations is to be randomised, and randomise the set of indices for the commutative operations before execution of at least some repeats of the task.
19. The apparatus of any one of claims 15 to 18, wherein the processor comprises a plurality of processing cores for executing the plurality of separately executable tasks and wherein the processor is configured to randomise locations of execution of the commutative operations of the at least one task.
20. The apparatus of any one of the preceding claims, wherein the computer algorithm is an encryption algorithm encrypting plaintext or decrypting to plaintext.
21. The apparatus of any one of the preceding claims, comprising a random number generator to assist the processor when randomising the plurality of executable tasks at a pipeline or task level.
22. A method for enhancing security when executing a computer algorithm comprising separately executable tasks, each of which produce an electrical signal when executed, the method comprising: storing at least part of the computer algorithm as at least one pipeline comprising a plurality of separately executable tasks; receiving a plurality of inputs to be processed by the computer algorithm; randomising the plurality of separately executable tasks at a pipeline level and/or at a task level; executing the randomised plurality of separately executable tasks, and repeating the randomising and executing steps until the computer algorithm has processed the plurality of inputs, whereby, at each repetition, the electrical signals produced when executing the plurality of separately executable tasks are randomised to enhance security.
23. The method of claim 22, wherein the at least one pipeline comprises a plurality of non- commutative separately executable tasks; wherein randomising the plurality of separately executable tasks at a pipeline level comprises randomising the order in which the plurality of non-commutative separately executable tasks are presented for execution, and wherein executing the randomised plurality of separately executable tasks comprises attempting to execute each of the plurality of non-commutative separately executable tasks from the randomised order once by determining whether preconditions for execution are met, when the preconditions are met, executing the task and storing an output of the executed task.
24. The method of claim 22 or claim 23, wherein at least one task of the plurality of separately executable tasks comprises a set of commutative operations and wherein randomising the plurality of separately executable tasks at a task level comprises randomising an order of execution of the set of commutative operations within the at least one task.
25. A method for encrypting plaintext comprising receiving at least one input plaintext; and executing the computer algorithm as set out in any one of claims 22 to 24, wherein the computer algorithm is an encryption algorithm.
26. A method for decrypting cyphertext comprising receiving at least one input cyphertext; and
executing the computer algorithm as set out in any one of claims 22 to 24, wherein the computer algorithm is a decryption algorithm.
27. Processor control code which when executed on apparatus causes the apparatus to carry out the method of any one of claims 22 to 26.
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| GBGB2012352.7A GB202012352D0 (en) | 2020-08-07 | 2020-08-07 | Method and apparatus for reducing the risk of successful side channel attacks |
| GB2012352.7 | 2020-08-07 | ||
| GB2105109.9 | 2021-04-09 | ||
| GBGB2105109.9A GB202105109D0 (en) | 2021-04-09 | 2021-04-09 | Method and apparatus for reducing the risk of successful side channel and fault injection attacks |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2022029443A1 true WO2022029443A1 (en) | 2022-02-10 |
Family
ID=77358300
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/GB2021/052034 Ceased WO2022029443A1 (en) | 2020-08-07 | 2021-08-05 | Method and apparatus for reducing the risk of successful side channel and fault injection attacks |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2022029443A1 (en) |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114328001A (en) * | 2022-03-11 | 2022-04-12 | 紫光同芯微电子有限公司 | Method and device for detecting fault injection attack on RAM and storage medium |
| EP4386588A1 (en) * | 2022-12-14 | 2024-06-19 | Thales Dis France Sas | Method to secure a software code against attacks requiring a knowledge of the location of data in the memory stack |
| WO2024253588A1 (en) * | 2023-06-09 | 2024-12-12 | National University Of Singapore | Method and system for counteracting side channel attacks |
| WO2025005925A1 (en) * | 2023-06-30 | 2025-01-02 | Pqsecure Technologies, Llc | Hardware-based and software-based method for enhancing resistance against side-channel attacks in a cryposystem |
Citations (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8143A (en) | 1851-06-10 | dtjtcher | ||
| US20060288239A1 (en) | 2003-04-22 | 2006-12-21 | Francesco Pessolano | Electronic circuit device for cryptographic applications |
| US20080019507A1 (en) | 2006-06-29 | 2008-01-24 | Incard S.A. | Method for Protecting IC Cards Against Power Analysis Attacks |
| US20100166177A1 (en) | 2008-12-31 | 2010-07-01 | Incard S.A. | Method for protecting a cryptographic device against spa, dpa and time attacks |
| GB2494731A (en) | 2011-09-06 | 2013-03-20 | Nds Ltd | Dummy and secret control signals for a circuit |
| US20140013425A1 (en) | 2012-07-03 | 2014-01-09 | Honeywell International Inc. | Method and apparatus for differential power analysis protection |
| GB2524335A (en) | 2014-03-22 | 2015-09-23 | Primary Key Associates Ltd | Methods and apparatus for resisting side channel attack |
| US20160171252A1 (en) | 2014-12-16 | 2016-06-16 | Cryptography Research, Inc | Buffer access for side-channel attack resistance |
| US20170099134A1 (en) | 1998-01-02 | 2017-04-06 | Cryptography Reserach, Inc. | Differential power analysis - resistant cryptographic processing |
| EP3624390A1 (en) * | 2018-09-17 | 2020-03-18 | Secure-IC SAS | Devices and methods for protecting cryptographic programs |
-
2021
- 2021-08-05 WO PCT/GB2021/052034 patent/WO2022029443A1/en not_active Ceased
Patent Citations (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8143A (en) | 1851-06-10 | dtjtcher | ||
| US20170099134A1 (en) | 1998-01-02 | 2017-04-06 | Cryptography Reserach, Inc. | Differential power analysis - resistant cryptographic processing |
| US20060288239A1 (en) | 2003-04-22 | 2006-12-21 | Francesco Pessolano | Electronic circuit device for cryptographic applications |
| US20080019507A1 (en) | 2006-06-29 | 2008-01-24 | Incard S.A. | Method for Protecting IC Cards Against Power Analysis Attacks |
| US20100166177A1 (en) | 2008-12-31 | 2010-07-01 | Incard S.A. | Method for protecting a cryptographic device against spa, dpa and time attacks |
| GB2494731A (en) | 2011-09-06 | 2013-03-20 | Nds Ltd | Dummy and secret control signals for a circuit |
| US20140013425A1 (en) | 2012-07-03 | 2014-01-09 | Honeywell International Inc. | Method and apparatus for differential power analysis protection |
| GB2524335A (en) | 2014-03-22 | 2015-09-23 | Primary Key Associates Ltd | Methods and apparatus for resisting side channel attack |
| US20160171252A1 (en) | 2014-12-16 | 2016-06-16 | Cryptography Research, Inc | Buffer access for side-channel attack resistance |
| EP3624390A1 (en) * | 2018-09-17 | 2020-03-18 | Secure-IC SAS | Devices and methods for protecting cryptographic programs |
Non-Patent Citations (16)
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114328001A (en) * | 2022-03-11 | 2022-04-12 | 紫光同芯微电子有限公司 | Method and device for detecting fault injection attack on RAM and storage medium |
| EP4386588A1 (en) * | 2022-12-14 | 2024-06-19 | Thales Dis France Sas | Method to secure a software code against attacks requiring a knowledge of the location of data in the memory stack |
| WO2024126305A1 (en) * | 2022-12-14 | 2024-06-20 | Thales Dis France Sas | Method to secure a software code against attacks requiring a knowledge of the location of data in the memory stack |
| WO2024253588A1 (en) * | 2023-06-09 | 2024-12-12 | National University Of Singapore | Method and system for counteracting side channel attacks |
| WO2025005925A1 (en) * | 2023-06-30 | 2025-01-02 | Pqsecure Technologies, Llc | Hardware-based and software-based method for enhancing resistance against side-channel attacks in a cryposystem |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2022029443A1 (en) | Method and apparatus for reducing the risk of successful side channel and fault injection attacks | |
| JP5120830B2 (en) | Method and system for generating ciphertext and message authentication code using shared hardware | |
| CN101206816B (en) | Arithmetic processing device and arithmetic processing control method | |
| US8094816B2 (en) | System and method for stream/block cipher with internal random states | |
| Rahimunnisa et al. | FPGA implementation of AES algorithm for high throughput using folded parallel architecture | |
| EP3839788A1 (en) | Bit-length parameterizable cipher | |
| US8976960B2 (en) | Methods and apparatus for correlation protected processing of cryptographic operations | |
| CA2486713A1 (en) | Advanced encryption standard (aes) hardware cryptographic engine | |
| US20050120065A1 (en) | Pseudorandom number generator for a stream cipher | |
| Andrade et al. | Lyra2: Efficient password hashing with high security against time-memory trade-offs | |
| CA2497935C (en) | Stream cipher design with revolving buffers | |
| Abd Ali et al. | Novel encryption algorithm for securing sensitive information based on feistel cipher | |
| WO2008013083A1 (en) | Pseudo random number generator, stream encrypting device, and program | |
| Fanfakh et al. | Simultaneous encryption and authentication of messages over GPUs | |
| Bhargavi et al. | Panther: A sponge based lightweight authenticated encryption scheme | |
| WO2007129197A1 (en) | Cryptographic apparatus and process | |
| Subramanian et al. | Adaptive counter clock gated S-Box transformation based AES algorithm of low power consumption and dissipation in VLSI system design | |
| Zeh et al. | RISC-V cryptographic extension proposals | |
| Zeh et al. | RISC-V cryptographic extension proposals volume I: Scalar & entropy source instructions | |
| Naumenko | Cryptographically Secure Pseudorandom Number Generators | |
| US11303436B2 (en) | Cryptographic operations employing non-linear share encoding for protecting from external monitoring attacks | |
| Abdulwahed | Chaos-Based Advanced Encryption Standard | |
| Monfared et al. | BSRNG: a high throughput parallel bitsliced approach for random number generators | |
| Hafsa et al. | A lightweight and robust block cipher algorithm for real-time applications | |
| Irwin et al. | Using media processors for low-memory AES implementation |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21755549 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 21755549 Country of ref document: EP Kind code of ref document: A1 |