Disclosure of Invention
The embodiment of the application provides an NTT hardware implementation method, which can solve the problems of high data transmission overhead and long transmission time when the NTT hardware is implemented.
The embodiment of the application provides an NTT hardware implementation method which is applied to an NTT hardware accelerator, wherein the NTT hardware accelerator is coupled to an LSU of a processor, the NTT hardware accelerator comprises an address generation module and a butterfly operation module, and the NTT hardware implementation method comprises the following steps:
the address generation module generates addresses of two input coefficients, addresses of two output results and an address of a twiddle factor;
the butterfly operation module reads the two input coefficients and the twiddle factor from the main memory through the LSU at one time based on the addresses of the two input coefficients and the address of the twiddle factor, performs butterfly operation on the two input coefficients and the twiddle factor, and writes butterfly operation results into the main memory through the LSU based on the addresses of the two output results.
Optionally, before the step of generating the addresses of the two input coefficients, the addresses of the two output results, and the address of one twiddle factor by the address generating module, the NTT hardware implementation method further includes:
The order n and the modulus q of the NTT hardware accelerator are configured through a custom instruction.
Optionally, before the step of generating the addresses of the two input coefficients, the addresses of the two output results, and the address of one twiddle factor by the address generating module, the NTT hardware implementation method further includes:
and transmitting the parameters of the source register of the processor to the NTT hardware accelerator for configuration, wherein if the parameters of the source register are 1, the NTT operation is started.
Optionally, the NTT hardware implementation method further includes:
If the source register parameter is 0, then INTT operations are performed.
Alternatively, the length of the addresses of the two input coefficients and the length of the address of one twiddle factor are 32 bits.
Optionally, after the step of generating the addresses of the two input coefficients, the addresses of the two output results, and the address of one twiddle factor by the address generating module, the NTT hardware implementation method further includes:
The address generation module splices the addresses of two 32-bit input coefficients and the address of one 32-bit twiddle factor into a 96-bit address which is sent to a main memory read address port.
Optionally, the NTT hardware implementation method further includes:
In performing NTT operations, the port of the host and processor is connected to the interactive interface of the NTT hardware accelerator and enables a 96-bit data path.
The scheme of the application has the following beneficial effects:
in the embodiment of the application, the NTT hardware accelerator is coupled to the access phase by being coupled to the LSU pipeline of the processor, and the data interface of the NTT hardware accelerator is directly connected to the interaction interface between the processor and the main memory, so that the transmission communication overhead of data in the processor pipeline is reduced, and the time of data transmission is shortened.
Other advantageous effects of the present application will be described in detail in the detailed description section which follows.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
As used in the present description and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".
Furthermore, the terms "first," "second," "third," and the like in the description of the present specification and in the appended claims, are used for distinguishing between descriptions and not necessarily for indicating or implying a relative importance.
Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.
Aiming at the problems of large data transmission cost and long transmission time when the current NTT hardware is realized, the embodiment of the application provides a NTT hardware realization method, which realizes that an NTT hardware accelerator is coupled to a memory stage by coupling the NTT hardware accelerator in an LSU pipeline of a processor, and a data interface of the NTT hardware accelerator is directly connected to an interaction interface of the processor and a main memory, thereby reducing the transmission communication cost of data in the processor pipeline and shortening the time of data transmission.
The method for implementing NTT hardware provided by the present application is described below with reference to specific embodiments.
The method for realizing NTT hardware provided by the application is applied to an NTT hardware accelerator, as shown in figure 1, wherein the NTT hardware accelerator is coupled to a Loading Storage Unit (LSU) of a processor, and the NTT hardware accelerator comprises an address generation module and a butterfly operation module. As shown in fig. 2, the method for implementing NTT hardware provided by the present application includes the following steps:
In step 21, the address generating module generates addresses of two input coefficients, addresses of two output results, and an address of one twiddle factor.
In step 22, the butterfly operation module reads the two input coefficients and the twiddle factor from the main memory through the LSU at one time based on the addresses of the two input coefficients and the addresses of the twiddle factor, performs butterfly operation on the two input coefficients and the twiddle factor, and writes the butterfly operation result into the main memory through the LSU based on the addresses of the two output results.
The address generation module is mainly used for generating addresses for accessing a main memory (RAM), and five output ports of the address generation module are respectively addresses of two input coefficients, addresses of two output results and addresses of one twiddle factor. As an alternative example, the address generation module may be implemented by a controller, and the processor may be a processor RI5CY suitable for a low power consumption application scenario.
The butterfly operation module comprises three input ports and two output ports, wherein the input ports receive two input coefficients and twiddle factors read from the RAM, the two output data are two butterfly operation results, the butterfly operation module is mainly used for performing butterfly operation on the two input coefficients and the twiddle factors, and the two butterfly operation results are written into a main memory through an LSU according to addresses of the two output results.
In some embodiments of the application, the order n and modulus q of the NTT hardware accelerator may be configured by instructions prior to starting the NTT operation. Specifically, the order n and the modulus q of the NTT hardware accelerator can be configured through a custom instruction, so that the flexibility of design is greatly enhanced, and the design scheme of the application can support a plurality of different post quantum cryptography algorithms (PQCs) and polynomial operations with different security levels, thereby adapting to diversified encryption requirements.
In practical application, the NTT hardware accelerator may be configured correspondingly according to different algorithms, and the instructions of different configuration parameters include different parameter information, such as a configuration module q, where the parameter q of the source register rs1 of the processor is transmitted to the NTT hardware accelerator for configuration, and the configuration order n, where the order n of the source register rs1 of the processor is transmitted to the NTT hardware accelerator for configuration.
In some embodiments of the application, the NTT operation may be initiated by transmitting parameters of the processor's source register rs1 to the NTT hardware accelerator for configuration. Specifically, as in the mode of configuring the modulus q, the parameters of the source register are transmitted to the NTT hardware accelerator for configuration, if the parameters of the source register are 1, the NTT operation is started, and if the parameters of the source register are 0, the operation is performed INTT. That is, if NTT is performed, the value of rs1 is 1, otherwise, 0. After the NTT operation is started, the former instruction operation of the pipeline is suspended until the NTT operation ends.
It should be noted that, the input data interface of the NTT hardware accelerator is directly connected to the main memory interface, and the correctness of the data stream is ensured by controlling the input of the data, the output result of the butterfly operation and the address of the memory through the designed control logic module (i.e. the address generating module). When the NTT/INTT (INTT is the inverse operation of NTT) is performed, the length of the addresses of the two input coefficients and the length of the address of one twiddle factor may be 32 bits, and after the address generating module generates the addresses of the two input coefficients and the address of one twiddle factor, the address generating module may splice the addresses of the two 32-bit input coefficients and the address of one 32-bit twiddle factor into a 96-bit address to be sent to a read address port of a main memory (RAM). And the RAM directly sends the three read data to the butterfly operation module for processing through the storage interface in the next period. Because the butterfly operation module is a pure combination logic, the butterfly operation is immediately carried out after the data is input to obtain two output results, and then the results are written back into the RAM according to the write address in the next period.
Notably, the read-write data port of the present application is dynamically configurable. In performing NTT operations, the port of the host and processor is connected to the interactive interface of the NTT hardware accelerator and enables a 96-bit data path. In practical application, when NTT/INTT operation is executed, the data path can be 96 bits and 64 bits, so that up to three 32 bits of data in RAM can be accessed at one time, the throughput of data processing is remarkably improved, the efficient data access mechanism enables an NTT hardware accelerator to process a large amount of data more rapidly and efficiently, and the original 32 bits of data bit width is kept under most other instruction operation conditions. This dynamic configuration capability enables the NTT hardware accelerator to flexibly accommodate different data processing requirements while maintaining high efficiency and low power consumption.
In practical application, after starting NTT operation, NTT hardware accelerator starts to access main memory, access address is 96 bits (bit) width, three bits of data can be addressed, two input coefficients and one twiddle factor are contained respectively, three data are transmitted to NTT hardware accelerator in next period to immediately execute calculation output result, two output results and addresses of two results are interacted with main memory in next period, and the result is written back to corresponding position. The subsequent data to be processed (i.e., two input coefficients and a twiddle factor) is repeated.
Therefore, the NTT hardware accelerator realizes the triple advantages of high performance, high throughput and high flexibility by tightly integrating the NTT hardware accelerator into the LSU, dynamic data path configuration and flexible parameter configuration, and provides an efficient and extensible platform for the hardware implementation of the post quantum cryptography algorithm. Meanwhile, compared with other design schemes, the method remarkably reduces the overhead of data transmission to registers, avoids the reduction of main frequency caused by integration to a decoding stage, and not only optimizes the data flow, but also maintains the performance of a processor.
Specifically, the application skillfully fuses the expansion of RISC-V instruction set, the adjustment of processor architecture and the realization of NTT hardware accelerator, and forms a unified whole.
Firstly, according to the data access time sequence of the processor access module, an NTT hardware accelerator with an NTT/INTT starting signal is designed and realized. The NTT hardware accelerator comprises three input ports and seven output ports according to the structure so as to perform efficient data access and interaction with the RAM.
Next, a new instruction set is extended for activating the NTT hardware accelerator and configuring its parameters. When the instruction format is designed, the occupied instruction format in the original instruction set is skillfully avoided. For example, in the present case, an R-type instruction is employed and 0000011 is selected as the value of the funct field (funct field is part of the RISC-V instruction set) because this value was not used in the previous instruction set. Meanwhile, the funct field (funct field is part of the RISC-V instruction set) is utilized to distinguish between NTT-on instructions and parameter configuration instructions. After the instruction format is designed, the RISC-V tool chain needs to be recompiled, or an inline assembly is used in combination with the word format, to ensure that the tool chain can recognize and compile the extended instructions. The intended function can be implemented as long as the tool chain is able to recognize and compile these extended instructions.
Finally, it is also a crucial step to integrate the NTT hardware accelerator into the memory stage of the processor pipeline. By using a multiplexer, the interaction signals (such as access address, write-back data and data input and the like) of the NTT execution stage are selectively connected with the signals of the original access instruction. When NTT operation is carried out, the ports of the RAM and the processor are connected to the interaction interface of the NTT by the corresponding selection signals, 96-bit data paths are started, and other instructions are connected back to the original LSU interface, and 64-bit data paths which are not needed to be used are shielded, so that power consumption is saved.
Through the series of carefully designed steps, the NTT hardware implementation method not only improves the efficiency of NTT operation, but also maintains the flexibility and energy efficiency of a processor, and provides a high-efficiency and extensible solution for the hardware implementation of a post quantum cryptography algorithm.
While the foregoing is directed to the preferred embodiments of the present application, it will be appreciated by those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present application, and such modifications and adaptations are intended to be comprehended within the scope of the present application.