[go: up one dir, main page]

CN120144090A - A NTT hardware implementation method - Google Patents

A NTT hardware implementation method Download PDF

Info

Publication number
CN120144090A
CN120144090A CN202510204528.9A CN202510204528A CN120144090A CN 120144090 A CN120144090 A CN 120144090A CN 202510204528 A CN202510204528 A CN 202510204528A CN 120144090 A CN120144090 A CN 120144090A
Authority
CN
China
Prior art keywords
ntt
addresses
hardware
address
input coefficients
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510204528.9A
Other languages
Chinese (zh)
Inventor
蔡述庭
谢子龙
赵尹侨
陈聪玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN202510204528.9A priority Critical patent/CN120144090A/en
Publication of CN120144090A publication Critical patent/CN120144090A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/57Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
    • G06F7/575Basic arithmetic logic units, i.e. devices selectable to perform either addition, subtraction or one of several logical operations, using, at least partially, the same circuitry

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Advance Control (AREA)

Abstract

本申请适用于信息通讯技术领域,提供了一种NTT硬件实现方法,应用于NTT硬件加速器,NTT硬件加速器耦合于处理器的LSU中,NTT硬件加速器包括地址生成模块和蝶式运算模块,NTT硬件实现方法包括:地址生成模块生成两个输入系数的地址、两个输出结果的地址以及一个旋转因子的地址;蝶式运算模块基于两个输入系数的地址和一个旋转因子的地址,通过LSU一次性从主存读取两个输入系数和旋转因子,并对两个输入系数和旋转因子进行蝶式运算,基于两个输出结果的地址将蝶式运算结果通过LSU写入主存。本申请能减小NTT硬件实现时数据的传输开销,同时缩短数据的传输时间。

The present application is applicable to the field of information and communication technology, and provides an NTT hardware implementation method, which is applied to an NTT hardware accelerator, wherein the NTT hardware accelerator is coupled to the LSU of the processor, and the NTT hardware accelerator includes an address generation module and a butterfly operation module, and the NTT hardware implementation method includes: the address generation module generates the addresses of two input coefficients, the addresses of two output results, and the address of a rotation factor; the butterfly operation module reads the two input coefficients and the rotation factor from the main memory at one time through the LSU based on the addresses of the two input coefficients and the address of the rotation factor, and performs a butterfly operation on the two input coefficients and the rotation factor, and writes the butterfly operation result into the main memory through the LSU based on the addresses of the two output results. The present application can reduce the data transmission overhead when the NTT hardware is implemented, and shorten the data transmission time.

Description

NTT hardware implementation method
Technical Field
The application belongs to the technical field of information communication, and particularly relates to an NTT hardware implementation method.
Background
With the rapid development of quantum algorithms and quantum computing technologies, conventional public key encryption algorithms are facing unprecedented challenges. To address these challenges, post quantum cryptography algorithms have evolved, with hardware integrated processor implementations of the core operations-number theory transformation (NTT, numberTheoretic Transform) for these algorithms.
The current NTT coupling scheme has a close coupling to the decoding stage and the execution stage. The close coupling to the decode stage tends to cause critical paths to appear in the decode stage that affect the main frequency of the processor, while the coupled NTT structure also requires registering and pipelining after the data processing is completed, i.e., additional transmission overhead remains unavoidable. Similarly, coupling to the execution phase also requires the beat of data to be registered, and its transmission overhead is unavoidable. Moreover, the two above-mentioned two have a common disadvantage that the input data to be processed needs to be registered in a register to process the data, and after the data is processed, the result is transferred to the memory module again for writing back into the main memory, which requires two additional cycles, and in some cases, the overall processing performance of the NTT processor is affected.
Disclosure of Invention
The embodiment of the application provides an NTT hardware implementation method, which can solve the problems of high data transmission overhead and long transmission time when the NTT hardware is implemented.
The embodiment of the application provides an NTT hardware implementation method which is applied to an NTT hardware accelerator, wherein the NTT hardware accelerator is coupled to an LSU of a processor, the NTT hardware accelerator comprises an address generation module and a butterfly operation module, and the NTT hardware implementation method comprises the following steps:
the address generation module generates addresses of two input coefficients, addresses of two output results and an address of a twiddle factor;
the butterfly operation module reads the two input coefficients and the twiddle factor from the main memory through the LSU at one time based on the addresses of the two input coefficients and the address of the twiddle factor, performs butterfly operation on the two input coefficients and the twiddle factor, and writes butterfly operation results into the main memory through the LSU based on the addresses of the two output results.
Optionally, before the step of generating the addresses of the two input coefficients, the addresses of the two output results, and the address of one twiddle factor by the address generating module, the NTT hardware implementation method further includes:
The order n and the modulus q of the NTT hardware accelerator are configured through a custom instruction.
Optionally, before the step of generating the addresses of the two input coefficients, the addresses of the two output results, and the address of one twiddle factor by the address generating module, the NTT hardware implementation method further includes:
and transmitting the parameters of the source register of the processor to the NTT hardware accelerator for configuration, wherein if the parameters of the source register are 1, the NTT operation is started.
Optionally, the NTT hardware implementation method further includes:
If the source register parameter is 0, then INTT operations are performed.
Alternatively, the length of the addresses of the two input coefficients and the length of the address of one twiddle factor are 32 bits.
Optionally, after the step of generating the addresses of the two input coefficients, the addresses of the two output results, and the address of one twiddle factor by the address generating module, the NTT hardware implementation method further includes:
The address generation module splices the addresses of two 32-bit input coefficients and the address of one 32-bit twiddle factor into a 96-bit address which is sent to a main memory read address port.
Optionally, the NTT hardware implementation method further includes:
In performing NTT operations, the port of the host and processor is connected to the interactive interface of the NTT hardware accelerator and enables a 96-bit data path.
The scheme of the application has the following beneficial effects:
in the embodiment of the application, the NTT hardware accelerator is coupled to the access phase by being coupled to the LSU pipeline of the processor, and the data interface of the NTT hardware accelerator is directly connected to the interaction interface between the processor and the main memory, so that the transmission communication overhead of data in the processor pipeline is reduced, and the time of data transmission is shortened.
Other advantageous effects of the present application will be described in detail in the detailed description section which follows.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of an NTT hardware accelerator according to an embodiment of the present application;
fig. 2 is a flowchart of an NTT hardware implementation method according to an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
As used in the present description and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".
Furthermore, the terms "first," "second," "third," and the like in the description of the present specification and in the appended claims, are used for distinguishing between descriptions and not necessarily for indicating or implying a relative importance.
Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.
Aiming at the problems of large data transmission cost and long transmission time when the current NTT hardware is realized, the embodiment of the application provides a NTT hardware realization method, which realizes that an NTT hardware accelerator is coupled to a memory stage by coupling the NTT hardware accelerator in an LSU pipeline of a processor, and a data interface of the NTT hardware accelerator is directly connected to an interaction interface of the processor and a main memory, thereby reducing the transmission communication cost of data in the processor pipeline and shortening the time of data transmission.
The method for implementing NTT hardware provided by the present application is described below with reference to specific embodiments.
The method for realizing NTT hardware provided by the application is applied to an NTT hardware accelerator, as shown in figure 1, wherein the NTT hardware accelerator is coupled to a Loading Storage Unit (LSU) of a processor, and the NTT hardware accelerator comprises an address generation module and a butterfly operation module. As shown in fig. 2, the method for implementing NTT hardware provided by the present application includes the following steps:
In step 21, the address generating module generates addresses of two input coefficients, addresses of two output results, and an address of one twiddle factor.
In step 22, the butterfly operation module reads the two input coefficients and the twiddle factor from the main memory through the LSU at one time based on the addresses of the two input coefficients and the addresses of the twiddle factor, performs butterfly operation on the two input coefficients and the twiddle factor, and writes the butterfly operation result into the main memory through the LSU based on the addresses of the two output results.
The address generation module is mainly used for generating addresses for accessing a main memory (RAM), and five output ports of the address generation module are respectively addresses of two input coefficients, addresses of two output results and addresses of one twiddle factor. As an alternative example, the address generation module may be implemented by a controller, and the processor may be a processor RI5CY suitable for a low power consumption application scenario.
The butterfly operation module comprises three input ports and two output ports, wherein the input ports receive two input coefficients and twiddle factors read from the RAM, the two output data are two butterfly operation results, the butterfly operation module is mainly used for performing butterfly operation on the two input coefficients and the twiddle factors, and the two butterfly operation results are written into a main memory through an LSU according to addresses of the two output results.
In some embodiments of the application, the order n and modulus q of the NTT hardware accelerator may be configured by instructions prior to starting the NTT operation. Specifically, the order n and the modulus q of the NTT hardware accelerator can be configured through a custom instruction, so that the flexibility of design is greatly enhanced, and the design scheme of the application can support a plurality of different post quantum cryptography algorithms (PQCs) and polynomial operations with different security levels, thereby adapting to diversified encryption requirements.
In practical application, the NTT hardware accelerator may be configured correspondingly according to different algorithms, and the instructions of different configuration parameters include different parameter information, such as a configuration module q, where the parameter q of the source register rs1 of the processor is transmitted to the NTT hardware accelerator for configuration, and the configuration order n, where the order n of the source register rs1 of the processor is transmitted to the NTT hardware accelerator for configuration.
In some embodiments of the application, the NTT operation may be initiated by transmitting parameters of the processor's source register rs1 to the NTT hardware accelerator for configuration. Specifically, as in the mode of configuring the modulus q, the parameters of the source register are transmitted to the NTT hardware accelerator for configuration, if the parameters of the source register are 1, the NTT operation is started, and if the parameters of the source register are 0, the operation is performed INTT. That is, if NTT is performed, the value of rs1 is 1, otherwise, 0. After the NTT operation is started, the former instruction operation of the pipeline is suspended until the NTT operation ends.
It should be noted that, the input data interface of the NTT hardware accelerator is directly connected to the main memory interface, and the correctness of the data stream is ensured by controlling the input of the data, the output result of the butterfly operation and the address of the memory through the designed control logic module (i.e. the address generating module). When the NTT/INTT (INTT is the inverse operation of NTT) is performed, the length of the addresses of the two input coefficients and the length of the address of one twiddle factor may be 32 bits, and after the address generating module generates the addresses of the two input coefficients and the address of one twiddle factor, the address generating module may splice the addresses of the two 32-bit input coefficients and the address of one 32-bit twiddle factor into a 96-bit address to be sent to a read address port of a main memory (RAM). And the RAM directly sends the three read data to the butterfly operation module for processing through the storage interface in the next period. Because the butterfly operation module is a pure combination logic, the butterfly operation is immediately carried out after the data is input to obtain two output results, and then the results are written back into the RAM according to the write address in the next period.
Notably, the read-write data port of the present application is dynamically configurable. In performing NTT operations, the port of the host and processor is connected to the interactive interface of the NTT hardware accelerator and enables a 96-bit data path. In practical application, when NTT/INTT operation is executed, the data path can be 96 bits and 64 bits, so that up to three 32 bits of data in RAM can be accessed at one time, the throughput of data processing is remarkably improved, the efficient data access mechanism enables an NTT hardware accelerator to process a large amount of data more rapidly and efficiently, and the original 32 bits of data bit width is kept under most other instruction operation conditions. This dynamic configuration capability enables the NTT hardware accelerator to flexibly accommodate different data processing requirements while maintaining high efficiency and low power consumption.
In practical application, after starting NTT operation, NTT hardware accelerator starts to access main memory, access address is 96 bits (bit) width, three bits of data can be addressed, two input coefficients and one twiddle factor are contained respectively, three data are transmitted to NTT hardware accelerator in next period to immediately execute calculation output result, two output results and addresses of two results are interacted with main memory in next period, and the result is written back to corresponding position. The subsequent data to be processed (i.e., two input coefficients and a twiddle factor) is repeated.
Therefore, the NTT hardware accelerator realizes the triple advantages of high performance, high throughput and high flexibility by tightly integrating the NTT hardware accelerator into the LSU, dynamic data path configuration and flexible parameter configuration, and provides an efficient and extensible platform for the hardware implementation of the post quantum cryptography algorithm. Meanwhile, compared with other design schemes, the method remarkably reduces the overhead of data transmission to registers, avoids the reduction of main frequency caused by integration to a decoding stage, and not only optimizes the data flow, but also maintains the performance of a processor.
Specifically, the application skillfully fuses the expansion of RISC-V instruction set, the adjustment of processor architecture and the realization of NTT hardware accelerator, and forms a unified whole.
Firstly, according to the data access time sequence of the processor access module, an NTT hardware accelerator with an NTT/INTT starting signal is designed and realized. The NTT hardware accelerator comprises three input ports and seven output ports according to the structure so as to perform efficient data access and interaction with the RAM.
Next, a new instruction set is extended for activating the NTT hardware accelerator and configuring its parameters. When the instruction format is designed, the occupied instruction format in the original instruction set is skillfully avoided. For example, in the present case, an R-type instruction is employed and 0000011 is selected as the value of the funct field (funct field is part of the RISC-V instruction set) because this value was not used in the previous instruction set. Meanwhile, the funct field (funct field is part of the RISC-V instruction set) is utilized to distinguish between NTT-on instructions and parameter configuration instructions. After the instruction format is designed, the RISC-V tool chain needs to be recompiled, or an inline assembly is used in combination with the word format, to ensure that the tool chain can recognize and compile the extended instructions. The intended function can be implemented as long as the tool chain is able to recognize and compile these extended instructions.
Finally, it is also a crucial step to integrate the NTT hardware accelerator into the memory stage of the processor pipeline. By using a multiplexer, the interaction signals (such as access address, write-back data and data input and the like) of the NTT execution stage are selectively connected with the signals of the original access instruction. When NTT operation is carried out, the ports of the RAM and the processor are connected to the interaction interface of the NTT by the corresponding selection signals, 96-bit data paths are started, and other instructions are connected back to the original LSU interface, and 64-bit data paths which are not needed to be used are shielded, so that power consumption is saved.
Through the series of carefully designed steps, the NTT hardware implementation method not only improves the efficiency of NTT operation, but also maintains the flexibility and energy efficiency of a processor, and provides a high-efficiency and extensible solution for the hardware implementation of a post quantum cryptography algorithm.
While the foregoing is directed to the preferred embodiments of the present application, it will be appreciated by those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present application, and such modifications and adaptations are intended to be comprehended within the scope of the present application.

Claims (7)

1. An NTT hardware implementation method, applied to an NTT hardware accelerator, where the NTT hardware accelerator is coupled to an LSU of a processor, where the NTT hardware accelerator includes an address generation module and a butterfly operation module, where the NTT hardware implementation method includes:
The address generation module generates addresses of two input coefficients, addresses of two output results and an address of a twiddle factor;
The butterfly operation module reads the two input coefficients and the twiddle factor from the main memory through the LSU at one time based on the addresses of the two input coefficients and the address of the twiddle factor, performs butterfly operation on the two input coefficients and the twiddle factor, and writes butterfly operation results into the main memory through the LSU based on the addresses of the two output results.
2. The NTT hardware implemented method according to claim 1, further comprising, prior to the step of the address generation module generating addresses for two input coefficients, addresses for two output results, and addresses for one twiddle factor:
and configuring the order n and the modulus q of the NTT hardware accelerator through a custom instruction.
3. The NTT hardware implemented method according to claim 1, further comprising, prior to the step of the address generation module generating addresses for two input coefficients, addresses for two output results, and addresses for one twiddle factor:
And transmitting the parameters of the source register of the processor to the NTT hardware accelerator for configuration, wherein if the parameters of the source register are 1, the NTT operation is started.
4. The NTT hardware-implemented method of claim 3, further comprising:
If the source register parameter is 0, then INTT operations are performed.
5. The NTT hardware implemented method of claim 1, wherein the length of the addresses of the two input coefficients and the length of the address of the one twiddle factor are 32 bits.
6. The NTT hardware implemented method according to claim 5, further comprising, after the step of the address generation module generating addresses of two input coefficients, addresses of two output results, and addresses of one twiddle factor:
The address generation module splices the addresses of two 32-bit input coefficients and the address of one 32-bit twiddle factor into a 96-bit address which is sent to a main memory read address port.
7. The NTT hardware-implemented method of claim 1, further comprising:
when NTT operation is carried out, a port of a main memory and a processor is connected to an interactive interface of the NTT hardware accelerator, and a 96-bit data path is enabled.
CN202510204528.9A 2025-02-24 2025-02-24 A NTT hardware implementation method Pending CN120144090A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510204528.9A CN120144090A (en) 2025-02-24 2025-02-24 A NTT hardware implementation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510204528.9A CN120144090A (en) 2025-02-24 2025-02-24 A NTT hardware implementation method

Publications (1)

Publication Number Publication Date
CN120144090A true CN120144090A (en) 2025-06-13

Family

ID=95961363

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510204528.9A Pending CN120144090A (en) 2025-02-24 2025-02-24 A NTT hardware implementation method

Country Status (1)

Country Link
CN (1) CN120144090A (en)

Similar Documents

Publication Publication Date Title
US6968445B2 (en) Multithreaded processor with efficient processing for convergence device applications
US7308320B2 (en) Processor core for using external extended arithmetic unit efficiently and processor incorporating the same
RU2636669C2 (en) Device and method of reversing and swapping bits in mask register
US5617553A (en) Computer system which switches bus protocols and controls the writing of a dirty page bit of an address translation buffer
CN110806899B (en) Assembly line tight coupling accelerator interface structure based on instruction extension
US20030093648A1 (en) Method and apparatus for interfacing a processor to a coprocessor
US10387072B2 (en) Systems and method for dynamic address based mirroring
US10877509B2 (en) Communicating signals between divided and undivided clock domains
CN104050415B (en) The sane and high performance instruction called for system
US8949575B2 (en) Reversing processing order in half-pumped SIMD execution units to achieve K cycle issue-to-issue latency
CN114924793A (en) Processing unit, computing device and instruction processing method
CN114428638A (en) Instruction issuing unit, instruction execution unit, related apparatus and method
US5367648A (en) General purpose memory access scheme using register-indirect mode
US8583897B2 (en) Register file with circuitry for setting register entries to a predetermined value
JPH05143323A (en) Method and apparatus for executing type-1 diadic instruction
CN120144090A (en) A NTT hardware implementation method
CN114924792A (en) Instruction decoding unit, instruction execution unit, and related devices and methods
CN115269011A (en) Instruction execution unit, processing unit and related device and method
US5596717A (en) Four state token passing alignment fault state circuit for microprocessor address misalignment fault generation
JP4264622B2 (en) Processor
CN118897822B (en) RISC-V extended instruction set construction method and device supporting dynamic reconstruction
US20020112145A1 (en) Method and apparatus for providing software compatibility in a processor architecture
US6442671B1 (en) System for priming a latch between two memories and transferring data via the latch in successive clock cycle thereafter
US6289439B1 (en) Method, device and microprocessor for performing an XOR clear without executing an XOR instruction
US6772318B1 (en) Bypass control circuit

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination