TWI900865B

TWI900865B - A processor and a method for processing data in the processor

Info

Publication number: TWI900865B
Application number: TW112124024A
Authority: TW
Inventors: 曼諾庫瑪; 希薇亞梅莉塔穆勒; 笛巴普里雅洽特傑; 尼爾斯弗里克; 馬丁迪德柏克斯
Original assignee: 美商萬國商業機器公司
Priority date: 2022-07-05
Filing date: 2023-06-28
Publication date: 2025-10-11
Also published as: TW202422506A

Abstract

A processor includes an execution unit for executing a message padding instruction including an operand field indicating a register buffering a message block segment of a message block to be padded and a mode field indicating which hash function is to be applied to the message block. The execution unit includes a padding circuit configured to receive a message block segment from a register indicated by the operand field, where the message block spans multiple registers in a register file. Based on which hash function is indicated by the mode field, the padding circuit selects a byte location in the message block segment at which to insert at least one padding byte and inserts the at least one padding byte at the byte location within the message block segment. The message block segment as padded by the at least one padding byte is written back to the register file.

Description

A processor and a method for processing data in the processor

本發明大體而言係關於資料處理，且特定言之，係關於高效地填充雜湊演算法之訊息塊。The present invention relates generally to data processing, and more particularly to efficiently filling message blocks in hashing algorithms.

資料安全之重要態樣為經由加密來保護靜止資料(例如，當儲存於資料儲存裝置中時)或轉變中之資料(例如，在傳輸期間)。一般而言，加密涉及經由利用加密函數將明文與一或多個加密密鑰組合來將未加密資料(被稱作明文)轉換成經加密資料(被稱作密文)。為了自密文恢復明文，藉由利用一或多個解密密鑰之解密函數處理密文。因此，加密藉由在當事方能夠存取受保護明文之前彼當事方已知額外秘密(亦即，解密密鑰)的要求來提供資料安全。An important aspect of data security is the protection of data at rest (e.g., when stored on a data storage device) or in transit (e.g., during transmission) through encryption. Generally speaking, encryption involves converting unencrypted data (called plaintext) into encrypted data (called ciphertext) by combining the plaintext with one or more encryption keys using an encryption function. To recover the plaintext from the ciphertext, the ciphertext is processed using a decryption function using one or more decryption keys. Thus, encryption provides data security by requiring that an additional secret (i.e., a decryption key) be known to a party before they can access the protected plaintext.

在許多實現中，利用執行於通用處理器上之軟體來執行資料加密。雖然在軟體中實現加密提供了能夠選擇不同加密演算法且易於調適所選擇加密演算法以使用各種資料長度的優點，但在軟體中執行加密具有相對不良效能的伴隨缺點。隨著資料集之量在「大資料」時代繼續顯著增加，當加密大訊息及/或資料集時，藉由軟體實現加密達成之效能可係不可接受的。亦由於愈來愈需要利用加密資料運行企業應用程式以便減輕「黑客行為」及其他網路攻擊的後果且確保法規遵循性，而產生對加密效能之關注。因此，常常需要提供對硬體中之加密的支援以達成改良之效能。In many implementations, data encryption is performed in software running on a general-purpose processor. While implementing encryption in software offers the advantages of being able to choose from a variety of encryption algorithms and easily adapting the chosen encryption algorithm to work with various data lengths, performing encryption in software has the attendant disadvantage of relatively poor performance. As the size of data sets continues to increase dramatically in the "Big Data" era, the performance achieved by implementing encryption in software may become unacceptable when encrypting large messages and/or data sets. Concerns about encryption performance have also arisen due to the increasing need to run enterprise applications using encrypted data in order to mitigate the consequences of "hacktivism" and other cyberattacks and to ensure regulatory compliance. Therefore, there is often a need to provide support for encryption in hardware to achieve improved performance.

本發明瞭解到，希望為其提供硬體支援的一種類別之加密演算法為雜湊函數，包括但不限於屬於安全雜湊演算法(SHA)標準系列之雜湊函數。如此項技術中已知，SHA標準系列定義由國家標準學會(NIST)核准的用於生成訊息之壓縮表示(亦即，訊息摘要)的雜湊演算法。SHA標準系列經指定於兩個聯邦資訊處理標準(FIPS)中：FIPS 180-4「安全雜湊標準(Secure Hash Standard)」及FIPS 202「SHA-3標準：基於置換之雜湊及可擴展輸出函數(SHA-3 Standard: Permutation-Based Hash and Extendable-Output Functions)」，該等標準以引用方式併入本文中。FIPS 180-4指定七個雜湊演算法，即安全雜湊演算法1(Secure Hash Algorithm-1；SHA-1)及SHA-2系列雜湊演算法，包括SHA-224、SHA-256、SHA-384、SHA-512、SHA-512/224及SHA-512/256。FIPS 202另外指定四個SHA-3雜湊演算法，其具有固定長度輸出(亦即，SHA3-224、SHA3-256、SHA3-384及SHA3-512)及兩個緊密相關「可擴展輸出」函數(XOF)，名為SHAKE128及SHAKE256(其中SHAKE為安全雜湊演算法及Keccak之縮寫)。SHA標準系列之額外用途(例如，作為串流密碼、經鑑認加密系統或樹雜湊方案)尚未被採用為NIST標準。The present invention recognizes that one class of cryptographic algorithms for which hardware support is desirable is hash functions, including but not limited to hash functions belonging to the Secure Hash Algorithm (SHA) family of standards. As is known in the art, the SHA family of standards defines hash algorithms approved by the National Institute of Standards (NIST) for generating compressed representations of messages (i.e., message digests). The SHA family of standards is specified in two Federal Information Processing Standards (FIPS): FIPS 180-4, "Secure Hash Standard," and FIPS 202, "SHA-3 Standard: Permutation-Based Hash and Extendable-Output Functions," which are incorporated herein by reference. FIPS 180-4 specifies seven hashing algorithms: Secure Hash Algorithm-1 (SHA-1) and the SHA-2 family of hashing algorithms, including SHA-224, SHA-256, SHA-384, SHA-512, SHA-512/224, and SHA-512/256. FIPS 202 additionally specifies four SHA-3 hashing algorithms with fixed-length outputs (i.e., SHA3-224, SHA3-256, SHA3-384, and SHA3-512) and two closely related "extendable output" functions (XOFs) named SHAKE128 and SHAKE256 (where SHAKE is an abbreviation for Secure Hash Algorithm and Keccak). Additional uses of the SHA family of standards (e.g., as a stream cipher, an authenticated encryption system, or a tree hashing scheme) have not yet been adopted as NIST standards.

在給出雜湊函數之廣泛多樣性及雜湊函數之資料大小(即使在SHA標準系列內)的情況下，用於雜湊函數之硬體中的廣泛支援可導致處理器佈局內之較大區域被實現雜湊函數之電路系統消耗。結果，一些硬體解決方案選擇(例如)在匯流排附接之特殊應用積體電路(ASIC)或加速器中與處理器核心分開地實現此電路系統。雖然提供了比一些軟體解決方案更好的效能的可能性，但此等輔助電路之使用仍然受到匯流排及記憶體存取潛時及訊息傳遞開銷的影響，從而與在高效能處理器核心內可達成之效能相比再次限制了效能。對於相對較小訊息(例如，擬合於單一訊息塊內之訊息)，此效能損失尤其嚴重，該等訊息為企業伺服器中處置之大多數SHA訊息。本發明藉由在處理器中有效地實現包括相關聯訊息塊填充之雜湊函數來解決此等及其他設計考慮因素。Given the wide variety of hash functions and the data sizes they handle (even within the SHA family of standards), widespread support in hardware for hash functions can result in a significant area of the processor layout being consumed by the circuitry implementing the hash functions. As a result, some hardware solutions choose to implement this circuitry separately from the processor core, for example in a bus-attached application-specific integrated circuit (ASIC) or accelerator. While offering the potential for better performance than some software solutions, the use of such auxiliary circuitry is still subject to bus and memory access latency and message passing overhead, again limiting performance compared to what could be achieved within a high-performance processor core. This performance loss is particularly severe for relatively small messages (e.g., messages that fit within a single message block), which are the majority of SHA messages processed in enterprise servers. The present invention addresses these and other design considerations by efficiently implementing the hash function, including associated message block padding, in the processor.

在一個實施例中，一種處理器包括：一指令提取單元，其提取待執行之指令；一暫存器檔案，其包括用於儲存源及目的地運算元之複數個暫存器；及一執行單元，其用於執行一訊息填充指令。該訊息填充指令包括一運算元欄位及一模式欄位，該運算元欄位指示緩衝待填充之一訊息塊之一訊息塊區段的該複數個暫存器中之一者，且該模式欄位指示複數個不同雜湊函數中之哪一者待應用於該訊息塊。該執行單元包括一填充電路，該填充電路經組態以基於該訊息填充指令，自由該訊息填充指令之該運算元欄位指示的該複數個暫存器中之一者接收一訊息塊區段，其中該訊息塊跨越該暫存器檔案中之多個暫存器。基於該複數個不同雜湊函數中之哪一者係由該訊息填充指令之該模式欄位指示，該填充電路在該訊息塊區段中選擇要插入至少一個填充位元組之一位元組位置且在該訊息塊區段內之該位元組位置處插入該至少一個填充位元組。接著將由該至少一個填充位元組填充的該訊息塊區段寫回至該暫存器檔案。In one embodiment, a processor includes an instruction fetch unit that fetches instructions to be executed, a register file that includes a plurality of registers for storing source and destination operands, and an execution unit that executes a fill instruction. The fill instruction includes an operand field that indicates one of the plurality of registers for buffering a message block segment of a message block to be filled, and a mode field that indicates which of a plurality of different hash functions is to be applied to the message block. The execution unit includes a fill circuit configured to receive a message block segment from one of the plurality of registers indicated by the operand field of the message fill instruction based on the message fill instruction, wherein the message block spans multiple registers in the register file. Based on which of the plurality of different hash functions is indicated by the mode field of the message fill instruction, the fill circuit selects a byte position in the message block segment at which to insert at least one padding byte and inserts the at least one padding byte at the byte position within the message block segment. The message block segment filled with the at least one padding byte is then written back to the register file.

在一個實施例中，一種資料處理之方法包括：藉由一處理器之一指令提取單元提取待藉由該處理器執行之指令，其中該等指令包括一訊息填充指令，該訊息填充指令包括一運算元欄位及一模式欄位，該運算元欄位指示緩衝待填充之一訊息塊之一訊息塊區段的複數個暫存器中之一者，且該模式欄位指示複數個不同雜湊函數中之哪一者待應用於該訊息塊。基於接收到該訊息填充指令，該處理器之一執行單元執行該訊息填充指令。執行該訊息填充指令包括自一暫存器檔案接收來自由該訊息填充指令之該運算元欄位指示的該複數個暫存器中之一者的一訊息塊區段，其中該訊息塊跨越該暫存器檔案中之多個暫存器。基於該複數個不同雜湊函數中之哪一者係由該訊息填充指令之該模式欄位指示，在該訊息塊區段中選擇要插入至少一個填充位元組之一位元組位置，且在該訊息塊區段內之該位元組位置處插入該至少一個填充位元組。將由該至少一個填充位元組填充的該訊息塊區段寫回至該暫存器檔案。In one embodiment, a data processing method includes fetching, by an instruction fetch unit of a processor, instructions to be executed by the processor, wherein the instructions include a fill instruction, the fill instruction including an operand field and a mode field, the operand field indicating one of a plurality of registers of a message block segment of a buffer to be filled with a message block, and the mode field indicating which of a plurality of different hash functions to apply to the message block. Upon receiving the fill instruction, an execution unit of the processor executes the fill instruction. Executing the message fill instruction includes receiving a message block segment from one of the plurality of registers indicated by the operand field of the message fill instruction from a register file, wherein the message block spans multiple registers in the register file. Based on which of the plurality of different hash functions is indicated by the mode field of the message fill instruction, selecting a byte position in the message block segment at which to insert at least one padding byte, and inserting the at least one padding byte at the byte position within the message block segment. The message block segment filled with the at least one padding byte is written back to the register file.

在一個實施例中，該訊息塊包括多個訊息塊區段，且該執行單元經組態以基於該模式欄位中之一指示偵測該訊息塊區段係該多個訊息塊區段中之哪一者。In one embodiment, the message block includes a plurality of message block segments, and the execution unit is configured to detect which of the plurality of message block segments the message block segment is based on an indication in the mode field.

在一個實施例中，該複數個不同雜湊函數包括一第一雜湊函數及一第二雜湊函數，且該執行單元經組態以基於該模式欄位指示該第一雜湊函數而在該訊息塊區段中插入訊息末端(EOM)填充及塊末端(EOB)填充兩者，且經組態以基於該模式欄位指示該第二雜湊函數而在該訊息塊區段中插入EOM填充而不插入EOB填充。In one embodiment, the plurality of different hash functions include a first hash function and a second hash function, and the execution unit is configured to insert both end-of-message (EOM) padding and end-of-block (EOB) padding in the message block segment based on the mode field indicating the first hash function, and is configured to insert EOM padding but not EOB padding in the message block segment based on the mode field indicating the second hash function.

在一個實施例中，該訊息填充指令之該運算元欄位指示該複數個暫存器中之一者緩衝該訊息區段之一長度參數，該執行單元經組態以基於該複數個雜湊函數中之哪一者由該模式欄位指示而以不同方式解譯該長度參數。In one embodiment, the operand field of the message fill instruction indicates a length parameter of one of the plurality of registers buffering the message segment, and the execution unit is configured to interpret the length parameter differently based on which of the plurality of hash functions is indicated by the mode field.

在一個實施例中，該至少一個填充位元組包括一訊息末端(EOM)填充位元組，選擇該位元組位置包括生成具有對應於該訊息塊區段之一長度的一EOM賦能向量，且插入該至少一個填充位元組包括基於該EOM賦能向量而在該訊息塊區段中之該位元組位置處插入該EOM填充位元組。In one embodiment, the at least one padding byte includes an end-of-message (EOM) padding byte, selecting the byte position includes generating an EOM enable vector having a length corresponding to the message block segment, and inserting the at least one padding byte includes inserting the EOM padding byte at the byte position in the message block segment based on the EOM enable vector.

在一個實施例中，該填充電路包括一選擇電路，該選擇電路經組態以基於該複數個不同雜湊函數中之哪一者係由該填充指令之該模式欄位指示而選擇該EOM填充位元組之一值。In one embodiment, the fill circuit includes a selection circuit configured to select a value of the EOM fill byte based on which of the plurality of different hash functions is indicated by the mode field of the fill instruction.

在一個實施例中，該暫存器檔案為包括第一複數個暫存器之一第一暫存器檔案；該處理器包括一第二暫存器檔案，該第二暫存器檔案包括各自具有小於該第一複數個暫存器之長度之一長度的第二複數個暫存器；且該執行單元經進一步組態以將該訊息塊區段之多個組塊組合於該第二複數個暫存器當中的多個暫存器中，且將所有該多個組塊傳送至該第一複數個暫存器中之一者中以形成該訊息塊區段。在一個實施例中，該處理器經進一步組態以在將該多個組塊傳送至該第一複數個暫存器中之該一者之前將塊末端(EOB)填充插入至該第二複數個暫存器當中的該多個暫存器中之一者中。In one embodiment, the register file is a first register file including a first plurality of registers; the processor includes a second register file including a second plurality of registers, each of which has a length less than a length of the first plurality of registers; and the execution unit is further configured to combine multiple blocks of the message block segment into multiple registers among the second plurality of registers, and transfer all of the multiple blocks to one of the first plurality of registers to form the message block segment. In one embodiment, the processor is further configured to insert end-of-block (EOB) padding into one of the plurality of registers in the second plurality of registers before transferring the plurality of blocks to the one of the first plurality of registers.

在一個實施例中，該至少一個填充位元組包括一塊末端(EOB)填充位元組，選擇該位元組位置包括生成具有對應於該訊息塊區段之一長度的一EOB賦能向量，且插入該至少一個填充位元組包括基於該EOB賦能向量而在該訊息塊區段中之該位元組位置處插入該EOB填充位元組。In one embodiment, the at least one padding byte includes an end-of-block (EOB) padding byte, selecting the byte position includes generating an EOB enable vector having a length corresponding to the message block segment, and inserting the at least one padding byte includes inserting the EOB padding byte at the byte position in the message block segment based on the EOB enable vector.

在一個實施例中，插入該至少一個填充位元組包括邏輯地組合該EOB填充位元組與一訊息末端(EOM)填充位元組。In one embodiment, inserting the at least one pad byte includes logically combining the EOB pad byte with an end-of-message (EOM) pad byte.

在一個實施例中，該執行單元包括一雜湊電路，該雜湊電路經組態以基於一雜湊指令而將SHA系列之雜湊函數當中之一雜湊函數應用於包括經填充之該訊息塊區段的一經填充訊息塊。In one embodiment, the execution unit includes a hash circuit configured to apply a hash function from the SHA family of hash functions to a padded message block including the padded message block segment based on a hash instruction.

在一個實施例中，該訊息塊區段為包含具有r個位元之一相同長度的複數個訊息塊之一訊息的一部分，經填充之該訊息塊包括r個位元；且該複數個暫存器中之每一者具有小於r個位元之一長度。In one embodiment, the message block segment is a portion of a message including a plurality of message blocks having a same length of r bits, the padded message blocks include r bits; and each of the plurality of registers has a length less than r bits.

現在參考諸圖且特別參考圖 1，繪示根據一個實施例的資料處理系統100之高階方塊圖。在一些實現中，資料處理系統100可為(例如)伺服器電腦系統(諸如，可購自國際商業機器公司之POWER系列伺服器中之一者)、大型電腦系統、行動計算裝置(諸如智慧型手機或平板電腦)、膝上型或桌上型個人電腦系統或嵌入式處理器系統。Referring now to the figures and in particular to FIG1 , a high-level block diagram of a data processing system 100 is shown according to one embodiment. In some implementations, the data processing system 100 can be, for example, a server computer system (e.g., one of the POWER series servers available from Business Machines), a mainframe computer system, a mobile computing device (e.g., a smartphone or tablet), a laptop or desktop personal computer system, or an embedded processor system.

如所示，資料處理系統100包括處理指令及資料之一或多個處理器102。如此項技術中已知，每一處理器102可實現為具有半導體基板之各別積體電路，在該半導體基板中形成有積體電路系統。在至少一些實施例中，處理器102可通常實現多個市售處理器架構中之任一者，例如，POWER、ARM、Intel x86、NVidia、Apple silicon等。在所描繪之實例中，每一處理器102包括一或多個處理器核心104及快取記憶體106，該快取記憶體提供對很可能待由處理器核心104讀取及/或寫入之指令及運算元的低潛時存取。處理器102經耦接以用於藉由系統互連件110進行通信，該系統互連件在各種實現中可包括一或多個匯流排、交換器、橋接器及/或混合互連件。As shown, data processing system 100 includes one or more processors 102 for processing instructions and data. As is known in the art, each processor 102 can be implemented as a separate integrated circuit having a semiconductor substrate in which the integrated circuitry is formed. In at least some embodiments, processor 102 can generally implement any of a number of commercially available processor architectures, such as POWER, ARM, Intel x86, NVidia, Apple silicon, etc. In the depicted example, each processor 102 includes one or more processor cores 104 and a cache memory 106 that provides low-latency access to instructions and operands that are likely to be read and/or written by processor core 104 . Processors 102 are coupled for communication via a system interconnect 110 , which in various implementations may include one or more buses, switches, bridges, and/or hybrid interconnects.

資料處理系統100可另外包括耦接至系統互連件110之數個其他組件。舉例而言，此等組件可包括控制由處理器102及資料處理系統100之其他組件對系統記憶體114之存取的記憶體控制器112。另外，資料處理系統100可包括：輸入/輸出(I/O)配接器116，其用於將一或多個I/O裝置耦接至系統互連件110；非揮發性儲存系統118 ；及網路配接器120，其用於將資料處理系統100耦接至通信網路(例如，有線或無線區域網路及/或網際網路)。Data processing system 100 may additionally include several other components coupled to system interconnect 110. For example, these components may include a memory controller 112 that controls access to system memory 114 by processor 102 and other components of data processing system 100. Additionally, data processing system 100 may include an input/output (I/O) adapter 116 for coupling one or more I/O devices to system interconnect 110 , a non-volatile storage system 118 , and a network adapter 120 for coupling data processing system 100 to a communications network (e.g., a wired or wireless local area network and/or the Internet).

熟習此項技術者應另外瞭解，圖 1中所展示之資料處理系統100可包括許多額外未繪示之組件。因為此類額外組件對於理解所描述實施例並非必需的，所以其並未在圖 1中加以繪示或在本文中加以進一步論述。然而，亦應理解，本文中所描述之增強適用於不同架構之資料處理系統及處理器，且決不限於圖 1中所繪示之一般化資料處理系統架構。Those skilled in the art will also appreciate that the data processing system 100 shown in FIG1 may include many additional components not shown. Because these additional components are not necessary for understanding the described embodiments, they are not shown in FIG1 or discussed further herein. However, it should also be understood that the enhancements described herein are applicable to data processing systems and processors of various architectures and are in no way limited to the generalized data processing system architecture shown in FIG1 .

現參考圖 2，描繪根據一個實施例的例示性處理器核心200之高階方塊圖。處理器核心200可用以實現圖 1之處理器核心104中之任一者。 2 , a high-level block diagram of an exemplary processor core 200 is shown according to one embodiment. The processor core 200 may be used to implement any of the processor cores 104 of FIG. 1 .

在所描繪之實例中，處理器核心200包括用於自儲存器230(其可包括例如來自圖 1之快取記憶體106及/或系統記憶體114)提取一或多個指令串流內之指令的指令提取單元202。在典型實現中，每一指令具有由處理器核心200之指令集架構定義之格式，且至少包括指定待由處理器核心200執行之操作(例如，固定點或浮點算術運算、向量運算、矩陣運算、邏輯運算、分支運算、記憶體存取操作、加密運算等)的作業碼(operation code/opcode)欄位。某些指令可另外包括一或多個運算元欄位，該一或多個運算元欄位直接指定運算元或隱含地或明確地參考儲存待用於指令執行中之源運算元的一或多個暫存器及用於儲存藉由指令執行而生成的目的地運算元的一或多個暫存器。在一些實施例中可與指令提取單元202合併的指令解碼單元204，解碼藉由指令提取單元202自儲存器230擷取之指令，且將控制執行流之分支指令轉遞至分支處理單元206。在一些實施例中，藉由分支處理單元206執行之分支指令的處理可包括推測條件分支指令之結果。由分支處理單元206進行的分支處理(推測性及非推測性兩者)之結果繼而可用以重新引導藉由指令提取單元202進行的指令提取之一或多個串流。In the depicted example, processor core 200 includes an instruction fetch unit 202 for fetching instructions within one or more instruction streams from memory 230 (which may include, for example, cache 106 and/or system memory 114 of FIG. 1 ). In a typical implementation, each instruction has a format defined by the instruction set architecture of processor core 200 and includes at least an operation code (opcode) field that specifies an operation to be performed by processor core 200 (e.g., fixed-point or floating-point arithmetic operations, vector operations, matrix operations, logical operations, branch operations, memory access operations, cryptographic operations, etc.). Some instructions may also include one or more operand fields that directly specify operands or implicitly or explicitly reference one or more registers storing source operands to be used in instruction execution and one or more registers storing destination operands generated by instruction execution. In some embodiments, the instruction decode unit 204 , which may be combined with the instruction fetch unit 202 , decodes instructions fetched from the register 230 by the instruction fetch unit 202 and forwards branch instructions that control the execution flow to the branch processing unit 206. In some embodiments, the processing of branch instructions by the branch processing unit 206 may include speculating the outcome of a conditional branch instruction. The results of branch processing (both speculative and non-speculative) by the branch processing unit 206 may then be used to redirect one or more streams of instruction fetches performed by the instruction fetch unit 202 .

指令解碼單元204將並非分支指令的指令(常常被稱作「依序指令」)轉遞至映射器電路210。映射器電路210負責視需要將處理器核心200之暫存器檔案內的實體暫存器指派給指令以支援指令執行。映射器電路210較佳實現暫存器重命名。因此，對於至少一些類別之指令，映射器電路210建立藉由指令參考之邏輯(或經架構)暫存器之集合與處理器核心200之暫存器檔案內的實體暫存器之較大集合之間的暫態映射。結果，處理器核心200可避免對並非資料相依的指令進行不必要的串列化，否則可能由於按程式次序附近的指令再使用經架構暫存器之有限集合而發生此情形。Instruction decode unit 204 forwards instructions that are not branch instructions (often referred to as "in-order instructions") to mapper circuitry 210. Mapper circuitry 210 is responsible for assigning physical registers within the processor core 200 's register file to instructions as needed to support instruction execution. Mapper circuitry 210 preferably implements register renaming. Thus, for at least some classes of instructions, mapper circuitry 210 creates a temporary mapping between the set of logical (or architected) registers referenced by the instruction and a larger set of physical registers within the processor core 200 's register file. As a result, the processor core 200 may avoid unnecessary serialization of non-data-dependent instructions, which may otherwise occur due to nearby instructions in program order reusing the limited set of architected registers.

仍參看圖 2，處理器核心200另外包括一分派電路216，該分派電路經組態以確保觀測到指令之間的任何資料相依性並在依序指令變得準備好執行時分派依序指令。由分派電路216分派之指令暫時在發行佇列218中經緩衝，直至處理器核心200之執行單元具有可用於執行經分派指令之資源。當適當的執行資源變得可用時，發行佇列218機會性地且可能相對於指令之原始程式次序無序地將指令自發行佇列218發行至處理器核心200之執行單元。Still referring to FIG2 , processor core 200 further includes a dispatch circuit 216 configured to ensure that any data dependencies between instructions are observed and to dispatch sequential instructions as they become ready for execution. Instructions dispatched by dispatch circuit 216 are temporarily buffered in issue queue 218 until the execution units of processor core 200 have resources available to execute the dispatched instructions. When appropriate execution resources become available, issue queue 218 opportunistically issues instructions from issue queue 218 to the execution units of processor core 200 , potentially out of order relative to the original program order of the instructions.

在所描繪之實例中，處理器核心200包括用於執行各別不同類別之指令的若干不同類型之執行單元。在此實例中，執行單元包括：一或多個固定點單元220，其用於執行存取固定點運算元之指令；一或多個浮點單元222，其用於執行存取浮點運算元之指令；一或多個載入-儲存單元224，其用於自儲存器230載入資料並將資料儲存至該儲存器；及一或多個向量-純量單元226，其用於執行存取向量及/或純量運算元之指令。在一典型實施例中，每一執行單元經實現為多階段管線，其中可在不同執行階段同時處理多個指令。每一執行單元較佳包括至少一個暫存器檔案或經耦接以存取至少一個暫存器檔案，該至少一個暫存器檔案包括用於暫時緩衝在指令執行中存取或藉由指令執行生成之運算元的複數個實體暫存器。In the depicted example, processor core 200 includes several different types of execution units for executing various classes of instructions. In this example, the execution units include one or more fixed-point units 220 for executing instructions that access fixed-point operands; one or more floating-point units 222 for executing instructions that access floating-point operands; one or more load-store units 224 for loading data from and storing data to memory 230 ; and one or more vector-scalar units 226 for executing instructions that access vector and/or scalar operands. In a typical embodiment, each execution unit is implemented as a multi-stage pipeline, in which multiple instructions can be processed simultaneously at different execution stages. Each execution unit preferably includes or is coupled to access at least one register file, the at least one register file including a plurality of physical registers for temporarily buffering operands accessed during or generated by instruction execution.

熟習此項技術者應瞭解，處理器核心200可包括額外未繪示之組件，諸如經組態以管理由執行單元220至226之執行結束所針對之指令的完成及引退的邏輯。因為此等額外組件對於理解所描述實施例並非必需的，所以其並未在圖 2中加以繪示或在本文中加以進一步論述。Those skilled in the art will appreciate that processor core 200 may include additional components not shown, such as logic configured to manage the completion and retirement of instructions targeted by the completion of execution by execution units 220 through 226. Because these additional components are not necessary for understanding the described embodiments, they are not shown in FIG . 2 or discussed further herein.

現在參考圖 3，繪示根據一個實施例的處理器102之例示性執行單元之高階方塊圖。在此實例中，更詳細地展示處理器核心200之向量-純量單元226。在圖 3之實施例中，向量-純量單元226經組態以執行對不同類型之運算元之操作並生成不同類型之運算元的多個不同類別之指令。舉例而言，向量-純量單元226經組態以執行對向量及純量源運算元進行操作並生成向量及純量目的地運算元的第一類別之指令。向量-純量單元226在功能單元302至312中執行此第一類別之指令中的指令，在所描繪之實施例中，該等功能單元包括：用於執行加法、減法及旋轉運算之算術邏輯單元/旋轉單元302、用於執行二進位乘法之乘法單元304、用於執行二進位除法之除法單元306、用於執行加密功能之加密單元308、用於執行運算元置換之置換單元310及用於執行十進位數學運算之二進位寫碼十進位(BCD)單元312。對其執行此等運算之向量及純量源運算元以及藉由此等運算生成之向量及純量目的地運算元在經架構暫存器檔案300之實體暫存器中被緩衝。Referring now to FIG. 3 , a high-level block diagram of an exemplary execution unit of processor 102 is shown according to one embodiment. In this example, vector-scalar unit 226 of processor core 200 is shown in greater detail. In the embodiment of FIG . 3 , vector-scalar unit 226 is configured to perform operations on different types of operands and generate multiple different classes of instructions for different types of operands. For example, vector-scalar unit 226 is configured to perform a first class of instructions that operate on vector and scalar source operands and generate vector and scalar destination operands. The vector-scalar unit 226 executes instructions from this first category of instructions in functional units 302 through 312 , which in the depicted embodiment include an arithmetic logic unit/rotate unit 302 for performing addition, subtraction, and rotate operations, a multiplication unit 304 for performing binary multiplication, a division unit 306 for performing binary division, an encryption unit 308 for performing encryption functions, a permutation unit 310 for performing operand permutations, and a binary code decimal (BCD) unit 312 for performing decimal math operations. The vector and scalar source operators on which these operations are performed, and the vector and scalar destination operators generated by these operations, are buffered in physical registers of the architected register file 300 .

在此實例中，向量-純量單元226另外經組態以執行致使執行雜湊函數之第二類別之指令。向量-純量單元226在加速器單元314中執行此第二類別之指令中的指令。對其執行此等雜湊函數之運算元及藉由此等雜湊函數生成之運算元經緩衝且累積於寬向量暫存器檔案316中，該寬向量暫存器檔案可包括例如1024位元寬實體暫存器。In this example, the vector-scalar unit 226 is additionally configured to execute a second class of instructions that result in the execution of hash functions. The vector-scalar unit 226 executes instructions from this second class of instructions in the accelerator unit 314. Operands on which these hash functions are executed and operands generated by these hash functions are buffered and accumulated in a wide vector register file 316 , which may include, for example, 1024-bit wide physical registers.

在操作中，向量-純量單元226自發行佇列218接收指令。若指令係在第一類別之指令(例如，向量-純量指令)中，則在經架構暫存器檔案300中利用由映射器電路210建立的邏輯暫存器與實體暫存器之間的映射來存取用於指令之相關源運算元，且接著將其與指令一起轉遞至功能單元302至312中之一相關功能單元以供執行。藉由彼執行生成的目的地運算元接著儲存回至經架構暫存器檔案300的藉由映射器電路210建立之映射判定的實體暫存器。另一方面，若指令處於第二類別之指令(例如，雜湊指令)中，則將該指令轉遞至加速器單元314以關於在寬向量暫存器檔案316之指定暫存器中緩衝的運算元進行執行。In operation, the vector-scalar unit 226 receives an instruction from the issue queue 218. If the instruction is a first-class instruction (e.g., a vector-scalar instruction), the associated source operands for the instruction are accessed in the architected register file 300 using the mapping between logical registers and physical registers established by the mapper circuit 210 and then transferred along with the instruction to the associated one of the functional units 302 to 312 for execution. The destination operands generated by the execution are then stored back into the physical registers of the architected register file 300 determined by the mapping established by the mapper circuit 210 . On the other hand, if the instruction is in the second category of instructions (e.g., a shuffle instruction), the instruction is forwarded to the accelerator unit 314 for execution on the operands buffered in the specified registers of the wide vector register file 316 .

現在參看圖 4，描繪根據一個實施例的圖 3之加速器單元314之更詳細方塊圖。加速器單元314包括用於在硬體中執行多種雜湊函數之電路系統，包括(例如)由SHA標準系列定義之一或多個雜湊函數。在所描繪之實例中，加速器單元314之雜湊電路系統至少包括如下文參考圖 11更詳細描述之SHA3/SHAKE雜湊電路400及如下文參考圖 17更詳細描述之SHA2雜湊電路402。加速器單元314另外包括在執行訊息之SHA3/SHAKE雜湊時採用的單指令多資料(SIMD)互斥或(XOR)電路404，如下文進一步論述。最後，加速器單元314包括在記憶體系統(例如，快取記憶體106及系統記憶體114)與寬向量暫存器檔案316之間傳送資料(例如，待雜湊之訊息及訊息摘要)的資料傳送電路406。Referring now to FIG. 4 , a more detailed block diagram of the accelerator unit 314 of FIG. 3 is depicted, according to one embodiment. The accelerator unit 314 includes circuitry for executing various hash functions in hardware, including, for example, one or more hash functions defined by the SHA family of standards. In the depicted example, the hash circuitry of the accelerator unit 314 includes at least a SHA3/SHAKE hash circuit 400, described in more detail below with reference to FIG . 11 , and a SHA2 hash circuit 402, described in more detail below with reference to FIG. 17 . The accelerator unit 314 also includes a single instruction, multiple data (SIMD) exclusive OR (XOR) circuit 404 for performing SHA3/SHAKE hashing of messages, as discussed further below. Finally, the accelerator unit 314 includes a data transfer circuit 406 for transferring data (e.g., messages to be hashed and message digests) between the memory system (e.g., cache 106 and system memory 114 ) and the wide vector register file 316 .

現在參考圖 5，存在根據SHA-3標準的訊息雜湊之程序500的時間-空間圖。如此項技術中已知，SHA-3標準(亦即，FIPS 202)採用基於寬隨機函數或隨機置換的海綿構造。根據此海綿構造，任何任意長度(可能許多百萬位元組)之訊息502首先在輸入階段(在海綿術語中被稱作SHA3吸收階段504)中經處理。在下文參考圖 6更詳細描述的SHA3吸收階段504，針對SHA3雜湊函數及SHAKE雜湊函數兩者係相同的。SHA3吸收階段504產生1600位元最後吸收狀態610，接著在輸出階段(在海綿術語中被稱作SHA3/SHAKE擠壓階段506)中處理該1600位元最後吸收狀態以生成訊息摘要508。下文參考圖 8詳細描述的SHA3/SHAKE擠壓階段506針對SHA3雜湊函數及SHAKE雜湊函數不同地操作。特定言之，SHA3/SHAKE擠壓階段506生成用於各種SHA3雜湊函數的固定長度訊息摘要508，但生成用於SHAKE雜湊函數的可變長度訊息摘要508。Referring now to FIG. 5 , there is a time-space diagram of a message hashing process 500 according to the SHA-3 standard. As is known in the art, the SHA-3 standard (i.e., FIPS 202) employs a sponge structure based on a wide random function or random permutation. According to this sponge structure, a message 502 of any arbitrary length (potentially many megabytes) is first processed in an input phase (referred to in sponge terminology as the SHA3 absorption phase 504 ). The SHA3 absorption phase 504 , described in more detail below with reference to FIG . 6 , is identical for both the SHA3 hash function and the SHAKE hash function. The SHA3 absorption phase 504 produces a 1600-bit final absorption state 610 , which is then processed in the output phase (referred to in sponge terms as the SHA3/SHAKE squeeze phase 506 ) to generate a message digest 508. The SHA3/SHAKE squeeze phase 506 , described in detail below with reference to FIG8 , operates differently for the SHA3 hash function and the SHAKE hash function. Specifically, the SHA3/SHAKE squeeze phase 506 generates a fixed-length message digest 508 for various SHA3 hash functions, but generates a variable-length message digest 508 for the SHAKE hash function.

以下表I概述由SHA-3標準定義且列於第一行中的四個SHA3雜湊函數及兩個SHAKE雜湊函數之屬性。在表I中，第二行概述SHA3吸收階段504將可變長度訊息502再分成的訊息塊之以位元為單位的大小(r)。訊息塊大小r為位元組長度的整數倍，且每一訊息之第一訊息塊係位元組對準的。表I之第三行概述由SHA3/SHAKE擠壓階段506輸出之訊息摘要508之以位元為單位的大小(d)。再次應注意，不同於SHA3雜湊函數，SHAKE-128及SHAKE-256生成長度為d '的可變長度摘要。如表I之第四行中所提及，對於由SHA-3標準指定之每一雜湊函數，最後吸收狀態610之長度為1600位元。表I之第五行指定c之不同值，即在SHA3/SHAKE擠壓階段506期間在SHA3狀態置換函數之反覆之間傳遞的較低階位元之數目(參見例如圖 8)。最後，表I之第六行指定：SHA3狀態置換函數之每次反覆對每訊息塊採用24個回合之置換(參見例如圖 7A)。在對SHA-3標準之更新中或在非標準實現中，可例如藉由減小所需之置換數目(例如，減小至12)來變化置換之回合數。表I 訊息塊大小r (位元) 摘要大小d (位元) 狀態(位元) c = 1600-r (位元) 每訊息塊之置換回合 SHA3-224 1152 224 1600 448 24 SHA3-256 1088 256 1600 512 24 SHA3-384 832 384 1600 768 24 SHA3-512 576 512 1600 1024 24 SHAKE-128 1344 d' 1600 256 24 SHAKE-256 1088 d' 1600 512 24 Table 1 below summarizes the properties of the four SHA3 hash functions and the two SHAKE hash functions defined by the SHA-3 standard and listed in the first row. In Table 1, the second row summarizes the size ( r ) in bits of the message blocks into which the variable-length message 502 is subdivided by the SHA3 absorption phase 504. The message block size r is an integer multiple of the byte length, and the first message block of each message is byte-aligned. The third row of Table 1 summarizes the size ( d ) in bits of the message digest 508 output by the SHA3/SHAKE compression phase 506. Again, it should be noted that, unlike the SHA3 hash functions, SHAKE-128 and SHAKE-256 generate variable-length digests of length d ' . As mentioned in the fourth row of Table I, for each hash function specified by the SHA-3 standard, the length of the final absorbed state 610 is 1600 bits. The fifth row of Table I specifies different values of c , the number of lower-order bits that are passed between iterations of the SHA3 state permutation function during the SHA3/SHAKE squeeze phase 506 (see, e.g., FIG8 ). Finally, the sixth row of Table I specifies that each iteration of the SHA3 state permutation function employs 24 rounds of permutations per message block (see, e.g., FIG7A ). In updates to the SHA-3 standard or in non-standard implementations, the number of rounds of permutations may be varied, for example, by reducing the number of required permutations (e.g., to 12). Table I Message block size r (bits) Digest size d (bits) Status (bit) c = 1600 - r (bits) Per-block replacement rounds SHA3-224 1152 224 1600 448 twenty four SHA3-256 1088 256 1600 512 twenty four SHA3-384 832 384 1600 768 twenty four SHA3-512 576 512 1600 1024 twenty four SHAKE-128 1344 d' 1600 256 twenty four SHAKE-256 1088 d' 1600 512 twenty four

現在參看圖 6，描繪圖 5中所描繪之SHA3吸收階段504的時間-空間圖。如所展示，SHA3吸收階段504接收任何任意長度的訊息502作為輸入。如在區塊600處所展示，填充訊息502以獲得為r個位元之整數倍的長度。在許多先前技術實現中，此填充經由整個訊息502之高潛時、計算上昂貴的記憶體至記憶體移動來實現。在一些其他先前技術實現中，SHA雜湊軟體常式在使用習知SIMD指令序列將訊息塊載入至SIMD暫存器中之後填充訊息塊。儘管此等先前技術之技術可在本文中用以執行填充，但如下文參考圖 21A 至圖 27所詳細描述，此填充可替代地藉由根據所揭示發明之處理器暫存器(例如，寬向量暫存器檔案316)中之硬體經由執行填充指令來高效地執行。經由執行填充指令來填充訊息502亦允許以與SHA3吸收階段504中之訊息塊之處理時間上重疊的方式將填充應用於訊息502之末端。Referring now to FIG6 , a time-space diagram of the SHA3 absorption phase 504 depicted in FIG5 is depicted. As shown, the SHA3 absorption phase 504 receives as input a message 502 of any arbitrary length. As shown at block 600 , the message 502 is padded to obtain a length that is an integer multiple of r bits. In many prior art implementations, this padding is accomplished by a potentially time-consuming, computationally expensive memory-to-memory move of the entire message 502. In some other prior art implementations, the SHA hash software routine pads the message block after loading it into a SIMD register using a learned SIMD instruction sequence. While these prior art techniques may be used herein to perform padding, as described in detail below with reference to Figures 21A through 27 , such padding may alternatively be efficiently performed by hardware in accordance with the disclosed invention by executing pad instructions in processor registers (e.g., wide vector register file 316 ). Padding message 502 by executing pad instructions also allows padding to be applied to the end of message 502 in a manner that overlaps with the processing time of message blocks in SHA3 absorption phase 504 .

在SHA3吸收階段504中，提取組成經填充訊息的長度為r之n個(n為正整數)訊息塊中的每一者，且接著在尾隨低階位元中將其進行零擴展以形成n個1600位元擴展訊息塊602。第一訊息塊，亦即訊息塊1 602，形成由SHA-3標準定義的SHA3狀態置換函數604之輸入。如下文參考圖 9及圖 11所描述，根據所揭示發明之一個態樣，在硬體中經由執行SHA3雜湊指令來執行SHA3狀態置換函數604。SHA3狀態置換函數604之1600位元狀態輸出形成1600位元逐位元XOR函數606之第一輸入，該1600位元逐位元XOR函數將經填充訊息之下一1600位元擴展訊息塊602視為第二輸入。逐位元XOR函數606之結果形成SHA3狀態置換函數604之下一反覆之輸入。如所示，此程序針對訊息塊602中之每一者反覆地繼續，直至SHA3狀態置換函數604之最終反覆生成並輸出1600位元最後吸收狀態610，如先前在圖 5之描述中所提及。In the SHA3 absorption phase 504 , each of the n ( n is a positive integer) message blocks of length r that comprise the padded message is extracted and then zero-extended in the trailing low-order bits to form n 1600-bit extended message blocks 602. The first message block, message block 1 602 , forms the input to the SHA3 state permutation function 604 defined by the SHA- 3 standard. As described below with reference to Figures 9 and 11 , according to one aspect of the disclosed invention, the SHA3 state permutation function 604 is executed in hardware by executing a SHA3 hash instruction. The 1600-bit state output of SHA3 state permutation function 604 forms the first input to a 1600-bit bitwise XOR function 606 , which takes the next 1600-bit extended message block 602 of the padded message as its second input. The result of bitwise XOR function 606 forms the next iteration input of SHA3 state permutation function 604. As shown, this process continues repeatedly for each of message blocks 602 until the final iteration of SHA3 state permutation function 604 generates and outputs a 1600-bit final absorbed state 610 , as previously mentioned in the description of FIG. 5 .

現在參考圖 7A，繪示圖 6中所繪示之SHA3置換函數604的時間-空間圖。SHA3置換函數604接受1600位元輸入，且接著在SHA3回合函數704之24個回合中之第一回合中結合SHA-3標準指定之回合索引0 702來處理該1600位元輸入。此程序反覆地繼續，其中SHA3回合函數704中之每一後續回合之處理接收前一SHA3回合函數704之1600位元輸出及相關SHA3標準指定之回合索引702 (其為常數)作為輸入。在SHA3狀態置換函數604內之24個回合之處理完成之後，SHA3狀態置換函數604輸出1600位元狀態，該1600位元狀態充當至逐位元XOR函數606之輸入，或在SA3吸收階段504內之SHA3狀態置換函數604之最終反覆的狀況下構成充當SHA3/SHAKE擠壓階段506之輸入的最後吸收狀態610。Referring now to FIG. 7A , a time-space diagram of the SHA3 permutation function 604 shown in FIG . SHA3 permutation function 604 accepts a 1600-bit input and then processes the 1600-bit input in conjunction with the SHA-3 standard-specified round index 0 702 in the first of 24 rounds of SHA3 round function 704. This process continues repeatedly, with each subsequent round of processing in SHA3 round function 704 receiving as input the 1600-bit output of the previous SHA3 round function 704 and the associated SHA3 standard-specified round index 702 (which is a constant). After the 24 rounds of processing within the SHA3 state permutation function 604 are completed, the SHA3 state permutation function 604 outputs a 1600-bit state, which serves as input to the bitwise XOR function 606 or, in the case of the final iteration of the SHA3 state permutation function 604 within the SA3 absorption phase 504, constitutes the final absorbed state 610 that serves as input to the SHA3/SHAKE squeeze phase 506 .

現在參看圖 7B，描繪圖 7A中所描繪之SHA3回合函數704的時間-空間圖。如所示，SHA3回合函數704包括SHA-3標準指定之函數序列，按次序包括在SHA-3標準中由希臘字母θ (theta)、ρ (rho)、π (pi)、χ (chi)及ϊ (iota)所指的五個函數。θ函數接收並處理至回合函數704之1600位元輸入，且除ϊ函數之外的每個其他函數之輸出饋送下一依序函數。最後，ϊ函數處理χ函數之輸出及相關回合索引702以產生SHA3回合函數704之給定反覆的1600位元輸出。在先前技術中，利用兩個單指令多資料(SIMD)向量管線執行回合函數704可佔用多達80個循環。根據本文中所揭示之發明的一個態樣，可利用下文所描述的圖 11之SHA3/SHAKE雜湊電路400在處理器核心104之單一循環中完成回合函數704。Referring now to FIG. 7B , a time-space diagram of the SHA3 round function 704 depicted in FIG . 7A is depicted. As shown, the SHA3 round function 704 comprises a sequence of functions specified by the SHA-3 standard, including, in order, the five functions designated in the SHA-3 standard by the Greek letters θ (theta), ρ (rho), π (pi), χ (chi), and ϊ (iota). The θ function receives and processes the 1600-bit input to the round function 704 , and the output of each function except the ϊ function is fed to the next sequential function. Finally, the ϊ function processes the output of the χ function and the associated round index 702 to produce the 1600-bit output for a given iteration of the SHA3 round function 704 . In the prior art, executing round function 704 using two single instruction multiple data (SIMD) vector pipelines can take up to 80 cycles. According to one aspect of the invention disclosed herein, round function 704 can be completed in a single cycle of processor core 104 using SHA3/SHAKE hash circuit 400 described below in FIG. 11 .

現在參考圖 8，繪示圖 5中所繪示之SHA3/SHAKE擠壓階段506的時間-空間圖。如先前所描述，SHA3/SHAKE擠壓階段506接收由SHA3吸收階段504產生的1600位元最後吸收階段610作為輸入。為了產生用於由SHA-3標準定義之SHA3函數中之任一者的訊息摘要508，SHA3/SHAKE擠壓階段506首先提取最後吸收狀態610之前r個高階位元以形成結果塊1 800。截斷函數802接著截斷結果塊1 800之r個位元以保留形成訊息摘要508之高階d個位元。Referring now to FIG. 8 , a time-space diagram of the SHA3/SHAKE squeeze phase 506 shown in FIG . 5 is shown. As previously described, the SHA3/SHAKE squeeze phase 506 receives as input the 1600-bit final absorbed state 610 generated by the SHA3 absorb phase 504. To generate a message digest 508 for use with any of the SHA3 functions defined by the SHA-3 standard, the SHA3/SHAKE squeeze phase 506 first extracts the r high-order bits preceding the final absorbed state 610 to form a result block 1 800. The truncation function 802 then truncates the r bits of the result block 1 800 to retain the d high-order bits that form the message digest 508 .

為了產生用於由SHA-3標準定義的SHAKE函數中之一者的訊息摘要，結果塊1 800之r個位元形成截斷函數804之輸入之r個高階位元。此等r個高階位元與n-1個額外r位元結果塊800串連，該等額外r位元結果塊中之每一者係由如先前關於圖 7A所描述的SHA3狀態置換函數604之反覆之輸出的r個高階位元形成。SHA3/SHAKE擠壓階段506之每一SHA3狀態置換函數604接收1600位元輸入(亦即，r + c = 1600)並生成1600位元輸出，該1600位元輸出除了SHA3狀態置換函數604之最後反覆之外，饋送SHA3狀態置換函數604之後續反覆。截斷函數804截斷r × n個輸入位元以獲得具有使用者指定長度d '位元的訊息摘要508。To generate a message digest for one of the SHAKE functions defined by the SHA-3 standard, the r bits of result block 1 800 form the r high-order bits of the input to a truncation function 804. These r high-order bits are concatenated with n -1 additional r -bit result blocks 800 , each of which is formed by the r high -order bits of the output of repeated SHA3 state permutation functions 604 as previously described with respect to FIG. 7A . Each SHA3 state permutation function 604 in the SHA3/SHAKE compression phase 506 receives a 1600-bit input (i.e., r + c = 1600) and generates a 1600-bit output. This 1600-bit output feeds all subsequent iterations of the SHA3 state permutation function 604 , except for the final iteration. The truncation function 804 truncates the r × n input bits to obtain a message digest 508 of a user-specified length of d ' bits.

現在參看圖 9至圖 10，繪示根據一個實施例的分別用於SHA3雜湊指令900及逐位元互斥或(XOR)指令1000之例示性格式。在一例示性實施例中，加速器單元314經組態以回應於接收到SHA3雜湊指令900而在硬體中利用SHA3/SHAKE雜湊電路400來執行SHA3/SHAKE狀態置換函數，且回應於接收到逐位元XOR指令1000而利用SIMD XOR電路404來執行指定運算元之1024位元逐位元XOR。 9 and 10 , exemplary formats for a SHA3 hash instruction 900 and a bitwise exclusive OR (XOR) instruction 1000 , respectively, are shown according to one embodiment. In one exemplary embodiment, the accelerator unit 314 is configured to, in response to receiving the SHA3 hash instruction 900 , execute a SHA3/SHAKE state permutation function in hardware using the SHA3/SHAKE hash circuit 400 , and, in response to receiving the bitwise XOR instruction 1000 , execute a 1024-bit bitwise XOR of the specified operand using the SIMD XOR circuit 404 .

在所繪示之實施例中，SHA3雜湊指令900包括作業碼欄位902 ，該作業碼欄位指定用於SHA3/SHAKE置換函數之特定的架構特定作業碼。SHA3雜湊指令900另外包括一或多個暫存器欄位904、906，該一或多個暫存器欄位用於指定寬向量暫存器檔案316內之用於SHA3/SHAKE狀態置換函數之源及目的地運算元的暫存器。舉例而言，在一個實現中，SHA3雜湊指令900包括單一暫存器欄位904，該暫存器欄位指定緩衝1600位元源運算元且在SHA3/SHAKE置換函數完成之後緩衝1600位元目的地運算元(其覆寫源運算元)的一對鄰近的1024位元暫存器中之第一者。在一替代實現中，SHA3雜湊指令900包括用於指定單獨對的1024位元源及目的地暫存器的兩個暫存器欄位904、906 (在此狀況下，目的地運算元並不覆寫源運算元)。In the illustrated embodiment, the SHA3 hash instruction 900 includes an opcode field 902 that specifies a particular architecture-specific opcode for the SHA3/SHAKE permutation function. The SHA3 hash instruction 900 also includes one or more register fields 904 and 906 that specify registers within the wide vector register file 316 for the source and destination operands of the SHA3/SHAKE state permutation function. For example, in one implementation, the SHA3 hash instruction 900 includes a single register field 904 that specifies the first of a pair of adjacent 1024-bit registers that buffer a 1600-bit source operand and, after the SHA3/SHAKE permutation function completes, a 1600-bit destination operand (which overwrites the source operand). In an alternative implementation, the SHA3 hash instruction 900 includes two register fields 904 , 906 that specify separate pairs of 1024-bit source and destination registers (in which case the destination operand does not overwrite the source operand).

如上文所提及，在將來的對SHA-3標準之更新中或在非標準實現中，可需要控制由SHA3狀態置換函數604應用的置換之回合數目。在此類實施例中，該回合數目的SHA3雜湊指令900可包括直接設定置換之回合數目或參考指定置換之回合數目之暫存器的欄位。As mentioned above, in future updates to the SHA-3 standard or in non-standard implementations, it may be necessary to control the number of permutation rounds applied by the SHA3 state permutation function 604. In such embodiments, the SHA3 hash instruction 900 for the number of permutations may include directly setting the number of permutations or referencing a field in a register that specifies the number of permutations.

圖 10描繪例示性實施例，其中逐位元XOR指令包括作業碼欄位1002，該作業碼欄位指定用於1024位元逐位元XOR函數之特定的架構特定作業碼。逐位元XOR指令1000另外包括三個暫存器欄位1004、1006及1008，該等暫存器欄位用於分開地指定寬向量暫存器檔案316內之用於緩衝兩個1024位元源運算元及一個1024位元目的地運算元的1024位元暫存器。 FIG10 illustrates an exemplary embodiment in which a bitwise XOR instruction includes an opcode field 1002 that specifies an architecture-specific opcode for a 1024-bit bitwise XOR function. Bitwise XOR instruction 1000 also includes three register fields 1004 , 1006 , and 1008 that are used to separately specify 1024-bit registers within wide vector register file 316 for buffering two 1024-bit source operands and one 1024-bit destination operand.

現在，已解釋SHA3及SHAKE雜湊函數以及用於實現此等雜湊函數之部分之例示性指令，呈現用於在硬體中執行例示性SHA3雜湊函數之偽碼。在以下偽碼中，參考以下暫存器： Rr ß以位元組為單位之塊長度 RL ß以位元組為單位之訊息長度//假定RL ≥ Rr且第一塊未被填充 Ra ß訊息之起始位址 Rb ß由雜湊函數產生之訊息摘要之位址 Rd ß以位元組為單位之訊息摘要長度 Xs ß SHA3狀態 //寬向量暫存器對 Xm ß訊息塊 //寬向量暫存器對給出此等暫存器，用於SHA3(非SHAKE)雜湊函數中之任一者的偽碼可表示如下： Xs = loadlength(Ra, Rr) //載入訊息之第一訊息塊且初始化狀態 Xs = sha3hash(Xs) //執行SHA3雜湊指令以對第一訊息塊執行置換 RL - = Rr //遞減訊息之未經處理部分之長度 Ra += Rr //遞增至訊息中之下一訊息塊之指標 While (RL ＞ = Rr) //進入用於處理每一剩餘訊息塊之迴路，除訊息之最後訊息塊之外 { Xm = loadlength(Ra, Rr) //載入下一訊息塊 Xs = wide_xor(Xs, Xm) //執行逐位元XOR指令以組合狀態及當前訊息塊 Xs = sha3hash(Xs) //執行SHA3雜湊指令以對當前訊息塊執行置換 RL - = Rr //遞減訊息之未經處理部分之長度 Ra += Rr //遞增至下一訊息塊之指標 } Xm = loadlength(Ra, RL) //載入最後訊息塊(若存在) (RL可為零)Xm = sha3_padding(Xm, RL, sha3-type) //基於剩餘訊息長度及SHA3函數執行填充指令以填充訊息 Xs = wide_xor(Xs, Xm) //執行逐位元XOR指令以組合狀態及最後訊息塊 Xs = sha3hash(Xs) //執行SHA3雜湊指令以對最後訊息塊執行置換且產生最後吸收狀態 Store_length(Xs, Rb, Rd) //在SHA3擠壓階段中，藉由將Xs之前導Rd位元組儲存至位址Rb處之記憶體來截斷最後吸收狀態以形成訊息摘要Now that the SHA3 and SHAKE hash functions and example instructions for implementing portions of these hash functions have been explained, pseudocode for executing an example SHA3 hash function in hardware is presented. In the following pseudocode, the following registers are referenced: Rr ß Block length in bytes RL ß Message length in bytes // Assume RL ≥ Rr and the first block is unpadded Ra ß Starting address of the message Rb ß Address of the message digest produced by the hash function Rd ß Message digest length in bytes Xs ß SHA3 state // Wide vector register pair Xm ß Message block // Wide vector register pair Given these registers, the pseudocode for any of the SHA3 (non-SHAKE) hash functions can be expressed as follows: Xs = loadlength(Ra, Rr) // Load the first message block of the message and initialize the state Xs = sha3hash(Xs) //Execute the SHA3 hash instruction to perform the permutation on the first message block RL - = Rr //Decrement the length of the unprocessed portion of the message Ra += Rr //Increment the pointer to the next message block in the message While (RL > = Rr) //Enter the loop for processing each remaining message block except the last message block { Xm = loadlength(Ra, Rr) //Load the next message block Xs = wide_xor(Xs, Xm) //Execute the bitwise XOR instruction to combine the state and the current message block Xs = sha3hash(Xs) //Execute the SHA3 hash instruction to perform the permutation on the current message block RL - = Rr //Decrement the length of the unprocessed portion of the message Ra += Rr //Increment to the pointer of the next message block} Xm = loadlength(Ra, RL) //Load the last message block (if it exists) (RL can be zero)Xm = sha3_padding(Xm, RL, sha3-type) // Execute padding instructions to pad the message based on the remaining message length and the SHA3 function Xs = wide_xor(Xs, Xm) // Execute bitwise XOR instructions to combine the state and the last message block Xs = sha3hash(Xs) // Execute SHA3 hash instructions to permute the last message block and generate the final absorbed state Store_length(Xs, Rb, Rd) // During the SHA3 squeeze phase, truncate the final absorbed state by storing the leading Rd bytes of Xs to memory at address Rb to form the message digest

現在參考圖 11，繪示根據一個實施例的適合於執行SHA3雜湊指令900之例示性SHA3/SHAKE雜湊電路400的高階方塊圖。如所示，SHA3/SHAKE雜湊電路400包括兩個1024位元雙輸入多工器1100a、1100b、兩個1024位元狀態暫存器1102a、1102b、SHA3回合電路1106以及控制電路1110，該控制電路回應於SHA3雜湊指令900來控制SHA3/SHAKE雜湊電路400之操作。 11 , a high-level block diagram of an exemplary SHA3/SHAKE hash circuit 400 suitable for executing the SHA3 hash instruction 900 is shown, according to one embodiment. As shown, the SHA3/SHAKE hash circuit 400 includes two 1024-bit two-input multiplexers 1100 a , 1100 b , two 1024-bit state registers 1102 a, 1102 b , a SHA3 round circuit 1106 , and a control circuit 1110 that controls the operation of the SHA3/SHAKE hash circuit 400 in response to the SHA3 hash instruction 900 .

輸入多工器1100a具有：第一輸入，其經耦合以自由SHA3雜湊指令900識別的寬向量暫存器檔案316中之暫存器對的第一暫存器接收1600位元輸入狀態之高階1024個位元；及第二輸入，其經耦合以自SHA3回合電路1106接收1600位元回合回饋之高階1024個位元。輸入多工器1100b類似地經結構化，其具有：第一輸入，其經耦合以自寬向量暫存器檔案316中之指令指定之暫存器對中的第二暫存器接收包括1600位元輸入狀態之低階576個位元的1024位元值；及第二輸入，其耦合至SHA3回合電路1106以接收包括1600位元回合回饋之低階576個位元的1024位元值。SHA3/SHAKE雜湊電路400內之控制邏輯1110將未繪示之選擇信號提供至輸入多工器1100a、1100b以使輸入多工器1100a、1100b選擇在SHA3回合0之前在其第一輸入處存在的值且選擇在SHA3回合0至SHA3回合23中之每一者之後在其第二輸入處存在的值。由輸入多工器1100a、1100b輸出的分別在狀態暫存器1102a、1102b中緩衝的值一起形成SHA3回合電路1106之1600位元回合輸入值，該SHA3回合電路經組態以對回合輸入值執行SHA3回合函數704，如先前參考圖 7A 至圖 7B所描述。Input multiplexer 1100a has a first input coupled to receive the high-order 1024 bits of the 1600-bit input state from the first register of the register pair in the wide vector register file 316 identified by the SHA3 hash instruction 900 , and a second input coupled to receive the high-order 1024 bits of the 1600-bit round feedback from the SHA3 round circuit 1106 . Input multiplexer 1100b is similarly structured, having a first input coupled to receive a 1024-bit value comprising the low-order 576 bits of the 1600-bit input state from the second register of the register pair specified by the instruction in width vector register file 316 ; and a second input coupled to SHA3 round circuit 1106 to receive a 1024-bit value comprising the low-order 576 bits of the 1600-bit round feedback. Control logic 1110 within SHA3/SHAKE hash circuit 400 provides select signals (not shown) to input multiplexers 1100a and 1100b, causing input multiplexers 1100a and 1100b to select the value present at their first inputs before SHA3 round 0 and the value present at their second inputs after each of SHA3 rounds 0 through SHA3 round 23. The values output by input multiplexers 1100a and 1100b , respectively, and buffered in state registers 1102a and 1102b, collectively form the 1600-bit round input value for SHA3 round circuit 1106 , which is configured to perform SHA3 round function 704 on the round input value, as previously described with reference to FIG . 7A and FIG .

控制電路1110經進一步組態以利用由SHA-3標準指定之正確回合索引經由SHA-3標準所需之24個回合中的每一者對SHA3回合電路1106進行定序。在第23個回合結束之後，狀態暫存器1102a、1102b將分別保持1600位元輸出狀態之高階1024個位元及低階576個位元。控制電路1110進一步經組態以一旦獲得輸出狀態，就確立未繪示之選擇信號，以使輸出多工器1108在兩個連續循環中將來自狀態暫存器1102a、1102b之1600位元輸出狀態之高階位元及低階位元分別寫入至寬向量暫存器檔案316中的指令指定之暫存器對(假定寬向量暫存器檔案316具有單個寫入埠)。Control circuit 1110 is further configured to sequence SHA3 round circuit 1106 through each of the 24 rounds required by the SHA-3 standard using the correct round index specified by the SHA-3 standard. After the 23rd round, state registers 1102a and 1102b will hold the high-order 1024 bits and low-order 576 bits of the 1600-bit output state, respectively. The control circuit 1110 is further configured to assert a select signal (not shown) once the output state is obtained, causing the output multiplexer 1108 to write the high-order bits and low-order bits of the 1600-bit output state from the state registers 1102a , 1102b to the instruction-specified register pair in the wide vector register file 316 (assuming the wide vector register file 316 has a single write port) in two consecutive cycles.

現在參看圖 12，描繪根據一個實施例的用於執行SHA3雜湊指令900之例示性程序的高階邏輯流程圖。為了易於理解，參考圖 11之例示性SHA3/SHAKE雜湊電路400描述圖 12之程序。 12 , a high-level logic flow diagram of an exemplary process for executing the SHA3 hash instruction 900 according to one embodiment is depicted. For ease of understanding, the process of FIG . 12 is described with reference to the exemplary SHA3/SHAKE hash circuit 400 of FIG. 11 .

圖 12之程序開始於區塊1200，且接著繼續進行至區塊1202，區塊1202繪示SHA3/SHAKE雜湊電路400接收指定寬向量暫存器檔案316內之運算元暫存器對的SHA3雜湊指令900。回應於接收到SHA3雜湊指令900，控制電路1110使得待自寬向量暫存器檔案316讀出運算元暫存器對之內容且經由輸入多工器1100a、1100b將該等內容載入至狀態暫存器1102a、1102b中(區塊1204)。控制電路1110另外初始化內部回合計數器至0 (區塊1206)。The process of Figure 12 begins at block 1200 and then continues to block 1202 , which shows the SHA3/SHAKE hash circuit 400 receiving the SHA3 hash instruction 900 specifying an operand register pair within the wide vector register file 316. In response to receiving the SHA3 hash instruction 900 , the control circuit 1110 causes the contents of the operand register pair to be read from the wide vector register file 316 and loaded into the state registers 1102a and 1102b via the input multiplexers 1100a and 1100b (block 1204 ). Control circuit 1110 also initializes the internal round counter to 0 (block 1206 ).

程序接著自區塊1206繼續進行至區塊1208，該區塊1208繪示控制電路1110引導SHA3回合電路1106利用在狀態暫存器1102a、1102b中緩衝之回合輸入及適當的SHA-3標準指定之回合索引來執行SHA3回合函數704之反覆。控制電路1110另外遞增回合計數器(區塊1208)。SHA3回合電路1106之處理的結果由輸入多工器1100a、1100b傳回至狀態暫存器1102 a、1102b。如區塊1210處所指示，控制邏輯1110使SHA3回合電路1106利用適當的回合索引執行由SHA-3標準指定的24回合處理。當24回合處理完成時，控制電路1110確立適當選擇信號以使輸出多工器1108將在狀態暫存器1102a、1102b中緩衝的1600位元狀態(在低階位元中經零擴展以形成兩個1024位元值)儲存至由SHA3雜湊指令900指定之寬向量暫存器檔案316內的運算元暫存器對中(區塊1214)。此後，圖 12之程序在區塊1216處結束。The process then proceeds from block 1206 to block 1208 , which shows control circuit 1110 directing SHA3 round circuit 1106 to perform iterations of SHA3 round function 704 using the round inputs buffered in state registers 1102a and 1102b and the appropriate round index specified by the SHA- 3 standard. Control circuit 1110 also increments the round counter (block 1208 ). The results of SHA3 round circuit 1106 's processing are returned to state registers 1102a and 1102b via input multiplexers 1100a and 1100b . As indicated at block 1210 , control logic 1110 causes SHA3 round circuitry 1106 to perform the 24 rounds of processing specified by the SHA-3 standard using the appropriate round index. When the 24 rounds of processing are complete, control circuitry 1110 asserts the appropriate select signal to cause output multiplexer 1108 to store the 1600-bit state buffered in state registers 1102 a and 1102 b (zero-extended in the low-order bits to form two 1024-bit values) into the operand register pair within wide vector register file 316 specified by SHA3 hash instruction 900 (block 1214 ). The process of FIG. 12 then terminates at block 1216 .

現在參考圖 13，繪示根據SHA-2標準(FIPS 180-4)之訊息雜湊的時間-空間圖，該訊息雜湊在圖 4之實施例中由SHA2雜湊電路402執行。以下表II概述由SHA-2標準定義且列於第一行中的六個SHA2雜湊函數之屬性。在表II中，第二行概述以位元為單位之訊息塊大小(r)。訊息塊大小r為位元組長度的整數倍，且訊息之第一訊息塊係位元組對準的。表II之第三行概述由每一SHA2雜湊函數產生的訊息摘要之以位元為單位的固定大小(d)。表II之第四行指定每一SHA2雜湊函數之狀態之以位元為單位的大小，且表II之第五行指示每一SHA2雜湊函數中所採用之處理的回合數目(亦即，64或80) (參見例如圖 14)。最後，表II之第六行指定用於每一SHA2雜湊函數之以位元為單位的字大小。應注意，對於所有變體，狀態大小為字大小之8倍(亦即，包含8個字)，且訊息塊之大小為字之大小的16倍(亦即，包含16個字)。如下文所描述，根據所揭示發明之一個態樣，憑藉應用於SHA2-224及SHA2-256雜湊函數之字的訊息擴展沿著相同資料流來處理採用32位元字大小之SHA2雜湊函數及採用64位元字大小之SHA2雜湊函數，如下文參考圖 15所描述。表II 訊息塊大小r (位元) 摘要大小d (位元) 狀態(位元) 回合字大小w (位元) SHA2-224 512 224 256 64 32 SHA2-256 512 256 256 64 32 SHA2-384 1024 384 512 80 64 SHA2-512 1024 512 512 80 64 SHA2-512/224 1024 224 512 80 64 SHA2-512/256 1024 256 512 80 64 Referring now to FIG. 13 , a time-space diagram of message hashing according to the SHA-2 standard (FIPS 180-4) is shown, which is performed by the SHA2 hashing circuit 402 in the embodiment of FIG . Table II below summarizes the properties of the six SHA2 hashing functions defined by the SHA-2 standard and listed in the first row. In Table II , the second row summarizes the message block size ( r ) in bits. The message block size r is an integer multiple of the byte length, and the first message block of the message is byte aligned. The third row of Table II summarizes the fixed size ( d ) in bits of the message digest produced by each SHA2 hashing function. The fourth row of Table II specifies the size of the state of each SHA2 hash function in bits, and the fifth row of Table II indicates the number of rounds of processing employed in each SHA2 hash function (i.e., 64 or 80) (see, for example, FIG. 14 ). Finally, the sixth row of Table II specifies the word size used for each SHA2 hash function in bits. Note that for all variants, the state size is 8 times the word size (i.e., contains 8 words), and the message block size is 16 times the word size (i.e., contains 16 words). As described below, according to one aspect of the disclosed invention, a SHA2 hash function using a 32-bit word size and a SHA2 hash function using a 64-bit word size are processed along the same data stream by means of message expansion applied to the words of the SHA2-224 and SHA2-256 hash functions, as described below with reference to FIG . 15 . Table II Message block size r (bits) Digest size d (bits) Status (bit) round Word size w (bits) SHA2-224 512 224 256 64 32 SHA2-256 512 256 256 64 32 SHA2-384 1024 384 512 80 64 SHA2-512 1024 512 512 80 64 SHA2-512/224 1024 224 512 80 64 SHA2-512/256 1024 256 512 80 64

如圖 13中所展示，SHA2雜湊函數1300接收任何任意長度(例如，長度可能為百萬位元組)之訊息1302作為一個輸入。如在區塊1304處所展示，填充訊息1302以獲得為r個位元之整數倍的長度。如上文參考圖 6所論述，此填充可藉由處理器暫存器(例如，寬向量暫存器檔案316)中之硬體而非經由執行填充指令而進行記憶體移動來高效地執行。經由執行填充指令來填充訊息1302，且特定言之填充訊息1302之最後訊息塊，亦允許以SHA2雜湊函數1300對訊息塊之處理在時間上重疊的方式將填充應用於訊息1302之末端。組成藉由區塊1304產生之經填充訊息的長度為r(其中r = 16×w)的n個(n為正整數)訊息塊中之每一者經提取以形成n個16×w位元訊息塊1306中之一者。 As shown in FIG13 , SHA2 hash function 1300 receives as an input a message 1302 of any arbitrary length (e.g., a length of millions of bytes ) . As shown at block 1304 , message 1302 is padded to obtain a length that is an integer multiple of r bits. As discussed above with reference to FIG6 , this padding can be performed efficiently by hardware in processor registers (e.g., wide vector register file 316 ) rather than by executing a pad instruction and performing a memory move. Message 1302 is padded by executing the padding instruction, and specifically the last message block of message 1302 , allowing the padding to be applied to the end of message 1302 in a manner that overlaps in time with the processing of the message blocks by SHA2 hash function 1300. Each of the n ( n is a positive integer) message blocks of length r (where r = 16× w ) that make up the padded message generated by block 1304 is extracted to form one of n 16× w bit message blocks 1306 .

除了訊息1302以外，SHA2雜湊函數1300亦接收8×w位元之SHA-2指定之常數值作為輸入。如此項技術中已知，可自經架構暫存器檔案300存取之此常數值，在SHA2雜湊函數之間變化且形成8×w位元初始狀態1308。初始狀態1308及第一訊息塊(亦即，訊息塊1 1306)形成由SHA-2標準定義之SHA2塊雜湊函數1 1310之兩個輸入。如下文參考圖 16及圖 17所描述，根據所揭示發明之一個態樣，在硬體中經由執行SHA2雜湊指令來執行SHA2塊雜湊函數1310。由SHA2塊雜湊函數1 1310輸出之8×w位元狀態形成SHA2塊雜湊函數2 1310之第一輸入，該SHA2塊雜湊函數2 1310將下一16×w位元訊息塊2 1306視為第二輸入。SHA2塊雜湊函數2 13 1 0之結果形成SHA2塊雜湊函數1310之下一反覆之輸入。如所示，此程序針對訊息塊602中之每一者反覆地繼續，直至SHA2塊雜湊函數1310之最終第n次反覆生成並輸出8×w位元最後狀態，該8×w位元最後狀態藉由截斷函數1312截斷以產生具有d個位元之訊息摘要1314。In addition to message 1302 , SHA2 hash function 1300 also receives as input an 8× w- bit SHA-2-specified constant value. As is known in the art, this constant value, accessible from architected register file 300 , varies between SHA2 hash functions and forms an 8× w -bit initial state 1308. Initial state 1308 and the first message block (i.e., message block 1 1306 ) form the two inputs to SHA2 block hash function 1 1310 as defined by the SHA-2 standard. As described below with reference to Figures 16 and 17 , according to one aspect of the disclosed invention, SHA2 block hash function 1310 is executed in hardware by executing a SHA2 hash instruction. The 8× w -bit state output by SHA2 block hash function 1 1310 forms the first input of SHA2 block hash function 2 1310. SHA2 block hash function 2 1310 then treats the next 16× w -bit message block 2 1306 as its second input . The result of SHA2 block hash function 2 1310 forms the next repeated input of SHA2 block hash function 1310 . As shown, this process continues repeatedly for each of the message blocks 602 until the final n- th iteration of the SHA2 block hash function 1310 generates and outputs an 8× w - bit final state, which is truncated by the truncation function 1312 to produce a message digest 1314 having d bits.

現在參看圖 14，描繪圖 13中所繪示之SHA2塊雜湊函數1310的時間-空間圖。SHA2塊雜湊函數1310接受16×w位元訊息塊1306，且如在區塊1420處所展示，初始化針對訊息塊1306之16×w位元訊息排程。SHA2塊雜湊函數1310接著經由訊息排程回合函數1400中之n個回合之處理來處理16×w位元訊息排程，其中回合1至n-2中之每一者的16×w位元輸出充當至下一回合之訊息排程處理的輸入。Referring now to FIG. 14 , a time-space diagram of SHA2 block hash function 1310 shown in FIG . SHA2 block hash function 1310 accepts a 16× w -bit message block 1306 and, as shown at block 1420 , initializes a 16× w -bit message schedule for message block 1306 . SHA2 block hash function 1310 then processes the 16× w -bit message schedule via n rounds of message schedule round function 1400 , where the 16× w -bit output of each of rounds 1 through n-2 serves as the input to the next round of message schedule processing.

如所示，SHA2塊雜湊函數1310亦接收8×w位元當前雜湊狀態(亦即，初始狀態1308或先前SHA2塊雜湊函數1310之輸出)作為輸入。如區塊1406處所指示，SHA2塊雜湊函數1310將此8×w位元當前雜湊狀態分割成8 w位元變數a至h。SHA2塊雜湊函數1310接著藉由更新回合函數1404經由n個回合處理來處理當前雜湊狀態。初始更新回合0 1404將SHA-2指定之w位元回合密鑰0 1402及訊息排程之16×w位元初始化1420的w個高階位元視為額外輸入。更新回合函數1404之每一接續反覆將由更新回合函數1404之先前反覆生成的狀態、訊息排程回合函數1400之對應反覆的16×w位元輸出之w個高階位元以及SHA-2指定之w位元回合密鑰1402視為輸入。由更新回合函數n-1 1404輸出之雜湊狀態藉由8×w位元進位傳播加法函數1410添加至輸入雜湊狀態以生成下一雜湊狀態。As shown, SHA2 block hash function 1310 also receives as input the 8× w -bit current hash state (i.e., initial state 1308 or the output of a previous SHA2 block hash function 1310 ). As indicated at block 1406 , SHA2 block hash function 1310 partitions this 8× w- bit current hash state into 8w - bit variables a through h . SHA2 block hash function 1310 then processes the current hash state by updating round function 1404 through n rounds of processing. Initial update round 0 1404 takes as additional input the w -bit SHA-2-specified round key 0 1402 and the w high-order bits of the 16× w -bit message scheduling initialization 1420. Each subsequent iteration of update round function 1404 takes as input the state generated by the previous iteration of update round function 1404 , the w high-order bits of the 16× w -bit output of the corresponding iteration of message scheduling round function 1400 , and the w- bit SHA-2-specified round key 1402. The hash state output by update round function n -1 1404 is added to the input hash state by an 8× w -bit carry-propagation addition function 1410 to generate the next hash state.

現在參考圖15，繪示根據一例示性實施例的用於SHA2雜湊函數之訊息擴展。如上文參考表II及圖 13所提及，本發明之實施例較佳地藉由擴展採用較小字大小之彼等SHA2雜湊函數的訊息字及初始雜湊狀態來支援沿著共同資料路徑的不同字大小w之SHA2雜湊函數之處理。此擴展可例如在圖 13之區塊1304及1308處執行。圖 15繪示一特定實例，其中SHA2-224或SHA2-256輸入訊息1500之十六個32位元字1502中的每一者經擴展以形成輸出訊息1504之十六個64位元雙字1506中的對應一者。在此實例中，每一64位元雙字1506係藉由將64位元雙字1506之高階一半中之輸入訊息1500之32位元字與雙字1506之低階一半中之32位元零字1508串連而形成。所得輸出訊息1504可接著以與採用64位元字之訊息相同的方式由SHA2雜湊電路處理。Referring now to FIG. 15 , a diagram illustrates message expansion for a SHA2 hash function according to an exemplary embodiment. As mentioned above with reference to Table II and FIG . 13 , embodiments of the present invention preferably support processing SHA2 hash functions of different word sizes w along a common data path by expanding the message word and initial hash state of those SHA2 hash functions employing smaller word sizes. This expansion can be performed, for example, at blocks 1304 and 1308 of FIG. 13 . 15 illustrates a specific example in which each of the sixteen 32-bit words 1502 of a SHA2-224 or SHA2-256 input message 1500 is expanded to form a corresponding one of the sixteen 64-bit doublewords 1506 of an output message 1504. In this example, each 64-bit doubleword 1506 is formed by concatenating the 32-bit word of the input message 1500 in the high-order half of the 64-bit doubleword 1506 with the 32-bit zero word 1508 in the low-order half of the doubleword 1506. The resulting output message 1504 can then be processed by the SHA2 hash circuit in the same manner as a message using 64-bit words.

現在參看圖 16，描繪根據一個實施例的用於SHA2雜湊指令1600之例示性格式。在一例示性實施例中，加速器單元314經組態以回應於接收到SHA2雜湊指令1600而在硬體中利用SHA2雜湊電路402執行SHA2塊雜湊函數1310。 16 , an exemplary format for a SHA2 hash instruction 1600 according to one embodiment is depicted. In one exemplary embodiment, the accelerator unit 314 is configured to execute the SHA2 block hash function 1310 in hardware using the SHA2 hash circuit 402 in response to receiving the SHA2 hash instruction 1600 .

在所繪示之實施例中，SHA2雜湊指令1600包括作業碼欄位1602 ，該作業碼欄位指定用於SHA2塊雜湊函數之特定的架構特定作業碼。SHA2雜湊指令1600另外包括一或多個運算元暫存器欄位1604、1606，該一或多個運算元暫存器欄位用於指定寬向量暫存器檔案316內之用於SHA2塊雜湊函數之源及目的地運算元的運算元暫存器。舉例而言，在一個實現中，SHA2雜湊指令1600包括暫存器欄位1604，該暫存器欄位指定緩衝輸入當前雜湊狀態且在SHA2塊雜湊函數完成之後緩衝輸出當前雜湊狀態(其覆寫輸入當前雜湊狀態)的1024位元暫存器。另外，SHA2雜湊指令1600包括緩衝待處理之當前訊息塊的暫存器欄位1606。SHA2雜湊指令1600進一步包括模式欄位1608，該模式欄位指示待執行之SHA2雜湊函數是採用32位元字抑或64位元字。In the illustrated embodiment, the SHA2 hash instruction 1600 includes an opcode field 1602 that specifies a particular architecture-specific opcode for the SHA2 block hash function. The SHA2 hash instruction 1600 also includes one or more operand register fields 1604 and 1606 that specify operand registers within the wide vector register file 316 for the source and destination operands of the SHA2 block hash function. For example, in one implementation, SHA2 hash instruction 1600 includes a register field 1604 that specifies a 1024-bit register that buffers the current hash state as input and outputs the current hash state after the SHA2 block hash function completes (overwriting the input current hash state). Additionally, SHA2 hash instruction 1600 includes a register field 1606 that buffers the current message block to be processed. The SHA2 hash instruction 1600 further includes a mode field 1608 that indicates whether the SHA2 hash function to be performed uses 32-bit words or 64-bit words.

現在，已解釋SHA2雜湊函數及用於實現SHA2雜湊函數之部分之例示性指令，呈現用於在硬體中執行例示性SHA2雜湊函數(亦即，SHA2-512)之偽碼。在SHA2-512雜湊函數中，每一訊息塊之長度為1024個位元，且雜湊狀態及訊息摘要之長度各自為512個位元。在以下偽碼中，參考以下暫存器： Rl ß以位元為單位之訊息長度 RL ß以位元組為單位之訊息長度；假定≥ 128個位元組，因此第一訊息塊未被填充 Ra ß訊息之起始位址 Ri ß初始狀態之位址 Rb ß由雜湊函數產生之訊息摘要之位址 Rd ß以位元組為單位之訊息摘要長度 Xs ßSHA2狀態 //寬向量暫存器 Xm ß當前訊息塊 //寬向量暫存器Now that the SHA2 hash function and example instructions for implementing a portion of the SHA2 hash function have been explained, a pseudocode for executing an example SHA2 hash function (i.e., SHA2-512) in hardware is presented. In the SHA2-512 hash function, each message block is 1024 bits long, and the hash state and message digest are each 512 bits long. In the following pseudocode, the following registers are referenced: Rl ßMessage length in bytes RL ßMessage length in bytes; assumed ≥ 128 bytes, so the first message block is unpadded Ra ßStarting address of the message Ri ßAddress of the initial state Rb ßAddress of the message digest generated by the hash function Rd ßMessage digest length in bytes Xs ßSHA2 state //Wide vector register Xm ßCurrent message block //Wide vector register

給出此等暫存器，用於執行SHA2-512雜湊函數之偽碼可表示如下： Xs = load(Ri, 64) //載入64個位元組之初始狀態 Xm = load(Ra, 128) //載入第一(完整)訊息塊 Xs = sha2hash(Xs, Xm, 64-bit) //執行SHA2雜湊指令以執行塊雜湊函數 RL - = 128 //遞減待處理之訊息長度 Ra += 128 //前進指標至下一訊息塊 While (RL ＞ = 128) //經由剩餘訊息塊迴路，除了最後訊息塊之外 { Xm = load(Ra, 128) //載入下一訊息塊(全大小) Xs = sha2hash(Xs, Xm, 64-bit) //執行SHA2雜湊指令以執行塊雜湊函數 RL - = 128 //遞減待處理之訊息長度 Ra += 128 //前進指標至下一訊息塊 } Xm = loadlength(Ra, RL) //載入最後訊息塊(若存在) (RL可為零) Xm = sha2_EOM_pad(Xm, RL) //將SHA2 EOM位元組附加至訊息塊之末端 If (RL ＞ 111) then //若填充跨越兩個訊息塊，則 { Xs = sha2hash(Xs, Xm, 64-bit) //執行SHA2雜湊指令以執行塊 Xm = force-to-zero //雜湊函數且將最後訊息塊置零 } Xm = sha2_EOB_pad(Xm, RI) //在經填充訊息之最後塊中插入EOB Xs = sha2hash(Xs, Xm, 64-bit) //執行SHA2雜湊指令以對最後訊息塊執行塊雜湊函數 Store(Xs, Rb, 64) //截斷狀態至Xs之前導64個位元組以獲得訊息摘要且在位址Rb處儲存至記憶體Given these registers, the pseudocode for executing the SHA2-512 hash function can be expressed as follows: Xs = load(Ri, 64) //Load 64-byte initial state Xm = load(Ra, 128) //Load the first (complete) message block Xs = sha2hash(Xs, Xm, 64-bit) //Execute the SHA2 hash instruction to perform the block hash function RL - = 128 //Decrement the length of the message to be processed Ra += 128 //Advance the pointer to the next message block While (RL >= 128) //Through the remaining message block loop, except for the last message block { Xm = load(Ra, 128) //Load the next message block (full size) Xs = sha2hash(Xs, Xm, 64-bit) //Execute SHA2 hash instruction to execute the block hash function RL - = 128 //Decrement the length of the message to be processed Ra += 128 //Advance pointer to the next message block } Xm = loadlength(Ra, RL) //Load the last message block (if any) (RL can be zero) Xm = sha2_EOM_pad(Xm, RL) //Append the SHA2 EOM bytes to the end of the message block If (RL > 111) then //If the padding spans two message blocks, then { Xs = sha2hash(Xs, Xm, 64-bit) //Execute SHA2 hash instruction to execute the block Xm = force-to-zero //Hash function and set the last message block to zero } Xm = sha2_EOB_pad(Xm, RI) //Insert EOB in the last block of padded message Xs = sha2hash(Xs, Xm, 64-bit) //Execute SHA2 hash instruction to execute the block hash function on the last message block Store(Xs, Rb, 64) //Truncate the state to the leading 64 bytes of Xs to obtain the message digest and store it in memory at address Rb

現在參考圖 17，繪示適合於執行SHA2雜湊指令1600的圖 4之SHA2雜湊電路402之例示性實施例的高階方塊圖。如所示，SHA2雜湊電路402包括512位元雙輸入狀態多工器1702a、1024位元雙輸入訊息多工器1702b、512位元狀態暫存器1704a、1024位元訊息塊暫存器1704b、更新工作狀態電路1708、訊息排程回合電路1710及控制電路1720，該控制電路回應於SHA2雜湊指令1600而控制SHA2雜湊電路402之操作。Referring now to FIG. 17 , a high-level block diagram of an exemplary embodiment of the SHA2 hash circuit 402 of FIG. 4 is shown, suitable for executing the SHA2 hash instruction 1600. As shown, the SHA2 hash circuit 402 includes a 512-bit two-input state multiplexer 1702 a , a 1024-bit two-input message multiplexer 1702 b , a 512-bit state register 1704 a , a 1024-bit message block register 1704 b , an update working status circuit 1708 , a message scheduling circuit 1710 , and a control circuit 1720 that controls the operation of the SHA2 hash circuit 402 in response to the SHA2 hash instruction 1600 .

在此實例中，狀態多工器1702a之第一輸入經耦合以自寬向量暫存器檔案316中之由SHA2雜湊指令1600之暫存器欄位1604指定的暫存器接收保持於暫存器之512高階位元中的當前雜湊狀態。狀態多工器1702a之第二輸入耦接至更新工作狀態電路1708之輸出。訊息多工器1702b經類似地組態，其具有：第一輸入，其經耦合以自寬向量暫存器檔案316中之由SHA2雜湊指令1600之暫存器欄位1606指定的暫存器接收訊息塊；及第二輸入，其經耦合以自訊息排程回合電路1710接收1024位元回合回饋。SHA2雜湊電路400內之控制邏輯1720將未繪示之選擇信號提供至多工器1702a、1702b，以使多工器1702a、1702b選擇在更新回合0函數1404之前存在於第一輸入處的值，且選擇在更新回合0函數至SHA2塊雜湊n函數中之每一者之後存在於第二輸入處的值。分別在狀態暫存器1704a及訊息塊暫存器1704b中暫時緩衝由多工器1702a、1702b輸出之值。在訊息塊暫存器1704b中緩衝之訊息塊形成訊息排程回合電路1710之輸入，該訊息排程回合電路實現圖 14之訊息排程回合函數1400。來自訊息塊暫存器1704b之64高階位元及狀態暫存器1704a中之512位元狀態形成更新工作狀態電路1708之兩個輸入，該更新工作狀態電路經組態以執行如先前參考圖 14所描述之更新回合函數1404。In this example, a first input of state multiplexer 1702a is coupled to receive the current hash state held in the 512 high-order bits of the register specified by register field 1604 of SHA2 hash instruction 1600 in width vector register file 316. A second input of state multiplexer 1702a is coupled to the output of update working state circuit 1708 . Message multiplexer 1702b is similarly configured, having a first input coupled to receive a message block from the register specified by register field 1606 of SHA2 hash instruction 1600 in width vector register file 316 , and a second input coupled to receive a 1024-bit round feedback from message scheduling round circuit 1710 . Control logic 1720 within SHA2 hash circuit 400 provides select signals (not shown) to multiplexers 1702a and 1702b , causing multiplexers 1702a and 1702b to select the value present at the first input before the update round 0 function 1404 and the value present at the second input after each of the update round 0 function and the SHA2 block hash n function. The values output by multiplexers 1702a and 1702b are temporarily buffered in state register 1704a and message block register 1704b, respectively. The message blocks buffered in message block register 1704b form the inputs of message scheduling round circuit 1710 , which implements message scheduling round function 1400 of FIG . 14 . The 64 high-order bits from message block register 1704b and the 512-bit state in state register 1704a form the two inputs of update work state circuit 1708 , which is configured to execute update round function 1404 as previously described with reference to FIG .

控制電路1720經進一步組態以利用由SHA-2標準指定之正確回合索引經由n個回合中之每一者對更新工作狀態電路1708進行定序。在最後回合n-1結束之後，狀態暫存器1704a將保持512位元雜湊狀態。控制電路1720經進一步組態以一旦獲得輸出雜湊狀態，就使單指令多資料(SIMD)加法器1712將來自狀態暫存器1704a之雜湊狀態與自寬向量暫存器欄位316讀取之輸入雜湊狀態相加，且將作為下一雜湊狀態之結果儲存回至寬向量暫存器檔案316，如上文關於圖 14之加法函數1410所描述。熟習此項技術者將瞭解，在不同實現中，SIMD加法器1712可實現為SHA2雜湊電路402之專用組件或實現為可(例如)由多個雜湊電路共用之單獨管線。The control circuit 1720 is further configured to sequence the update of the working state circuit 1708 through each of the n rounds using the correct round index specified by the SHA-2 standard. After the last round n -1 is completed, the state register 1704a will hold the 512-bit hash state. The control circuit 1720 is further configured to cause the single instruction multiple data (SIMD) adder 1712 to add the hash state from the state register 1704a to the input hash state read from the wide vector register field 316 upon obtaining the output hash state, and to store the result back to the wide vector register file 316 as the next hash state, as described above with respect to the add function 1410 of FIG. 14 . Those skilled in the art will appreciate that, in various implementations, the SIMD adder 1712 may be implemented as a dedicated component of the SHA2 hash circuit 402 or as a separate pipeline that may be shared by, for example, multiple hash circuits.

現在參看圖 18，描繪根據一個實施例的來自圖 17之例示性更新工作狀態電路1708之更詳細方塊圖。在此實施例中，在狀態暫存器1704a內緩衝的作為更新工作狀態電路1708之一個輸入被接收的512位元狀態，經分割成八個64位元變數，其在SHA-2標準中被稱作變數a至h，如區塊1800處所展示。更新工作狀態電路1708包括：兩個西格瑪函數電路，即SHA2西格瑪0電路1802及SHA2西格瑪1電路1806，以及SHA2 MA電路1804及SHA2 CH電路1808，其各自執行由SHA-2標準定義之各別函數。更新工作狀態電路1708另外包括三個64位元加法器1810、1812及1814。SHA2西格瑪0電路1802將具有n (n1, n2, n3) = (28, 34, 39)及m (m1, m2, m3) = (2, 13, 22)的西格瑪函數應用於變數a以產生加法器1812之第一輸入。藉由SHA2 MA電路1804處理變數a、b及c以產生加法器1812之第二輸入。SHA2西格瑪1電路1806將具有n (n1, n2, n3) = (14, 18, 41)及m (m1, m2, m3) = (6, 13, 22)的西格瑪函數應用於變數e以產生加法器1810之五個輸入當中的第一輸入。藉由SHA2 CH電路1808處理變數e、f及g以產生加法器1810之第二輸入。加法器1810將相關回合密鑰、回合訊息塊及變數d加至此兩個輸入以產生形成加法器1814之第一輸入及加法器1812之第三輸入的總和。Referring now to FIG18 , a more detailed block diagram of the exemplary update operating state circuit 1708 from FIG17 is depicted according to one embodiment. In this embodiment, the 512-bit state buffered in state register 1704 a received as one input to the update operating state circuit 1708 is split into eight 64-bit variables, referred to as variables a through h in the SHA-2 standard, as shown at block 1800 . Update working state circuit 1708 includes two sigma function circuits, SHA2 Sigma 0 circuit 1802 and SHA2 Sigma 1 circuit 1806 , as well as SHA2 MA circuit 1804 and SHA2 CH circuit 1808 , each of which performs a function defined by the SHA-2 standard. Update working state circuit 1708 also includes three 64-bit adders 1810 , 1812 , and 1814. SHA2 Sigma 0 circuit 1802 applies a sigma function with n (n1, n2, n3) = (28, 34, 39) and m (m1, m2, m3) = (2, 13, 22) to variable a to generate the first input of adder 1812 . Variables a , b, and c are processed by SHA2 MA circuit 1804 to generate the second input of adder 1812. SHA2 Sigma 1 circuit 1806 applies a sigma function with n (n1, n2, n3) = (14, 18, 41) and m (m1, m2, m3) = (6, 13, 22) to variable e to generate the first of five inputs to adder 1810. Variables e , f, and g are processed by SHA2 CH circuit 1808 to generate the second input of adder 1810. Adder 1810 adds the associated round key, round message block, and variable d to these two inputs to generate the sum that forms the first input of adder 1814 and the third input of adder 1812 .

更新工作狀態電路1708生成由八個64位元變數a '至h '組成的512位元結果狀態1816。結果狀態1816之變數a '係藉由加法器1812之輸出而形成，變數b '、c '及d '分別由輸入狀態1800之變數a、b及c形成，且變數f '、g '及h '分別由輸入狀態1800之變數e、f及g形成。剩餘變數e '係藉由加法器1810之輸出與輸入狀態1800之變數d的總和而形成。Updated working state circuit 1708 generates a 512-bit result state 1816 consisting of eight 64-bit variables a ' through h ' . Variable a ' of result state 1816 is formed by the output of adder 1812 , variables b ' , c ' , and d ' are formed by variables a , b , and c, respectively, from input state 1800 , and variables f ' , g ', and h ' are formed by variables e , f, and g , respectively, from input state 1800. Remainder variable e ' is formed by the sum of the output of adder 1810 and variable d from input state 1800 .

應注意，上文參考圖 15所描述的SHA-2訊息之字的32位元至64位元擴展並不影響SHA2 MA電路1804、SHA2 CH電路1808及模組化加法器1812、1814之設計(對其透明)。採用32位元字之SHA2訊息的尾隨零擴展僅影響SHA2西格瑪電路1802、1806，如下文參考圖 19更詳細描述。It should be noted that the 32-bit to 64-bit expansion of the SHA-2 message word described above with reference to FIG15 does not affect (is transparent to) the design of SHA2 MA circuit 1804 , SHA2 CH circuit 1808 , and modular adders 1812 and 1814. Trailing zero expansion of the SHA2 message using 32-bit words only affects SHA2 sigma circuits 1802 and 1806 , as described in more detail below with reference to FIG19 .

圖 19為SHA2西格瑪電路1900之例示性實施例的更詳細方塊圖，該SHA2西格瑪電路可用以實現圖 18之SHA2西格瑪0電路1802及SHA2西格瑪1電路1806。SHA2西格瑪電路1900接收包括32個高階位元(位元0至31)及32個低階位元(位元32至63)之64位元輸入變數1902。 FIG19 is a more detailed block diagram of an exemplary embodiment of a SHA2 Sigma circuit 1900 , which may be used to implement the SHA2 Sigma 0 circuit 1802 and the SHA2 Sigma 1 circuit 1806 of FIG18 . The SHA2 Sigma circuit 1900 receives a 64-bit input variable 1902 comprising 32 high-order bits (bits 0 to 31) and 32 low-order bits (bits 32 to 63).

SHA2西格瑪電路1900包括64位元旋轉電路1904a，該64位元旋轉電路將64位元輸入變數1902旋轉n1個位元(亦即，對於SHA2西格瑪0電路1802為28個位元且對於SHA2西格瑪1電路1806為14個位元)以獲得多工器1910a之第一64位元輸入。SHA2西格瑪電路1900另外包括32位元旋轉電路1906a，該32位元旋轉電路將輸入變數1902之32個高階位元旋轉m1個位元(亦即，對於SHA2西格瑪0電路1802為2個位元且對於SHA2西格瑪1電路1806為6個位元)，以在與輸入變數1902之32個低階位元串連時獲得多工器1910a之第二64位元輸入。多工器1910a基於由相關SHA2雜湊指令1600之模式欄位1608判定的模式信號而在其第一輸入與第二輸入之間進行選擇。亦即，若模式信號指示模式欄位1608經設定為指示利用64位元字之SHA2雜湊函數，則多工器1910a選擇第一輸入，且若模式信號指示模式欄位1608經設定為指示利用32位元字之SHA2雜湊函數，則多工器1910a選擇第二輸入。SHA2 sigma circuit 1900 includes a 64-bit rotate circuit 1904a that rotates the 64-bit input variable 1902 by n1 bits (i.e., 28 bits for SHA2 sigma 0 circuit 1802 and 14 bits for SHA2 sigma 1 circuit 1806 ) to obtain a first 64-bit input to multiplexer 1910a . SHA2 sigma circuit 1900 further includes a 32-bit rotate circuit 1906a that rotates the 32 high-order bits of input variable 1902 by m1 bits (i.e., 2 bits for SHA2 sigma 0 circuit 1802 and 6 bits for SHA2 sigma 1 circuit 1806 ) to obtain a second 64-bit input to multiplexer 1910a when concatenated with the 32 low-order bits of input variable 1902. Multiplexer 1910a selects between its first and second inputs based on a mode signal determined by mode field 1608 of the associated SHA2 hash instruction 1600 . That is, if the mode signal indicates that the mode field 1608 is set to indicate the SHA2 hash function using 64-bit words, the multiplexer 1910a selects the first input, and if the mode signal indicates that the mode field 1608 is set to indicate the SHA2 hash function using 32-bit words, the multiplexer 1910a selects the second input.

SHA2西格瑪電路1900另外包括64位元旋轉電路1904b，該64位元旋轉電路將64位元輸入變數1902旋轉n2個位元(亦即，對於SHA2西格瑪0電路1802為34個位元且對於SHA2西格瑪1電路1806為18個位元)以獲得多工器1910b之第一64位元輸入。SHA2西格瑪電路1900亦包括32位元旋轉電路1906b，該32位元旋轉電路將輸入變數1902之32個高階位元旋轉m2個位元(亦即，對於SHA2西格瑪0電路1802及SHA2西格瑪1電路1806兩者為13個位元)，以在與輸入變數1902之32個低階位元串連時獲得多工器1910b之第二64位元輸入。多工器1910b基於模式信號在其第一輸入與第二輸入之間進行選擇。特定言之，若模式信號指示模式欄位1608經設定為指示利用64位元字之SHA2雜湊函數，則多工器1910b選擇第一輸入，且若模式信號指示模式欄位1608經設定為指示利用32位元字之SHA2雜湊函數，則多工器1910b選擇第二輸入。SHA2 sigma circuit 1900 further includes a 64-bit rotate circuit 1904b that rotates the 64-bit input variable 1902 by n2 bits (i.e., 34 bits for SHA2 sigma 0 circuit 1802 and 18 bits for SHA2 sigma 1 circuit 1806 ) to obtain a first 64-bit input to multiplexer 1910b . SHA2 sigma circuit 1900 also includes a 32-bit rotate circuit 1906b that rotates the 32 high-order bits of input variable 1902 by m2 bits (i.e., 13 bits for both SHA2 sigma 0 circuit 1802 and SHA2 sigma 1 circuit 1806 ) to provide a second 64-bit input to multiplexer 1910b when concatenated with the 32 low-order bits of input variable 1902. Multiplexer 1910b selects between its first and second inputs based on the mode signal. Specifically, if the mode signal indicates that the mode field 1608 is set to indicate the SHA2 hash function using 64-bit words, the multiplexer 1910b selects the first input, and if the mode signal indicates that the mode field 1608 is set to indicate the SHA2 hash function using 32-bit words, the multiplexer 1910b selects the second input.

SHA2西格瑪電路1900亦包括64位元旋轉/移位電路1908a，該64位元旋轉/移位電路將64位元輸入變數旋轉及移位n3個位元(亦即，對於SHA2西格瑪0電路1802為39個位元且對於SHA2西格瑪1電路1806為41個位元)以獲得多工器1910c之第一64位元輸入。SHA2西格瑪電路1900另外包括32位元旋轉/移位電路1908b，該32位元旋轉/移位電路將輸入變數1902之32個高階位元旋轉及移位m3個位元(亦即，對於SHA2西格瑪0電路1802及SHA2西格瑪1電路1806兩者為22個位元)，以在與輸入變數1902之32個低階位元串連時獲得多工器1910c之第二64位元輸入。多工器1910c基於模式信號在其第一輸入與第二輸入之間進行選擇。如同多工器1910a、1910b，若模式信號指示模式欄位1608經設定為指示利用64位元字之SHA2雜湊函數，則多工器1910c選擇第一輸入，且若模式信號指示模式欄位1608經設定為指示利用32位元字之SHA2雜湊函數，則多工器1910c選擇第二輸入。SHA2 sigma circuit 1900 also includes a 64-bit rotate/shift circuit 1908a that rotates and shifts the 64-bit input variable by n3 bits (i.e., 39 bits for SHA2 sigma 0 circuit 1802 and 41 bits for SHA2 sigma 1 circuit 1806 ) to obtain a first 64-bit input to multiplexer 1910c . SHA2 sigma circuit 1900 further includes a 32-bit rotate/shift circuit 1908b that rotates and shifts the 32 high-order bits of input variable 1902 by m3 bits (i.e., 22 bits for both SHA2 sigma 0 circuit 1802 and SHA2 sigma 1 circuit 1806 ) to provide a second 64-bit input to multiplexer 1910c when concatenated with the 32 low-order bits of input variable 1902. Multiplexer 1910c selects between its first and second inputs based on the mode signal. Like multiplexers 1910a and 1910b , multiplexer 1910c selects the first input if the mode signal indicates that mode field 1608 is set to indicate the SHA2 hash function using 64-bit words, and selects the second input if the mode signal indicates that mode field 1608 is set to indicate the SHA2 hash function using 32-bit words.

多工器1910a、1910b及1910c之64位元輸出形成三輸入64位元逐位元XOR電路1912之輸入，該三輸入64位元逐位元XOR電路在其三個輸入上執行逐位元XOR以生成64位元輸出1914。熟習此項技術者應瞭解，在SHA2西格瑪電路1900之一些實施例中，旋轉電路1904a至1904b及1906a至1906b以及旋轉/移位電路1908a至1908b之功能可藉由適當佈線實現，從而允許SHA2西格瑪電路1900藉由三個多工器1910a至1910c及3向逐位元XOR電路1912且無需顯式旋轉及移位電路系統來實現。The 64-bit outputs of multiplexers 1910a , 1910b , and 1910c form the inputs of a three-input 64-bit bit-by-bit XOR circuit 1912 , which performs a bit-by-bit XOR on its three inputs to generate a 64-bit output 1914 . Those skilled in the art will appreciate that in some embodiments of the SHA2 sigma circuit 1900 , the functionality of the rotate circuits 1904a - 1904b and 1906a - 1906b , as well as the rotate/shift circuits 1908a - 1908b , can be implemented by appropriate routing, thereby allowing the SHA2 sigma circuit 1900 to be implemented by three multiplexers 1910a - 1910c and a 3-way bitwise XOR circuit 1912 without the need for explicit rotate and shift circuitry.

現在參看圖 20，描繪根據一個實施例的用於執行SHA2雜湊指令1600之例示性程序的高階邏輯流程圖。為了易於理解，參考圖 17至圖 19中所繪示之SHA2雜湊電路402之例示性實施例來描述圖 20之程序。 20 , a high-level logic flow diagram is depicted of an exemplary process for executing the SHA2 hash instruction 1600 according to one embodiment. For ease of understanding, the process of FIG . 20 is described with reference to the exemplary embodiment of the SHA2 hash circuit 402 shown in FIG . 17 through FIG. 19 .

圖 20之程序開始於區塊2000，且接著繼續進行至區塊2002，區塊2002繪示SHA2雜湊電路402接收指定特定SHA2模式(亦即，32位元或64位元字大小)以及寬向量暫存器檔案316內之狀態暫存器及訊息塊暫存器的SHA2雜湊指令1600。回應於接收到SHA2雜湊指令1600，控制電路1720使得待自寬向量暫存器檔案316讀出512位元狀態及1024位元訊息塊並分別經由多工器1702a 、 1702b將其載入至狀態暫存器1704a及訊息塊暫存器1704b中(區塊2002)。控制電路1720另外初始化內部回合計數器至0 (區塊2004)。The process of FIG. 20 begins at block 2000 and then continues to block 2002 , which illustrates the SHA2 hash circuit 402 receiving the SHA2 hash instruction 1600 specifying a particular SHA2 mode (i.e., 32-bit or 64-bit word size) and the state registers and message block registers within the wide vector register file 316 . In response to receiving SHA2 hash instruction 1600 , control circuit 1720 reads the 512-bit state and 1024-bit message blocks from the width vector register file 316 and loads them into state register 1704a and message block register 1704b via multiplexers 1702a and 1702b , respectively (block 2002 ). Control circuit 1720 also initializes the internal round counter to 0 (block 2004 ).

程序接著自區塊2004繼續進行至區塊2006，區塊2006繪示控制電路1720引導訊息排程回合電路1710利用在訊息塊暫存器1704b中緩衝之訊息塊來執行訊息排程回合函數1400之反覆。另外，控制電路1720引導更新工作狀態電路1708基於適當回合索引、訊息塊暫存器1704b之64高階位元及來自狀態暫存器1704a之輸入雜湊狀態來執行更新回合函數1404之反覆。更新工作狀態電路1708及訊息排程回合電路1710之處理結果分別由多工器1702a、1702b傳回至暫存器1704a、1704b。控制電路1110另外使回合計數器前進。在區塊2010處，控制邏輯1720藉由參考回合計數器判定SHA2雜湊電路402是否已執行由SHA-2標準指定之最後回合個處理。如表II中所提及，SHA2雜湊電路402針對採用32位元字之SHA2雜湊函數執行64個回合之處理，且針對採用64位元字之SHA2雜湊函數執行80個回合之處理。若控制電路1720在區塊2010處判定仍有至少一個額外回合之處理待執行，則程序返回至區塊2006，區塊2006已被描述。然而，回應於在區塊2010處判定所有回合之處理完成，控制電路1720使得先前狀態再次自寬向量暫存器檔案316被讀取且藉由SIMD加法器1712添加至在狀態暫存器1704a中緩衝之最終狀態(區塊2012)。控制電路1720接著將所得下一狀態之儲存引導回至寬向量暫存器檔案316中(區塊2014)。此後，圖 20之程序在區塊2016處結束。The process then proceeds from block 2004 to block 2006 , which shows that control circuit 1720 directs message scheduling round circuit 1710 to execute iterations of message scheduling round function 1400 using the message blocks buffered in message block register 1704b . Additionally, control circuit 1720 directs update work status circuit 1708 to execute iterations of update round function 1404 based on the appropriate round index, the 64 high-order bits of message block register 1704b , and the input hash state from status register 1704a . The processing results of the update work status circuit 1708 and the message scheduling round circuit 1710 are returned by multiplexers 1702a and 1702b to registers 1704a and 1704b , respectively. Control circuit 1110 also advances the round counter. At block 2010 , control logic 1720 determines whether SHA2 hash circuit 402 has performed the last round specified by the SHA-2 standard by referencing the round counter. As mentioned in Table II, SHA2 hash circuit 402 performs 64 rounds of processing for a SHA2 hash function using 32-bit words and 80 rounds of processing for a SHA2 hash function using 64-bit words. If control circuit 1720 determines at block 2010 that at least one additional round of processing remains to be performed, the program returns to block 2006 , which has already been described. However, in response to determining at block 2010 that all rounds of processing are complete, control circuit 1720 causes the previous state to be read again from wide vector register file 316 and added to the final state buffered in state register 1704a by SIMD adder 1712 (block 2012 ). Control circuit 1720 then directs the storage of the resulting next state back to wide vector register file 316 (block 2014 ). Thereafter, the program of FIG. 20 ends at block 2016 .

如上文參考圖 6之區塊600及圖 13之區塊1304所論述，由SHA2及SHA3雜湊函數處理之訊息經填充以產生長度為塊長度r個位元之偶數倍的訊息。圖 21A描繪例示性未經填充訊息2100，其具有L個位元之總長度且包括n個訊息塊。其中，前n-1個訊息塊包括r個位元，但最終訊息塊n包括k個位元，其中k ＜ r。如圖 21B中所展示，在一般狀況下，訊息2100藉由將r-k個填充位元附加至訊息塊n之末端來填充，從而產生長度皆為r個位元的n個訊息塊。As discussed above with reference to block 600 of FIG. 6 and block 1304 of FIG. 13 , messages processed by the SHA2 and SHA3 hash functions are padded to produce messages with a length that is an even multiple of the block length r bits. FIG. 21A depicts an exemplary unpadded message 2100 having a total length of L bits and comprising n message blocks. The first n -1 message blocks comprise r bits, but the final message block n comprises k bits, where k < r . As shown in FIG . 21B , in the general case, message 2100 is padded by appending r - k padding bits to the end of message block n , resulting in n message blocks each having a length of r bits.

為獲得經填充訊息所附加的填充位元之內容可取決於所考慮之雜湊函數而變化。舉例而言，在本文中所論述之SHA2及SHA3/SHAKE雜湊演算法中，填充位元將包括標記訊息之未經填充部分之末端(亦即，訊息末端(EOM)標記)及經填充訊息之最後塊之末端(亦即，塊末端(EOB)標記)兩者的位元組。如下文進一步所解釋，在一些狀況下，包括EOM及EOB標記之填充位元可全部包括於含有最終訊息位元組之訊息塊內；在其他狀況下，填充位元之添加可需要將額外訊息塊附加至訊息。在任一狀況下，所揭示發明較佳地在處理器暫存器中經由執行一或多個指令而非經由在記憶體中之兩個位置之間傳送訊息的高潛時記憶體移動操作來執行訊息填充。The content of the padding bits appended to obtain a padded message can vary depending on the hash function being considered. For example, in the SHA2 and SHA3/SHAKE hashing algorithms discussed herein, the padding bits will include bytes marking both the end of the unpadded portion of the message (i.e., the end-of-message (EOM) marker) and the end of the last block of the padded message (i.e., the end-of-block (EOB) marker). As explained further below, in some cases, the padding bits, including the EOM and EOB markers, may all be included in the message block containing the final message bytes; in other cases, the addition of padding bits may require appending an additional message block to the message. In either case, the disclosed invention preferably performs message filling in a processor register by executing one or more instructions rather than by a high-latency memory move operation that transfers the message between two locations in memory.

在至少一些架構中，載入儲存單元224、記憶體控制器112及/或系統互連件110並不經建構以支援系統記憶體114與寬向量暫存器檔案116之間的冗長資料物件(例如，完整的r位元SHA3/SHAKE及SHA2訊息塊)之資料傳送。在此類架構中，訊息塊以多個較小組塊經傳送至較窄暫存器檔案中，且接著自較窄暫存器檔案經傳送至寬向量暫存器檔案316之一或多個寬向量暫存器中。舉例而言，圖 22A繪示將SHA3/SHAKE訊息塊n組合於包括256位元暫存器r0至rS 301的經架構暫存器檔案300中的實例。在此實例中，(例如)藉由圖 2之載入儲存單元224執行載入長度指令以將1152位元SHA3-224訊息塊n之五個256位元組塊載入至暫存器r0至r7中且將不含訊息資料之任何暫存器位元組置零。在給出SHA3-224中之訊息塊的1152位元長度的情況下，訊息塊n內之訊息位元組至多將完全填充暫存器r0至r3加上暫存器r4之前導128個位元(當然，未經填充訊息之最終訊息塊可含有少於r個位元)。藉由自動執行載入長度指令抑或藉由執行標準載入指令，可將暫存器r4之至少剩餘128位元以及所有暫存器r5至r7置零。(僅適用於所支援之訊息塊長度中之任一者的通用SHA3函數才需要用零填充暫存器r6及r7)。可接著藉由資料傳送電路406或傳送單元320執行額外資料傳送指令，以將暫存器r0至r7之內容傳送至寬向量暫存器檔案316之暫存器R0及R1 317中，該寬向量暫存器檔案包括各自具有如上文所論述的1024位元之例示性寬度的暫存器R0至RT。在替代實現中，可藉由載入經架構暫存器檔案300內之四個暫存器301以緩衝組塊n1至n4且接著在後續循環上再使用相同暫存器301來緩衝組塊n5至n8來達成相同的結果。In at least some architectures, the load storage unit 224 , memory controller 112 , and/or system interconnect 110 are not configured to support the transfer of lengthy data objects (e.g., complete r- bit SHA3/SHAKE and SHA2 message blocks) between the system memory 114 and the wide vector register file 116. In such architectures, the message blocks are transferred in smaller chunks to a narrower register file, and then from the narrower register file to one or more wide vector registers in the wide vector register file 316 . 22A illustrates an example of assembling SHA3/SHAKE message block n into an architected register file 300 comprising 256-bit registers r0 through r5 301. In this example, a load length instruction is executed, for example by the load store unit 224 of FIG . 2 , to load five 256-byte chunks of the 1152-bit SHA3-224 message block n into registers r0 through r7 and to set any register bytes that do not contain message data to zero. Given a message block length of 1152 bits in SHA3-224, the message bytes in message block n will at most completely fill registers r0 through r3 plus the leading 128 bits of register r4 (of course, the final unpadded message block may contain fewer than r bits). At least the remaining 128 bits of register r4 and all of registers r5 through r7 are zeroed by either automatically executing a load length instruction or by executing a standard load instruction. (Registers r6 and r7 need only be zeroed for the generic SHA3 function for any of the supported message block lengths). Additional data transfer instructions may then be executed by data transfer circuitry 406 or transfer unit 320 to transfer the contents of registers r0 through r7 to registers R0 and R1 317 of wide vector register file 316 , which includes registers R0 through RT, each having an exemplary width of 1024 bits as discussed above. In an alternative implementation, the same result may be achieved by loading four registers 301 within architected register file 300 to buffer blocks n1 through n4 and then reusing the same registers 301 on a subsequent loop to buffer blocks n5 through n8.

圖 22B描繪類似實例，其展示在將訊息塊組合於經架構暫存器檔案300之暫存器301中之後將1024位元SHA2訊息塊n傳送至寬向量暫存器檔案316中之寬向量暫存器317。在此實例中，(例如)藉由圖 2之載入儲存單元224執行載入長度指令以將SHA2訊息塊n之四個256位元組塊載入至經架構暫存器檔案300之暫存器r2至r5中且將不含訊息資料之任何暫存器位元組置零。可接著藉由資料傳送電路406執行額外資料傳送指令，以將暫存器r2至r5之內容傳送至寬向量暫存器檔案316之暫存器R0 317中。在替代實現中，可藉由載入經架構暫存器檔案300內之兩個暫存器301以緩衝組塊n1及n2且接著在後續循環上再使用相同暫存器301來緩衝組塊n3至n4來達成相同的結果。 22B depicts a similar example, showing the 1024-bit SHA2 message block n being transferred to wide vector register 317 in wide vector register file 316 after the message blocks are assembled in register 301 of architected register file 300. In this example, a load length instruction is executed, for example by load store unit 224 of FIG . 2 , to load four 256-byte blocks of SHA2 message block n into registers r2 through r5 of architected register file 300 and to set any register bytes that do not contain message data to zero. Additional data transfer instructions may then be executed by data transfer circuitry 406 to transfer the contents of registers r2-r5 to register R0 317 of wide vector register file 316. In an alternative implementation, the same result may be achieved by loading two registers 301 within architected register file 300 to buffer chunks n1 and n2, and then reusing the same registers 301 on a subsequent loop to buffer chunks n3-n4.

在至少一些較佳實施例中，針對SHA3/SHAKE或SHA2訊息之所有訊息塊執行用於將訊息塊載入至圖 22A至圖 22B中所給出的寬向量暫存器檔案316中的程序，該等所有訊息塊包括訊息塊n，其為未經填充訊息之最後訊息塊。如下文所解釋，可接著經由執行一或多個指令至少部分地在寬向量暫存器檔案316內填充訊息之末端。In at least some preferred embodiments, the process for loading message blocks into the wide vector register file 316 shown in Figures 22A - 22B is executed for all message blocks of the SHA3/SHAKE or SHA2 message, including message block n , which is the last message block that has not been filled with messages. As explained below, the end of the message in the wide vector register file 316 can then be at least partially filled by executing one or more instructions.

圖 23A至圖 23D描繪針對各種長度之SHA3/SHAKE訊息的各種填充狀況。根據SHA-3標準，每一訊息必須包括標記EOM之EOM填充。在SHA3標準下，EOM填充具有用於SHA3雜湊函數之固定值x06及用於SHAKE雜湊函數之固定值x1F。經填充訊息內EOM填充之位置取決於訊息長度而變化，訊息長度在編譯時間常常係未知的。SHA-3標準進一步授權每一經填充訊息之最後位元組為固定值EOB填充位元組。 Figures 23A through 23D illustrate various padding configurations for SHA3/SHAKE messages of various lengths. According to the SHA-3 standard, every message must include an EOM padding byte, which marks the end of the message. Under the SHA3 standard, EOM padding has a fixed value of x06 for the SHA3 hash function and a fixed value of x1F for the SHAKE hash function. The location of the EOM padding within the padded message varies depending on the message length, which is often unknown at compile time. The SHA-3 standard further mandates that the last byte of each padded message be a fixed-value EOB padding byte.

如圖 23A中所展示，若SHA3/SHAKE訊息之最後訊息塊2300包括不含有訊息資料的多於兩個位元組，則EOM填充位元組2302緊接在最後訊息位元組2306之後被插入至相關寬向量暫存器317之置零位元組中，且EOB填充位元組2304作為經填充訊息塊之最後位元組被插入至寬向量暫存器317之置位元組中。 As shown in FIG23A , if the last message block 2300 of the SHA3/SHAKE message includes more than two bytes that do not contain message data, then the EOM padding byte 2302 is inserted into the zeroed bytes of the associated wide vector register 317 immediately following the last message byte 2306 , and the EOB padding byte 2304 is inserted into the set bytes of the wide vector register 317 as the last byte of the padded message block.

圖 23B繪示類似第二狀況，其中SHA3/SHAKE訊息之最後訊息塊2300 '包括並不含有訊息資料之確切兩個置零位元組。在此狀況下，最後訊息塊2300 '之最後兩個置零位元組用EOM填充位元組2302，接著是EOB填充位元組2304替換。 23B illustrates a similar second scenario, where the last message block 2300 ′ of the SHA3/SHAKE message includes exactly two zeroed bytes that do not contain message data. In this scenario, the last two zeroed bytes of the last message block 2300 ′ are replaced with EOM padding bytes 2302 , followed by EOB padding bytes 2304 .

圖 23C描繪第三狀況，其中SHA3/SHAKE訊息之最後訊息塊2300 ''在最後訊息位元組2306之後僅包括單一置零訊息位元組。在此狀況下，如下文所描述的填充指令之執行會使EOM及EOB填充值一起被「或」(OR)運算且插入於經填充訊息塊2300 ''之最終位元組中作為EOM/EOB填充位元組2308。 FIG23C depicts a third case, in which the last message block 2300 ″ of the SHA3/SHAKE message includes only a single zeroed message byte following the last message byte 2306. In this case, execution of the padding instruction described below causes the EOM and EOB padding values to be ORed together and inserted into the final byte of padded message block 2300 ″ as EOM/EOB padding byte 2308 .

圖 23D繪示最終狀況，其中SHA3/SHAKE訊息之最後訊息位元組2306為訊息塊2310之最後位元組。因為訊息塊2310在此狀況下不包括所需EOM及EOB填充之容量，所以將額外置零訊息塊2312附加至訊息(例如，經由執行載入長度指令)。EOM填充位元組2302作為第一位元組被插入至此置零訊息塊2312中，且EOB填充位元組2304作為最後位元組被插入至此置零訊息塊2312中。應注意，在圖 23A至圖 23D中所描繪之四種狀況中的每一者中，可有利地藉由單一填充指令應用EOM填充及EOB填充兩者，此係由於EOM填充及EOB填充兩者始終屬於同一訊息塊內。亦應瞭解，儘管圖 23A至圖 23D描繪將填充應用於包括整數數目個訊息位元組之訊息，但填充可類似地應用於不包括整數數目個位元組的位元訊息。 23D shows the final state, where the last message byte 2306 of the SHA3/SHAKE message is the last byte of message block 2310. Because message block 2310 does not include the required EOM and EOB padding in this state, an additional zeroed message block 2312 is appended to the message (e.g., by executing a load length instruction). EOM padding byte 2302 is inserted as the first byte into this zeroed message block 2312 , and EOB padding byte 2304 is inserted as the last byte into this zeroed message block 2312 . It should be noted that in each of the four cases depicted in Figures 23A to 23D , both EOM padding and EOB padding can be advantageously applied using a single padding instruction because both EOM padding and EOB padding always belong to the same message block. It should also be understood that although Figures 23A to 23D depict padding being applied to messages that include an integer number of message bytes, padding can similarly be applied to byte messages that do not include an integer number of bytes.

在一個實施例中，可利用三個指令來實現如圖 23A至圖 23D中所展示的任意長度之SHA3/SHAKE訊息之填充。此等指令包括：(1)載入長度指令，其將經填充訊息之最終訊息塊分級於經架構暫存器檔案300中之指定暫存器301中；(2)傳送指令，其將訊息塊自經架構暫存器檔案300中之暫存器301傳送至如圖 22A中所展示之寬向量暫存器檔案316中之一或多個寬向量暫存器317；及(3)填充指令，其在保持於寬向量暫存器317中之經填充SHA3/SHAKE訊息之最終訊息塊中的適當位元組位置處插入EOM及EOB填充。當然，在替代實現中，有可能利用兩個不同指令將EOM填充及EOB填充插入至最終訊息塊中。然而，對於諸如通常用於後量子加密方案中之單塊訊息的單塊訊息，添加額外填充指令會增加潛時且不合需要地降低雜湊效能。In one embodiment, three instructions can be used to implement padding of SHA3/SHAKE messages of arbitrary length as shown in Figures 23A to 23D . These instructions include: (1) a load length instruction that stages the final message block of the padded message into a specified register 301 in the architected register file 300 ; (2) a transfer instruction that transfers the message block from register 301 in the architected register file 300 to one or more wide vector registers 317 in the wide vector register file 316 as shown in Figure 22A ; and (3) a pad instruction that inserts EOM and EOB padding at appropriate byte positions in the final message block of the padded SHA3/SHAKE message held in the wide vector register 317 . Of course, in an alternative implementation, it is possible to use two different instructions to insert EOM padding and EOB padding into the final message block. However, for single-block messages, such as those commonly used in post-quantum encryption schemes, adding additional padding instructions increases latency and undesirably reduces hashing performance.

圖 24A至圖 24D描繪針對各種長度之SHA2訊息的各種填充狀況。根據SHA-2標準，每一訊息必在緊接在最後訊息位元組之後的位元組中必須包括具有值x80之一個EOM填充位元組。EOM填充位元組在經填充訊息內之位置因此取決於訊息長度而變化。SHA-2標準進一步授權，最後兩個字(亦即，取決於所討論之SHA2雜湊函數，兩個32位元字抑或兩個64位元字(參見表II))含有指定以位元為單位的未經填充訊息之長度的EOB填充。 Figures 24A to 24D illustrate various padding scenarios for SHA2 messages of various lengths. According to the SHA-2 standard, every message must include an EOM padding byte with the value x80 in the byte immediately following the last message byte. The position of the EOM padding byte within the padded message therefore varies depending on the message length. The SHA-2 standard further mandates that the last two words (i.e., two 32-bit words or two 64-bit words, depending on the SHA2 hash function in question (see Table II)) contain the EOB padding that specifies the length of the unpadded message in bits.

在圖 24A中所繪示之第一狀況下，SHA2訊息之最後訊息塊2400包括不含有訊息資料的多於兩個字加一個位元組。在此狀況下，最後訊息塊2400藉由緊接在最後訊息位元組2406之後將EOM填充位元組2 4 02插入至相關寬向量暫存器317之置零位元組中且藉由插入兩個EOB填充字2404作為最後訊息塊2400之最後兩個字來填充。In the first case shown in Figure 24A , the last message block 2400 of the SHA2 message includes more than two words plus one byte that do not contain message data. In this case, the last message block 2400 is padded by inserting the EOM pad byte 2402 into the zero bytes of the associated width vector register 317 immediately after the last message byte 2406 and by inserting two EOB pad words 2404 as the last two words of the last message block 2400 .

圖 24B繪示類似第二狀況，其中SHA2訊息之最後訊息塊2400 '包括並不含有訊息資料之確切兩個字加一個位元組。在此狀況下，最後訊息塊2400 '藉由緊接在最後訊息位元組2406之後將EOM填充位元組2402插入至相關寬向量暫存器317之置零位元組中且插入兩個EOB填充字2404作為最後訊息塊2400之最後兩個字來填充。 24B illustrates a similar second situation, where the last message block 2400 ′ of the SHA2 message includes exactly two words plus one byte that do not contain message data. In this situation, the last message block 2400 ′ is padded by inserting an EOM pad byte 2402 into the zeroed bytes of the associated width vector register 317 immediately following the last message byte 2406 and inserting two EOB pad words 2404 as the last two words of the last message block 2400 .

圖 24C描繪第三狀況，其中未經填充SHA2訊息之最後訊息塊2400 ''包括不含訊息資料的過少位元組以適應EOM填充位元組2402及兩個EOB填充字2404。在此狀況下，SHA2訊息係藉由緊接在最後訊息位元組2406之後將EOM填充位元組2402插入至相關寬向量暫存器317之置零位元組中來填充。因為EOB填充字2404並不適合於訊息塊2400 ''內，所以將額外置零訊息塊2408附加至訊息(例如，經由執行載入長度指令)。EOB填充字2404接著作為訊息塊2408之最後兩個字被插入。 FIG24C depicts a third case, in which the last message block 2400 ″ of the unpadded SHA2 message includes too few bytes without message data to accommodate EOM padding byte 2402 and two EOB padding words 2404. In this case, the SHA2 message is padded by inserting EOM padding byte 2402 into the zeroed bytes of the associated wide vector register 317 immediately after the last message byte 2406. Because EOB padding word 2404 does not fit within message block 2400 ″ , an additional zeroed message block 2408 is appended to the message (e.g., by executing a load length instruction). EOB padding words 2404 are then inserted as the last two words of message block 2408 .

圖 24D繪示第四狀況，其中SHA2訊息之最後訊息位元組2406形成完整訊息塊2410之最後位元組。因為訊息塊2410不包括EOM或EOB填充之容量，所以將額外置零訊息塊2412附加至SHA2訊息。額外訊息塊2412包括EOM填充位元組2 4 02作為訊息塊2412之第一位元組，接著是多個置零位元組，且最後在訊息塊2412之末端處為兩個EOB填充字2404。 FIG24D illustrates a fourth scenario, in which the last message byte 2406 of the SHA2 message forms the last byte of a complete message block 2410. Because message block 2410 does not include space for EOM or EOB padding, an additional zeroed message block 2412 is appended to the SHA2 message. Additional message block 2412 includes EOM padding byte 2402 as the first byte of message block 2412 , followed by a number of zeroed bytes, and finally , two EOB padding words 2404 at the end of message block 2412 .

在一個實施例中，可利用少至四個指令來實現任意長度之SHA2訊息之填充。此等指令包括：(1)載入長度指令，其將SHA2訊息之最終訊息塊置放於經架構暫存器檔案300中之指定暫存器301中且將不含訊息位元組之任何暫存器位元組置零；(2)插入字指令，其將兩個EOB填充字2404置放於經架構暫存器檔案300中之暫存器301之適當位元組中以標記經填充訊息之末端；(3)傳送指令，其將緩衝訊息塊的暫存器301之內容自經架構暫存器檔案300傳送至寬向量暫存器檔案316中之寬向量暫存器317；及(4)填充指令，其將EOM填充位元組2402插入寬向量暫存器317中之適當位置處。在此實施例中，填充指令之執行會插入EOM填充位元組2402而不插入EOB填充字2404，此係因為(1) EOM填充位元組2402及EOB填充字2404可位於不同訊息塊中，且(2) EOB填充字2404可利用現有插入字指令高效地定位於經架構暫存器檔案300內之適當暫存器301中。當然，在一替代實施例中，EOM填充位元組2402及EOB填充字2404兩者可應用於經架構暫存器檔案300之暫存器301中的SHA2訊息塊。In one embodiment, padding of SHA2 messages of arbitrary length can be implemented using as few as four instructions. These instructions include: (1) a load length instruction that places the final message block of the SHA2 message in a specified register 301 in the architected register file 300 and sets any register bytes that do not contain message bytes to zero; (2) an insert word instruction that places two EOB padding words 2404 in the appropriate bytes of register 301 in the architected register file 300 to mark the end of the padded message; (3) a transfer instruction that transfers the contents of register 301 of a buffered message block from the architected register file 300 to wide vector register 317 in the wide vector register file 316 . and (4) a padding instruction that inserts EOM padding bytes 2402 into the appropriate location in wide vector register 317. In this embodiment, execution of the padding instruction inserts EOM padding bytes 2402 but not EOB padding words 2404 because (1) EOM padding bytes 2402 and EOB padding words 2404 can be located in different message blocks, and (2) EOB padding words 2404 can be efficiently located in the appropriate register 301 within the architected register file 300 using existing insert word instructions. Of course, in an alternative embodiment, both EOM padding bytes 2402 and EOB padding words 2404 can be applied to the SHA2 message block in register 301 of the architected register file 300 .

現在參考圖 25，繪示根據一個實施例之例示性填充指令2500。在至少一個實施例中，例示性填充指令2500可由加速器單元314在資料傳送電路406內執行以針對SHA3/SHAKE訊息塊及SHA2訊息塊兩者執行填充。 25 , an exemplary fill instruction 2500 is shown according to one embodiment. In at least one embodiment, the exemplary fill instruction 2500 may be executed by the accelerator unit 314 within the data transfer circuit 406 to perform fill on both SHA3/SHAKE message blocks and SHA2 message blocks.

在所繪示之實例中，填充指令2500包括指定用於訊息填充指令之架構特定作業碼的作業碼欄位2502。填充指令另外包括用於指定填充操作之源及目的地運算元之儲存位置的兩個暫存器欄位2504、2506。舉例而言，暫存器1欄位2504可識別寬向量暫存器檔案316內緩衝待填充之訊息塊的目標寬向量暫存器317，且暫存器2欄位2506可指定經架構暫存器檔案300中保持以位元組為單位之剩餘訊息長度的暫存器301。In the illustrated example, the fill instruction 2500 includes an operation code field 2502 that specifies the architecture-specific operation code for the message fill instruction. The fill instruction also includes two register fields 2504 and 2506 that specify the storage locations of the source and destination operands of the fill operation. For example, register 1 field 2504 may identify the destination wide vector register 317 within wide vector register file 316 that buffers the message block to be filled, and register 2 field 2506 may specify register 301 within architected register file 300 that holds the remaining message length in bytes.

填充指令2500進一步包括提供用以填充訊息之資訊的模式欄位2508。在一個例示性實施例中，模式欄位2508包括至少三個子欄位，包括雜湊識別符(HID)子欄位2510、塊長度(BL)子欄位2512及擴展(E)子欄位2514。HID子欄位2510指示被應用於訊息塊的雜湊函數之類型。舉例而言，在一個實現中，HID子欄位2510可包括指定以下雜湊類型中之一者的兩個位元：SHA3、SHAKE、SHA2 (64位元字)及SHA2 (32位元字)。BL子欄位2512指示(可能在與HID子欄位2510一起被解譯時)以位元組為單位之訊息塊的長度。E子欄位2514指示由暫存器1欄位2504指定之寬向量暫存器317是保持訊息塊之前導區段S0抑或尾隨區段S1。舉例而言，在寬向量暫存器317為1024位元寬的一實施例中，若由暫存器1欄位2504指定之寬向量暫存器317並不保持訊息塊之尾隨區段，則E子欄位2514可具有值b0，且若經指定寬向量暫存器317保持訊息塊之尾隨區段，則E子欄位2514可具有值b1。當然，在寬向量暫存器317具有不同寬度(例如，512個位元)之其他實施例中，E子欄位2514可包括額外位元以指定額外暫存器區段。The padding instruction 2500 further includes a mode field 2508 that provides information for padding the message. In one exemplary embodiment, the mode field 2508 includes at least three subfields, including a hash identifier (HID) subfield 2510 , a block length (BL) subfield 2512 , and an extension (E) subfield 2514. The HID subfield 2510 indicates the type of hash function to be applied to the message block. For example, in one implementation, the HID subfield 2510 may include two bits that specify one of the following hash types: SHA3, SHAKE, SHA2 (64-bit word), and SHA2 (32-bit word). BL subfield 2512 indicates the length of the message block in bytes (possibly when interpreted together with HID subfield 2510 ). E subfield 2514 indicates whether the width vector register 317 specified by register 1 field 2504 holds the leading segment S0 or the trailing segment S1 of the message block. For example, in an embodiment where the wide vector register 317 is 1024 bits wide, if the wide vector register 317 specified by the register 1 field 2504 does not hold the trailing segment of the message block, then the E subfield 2514 may have a value of b0, and if the wide vector register 317 is specified to hold the trailing segment of the message block, then the E subfield 2514 may have a value of b1. Of course, in other embodiments where the wide vector register 317 has a different width (e.g., 512 bits), the E subfield 2514 may include additional bits to specify additional register segments.

現在參看圖 26，繪示根據一個實施例之例示性填充電路2600。可實現為例如加速器單元314之資料傳送電路406之一部分的填充電路2600，回應於如圖 25中所展示的填充指令2500之執行而填充保持於目標寬向量暫存器中之訊息區段S1。所繪示之實例假定寬向量暫存器檔案316具有1024位元寬向量暫存器317。Referring now to FIG. 26 , an exemplary fill circuit 2600 is shown according to one embodiment. Fill circuit 2600 , which may be implemented as part of data transfer circuit 406 of accelerator unit 314 , for example, fills message segment S1 held in a target wide vector register in response to execution of fill instruction 2500 as shown in FIG . 25 . The illustrated example assumes that wide vector register file 316 has 1024-bit wide vector registers 317 .

在此例示性實施例中，填充電路2600包括選擇EOM電路2602，該選擇EOM電路基於由填充指令2500之HID子欄位2510指定之雜湊函數來選擇EOM填充位元組2302或2402(亦即，eom_byte)之值。填充電路2600亦包括選擇EOB電路2604，該選擇EOB電路基於HID子欄位2510以類似方式選擇待藉由填充指令2500插入之EOB填充位元組(亦即，eob_byte)之值。在所描述之實施例中，對於SHA3/SHAKE雜湊函數，選擇EOB電路2604選擇由SHA-3標準指定之固定eob_byte值，該值含於由暫存器2欄位2506指示之暫存器中。對於SHA2雜湊函數，選擇EOB電路2604選擇零eob_byte，此係因為EOB填充字2404在此實施例中由單獨指令插入。填充電路2600進一步包括選擇BL大小電路2606，該選擇BL大小電路基於填充指令2500之HID欄位2510及BL欄位2512選擇並輸出8位元塊長度值。In this exemplary embodiment, fill circuit 2600 includes a select EOM circuit 2602 that selects the value of EOM pad byte 2302 or 2402 (i.e., eom_byte) based on the hash function specified by HID subfield 2510 of fill instruction 2500. Fill circuit 2600 also includes a select EOB circuit 2604 that similarly selects the value of the EOB pad byte (i.e., eob_byte) to be inserted by fill instruction 2500 based on HID subfield 2510 . In the depicted embodiment, for the SHA3/SHAKE hash function, select EOB circuit 2604 selects a fixed eob_byte value specified by the SHA-3 standard, which is contained in the register indicated by register 2 field 2506. For the SHA2 hash function, select EOB circuit 2604 selects a zero eob_byte because EOB padding word 2404 is inserted by a separate instruction in this embodiment. Padding circuit 2600 further includes select BL size circuit 2606 , which selects and outputs an 8-bit block length value based on HID field 2510 and BL field 2512 of padding instruction 2500 .

藉由選擇BL大小電路2606輸出之8位元塊長度值由EOB賦能電路2608接收，該EOB賦能電路包括比較器2610、解碼器2612及逐位元「及」(AND)電路2614。8位元塊長度值之高階位元指示訊息塊之長度是否超過1024位元寬向量暫存器317之寬度(如將(例如)針對SHA3-224、SHAKE-128以及SHAKE 256之狀況)。塊長度之低階7位元形成塊長度大小(bl_size)，其指示包含在由暫存器1欄位2504識別之目標寬向量暫存器317中緩衝的訊息塊之區段的位元組之數目。解碼器2612解碼7位元bl_size值以獲得目標寬向量暫存器317內之訊息塊之末端之位置的128位元表示。比較器2610比較8位元塊長度之高階位元與填充指令2500之E子欄位2514，以形成是否將EOB填充添加至在目標寬向量暫存器中緩衝的訊息塊之區段(亦即，目標寬向量暫存器317是否緩衝訊息塊之尾隨區段S1)的1位元指示。此1位元指示接著藉由逐位元「及」電路2614邏輯地組合以產生128位元EOB賦能信號(eob_en(0:127))，該128位元EOB賦能信號識別在目標寬向量暫存器317中緩衝的待插入EOB填充的訊息區段之位元組(若存在)。The 8-bit block length value output by the select BL size circuit 2606 is received by the EOB enable circuit 2608 , which includes a comparator 2610 , a decoder 2612 , and a bitwise AND circuit 2614. The high-order bit of the 8-bit block length value indicates whether the length of the message block exceeds the width of the 1024-bit wide vector register 317 (as would be the case, for example, for SHA3-224, SHAKE-128, and SHAKE 256). The low-order 7 bits of the block length form the block length size (bl_size), which indicates the number of bytes contained in the segment of the message block buffered in the target width vector register 317 identified by register 1 field 2504. The decoder 2612 decodes the 7-bit bl_size value to obtain a 128-bit representation of the position of the end of the message block in the target width vector register 317 . Comparator 2610 compares the high-order bits of the 8-bit block length with the E subfield 2514 of the pad instruction 2500 to form a 1-bit indication of whether EOB padding is added to the segment of the message block buffered in the destination width vector register (i.e., whether the destination width vector register 317 buffers the trailing segment S1 of the message block). This 1-bit indication is then logically combined by a bit-wise AND circuit 2614 to generate a 128-bit EOB enable signal (eob_en(0:127)) that identifies the byte of the message segment buffered in the target width vector register 317 into which the EOB padding is to be inserted (if any).

仍參看圖 26，填充電路2600進一步包括EOM賦能電路2620，該EOM賦能電路包括選擇電路2620、比較器2622、解碼器2624及逐位元「及」電路2626。在所描繪之實例中，選擇電路2620藉由雙輸入多工器實現，該雙輸入多工器具有經耦合以接收訊息長度之8位元指示的第一輸入，及經耦合以接收適用於採用32位元字之SHA-2個雜湊函數的擴展訊息長度的第二輸入。第二輸入處之擴展訊息長度值根據方程式EX_LEN=4*(LEN/4) + LEN藉由將b0插入至原始長度之位元5與6之間而使訊息之原始長度加倍。此技術保留了原始位元6:7之位元位置，該等位元位置指示在最終訊息位元組之擴展訊息塊之32高階位元內的位元組位置(若存在)。若HID子欄位2510指示雜湊函數為SHA3/SHAKE雜湊函數或採用64位元字之SHA2雜湊函數，則選擇電路2620選擇其兩個8位元輸入中之第一者，且若HID子欄位2510指示雜湊函數為採用32位元字之SHA2雜湊函數，則選擇電路2620替代地選擇其兩個輸入中之第二者。Still referring to FIG26 , padding circuit 2600 further includes EOM enable circuit 2620 , which includes selection circuit 2620 , comparator 2622 , decoder 2624 , and bitwise AND circuit 2626. In the depicted example, selection circuit 2620 is implemented by a two-input multiplexer having a first input coupled to receive an 8-bit indication of the message length and a second input coupled to receive an extended message length for a SHA-2 hash function using 32-bit words. The extended message length value at the second input doubles the original message length by inserting b0 between bits 5 and 6 of the original length according to the equation EX_LEN = 4*(LEN/4) + LEN. This technique preserves the bit positions of the original bits 6:7, which indicate the byte positions within the 32 high-order bits of the extended message block of the final message byte (if any). If HID subfield 2510 indicates that the hash function is a SHA3/SHAKE hash function or a SHA2 hash function that uses 64-bit words, then selection circuit 2620 selects the first of its two 8-bit inputs, and if HID subfield 2510 indicates that the hash function is a SHA2 hash function that uses 32-bit words, then selection circuit 2620 selects the second of its two inputs instead.

由選擇電路2620輸出之8位元長度值包括指示塊長度是否超過1024位元寬向量暫存器檔案316之寬度的高階位元，及指示包含在由暫存器1欄位2504識別之目標寬向量暫存器317中緩衝的訊息塊之區段的位元組之數目的七個低階位元。解碼器2624解碼七個低階位元以獲得位元組位置之128位元表示(若存在)，訊息位元組之末端將在目標寬向量暫存器317內被插入在該位元組位置處。比較器2622比較由選擇電路2620輸出之長度值之高階位元與填充指令2500之E子欄位2514，以形成EOM填充是否待添加至在目標寬向量暫存器317中緩衝之訊息塊之區段的1位元指示。此1位元指示接著藉由逐位元「及」電路2626邏輯地組合以產生128位元EOM賦能信號(eom_en(0:127))，該128位元EOM賦能信號識別在目標寬向量暫存器317中緩衝的待插入EOM填充的訊息區段之位元組(若存在)。The 8-bit length value output by select circuit 2620 includes a high-order bit indicating whether the block length exceeds the width of the 1024-bit width vector register file 316 , and seven low-order bits indicating the number of bytes contained in the segment of the message block buffered in the target width vector register 317 identified by register 1 field 2504. Decoder 2624 decodes the seven low-order bits to obtain a 128-bit representation of the byte position (if any) at which the end of the message byte is to be inserted in the target width vector register 317 . Comparator 2622 compares the high-order bits of the length value output by select circuit 2620 with E subfield 2514 of pad instruction 2500 to form a 1-bit indication of whether EOM padding is to be added to the segment of the message block buffered in target width vector register 317. This 1-bit indication is then logically combined by bitwise AND circuit 2626 to generate a 128-bit EOM enable signal (eom_en(0:127)), which identifies the bytes of the message segment buffered in target width vector register 317 into which EOM padding is to be inserted (if any).

EOB賦能信號eob_en(0:127)、EOM賦能信號eom_en(0:127)、eom_byte、eob_byte及來自目標寬向量暫存器317之訊息區段全部被傳遞至條件「或」電路2630，該條件「或」電路條件性地將EOM及/或EOB填充插入至訊息區段中以獲得經填充訊息區段Sp。接著將經填充訊息區段Sp儲存回至暫存器1欄位2504中指定之目標寬向量暫存器317。The EOB enable signal eob_en (0:127), the EOM enable signal eom_en (0:127), the eom_byte, the eob_byte, and the message segment from the target width vector register 317 are all passed to the conditional OR circuit 2630. This conditional OR circuit conditionally inserts EOM and/or EOB padding into the message segment to obtain the padded message segment Sp. The padded message segment Sp is then stored back into the target width vector register 317 specified in register 1 field 2504 .

現在參考圖 27，繪示圖 26之條件「或」電路2630的例示性實施例。在此實例中，訊息區段之128個位元組中之每一者具有各別相關聯的「或」閘2700，該「或」閘具有三個8位元輸入。「或」閘2700之第一輸入經耦合以接收訊息區段S之各別位元組。「或」閘2700之第二輸入耦合至雙輸入「及」閘2702之輸出，該雙輸入「及」閘針對訊息區段S之給定位元組用eom_en()限定eom_byte。「或」閘2700之第三輸入耦合至雙輸入「及」閘2704之輸出，該雙輸入「及」閘針對訊息區段S之給定位元組用eob_en()限定eob_byte。「或」閘2700對此等三個輸入執行邏輯或運算，且將經填充訊息區段Sp之所得位元組寫入至寬向量暫存器檔案316中之目標寬向量暫存器317。因此，若對於訊息區段S之給定位元組既不確立eom_en()亦不確立eob_en()，則相關「或」閘2700僅將輸入訊息區段S之位元組寫入至經填充訊息區段Sp之對應位元組。然而，若對於訊息區段S之給定位元組確立eom_en()或eob_en()中之一者或兩者，則相關「或」閘2700將eom_byte、eob_byte或其邏輯組合寫入至經填充訊息區段Sp之對應位元組中，如由賦能信號eom_en()及eob_en()指示。Referring now to FIG. 27 , an exemplary embodiment of the conditional OR circuit 2630 of FIG. 26 is shown. In this example, each of the 128 bytes of the message segment has a respective associated OR gate 2700 having three 8-bit inputs. The first input of OR gate 2700 is coupled to receive the respective byte of message segment S. The second input of OR gate 2700 is coupled to the output of a two-input AND gate 2702 , which qualifies the eom_byte for a given bit packet of message segment S using eom_en(). The third input of OR gate 2700 is coupled to the output of a two-input AND gate 2704 , which qualifies eob_byte with eob_en() for a given bit packet of message segment S. OR gate 2700 performs a logical OR operation on these three inputs and writes the resulting bytes of the padded message segment Sp to the destination wide vector register 317 in wide vector register file 316. Therefore, if neither eom_en() nor eob_en() is asserted for a given bit packet of message segment S, the associated OR gate 2700 simply writes the bytes of the input message segment S to the corresponding bytes of the padded message segment Sp. However, if either or both eom_en() or eob_en() are asserted for a given bit tuple of message segment S, then the associated OR gate 2700 writes eom_byte, eob_byte, or a logical combination thereof, into the corresponding bytes of the padded message segment Sp, as indicated by the enable signals eom_en() and eob_en().

現在參看圖 28，描繪根據一個實施例的用於填充訊息塊之例示性程序的高階邏輯流程圖。所繪示程序可藉由加速器單元314回應於接收到填充指令2500而執行。為了易於理解，下文參考圖 26至圖 27中所描繪之例示性填充電路來描述程序。Referring now to FIG. 28 , a high-level logic flow diagram is depicted illustrating an exemplary process for filling a message block according to one embodiment. The depicted process may be executed by the accelerator unit 314 in response to receiving a fill instruction 2500. For ease of understanding, the process is described below with reference to the exemplary fill circuits depicted in FIG. 26 and FIG. 27 .

圖 28之程序開始於區塊2800，且接著繼續進行至區塊2802，區塊2802繪示加速器單元314接收填充指令2500以供執行。回應於接收到填充指令2500，加速器單元314首先存取由填充指令2500之暫存器欄位2504、2506指定的源運算元(區塊2804)。特定言之，加速器單元314自寬向量暫存器檔案316中之由暫存器1欄位2504指定的目標寬向量暫存器317讀取訊息區段S，自經架構暫存器檔案300中之由暫存器2欄位2506指定的暫存器301讀取未經填充訊息長度，且將此等運算元傳送至圖 26之填充電路2600，該填充電路如上文所提及可實現於資料傳送電路406內。在區塊2806處，填充電路2600利用填充指令2500之模式欄位2508來選擇填充操作之參數。特定言之，選擇EOM電路2602基於由模式欄位2508指定之雜湊函數來選擇EOM填充位元組(eom_byte) 2302或2402之值，選擇EOB電路2604選擇待藉由填充指令2500插入之EOB填充位元組(eob_byte)之值(亦即，用於SHA3/SHAKE之固定值及用於SHA2之零位元組，此係由於EOB填充字2404係藉由用於SHA2之單獨指令應用)，且選擇BL大小電路2606基於HID子欄位2510及BL子欄位2512選擇塊長度。由選擇EOM電路2602選擇之eom_byte及由所選擇EOB電路2604選擇之eob_byte形成至條件「或」電路2630之輸入。 28 begins at block 2800 and then proceeds to block 2802 , which shows the accelerator unit 314 receiving the fill instruction 2500 for execution . In response to receiving the fill instruction 2500 , the accelerator unit 314 first accesses the source operand specified by the register fields 2504 and 2506 of the fill instruction 2500 (block 2804 ). Specifically, the accelerator unit 314 reads the message segment S from the target width vector register 317 specified by the register 1 field 2504 in the width vector register file 316 , and reads the unfilled message length from the register 301 specified by the register 2 field 2506 in the architected register file 300 , and passes these operands to the fill circuit 2600 of FIG . 26 , which, as mentioned above, may be implemented within the data transfer circuit 406. At block 2806 , the fill circuit 2600 uses the mode field 2508 of the fill instruction 2500 to select parameters for the fill operation. Specifically, select EOM circuit 2602 selects the value of EOM pad byte (eom_byte) 2302 or 2402 based on the hash function specified by mode field 2508 , select EOB circuit 2604 selects the value of the EOB pad byte (eob_byte) to be inserted by pad instruction 2500 (i.e., a fixed value for SHA3/SHAKE and a zero byte for SHA2, since EOB pad word 2404 is applied by a separate instruction for SHA2), and select BL size circuit 2606 selects the block length based on HID subfield 2510 and BL subfield 2512 . The eom_byte selected by the select EOM circuit 2602 and the eob_byte selected by the select EOB circuit 2604 form the input to the conditional OR circuit 2630 .

在區塊2808處，選擇電路2620基於模式欄位2508之HID子欄位判定應用於訊息之雜湊函數是否為採用32位元字之SHA2-224或SHA2-256雜湊函數中之一者。若否，則選擇電路2620選擇並輸出自由暫存器2欄位2506識別之暫存器301讀取的訊息長度作為訊息之長度，且圖 28之程序繼續進行至區塊2812，其在下文予以描述。然而，若選擇電路2620在區塊2808處判定填充指令2500之HID子欄位2510指示採用32位元字之SHA2雜湊函數，則選擇電路2620針對SHA2訊息選擇並輸出加倍長度以考量上文參考圖 15所描述之訊息擴展。在一個實現中，擴展SHA2訊息長度可方便地計算為：4*(LEN/4) + LEN。程序接著自區塊2810繼續進行至區塊2812。At block 2808 , selection circuit 2620 determines whether the hash function applied to the message is one of the SHA2-224 or SHA2-256 hash functions using 32-bit words based on the HID subfield of mode field 2508. If not, selection circuit 2620 selects and outputs the message length read from register 301 identified by free register 2 field 2506 as the message length, and the process of Figure 28 continues to block 2812 , which is described below. However, if selection circuit 2620 determines at block 2808 that HID subfield 2510 of fill instruction 2500 indicates the use of a 32-bit word SHA2 hash function, selection circuit 2620 selects and outputs a double length for the SHA2 message to account for the message expansion described above with reference to FIG. 15 . In one implementation, the expanded SHA2 message length can be conveniently calculated as: 4 * (LEN / 4) + LEN. The process then continues from block 2810 to block 2812 .

區塊2812繪示藉由EOM賦能電路2620判定是否待將EOM填充置放於當前訊息區段中。若否，則由EOM賦能電路2620生成之EOM賦能向量eom_en(0:127)全部為零，且無EOM填充被插入至訊息區段S中。因此，程序轉至區塊2816，其在下文加以描述。然而，若EOM賦能電路2620在區塊2812處判定EOM填充待被插入至訊息區段S中，則EOM賦能電路2620生成EOM賦能向量eom_en(0:127)，該EOM賦能向量識別待插入EOM填充位元組所在的訊息區段S之位元組，且EOM填充位元組係藉由條件「或」電路2630插入至經填充訊息區段Sp之指定位元組中(區塊2814)。程序自區塊2814繼續進行至區塊2816。Block 2812 shows the determination by the EOM enable circuit 2620 of whether EOM padding is to be placed in the current message segment. If not, the EOM enable vector eom_en(0:127) generated by the EOM enable circuit 2620 is all zeros, and no EOM padding is inserted into the message segment S. Therefore, the process proceeds to block 2816 , which is described below. However, if the EOM enable circuit 2620 determines at block 2812 that EOM padding is to be inserted into the message segment S, the EOM enable circuit 2620 generates an EOM enable vector eom_en(0:127) that identifies the byte of the message segment S where the EOM padding byte is to be inserted, and the EOM padding byte is inserted into the specified byte of the padded message segment Sp via the conditional OR circuit 2630 (block 2814 ). The process continues from block 2814 to block 2816 .

在區塊2816處，選擇BL大小電路2606及EOB賦能電路2608判定由雜湊指令2500之模式欄位2508指定的雜湊函數是否為SHA3或SHAKE雜湊函數且EOB填充位元組待插入於訊息區段S中。若否，則由EOB賦能電路2608生成之EOB賦能向量eob_en(0:127)全部為零，且無EOB填充被插入至訊息區段S中。因此，程序自區塊2816轉至區塊2820，區塊2820在下文加以描述。然而，若BL大小電路2606及EOB賦能電路2620在區塊2816處判定由模式欄位2508指定之雜湊函數為SHA3或SHAKE雜湊函數且EOB填充將被插入至訊息區段S中，則EOB賦能電路2608生成EOB賦能向量eob_en(0:127)，該EOB賦能向量識別待插入EOB填充位元組的訊息區段S之位元組，且EOB填充位元組藉由條件「或」電路2630插入至經填充訊息區段Sp之指定位元組中(區塊2818)。程序接著轉至區塊2820。At block 2816 , the BL size selection circuit 2606 and the EOB enable circuit 2608 determine whether the hash function specified by the mode field 2508 of the hash instruction 2500 is a SHA3 or SHAKE hash function and EOB padding bytes are to be inserted into the message segment S. If not, the EOB enable vector eob_en (0:127) generated by the EOB enable circuit 2608 is all zeros, and no EOB padding is inserted into the message segment S. Therefore, the program jumps from block 2816 to block 2820 , which is described below. However, if the BL size circuit 2606 and the EOB enable circuit 2620 determine at block 2816 that the hash function specified by the mode field 2508 is the SHA3 or SHAKE hash function and that EOB padding is to be inserted into the message segment S, the EOB enable circuit 2608 generates an EOB enable vector eob_en(0:127) that identifies the byte of the message segment S into which the EOB padding byte is to be inserted, and the EOB padding byte is inserted into the specified byte of the padded message segment Sp via the conditional OR circuit 2630 (block 2818 ). The program then proceeds to block 2820 .

區塊2820繪示資料傳送電路406將所得經填充訊息區段Sp寫入至由暫存器1欄位2504指定之目標寬向量暫存器317中。此後，圖 28之程序在區塊2822處結束。Block 2820 shows the data transfer circuit 406 writing the resulting padded message segment Sp into the target width vector register 317 specified by register 1 field 2504. Thereafter, the process of FIG. 28 ends at block 2822 .

現在參考圖 29，繪示用於(例如)半導體IC邏輯設計、模擬、測試、佈局以及製造中的例示性設計流程2900的方塊圖。設計流程2900包括用於處理設計結構或裝置以生成上文所描述並在本文中所展示的設計結構及/或裝置之邏輯上或以其他方式功能上等效表示的程序、機器及/或機構。藉由設計流程2900處理及/或生成的設計結構可在機器可讀傳輸或儲存媒體上經編碼以包括當在資料處理系統上執行或以其他方式處理時生成硬體組件、電路、裝置或系統之邏輯上、結構上、機械上或以其他方式功能上等效表示的資料及/或指令。機器包括但不限於用於IC設計程序之任何機器，該IC設計程序諸如設計、製造或模擬電路、組件、裝置或系統。舉例而言，機器可包括：微影機器、用於生成遮罩之機器及/或裝備(例如電子束寫入器)、用於模擬設計結構之電腦或裝備、用於製造或測試程序之任何設備或用於將設計結構之功能上等效的表示程式化至任何媒體中的任何機器(例如，用於程式化可程式化閘陣列的機器)。Referring now to FIG. 29 , a block diagram of an exemplary design flow 2900 used, for example, in semiconductor IC logic design, simulation, testing, layout, and manufacturing is shown. Design flow 2900 includes procedures, machines, and/or mechanisms for processing a design structure or device to generate a logically or otherwise functionally equivalent representation of the design structure and/or device described above and illustrated herein. The design structure processed and/or generated by design flow 2900 may be encoded on a machine-readable transmission or storage medium to include data and/or instructions that, when executed or otherwise processed on a data processing system, generate a logically, structurally, mechanically, or otherwise functionally equivalent representation of a hardware component, circuit, device, or system. A machine includes, but is not limited to, any machine used in an IC design process, such as designing, manufacturing, or simulating a circuit, component, device, or system. For example, a machine may include a lithography machine, a machine and/or equipment for generating masks (e.g., an electron beam writer), a computer or equipment for simulating a design structure, any equipment used in a manufacturing or testing process, or any machine for programming a functionally equivalent representation of a design structure into any medium (e.g., a machine for programming a programmable gate array).

設計流程2900可取決於正設計的表示之類型而變化。舉例而言，用於建置特殊應用IC (ASIC)之設計流程2900可不同於用於設計標準組件之設計流程2900或不同於用於將設計實體化為可程式化陣列之設計流程2900，可程式化陣列例如由Altera®公司或Xilinx®公司提供之可程式化閘陣列(PGA)或場可程式化閘陣列(FPGA)。The design flow 2900 may vary depending on the type of representation being designed. For example, the design flow 2900 for building an application-specific IC (ASIC) may differ from the design flow 2900 for designing a standard component or from the design flow 2900 for materializing the design into a programmable array, such as a programmable gate array (PGA) or a field programmable gate array (FPGA) provided by Altera® or Xilinx®.

圖 29繪示包括較佳藉由設計程序2910處理之輸入設計結構2920的多個此類設計結構。設計結構2920可為藉由設計程序2910生成且處理以產生硬體裝置之邏輯上等效之功能表示的邏輯模擬設計結構。設計結構2920亦可或替代地包含在藉由設計程序2910處理時生成硬體裝置之實體結構之功能表示的資料及/或程式指令。無論表示功能及/或結構設計特徵，都可使用諸如由核心開發者/設計者實現之電子電腦輔助設計(ECAD)來生成設計結構2920。當經編碼於機器可讀資料傳輸、閘陣列或儲存媒體上時，設計結構2920可藉由設計程序2910內之一或多個硬體及/或軟體模組存取及處理以模擬或另外功能上表示電子組件、電路、電子或邏輯模組、設備、裝置或系統，諸如本文中所展示之彼等電子組件、電路、電子或邏輯模組、設備、裝置或系統。因而，設計結構2920可包含檔案或包括人類及/或機器可讀原始程式碼的其他資料結構、經編譯結構及電腦可執行程式碼結構，該等電腦可執行程式碼結構在由設計或模擬資料處理系統處理時在功能上模擬或以其他方式表示硬體邏輯設計之電路或其他層級。此類資料結構可包括硬體描述語言(HDL)設計實體或符合較低層級HDL設計語言(諸如Verilog及VHDL)及/或較高層級設計語言(諸如C或C++)及/或與較低層級HDL設計語言及/或較高層級設計語言相容的其他資料結構。 FIG29 illustrates a plurality of such design structures, including an input design structure 2920 , preferably processed by a design program 2910. Design structure 2920 may be a logical simulation design structure generated by design program 2910 and processed to produce a logically equivalent functional representation of a hardware device. Design structure 2920 may also or alternatively include data and/or program instructions that, when processed by design program 2910, generate a functional representation of the physical structure of the hardware device. Whether representing functional and/or structural design features, design structure 2920 may be generated using, for example, electronic computer-aided design (ECAD) implemented by a core developer/designer. When encoded on a machine-readable data transmission, gate array, or storage medium, design structure 2920 can be accessed and processed by one or more hardware and/or software modules within design program 2910 to simulate or otherwise functionally represent an electronic component, circuit, electronic or logical module, apparatus, device, or system, such as those shown herein. Thus, design structure 2920 may include files or other data structures including human and/or machine readable source code, compiled structures, and computer executable code structures that, when processed by a design or simulation data processing system, functionally simulate or otherwise represent circuits or other levels of a hardware logic design. Such data structures may include hardware description language (HDL) design entities or other data structures that conform to lower-level HDL design languages (such as Verilog and VHDL) and/or higher-level design languages (such as C or C++) and/or are compatible with lower-level HDL design languages and/or higher-level design languages.

設計程序2910較佳採用且併入硬體及/或軟體模組以用於合成、轉譯或以其他方式處理本文中所展示之組件、電路、裝置或邏輯結構之設計/模擬功能等效者以生成可含有諸如設計結構2920之設計結構的接線對照表2980。接線對照表2980可包含例如經編譯或以其他方式處理之資料結構，其表示描述至積體電路設計中之其他元件及電路之連接的導線、離散組件、邏輯閘、控制電路、I/O裝置、模型等之清單。接線對照表2980可使用反覆製程來合成，其中接線對照表2980取決於用於裝置之設計規格及參數而經重新合成一或多次。如同本文中所描述的其他設計結構類型，接線對照表2980可經記錄於機器可讀儲存媒體上或經程式化至可程式化閘陣列中。媒體可為非揮發性儲存媒體，諸如磁碟機或光碟機、可程式化閘陣列、CF卡(compact flash)或其他快閃記憶體。另外或在替代例中，媒體可為系統或快取記憶體或緩衝空間。Design program 2910 preferably employs and incorporates hardware and/or software modules for synthesizing, translating, or otherwise processing design/simulation functional equivalents of components, circuits, devices, or logic structures presented herein to generate a wiring lookup table 2980 that may include design structures such as design structure 2920. Wiring lookup table 2980 may include, for example, a compiled or otherwise processed data structure representing a list of wires, discrete components, logic gates, control circuits, I/O devices, models, etc., that describe connections to other components and circuits in an integrated circuit design. Wiring lookup table 2980 can be synthesized using an iterative process, wherein wiring lookup table 2980 is resynthesized one or more times depending on the design specifications and parameters for the device. As with other design structure types described herein, wiring lookup table 2980 can be recorded on a machine-readable storage medium or programmed into a programmable gate array. The medium can be a non-volatile storage medium such as a magnetic or optical disk drive, a programmable gate array, a CF card, or other flash memory. Additionally or alternatively, the medium can be system or cache memory or buffer space.

設計程序2910可包括用於處理包括接線對照表2980之多種輸入資料結構類型的硬體及軟體模組。此類資料結構類型可駐留於例如程式庫元件2930內，且包括用於給定製造技術(例如，不同技術節點：32 nm、45 nm、290 nm等)的常用元件、電路及裝置之集合，包括模型、佈局及符號表示。資料結構類型可進一步包括設計規格2940、特性化資料2950、驗證資料2960、設計規則2990以及測試資料檔案2985，該等測試資料檔案可包括輸入測試圖案、輸出測試結果及其他測試資訊。設計程序2910可進一步包括例如標準機械設計程序，諸如應力分析、熱分析、機械事件模擬、用於諸如澆鑄、模製及模壓成形等之操作的程序模擬。機械設計之一般熟習此項技術者可瞭解用於設計程序2910中之可能的機械設計工具及應用的範圍而不偏離本發明之範疇及精神。設計程序2910亦可包括用於執行諸如定時分析、驗證、設計規則檢查、置放及路由操作等之標準電路設計程序之模組。Design program 2910 may include hardware and software modules for processing various input data structure types, including lookup tables 2980. Such data structure types may reside, for example, in library components 2930 and include a collection of commonly used components, circuits, and devices for a given manufacturing technology (e.g., different technology nodes: 32 nm, 45 nm, 290 nm, etc.), including models, layouts, and symbolic representations. Data structure types may further include design specifications 2940 , characterization data 2950 , verification data 2960 , design rules 2990 , and test data files 2985 , which may include input test patterns, output test results, and other test information. Design process 2910 may further include, for example, standard mechanical design processes such as stress analysis, thermal analysis, mechanical event simulation, and process simulation for operations such as casting, molding, and press forming. Those skilled in the art of mechanical design will appreciate the range of possible mechanical design tools and applications that may be used in design process 2910 without departing from the scope and spirit of the present invention. Design process 2910 may also include modules for performing standard circuit design process operations such as timing analysis, verification, design rule checking, placement, and routing.

設計程序2910採用且併入諸如HDL編譯器及模擬模型建構工具的邏輯及實體設計工具以連同所描繪支援資料結構中之一些或全部以及任何額外機械設計或資料(若適用)來處理設計結構2920，以生成第二設計結構2990。設計結構2990以用於交換機械裝置及結構之資料的資料格式(例如，以IGES、DXF、Parasolid XT、JT、DRG或用於儲存或呈現此類機械設計結構之任何其他合適格式儲存的資訊)駐留於儲存媒體或可程式化閘陣列上。類似於設計結構2920，設計結構2990較佳包含一或多個檔案、資料結構，或其他電腦經編碼資料或指令，其駐留於傳輸或資料儲存媒體上且當藉由ECAD系統處理時生成本文中所展示的本發明之實施例中之一或多者的邏輯上或以其他方式功能上等效之形式。在一個實施例中，設計結構2990可包含在功能上模擬本文中所展示之裝置的經編譯、可執行之HDL模擬模型。The design program 2910 employs and incorporates logical and physical design tools such as HDL compilers and simulation model building tools to process the design structure 2920 along with some or all of the depicted supporting data structures and any additional mechanical design or data, if applicable, to generate a second design structure 2990. The design structure 2990 resides on a storage medium or programmable gate array in a data format for exchanging data of mechanical devices and structures (e.g., information stored in IGES, DXF, Parasolid XT, JT, DRG, or any other suitable format for storing or representing such mechanical design structures). Similar to design structure 2920 , design structure 2990 preferably comprises one or more files, data structures, or other computer-encoded data or instructions that reside on a transmission or data storage medium and, when processed by an ECAD system, generate a logically or otherwise functionally equivalent form of one or more of the embodiments of the present invention presented herein. In one embodiment, design structure 2990 may comprise a compiled, executable HDL simulation model that functionally simulates a device presented herein.

設計結構2990亦可採用用於交換積體電路之佈局資料的資料格式及/或符號資料格式(例如，以GDSII (GDS2)、GL1、OASIS、映射檔案或用於儲存此類設計資料結構之任何其他合適格式儲存之資訊)。設計結構2990可包含資訊，諸如(例如)符號資料、映射檔案、測試資料檔案、設計內容檔案、製造資料、佈局參數、導線、金屬層級、通孔、形狀、用於經由所製造線路由的資料以及製造商或其他設計者/開發者生產如上文所描述及本文中所展示的裝置或結構所需的任何其他資料。設計結構2990接著可繼續進行至階段2995，其中例如設計結構2990：繼續進行至成品出廠驗證(tape-out)，經釋放至製造，經釋放至遮罩室，經發送至另一設計室，經發送回至客戶等。Design structure 2990 may also employ a data format and/or symbol data format for exchanging layout data of integrated circuits (e.g., information stored in GDSII (GDS2), GL1, OASIS, a map file, or any other suitable format for storing such design data structures). Design structure 2990 may include information such as, for example, symbol data, map files, test data files, design content files, manufacturing data, layout parameters, wires, metal levels, vias, shapes, data for routing through fabricated lines, and any other data required by a manufacturer or other designer/developer to produce devices or structures as described above and shown herein. The design structure 2990 may then proceed to stage 2995 , where, for example, the design structure 2990 : proceeds to tape-out, is released to manufacturing, is released to a mask room, is sent to another design house, is sent back to the customer, etc.

如已描述，在至少一個實施例中，一種處理器包括：一指令提取單元，其提取待執行之指令；一暫存器檔案，其包括用於儲存源及目的地運算元之複數個暫存器；及一執行單元，其用於執行一訊息填充指令。該訊息填充指令包括一運算元欄位及一模式欄位，該運算元欄位指示緩衝待填充之一訊息塊之一訊息塊區段的該複數個暫存器中之一者，且該模式欄位指示複數個不同雜湊函數中之哪一者待應用於該訊息塊。該執行單元包括一填充電路，該填充電路經組態以基於該訊息填充指令，自由該訊息填充指令之該運算元欄位指示的該複數個暫存器中之一者接收一訊息塊區段，其中該訊息塊跨越該暫存器檔案中之多個暫存器。基於該複數個不同雜湊函數中之哪一者係由該訊息填充指令之該模式欄位指示，該填充電路在該訊息塊區段中選擇要插入至少一個填充位元組之一位元組位置且在該訊息塊區段內之該位元組位置處插入該至少一個填充位元組。接著將由該至少一個填充位元組填充的該訊息塊區段寫回至該暫存器檔案。As described, in at least one embodiment, a processor includes an instruction fetch unit that fetches instructions to be executed, a register file that includes a plurality of registers for storing source and destination operands, and an execution unit that executes a fill instruction. The fill instruction includes an operand field and a mode field, the operand field indicating one of the plurality of registers that buffers a message block segment of a message block to be filled, and the mode field indicating which of a plurality of different hash functions to apply to the message block. The execution unit includes a fill circuit configured to receive a message block segment from one of the plurality of registers indicated by the operand field of the message fill instruction based on the message fill instruction, wherein the message block spans multiple registers in the register file. Based on which of the plurality of different hash functions is indicated by the mode field of the message fill instruction, the fill circuit selects a byte position in the message block segment at which to insert at least one padding byte and inserts the at least one padding byte at the byte position within the message block segment. The message block segment filled with the at least one padding byte is then written back to the register file.

雖然已特別展示並描述了各種實施例，但熟習此項技術者應瞭解，在不脫離所附申請專利範圍之精神及範疇的情況下，可在其中作出形式及細節上的各種改變，且此等替代實現皆屬於所附申請專利範圍之範疇。舉例而言，雖然已特定參考SHA標準系列來描述本發明，但熟習此項技術者應瞭解，所揭示之發明亦適用於其他雜湊演算法(例如，通用Keccak函數，以及其他)。另外，儘管本文中為了易於理解已論述繪示性數目個位元及位元組，但應瞭解，用於雜湊演算法中之位元及位元組的特定數目可以且隨著時間推移進行改變，且所揭示發明之原理適用於加密演算法，而不管給定實現中之位元及位元組的特定數目如何。While various embodiments have been particularly shown and described, those skilled in the art will understand that various changes in form and details may be made therein without departing from the spirit and scope of the appended claims, and that such alternative implementations are intended to be within the scope of the appended claims. For example, while the present invention has been described with specific reference to the SHA family of standards, those skilled in the art will understand that the disclosed invention is also applicable to other hashing algorithms (e.g., the generalized Keccak function, among others). Additionally, although illustrative numbers of bits and bytes have been discussed herein for ease of understanding, it should be understood that the specific numbers of bits and bytes used in hashing algorithms can and have changed over time, and that the principles of the disclosed invention apply to encryption algorithms regardless of the specific numbers of bits and bytes in a given implementation.

諸圖中之流程圖及方塊圖繪示根據本發明之各種實施例的系統、方法及電腦程式產品之可能實現之架構、功能性及操作。就此而言，流程圖或方塊圖中之每一區塊可表示模組、區段或指令之部分，其包含用於實現指定邏輯函數之一或多個可執行指令。在一些替代實現中，區塊中所提及之功能可能不以諸圖中所提及之次序發生。舉例而言，取決於所涉及之功能性，連續展示的兩個區塊事實上可實質上同時地執行，或該等區域塊有時可以反向次序執行。亦應注意，方塊圖及/或流程圖繪示之每一區塊以及方塊圖及/或流程圖繪示中之區塊組合可由執行指定功能或動作或進行特殊用途硬體及電腦指令之組合的基於特殊用途硬體之系統實現。The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagram may represent a module, segment, or portion of an instruction set that includes one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions mentioned in the blocks may not occur in the order mentioned in the figures. For example, depending on the functionality involved, two blocks shown in succession may actually be executed substantially simultaneously, or the blocks may sometimes be executed in the reverse order. It should also be noted that each block of the block diagrams and/or flowcharts, and combinations of blocks in the block diagrams and/or flowcharts, can be implemented by a special-purpose hardware-based system that performs the specified functions or actions or executes a combination of special-purpose hardware and computer instructions.

另外，儘管已關於執行引導本發明之功能之程式碼的電腦系統描述態樣，但應理解，本發明可替代地實現為包括儲存可由資料處理系統處理之程式碼的電腦可讀儲存裝置的程式產品。電腦可讀儲存裝置可包括揮發性或非揮發性記憶體、光碟或磁碟或其類似者。然而，如本文中所採用，「儲存裝置」具體地定義為僅包括法定製品且排除信號媒體本身、暫時性傳播信號本身及能量本身。Furthermore, although aspects have been described with respect to a computer system executing program code that directs the functions of the present invention, it should be understood that the present invention may alternatively be implemented as a program product including a computer-readable storage device storing program code that can be processed by a data processing system. The computer-readable storage device may include volatile or non-volatile memory, an optical or magnetic disk, or the like. However, as used herein, "storage device" is specifically defined to include only legal products and excludes the signal medium itself, the transient propagating signal itself, and the energy itself.

程式產品可包括資料及/或指令，該等資料及/或指令在資料處理系統上經執行或以其他方式經處理時生成本文中所揭示的硬體組件、電路、裝置或系統之邏輯上、結構上或以其他方式功能上等效的表示(包括模擬模型)。此類資料及/或指令可包括硬體描述語言(HDL)設計實體或符合較低層級HDL設計語言(諸如Verilog及VHDL)及/或較高層級設計語言(諸如C或C++)及/或與較低層級HDL設計語言及/或較高層級設計語言相容的其他資料結構。此外，資料及/或指令亦可採用用於交換積體電路之佈局資料的資料格式及/或符號資料格式(例如，以GDSII (GDS2)、GL1、OASIS、映射檔案或用於儲存此類設計資料結構之任何其他合適格式儲存之資訊)。Program products may include data and/or instructions that, when executed or otherwise processed on a data processing system, generate a logically, structurally, or otherwise functionally equivalent representation (including simulation models) of the hardware components, circuits, devices, or systems disclosed herein. Such data and/or instructions may include hardware description language (HDL) design entities or data structures compatible with lower-level HDL design languages (such as Verilog and VHDL) and/or higher-level design languages (such as C or C++) and/or other data structures compatible with lower-level HDL design languages and/or higher-level design languages. Additionally, the data and/or instructions may be in a data format and/or symbolic data format for exchanging layout data of an integrated circuit (e.g., information stored in GDSII (GDS2), GL1, OASIS, a map file, or any other suitable format for storing such design data structures).

100:資料處理系統 102:處理器 104:處理器核心 106:快取記憶體 110:系統互連件 112:記憶體控制器 114:系統記憶體 116:輸入/輸出(I/O)配接器 118:非揮發性儲存系統 120:網路配接器 200:處理器核心 202:指令提取單元 204:指令解碼單元 206:分支處理單元 210:映射器電路 216:分派電路 218:發行佇列 220:固定點單元 222:浮點單元 224:載入-儲存單元 226:向量-純量單元 230:儲存器 300:經架構暫存器檔案 301:256位元暫存器r0至rS 302:功能單元/算術邏輯單元/旋轉單元 304:功能單元/乘法單元 306:功能單元/除法單元 308:功能單元/加密單元 310:功能單元/置換單元 312:功能單元/二進位寫碼十進位(BCD)單元 314:加速器單元 316:寬向量暫存器檔案 317:暫存器R0及R1 320:傳送單元 400:SHA3/SHAKE雜湊電路 402:SHA2雜湊電路 404:單指令多資料(SIMD)互斥或(XOR)電路 406:資料傳送電路 500:程序 502:訊息 504:SHA3吸收階段 506:SHA3/SHAKE擠壓階段 508:訊息摘要 600:區塊 602:1600位元擴展訊息塊 604:SHA3狀態置換函數 606:1600位元逐位元XOR函數 610:1600位元最後吸收狀態 702:SHA-3標準指定之回合索引0 704:SHA3回合函數 800:結果塊1 802:截斷函數 804:截斷函數 900:SHA3雜湊指令 902:作業碼欄位 904:暫存器欄位 906:暫存器欄位 1000:逐位元XOR指令 1002:作業碼欄位 1004:暫存器欄位 1006:暫存器欄位 1008:暫存器欄位 1100a:1024位元雙輸入多工器 1100b:1024位元雙輸入多工器 1102a:1024位元狀態暫存器 1102b:1024位元狀態暫存器 1106:SHA3回合電路 1108:輸出多工器 1110:控制電路/控制邏輯 1200:區塊 1202:區塊 1204:區塊 1206:區塊 1208:區塊 1210:區塊 1214:區塊 1216:區塊 1300:SHA2雜湊函數 1302:訊息 1304:區塊 1306:16×w位元訊息塊/訊息塊1 1308:8×w位元初始狀態 1310:SHA2塊雜湊函數 1312:截斷函數 1314:訊息摘要 1400:訊息排程回合函數 1402:w位元回合密鑰 1404:回合函數/初始更新回合 1406:區塊 1410:8×w位元進位傳播加法函數 1420:區塊/16×w位元初始化 1500:SHA2-224或SHA2-256輸入訊息 1502:32位元字 1504:輸出訊息 1506:64位元雙字 1508:32位元零字 1600:SHA2雜湊指令 1602:作業碼欄位 1604:運算元暫存器欄位 1606:運算元暫存器欄位 1608:模式欄位 1702a:512位元雙輸入狀態多工器 1702b:1024位元雙輸入訊息多工器 1704a:512位元狀態暫存器 1704b:1024位元訊息塊暫存器 1708:更新工作狀態電路 1710:訊息排程回合電路 1712:單指令多資料(SIMD)加法器 1720:控制電路 1800:區塊/輸入狀態 1802:SHA2西格瑪0電路 1804:SHA2 MA電路 1806:SHA2西格瑪1電路 1808:SHA2 CH電路 1810:64位元加法器 1812:64位元加法器/模組化加法器 1814:64位元加法器/模組化加法器 1816:結果狀態 1900:SHA2西格瑪電路 1902:64位元輸入變數 1904a:64位元旋轉電路 1904b:64位元旋轉電路 1906a:32位元旋轉電路 1906b:32位元旋轉電路 1908a:64位元旋轉/移位電路 1908b:32位元旋轉/移位電路 1910a:多工器 1910b:多工器 1910c:多工器 1912:三輸入64位元逐位元XOR電路/3向逐位元XOR電路 1914:64位元輸出 2000:區塊 2002:區塊 2004:區塊 2006:區塊 2010:區塊 2012:區塊 2014:區塊 2016:區塊 2100:未經填充訊息 2300:最後訊息塊 2300':最後訊息塊 2300'':最後訊息塊 2302:EOM填充位元組 2304:EOB填充位元組 2306:最後訊息位元組 2308:EOM/EOB填充位元組 2310:訊息塊 2312:額外置零訊息塊 2400:最後訊息塊 2400':最後訊息塊 2400'':最後訊息塊 2402:EOM填充位元組 2404:EOB填充字 2406:最後訊息位元組 2408:訊息塊 2410:完整訊息塊 2412:訊息塊 2500:填充指令 2502:作業碼欄位 2504:暫存器1欄位 2506:暫存器2欄位 2508:模式欄位 2510:雜湊識別符(HID)子欄位 2512:塊長度(BL)子欄位 2514:擴展(E)子欄位 2600:填充電路 2602:選擇EOM電路 2604:選擇EOB電路 2606:選擇BL大小電路 2608:EOB賦能電路 2610:比較器 2612:解碼器 2614:逐位元「及」電路 2620:EOM賦能電路/選擇電路 2622:比較器 2624:解碼器 2626:逐位元「及」電路 2630:條件「或」電路 2700:「或」閘 2702:雙輸入「及」閘 2704:雙輸入「及」閘 2800:區塊 2802:區塊 2804:區塊 2806:區塊 2808:區塊 2810:區塊 2812:區塊 2814:區塊 2816:區塊 2818:區塊 2820:區塊 2822:區塊 2900:設計流程 2910:設計程序 2920:設計結構 2930:程式庫元件 2940:設計規格 2950:特性化資料 2960:驗證資料 2980:接線對照表 2985:測試資料檔案 2990:設計規則/第二設計結構 2995:階段100: Data processing system 102: Processor 104: Processor core 106: Cache 110: System interconnect 112: Memory controller 114: System memory 116: Input/output (I/O) adapter 118: Non-volatile storage system 120: Network adapter 200: Processor core 202: Instruction fetch unit 204: Instruction decoding unit 206: Branch processing unit 210: Mapper circuit 216: Dispatch circuit 218: Issue queue 220: Fixed point unit 222: Floating point unit 224: Load-store unit 226: Vector-scalar unit 230: Register 300: Architected register file 301: 256-bit registers r0 to rS 302: Functional Unit/Arithmetic Logic Unit/Rotation Unit 304: Functional Unit/Multiplication Unit 306: Functional Unit/Division Unit 308: Functional Unit/Encryption Unit 310: Functional Unit/Permutation Unit 312: Functional Unit/Binary Coded Decimal (BCD) Unit 314: Accelerator Unit 316: Wide Vector Register File 317: Registers R0 and R1 320: Transmission unit 400: SHA3/SHAKE hash circuit 402: SHA2 hash circuit 404: Single instruction, multiple data (SIMD) exclusive OR (XOR) circuit 406: Data transmission circuit 500: Process 502: Message 504: SHA3 absorption phase 506: SHA3/SHAKE compression phase 508: Message digest 600: Block 602: 1600-bit extended message block 604: SHA3 state permutation function 606: 1600-bit bitwise XOR function 610: 1600-bit final absorbed state 702: Round index 0 specified by the SHA-3 standard 704: SHA3 round function 800: Result block 1 802: Truncation function 804: Truncation function 900: SHA3 hash instruction 902: Operation code field 904: Register field 906: Register field 1000: Bitwise XOR instruction 1002: Operation code field 1004: Register field 1006: Register field 1008: Register field 1100a: 1024-bit dual-input multiplexer 1100b: 1024-bit dual-input multiplexer 1102a: 102 4-bit status register 1102b: 1024-bit status register 1106: SHA3 round circuit 1108: output multiplexer 1110: control circuit/control logic 1200: block 1202: block 1204: block 1206: block 1208: block 1210: block 1214: block 1216: block 1300: SHA2 hash function 1302: message 1304: block 1306: 16× w -bit message block/message block 1 1308: 8× w -bit initial state 1310: SHA2 block hash function 1312: Truncation function 1314: Message digest 1400: Message scheduling round function 1402: w -bit round key 1404: Round function/initial update round 1406: Block 1410: 8× w- bit carry propagation addition function 1420: Block/16× w bit initialization 1500: SHA2-224 or SHA2-256 input message 1502: 32-bit word 1504: output message 1506: 64-bit double word 1508: 32-bit zero word 1600: SHA2 hash instruction 1602: operation code field 1604: operand register field 1606: operand register field 1608: mode field 1702a: 512-bit double input status Multiplexer 1702b: 1024-bit dual-input message multiplexer 1704a: 512-bit status register 1704b: 1024-bit message block register 1708: Update work status circuit 1710: Message scheduling circuit 1712: Single instruction multiple data (SIMD) adder 1720: Control circuit 1800: Block/input status 1802: SHA2 Sigma 0 circuit 1804: SHA2 MA circuit 1806: SHA2 Sigma 1 circuit 1808: SHA2 CH circuit 1810: 64-bit adder 1812: 64-bit adder/modular adder 1814: 64-bit adder/modular adder 1816: Result status 1900: SHA2 sigma circuit 1902: 64-bit input variable 1904a: 64-bit rotate circuit 1904b: 64-bit rotate circuit 1906a: 32-bit rotate circuit 1906b: 32-bit rotate circuit 1908a: 64-bit rotate/shift circuit 1908b: 32-bit rotate/shift circuit 1910a: Multiplexer 1910b: Multiplexer 1910c: Multiplexer 1912: Three-input 64-bit bitwise XOR circuit/3-way bitwise XOR circuit 1914: 64-bit Meta output 2000: Block 2002: Block 2004: Block 2006: Block 2010: Block 2012: Block 2014: Block 2016: Block 2100: Unfilled message 2300: Last message block 2300': Last message block 2300'': Last message block 2302: EOM fill byte 2304: EO B padding byte 2306: Last message byte 2308: EOM/EOB padding byte 2310: Message block 2312: Extra zero message block 2400: Last message block 2400': Last message block 2400'': Last message block 2402: EOM padding byte 2404: EOB padding byte 2406: Last message byte 2408: Message block 2410: Complete message block 2412: Message block 2500: Fill instruction 2502: Operation code field 2504: Register 1 field 2506: Register 2 field 2508: Mode field 2510: Hash identifier (HID) subfield 2512: Block length (BL) subfield 2514: Extension (E) subfield 2 600: Fill circuit 2602: EOM selection circuit 2604: EOB selection circuit 2606: BL size selection circuit 2608: EOB enable circuit 2610: Comparator 2612: Decoder 2614: Bitwise AND circuit 2620: EOM enable circuit/selection circuit 2622: Comparator 2624: Decoder 2626 :Bitwise AND circuit 2630:Conditional OR circuit 2700:OR gate 2702:Dual input AND gate 2704:Dual input AND gate 2800:Block 2802:Block 2804:Block 2806:Block 2808:Block 2810:Block 2812:Block 2814:Block 2816:Block 2818:Block Block 2820: Block 2822: Block 2900: Design Flow 2910: Design Procedure 2920: Design Structure 2930: Library Components 2940: Design Specifications 2950: Characterization Data 2960: Verification Data 2980: Wiring Comparison Table 2985: Test Data File 2990: Design Rules/Second Design Structure 2995: Phase

圖 1為根據一個實施例的包括處理器之資料處理系統之高階方塊圖； FIG1 is a high-level block diagram of a data processing system including a processor according to one embodiment ;

圖 2為根據一個實施例的處理器核心之高階方塊圖； FIG2 is a high-level block diagram of a processor core according to one embodiment ;

圖 3為根據一個實施例的處理器核心之例示性執行單元之高階方塊圖； FIG3 is a high-level block diagram of an exemplary execution unit of a processor core according to one embodiment ;

圖 4為根據一個實施例的在處理器核心內之加速器單元之更詳細方塊圖； FIG4 is a more detailed block diagram of an accelerator unit within a processor core according to one embodiment ;

圖 5為根據SHA-3標準之訊息雜湊的時間-空間圖； Figure 5 shows the time-space diagram of message hashing according to the SHA-3 standard;

圖 6為圖 5中所描繪之吸收階段之時間-空間圖； FIG6 is a time-space diagram of the absorption phase depicted in FIG5 ;

圖 7A為圖 6中所繪示的SHA3置換函數之時間-空間圖； FIG7A is a time-space diagram of the SHA3 permutation function shown in FIG6 ;

圖 7B為圖 7A中所描繪的SHA3回合函數之時間-空間圖； FIG7B is a time-space diagram of the SHA3 round function depicted in FIG7A ;

圖 8為圖 5中所繪示之SHA3/SHAKE擠壓階段之時間-空間圖； Figure 8 is a time-space diagram of the SHA3/SHAKE squeeze phase shown in Figure 5 ;

圖 9至圖 10分別繪示根據一個實施例的用於SHA3雜湊指令及逐位互斥或(exclusive OR；XOR)指令之例示性格式； 9 and 10 illustrate exemplary formats for a SHA3 hash instruction and a bitwise exclusive OR (XOR) instruction, respectively, according to one embodiment ;

圖 11為根據一個實施例的例示性SHA3/SHAKE雜湊電路的高階方塊圖； FIG11 is a high-level block diagram of an exemplary SHA3 / SHAKE hashing circuit according to one embodiment;

圖 12為根據一個實施例的處理器執行SHA3雜湊指令所藉以的例示性程序之高階邏輯流程圖； FIG12 is a high-level logical flow chart of an exemplary process by which a processor executes a SHA3 hash instruction according to one embodiment;

圖 13描繪根據SHA-2標準之訊息雜湊的時間-空間圖； Figure 13 depicts the time-space diagram of message hashing according to the SHA-2 standard;

圖 14為圖 13中所繪示之SHA2塊雜湊函數之時間-空間圖； FIG14 is a time-space diagram of the SHA2 block hash function shown in FIG13 ;

圖 15繪示根據例示性實施例的具有32位元字之SHA2雜湊函數的訊息擴展； FIG15 illustrates message expansion of a SHA2 hash function with 32-bit words according to an exemplary embodiment ;

圖 16描繪根據一個實施例的用於SHA2雜湊指令之例示性格式； FIG16 depicts an exemplary format for a SHA2 hash instruction according to one embodiment ;

圖 17為根據一個實施例的例示性SHA2雜湊電路的高階方塊圖； FIG17 is a high-level block diagram of an exemplary SHA2 hashing circuit according to one embodiment;

圖 18為根據一個實施例的來自圖 17之例示性更新工作狀態電路之高階方塊圖； FIG18 is a high-level block diagram of the exemplary update operating status circuit from FIG17 according to one embodiment;

圖 19為如圖 18中所展示的SHA2西格瑪電路(sigma circuit)之例示性實施例的高階方塊圖； FIG19 is a high-level block diagram of an exemplary embodiment of the SHA2 sigma circuit shown in FIG18 ;

圖 20為根據一個實施例的處理器執行SHA2雜湊指令所藉以的例示性程序之高階邏輯流程圖； FIG20 is a high-level logical flow chart of an exemplary process by which a processor executes a SHA2 hash instruction according to one embodiment;

圖 21A描繪例示性未經填充訊息； FIG21A depicts an exemplary unpopulated message;

圖 21B繪示例示性經填充訊息； FIG21B depicts an exemplary populated message;

圖 22A至圖 22B描繪將訊息塊之組塊組合於較窄第一暫存器檔案中且將訊息塊傳送至較寬第二暫存器檔案； 22A - 22B illustrate grouping blocks of message blocks in a narrower first register file and transferring the message blocks to a wider second register file;

圖 23A至圖 23D繪示SHA3/SHAKE訊息之各種填充情境； Figures 23A to 23D illustrate various padding scenarios for SHA3 /SHAKE messages;

圖 24A至圖 24D描繪用於SHA2訊息之各種填充情境； Figures 24A to 24D illustrate various padding scenarios for SHA2 messages ;

圖 25繪示根據一個實施例之例示性填充指令； FIG25 illustrates an exemplary fill instruction according to one embodiment ;

圖 26描繪根據一個實施例之例示性填充電路； FIG26 depicts an exemplary fill circuit according to one embodiment ;

圖 27繪示根據一個實施例的用於組合塊末端(EOB)及訊息末端(EOM)位元組與訊息的例示性電路； FIG27 illustrates an exemplary circuit for combining end-of-block (EOB) and end-of-message (EOM) bytes and messages according to one embodiment ;

圖 28為根據一個實施例的用於填充訊息塊之例示性程序之高階邏輯流程圖；且 FIG28 is a high-level logic flow diagram of an exemplary process for filling a message block according to one embodiment; and

圖 29描繪根據一個實施例之例示性設計程序。 FIG29 depicts an exemplary design process according to one embodiment.

2800:區塊 2802:區塊 2804:區塊 2806:區塊 2808:區塊 2810:區塊 2812:區塊 2814:區塊 2816:區塊 2818:區塊 2820:區塊 2822:區塊2800: Block 2802: Block 2804: Block 2806: Block 2808: Block 2810: Block 2812: Block 2814: Block 2816: Block 2818: Block 2820: Block 2822: Block

Claims

A processor includes: an instruction fetch unit that fetches instructions to be executed; a register file comprising a plurality of registers for storing source and destination operands; and an execution unit for executing a message fill instruction, wherein the message fill instruction includes an operand field and a mode field, the operand field indicating one of the plurality of registers that buffers a message block segment of a message block to be filled, and the mode field indicating which of a plurality of different hash functions is to be applied to the message block, wherein the execution unit includes a fill circuit configured to perform the following operations based on the message fill instruction: receiving a message block segment from one of the plurality of registers indicated by the operand field of the message stuff instruction, wherein the message block spans a plurality of registers in the register file; determining whether to insert at least one padding byte into the message block segment based on which of the plurality of different hash functions is indicated by the mode field of the message stuff instruction; selecting a byte position in the message block segment into which to insert the at least one padding byte based on which of the plurality of different hash functions is indicated by the mode field, based on determining that the at least one padding byte is to be inserted into the message block segment, and inserting the at least one padding byte at the selected byte position in the message block segment; Based on determining that the at least one padding byte is not inserted into the message block segment, avoid inserting the at least one padding byte into the message block segment; and write the message block segment filled by the at least one padding byte back to the register file, wherein the padding circuit includes: at least one enable circuit, the enable circuit being configured to generate an enable vector to select a byte position; and an OR circuit, the OR circuit being coupled to receive the enable vector and the message block segment, and being configured to insert the at least one padding byte at the selected byte position in the message block segment based on the enable vector.

The processor of claim 1, wherein the plurality of different hash functions include SHA3, SHAKE, and SHA2 hash functions.

The processor of claim 1, wherein: the message block includes a plurality of message block segments; and the execution unit is configured to detect which of the plurality of message block segments the message block segment is based on an indication in the mode field.

A processor as claimed in claim 1, wherein: the plurality of different hash functions include a first hash function and a second hash function; and the execution unit is configured to insert both end-of-message (EOM) padding and end-of-block (EOB) padding in the message block segment based on the mode field indicating the first hash function, and is configured to insert EOM padding but not EOB padding in the message block segment based on the mode field indicating the second hash function.

The processor of claim 1, wherein: the plurality of hash functions include a first hash function and a second hash function; the selecting includes selecting the byte position based on a length parameter indicated by the operand field of the message fill instruction; and the execution unit is configured to insert the at least one padding byte at a first byte position based on the length parameter when the mode field indicates the first hash function, and to insert the at least one padding byte at a different second byte position based on the length parameter when the mode field indicates the second hash function.

The processor of claim 1, wherein: the fill circuit includes a selection circuit configured to select a value of the at least one fill byte based on which of the plurality of different hash functions is indicated by the mode field of the fill instruction.

A processor as in claim 1, wherein: the register file is a first register file; the plurality of registers are a first plurality of registers; the processor includes a second register file, the second register file including a second plurality of registers, each having a length less than a length of the first plurality of registers; the execution unit is further configured to combine multiple blocks of the message block segment into multiple registers among the second plurality of registers, and transfer all of the multiple blocks to one of the first plurality of registers to form the message block segment.

The processor of claim 7, wherein the processor is further configured to insert end-of-block (EOB) padding into one of the plurality of registers in the second plurality of registers before transferring the plurality of chunks to the one of the first plurality of registers.

The processor of claim 1, wherein: selecting the byte position comprises generating an enable vector having a length corresponding to the message block segment; and inserting the at least one padding byte comprises inserting the at least one padding byte at the byte position in the message block segment based on the enable vector.

The processor of claim 1, wherein: inserting the at least one pad byte comprises logically combining an end-of-block (EOB) pad byte with an end-of-message (EOM) pad byte and inserting a resulting combined pad byte.

The processor of claim 1, wherein: the execution unit includes a hash circuit configured to apply a hash function from the SHA family of hash functions to a padded message block including the padded message block segment based on a hash instruction.

The processor of claim 1, wherein: the message block segment is part of a message comprising a plurality of message blocks having a same length of r bits; the padded message block includes r bits; and each of the plurality of registers has a length less than r bits.

A method for performing data processing in a processor including a register file, the method comprising: fetching, by an instruction fetch unit of the processor, instructions to be executed by the processor, wherein the instructions include a fill instruction, the fill instruction including an operand field and a mode field, the operand field indicating one of a plurality of registers of a message block segment of a buffer to be filled with a message block, and the mode field indicating which of a plurality of different hash functions is to be applied to the message block; and executing, by an execution unit of the processor, the fill instruction upon receiving the fill instruction, wherein executing the fill instruction comprises: receiving, from the register file, a message block segment from one of the plurality of registers indicated by the operand field of the message stuff instruction, wherein the message block spans a plurality of registers in the register file; determining whether to insert at least one padding byte into the message block segment based on which of the plurality of different hash functions is indicated by the mode field of the message stuff instruction; Based on determining that the at least one padding byte is to be inserted into the message block segment, selecting a byte position in the message block segment at which to insert the at least one padding byte based on which of the plurality of different hash functions is indicated by the mode field, and inserting the at least one padding byte at the selected byte position in the message block segment, wherein the selecting comprises generating an enable vector specifying the byte position by an enable circuit; and inserting the at least one padding byte at the selected byte position in the message block segment based on the enable vector by one or more circuits coupled to receive the enable vector and the message block segment; Based on determining that the at least one stuffing byte is not inserted into the message block segment, refraining from inserting the at least one stuffing byte into the message block segment; and writing the message block segment filled with the at least one stuffing byte back to the register file.

A method as claimed in claim 13, wherein: the message block includes multiple message block segments; and the method further includes detecting which of the multiple message block segments the message block segment is based on an indication in the mode field.

The method of claim 13, wherein: the plurality of different hash functions include a first hash function and a second hash function; and the execution includes inserting both end-of-message (EOM) padding and end-of-block (EOB) padding in the message block segment based on the mode field indicating the first hash function, and is configured to insert EOM padding but not EOB padding in the message block segment based on the mode field indicating the second hash function.

The method of claim 13, wherein: the plurality of hash functions include a first hash function and a second hash function; the selecting includes selecting the byte position based on a length parameter indicated by the operand field of the message fill instruction; and the inserting includes inserting the at least one padding byte at a first byte position based on the length parameter when the mode field indicates the first hash function, and inserting the at least one padding byte at a different second byte position based on the length parameter when the mode field indicates the second hash function.

The method of claim 13, further comprising: selecting, by the execution unit, a value of the at least one pad byte based on a hash function indicated by the mode field of the pad instruction.

A method as claimed in claim 13, wherein: the register file of the processor is a first register file; the plurality of registers are a first plurality of registers; the processor includes a second register file, the second register file including a second plurality of registers, each having a length less than a length of the first plurality of registers; the method further includes combining multiple chunks of the message block segment in multiple registers among the second plurality of registers, and transferring all of the multiple chunks to one of the first plurality of registers to form the message block segment.

The method of claim 18, further comprising inserting end-of-block (EOB) padding into one of the plurality of registers in the second plurality of registers before transferring the plurality of chunks to the one of the first plurality of registers.

The method of claim 13, further comprising: based on a hash instruction, applying, by a hash circuit of the processor, a hash function from the SHA family of hash functions to a padded message block including the padded message block segment.