WO2025014480A1 - Compressed instruction set architecture for vector digital signal processors - Google Patents
Compressed instruction set architecture for vector digital signal processors Download PDFInfo
- Publication number
- WO2025014480A1 WO2025014480A1 PCT/US2023/027399 US2023027399W WO2025014480A1 WO 2025014480 A1 WO2025014480 A1 WO 2025014480A1 US 2023027399 W US2023027399 W US 2023027399W WO 2025014480 A1 WO2025014480 A1 WO 2025014480A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- instruction
- vector
- compressed
- instructions
- compressed vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/3017—Runtime instruction translation, e.g. macros
- G06F9/30178—Runtime instruction translation, e.g. macros of compressed or encrypted instructions
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30036—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
Definitions
- Embodiments of the present disclosure relate to the instruction set architecture (ISA) of vector digital signal processors (vDSP).
- ISA instruction set architecture
- vDSP vector digital signal processors
- Vector digital signal processors are used in many modem electronic products, for example, mobile communication products such as smartphones. Because mobile communication products commonly rely on batteries for power, it is important to reduce power consumption in these products to extend battery life.
- a processor includes an instruction fetch unit configured to fetch vector instructions, wherein the vector instructions include compressed vector instructions and non-compressed vector instructions, an instruction decoder unit, coupled to the instruction fetch unit, configured to at least determine whether a fetched vector instruction is a compressed vector instruction or a non-compressed vector instruction, an instruction converter, coupled to the instruction decoder unit, configured to receive one or more compressed vector instructions and convert the one or more compressed vector instructions to a corresponding one or more non-compressed vector instructions, an instruction converter bypass path, coupled to the instruction decoder unit; and an instruction queue coupled to the instruction decoder unit, and further coupled to the instruction converter bypass path.
- the compressed vector instructions comprise fewer bits than the non-compressed vector instructions.
- a method of executing vector instructions includes fetching one or more vector instructions from an instruction memory, determining whether a first one of the one or more fetched vector instructions is a compressed vector instruction or a noncompressed vector instruction, converting, if the first one of the one or more fetched vector instructions is a non-compressed vector instruction, the first one of the one or more fetched vector instructions to a corresponding non-compressed vector instruction, and transferring the corresponding non-compressed vector instruction to an instruction queue, and transferring, if the first one of the one or more fetched vector instructions is a non-compressed vector instruction, the first one of the one or more fetched vector instructions to the instruction queue via an instruction converter bypass path.
- a system includes an instruction memory, a processor coupled to the instruction memory, wherein the processor includes an instruction fetch unit configured to fetch vector instructions, wherein the vector instructions include compressed vector instructions and non-compressed vector instructions, an instruction decoder unit, coupled to the instruction fetch unit, configured to at least determine whether a fetched vector instruction is a compressed vector instruction or a non-compressed vector instruction, an instruction converter, coupled to the instruction decoder unit, configured to receive one or more compressed vector instructions and convert the one or more compressed vector instructions to a corresponding one or more non-compressed vector instructions, an instruction converter bypass path, coupled to the instruction decoder unit, and an instruction queue coupled to the instruction decoder unit, and further coupled to the instruction converter bypass path.
- the instruction memory is a non-transitory memory.
- FIG. 1 is a high-level block diagram of a generalized vDSP architecture.
- FIG. 2 shows a pair of 32-bit vector instructions and an illustrative encoding that provides a corresponding compressed pair of 16-bit vector instructions.
- FIG. 3 shows a pair of 32-bit vector instructions and an illustrative encoding that provides a corresponding compressed pair of 16-bit vector instructions.
- FIG. 4 shows a pair of 32-bit vector instructions and an illustrative encoding that provides a corresponding compressed pair of 24-bit vector instructions.
- FIG. 5 shows a pair of non-compressed 32-bit vector instructions and an illustrative encoding that provides a corresponding compressed pair of 24-bit vector instructions.
- FIG. 6 is a flow chart showing an illustrative method of converting, i.e., transforming, compressed vector instructions into corresponding non-compressed 32-bit vector instructions.
- FIG. 7 is a high-level block diagram of a portion of a processor illustrating the datapath for vector instructions between an instruction fetch unit and an instruction queue including decision hardware in the instruction decoder, and an instruction converter disposed in the data path between the instruction decoder and the instruction queue.
- FIG. 8A shows a segment of non-compressed assembly code using 32-bit vector instructions, that requires 20 bytes of instruction memory.
- FIG. 8B shows an illustrative segment of compressed assembly code, in accordance with this disclosure, that directs a vDSP to perform the same function as the assembly code of FIG. 8A, but only requires 13 bytes of instruction memory.
- FIG. 9 is a high-level block diagram of a system including an instruction memory, and a portion of a processor illustrating the datapath for vector instructions between an instruction fetch unit and an instruction queue, an instruction converter disposed in the datapath between the instruction decoder and the instruction queue, an instruction converter bypass path, issue queues, a vector computation unit, and a vector register file.
- FIG. 10 is a flow diagram of a method of executing vector instructions.
- non-compressed vector instructions are 32-bits (4 bytes) in length, and at least some of those have functionally equivalent counterpart vector instructions that are either 16-bits (2 bytes) or 24-bits (3 bytes) in length.
- the number of bits used for non-compressed and compressed vector instructions in this disclosure is not a limitation on the design of a vector instruction set, but rather for the purpose of illustration.
- the amount of memory required to store the vector instructions of a program may be reduced.
- the reduced memory requirements reduce the number of memory accesses needed to fetch the vector instructions from memory.
- the reduced number of memory accesses reduces power consumption by an amount that would have been needed to perform the now unneeded memory accesses.
- the reduction of the amount of instruction memory due to the use of compressed vector instructions may result in the benefit of reducing the amount of physical memory that is required for storing a program, thus further reducing power consumption by having less memory to remain powered.
- Various embodiments in accordance with this disclosure provide a processor, such as but not limited to a vector digital signal processor, that can fetch both non-compressed and compressed vector instructions and convert the fetched compressed vector instructions to equivalent counterpart non-compressed vector instructions prior to execution.
- a processor such as but not limited to a vector digital signal processor, that can fetch both non-compressed and compressed vector instructions and convert the fetched compressed vector instructions to equivalent counterpart non-compressed vector instructions prior to execution.
- At least some vector instructions of an original 32-bit vector instruction set used in a vDSP may be compressed into either 16-bit or 24-bit vector instructions. It is noted that, depending on the original vector instruction set, compression is not limited to 32-bit to 16-bit or 24-bit. Rather, compression may be from 64-bit or even longer vector instruction, to 8-bit, 16-bit, 32-bit, 40-bit, 48-bit, and so on, with appropriate byte alignment.
- the 16-bit/24-bit to 32-bit instruction conversion logic disclosed herein may be moved from in between the vector control unit (VCU) instruction decoder (ID) and instruction queue, before each individual issue queue, or even into each individual ID in a vector datapath unit (VDU).
- VCU vector control unit
- ID instruction decoder
- VDU vector datapath unit
- references in the specification to “one embodiment,” “an embodiment,” “an example embodiment,” “some embodiments,” etc. indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases do not necessarily refer to the same embodiment. Further, when a particular feature, structure or characteristic is described in connection with an embodiment, it would be within the knowledge of a person skilled in the pertinent art to effect such feature, structure or characteristic in connection with other embodiments whether or not explicitly described.
- terminology may be understood at least in part from usage in context.
- the term “one or more” as used herein, depending at least in part upon context may be used to describe any feature, structure, or characteristic in a singular sense or may be used to describe combinations of features, structures or characteristics in a plural sense.
- terms, such as “a,” “an,” or “the,” again, may be understood to convey a singular usage or to convey a plural usage, depending at least in part upon context.
- the term “based on” may be understood as not necessarily intended to convey an exclusive set of factors and may, instead, allow for existence of additional factors not necessarily expressly described, again, depending at least in part on context.
- the term “nominal/nominally” refers to a desired, or target, value of a characteristic or parameter for a component or a process operation, set during the design phase of a product or a process, together with a range of values above and/or below the desired value.
- the range of values can be due to slight variations in manufacturing processes or tolerances.
- SoC System-on-a-Chip
- An SoC is an integrated circuit (IC) with the circuits and subsystems to implement a particular system, such as, for example, a smartphone, smart watch, or other product of similar complexity.
- Integrated circuits in accordance with this disclosure may be implemented in any suitable process technology including, but not limited to, complementary metal oxide semiconductor (CMOS). Such integrated circuits may be custom designed, synthesized, or a combination of both. And, embodiments of the architecture disclosed herein may be implemented on a single chip, or partitioned over one or more chips, or one or more die in a single package, or one or more chiplets in a single package, or any combination of these or similar integrated circuit and packaging configurations.
- CMOS complementary metal oxide semiconductor
- embodiments of the architecture disclosed herein may be implemented with one or more chiplets.
- chiplets may be disposed on an interposer in a single package.
- the interposer may be a silicon interposer.
- Various embodiments in accordance with this disclosure may be implemented with any suitable packaging technology.
- FIG. 1 introduces a general architecture of a vDSP system without the capability to execute both non-compressed and compressed vector instructions. More particularly, FIG. 1 is a high-level block diagram of a general vDSP system architecture 100, for illustrative purposes. While FIG. 1 is a general illustration showing a floating-point scalar computation unit and a fixed-point vector computation unit, it is noted that vDSP architectures may also include fixed-point scalar computation units, and floating-point vector computation units. As described later herein, various embodiments in accordance with this disclosure, may be implemented in processors such as those having a vector digital signal processor architecture, thereby providing such processors with the capability to execute both compressed and non-compressed vector instructions.
- vDSP system architecture 100 executes non-compressed vector instructions and non-compressed scalar instructions.
- a vDSP 102 is coupled to a memory 104.
- vDSP 102 includes a floating point scalar computation unit 106 coupled to a floating point scalar register file 108, and a fixed point vector computation unit 110 is coupled to a fixed point vector register file 112.
- vDSP 102 further includes a control unit 114 that is coupled to floating point scalar computation unit 106, floating point scalar register file 108, fixed point vector computation unit 110, and fixed point vector register file 112.
- Memory 104 may be an external memory, i.e., implemented as a chip or chiplet separate from the chip or chiplet(s) in which vDSP 102 is implemented. Alternatively, memory 104 may be implemented on a single chip with vDSP 102. Memory 104 may be implemented with any suitable circuitry or manufacturing process technology that meets the cost, performance, and capacity requirements of vDSP system architecture 100.
- RISC-V refers to an instruction set architecture specification.
- RISC-V represents the fifth major reduced instruction set computer (RISC) instruction set architecture (ISA) from the University of California at Berkeley.
- FIGs. 2-5 show illustrative mappings, i.e., encodings, for transforming 32-bit noncompressed vector instructions to compressed 16-bit or 24-bit vector instructions.
- a processor such as but not limited to a vDSP, has a set of 32 X registers, which are the registers in a vector control unit of a vDSP.
- one of those operands is an X register.
- VG2X VGO, XI [5 -bit addressing > can be compressed as move the value of XI to VGO
- FIG. 2 shows a pair of 32-bit vector instructions and an illustrative encoding that provides a corresponding compressed pair of 16-bit vector instructions.
- the bit field [1 :0] of the instruction represents the major code of RISC-V.
- major code 00 is used
- function code 100 is used at the bit field [15: 13] of the instruction.
- VG2X moving from VG registers to X registers
- X2VG moving from X registers to VG registers
- C16.VG2X and C16.X2VG are compressed alternatives for VG2X and X2VG, respectively.
- the instruction size for both VG2X and X2VG are reduced from 32-bit to 16-bit, as shown in the illustrative example of FIG. 2 [0039] Similar to the example above, 32-bit instructions for moving to/from vector control registers (VCR) from/to X registers may also be compressed with just a 2 -bit register address field for the X registers, i.e., X0-X3.
- FIG. 3 shows a pair of 32-bit vector instructions and an illustrative encoding that provides a corresponding compressed pair of 16-bit vector instructions.
- the details of conversion from the original 32-bit instructions X2VCR (moving from X registers to VCR registers)) and VCR2X (moving from VCR registers to X registers) to compressed vector instructions are shown in FIG. 3.
- the compressed vector instructions are C16.X2VCR and C16.VCR2X, and as shown, the vector instruction size has been reduced from 32-bits to 16-bits.
- Some vector instructions include a field in which an immediate value is specified. However, when the immediate value is small, not all the bits reserved for the immediate value field are needed to represent that small value. For example, assigning an immediate value of 5 to the VCOP MODE register (vector co-processor mode register) by the original instruction VCRWI.L, which upon execution writes the 16-bit immediate value to lower 16-bits of the vector control register. However, the VCOP MODE register only has 8 bits. Thus, only an 8-bit immediate field, rather than a 16-bit immediate field, is needed for writing to the VCOP MODE register, and a compressed version of this instruction can be created that is eight bits (one byte) shorter. For the example above, the non-compressed 32-bit vector instruction and its compressed 24-bit counterpart vector instruction may be written as follows:
- VCRWI.L VCOP MODE 5 (#IMM8> //VCOP MODE register only has 8 bits
- FIG. 4 shows a pair of 32-bit vector instructions (VCRWI.L and VCRWI.H) and an illustrative encoding that provides a corresponding compressed pair of 24-bit vector instructions (C24. VCRWI.L and C24. VCRWI.H).
- FIG. 5 shows a pair of non-compressed 32-bit vector instructions (VMAC and VMSUB) and an illustrative encoding that provides a corresponding compressed pair of 24-bit vector instructions (C24.VMAC and C24VMSUB). It will be appreciated that any given vector instruction set may have other opportunities to compress vector instructions in accordance with the principles disclosed herein.
- FIG. 6 is a flow chart showing an illustrative method 600 of converting, i.e., transforming, compressed vector instructions into corresponding non-compressed counterpart 32- bit vector instructions; and for identifying vector instructions that are already in non-compressed format and can therefore bypass the instruction converter process and be transferred directly to an instruction queue.
- method 600 is illustrated in a multi-step flowchart, it will be appreciated that this logic may be implemented in hardware logic circuitry so that the decoding of the various bit fields can take place in one step.
- bit field [1 :0] of a vector instruction is equal to 11. If not equal to 11, then the vector instruction is not a compressed vector instruction, and at block 604, it can be routed to the instruction queue by way of the instruction converter bypass path. If bit field [1 :0] is equal to 11, then the vector instruction is a compressed vector instruction, and at block 606, it is determined whether bit field [1 :0] equals 00. If bit field [1 :0] is equal to 00, then at block 608, it is determined whether bit field [15: 13] is equal to 100. If bit field [15: 13] is not equal to 100, then at block 610, further processing in accordance with this disclosure is terminated.
- bit field [15: 13] is equal to 100, then at block 612, it is determined whether bit field [10] of the vector instruction is equal to 0. If not equal to 0, then at block 614, it is determined whether bit field [9] of the compressed vector instruction equals 0. If bit field [9] does equal 0, then the compressed vector instruction is C16.X2VCR, and at block 616, the compressed vector instruction is converted back to a noncompressed X2VCR instruction, which is then transferred to the instruction queue. If bit field [9] does not equal 0, then the compressed vector instruction is C16.VCR2X, and at block 618, the compressed vector instruction is converted back to a non-compressed VCR2X instruction, which is then transferred to the instruction queue.
- bit field [8] of the compressed vector instruction is equal to 0. If bit field [8] is not equal to 0, then at block 622, it is determined whether bit field [9] is equal to 0. If bit field [9] is equal to 0, then at block 624, it is determined whether bit field [21] equals 0. If bit field [21] does equal 0, then the compressed vector instruction is C24.VMAC, and at block 626, the compressed vector instruction is converted back to a non-compressed VMAC, which is then transferred to the instruction queue. However, if bit field [21] does not equal 0, then the compressed vector instruction is C24.VMSUB, and at block 628, the compressed vector instruction is converted back to a noncompressed VMSUB instruction, which is then transferred to the instruction queue.
- bit field [9] does not equal 0
- bit field [22] is equal to 0. If bit field [22] is equal to 0, then the compressed vector instruction is C24.VCRWI.L, and at block 632, the compressed vector instruction is converted back to a non-compressed VCRWI.L, which is then transferred to the instruction queue. However, if bit field [22] is not equal to 0, then the compressed vector instruction is C24.VCRWI.H, and at block 634, the compressed vector instruction is converted back to a non-compressed VCRWI.H instruction, which is then transferred to the instruction queue.
- bit field [8] is equal to 0
- bit field [9] is equal to 0. If bit field [9] is equal to 0, then the compressed vector instruction is C16.VG2X, and at block 638, the compressed vector instruction is converted back to a non-compressed VG2X instruction, which is then transferred to the instruction queue. However, if bit field [9] does not equal 0, then the compressed vector instruction is C16.X2VG, and at block 640, the compressed vector instruction is converted back to a non-compressed X2VG instruction, which is then transferred to the instruction queue.
- FIG. 7 is a high-level block diagram of a portion of a processor 700.
- FIG. 7 shows a vector control unit (VCU) 702 having an instruction fetch unit 704 and an instruction decode unit 706, and illustrates the datapath for vector instructions between an instruction fetch unit 704, an instruction decoder 706 including decision hardware for determining whether a fetched instruction is compressed, and an instruction queue 710.
- VCU vector control unit
- the datapath between instruction decoder 706 and instruction queue 710 includes an instruction converter 708.
- the datapath between instruction decoder 706 and instruction queue 710 includes an instruction converter bypass path 712.
- compressed vector instructions are transferred to instruction converter 708 for conversion to an equivalent non-compressed vector instruction.
- instruction converter 708 is configured to convert 16-bit compressed vector instructions to 32-bit non-compressed equivalent vector instructions; and to convert 24-bit compressed vector instructions to 32-bit non-compressed equivalent vector instructions.
- the mapping of certain noncompressed vector instructions to equivalent compressed vector instructions is shown in FIGs. 2- 5; and the logic for converting compressed vector instructions to equivalent non-compressed vector instructions is described above and shown in FIG. 6.
- FIG. 7 also illustrates a bus 713 that couples instruction queue 710 to arithmetic logic unit (ALU) issue queue 714, load (LD) issue queue 716, store (ST) issue queue 718, coprocessor (COP) issue queue 720, and scalar issue queue 722.
- ALU arithmetic logic unit
- LD load
- ST store
- COP coprocessor
- vector instructions from instruction queue 710 are transferred to the desired issue queue over bus 713.
- the vector instructions in instruction queue 710 are all noncompressed instructions. It is noted that alternative embodiments may have different numbers of issue queues, by way of example and not limitation, a processor in accordance with this disclosure may have two load issue queues to support two load units to load data concurrently.
- FIG. 7 Although not shown in FIG. 7, it will be recognized that the various issue queues are coupled to their respective functional units, e.g., computational units.
- ALU issue queue 714 is coupled to a vector ALU
- LD issue queue 716 is coupled to a vector load/store unit
- ST issue queue 718 is coupled to the vector load/store unit
- COP issue queue 720 is coupled to a vector co-processor unit
- scalar issue queue 722 is coupled to a scalar computation unit.
- FIGs. 8A and 8B illustrate certain advantages that various embodiments of this disclosure provide.
- FIG. 8A illustrates a code segment using non-compressed vector instructions
- FIG. 8B illustrates a functionally equivalent code segment using compressed vector instructions.
- FIG. 8A shows a segment of non-compressed assembly code using 32-bit vector instructions, that requires 20 bytes of instruction memory
- FIG. 8B shows an illustrative segment of compressed assembly code, in accordance with this disclosure, that directs a vDSP to perform the same function as the assembly code of FIG. 8A, but only requires 13 bytes of instruction memory.
- the amount of instruction memory required by the compressed vector instruction version is reduced by 35% compared to the non-compressed vector instruction version.
- FIG. 9 is a high-level block diagram of a system 900 including a memory 901, and a portion of a processor 902 illustrating the datapath for vector instructions between a vector instruction fetch/vector instruction decode unit 904, and an instruction queue 910, an instruction converter 906 disposed in the datapath between the vector instruction fetch/vector instruction decode unit 904 and the instruction queue 910, an instruction converter bypass path 908, issue queues 914, 916, 918, 920, 922, a vector computation unit 924, and a vector register file 926.
- Processor 902 may be a vDSP.
- processor 902 may execute both vector and scalar instructions.
- processor 902 may be a vDSP that executes both vector and scalar instructions.
- vector instruction fetch/vector instruction decode unit 904 is coupled to instruction converter 906, and instruction converter bypass path 908.
- Instruction converter 906 is coupled to instruction queue 910.
- Instruction converter bypass path 908 is also coupled to instruction queue 910.
- Instruction queue 910 is coupled to bus 912, and bus 912 is coupled to issue queues 914, 916, 918, 920, 922.
- Issue queues 914, 916, 918, 920, 922 are coupled to vector computation unit 924.
- Vector computation unit 924 is coupled to vector register file 926. It is noted that in alternative embodiments in accordance with this disclosure may have more or fewer issue queues than the number of issue queues illustrated in FIG. 9.
- memory 901 may be an external memory.
- Memory 901 in accordance with this disclosure, is configured to store at least non-compressed vector instructions and compressed vector instructions.
- each compressed vector instruction fetched from memory 901 by the instruction fetch logic of processor 902 may be converted to a corresponding non-compressed vector instruction prior to being transferred to instruction queue 910.
- each compressed vector instruction fetched from memory 901 by the instruction fetch logic of processor 902 may be converted to a corresponding non-compressed vector instruction prior to being issued to a vector computational unit 924 of processor 902.
- memory 901 is an instruction memory, i.e., a memory dedicated exclusively to storing program instructions.
- memory 901 may store program instructions in a first portion thereof, and store data in a second portion thereof.
- a portion of memory available to store data may be referred to a data memory.
- the data memory may be accessible to Load/Store unit of processor 902.
- FIG. 10 is a flow diagram of a method 1000 of executing vector instructions.
- Method 1000 includes, at a block 1002, fetching one or more vector instructions from an instruction memory.
- fetching of vector instructions is performed by an instruction fetch unit of a processor such as, but not limited to, a vDSP.
- the instruction memory may be an external memory, i.e., external to the processor. In alternative embodiments, the instruction memory may be integrated on the same substrate as the processor.
- Method 1000 continues, at a block 1004, by determining whether a first one of the one or more fetched vector instructions is a compressed vector instruction or a non-compressed vector instruction.
- Method 1000 continues, at a block 1004, by converting, if the first one of the one or more fetched vector instructions is a non-compressed vector instruction, the first one of the one or more fetched vector instructions to a corresponding non-compressed vector instruction, and transferring the corresponding non-compressed vector instruction to an instruction queue.
- Method 1000 further includes, at a block 1008, transferring, if the first one of the one or more fetched vector instructions is a non-compressed vector instruction, the first one of the one or more fetched vector instructions to the instruction queue via an instruction converter bypass path.
- the instruction converter bypass path is coupled between an instruction decoder of the processor and the instruction queue.
- the external memory referred to above may be implemented with any suitable non-transitory memory technology that meets the speed, capacity, and cost requirements of any particular design in accordance with this disclosure.
- the external memories for embodiments in accordance with this disclosure may be, but are not limited to, volatile memories such as but not limited to dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), and static random access memory (SRAM).
- the external memories for embodiments in accordance with this disclosure may be, but are not limited to, non-volatile memories that meet the speed, capacity, and cost requirements of the design.
- the external memories may be chiplets coupled to another integrated circuit, or the external memories may actually be embedded on the same die as the logic circuits that comprise an associated processor.
- Various integrated circuit embodiments of this disclosure may further include “low-power,” or “power-down” circuitry to reduce leakage currents that result in unwanted power consumption.
- one or more circuit blocks or subsystems of an integrated circuit in accordance with this disclosure may be configured to enter a low-power state responsive to one or more signals indicating that those circuit blocks or subsystems do not have work to do at the moment.
- a low-power state refers to an integrated circuit operating in a manner that reduces or eliminates unneeded power consumption.
- the low-power state may be achieved by any suitable circuit technique. In one illustrative circuit technique, the low-power state may be achieved by stopping the clock signal or signals to a circuit block or subsystem of the integrated circuit. Stopping the clock signal or signals may be appropriate if the circuit block is implemented with a static logic design in CMOS process technology.
- an appropriate approach may be to ensure that the period for which the clocks are stopped is less than the period of time for the charge on one or more dynamic nodes in the circuit block or subsystem to decay to a point that the logic state on the one or more dynamic nodes cannot be reliably maintained.
- Such charge decay may occur through mechanisms such as, e.g., leakage, or capacitive coupling to nearby circuit nodes.
- the low-power state may be achieved by decoupling the circuit block or subsystem from one or more power supply nodes.
- the circuit block or subsystem may have a transistor coupled in series between itself and a positive power supply node such that the circuit block or subsystem is decoupled from the positive power supply node when the transistor is turned off.
- the circuit block or subsystem may have a transistor coupled in series between itself and a ground supply node such that the circuit block or subsystem is decoupled from the ground supply node when the transistor is turned off.
- the circuit block or subsystem may have a first transistor coupled in series between itself and a positive power supply node, and a second transistor coupled in series between itself and a ground supply node such that the circuit block or subsystem is decoupled from the positive power supply node and from the ground supply node when both the first and second transistors are turned off.
- Power consumption is related to the magnitude of a circuit’s power supply voltage, and is also related to its operating frequency.
- Many integrated circuits including but not limited to those designed for use in wireless communication, are designed to be used in products that rely, at least in part, on batteries for power. Since extending battery life is important to customers of such products, it is also important for designers to reduce the amount of power needed for the integrated circuits to perform their intended functions, and to provide hardware support for integrated circuits to enter one of one or more states in which unneeded power consumption is reduced or eliminated. Such states may be referred to as “low-power states,” “low-power modes,” “low-power operation,” or similarly descriptive terms. In further examples of low-power operation, it is possible to reduce power consumption in circuits that are unused or unneeded for a period of time, by reducing the magnitude of the circuit’s power supply voltage, reducing its operating frequency, or both.
- the processor further includes a plurality of issue queues, each issue queue coupled to the instruction queue, wherein the instruction queue is configured to receive non-compressed vector instructions.
- a processor includes an instruction fetch unit configured to fetch vector instructions, wherein the vector instructions include compressed vector instructions and non-compressed vector instructions, an instruction decoder unit coupled to the instruction fetch unit, configured to at least determine whether a fetched vector instruction is a compressed vector instruction or a non-compressed vector instruction, an instruction converter coupled to the instruction decoder unit, configured to receive one or more compressed vector instructions and convert the one or more compressed vector instructions to a corresponding one or more noncompressed vector instructions, an instruction converter bypass path, coupled to the instruction decoder unit, and an instruction queue coupled to the instruction decoder unit, and further coupled to the instruction converter bypass path.
- the compressed vector instructions comprise fewer bits than the non-compressed vector instructions.
- the processor further includes a plurality of issue queues, each issue queue coupled to the instruction queue, wherein the instruction queue is configured to receive non-compressed vector instructions.
- the instruction queue is configured to receive and to store a plurality of non-compressed vector instructions.
- the non-compressed vector instructions each have a fixed length.
- the compressed vector instructions are variable length vector instructions.
- the compressed vector instructions have fewer bits than the non-compressed vector instructions.
- the instruction converter is configured to determine whether a compressed vector instruction is a 2-byte compressed vector instruction or a 3-byte compressed vector instruction.
- the instruction converter is configured to convert a 2-byte compressed vector instruction to a corresponding 4-byte non-compressed vector instruction, and to convert a 3-byte compressed vector instruction to a corresponding 4-byte non-compressed vector instruction.
- the processor comprises a vector digital signal processor.
- the processor includes one or more chiplets.
- a method of executing vector instructions includes fetching one or more vector instructions from an instruction memory, determining whether a first one of the one or more fetched vector instructions is a compressed vector instruction or a noncompressed vector instruction, converting, if the first one of the one or more fetched vector instructions is a non-compressed vector instruction, the first one of the one or more fetched vector instructions to a corresponding non-compressed vector instruction, and transferring the corresponding non-compressed vector instruction to an instruction queue, and transferring, if the first one of the one or more fetched vector instructions is a non-compressed vector instruction, the first one of the one or more fetched vector instructions to the instruction queue via an instruction converter bypass path.
- converting the compressed vector instruction to a noncompressed vector instruction includes determining whether the compressed vector instruction comprises a first number of bits or a second number of bits. Because converting, for example, a 2-byte compressed vector instruction to a 4-byte non-compressed vector instruction is different from converting a 3 -byte compressed vector instruction to a 4-byte non-compressed vector instruction, knowing the size of the compressed vector instruction may be used to inform the logic used for converting a compressed vector instruction to a non-compressed vector instruction.
- the method further includes transferring a non-compressed vector instruction from the instruction queue to an issue queue of a plurality of issue queues based, at least in part, on an op-code of the non-compressed vector instruction.
- the non-compressed vector instruction comprises 32 bits, a first compressed vector instruction comprises 16 bits, and a second compressed vector instruction comprises 24 bits.
- the plurality of issue queues comprises at least one ALU issue queue, at least one Load issue queue, and at least one Store issue queue.
- the plurality of issue queues may further include at least one Permutation issue queue, and/or at least one Coprocessor issue queue.
- a Scalar instruction queue may be included in the plurality of issue queues.
- a system includes an instruction memory, and a processor coupled to the instruction memory, wherein the processor includes an instruction fetch unit configured to fetch vector instructions, wherein the vector instructions include compressed vector instructions and non-compressed vector instructions, an instruction decoder unit, coupled to the instruction fetch unit, configured to at least determine whether a fetched vector instruction is a compressed vector instruction or a non-compressed vector instruction, an instruction converter, coupled to the instruction decoder unit, configured to receive one or more compressed vector instructions and convert the one or more compressed vector instructions to a corresponding one or more non-compressed vector instructions, an instruction converter bypass path, coupled to the instruction decoder unit, and an instruction queue coupled to the instruction decoder unit, and further coupled to the instruction converter bypass path.
- the instruction memory is a non-transitory memory.
- the processor is a vector digital signal processor.
- the system further includes a non-transitory data memory.
- the processor comprises one or more chiplets.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Computing Systems (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Executing Machine-Instructions (AREA)
Abstract
A processor includes an instruction fetch unit configured to fetch compressed vector instructions and non-compressed vector instructions, an instruction decoder unit, coupled to the instruction fetch unit, configured to at least determine whether a fetched vector instruction is a compressed vector instruction or a non-compressed vector instruction, an instruction converter, coupled to the instruction decoder unit, configured to receive one or more compressed vector instructions and convert the one or more compressed vector instructions to a corresponding one or more non-compressed vector instructions, an instruction converter bypass path, coupled to the instruction decoder unit, and an instruction queue coupled to the instruction decoder unit, and further coupled to the instruction converter bypass path. The processor further includes a plurality of issue queues, each issue queue coupled to the instruction queue, wherein the instruction queue is configured to receive non-compressed vector instructions.
Description
COMPRESSED INSTRUCTION SET ARCHITECTURE FOR VECTOR DIGITAL SIGNAL PROCESSORS
BACKGROUND
[0001] Embodiments of the present disclosure relate to the instruction set architecture (ISA) of vector digital signal processors (vDSP).
[0002] Vector digital signal processors are used in many modem electronic products, for example, mobile communication products such as smartphones. Because mobile communication products commonly rely on batteries for power, it is important to reduce power consumption in these products to extend battery life.
SUMMARY
[0003] According to one aspect of this disclosure, a processor includes an instruction fetch unit configured to fetch vector instructions, wherein the vector instructions include compressed vector instructions and non-compressed vector instructions, an instruction decoder unit, coupled to the instruction fetch unit, configured to at least determine whether a fetched vector instruction is a compressed vector instruction or a non-compressed vector instruction, an instruction converter, coupled to the instruction decoder unit, configured to receive one or more compressed vector instructions and convert the one or more compressed vector instructions to a corresponding one or more non-compressed vector instructions, an instruction converter bypass path, coupled to the instruction decoder unit; and an instruction queue coupled to the instruction decoder unit, and further coupled to the instruction converter bypass path. According to this aspect, the compressed vector instructions comprise fewer bits than the non-compressed vector instructions.
[0004] According to another aspect, a method of executing vector instructions includes fetching one or more vector instructions from an instruction memory, determining whether a first one of the one or more fetched vector instructions is a compressed vector instruction or a noncompressed vector instruction, converting, if the first one of the one or more fetched vector instructions is a non-compressed vector instruction, the first one of the one or more fetched vector instructions to a corresponding non-compressed vector instruction, and transferring the corresponding non-compressed vector instruction to an instruction queue, and transferring, if the first one of the one or more fetched vector instructions is a non-compressed vector instruction, the first one of the one or more fetched vector instructions to the instruction queue via an
instruction converter bypass path.
[0005] According to yet another aspect, a system includes an instruction memory, a processor coupled to the instruction memory, wherein the processor includes an instruction fetch unit configured to fetch vector instructions, wherein the vector instructions include compressed vector instructions and non-compressed vector instructions, an instruction decoder unit, coupled to the instruction fetch unit, configured to at least determine whether a fetched vector instruction is a compressed vector instruction or a non-compressed vector instruction, an instruction converter, coupled to the instruction decoder unit, configured to receive one or more compressed vector instructions and convert the one or more compressed vector instructions to a corresponding one or more non-compressed vector instructions, an instruction converter bypass path, coupled to the instruction decoder unit, and an instruction queue coupled to the instruction decoder unit, and further coupled to the instruction converter bypass path. According to this aspect, the instruction memory is a non-transitory memory.
[0006] These illustrative embodiments are mentioned not to limit or define the present disclosure, but to provide examples to aid understanding thereof. Additional embodiments are discussed in the Detailed Description, and further description is provided there.
BRIEF DESCRIPTION OF THE DRAWINGS
[0007] The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate embodiments of the present disclosure and, together with the description, further serve to explain the principles of the present disclosure and to enable a person skilled in the pertinent art to make and use the present disclosure.
[0008] FIG. 1 is a high-level block diagram of a generalized vDSP architecture.
[0009] FIG. 2 shows a pair of 32-bit vector instructions and an illustrative encoding that provides a corresponding compressed pair of 16-bit vector instructions.
[0010] FIG. 3 shows a pair of 32-bit vector instructions and an illustrative encoding that provides a corresponding compressed pair of 16-bit vector instructions.
[0011] FIG. 4 shows a pair of 32-bit vector instructions and an illustrative encoding that provides a corresponding compressed pair of 24-bit vector instructions.
[0012] FIG. 5 shows a pair of non-compressed 32-bit vector instructions and an illustrative encoding that provides a corresponding compressed pair of 24-bit vector instructions.
[0013] FIG. 6 is a flow chart showing an illustrative method of converting, i.e., transforming, compressed vector instructions into corresponding non-compressed 32-bit vector
instructions.
[0014] FIG. 7 is a high-level block diagram of a portion of a processor illustrating the datapath for vector instructions between an instruction fetch unit and an instruction queue including decision hardware in the instruction decoder, and an instruction converter disposed in the data path between the instruction decoder and the instruction queue.
[0015] FIG. 8A shows a segment of non-compressed assembly code using 32-bit vector instructions, that requires 20 bytes of instruction memory.
[0016] FIG. 8B shows an illustrative segment of compressed assembly code, in accordance with this disclosure, that directs a vDSP to perform the same function as the assembly code of FIG. 8A, but only requires 13 bytes of instruction memory.
[0017] FIG. 9 is a high-level block diagram of a system including an instruction memory, and a portion of a processor illustrating the datapath for vector instructions between an instruction fetch unit and an instruction queue, an instruction converter disposed in the datapath between the instruction decoder and the instruction queue, an instruction converter bypass path, issue queues, a vector computation unit, and a vector register file.
[0018] FIG. 10 is a flow diagram of a method of executing vector instructions.
[0019] Various illustrative embodiments of this disclosure will be described with reference to the accompanying drawings.
DETAILED DESCRIPTION
[0020] Some embodiments, in accordance with this disclosure, provide an instruction set including non-compressed vector instructions, and compressed vector instructions that produce the same results as their non-compressed counterparts when fetched and executed by a processor in accordance with this disclosure. In some illustrative embodiments, non-compressed vector instructions are 32-bits (4 bytes) in length, and at least some of those have functionally equivalent counterpart vector instructions that are either 16-bits (2 bytes) or 24-bits (3 bytes) in length. The number of bits used for non-compressed and compressed vector instructions in this disclosure is not a limitation on the design of a vector instruction set, but rather for the purpose of illustration.
[0021] By providing counterpart compressed vector instructions for non-compressed vector instructions, the amount of memory required to store the vector instructions of a program may be reduced. In turn, the reduced memory requirements reduce the number of memory accesses needed to fetch the vector instructions from memory. And, consequently, the reduced
number of memory accesses reduces power consumption by an amount that would have been needed to perform the now unneeded memory accesses. In some embodiments, depending on the size of the program to be executed, the reduction of the amount of instruction memory due to the use of compressed vector instructions may result in the benefit of reducing the amount of physical memory that is required for storing a program, thus further reducing power consumption by having less memory to remain powered.
[0022] Various embodiments in accordance with this disclosure provide a processor, such as but not limited to a vector digital signal processor, that can fetch both non-compressed and compressed vector instructions and convert the fetched compressed vector instructions to equivalent counterpart non-compressed vector instructions prior to execution.
[0023] In accordance with this disclosure, at least some vector instructions of an original 32-bit vector instruction set used in a vDSP may be compressed into either 16-bit or 24-bit vector instructions. It is noted that, depending on the original vector instruction set, compression is not limited to 32-bit to 16-bit or 24-bit. Rather, compression may be from 64-bit or even longer vector instruction, to 8-bit, 16-bit, 32-bit, 40-bit, 48-bit, and so on, with appropriate byte alignment.
[0024] Although specific configurations and arrangements are discussed, it should be understood that this is done for illustrative purposes only. Various instruction formats/encodings disclosed herein are examples for illustrative purposes, and it will be understood that alternative compressed instruction formats/encodings in accordance with this disclosure may be used.
[0025] In alternative embodiments, the 16-bit/24-bit to 32-bit instruction conversion logic disclosed herein may be moved from in between the vector control unit (VCU) instruction decoder (ID) and instruction queue, before each individual issue queue, or even into each individual ID in a vector datapath unit (VDU).
[0026] A person skilled in the pertinent art will recognize that other configurations and arrangements can be used without departing from the spirit and scope of this disclosure. It will be apparent to a person skilled in the pertinent art that this disclosure can also be employed in a variety of other applications.
[0027] It is noted that references in the specification to “one embodiment,” “an embodiment,” “an example embodiment,” “some embodiments,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic.
Moreover, such phrases do not necessarily refer to the same embodiment. Further, when a particular feature, structure or characteristic is described in connection with an embodiment, it would be within the knowledge of a person skilled in the pertinent art to effect such feature, structure or characteristic in connection with other embodiments whether or not explicitly described.
[0028] In general, terminology may be understood at least in part from usage in context. For example, the term “one or more” as used herein, depending at least in part upon context, may be used to describe any feature, structure, or characteristic in a singular sense or may be used to describe combinations of features, structures or characteristics in a plural sense. Similarly, terms, such as “a,” “an,” or “the,” again, may be understood to convey a singular usage or to convey a plural usage, depending at least in part upon context. In addition, the term “based on” may be understood as not necessarily intended to convey an exclusive set of factors and may, instead, allow for existence of additional factors not necessarily expressly described, again, depending at least in part on context.
[0029] As used herein, the term “nominal/nominally” refers to a desired, or target, value of a characteristic or parameter for a component or a process operation, set during the design phase of a product or a process, together with a range of values above and/or below the desired value. The range of values can be due to slight variations in manufacturing processes or tolerances.
[0030] As used herein, the acronym “SoC” refers to System-on-a-Chip. An SoC is an integrated circuit (IC) with the circuits and subsystems to implement a particular system, such as, for example, a smartphone, smart watch, or other product of similar complexity.
[0031] Integrated circuits in accordance with this disclosure may be implemented in any suitable process technology including, but not limited to, complementary metal oxide semiconductor (CMOS). Such integrated circuits may be custom designed, synthesized, or a combination of both. And, embodiments of the architecture disclosed herein may be implemented on a single chip, or partitioned over one or more chips, or one or more die in a single package, or one or more chiplets in a single package, or any combination of these or similar integrated circuit and packaging configurations.
[0032] Further, embodiments of the architecture disclosed herein may be implemented with one or more chiplets. In some embodiments, chiplets may be disposed on an interposer in a single package. The interposer may be a silicon interposer. Various embodiments in accordance
with this disclosure may be implemented with any suitable packaging technology.
[0033] FIG. 1 introduces a general architecture of a vDSP system without the capability to execute both non-compressed and compressed vector instructions. More particularly, FIG. 1 is a high-level block diagram of a general vDSP system architecture 100, for illustrative purposes. While FIG. 1 is a general illustration showing a floating-point scalar computation unit and a fixed-point vector computation unit, it is noted that vDSP architectures may also include fixed-point scalar computation units, and floating-point vector computation units. As described later herein, various embodiments in accordance with this disclosure, may be implemented in processors such as those having a vector digital signal processor architecture, thereby providing such processors with the capability to execute both compressed and non-compressed vector instructions.
[0034] Still referring to FIG. 1, vDSP system architecture 100 executes non-compressed vector instructions and non-compressed scalar instructions. A vDSP 102 is coupled to a memory 104. vDSP 102 includes a floating point scalar computation unit 106 coupled to a floating point scalar register file 108, and a fixed point vector computation unit 110 is coupled to a fixed point vector register file 112. vDSP 102 further includes a control unit 114 that is coupled to floating point scalar computation unit 106, floating point scalar register file 108, fixed point vector computation unit 110, and fixed point vector register file 112. Memory 104 may be an external memory, i.e., implemented as a chip or chiplet separate from the chip or chiplet(s) in which vDSP 102 is implemented. Alternatively, memory 104 may be implemented on a single chip with vDSP 102. Memory 104 may be implemented with any suitable circuitry or manufacturing process technology that meets the cost, performance, and capacity requirements of vDSP system architecture 100.
[0035] For convenience, illustrative examples in accordance with this disclosure are described in a format compatible with a RISC-V scalar instruction format. It is noted that “RISC-V” refers to an instruction set architecture specification. The name “RISC-V” represents the fifth major reduced instruction set computer (RISC) instruction set architecture (ISA) from the University of California at Berkeley.
[0036] FIGs. 2-5 show illustrative mappings, i.e., encodings, for transforming 32-bit noncompressed vector instructions to compressed 16-bit or 24-bit vector instructions.
[0037] In some illustrative embodiments, a processor, such as but not limited to a vDSP, has a set of 32 X registers, which are the registers in a vector control unit of a vDSP. A 5-bit
field is required in a non-compressed vector instruction to specify one of the thirty-two X registers (25=32). For some vector instructions having only two operands, one of those operands is an X register. By limiting the usage of X registers only to X0-X7 instead of X0-X31, only a 3- bit field is required to specify one of eight X registers (23=8). It can be seen that by enforcing certain programming constraints, such as using less than all available X register addresses, fewer bits are needed to specify the operands in a vector instruction. A code example is shown below: //move the value of XI in VCU to VGO in the scalar unit ofVDU
VG2X VGO, XI [5 -bit addressing > can be compressed as move the value of XI to VGO
C16. VG2X VGO, Xl[3-bit addressing]
[0038] The details of conversion from the original 32-bit vector instructions to the 16-bit compressed vector instructions can be seen in the table of FIG. 2. FIG. 2 shows a pair of 32-bit vector instructions and an illustrative encoding that provides a corresponding compressed pair of 16-bit vector instructions. For compressed vector instructions, the bit field [1 :0] of the instruction represents the major code of RISC-V. In this example, major code = 00 is used, and function code = 100 is used at the bit field [15: 13] of the instruction. As shown in FIG. 2, VG2X (moving from VG registers to X registers) and X2VG (moving from X registers to VG registers) are instructions that are widely used in many vDSPs, while C16.VG2X and C16.X2VG are compressed alternatives for VG2X and X2VG, respectively. The instruction size for both VG2X and X2VG are reduced from 32-bit to 16-bit, as shown in the illustrative example of FIG. 2 [0039] Similar to the example above, 32-bit instructions for moving to/from vector control registers (VCR) from/to X registers may also be compressed with just a 2 -bit register address field for the X registers, i.e., X0-X3. FIG. 3 shows a pair of 32-bit vector instructions and an illustrative encoding that provides a corresponding compressed pair of 16-bit vector instructions. The details of conversion from the original 32-bit instructions X2VCR (moving from X registers to VCR registers)) and VCR2X (moving from VCR registers to X registers) to compressed vector instructions are shown in FIG. 3. The compressed vector instructions are C16.X2VCR and C16.VCR2X, and as shown, the vector instruction size has been reduced from 32-bits to 16-bits.
[0040] Some vector instructions include a field in which an immediate value is specified. However, when the immediate value is small, not all the bits reserved for the immediate value field are needed to represent that small value. For example, assigning an immediate value of 5 to
the VCOP MODE register (vector co-processor mode register) by the original instruction VCRWI.L, which upon execution writes the 16-bit immediate value to lower 16-bits of the vector control register. However, the VCOP MODE register only has 8 bits. Thus, only an 8-bit immediate field, rather than a 16-bit immediate field, is needed for writing to the VCOP MODE register, and a compressed version of this instruction can be created that is eight bits (one byte) shorter. For the example above, the non-compressed 32-bit vector instruction and its compressed 24-bit counterpart vector instruction may be written as follows:
VCRWI.L VCOP MODE, 5 (#IMM16) //VCOP MODE register only has 8 bits
C24. VCRWI.L VCOP MODE, 5 (#IMM8> //VCOP MODE register only has 8 bits
[0041] In connection with the above example, FIG. 4 shows a pair of 32-bit vector instructions (VCRWI.L and VCRWI.H) and an illustrative encoding that provides a corresponding compressed pair of 24-bit vector instructions (C24. VCRWI.L and C24. VCRWI.H).
[0042] FIG. 5 shows a pair of non-compressed 32-bit vector instructions (VMAC and VMSUB) and an illustrative encoding that provides a corresponding compressed pair of 24-bit vector instructions (C24.VMAC and C24VMSUB). It will be appreciated that any given vector instruction set may have other opportunities to compress vector instructions in accordance with the principles disclosed herein.
[0043] FIG. 6 is a flow chart showing an illustrative method 600 of converting, i.e., transforming, compressed vector instructions into corresponding non-compressed counterpart 32- bit vector instructions; and for identifying vector instructions that are already in non-compressed format and can therefore bypass the instruction converter process and be transferred directly to an instruction queue. Although the logic of method 600 is illustrated in a multi-step flowchart, it will be appreciated that this logic may be implemented in hardware logic circuitry so that the decoding of the various bit fields can take place in one step.
[0044] Referring to FIG. 6, at a block 602 it is determined whether bit field [1 :0] of a vector instruction is equal to 11. If not equal to 11, then the vector instruction is not a compressed vector instruction, and at block 604, it can be routed to the instruction queue by way of the instruction converter bypass path. If bit field [1 :0] is equal to 11, then the vector instruction is a compressed vector instruction, and at block 606, it is determined whether bit field [1 :0] equals 00. If bit field [1 :0] is equal to 00, then at block 608, it is determined whether bit field [15: 13] is equal to 100. If bit field [15: 13] is not equal to 100, then at block 610, further
processing in accordance with this disclosure is terminated. If bit field [15: 13] is equal to 100, then at block 612, it is determined whether bit field [10] of the vector instruction is equal to 0. If not equal to 0, then at block 614, it is determined whether bit field [9] of the compressed vector instruction equals 0. If bit field [9] does equal 0, then the compressed vector instruction is C16.X2VCR, and at block 616, the compressed vector instruction is converted back to a noncompressed X2VCR instruction, which is then transferred to the instruction queue. If bit field [9] does not equal 0, then the compressed vector instruction is C16.VCR2X, and at block 618, the compressed vector instruction is converted back to a non-compressed VCR2X instruction, which is then transferred to the instruction queue.
[0045] Still referring to FIG. 6, at block 620, it is determined whether bit field [8] of the compressed vector instruction is equal to 0. If bit field [8] is not equal to 0, then at block 622, it is determined whether bit field [9] is equal to 0. If bit field [9] is equal to 0, then at block 624, it is determined whether bit field [21] equals 0. If bit field [21] does equal 0, then the compressed vector instruction is C24.VMAC, and at block 626, the compressed vector instruction is converted back to a non-compressed VMAC, which is then transferred to the instruction queue. However, if bit field [21] does not equal 0, then the compressed vector instruction is C24.VMSUB, and at block 628, the compressed vector instruction is converted back to a noncompressed VMSUB instruction, which is then transferred to the instruction queue.
[0046] Still referring to FIG. 6, at block 622, if it is determined that bit field [9] does not equal 0, then at block 630, it is determined whether bit field [22] is equal to 0. If bit field [22] is equal to 0, then the compressed vector instruction is C24.VCRWI.L, and at block 632, the compressed vector instruction is converted back to a non-compressed VCRWI.L, which is then transferred to the instruction queue. However, if bit field [22] is not equal to 0, then the compressed vector instruction is C24.VCRWI.H, and at block 634, the compressed vector instruction is converted back to a non-compressed VCRWI.H instruction, which is then transferred to the instruction queue.
[0047] Still referring to FIG. 6, at block 620, if it is determined that bit field [8] is equal to 0, then at block 636, it is determined whether bit field [9] is equal to 0. If bit field [9] is equal to 0, then the compressed vector instruction is C16.VG2X, and at block 638, the compressed vector instruction is converted back to a non-compressed VG2X instruction, which is then transferred to the instruction queue. However, if bit field [9] does not equal 0, then the compressed vector instruction is C16.X2VG, and at block 640, the compressed vector instruction
is converted back to a non-compressed X2VG instruction, which is then transferred to the instruction queue.
[0048] FIG. 7 is a high-level block diagram of a portion of a processor 700. FIG. 7 shows a vector control unit (VCU) 702 having an instruction fetch unit 704 and an instruction decode unit 706, and illustrates the datapath for vector instructions between an instruction fetch unit 704, an instruction decoder 706 including decision hardware for determining whether a fetched instruction is compressed, and an instruction queue 710. For fetched vector instructions that are determined to be compressed vector instructions, the datapath between instruction decoder 706 and instruction queue 710 includes an instruction converter 708. For fetched vector instructions that are determined to be non-compressed vector instructions, the datapath between instruction decoder 706 and instruction queue 710 includes an instruction converter bypass path 712.
[0049] Still referring to FIG. 7, compressed vector instructions are transferred to instruction converter 708 for conversion to an equivalent non-compressed vector instruction. In this illustrative embodiment, and as indicated in FIG. 7, instruction converter 708 is configured to convert 16-bit compressed vector instructions to 32-bit non-compressed equivalent vector instructions; and to convert 24-bit compressed vector instructions to 32-bit non-compressed equivalent vector instructions. In this illustrative embodiment, the mapping of certain noncompressed vector instructions to equivalent compressed vector instructions is shown in FIGs. 2- 5; and the logic for converting compressed vector instructions to equivalent non-compressed vector instructions is described above and shown in FIG. 6.
[0050] FIG. 7 also illustrates a bus 713 that couples instruction queue 710 to arithmetic logic unit (ALU) issue queue 714, load (LD) issue queue 716, store (ST) issue queue 718, coprocessor (COP) issue queue 720, and scalar issue queue 722. In this illustrative example, vector instructions from instruction queue 710 are transferred to the desired issue queue over bus 713. In this illustrative example, the vector instructions in instruction queue 710 are all noncompressed instructions. It is noted that alternative embodiments may have different numbers of issue queues, by way of example and not limitation, a processor in accordance with this disclosure may have two load issue queues to support two load units to load data concurrently.
[0051] Although not shown in FIG. 7, it will be recognized that the various issue queues are coupled to their respective functional units, e.g., computational units. For example, ALU issue queue 714 is coupled to a vector ALU, LD issue queue 716 is coupled to a vector load/store
unit, ST issue queue 718 is coupled to the vector load/store unit, COP issue queue 720 is coupled to a vector co-processor unit, and scalar issue queue 722 is coupled to a scalar computation unit. [0052] FIGs. 8A and 8B illustrate certain advantages that various embodiments of this disclosure provide. FIG. 8A illustrates a code segment using non-compressed vector instructions, and FIG. 8B illustrates a functionally equivalent code segment using compressed vector instructions. More particularly, FIG. 8A shows a segment of non-compressed assembly code using 32-bit vector instructions, that requires 20 bytes of instruction memory; and FIG. 8B shows an illustrative segment of compressed assembly code, in accordance with this disclosure, that directs a vDSP to perform the same function as the assembly code of FIG. 8A, but only requires 13 bytes of instruction memory. In the illustrative example of FIGs. 8 A and 8B, the amount of instruction memory required by the compressed vector instruction version is reduced by 35% compared to the non-compressed vector instruction version.
[0053] FIG. 9 is a high-level block diagram of a system 900 including a memory 901, and a portion of a processor 902 illustrating the datapath for vector instructions between a vector instruction fetch/vector instruction decode unit 904, and an instruction queue 910, an instruction converter 906 disposed in the datapath between the vector instruction fetch/vector instruction decode unit 904 and the instruction queue 910, an instruction converter bypass path 908, issue queues 914, 916, 918, 920, 922, a vector computation unit 924, and a vector register file 926. Processor 902 may be a vDSP. In some embodiments, processor 902 may execute both vector and scalar instructions. In some embodiments, processor 902 may be a vDSP that executes both vector and scalar instructions.
[0054] Still referring to FIG. 9, vector instruction fetch/vector instruction decode unit 904 is coupled to instruction converter 906, and instruction converter bypass path 908. Instruction converter 906 is coupled to instruction queue 910. Instruction converter bypass path 908 is also coupled to instruction queue 910. Instruction queue 910 is coupled to bus 912, and bus 912 is coupled to issue queues 914, 916, 918, 920, 922. Issue queues 914, 916, 918, 920, 922 are coupled to vector computation unit 924. Vector computation unit 924 is coupled to vector register file 926. It is noted that in alternative embodiments in accordance with this disclosure may have more or fewer issue queues than the number of issue queues illustrated in FIG. 9.
[0055] Still referring to FIG. 9, memory 901 may be an external memory. Memory 901, in accordance with this disclosure, is configured to store at least non-compressed vector instructions and compressed vector instructions. In some embodiments, each compressed vector
instruction fetched from memory 901 by the instruction fetch logic of processor 902 may be converted to a corresponding non-compressed vector instruction prior to being transferred to instruction queue 910. In some alternative embodiments, each compressed vector instruction fetched from memory 901 by the instruction fetch logic of processor 902 may be converted to a corresponding non-compressed vector instruction prior to being issued to a vector computational unit 924 of processor 902.
[0056] In some embodiments, memory 901 is an instruction memory, i.e., a memory dedicated exclusively to storing program instructions. In other embodiments, memory 901 may store program instructions in a first portion thereof, and store data in a second portion thereof. A portion of memory available to store data may be referred to a data memory. The data memory may be accessible to Load/Store unit of processor 902.
[0057] FIG. 10 is a flow diagram of a method 1000 of executing vector instructions. Method 1000 includes, at a block 1002, fetching one or more vector instructions from an instruction memory. In this illustrative embodiment, fetching of vector instructions is performed by an instruction fetch unit of a processor such as, but not limited to, a vDSP. The instruction memory may be an external memory, i.e., external to the processor. In alternative embodiments, the instruction memory may be integrated on the same substrate as the processor. Method 1000 continues, at a block 1004, by determining whether a first one of the one or more fetched vector instructions is a compressed vector instruction or a non-compressed vector instruction. Method 1000 continues, at a block 1004, by converting, if the first one of the one or more fetched vector instructions is a non-compressed vector instruction, the first one of the one or more fetched vector instructions to a corresponding non-compressed vector instruction, and transferring the corresponding non-compressed vector instruction to an instruction queue. Method 1000 further includes, at a block 1008, transferring, if the first one of the one or more fetched vector instructions is a non-compressed vector instruction, the first one of the one or more fetched vector instructions to the instruction queue via an instruction converter bypass path. In various embodiments, the instruction converter bypass path is coupled between an instruction decoder of the processor and the instruction queue.
[0058] The external memory referred to above may be implemented with any suitable non-transitory memory technology that meets the speed, capacity, and cost requirements of any particular design in accordance with this disclosure. Thus the external memories for embodiments in accordance with this disclosure may be, but are not limited to, volatile memories
such as but not limited to dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), and static random access memory (SRAM). Likewise, the external memories for embodiments in accordance with this disclosure may be, but are not limited to, non-volatile memories that meet the speed, capacity, and cost requirements of the design. And, the external memories may be chiplets coupled to another integrated circuit, or the external memories may actually be embedded on the same die as the logic circuits that comprise an associated processor.
[0059] Various integrated circuit embodiments of this disclosure may further include “low-power,” or “power-down” circuitry to reduce leakage currents that result in unwanted power consumption.
[0060] In some embodiments, one or more circuit blocks or subsystems of an integrated circuit in accordance with this disclosure may be configured to enter a low-power state responsive to one or more signals indicating that those circuit blocks or subsystems do not have work to do at the moment. A low-power state refers to an integrated circuit operating in a manner that reduces or eliminates unneeded power consumption. The low-power state may be achieved by any suitable circuit technique. In one illustrative circuit technique, the low-power state may be achieved by stopping the clock signal or signals to a circuit block or subsystem of the integrated circuit. Stopping the clock signal or signals may be appropriate if the circuit block is implemented with a static logic design in CMOS process technology. However, since static logic design may consume more chip area than dynamic logic design, a different trade-off that may make stopping the clocks an appropriate approach may be to ensure that the period for which the clocks are stopped is less than the period of time for the charge on one or more dynamic nodes in the circuit block or subsystem to decay to a point that the logic state on the one or more dynamic nodes cannot be reliably maintained. Such charge decay may occur through mechanisms such as, e.g., leakage, or capacitive coupling to nearby circuit nodes. Alternatively, the low-power state may be achieved by decoupling the circuit block or subsystem from one or more power supply nodes. For example, the circuit block or subsystem may have a transistor coupled in series between itself and a positive power supply node such that the circuit block or subsystem is decoupled from the positive power supply node when the transistor is turned off. In another example, the circuit block or subsystem may have a transistor coupled in series between itself and a ground supply node such that the circuit block or subsystem is decoupled from the ground supply node when the transistor is turned off. In still another example, the circuit block
or subsystem may have a first transistor coupled in series between itself and a positive power supply node, and a second transistor coupled in series between itself and a ground supply node such that the circuit block or subsystem is decoupled from the positive power supply node and from the ground supply node when both the first and second transistors are turned off.
[0061] Power consumption is related to the magnitude of a circuit’s power supply voltage, and is also related to its operating frequency. Many integrated circuits, including but not limited to those designed for use in wireless communication, are designed to be used in products that rely, at least in part, on batteries for power. Since extending battery life is important to customers of such products, it is also important for designers to reduce the amount of power needed for the integrated circuits to perform their intended functions, and to provide hardware support for integrated circuits to enter one of one or more states in which unneeded power consumption is reduced or eliminated. Such states may be referred to as “low-power states,” “low-power modes,” “low-power operation,” or similarly descriptive terms. In further examples of low-power operation, it is possible to reduce power consumption in circuits that are unused or unneeded for a period of time, by reducing the magnitude of the circuit’s power supply voltage, reducing its operating frequency, or both.
[0062] In some embodiments, the processor further includes a plurality of issue queues, each issue queue coupled to the instruction queue, wherein the instruction queue is configured to receive non-compressed vector instructions.
[0063] In one embodiment, a processor includes an instruction fetch unit configured to fetch vector instructions, wherein the vector instructions include compressed vector instructions and non-compressed vector instructions, an instruction decoder unit coupled to the instruction fetch unit, configured to at least determine whether a fetched vector instruction is a compressed vector instruction or a non-compressed vector instruction, an instruction converter coupled to the instruction decoder unit, configured to receive one or more compressed vector instructions and convert the one or more compressed vector instructions to a corresponding one or more noncompressed vector instructions, an instruction converter bypass path, coupled to the instruction decoder unit, and an instruction queue coupled to the instruction decoder unit, and further coupled to the instruction converter bypass path. In this embodiment, the compressed vector instructions comprise fewer bits than the non-compressed vector instructions.
[0064] In some embodiments, the processor further includes a plurality of issue queues, each issue queue coupled to the instruction queue, wherein the instruction queue is configured to
receive non-compressed vector instructions.
[0065] In some embodiments, the instruction queue is configured to receive and to store a plurality of non-compressed vector instructions.
[0066] In some embodiments, the non-compressed vector instructions each have a fixed length.
[0067] In some embodiments, the compressed vector instructions are variable length vector instructions.
[0068] In some embodiments, the compressed vector instructions have fewer bits than the non-compressed vector instructions.
[0069] In some embodiments, the instruction converter is configured to determine whether a compressed vector instruction is a 2-byte compressed vector instruction or a 3-byte compressed vector instruction.
[0070] In some embodiments, the instruction converter is configured to convert a 2-byte compressed vector instruction to a corresponding 4-byte non-compressed vector instruction, and to convert a 3-byte compressed vector instruction to a corresponding 4-byte non-compressed vector instruction.
[0071] In some embodiments, the processor comprises a vector digital signal processor.
[0072] In some embodiments, the processor includes one or more chiplets.
[0073] In another embodiment, a method of executing vector instructions includes fetching one or more vector instructions from an instruction memory, determining whether a first one of the one or more fetched vector instructions is a compressed vector instruction or a noncompressed vector instruction, converting, if the first one of the one or more fetched vector instructions is a non-compressed vector instruction, the first one of the one or more fetched vector instructions to a corresponding non-compressed vector instruction, and transferring the corresponding non-compressed vector instruction to an instruction queue, and transferring, if the first one of the one or more fetched vector instructions is a non-compressed vector instruction, the first one of the one or more fetched vector instructions to the instruction queue via an instruction converter bypass path.
[0074] In some embodiments, converting the compressed vector instruction to a noncompressed vector instruction includes determining whether the compressed vector instruction comprises a first number of bits or a second number of bits. Because converting, for example, a 2-byte compressed vector instruction to a 4-byte non-compressed vector instruction is different
from converting a 3 -byte compressed vector instruction to a 4-byte non-compressed vector instruction, knowing the size of the compressed vector instruction may be used to inform the logic used for converting a compressed vector instruction to a non-compressed vector instruction. [0075] In some embodiments, the method further includes transferring a non-compressed vector instruction from the instruction queue to an issue queue of a plurality of issue queues based, at least in part, on an op-code of the non-compressed vector instruction.
[0076] In some embodiments, the non-compressed vector instruction comprises 32 bits, a first compressed vector instruction comprises 16 bits, and a second compressed vector instruction comprises 24 bits.
[0077] In some embodiments, the plurality of issue queues comprises at least one ALU issue queue, at least one Load issue queue, and at least one Store issue queue. In some embodiments the plurality of issue queues may further include at least one Permutation issue queue, and/or at least one Coprocessor issue queue. In various embodiments that include executing scalar instructions, a Scalar instruction queue may be included in the plurality of issue queues.
[0078] In a further embodiment, a system includes an instruction memory, and a processor coupled to the instruction memory, wherein the processor includes an instruction fetch unit configured to fetch vector instructions, wherein the vector instructions include compressed vector instructions and non-compressed vector instructions, an instruction decoder unit, coupled to the instruction fetch unit, configured to at least determine whether a fetched vector instruction is a compressed vector instruction or a non-compressed vector instruction, an instruction converter, coupled to the instruction decoder unit, configured to receive one or more compressed vector instructions and convert the one or more compressed vector instructions to a corresponding one or more non-compressed vector instructions, an instruction converter bypass path, coupled to the instruction decoder unit, and an instruction queue coupled to the instruction decoder unit, and further coupled to the instruction converter bypass path. In such embodiment, the instruction memory is a non-transitory memory.
[0079] In some embodiments of the system, the processor is a vector digital signal processor.
[0080] In some embodiments, the system further includes a non-transitory data memory.
[0081] In some embodiments, the processor comprises one or more chiplets.
[0082] The foregoing description of the specific embodiments will so reveal the general
nature of the present disclosure that others can, by applying knowledge within the skill of the art, readily modify and/or adapt for various applications of such specific embodiments, without undue experimentation, and without departing from the general concept of the present disclosure. Therefore, such adaptations and modifications are intended to be within the meaning and range of equivalents of the disclosed embodiments, based on the teaching and guidance presented herein. It is to be understood that the phraseology or terminology herein is for the purpose of description and not of limitation, such that the terminology or phraseology of the present specification is to be interpreted by the skilled artisan in light of the teachings and guidance.
[0083] Embodiments of the present disclosure have been described above with the aid of functional building blocks illustrating the embodiment of specified functions and relationships thereof. The boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed.
[0084] The Summary and Abstract sections may set forth one or more but not all embodiments of the present disclosure as contemplated by the inventor(s), and thus, are not intended to limit the present disclosure and the appended claims in any way.
[0085] The breadth and scope of the present disclosure should not be limited by any of the above-described illustrative embodiments, but should be defined only in accordance with the subjoined claims and their equivalents.
Claims
1. A processor, comprising: an instruction fetch unit configured to fetch vector instructions, wherein the vector instructions include compressed vector instructions and non-compressed vector instructions; an instruction decoder unit, coupled to the instruction fetch unit, configured to at least determine whether a fetched vector instruction is a compressed vector instruction or a noncompressed vector instruction; an instruction converter, coupled to the instruction decoder unit, configured to receive one or more compressed vector instructions and convert the one or more compressed vector instructions to a corresponding one or more non-compressed vector instructions; an instruction converter bypass path, coupled to the instruction decoder unit; and an instruction queue coupled to the instruction decoder unit, and further coupled to the instruction converter bypass path, wherein the compressed vector instructions comprise fewer bits than the non-compressed vector instructions.
2. The processor of claim 1, further comprising: a plurality of issue queues, each issue queue coupled to the instruction queue, wherein the instruction queue is configured to receive non-compressed vector instructions.
3. The processor of claim 1, wherein the instruction queue is configured to receive and to store a plurality of non-compressed vector instructions.
4. The processor of claim 1, wherein the non-compressed vector instructions each have a fixed length.
5. The processor of claim 4, wherein the compressed vector instructions are variable length vector instructions.
6. The processor of claim 1, wherein the compressed vector instructions have fewer
bits than the non-compressed vector instructions.
7. The processor of claim 1, wherein the instruction converter is configured to determine whether a compressed vector instruction is a 2-byte compressed vector instruction or a 3 -byte compressed vector instruction.
8. The processor of claim 1, wherein the instruction converter is configured to convert a 2-byte compressed vector instruction to a corresponding 4-byte non-compressed vector instruction, and to convert a 3 -byte compressed vector instruction to a corresponding 4-byte noncompressed vector instruction.
9. The processor of claim 1, wherein the processor comprises a vector digital signal processor.
10. The processor of claim 1, wherein the processor comprises: one or more chiplets.
11. A method of executing vector instructions, comprising: fetching one or more vector instructions from an instruction memory; determining whether a first one of the one or more fetched vector instructions is a compressed vector instruction or a non-compressed vector instruction; converting, if the first one of the one or more fetched vector instructions is a compressed vector instruction, the first one of the one or more fetched vector instructions to a corresponding non-compressed vector instruction, and transferring the corresponding non-compressed vector instruction to an instruction queue; and transferring, if the first one of the one or more fetched vector instructions is a noncompressed vector instruction, the first one of the one or more fetched vector instructions to the instruction queue via an instruction converter bypass path.
12. The method of claim 11, wherein converting the compressed vector instruction to a non-compressed vector instruction comprises: determining whether the compressed vector instruction comprises a first number
of bits or a second number of bits.
13. The method of claim 12, further comprising: transferring a non-compressed vector instruction from the instruction queue to an issue queue of a plurality of issue queues based, at least in part, on an op-code of the non-compressed vector instruction.
14. The method of claim 13, wherein the non-compressed vector instruction comprises 32 bits, a first compressed vector instruction comprises 16 bits, and a second compressed vector instruction comprises 24 bits.
15. The method of claim 13, wherein the plurality of issue queues comprises at least one ALU issue queue, at least one Load issue queue, and at least one Store issue queue,.
16. A system, comprising: an instruction memory; a processor coupled to the instruction memory, the processor comprising: an instruction fetch unit configured to fetch vector instructions, wherein the vector instructions include compressed vector instructions and non-compressed vector instructions; an instruction decoder unit, coupled to the instruction fetch unit, configured to at least determine whether a fetched vector instruction is a compressed vector instruction or a noncompressed vector instruction; an instruction converter, coupled to the instruction decoder unit, configured to receive one or more compressed vector instructions and convert the one or more compressed vector instructions to a corresponding one or more non-compressed vector instructions; an instruction converter bypass path, coupled to the instruction decoder unit; and an instruction queue coupled to the instruction decoder unit, and further coupled to the instruction converter bypass path, wherein the instruction memory is a non-transitory memory.
17. The system of claim 16, wherein the processor further comprises: a plurality of issue queues, each issue queue coupled to the instruction queue; and
a vector computation unit coupled to one or more issue queues of the plurality of issue queues; and a vector register file coupled to the vector computation unit.
18. The system of claim 17, wherein the processor is a vector digital signal processor.
19. The system of claim 16, further comprising a non-transitory data memory.
20. The system of claim 16, wherein the processor comprises one or more chiplets.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/US2023/027399 WO2025014480A1 (en) | 2023-07-11 | 2023-07-11 | Compressed instruction set architecture for vector digital signal processors |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/US2023/027399 WO2025014480A1 (en) | 2023-07-11 | 2023-07-11 | Compressed instruction set architecture for vector digital signal processors |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025014480A1 true WO2025014480A1 (en) | 2025-01-16 |
Family
ID=94216235
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2023/027399 Pending WO2025014480A1 (en) | 2023-07-11 | 2023-07-11 | Compressed instruction set architecture for vector digital signal processors |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025014480A1 (en) |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20190042544A1 (en) * | 2018-09-05 | 2019-02-07 | Intel Corporation | Fp16-s7e8 mixed precision for deep learning and other algorithms |
| US10339043B1 (en) * | 2016-12-22 | 2019-07-02 | Mosys, Inc. | System and method to match vectors using mask and count |
| US20200348937A1 (en) * | 2018-09-27 | 2020-11-05 | Intel Corporation | Systems and methods for performing matrix compress and decompress instructions |
| WO2022191859A1 (en) * | 2021-03-12 | 2022-09-15 | Zeku, Inc. | Vector processing using vector-specific data type |
| WO2023129318A1 (en) * | 2021-12-31 | 2023-07-06 | SiFive, Inc. | Cycle accurate tracing of vector instructions |
-
2023
- 2023-07-11 WO PCT/US2023/027399 patent/WO2025014480A1/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10339043B1 (en) * | 2016-12-22 | 2019-07-02 | Mosys, Inc. | System and method to match vectors using mask and count |
| US20190042544A1 (en) * | 2018-09-05 | 2019-02-07 | Intel Corporation | Fp16-s7e8 mixed precision for deep learning and other algorithms |
| US20200348937A1 (en) * | 2018-09-27 | 2020-11-05 | Intel Corporation | Systems and methods for performing matrix compress and decompress instructions |
| WO2022191859A1 (en) * | 2021-03-12 | 2022-09-15 | Zeku, Inc. | Vector processing using vector-specific data type |
| WO2023129318A1 (en) * | 2021-12-31 | 2023-07-06 | SiFive, Inc. | Cycle accurate tracing of vector instructions |
Non-Patent Citations (1)
| Title |
|---|
| ALYAZEED ALBASYONI; MHER SAFARYAN; LAURENT CONDAT; PETER RICHT\'ARIK: "Optimal Gradient Compression for Distributed and Federated Learning", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 7 October 2020 (2020-10-07), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081780558 * |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11341085B2 (en) | Low energy accelerator processor architecture with short parallel instruction word | |
| US9524237B2 (en) | Data processing device and semiconductor intergrated circuit device for a bi-endian system | |
| US6058474A (en) | Method and apparatus for DMA boot loading a microprocessor without an internal ROM | |
| US20180217837A1 (en) | Low energy accelerator processor architecture | |
| WO2008131203A2 (en) | Computer memory addressing mode employing memory segmenting and masking | |
| CN112559037B (en) | Instruction execution method, unit, device and system | |
| US20120191766A1 (en) | Multiplication of Complex Numbers Represented in Floating Point | |
| US10203959B1 (en) | Subroutine power optimiztion | |
| CN111813447B (en) | Processing method and processing device for data splicing instruction | |
| US11775301B2 (en) | Coprocessor register renaming using registers associated with an inactive context to store results from an active context | |
| WO2025014480A1 (en) | Compressed instruction set architecture for vector digital signal processors | |
| US20020199081A1 (en) | Data processing system and control method | |
| CN120011287A (en) | Analog front-end data interface and RISC-V instruction set based processor | |
| CN116643796A (en) | Processing method of mixed precision operation and instruction processing device | |
| US20040024992A1 (en) | Decoding method for a multi-length-mode instruction set | |
| US10514925B1 (en) | Load speculation recovery | |
| US11372621B2 (en) | Circuitry for floating-point power function | |
| US10503473B1 (en) | Floating-point division alternative techniques | |
| US7290153B2 (en) | System, method, and apparatus for reducing power consumption in a microprocessor | |
| US12498932B1 (en) | Physical register sharing | |
| JPH04104350A (en) | Micro processor | |
| US20250094173A1 (en) | Processor with out-of-order completion | |
| CN117971320B (en) | Microprocessor for display system and register writing method | |
| US12079142B2 (en) | PC-based instruction group permissions | |
| US12461677B2 (en) | Reservation station with primary and secondary storage circuits for store operations |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23945276 Country of ref document: EP Kind code of ref document: A1 |