CN103235717B

CN103235717B - There is the processor of polymorphic instruction set architecture

Info

Publication number: CN103235717B
Application number: CN201310139290.3A
Authority: CN
Inventors: 王东琳; 谢少林; 杨勇勇; 尹磊祖; 王磊; 刘子君; 汪涛; 张星
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Shanghai Silam Technology Co., Ltd.
Priority date: 2013-04-19
Filing date: 2013-04-19
Publication date: 2016-04-06
Anticipated expiration: 2033-04-19
Also published as: CN103235717A

Abstract

The present invention proposes a processor with polymorphic instruction set architecture, which includes a scalar processing unit (101), at least one polymorphic instruction processing unit (100), at least one multi-granularity parallel memory (102) and a DMA control device (103); polymorphic instruction processing unit (100) includes at least one functional unit (202); polymorphic instruction processing unit (100) is used for explaining and executing polymorphic instruction, and its functional unit (202) is used for carrying out specific Data manipulation tasks; the scalar processing unit (101) is used to call polymorphic instructions and query the execution status of polymorphic instructions; the DMA controller (103) is used to transmit configuration information of polymorphic instructions and to the multi-granularity The memory (102) transfers data required by the polymorphic instruction. After the processor of the present invention is tape-out, the programmer can still redefine the instruction set of the processor according to the characteristics of the application algorithm.

Description

Processor with Polymorphic Instruction Set Architecture

技术领域technical field

本发明主要涉及处理器指令集体系结构，与处理器指令集的定义、处理器体系结构设计以及微体系结构的实现方法紧密相关，特别是一种具有流片后可动态重构的多态指令集体系结构的处理器。The present invention mainly relates to the processor instruction set architecture, which is closely related to the definition of the processor instruction set, the design of the processor architecture and the implementation method of the micro-architecture, especially a polymorphic instruction that can be dynamically reconfigured after tape-out processor set architecture.

背景技术Background technique

近年来，互联网、云计算和物联网发展迅猛。无所不在的移动设备、RFID、无线传感器每分每秒都在产生信息，数以亿计用户的互联网服务产生了巨量信息交互；同时，用户对信息处理的实时性、有效性提出了很高要求，如在线视频点播系统，用户不仅要求高清晰的画面，还要求至少每秒30帧以上的解码和显示速度。我们需要从算法特征分析着手研究如何高效快速的处理海量信息。In recent years, the Internet, cloud computing, and the Internet of Things have developed rapidly. Ubiquitous mobile devices, RFID, and wireless sensors are generating information every second, and hundreds of millions of users of Internet services have generated a huge amount of information interaction; at the same time, users have put forward high requirements for the real-time and effectiveness of information processing , such as an online video-on-demand system, users not only require high-definition images, but also require decoding and display speeds of at least 30 frames per second. We need to study how to efficiently and quickly process massive amounts of information from the analysis of algorithm characteristics.

总体说来，海量信息处理呈现出以下几个特征：第一个特征是数据量巨大，高清视频、宽带通信、高精度传感器所产生的数据量都以每年5～10倍的速度递增。第二个特征是计算量巨大，信息处理的计算复杂度通常为数据量n的K次方(即O(n^K))，如冒泡排序算法的计算复杂度为O(n²)，FFT算法复杂度为O(nlogn)，随着数据量的增加，信息处理所需的计算量急剧增加。第三个特征是海量信息处理的算法相对规整，如一维二维滤波、FFT变换、自适应滤波等核心算法都能以简单的数学公式来表达，不需要复杂的逻辑判断。海量信息处理的第四个特征是具有很强的数据局部性：局部数据块之间不存在相关性，但局部数据自身存在强相关性。如滤波算法中的计算结果只与滤波模板范围内的数据相关，且模板范围的数据需要经过多次计算才能得到最终结果；视频编解码算法中需要对一个或相邻宏块的数据经过复杂的运算得到最终结果，而距离较远的宏块间不存在数据相关性。海量信息处理的第五个特征是处理算法模式基本不变，但算法细节不断演进。如视频编码标准从H.263向H.264演进，通信协议从2G到3G，再到LTE的演进。Generally speaking, massive information processing presents the following features: The first feature is the huge amount of data. The amount of data generated by high-definition video, broadband communications, and high-precision sensors is increasing at a rate of 5 to 10 times per year. The second feature is the huge amount of calculation. The computational complexity of information processing is usually the Kth power of the data volume n (that is, O(n ^K )). For example, the computational complexity of the bubble sorting algorithm is O(n ² ), FFT The complexity of the algorithm is O(nlogn). As the amount of data increases, the amount of calculation required for information processing increases sharply. The third feature is that the algorithms for massive information processing are relatively regular. Core algorithms such as one-dimensional and two-dimensional filtering, FFT transformation, and adaptive filtering can all be expressed in simple mathematical formulas without complicated logical judgments. The fourth characteristic of massive information processing is that it has strong data locality: there is no correlation between local data blocks, but there is strong correlation between local data itself. For example, the calculation results in the filtering algorithm are only related to the data in the filtering template range, and the data in the template range needs to be calculated many times to get the final result; in the video coding algorithm, the data of one or adjacent macroblocks needs to be complicated. The final result is obtained by the operation, and there is no data correlation between the macroblocks that are far away. The fifth characteristic of massive information processing is that the processing algorithm mode basically remains unchanged, but the details of the algorithm continue to evolve. For example, the video coding standard evolves from H.263 to H.264, and the communication protocol evolves from 2G to 3G, and then to LTE.

海量信息处理有自身独特的性能要求和应用特性。由于海量信息处理过程中数据量巨大，运算量巨大，而且大部分要求实时计算，传统的标量、超标量处理器的计算能力远低于这一要求；同时，由于功耗、体积的限制，我们也无法仅仅依靠堆砌标量处理器来实现海量信息处理系统。而针对海量信息处理的ASIC芯片由于设计开发成本大，周期长，其更新速度远低于海量信息处理算法的演进速度，无法适应海量信息处理系统的发展速度。因此，针对海量信息处理特征对传统的标量、超标量处理器进行改造，甚至设计全新的领域处理器，是当前海量信息处理芯片的发展趋势。Massive information processing has its own unique performance requirements and application characteristics. Due to the huge amount of data and calculation in the process of mass information processing, and most of them require real-time calculation, the computing power of traditional scalar and superscalar processors is far lower than this requirement; at the same time, due to the limitations of power consumption and volume, our It is also impossible to realize a massive information processing system only by stacking scalar processors. However, due to the high design and development costs and long cycle of ASIC chips for massive information processing, their update speed is far lower than the evolution speed of massive information processing algorithms, and cannot adapt to the development speed of massive information processing systems. Therefore, it is the current development trend of massive information processing chips to transform traditional scalar and superscalar processors according to the characteristics of massive information processing, or even design brand new domain processors.

“指令”是设计者所定义的、处理器可以理解的符号。通过向处理器发送不同的指令序列，程序员指定处理器不同时刻的动作。处理器所能理解的所有指令的集合，即为该处理器的指令集。程序员利用指令集中的指令，实现各种算法。An "instruction" is a symbol defined by the designer and understood by the processor. By sending different sequences of instructions to the processor, the programmer specifies what the processor should do at different times. The set of all instructions that a processor can understand is the instruction set for that processor. Programmers use the instructions in the instruction set to implement various algorithms.

一般处理器指令集都是确定的，指令行为与处理器实现一一对应，如ARMv4T指令集中包括的计算指令“ADDR0，R1，R2”，表示要将寄存器R1和R2中的值相加，再写入R0。Generally, the processor instruction set is determined, and the instruction behavior corresponds to the processor implementation one-to-one. For example, the calculation instruction "ADDR0, R1, R2" included in the ARMv4T instruction set means to add the values in registers R1 and R2, and then Write to R0.

当处理器指令集确定后，程序员无法向指令集中增加指令，或重新定义指令的行为，因此，一般处理器指令集中的指令比较通用，以保证编程灵活性。但通用的处理器指令集难以高效实现某些特殊的应用。如视频编码中，经常需要进行8bit的数据计算，如果用类似ARM处理器中的32bit加法指令“ADDR0，R1，R2”实现该类算法，效率非常低。因此，各类处理器通常都会针对特殊的应用，扩展指令集，如X86指令集中针对视频图像处理的MMX指令，以及ARM指令集中的NEON指令。After the processor instruction set is determined, the programmer cannot add instructions to the instruction set, or redefine the behavior of the instructions. Therefore, the instructions in the general processor instruction set are more general to ensure programming flexibility. However, it is difficult for general-purpose processor instruction sets to efficiently implement some special applications. For example, in video coding, it is often necessary to perform 8-bit data calculations. If such algorithms are implemented with 32-bit addition instructions "ADDR0, R1, R2" similar to ARM processors, the efficiency is very low. Therefore, all kinds of processors usually have extended instruction sets for special applications, such as the MMX instruction in the X86 instruction set for video image processing, and the NEON instruction in the ARM instruction set.

这类扩展指令的特点是对于某一类应用具有很高的执行效率，但对于其它应用，执行效率非常低。因此，处理器在设计完成后，它所适应的应用领域就已经确定，难以适应其它应用领域。程序员也无法根据其它应用领域的算法特征，对处理器进行微调优化。Such extended instructions are characterized by high execution efficiency for a certain type of application, but very low execution efficiency for other applications. Therefore, after the design of the processor is completed, the application field it is suitable for has been determined, and it is difficult to adapt to other application fields. Programmers also cannot fine-tune and optimize the processor based on the algorithm characteristics of other application fields.

目前已有一些专利讨论如何实现可重构计算。如美国专利US2005/0027970A1(ReconfigurableInstructionSetComputing)以及专利US2005/0169550A1(VideoProcessingSystemWithReconfigurableInstructions)采用CPU+类FPGA的结构，用户用统一的高层语言进行开发，编译器将程序划分成CPU运行的部分和FPGA运行的部分。该方法的特点是能利用FPGA的灵活性加速程序效率，但FPGA过于灵活的配置导致芯片性能/成本比不高。美国专利US2004/0019765A1(PipelinedReconfigurableDynamicInstructionSetProcessor)讨论了一个RISC处理器+可配置阵列处理器单元的处理器结构，在该结构中多个阵列处理单元按逻辑划分成多个流水级，每个流水级的行为通过RISC处理器的动态配置。美国专利US2006/0211387A1(MultistandardSDRArchitectureUsingContext-BasedOperationReconfigurableInstructionSetProcessor)定义了一种配置单元+协处理器的处理器结构，其中每个协处理器由状态控制单元和数据通路组成，负责某些相似的处理任务。There are already some patents discussing how to implement reconfigurable computing. For example, US2005/0027970A1 (ReconfigurableInstructionSetComputing) and US2005/0169550A1 (VideoProcessingSystemWithReconfigurableInstructions) adopt the structure of CPU+FPGA, and the user uses a unified high-level language to develop, and the compiler divides the program into the part running on CPU and the part running on FPGA. The feature of this method is that the flexibility of FPGA can be used to accelerate program efficiency, but the too flexible configuration of FPGA leads to low chip performance/cost ratio. U.S. Patent US2004/0019765A1 (PipelinedReconfigurableDynamicInstructionSetProcessor) discusses a processor structure of a RISC processor + a configurable array processor unit, in which multiple array processing units are logically divided into multiple pipeline stages, and the behavior of each pipeline stage Dynamic configuration via RISC processors. US Patent US2006/0211387A1 (MultistandardSDRAarchitectureUsingContext-BasedOperationReconfigurableInstructionSetProcessor) defines a configuration unit + coprocessor processor structure, where each coprocessor is composed of a state control unit and a data path, and is responsible for some similar processing tasks.

发明内容Contents of the invention

(一)要解决的技术问题(1) Technical problems to be solved

本发明所要解决的技术问题是提出一种具有多态指令集体系结构的处理器，以解决处理器在流片后无法对处理器指令集重新定义的问题。The technical problem to be solved by the invention is to propose a processor with polymorphic instruction set architecture to solve the problem that the processor cannot redefine the processor instruction set after tape-out.

(二)技术方案(2) Technical solution

为解决上述技术问题，本发明提出一种具有多态指令集体系结构的处理器，包括一个标量处理单元、至少一个多态指令处理单元、至少一个多粒度并行存储器和一个DMA控制器；所述多态指令处理单元包括至少一个功能单元；所述多态指令处理单元用于解释和执行多态指令，其功能单元用于执行具体的数据操作任务，其中，多态指令是指多个连续执行的微码记录的序列，微码记录表示某个时钟周期内各功能单元需要执行的动作；所述标量处理单元用于调用多态指令并查询多态指令的执行状态；所述DMA控制器用于传送多态指令的配置信息以及向所述多粒度存储器传送多态指令所需数据。In order to solve the above-mentioned technical problems, the present invention proposes a processor with polymorphic instruction set architecture, including a scalar processing unit, at least one polymorphic instruction processing unit, at least one multi-granularity parallel memory and a DMA controller; The polymorphic instruction processing unit includes at least one functional unit; the polymorphic instruction processing unit is used to interpret and execute polymorphic instructions, and its functional unit is used to perform specific data manipulation tasks, wherein the polymorphic instruction refers to multiple sequentially executed The sequence of microcode records, the microcode records represent the actions that each functional unit needs to perform in a certain clock cycle; the scalar processing unit is used to call polymorphic instructions and query the execution status of polymorphic instructions; the DMA controller is used for The configuration information of the polymorphic instruction is transmitted, and the data required by the polymorphic instruction is transmitted to the multi-granularity memory.

根据本发明的一种具体实施方式，所述多态指令处理单元从所述DMA控制器被动接收多态指令，并被标量处理单元调用。According to a specific implementation manner of the present invention, the polymorphic instruction processing unit passively receives the polymorphic instruction from the DMA controller, and is called by the scalar processing unit.

根据本发明的一种具体实施方式，所述标量处理单元通过一个第一控制通路来控制所述多态指令处理单元，所述标量处理单元通过第二控制通路来控制所述DMA控制器。According to a specific implementation manner of the present invention, the scalar processing unit controls the polymorphic instruction processing unit through a first control path, and the scalar processing unit controls the DMA controller through a second control path.

根据本发明的一种具体实施方式，所述多态指令处理单元还包括微码存储器)和微码控制单元；所述微码存储器用于存放多态指令；所述微码控制单元用于通过所述第一控制通路接收所述标量处理单元的控制请求并执行相应的动作。According to a specific embodiment of the present invention, the polymorphic instruction processing unit also includes a microcode memory) and a microcode control unit; the microcode memory is used to store polymorphic instructions; the microcode control unit is used to pass The first control path receives a control request from the scalar processing unit and executes corresponding actions.

根据本发明的一种具体实施方式，所述微码控制单元包括配置寄存器，该配置寄存器用于存储多态指令处理器单元运行时所需参数及运行状态。According to a specific embodiment of the present invention, the microcode control unit includes a configuration register, which is used to store parameters and running states required by the multi-state instruction processor unit during operation.

根据本发明的一种具体实施方式，所述标量处理单元的控制请求包括启动或查询所述多态指令处理单元、读写所述多态指令处理单元的配置寄存器。According to a specific implementation manner of the present invention, the control request of the scalar processing unit includes starting or querying the polymorphic instruction processing unit, reading and writing configuration registers of the polymorphic instruction processing unit.

根据本发明的一种具体实施方式，所述多态指令处理单元还包括传送控制单元，所述功能单元具有多个数据输入/输出端口，并通过该传送控制单元交换数据。According to a specific embodiment of the present invention, the polystate instruction processing unit further includes a transmission control unit, the functional unit has multiple data input/output ports, and exchanges data through the transmission control unit.

根据本发明的一种具体实施方式，所述功能单元用于执行数据加载/存储操作，并通过一第一内部总线从所述多粒度并行存储器读写数据；同时，所述微码存储器作为从设备与该第一内部总线相连，被动地从外部接收微码记录。According to a specific embodiment of the present invention, the functional unit is used to perform data load/store operations, and read and write data from the multi-granularity parallel memory through a first internal bus; at the same time, the microcode memory acts as a slave A device is connected to the first internal bus and passively receives microcode records from the outside.

根据本发明的一种具体实施方式，所述微码控制单元依次读取并执行多态指令的微码记录。According to a specific implementation manner of the present invention, the microcode control unit sequentially reads and executes the microcode records of the polymorphic instructions.

根据本发明的一种具体实施方式，所述微码存储器中的每一行存放一个微码记录，当所述标量处理单元调用多态指令时，只指定该多态指令对应的起始微码记录在该微码存储器中的行号。According to a specific embodiment of the present invention, each row in the microcode memory stores a microcode record, and when the scalar processing unit calls a polymorphic instruction, only the initial microcode record corresponding to the polymorphic instruction is specified The line number in this microcode memory.

(三)有益效果(3) Beneficial effects

本发明的具有多态指令集体系结构的处理器在流片生产后，程序员仍可根据应用算法特点对处理器指令集进行重定义。重定义后处理器指令集体系结构更加契合应用算法特征，从而能提高处理器在该类应用中的处理性能。重定义过程不修改处理器硬件和相应的汇编器、链接器等软件工具链，但对于不同的指令定义，指令集体系结构呈现出不同的形态。After the processor with the polymorphic instruction set architecture of the present invention is tape-out, the programmer can still redefine the processor instruction set according to the characteristics of the application algorithm. Redefining the instruction set architecture of the post-processor is more in line with the characteristics of the application algorithm, so that the processing performance of the processor in this type of application can be improved. The redefinition process does not modify the processor hardware and the corresponding assembler, linker and other software tool chains, but for different instruction definitions, the instruction set architecture presents different forms.

附图说明Description of drawings

图1简要示出了本发明的具有多态指令集体系结构的处理器的主要组成部分和互连关系；Fig. 1 schematically shows the main components and interconnection relationship of the processor with polymorphic instruction set architecture of the present invention;

图2简要示出了本发明的多态指令执行单元的主要组成部分和互连关系；Fig. 2 briefly shows the main components and interconnection relationship of the polymorphic instruction execution unit of the present invention;

图3简要示出了本发明的微码记录的主要组成部分；Fig. 3 schematically shows the main components of the microcode record of the present invention;

图4简要示出了如何定义多态指令的行为以及微码存储器如何保存多态指令的定义；Fig. 4 briefly shows how to define the behavior of the polymorphic instruction and how the microcode memory preserves the definition of the polymorphic instruction;

图5示例性地示出了本发明的一种定义和调用多态指令的流程；FIG. 5 exemplarily shows a process of defining and calling polymorphic instructions in the present invention;

图6简要示出了本发明的一种具有多态指令集体系结构处理器中的功能单元；Fig. 6 schematically shows a functional unit in a processor with a polymorphic instruction set architecture of the present invention;

图7示例性地示出了本发明的处理器采用的计算单元的接口定义和内部结构；Fig. 7 exemplarily shows the interface definition and internal structure of the calculation unit adopted by the processor of the present invention;

图8示例性地示出了本发明的处理器采用的总线接口单元的接口定义和内部结构；Fig. 8 exemplarily shows the interface definition and internal structure of the bus interface unit adopted by the processor of the present invention;

图9示例性地示出了本发明的处理器采用的寄存器文件堆的接口定义；Fig. 9 exemplarily shows the interface definition of the register file heap adopted by the processor of the present invention;

图10示例性地示出了本发明的处理器中功能部件之间数据传送路径的定义；Fig. 10 exemplarily shows the definition of data transfer paths between functional components in the processor of the present invention;

图11示例性地示出了本发明的处理器中计算单元内部数据传送单元的实现结构；Fig. 11 exemplarily shows the implementation structure of the internal data transmission unit of the computing unit in the processor of the present invention;

图12示例性地示出了本发明的处理器中功能部件之间数据传送单元的实现结构Fig. 12 exemplarily shows the implementation structure of the data transmission unit between functional components in the processor of the present invention

图13示例性地示出了本发明的处理器中功能部件的编码；Fig. 13 exemplarily shows the coding of functional components in the processor of the present invention;

图14示例性地示出了本发明的处理器中本发明的处理器中多路选择器的逻辑行为。Fig. 14 exemplarily shows the logical behavior of the multiplexer in the processor of the present invention in the processor of the present invention.

具体实施方式detailed description

为使本发明的目的、技术方案和优点更加清楚明白，以下结合具体实施例，并参照附图，对本发明作进一步的详细说明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with specific embodiments and with reference to the accompanying drawings.

本发明提出了一种流片(Tapeout，试生产)后可动态重构多态指令集体系结构的处理器。The invention proposes a processor capable of dynamically reconfiguring a polymorphic instruction set architecture after tapeout (trial production).

本发明的处理器的结构如图1所示，主要包括以下几个组成部分：一个标量处理单元101，至少一个多态指令处理单元100，至少一个多粒度并行存储器102和一个DMA控制器103。所述多态指令处理单元100包括至少一个功能单元。The structure of the processor of the present invention is shown in FIG. 1 , and mainly includes the following components: a scalar processing unit 101 , at least one polymorphic instruction processing unit 100 , at least one multi-granularity parallel memory 102 and a DMA controller 103 . The polymorphic instruction processing unit 100 includes at least one functional unit.

所述多态指令是指多个连续执行的微码记录的序列。所述多态指令集即多态指令的集合，微码记录表示某个时钟周期内各功能单元需要执行的动作，如进行加法操作，或进行数据加载操作，或者什么都不做。The polymorphic instruction refers to a sequence of multiple consecutively executed microcode records. The polymorphic instruction set is a collection of polymorphic instructions, and the microcode records represent the actions that each functional unit needs to perform in a certain clock cycle, such as performing an addition operation, or performing a data loading operation, or doing nothing.

其中，所述多态指令处理单元100解释和执行多态指令，其包含的功能单元用于执行具体的数据操作任务；所述标量处理单元101调用多态指令并查询多态指令的执行状态，而所述DMA控制器103则传送多态指令的配置信息以及向所述多粒度存储器102传送多态指令所需数据。Wherein, the polymorphic instruction processing unit 100 interprets and executes the polymorphic instruction, and the functional units included in it are used to perform specific data manipulation tasks; the scalar processing unit 101 invokes the polymorphic instruction and queries the execution status of the polymorphic instruction, The DMA controller 103 transmits the configuration information of the polymorphic instruction and transmits the data required by the polymorphic instruction to the multi-granularity memory 102 .

所述标量处理单元101通过一个第一控制通路104来控制多态指令处理单元100，标量处理单元101通过第二控制通路105来控制DMA控制器103，所述DMA控制器103通过第一内部总线106向多态处理单元100传送配置信息，DMA控制器103通过第二内部总线107向多粒度并行存储器102传送数据，DMA控制器103通过总线108从外部读写数据，所述多态指令处理单元100通过第二内部总线107从所述多粒度并行存储器102读写数据。The scalar processing unit 101 controls the polymorphic instruction processing unit 100 through a first control path 104, the scalar processing unit 101 controls the DMA controller 103 through a second control path 105, and the DMA controller 103 controls the DMA controller 103 through a first internal bus 106 transmits configuration information to the polystate processing unit 100, the DMA controller 103 transmits data to the multi-granularity parallel memory 102 through the second internal bus 107, and the DMA controller 103 reads and writes data from the outside through the bus 108, and the polystate instruction processing unit 100 reads and writes data from the multi-granularity parallel memory 102 through the second internal bus 107 .

所述标量处理单元101可以为一RISC或DSP，但必须有第一控制通路104，该控制通路104必须具备以下功能：Described scalar processing unit 101 can be a RISC or DSP, but must have first control path 104, and this control path 104 must possess following function:

1.启动多态指令处理单元100；1. Start the polymorphic instruction processing unit 100;

2.查询多态指令处理单元100的执行状态；2. Query the execution state of the polymorphic instruction processing unit 100;

3.读写多态指令处理单元100的配置寄存器(将在下面描述)。3. Read and write configuration registers of the polymorphic instruction processing unit 100 (to be described below).

所述多粒度并行存储器102采用申请号为201110460585.1(名称为“多粒度并行存储系统与存储器”)的中国专利公开说明书中的多粒度并行存储器，该存储器可同时支持不同数据类型的矩阵行列数据并行读写。The multi-granularity parallel memory 102 adopts the multi-granularity parallel memory in the Chinese patent publication whose application number is 201110460585.1 (named "multi-granularity parallel storage system and memory"), and the memory can simultaneously support matrix rows and columns of different data types in parallel. read and write.

所述第二内部总线107的主设备为多态指令处理单元100，从设备为多粒度并行存储器102。DMA控制器103和多态指令处理单元100可通过该第二内部总线107从多粒度并行存储器102读写数据，The master device of the second internal bus 107 is the polymorphic instruction processing unit 100 , and the slave device is the multi-granularity parallel memory 102 . The DMA controller 103 and the multi-state instruction processing unit 100 can read and write data from the multi-granularity parallel memory 102 through the second internal bus 107,

所述第一内部总线106的主设备为DMA控制器103，从设备为多态指令处理单元100，DMA控制器103可通过该第一内部总线106向多态指令处理单元100写入多态指令。多态指令被存放在与总线108相连的外部存储器中。The master device of the first internal bus 106 is a DMA controller 103, and the slave device is a polymorphic instruction processing unit 100, and the DMA controller 103 can write polymorphic instructions to the polymorphic instruction processing unit 100 through the first internal bus 106 . Polymorphic instructions are stored in external memory connected to bus 108 .

多态指令处理单元polymorphic instruction processing unit

多态指令处理单元100从DMA控制器103被动接收多态指令，并被标量处理单元101调用。图2给出了多态指令处理单元100的内部结构图。The polymorphic instruction processing unit 100 passively receives polymorphic instructions from the DMA controller 103 and is invoked by the scalar processing unit 101 . FIG. 2 shows an internal structure diagram of the polymorphic instruction processing unit 100 .

多态指令处理单元100包括微码存储器200、微码控制单元201、至少一个功能单元202和传送控制单元203。微码存储器200负责存放多态指令，微码控制单元201，通过第一控制通路104接收标量处理单元101的各类控制请求并执行相应的动作。所述微码控制单元201包括配置寄存器207，该配置寄存器207用于存储多态指令处理器单元100运行时所需参数及运行状态，如指定执行当前多态指令的功能单元202，指定所需数据起始地址和数据总长度，以及表明当前多态指令处理器单元100是否空闲等。The polymorphic instruction processing unit 100 includes a microcode memory 200 , a microcode control unit 201 , at least one function unit 202 and a transmission control unit 203 . The microcode memory 200 is responsible for storing polymorphic instructions, and the microcode control unit 201 receives various control requests from the scalar processing unit 101 through the first control path 104 and executes corresponding actions. Described microcode control unit 201 comprises configuration register 207, and this configuration register 207 is used for storing polymorphic instruction processor unit 100 required parameters and running state when running, as specifying the functional unit 202 that carries out current polymorphic instruction, specifying required The starting address of the data, the total length of the data, and indicating whether the current polymorphic instruction processor unit 100 is idle or not.

这些请求包括：These requests include:

1.启动多态指令处理单元100：此时微码控制单元201从微码存储器200读取微码记录300，并产生相应的控制信息，发送给功能单元202和传送控制单元203。1. Start the polymorphic instruction processing unit 100: At this time, the microcode control unit 201 reads the microcode record 300 from the microcode memory 200, generates corresponding control information, and sends it to the function unit 202 and the transmission control unit 203.

2.查询多态指令处理单元100：此时微码控制单元201返回当前多态指令的执行状态：完成或空闲。2. Query the polymorphic instruction processing unit 100: at this time, the microcode control unit 201 returns the execution status of the current polymorphic instruction: completed or idle.

3.读写多态指令处理单元100的配置寄存器207：此时微码控制单元201将向指定配置寄存器207写入指定的数据，或返回指定配置寄存器207的数据。3. Read and write the configuration register 207 of the polymorphic instruction processing unit 100: at this time, the microcode control unit 201 will write the specified data into the specified configuration register 207, or return the data of the specified configuration register 207.

多态指令处理单元100可根据应用需求，设计至少一个不同的功能单元202。功能单元202负责执行具体的数据操作任务，如执行加法运算，或数据加载/存储操作。功能单元202一般有多个数据输入/输出端口，并通过传送控制单元203交换数据，如加法单元在完成加法运算后，将加法结果传递给传送控制单元203，传送控制单元203然后将加法结果送入乘法单元，进行乘法运算。The polymorphic instruction processing unit 100 can design at least one different functional unit 202 according to application requirements. The functional unit 202 is responsible for performing specific data operation tasks, such as performing addition operations, or data load/store operations. The functional unit 202 generally has a plurality of data input/output ports, and exchanges data through the transmission control unit 203. After the addition operation is completed, the addition unit passes the addition result to the transmission control unit 203, and the transmission control unit 203 sends the addition result to the transmission control unit 203. into the multiplication unit for multiplication.

传送控制单元203与所有功能单元202的数据输入/输出端口相连，通过接口206从微码控制单元201接收每个时刻数据的来源地和目的地信息，并将来源地数据送至目的地。The transmission control unit 203 is connected to the data input/output ports of all functional units 202, receives the source and destination information of the data at each time from the microcode control unit 201 through the interface 206, and sends the source data to the destination.

总线107即图1中的第一内部总线107，某些类型的功能单元202需要执行数据加载/存储操作，需要通过第一内部总线107从多粒度并行存储器102读写数据。同时，微码存储器200从作为从设备与第一内部总线107相连，被动地从外部接收微码记录300。The bus 107 is the first internal bus 107 in FIG. 1 . Certain types of functional units 202 need to perform data load/store operations, and need to read and write data from the multi-granularity parallel memory 102 through the first internal bus 107 . At the same time, the microcode memory 200 is connected to the first internal bus 107 as a slave device, passively receiving the microcode record 300 from the outside.

多态指令的定义与调用Definition and call of polymorphic instruction

图3显示了一项微码记录300的结构图。微码记录300分成多个域，每个功能单元在微码记录300中都有对应的域，如功能单元域301对应第2功能单元。同时，微码记录300中还有一个特殊的微码控制域302，该域指明下一个时钟，微码控制单元201需要读取哪一行微码记录300。FIG. 3 shows a structure diagram of a microcode record 300 . The microcode record 300 is divided into multiple domains, and each functional unit has a corresponding domain in the microcode record 300, for example, the functional unit domain 301 corresponds to the second functional unit. At the same time, there is a special microcode control field 302 in the microcode record 300, which indicates which line of the microcode record 300 the microcode control unit 201 needs to read at the next clock.

如前所述，本发明的多态指令是多个连续执行的、具有特定功能的微码记录300序列。如图4所示。多态指令，即微码记录300的序列存放在微码存储器200中，被微码控制单元201依次读取并执行。微码存储器200中的每一行存放一个微码记录300，当标量处理单元101调用多态指令时，只需指定该多态指令对应的起始记录在微码存储器200中的行号。As mentioned above, the polymorphic instruction of the present invention is a sequence of microcode records 300 that are executed consecutively and have specific functions. As shown in Figure 4. The polymorphic instructions, that is, the sequence of microcode records 300 are stored in the microcode memory 200, and are sequentially read and executed by the microcode control unit 201. Each row in the microcode memory 200 stores a microcode record 300 . When the scalar processing unit 101 invokes a polymorphic instruction, it only needs to specify the row number corresponding to the polymorphic instruction that starts to be recorded in the microcode memory 200 .

程序员可以根据算法需求，利用微码记录300灵活定义多态指令的行为和多态指令在微码存储器中的起始行号。图5示例性地示出了一种定义和调用多态指令的流程。首先，程序员根据应用需求，定义一个或多个多态指令的行为，并将该指令的行为转换为微码记录300序列，该序列一般用文本来表达，如“ALU.T0＝T1+T2(U)||Repeat(10)”，表示ALU进行10次加法运算。同时，编写标量代码，该代码调用程序员定义的多态指令，此时多态指令的起始行号还没有确定，用标号代替，如Instr1。用文本表示的多态指令记录经过编译和链接后，变成微码控制单元201可以理解的二进制文件，同时，在编译和链接过程中，确定每一个多态指令的起始地址，如此时Instr1的值已经确定为10。标量代码经过编译链接后，还需要与多态指令二制文件进行交叉链接，将原标量代码中用符号表示的多态指令起始地址替换为实际的数值，生成标量二进制文件。标量代码在调用多态指令之前，利用DMA控制器103将多态指令二进制文件内容加载至微码存储器，再调用多态指令。The programmer can flexibly define the behavior of the polymorphic instruction and the starting line number of the polymorphic instruction in the microcode memory by using the microcode record 300 according to the algorithm requirement. Fig. 5 exemplarily shows a flow of defining and calling polymorphic instructions. First, the programmer defines the behavior of one or more polymorphic instructions according to the application requirements, and converts the behavior of the instruction into a microcode record 300 sequence, which is generally expressed in text, such as "ALU.T0=T1+T2 (U)||Repeat(10)", indicating that the ALU performs 10 addition operations. At the same time, write scalar code, which calls the polymorphic instruction defined by the programmer. At this time, the starting line number of the polymorphic instruction has not been determined, and it is replaced by a label, such as Instr1. After compiling and linking, the polymorphic instruction record represented by text becomes a binary file that the microcode control unit 201 can understand, and at the same time, in the process of compiling and linking, determine the starting address of each polymorphic instruction, such as Instr1 The value of has been determined to be 10. After the scalar code is compiled and linked, it needs to be cross-linked with the polymorphic instruction binary file to replace the symbolic starting address of the polymorphic instruction in the original scalar code with the actual value to generate a scalar binary file. Before calling the polymorphic instruction, the scalar code uses the DMA controller 103 to load the content of the binary file of the polymorphic instruction into the microcode memory, and then calls the polymorphic instruction.

具有多态指令集体系结构的处理器的实施例Embodiments of Processors with Polymorphic Instruction Set Architectures

下面给出多态指令集体系结构的一个示例性的实施例，该实施例只是本发明的一种实施方式，本发明内容不局限于该示例。An exemplary embodiment of the polymorphic instruction set architecture is given below, which is only an implementation manner of the present invention, and the content of the present invention is not limited to this example.

该实施例是一种面向数据密集型应用的具有多态指令集体系结构的处理器。图6显示了该处理器中的功能单元。如图6所示，所有功能单元的数据位宽都为512bit，在进行数据操作时，512bit可以看成64个8bit或32个16bit或16个32bit的数据。功能单元中的IALU用于进行定点逻辑计算，FALU用于进行浮点逻辑计算，IMAC用于进行定点乘累加计算，FMAC用于进行浮点乘累加操作，SHU0和SHU1用地进行数据交织操作，即交换512bit数据内任意两个8bit数据的位置。M为512bit位宽的寄存器文件堆，BIU0、BIU1、BIU2为总线接口单元，负责从多粒度并行存储器102中加载/存储数据。This embodiment is a processor with polymorphic instruction set architecture for data-intensive applications. Figure 6 shows the functional units in this processor. As shown in Figure 6, the data bit width of all functional units is 512 bits. When performing data operations, 512 bits can be regarded as 64 pieces of 8bit or 32 pieces of 16bit or 16 pieces of 32bit data. IALU in the functional unit is used for fixed-point logic calculations, FALU is used for floating-point logic calculations, IMAC is used for fixed-point multiply-accumulate calculations, FMAC is used for floating-point multiply-accumulate operations, and SHU0 and SHU1 are used for data interleaving operations, namely Exchange the positions of any two 8bit data within the 512bit data. M is a 512-bit wide register file stack, BIU0 , BIU1 , and BIU2 are bus interface units, responsible for loading/storing data from the multi-granularity parallel memory 102 .

IALU、FALU、IMAC、FMAC、SHU0、SHU1具有相似的接口，该实施例中统称它们为计算单元500，该计算单元500的接口如图7所示，它包括四个数据输入端口604，以及对应的四个临时寄存器600。运算逻辑601从临时寄存器中读取数据进行运算，运算结果写入临时寄存器602，然后通过输出端口603将运算结果传送至传送控制单元203。IALU, FALU, IMAC, FMAC, SHU0, SHU1 have similar interfaces, they are collectively referred to as computing unit 500 in this embodiment, the interface of this computing unit 500 is shown in Figure 7, and it comprises four data input ports 604, and corresponding 600 of the four temporary registers. The operation logic 601 reads data from the temporary register to perform calculation, writes the calculation result into the temporary register 602 , and then transmits the calculation result to the transmission control unit 203 through the output port 603 .

BIU0、BIU1、BIU2统称为总线接口单元501，其内部结构如图8所示。它具有一个数据输入端口702，它通过从传送控制单元203获取数据，并将获得的数据写入临时寄存器700；一个数据输出端口703，通过该端口将临时寄存器701中的数据传送至传送控制单元203；一个内部总线接口107，通过该接口读写多粒度并行存储器102中的数据；一个地址计算逻辑704，负责计算发往第二内部总线107的地址。BIU0 , BIU1 , and BIU2 are collectively referred to as a bus interface unit 501 , and its internal structure is shown in FIG. 8 . It has a data input port 702, which obtains data from the transmission control unit 203, and writes the obtained data into the temporary register 700; a data output port 703, through which the data in the temporary register 701 is transmitted to the transmission control unit 203 ; an internal bus interface 107 through which data in the multi-granularity parallel memory 102 is read and written; an address calculation logic 704 is responsible for calculating an address sent to the second internal bus 107 .

M为512位宽的寄存器文件堆(Registerfile)，具有4个写端口800、4个读端口802，以及对应的存储体801。图9示例了该寄存器文件堆的接口。M is a 512-bit wide register file (Registerfile), which has 4 write ports 800 , 4 read ports 802 , and corresponding memory banks 801 . Figure 9 illustrates the interface to the register file file.

在多态指令集体系结构中，各功能单元的计算结果可以直接传送给其它功能单元，实现级联运算。在本实施例中，并不需要所有功能单元之间都设计直接的数据传送路径，如FMAC主要进行浮点乘累加运算，它的运算结果没有必要直接传送给定点计算单元IALU或IMAC。减少数据传送路径的好处在于可减少功能单元之间的连线，进而减少芯片面积，降低芯片成本。本实施例中各功能单元之间的数据传送路径如图10所示，该表中每一列的开头表示数据目的地，每一行的开头表示数据源，中间有勾的单元格表示存在传送路径。另外，为进一步减少传送路径，某些功能单元之间可以根据应用需要共用传送路径，功能单元之间共用传路径可减少功芯片连线，但这些功能单元之间就不能在同一时刻都传送数据了。如SHU0至BIU0、SHU1至BIU1共用一条传送路径，则SHU0向BIU0传送数据时，SHU1与BIU1之间就不能传送数据了。图10中的阴影表示了部分共用的传送路径。In the polymorphic instruction set architecture, the calculation results of each functional unit can be directly transmitted to other functional units to realize cascaded operations. In this embodiment, it is not necessary to design direct data transmission paths between all functional units. For example, FMAC mainly performs floating-point multiply-accumulate operations, and its operation results do not need to be directly transmitted to fixed-point calculation units IALU or IMAC. The advantage of reducing the data transmission path is that the connection between functional units can be reduced, thereby reducing the chip area and chip cost. The data transmission path between the functional units in this embodiment is shown in Figure 10. The beginning of each column in the table indicates the data destination, the beginning of each row indicates the data source, and the cells with ticks in the middle indicate the existence of transmission paths. In addition, in order to further reduce the transmission path, some functional units can share the transmission path according to the application needs. Sharing the transmission path between the functional units can reduce power chip connections, but these functional units cannot transmit data at the same time. up. If SHU0 to BIU0 and SHU1 to BIU1 share one transmission path, when SHU0 transmits data to BIU0, data cannot be transmitted between SHU1 and BIU1. Shading in Fig. 10 indicates a part of the common transmission path.

与图10对应的传送控制单元203由29个多路选择器构成，为方便表述，我们将传送控制单元203分解成两个层次，第一个层次由IALU、IMAC、FALU、FMAC构成，暂称该层次为ACU，如图11所示。该层次通过三个输入端口ACU.I0、ACU.I1、ACU.I2以及一个输出端口ACU.O与其它功能单元进行数据传送。ACU一共包括16个多路选择器，即图11中M13～M28，各个多路选择器的数据输入参见图中的标记。The transmission control unit 203 corresponding to FIG. 10 is composed of 29 multiplexers. For the convenience of expression, we decompose the transmission control unit 203 into two levels. The first level is composed of IALU, IMAC, FALU, and FMAC, temporarily called This level is ACU, as shown in Figure 11. This layer transmits data with other functional units through three input ports ACU.I0, ACU.I1, ACU.I2 and one output port ACU.O. The ACU includes a total of 16 multiplexers, namely M13-M28 in Figure 11, and the data input of each multiplexer refers to the marks in the figure.

第二个层次由ACU、M、SHU0、SHU1以及BIU0～BIU2构成，如图12所示，一共包括13个多路选择器，即图12中的M0～M12，各个多路选择器的数据输入参见图中的标记。The second level consists of ACU, M, SHU0, SHU1, and BIU0~BIU2, as shown in Figure 12, including a total of 13 multiplexers, namely M0~M12 in Figure 12, the data input of each multiplexer See markings in figure.

为了产生传送控制单元203中的29个多路选择器的控制信号，我们首先对所能功能单元进行分组并编码，如图13所示，其中“x”表示不关心，“0”或“1”都可以。在微码记录300中的每个功能单元控制域301除了指明功能单元要执行的操作外，还需要指明操作结果的目的地，该目的地通过图13中的编码来指定，如FALU控制域用文本表达为“IALU.T0＝FALU.T1+T2”，其中“＝”右边的“FALU.T1+T2”表示FALU要执行加法操作，而“＝”左边的“IALU”指数据操作结果目的地，该目的地的编码即为“1100”。In order to generate the control signals for the 29 multiplexers in the transmission control unit 203, we first group and encode the functional units that can be used, as shown in Figure 13, where "x" means don't care, "0" or "1 "It will be all right. Each functional unit control field 301 in the microcode record 300 needs to specify the destination of the operation result in addition to specifying the operation to be performed by the functional unit. The destination is specified by the code in FIG. The text expression is "IALU.T0=FALU.T1+T2", where "FALU.T1+T2" on the right of "=" indicates that FALU will perform an addition operation, and "IALU" on the left of "=" refers to the destination of the data operation result , the code for this destination is "1100".

微码控制单元201将微码记录300中的所有功能单元的目的地信息都发送给传送控制单元203，传送控制单元203根据这些目的地信息产生29个多路选择器的控制信号。图14描述了多路选择器M0的逻辑行为，其中GroupID表示对应功能单元控制域301中目的地的组编号。The microcode control unit 201 sends the destination information of all functional units in the microcode record 300 to the transmission control unit 203, and the transmission control unit 203 generates control signals of 29 multiplexers according to the destination information. FIG. 14 describes the logical behavior of the multiplexer M0, where GroupID represents the group number of the destination in the control field 301 of the corresponding functional unit.

以上所述的具体实施例，对本发明的目的、技术方案和有益效果进行了进一步详细说明，应理解的是，以上所述仅为本发明的具体实施例而已，并不用于限制本发明，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The specific embodiments described above have further described the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above descriptions are only specific embodiments of the present invention, and are not intended to limit the present invention. Within the spirit and principles of the present invention, any modifications, equivalent replacements, improvements, etc., shall be included in the protection scope of the present invention.

Claims

1. A processor with polymorphic instruction set architecture, characterized in that: comprise a scalar processing unit (101), at least one polymorphic instruction processing unit (100), at least one multi-granularity parallel memory (102) and a DMA controller (103); the polymorphic instruction processing unit (100) includes at least one functional unit (202);

The polymorphic instruction processing unit (100) is used to interpret and execute polymorphic instructions, and its functional unit (202) is used to perform specific data manipulation tasks, wherein the polymorphic instruction refers to a plurality of continuously executed microcode records ( 300), the microcode records represent the actions that each functional unit (202) needs to perform in a certain clock cycle;

The scalar processing unit (101) is used to call the polymorphic instruction and query the execution state of the polymorphic instruction;

The DMA controller (103) is used to transmit the configuration information of the polymorphic instruction and transmit the data required by the polymorphic instruction to the multi-granularity parallel memory (102);

The polymorphic instruction processing unit (100) passively receives polymorphic instructions from the DMA controller (103), and is called by the scalar processing unit (101);

The scalar processing unit (101) controls the polymorphic instruction processing unit (100) through a first control path (104), and the scalar processing unit (101) controls the DMA controller (103).

2. the processor with polymorphic instruction set architecture as claimed in claim 1, is characterized in that: described polymorphic instruction processing unit (100) also comprises microcode memory (200) and microcode control unit (201) ;

The microcode memory (200) is used to store polymorphic instructions;

The microcode control unit (201) is configured to receive a control request from the scalar processing unit (101) through the first control path (104) and execute corresponding actions.

3. the processor with polymorphic instruction set architecture as claimed in claim 2, is characterized in that: described microcode control unit (201) comprises configuration register (207), and this configuration register (207) is used for storing multiple The required parameters and the running state of the state instruction processing unit (100) during operation.

4. The processor with polymorphic instruction set architecture as claimed in claim 3, characterized in that: the control request of the scalar processing unit (101) comprises starting or inquiring about the polymorphic instruction processing unit (100), Reading and writing configuration registers (207) of the polymorphic instruction processing unit (100).

5. The processor with polymorphic instruction set architecture as claimed in claim 3, characterized in that: the polymorphic instruction processing unit (100) also includes a transmission control unit (203), and the functional unit (202) It has a plurality of data input/output ports, and exchanges data through the transmission control unit (203).

6. The processor with polymorphic instruction set architecture as claimed in claim 3, characterized in that: said functional unit (202) is used to perform data loading/storage operations, and through a first internal bus (107) Reading and writing data from the multi-granularity parallel memory (102); at the same time, the microcode memory (200) is connected to the first internal bus (107) as a slave device, and passively receives microcode records (300) from the outside.

7. The processor with polymorphic instruction set architecture according to claim 2, characterized in that: the microcode control unit (201) sequentially reads and executes the microcode records (300) of polymorphic instructions.

8. the processor with polymorphic instruction set architecture as claimed in claim 7, is characterized in that: each line deposits a microcode record (300) in the described microcode memory (200), when described scalar processing When the unit (101) calls the polymorphic instruction, it only specifies the line number of the starting microcode corresponding to the called polymorphic instruction recorded in the microcode memory (200).