[go: up one dir, main page]

CN113407483A - Data intensive application oriented dynamic reconfigurable processor - Google Patents

Data intensive application oriented dynamic reconfigurable processor Download PDF

Info

Publication number
CN113407483A
CN113407483A CN202110703118.0A CN202110703118A CN113407483A CN 113407483 A CN113407483 A CN 113407483A CN 202110703118 A CN202110703118 A CN 202110703118A CN 113407483 A CN113407483 A CN 113407483A
Authority
CN
China
Prior art keywords
data
register
time
operator
configuration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110703118.0A
Other languages
Chinese (zh)
Other versions
CN113407483B (en
Inventor
刘大江
朱蓉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN202110703118.0A priority Critical patent/CN113407483B/en
Publication of CN113407483A publication Critical patent/CN113407483A/en
Application granted granted Critical
Publication of CN113407483B publication Critical patent/CN113407483B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8007Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
    • G06F15/8023Two dimensional arrays, e.g. mesh, torus
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7867Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
    • G06F15/7871Reconfiguration support, e.g. configuration loading, configuration switching, or hardware OS
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Multi Processors (AREA)
  • Logic Circuits (AREA)

Abstract

本发明提出一种面向数据密集型应用的动态可重构处理器,其中,方法包括:一种面向数据密集型应用的动态可重构处理器,动态可重构处理器包括一个处理单元阵列、一个片上多bank便签式存储器和一个配置存储器,处理单元阵列由m x n个处理单元PE以二维阵列的形式组成,m和n为正整数,其中,同一行PE连接到同一条总线上,每条总线通过一个交叉选择矩阵单元访问到便签式存储器中的m个bank。本申请提出的方法使得可重用数据高效的在处理单元阵列中流动,避免了相同存储位置数据的重复访问,从源头上减少数据访问量,使动态可重构处理器的循环流水性能得到很大的提升。

Figure 202110703118

The present invention provides a dynamically reconfigurable processor oriented to data-intensive applications, wherein the method includes: a dynamically reconfigurable processor oriented to data-intensive applications, the dynamic reconfigurable processor includes a processing unit array, An on-chip multi-bank note-type memory and a configuration memory, the processing unit array consists of mxn processing units PE in the form of a two-dimensional array, m and n are positive integers, where the same row of PEs is connected to the same bus, each The bus accesses m banks in the scratchpad memory through a cross-select matrix unit. The method proposed in this application enables reusable data to efficiently flow in the processing unit array, avoids repeated access to data in the same storage location, reduces the amount of data access from the source, and greatly improves the circulation performance of the dynamically reconfigurable processor. improvement.

Figure 202110703118

Description

Data intensive application oriented dynamic reconfigurable processor
Technical Field
The invention relates to the technical field of integrated circuits, in particular to the field of a dynamic reconfigurable processor.
Background
With the development of technologies such as cloud computing, big data, internet of things and the like, and the popularization of various intelligent terminal devices, the speed increase of data traffic is accelerating continuously, and the requirement on high-performance chips is urgent increasingly. A dynamically reconfigurable processor is a new processor architecture with energy efficiency approaching that of an Application Specific Integrated Circuit (ASIC) without sacrificing much programming flexibility, and is one of the ideal architectures to accelerate data intensive applications. Unlike conventional General Purpose Processors (GPP), the dynamic reconfigurable Processor has no latency and energy consumption overhead for fetch and decode operations; different from an ASIC, the dynamic reconfigurable processor can dynamically configure the functions of the circuit during operation, and has better flexibility; different from a Field Programmable Gate Array (FPGA), the dynamic reconfigurable processor has a coarse-grained configuration mode, reduces the cost of configuration information and has higher calculation energy efficiency.
A typical dynamically reconfigurable processor is generally composed of an array of processing units, data memory, and configuration memory. The Processing unit array is configured by a plurality of Processing units (PEs), and functions of the entire array are defined by configuring connectivity and operation patterns of each PE. The configuration is mainly derived from the mapping of the specific compilation algorithm. Modular scheduling loop software pipelining is one of the most common methods for mapping optimization in compilation, and it improves application parallel execution performance by minimizing the Initiation Interval (II) of loop iteration. However, the high computational parallelism allows a large amount of data to be accessed in parallel between the data memory and the array of processing units. In order to deal with the problem of parallel data access pressure, the conventional reconfigurable processor generally uses a multi-bank Scratch Pad Memory (SPM) on a chip to provide data for the processing unit array in parallel. A 4 x 4 array of processing elements is typically equipped with a 4-bank SPM, in which each row of the array of processing elements can access each bank of the SPM in parallel, but different PEs within the same row can only access data serially because of the shared data bus. To further improve the parallel data access capability, HReA [ L.Liu, Z.Li, C.Yang, C.Deng, S.YIn and S.Wei ], "HReA: An Energy-Efficient Embedded dynamic Reconfigurable Fabric for 13-Dwarfs Processing," in IEEE Transactions on Circuits and Systems II: Express Briefs, vol.65, No.3, pp.381-385, and March 2018 ] is equipped with a 16-bank SPM for a 4 × 4 array of Processing units. In the framework, all PEs can access each bank of the SPM in parallel, and the access and storage capacity of parallel data is greatly enhanced. But also increases the difficulty of bank management and increases chip area and power consumption. In order to fully utilize the limited bandwidth of SPM, the document [ s.yin, x.yao, t.lu, d.liu, j.gu, l.liu, and s.wei, "flash-free loop mapping for coarse-grained configurable architecture with multi-bank memory," IEEE Transactions on Parallel and Distributed Systems, vol.28, No.9, pp.2471-2485,2017 ] proposes a collision-free circular mapping method from a compiling point of view. According to the method, access operators in the DFG are scheduled to different time steps to reduce the access amount of parallel data, then the storage position of the data in the SPM is reasonably organized through a storage partitioning algorithm, and finally the conflict of data access is reduced. But as the start-up interval of the circular pipeline becomes smaller and smaller, multiple access operators have to execute simultaneously, eventually only at the expense of performance to guarantee conflict-free data access. The above work enables the original access operators in the application to operate without conflict from the architecture or compilation perspective, but the number of the access operators is not changed essentially, thereby limiting the extent of access conflict optimization.
From the perspective of data-intensive applications, there are many opportunities for data reuse, such as template computation, in the application's loop kernel. Although there are many memory accesses in its cycle core, some of them actually read the same data. If the same data is obtained once and used multiple times, then access conflicts can be reduced by reducing the access operators. But an inevitable problem is how to route these reusable data between PEs.
In a conventional processing unit array, a single channel network is formed between processing units, that is, data can be transmitted between PEs only by inputting data through two Multiplexer (MUX) input ends of the PEs and outputting the data through an output register, and the PEs at this time cannot perform any other operation except data routing operation, which is very resource-consuming. If the data is not routed out, but the data is multi-cycle retained in the Local Register File (LRF) of a PE, then the mapping scheme of the operator on the processing unit array is very limited in terms of compilation, and the operator can only map to a PE in order to use the data in the LRF of the PE. Therefore, how to creatively provide an architecture which can efficiently route data between PEs, fully utilize PE resources, and make the compiling mode of an operator flexible in the compiling aspect is a technical problem which is urgently solved by a person skilled in the art. The problem of efficient routing of data is solved, and the reusable data can be fully utilized, so that access operators are reduced, access conflicts are reduced, and the execution performance of the dynamic reconfigurable processor is improved finally.
Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.
Therefore, the first purpose of the present invention is to provide a dynamic reconfigurable processor oriented to data intensive applications, so as to improve the parallel data access and storage capability of the reconfigurable processor.
A second object of the invention is to propose a non-transitory computer-readable storage medium.
To achieve the above object, a first embodiment of the present invention provides a dynamic reconfigurable processor for data intensive applications, where the dynamic reconfigurable processor includes a processing unit array, an on-chip multi-bank scratchpad memory SPM, and a configuration memory, where the processing unit array is composed of m × n processing units PE in a two-dimensional array, m and n are positive integers, where a same row PE is connected to a same bus, and each bus accesses m banks in the scratchpad memory through a cross-selection matrix unit.
Alternatively, in an embodiment of the application, each PE comprises a functional unit FU, a local register file RF, an output register and a configuration register, the functional unit FU is used for performing various fixed-point operations, and has two multiplexers at its inputs, the multiplexer is used to access data from different sources, the local register file RF is divided into r separate registers, where r is a positive integer, each register selects data originating from the functional unit FU or from a previous register, the information of the configuration register in each processing unit PE originates from the configuration memory, the configuration memory is connected to the various components within the processing unit PE and distributes the configuration flow to configure the selection signals of each multiplexer, the functions of the functional units FU and the read-write enable of the registers.
Optionally, in an embodiment of the application, the processing unit PE is in a two-channel network comprising a result network for passing the result of the computation of the functional unit FU and a value network for passing the value of the local register file RF.
Alternatively, in an embodiment of the application, when a value obtained from the memory by an access operator requires multiple references in a short time, the value is distributed to other processing elements PE requiring reference values through the numerical network.
Optionally, in an embodiment of the present application, a serially shifted data channel is added to an internal register of the local register file RF.
Optionally, in an embodiment of the present application, the method includes the following steps:
step 1, converting the application pseudo code into an original data dependency graph, because the data x [ i +2 ]]With data x [ i +1 ] one clock cycle later]And data x [ i ] two clock cycles later]Is the same data, L is removed2,L3Two operators, adding two new reuse dependent edges (L)1,*),(L1And (+) obtaining a new data dependency graph, (L)1X) represents the operator L1The obtained data is transmitted to a multiplier (L) for consumption through a numerical network1And (+) represents the access operator "L1
The acquired data will be transmitted to the addition operator "+" through the numerical network for consumption. And obtaining a compiling scheme through a compiling tool, generating a configuration information stream, and obtaining a layout result with II equal to 1. Access operator L1Laid out in PE1, multiplier "+".
Placement at PE2, addition operator "+" Place at PE3, access operator "S1"lay out on PE 4;
step 2, time t, under the drive of configuration flow, time tL of1After the operator finishes taking the data, the operator places the data in the last register of PE 1;
step 3, at the time of t +1, under the drive of the configuration flow, L at the time of t1Data is passed out through the multiplexer Mb of PE1, through the multiplexers Ma and M1 of PE2 to the first register R1 of PE2, L at time t +11After the operator finishes taking the data, the operator places the data in an output register of PE1 and also places the data in a last register R1 of PE 1;
step 4, at the time of t +2, under the drive of the configuration flow, L at the time of t1Data is passed out through the multiplexer Mb of PE2, through the multiplexers Ma and M1 of PE3, to the first register R1 of PE 3; at the same time, L at time t is driven by the configuration stream1Data is also passed through multiplexers Md and Mf before the FU of PE2 to a second input port of the FU; meanwhile, under the drive of the configuration flow, data in an output register of the PE1 reaches a first input port through a multiplexer Me in front of the FU of the PE2, and under the drive of the configuration flow, the FU performs a multiplication operation, and the result is stored in an output register of the PE 2;
step 5, at the time of t +3, under the drive of the configuration flow, L at the time of t1Data is transmitted to a second register R2 of the PE3 through an output port of a first register R1 of the PE3 and a multiplexer M2, and meanwhile, FU of the PE2 carries out multiplication operation, and the result is stored in an output register of the PE 2;
step 6, at the time of t +4, under the drive of the configuration flow, L at the time of t1Data reach the second input port of FU through the multiplexer Md and Mf before FU of PE3, meanwhile, the data in the output register of PE2 is transmitted to the first input port of FU through the multiplexer Me before FU of PE3, FU carries on the operation of addition "+" under the drive of configuration flow, the result is kept in the output register of PE 3;
at the time of t +5, at step 7, the data in the output register of PE3 is transferred to the first input port of FU of PE4 through multiplexer Me, and the data is stored in Bank through bus.
To achieve the above object, a non-transitory computer-readable storage medium is provided in an embodiment of a third aspect of the present application, on which a computer program is stored, and the computer program, when executed by a processor, implements the method for transferring reusable data described in the embodiment of the first aspect of the present application.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a diagram of a data-intensive application-oriented dynamically reconfigurable processor according to an embodiment of the present invention;
fig. 2 is a schematic diagram of an implementation of how the architecture provided by the embodiment of the present invention is used.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
A data intensive application oriented dynamically reconfigurable processor according to an embodiment of the present invention is described below with reference to the accompanying drawings.
Fig. 1 is a schematic flowchart of a data intensive application-oriented dynamic reconfigurable processor according to an embodiment of the present invention.
To achieve the above object, as shown in fig. 1, a first embodiment of the present invention provides a dynamic reconfigurable processor for data intensive applications, where the dynamic reconfigurable processor includes a processing unit array, an on-chip multi-scratch pad memory SPM, and a configuration memory, where the processing unit array is composed of m × n processing units PE in a two-dimensional array, where m and n are positive integers, where a same row PE is connected to a same bus, and each bus accesses m banks in the scratch pad memory through a cross-selection matrix unit.
In an embodiment of the application, furthermore, each PE comprises a functional unit FU, a local register file RF, an output register and a configuration register, the functional unit FU is used for performing various fixed-point operations, and has two multiplexers at its inputs, the multiplexer is used to access data from different sources, the local register file RF is divided into r separate registers, where r is a positive integer, each register selects data originating from the functional unit FU or from a previous register, the information of the configuration register in each processing unit PE originates from the configuration memory, the configuration memory is connected to the various components within the processing unit PE and distributes the configuration flow to configure the selection signals of each multiplexer, the functions of the functional units FU and the read-write enable of the registers.
In an embodiment of the application, further, the processing unit PE is in a two-channel network comprising a result network for passing the result of the computation of the functional unit FU and a value network for passing the value of the local register file RF.
In an embodiment of the application, further, when a value obtained from the memory by an access operator requires a plurality of references in a short time, the value is distributed to other processing elements PE requiring reference values through the value network.
In an embodiment of the application, further, a serially shifted data channel is added in an internal register of the local register file RF.
In one embodiment of the present application, in particular, the FU may perform various fixed-point operations including logical operations of addition, subtraction, multiplication, and the like. The input of the FU has two multiplexers (1-out-of-6 multiplexers Me, Mf) which can access data from different sources, such as the FU of the neighboring PE, the respective local registers of the own register file RF, and the memory. The output of the FU has three directions: memory, output registers, and various registers of the register file RF.
The RF is divided into R separate registers (R1, R2 … Rr, R being a positive integer) each preceded by a 2-to-1 multiplexer (M1, M2 … Mr) that can select data originating from the FU or previous register (the previous register of the first register R1 being some register of the adjacent RF). The output port of each register is connected to the next register (the last register Rr is not register connected) and to three multiplexers (Mb, Mc, Md) which all select 1 multiplexer for r, through which multiplexer Mc or Md the data in one register can be selected to the local FU for calculation. The data in one register is selectable by a multiplexer Mb into the first register (R1) of the adjacent RF. Meanwhile, the first register R1 of the RF selects data from one neighboring RF through the 4-to-1 multiplexer Ma.
The information of the configuration register in each PE comes from the configuration memory, connects the various components inside the PE, distributes the configuration flow to configure the selection signal of each multiplexer, the function of the FU, and the read-write enable of the register.
In one embodiment of the application, in particular, the dual channel interconnection between the processing units, the processing units are changed from the original single channel network to a dual channel network, one network is used for transferring the calculation result of the FU (result network), and the other network is used for transferring the RF value (value network). The result network passes the results of the FU's computation either through the output register to the dominating PE or through the bus into the memory. The numerical network flexibly configures the output direction of data by the numerical value in the RF through a multiplexer, and transfers the data to the first register of the adjacent RF or other local registers.
When the numerical value obtained by an access operator from the memory needs to be referred for multiple times in a short time, the numerical value can be distributed to other PEs needing to be referred by the numerical value through the constructed numerical value network, and the data multiplexing capability of the reconfigurable computing array is enhanced. The dual-channel interconnection network can reduce the number of access operators in a data flow diagram from the source, thereby reducing the access conflict of a data memory.
In one embodiment of the present application, specifically, although the inter-PE interconnection network design of the dual channel enhances the flexibility of data transfer, the register interconnections in the processing units can only ensure the correctness of the calculation function when the number of clock cycles (RT, Required Time) that should be reached and the number of clock cycles (AT, Arrival Time) that actually reach of the data are the same in the pipeline execution mode. AT reusing data depends on Manhattan distance between data producer PE and consumer PE, and it is difficult to guarantee Manhattan distance between producer and consumer to match RT in compiling process. Therefore, we add a serial shifted data path to the RF internal registers, i.e. each register inside the RF is connected end-to-end in sequence, so that the reused data can still remain inside the RF of the same PE for multiple clock cycles in the streaming mode.
After the serial shift data path is added, each register can select either data from the FU or data from the previous register, since each register is preceded by a 1-out-of-2 multiplexer. In this case, the RF can be operated in either the normal mode or the shift register mode. When operating in the normal mode, the register may temporarily store data for a next time period. When the shift register circuit works in a shift register mode, registers of all the processing units form a register chain, the length of the register chain can be selected through the multiplexer, and the number of clock cycles of data flow is flexibly configured. Assuming that the manhattan distance between the data producer and the consumer is M1 and the number of the single RF internal registers is r, the adjustable range of the AT of the data is M1+1, r x (M1+1), which greatly enhances the synchronous arrival capability of the data. Therefore, the register interconnection network structure in the PE can provide a hardware basis for data synchronization and provide flexibility guarantee for subsequent compiling and mapping.
The technical effects of this application: the embodiment of the application provides a data-intensive application-oriented dynamic reconfigurable processor, aiming at the disadvantages that only Function Units (FU) of a traditional interconnection network among PEs are connected through an output register, and direct interconnection channels do not exist among register files of the PEs, the framework of the invention increases the interconnection Function of the register files RF of each PE in a processing Unit array, so that reusable data efficiently flows in the processing Unit array, repeated access of data in the same storage position is avoided, the data access amount is reduced from the source, and the circulating water-based performance of the dynamic reconfigurable processor can be greatly improved.
As shown in fig. 2, in an embodiment of the present application, further, the following steps are included:
step 1, converting the application pseudo code into an original data dependency graph, and finding data x [ i +2 ]]With data x [ i +1 ] one clock cycle later]And data x [ i ] two clock cycles later]Is the same data, L is removed2,L3Two operators, adding two new reuse dependent edges (L)1,*),(L1And (+) to obtain a new data dependency graph. (L)1X) represents the operator L1The obtained data is transmitted to a multiplier (L) for consumption through a numerical network1And (+) represents the access operator "L1The acquired data will be transmitted to the addition operator "+" through the numerical network for consumption. And obtaining a compiling scheme through a compiling tool, generating a configuration information stream, and obtaining a layout result with II equal to 1. Access operator L1Placement at PE1, multiplication operator ". about" at PE2, addition operator "+" at PE3, and access operator "S1Laid out on PE 4;
step 2, driving the configuration flow at the time t, and L at the time t1After the operator finishes taking the data, the operator places the data in the last register of PE 1;
step 3, at the time of t +1, under the drive of the configuration flow, at the time of t, L1Data is passed out through the multiplexer Mb of PE1, through the multiplexers Ma and M1 of PE2 to the first register R1 of PE2, L at time t +11After the operator finishes taking the data, the operator places the data in the output register of PE1 and the last register of PE1In the device;
step 4, at the time of t +2, under the drive of the configuration flow, at the time of t, L1Data is passed out through the multiplexer Mb of PE2, through the multiplexers Ma and M1 of PE3, to the first register R1 of PE 3; at the same time, driven by the configuration stream, time t is L1Data is also passed through multiplexers Md and Mf before the FU of PE2 to a second input port of the FU; meanwhile, under the drive of the configuration flow, data in an output register of the PE1 reaches a first input port through a multiplexer Me in front of the FU of the PE2, and under the drive of the configuration flow, the FU performs multiplication (×) operation, and the result is stored in an output register of the PE 2;
step 5, at the time of t +3, under the drive of the configuration flow, at the time of t, L1Data is transmitted to a second register R2 of the PE3 through an output port of a first register R1 of the PE3 and a multiplexer M2, and meanwhile, FU of the PE2 carries out multiplication operation, and the result is stored in an output register of the PE 2;
step 6, at the time of t +4, under the drive of the configuration flow, at the time of t, L1Data reach the second input port of FU through the multiplexer Md and Mf before FU of PE3, meanwhile, the data in the output register of PE2 is transmitted to the first input port of FU through the multiplexer Me before FU of PE3, FU carries on the operation of addition "+" under the drive of configuration flow, the result is kept in the output register of PE 3;
at the time of t +5, at step 7, the data in the output register of PE3 is transferred to the first input port of FU of PE4 through multiplexer Me, and the data is stored in Bank through bus.
In one embodiment of the present application, specifically, after 6 clock cycles, one complete iteration in the circulating water has been performed. Since II is 1, the processing unit array of each clock cycle is executed with the same configuration information.
In one embodiment of the present application, specifically, FIG. 2 is a diagram of (a) an example loop pseudocode; (b) an original DDG plot from (a); (c) reusing the DDG with the data obtained in (b); (d) example (m-2, n-2, r-2) inventive architecture; (e) the data obtained at time t L1 is transmitted over the example inventive architecture.
The technical effects of this application: the embodiment of the application provides a data-intensive application-oriented dynamic reconfigurable processor, aiming at the disadvantages that only Function Units (FU) of a traditional interconnection network among PEs are connected through an output register, and direct interconnection channels do not exist among register files of the PEs, the framework of the invention increases the interconnection Function of the register files RF of each PE in a processing Unit array, so that reusable data efficiently flows in the processing Unit array, repeated access of data in the same storage position is avoided, the data access amount is reduced from the source, and the circulating water-based performance of the dynamic reconfigurable processor can be greatly improved.
In order to implement the above embodiments, the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for transferring reusable data according to the embodiments of the first aspect of the present application.
Although the present application has been disclosed in detail with reference to the accompanying drawings, it is to be understood that such description is merely illustrative and not restrictive of the application of the present application. The scope of the present application is defined by the appended claims and may include various modifications, adaptations, and equivalents of the invention without departing from the scope and spirit of the application.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
The logic and/or steps represented in the schematic or otherwise described herein, e.g., as a sequential list of executable instructions that may be thought of as being useful to implement logical functions, may be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (7)

1.一种面向数据密集型应用的动态可重构处理器,其特征在于,所述动态可重构处理器包括一个处理单元阵列、一个片上多bank便签式存储器和一个配置存储器,所述处理单元阵列由m x n个处理单元PE以二维阵列的形式组成,所述m和n为正整数,其中,同一行PE连接到同一条总线上,每条总线通过一个交叉选择矩阵单元访问到便签式存储器中的m个bank。1. A dynamically reconfigurable processor for data-intensive applications, characterized in that the dynamically reconfigurable processor comprises a processing unit array, an on-chip multi-bank note-type memory and a configuration memory, and the processing The unit array is composed of m x n processing units PE in the form of a two-dimensional array, where m and n are positive integers, wherein the same row of PEs is connected to the same bus, and each bus accesses the sticky note type through a cross selection matrix unit m banks in memory. 2.根据权利要求1所述的动态可重构处理器法,其特征在于,每个PE包括一个功能单元FU、一个本地寄存器文件RF、一个输出寄存器和一个配置寄存器,所述功能单元FU用于执行各种定点运算,所述功能单元FU的输入端有两个多路复用器,所述多路复用器用于访问不同来源的数据,所述本地寄存器文件RF分为r个单独的寄存器,其中,r为正整数,每个寄存器选择来源于功能单元FU或前一个寄存器的数据,每个处理单元PE中的配置寄存器的信息来源于配置存储器,所述配置存储器连接处理单元PE内部的各个部件,分发配置流配置每个多路复用器的选择信号、功能单元FU的功能以及寄存器的读写使能。2. dynamic reconfigurable processor method according to claim 1, is characterized in that, each PE comprises a functional unit FU, a local register file RF, an output register and a configuration register, and described functional unit FU uses In order to perform various fixed-point operations, the input end of the functional unit FU has two multiplexers, and the multiplexers are used to access data from different sources, and the local register file RF is divided into r separate Register, where r is a positive integer, each register selects data from the functional unit FU or the previous register, and the information of the configuration register in each processing unit PE comes from the configuration memory, which is connected to the internal processing unit PE Each component of the distribution configuration stream configures the selection signal of each multiplexer, the function of the functional unit FU, and the read and write enable of the register. 3.根据权利要求2所述的动态可重构处理器法,其特征在于,所述处理单元PE处于双通道网络,所述双通道网络包括结果网络和数值网络,所述结果网络用于进行功能单元FU的计算结果的传递,所述数值网络用于进行本地寄存器文件RF的数值传递。3. The dynamically reconfigurable processor method according to claim 2, wherein the processing unit PE is in a dual-channel network, and the dual-channel network includes a result network and a numerical network, and the result network is used to perform The transfer of the calculation result of the functional unit FU, the numerical network is used to transfer the value of the local register file RF. 4.根据权利要求3所述的动态可重构处理器法,其特征在于,当一个访问算子从存储器获得的数值在短时间内需要多次引用时,将数值通过所述数值网络分发给其它需要引用数值的处理单元PE。4. The dynamically reconfigurable processor method according to claim 3, characterized in that, when the numerical value obtained by an access operator from the memory needs to be referenced many times in a short time, the numerical value is distributed to the numerical value network through the numerical value network. Other processing units PE that need to reference values. 5.根据权利要求3或4所述的动态可重构处理器法,其特征在于,在所述本地寄存器文件RF的内部寄存器添加串行移位的数据通道。5. The dynamically reconfigurable processor method according to claim 3 or 4, wherein a serially shifted data channel is added to the internal register of the local register file RF. 6.一种使用如权利要求1-5任一项所述的面向数据密集型应用的动态可重构处理器传递可重用数据的方法,其特征在于,包括以下步骤:6. A method for transferring reusable data using the dynamically reconfigurable processor oriented to data-intensive applications as claimed in any one of claims 1-5, characterized in that, comprising the steps of: 步骤1,将应用伪代码转换为原始的数据依赖图,由于数据x[i+2]与一个时钟周期后的数据x[i+1]和两个时钟周期后的数据x[i]是同一数据,去掉L2,L3两个算子,添加两条新的重用依赖边(L1,*),(L1,+),得到新的数据依赖图。(L1,*)代表算子L1取得的数据将通过数值网络传递给乘法算子“*”进行消费,(L1,+)代表算子L1取得的数据将通过数值网络传递给加法算子“+”进行消费。通过编译工具,获得编译方案,生成配置信息流,得到II=1的布局结果。访存算子L1布局在PE1,乘法算子“*”布局在PE2,加法算子“+”布局在PE3,访存算子S1布局在PE4上;Step 1, convert the application pseudocode into the original data dependency graph, since the data x[i+2] is the same as the data x[i+1] after one clock cycle and the data x[i] after two clock cycles Data, remove the two operators L 2 and L 3 , add two new reused dependent edges (L 1 , *), (L 1 , +) to obtain a new data dependency graph. (L 1 , *) means that the data obtained by operator L 1 will be passed to the multiplication operator “*” through the numerical network for consumption, (L 1 , +) means that the data obtained by the operator L 1 will be passed to the addition operator through the numerical network The operator "+" is used for consumption. Through the compilation tool, the compilation scheme is obtained, the configuration information flow is generated, and the layout result of II=1 is obtained. The memory fetch operator L 1 is placed on PE1, the multiplication operator "*" is placed on PE2, the addition operator "+" is placed on PE3, and the memory fetch operator S 1 is placed on PE4; 步骤2,t时刻,在配置流的驱动下,t时刻的L1算子取完数据后,将数据放置在PE1的最后一个寄存器中;Step 2, at time t, under the drive of the configuration stream, after the L1 operator at time t finishes fetching the data, the data is placed in the last register of PE1; 步骤3,t+1时刻,在配置流的驱动下,t时刻的L1数据通过PE1的多路复用选择器Mb传递出去,经过PE2的多路复用选择器Ma和M1,到达PE2的第一个寄存器R1,t+1时刻的L1算子取完数据后,将数据放置在PE1的输出寄存器,同时又放置在PE1的最后一个寄存器中;Step 3, at time t+ 1 , driven by the configuration stream, the L1 data at time t is transmitted through the multiplexing selector Mb of PE1, and then reaches the data of PE2 through the multiplexing selectors Ma and M1 of PE2. The first register R1, after the L1 operator at time t+ 1 fetches the data, places the data in the output register of PE1, and at the same time it is placed in the last register of PE1; 步骤4,t+2时刻,在配置流的驱动下,t时刻的L1数据通过PE2的多路复用选择器Mb传递出去,经过PE3的多路复用选择器Ma和M1,到达PE3的第一个寄存器R1;与此同时,在配置流的驱动下,t时刻的L1数据还通过PE2的FU前的多路选择器Md和Mf传递到FU的第二个输入端口;与此同时,在配置流的驱动下,PE1的输出寄存器中的数据经PE2的FU前的多路选择器Me到达第一个输入端口,在配置流的驱动下,FU进行乘法“*”操作,结果保存在PE2的输出寄存器;Step 4, at time t+2, driven by the configuration stream, the L1 data at time t is transmitted through the multiplexer Mb of PE2, and then reaches the data of PE3 through the multiplexer Ma and M1 of PE3. The first register R1; at the same time, driven by the configuration flow, the L1 data at time t is also passed to the second input port of the FU through the multiplexers Md and Mf before the FU of PE2; at the same time , Driven by the configuration stream, the data in the output register of PE1 reaches the first input port through the multiplexer Me in front of the FU of PE2. Driven by the configuration stream, the FU performs the multiplication "*" operation, and the result is saved in the output register of PE2; 步骤5,t+3时刻,在配置流的驱动下,t时刻的L1数据通过PE3的第一个寄存器R1的输出端口,经过多路选择器M2,传递到PE3的第二个寄存器R2中,与此同时,PE2进行乘法“*”操作,结果存于PE2的输出寄存器中;Step 5, at time t+3, driven by the configuration stream, the L1 data at time t passes through the output port of the first register R1 of PE3, passes through the multiplexer M2, and is transferred to the second register R2 of PE3 , at the same time, PE2 performs the multiplication "*" operation, and the result is stored in the output register of PE2; 步骤6,t+4时刻,在配置流的驱动下,t时刻的L1数据通过PE3的FU前的多路选择器Md和Mf到达FU的第二个输入端口,与此同时,PE2的输出寄存器中的数据经PE3的FU前的多路选择器Me传递到FU的第一个输入端口,在配置流的驱动下,FU进行加法“+”操作,结果保存在PE3的输出寄存器;Step 6, at time t+4, driven by the configuration stream, the L1 data at time t reaches the second input port of FU through the multiplexers Md and Mf before the FU of PE3, and at the same time, the output of PE2 The data in the register is passed to the first input port of the FU through the multiplexer Me before the FU of PE3. Under the drive of the configuration stream, the FU performs the addition "+" operation, and the result is stored in the output register of the PE3; 步骤7,t+5时刻,在配置流的驱动下,PE3的输出寄存器中的数据通过多路选择器Me传递到PE4的FU的第一个输入端口,通过总线将数据保存在Bank中。Step 7, at time t+5, driven by the configuration flow, the data in the output register of PE3 is transmitted to the first input port of the FU of PE4 through the multiplexer Me, and the data is stored in the Bank through the bus. 7.一种非临时性计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1-6中任一项所述的传递可重用数据的方法。7. A non-transitory computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the transfer reusability according to any one of claims 1-6 is realized method of data.
CN202110703118.0A 2021-06-24 2021-06-24 Dynamic reconfigurable processor for data intensive application Active CN113407483B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110703118.0A CN113407483B (en) 2021-06-24 2021-06-24 Dynamic reconfigurable processor for data intensive application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110703118.0A CN113407483B (en) 2021-06-24 2021-06-24 Dynamic reconfigurable processor for data intensive application

Publications (2)

Publication Number Publication Date
CN113407483A true CN113407483A (en) 2021-09-17
CN113407483B CN113407483B (en) 2023-12-12

Family

ID=77683003

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110703118.0A Active CN113407483B (en) 2021-06-24 2021-06-24 Dynamic reconfigurable processor for data intensive application

Country Status (1)

Country Link
CN (1) CN113407483B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113838498A (en) * 2021-09-27 2021-12-24 华中科技大学 Data multiplexing operation circuit and method for memory calculation
WO2023151216A1 (en) * 2022-02-14 2023-08-17 华为技术有限公司 Graph data processing method and chip

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050171990A1 (en) * 2001-12-06 2005-08-04 Benjamin Bishop Floating point intensive reconfigurable computing system for iterative applications
US20100211747A1 (en) * 2009-02-13 2010-08-19 Shim Heejun Processor with reconfigurable architecture
CN102253921A (en) * 2011-06-14 2011-11-23 清华大学 Dynamic reconfigurable processor
US20130089102A1 (en) * 2011-10-05 2013-04-11 Woong Seo Coarse-grained reconfigurable array based on a static router
CN103218347A (en) * 2013-04-28 2013-07-24 清华大学 Multiparameter fusion performance modeling method for reconfigurable array
CN105468568A (en) * 2015-11-13 2016-04-06 上海交通大学 High-efficiency coarse granularity reconfigurable computing system
CN112506853A (en) * 2020-12-18 2021-03-16 清华大学 Reconfigurable processing unit array of zero-buffer flow and zero-buffer flow method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050171990A1 (en) * 2001-12-06 2005-08-04 Benjamin Bishop Floating point intensive reconfigurable computing system for iterative applications
US20100211747A1 (en) * 2009-02-13 2010-08-19 Shim Heejun Processor with reconfigurable architecture
CN102253921A (en) * 2011-06-14 2011-11-23 清华大学 Dynamic reconfigurable processor
US20130089102A1 (en) * 2011-10-05 2013-04-11 Woong Seo Coarse-grained reconfigurable array based on a static router
CN103218347A (en) * 2013-04-28 2013-07-24 清华大学 Multiparameter fusion performance modeling method for reconfigurable array
CN105468568A (en) * 2015-11-13 2016-04-06 上海交通大学 High-efficiency coarse granularity reconfigurable computing system
CN112506853A (en) * 2020-12-18 2021-03-16 清华大学 Reconfigurable processing unit array of zero-buffer flow and zero-buffer flow method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JONGEUN LEE等: "Fast shared on-chip memory architecture for efficient hybird computing with CGRAs", 2013 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE), pages 1 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113838498A (en) * 2021-09-27 2021-12-24 华中科技大学 Data multiplexing operation circuit and method for memory calculation
CN113838498B (en) * 2021-09-27 2023-02-28 华中科技大学 A data multiplexing operation circuit and method for in-memory calculation
WO2023151216A1 (en) * 2022-02-14 2023-08-17 华为技术有限公司 Graph data processing method and chip

Also Published As

Publication number Publication date
CN113407483B (en) 2023-12-12

Similar Documents

Publication Publication Date Title
JP6059413B2 (en) Reconfigurable instruction cell array
CN109213523B (en) Processor, method and system for configurable spatial accelerator with memory system performance, power reduction and atomic support features
CN109213723B (en) A processor, method, device, and non-transitory machine-readable medium for data flow graph processing
US10469397B2 (en) Processors and methods with configurable network-based dataflow operator circuits
Mei et al. Exploiting loop-level parallelism on coarse-grained reconfigurable architectures using modulo scheduling
US20200310797A1 (en) Apparatuses, methods, and systems for swizzle operations in a configurable spatial accelerator
US20110231616A1 (en) Data processing method and system
CN108268278A (en) Processor, method and system with configurable spatial accelerator
US9727526B2 (en) Apparatus and method of vector unit sharing
KR20100092805A (en) A processor with reconfigurable architecture
JP2014216021A (en) Processor for batch thread processing, code generation apparatus and batch thread processing method
Abnous et al. Pipelining and bypassing in a VLIW processor
CN113407483A (en) Data intensive application oriented dynamic reconfigurable processor
Abdelhamid et al. A highly-efficient and tightly-connected many-core overlay architecture
Akabe et al. Imax: A power-efficient multilevel pipelined cgla and applications
CN112559954B (en) FFT algorithm processing method and device based on software-defined reconfigurable processor
US7260709B2 (en) Processing method and apparatus for implementing systolic arrays
CN114116598B (en) Chip and electronic device with built-in reconfigurable coprocessor
CN101236576B (en) Interconnecting model suitable for heterogeneous reconfigurable processor
Gokhale et al. Malleable architecture generator for FPGA computing
Rettkowski et al. Application-specific processing using high-level synthesis for networks-on-chip
CN112506853A (en) Reconfigurable processing unit array of zero-buffer flow and zero-buffer flow method
CN119441130B (en) A three-dimensional reconfigurable hardware acceleration core chip
McGrath et al. A WE-DSP32 based, low-cost, high-performance, synchronous multiprocessor for cyclo-static implementations
Ferreira et al. Reducing interconnection cost in coarse-grained dynamic computing through multistage network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant