[go: up one dir, main page]

CN117667210A - Instruction control device, method, processor, chip and board card - Google Patents

Instruction control device, method, processor, chip and board card Download PDF

Info

Publication number
CN117667210A
CN117667210A CN202211067770.9A CN202211067770A CN117667210A CN 117667210 A CN117667210 A CN 117667210A CN 202211067770 A CN202211067770 A CN 202211067770A CN 117667210 A CN117667210 A CN 117667210A
Authority
CN
China
Prior art keywords
instruction
memory
access
address
instructions
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211067770.9A
Other languages
Chinese (zh)
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cambrian Xi'an Integrated Circuit Co ltd
Original Assignee
Cambrian Xi'an Integrated Circuit Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cambrian Xi'an Integrated Circuit Co ltd filed Critical Cambrian Xi'an Integrated Circuit Co ltd
Priority to CN202211067770.9A priority Critical patent/CN117667210A/en
Publication of CN117667210A publication Critical patent/CN117667210A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • G06F9/30047Prefetch instructions; cache control instructions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The disclosure discloses an instruction control method, an instruction control device, a processor, a chip and a board card. The processor may be included as computing means in a combined processing means, which may also include interface means and other processing means. The computing device interacts with other processing devices to collectively perform a user-specified computing operation. The combined processing means may further comprise storage means connected to the computing means and the other processing means, respectively, for storing data of the computing means and the other processing means. The scheme of the disclosure provides an instruction control method which can improve the parallelism of instruction level and the processing efficiency.

Description

Instruction control device, method, processor, chip and board card
Technical Field
The present disclosure relates generally to the field of processors. More particularly, the present disclosure relates to an instruction control apparatus, an instruction control method, a processor, a chip, and a board.
Background
In recent years, artificial intelligence processors have been widely used in the cloud, side, and vehicle fields. The processing data of an artificial intelligence processor is typically high-dimensional tensor data, which is significantly different from scalar data processed by a scalar processor. Thus, the instruction dependent maintenance approach of scalar processors cannot meet the needs of artificial intelligence processors for tensor data maintenance.
The conventional processor maps the physical registers into a free list of available space (FreeList) by marking to identify whether the current physical register is free. Tensor data processed by an artificial intelligence processor cannot be held by using a register with a fixed bit width, but rather, a Static Random Access Memory (SRAM) with a larger capacity is required to be used as a data temporary storage mode. In a single SRAM, a plurality of tensor data are generally stored, and therefore, maintaining the dependency of each tensor data becomes a technical difficulty.
In view of this, there is a need for a way to efficiently maintain instruction dependencies to meet the needs of artificial intelligence processors.
Disclosure of Invention
To address at least one or more of the technical problems mentioned above, the present disclosure proposes, among other things, an instruction control scheme.
In a first aspect, the present disclosure provides an instruction control apparatus comprising: the instruction cache unit is used for caching access instructions to be transmitted; the instruction registration unit is used for registering access instructions to be transmitted, wherein the registration information comprises address dependency information between the current access instructions and the historical access instructions; the resource recording unit is used for recording the states of the execution resource and the storage resource of the access instruction; and a transmission control unit for controlling the transmission of the instruction in the instruction cache unit based on the information of the instruction registration unit and/or the resource recording unit.
In a second aspect, the present disclosure provides an instruction control method comprising: registering access instructions to be transmitted, wherein the registration information comprises address dependency information between the current access instructions and the historical access instructions; caching access instructions to be transmitted; recording the resource states of the execution resource and the storage resource of the access instruction; and controlling the transmission of the cached access instruction based on the registration information and/or the resource state.
In a third aspect, the present disclosure provides a processor comprising the instruction control apparatus of the first aspect. In a fourth aspect, the present disclosure provides a chip comprising the processor of the foregoing third aspect. In a fifth aspect, the present disclosure provides a board comprising the chip of the fourth aspect.
Through the instruction control device, the method, the processor, the chip and the board, the embodiment of the disclosure provides a data dependency detection scheme for access instructions and allows the parallel issuing of instructions without address dependencies, thereby improving the instruction level parallelism and the processor performance.
Drawings
The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:
FIG. 1 illustrates a block diagram of a board of an embodiment of the present disclosure;
FIG. 2 illustrates a block diagram of a combination processing device according to an embodiment of the present disclosure;
FIG. 3 illustrates a schematic internal architecture of a processor core of a single core computing device of an embodiment of the present disclosure;
FIG. 4 illustrates a simplified schematic diagram of the internal structure of a multi-core computing device of an embodiment of the present disclosure;
FIG. 5 illustrates an exemplary internal block diagram of an instruction control device according to some embodiments of the present disclosure;
FIG. 6 illustrates an exemplary internal block diagram of an instruction control device according to further embodiments of the present disclosure;
FIG. 7 illustrates an example diagram of single instruction address range calculation, according to some embodiments of the present disclosure;
FIG. 8 is a schematic diagram of two instruction address dependency decisions;
FIG. 9 illustrates a schematic diagram of an address dependency determination between two instructions of multiple operands in accordance with an embodiment of the present disclosure;
fig. 10 shows an exemplary flowchart of an instruction control method according to an embodiment of the present disclosure.
Detailed Description
The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the disclosure. Based on the embodiments in this disclosure, all other embodiments that may be made by those skilled in the art without the inventive effort are within the scope of the present disclosure.
It should be understood that the terms "first," "second," "third," and "fourth," and the like, as may appear in the claims, specification and drawings of the present disclosure, are used for distinguishing between different objects and not for describing a particular sequential order. The terms "comprises" and "comprising" when used in the specification and claims of the present disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the present disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present disclosure and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".
Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.
Exemplary hardware Environment
Fig. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure. As shown in fig. 1, the board 10 includes a Chip 101, which is a System on Chip (SoC), or a System on Chip, integrated with one or more combined processing devices, wherein the combined processing device is an artificial intelligent computing unit, and is used for supporting various deep learning and machine learning algorithms, so as to meet the intelligent processing requirements in complex fields such as computer vision, voice, natural language processing, data mining, and the like. Particularly, the deep learning technology is applied to the cloud intelligent field in a large quantity, and the cloud intelligent application has the remarkable characteristics of large input data quantity and high requirements on the storage capacity and the computing capacity of a platform, and the board card 10 of the embodiment is suitable for the cloud intelligent application and has huge off-chip storage, on-chip storage and strong computing capacity.
The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface means 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface means 102. The external interface device 102 may have different interface forms, such as PCIe interfaces, etc., according to different application scenarios.
The board 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected to the control device 106 and the chip 101 via a bus and transmits data. The control device 106 in the board 10 is configured to regulate the state of the chip 101. To this end, in one application scenario, the control device 106 may comprise a single chip microcomputer (Micro Controller Unit, MCU).
Fig. 2 is a block diagram showing a combination processing apparatus in the chip 101 of this embodiment. As shown in fig. 2, the combined processing means 20 comprises computing means 201, interface means 202, processing means 203 and storage means 204.
The computing device 201 is configured to perform user-specified operations, primarily implemented as a single-core smart processor or as a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively accomplish the user-specified operations.
The interface means 202 are used for transmitting data and control instructions between the computing means 201 and the processing means 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, writing to a storage device on the chip of the computing device 201. Further, the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202, and write the control instructions into a control cache on the chip of the computing device 201. Alternatively or in addition, the interface device 202 may also read data in the memory device of the computing device 201 and transmit it to the processing device 203.
The processing device 203 is a general purpose processing device that performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of processors, including but not limited to a digital signal processor (digital signal processor, DSP), an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (field-programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., of a central processing unit (central processing unit, CPU), graphics processor (graphics processing unit, GPU) or other general purpose and/or special purpose processor, and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure only with respect to it. However, when computing device 201 and processing device 203 are considered together, they are considered to form a heterogeneous multi-core structure.
The storage 204 is used to store data to be processed, which may be DRAM, is DDR memory, and is typically 16G or larger in size, for storing data for the computing device 201 and/or the processing device 203.
Fig. 3 shows a schematic diagram of the internal structure of a processing core when the computing device 201 in fig. 2 is a single-core device. The computing device 301 is configured to process input data such as computer vision, voice, natural language, data mining, etc., and the computing device 301 includes three modules: a control module 31 (also referred to as a controller), an arithmetic module 32 (also referred to as an operator), and a storage module 33 (also referred to as a memory).
The control module 31 is used for coordinating and controlling the operation of the operation module 32 and the storage module 33 to complete the task of deep learning, and comprises a fetch unit (instruction fetch unit, IFU) 311 and an instruction decode unit (instruction decode unit, IDU) 312. The instruction fetching unit 311 is configured to fetch an instruction from the processing device 203, and the instruction decoding unit 312 decodes the fetched instruction and sends the decoded result to the operation module 32 and the storage module 33 as control information.
The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used for performing vector operations and can support complex operations such as vector multiplication, addition, nonlinear transformation and the like; the matrix operation unit 322 is responsible for the core computation of the deep learning algorithm, i.e., matrix multiplication and convolution.
The storage module 33 is used for storing or carrying related data, and includes a neuron storage unit (NRAM) 331, a weight storage unit (weight RAM) 332, and a direct memory access module (direct memory access, DMA) 333.NRAM 331 is to store input neurons, output neurons, and calculated intermediate results; WRAM 332 is configured to store a convolution kernel, i.e., a weight, of the deep learning network; DMA 333 is coupled to DRAM 204 via bus 34 and is responsible for data handling between computing device 301 and DRAM 204. It should be noted that the NRAM and WRAM herein may be two memory areas formed by dividing the same memory in a logic memory space, or may be two independent memories, which are not limited herein specifically.
Fig. 4 shows a simplified schematic diagram of the internal architecture of the computing device 201 of fig. 2 when it is multi-core. The multi-core computing device may be abstracted using a hierarchical hardware model. As shown, the multi-core computing device 400 is a system-on-chip that includes at least one compute cluster (cluster), each of which in turn includes a plurality of processor cores, in other words, the multi-core computing device 400 is formed in a system-on-chip-compute cluster-processor core hierarchy.
At the system-on-chip level, as shown, the multi-core computing device 400 includes an external memory controller 41, a peripheral communication module 42, an on-chip interconnect module 43, a global synchronization module 44, and a plurality of computing clusters 45.
There may be a plurality of external memory controllers 41, 2 being shown by way of example, for accessing external memory devices (e.g., DRAM 204 in FIG. 2) to read data from or write data to off-chip in response to access requests issued by the processor cores. The peripheral communication module 42 is configured to receive a control signal from the processing device (203 of fig. 2) via the interface device (202 of fig. 2) and to initiate the computing device (201 of fig. 2) to perform a task. The on-chip interconnect module 43 connects the external memory controller 41, the peripheral communication module 42, and the plurality of computing clusters 45 for transmitting data and control signals between the respective modules. The global synchronization module 44 is, for example, a global synchronization barrier controller (GBC) for coordinating the working progress of each computing cluster to ensure synchronization of information. The plurality of computing clusters 45 are the computing cores of the multi-core computing device 400, 4 on each die being illustratively shown, the multi-core computing device 400 of the present disclosure may also include 8, 16, 64, or even more computing clusters 45 as hardware evolves. The computing clusters 45 are used to efficiently execute the deep learning algorithm.
At the level of the compute clusters, each compute cluster 45 includes a plurality of processor cores 406 as control and compute units, and a shared memory core 407 as a memory unit, as shown. Further, each computing cluster may further include a local synchronization module 412, configured to coordinate the working progress of each processor core in the computing cluster, so as to ensure synchronization of information. The processor cores 406 are illustratively shown as 4, and the present disclosure does not limit the number of processor cores 406.
The storage cores 407 are mainly used for storing and communicating, i.e., storing shared data or intermediate results between the processor cores 406, and executing communication between the compute clusters 45 and the DRAM 204, communication between the compute clusters 45, communication between the processor cores 406, and the like. In other embodiments, the memory core 407 has scalar operation capabilities to perform scalar operations.
The memory core 407 includes a shared memory unit (SMEM) 408, a broadcast bus 409, a compute cluster direct memory access module (cluster direct memory access, CDMA) 410, and a global direct memory access module (global direct memory access, GDMA) 411. The SMEM 408 assumes the role of a high-performance data transfer station, and data multiplexed between different processor cores 406 in the same computing cluster 45 is not required to be obtained from the processor cores 406 to the DRAM 204 respectively, but is transferred between the processor cores 406 through the SMEM 408, and the memory core 407 only needs to rapidly distribute the multiplexed data from the SMEM 408 to the plurality of processor cores 406, so as to improve the inter-core communication efficiency and greatly reduce the on-chip off-chip input/output access. Broadcast bus 409, CDMA 410, and GDMA 411 are used to perform communication between processor cores 406, communication between compute clusters 45, and data transfer between compute clusters 45 and DRAM 204, respectively.
At the level of the processor cores, the structure of a single processor core may be similar to the block diagram of a single core computing device shown in FIG. 3 and will not be described in detail herein.
Instruction control scheme
As mentioned previously, the processing data of an artificial intelligence processor is typically high-dimensional tensor data. Thus, the operands of the instruction are also multi-dimensional. The storage mode of the multi-dimensional tensor data on the memory has the complex conditions of continuous storage and discontinuous storage, unfixed size, changeable overlapping mode and the like. These complex models result in compilers that cannot or cannot easily determine dependencies at compile time, and the use of static compilation approaches to maintain parallelism between instructions is greatly reduced.
In this case, if storage details of operands are to be recorded, for example, to be accurate to respective dimensions of data, and an address range of each dimension is calculated to a Byte (Byte) level, a great storage overhead is required, and a memory area is also great. For example, the size of an operand is typically 32 bits, so the data to be handled may be 1B or 32B. If the control is accurate, it is necessary to know what each dimension is, what the store step is, and almost every instruction is recorded. Typically, one instruction requires about 500 more bits, and thus recording this information requires a large amount of area. On the other hand, the operand size of instructions varies widely, and for small-scale data, it is inefficient. While if maintenance management is performed according to the memory type, for example, the whole RAM is managed, efficiency is low. For example, if one instruction accesses a certain RAM, other instructions can no longer access that RAM. For example, assuming that one of the preceding instructions is accessing an NRAM (e.g., NRAM 331 shown in fig. 3), the entire NRAM would be occupied and the other instructions would not be able to access the NRAM. Even though the NRAM is relatively large, e.g., 0.75MB, this prior instruction only accesses 1B of data, and subsequent instructions cannot access other unrelated data on the NRAM. It can be seen that this coarse-grained approach is less efficient, reducing instruction parallelism. This is especially true for mass memories such as DRAMs, where the parallelism of access instructions to DRAMs is not high.
Accordingly, embodiments of the present disclosure provide an instruction control scheme that maintains data dependencies among memory access instructions (also referred to as data handling instructions), by recording address dependency information between current memory access instructions and historical memory access instructions, and controlling the emission of instructions based on the address dependency information, thereby improving instruction parallelism. The access instructions may include, but are not limited to, instructions such as load, store, move to transfer data from one storage location to another.
Fig. 5 illustrates an exemplary internal block diagram of an instruction control apparatus implementing an instruction control scheme according to some embodiments of the present disclosure.
As shown, the instruction control apparatus 500 includes an instruction cache unit 510 for caching access instructions to be transmitted. Instruction cache unit 510 is responsible for caching instructions received from an upstream unit (e.g., instruction decode unit IDU, instruction decode unit 312 of fig. 3) after they have been accessed.
When executing a sequence of instructions in a single processing core, the instructions are typically buffered separately by type into different instruction queues to be issued. The instructions within each instruction queue are sequentially issued, sequentially executed. Instructions in the plurality of instruction queues may be issued in parallel, so that the instructions are issued out of order as a whole.
Thus, in some implementations, instruction cache unit 510 may include several instruction cache queues 511 for respectively caching different types of memory access instructions. There is no dependency between instruction streams corresponding to different instruction cache queues, and memory access instructions in the same instruction cache queue are sequentially transmitted.
For example, in a software ping-pong implementation for LCS pipeline including loading (L), computing (C) and restoring (S) of data, the memory space may be generally configured with at least two buffers to support data access between one of the buffers and the external memory circuit while data access between the other buffer and the processing circuit is performed. These two buffers may be referred to as ping buffer space and pong buffer space, i.e., ping pong (ping pong) pipelining. Specifically, when the processing circuit performs a calculation on data on the ping memory space of the memory circuit, the memory circuit loads the next calculation data on its pong memory space. As is clear from the foregoing description of the hardware architecture, the memory interface between the memory circuit and the other memory circuit is different from the memory interface between the memory circuit and the processing circuit, and thus the above parallel manner can be supported, thereby constituting pipeline processing. In such a ping-pong pipeline scenario, the instruction cache queues may include, for example, an io0 queue (for ping buffer space) and an io1 queue (for pong buffer space).
In other implementations, the instruction cache queue 511 may be partitioned according to the resources and manner of access. For example, three instruction cache queues are provided, one for memory access instructions (such as a move stream) that do not access off-chip memory (e.g., DRAM), one for memory access instructions (such as an io0 stream) that read off-chip memory, and one for memory access instructions (such as an io1 stream) that write off-chip memory.
The instruction control apparatus 500 further comprises an instruction registration unit 520 for registering the memory access instruction to be transmitted, wherein the registration information comprises address dependency information between the current memory access instruction and the historical memory access instruction.
In some implementations, instruction registration unit 520 may be further to: calculating the address range of the current access instruction; comparing the address range of the current access instruction with the address range of the historical access instruction; and recording the result of the comparison in the registration information as the above-mentioned address dependency information.
The address range may be, for example, a range of storage space on memory for operands to which the instruction relates. The address range may have different manifestations depending on the granularity used.
In some embodiments, the address range may be characterized, for example, by a minimum address value and a maximum address value of memory space occupied by an operand, which may be contiguous or may allow free or bubble portions to exist therein.
In other embodiments, when the operand is multi-dimensional data and storage between dimensions is discontinuous, the address range may be a minimum address value and a maximum address value for each dimension of the multi-dimensional data or a minimum address value and a maximum address value for a plurality of blocks of space divided according to storage continuity for the dimension.
Still further, in still other embodiments, the address range may be accurate to the address range of the data in bytes in the operand.
In still other embodiments, the address range may also be relatively coarse, e.g., defined in terms of the storage resource name of the data block.
Thus, the granularity of the address range may be selected from any of the following:
the minimum address value and the maximum address value of the data block operated by the memory access instruction;
minimum address value and maximum address value of each dimension of data block operated by memory access instruction;
an address range of data taking bytes as a unit in a data block operated by the memory access instruction; and
memory resource names of data blocks operated by the memory access instruction.
It will be appreciated that the more accurate the address range, the more information that needs to be stored and thus the more memory area overhead is occupied. Thus, the granularity at which the address range is characterized may be determined based on operational performance requirements and/or storage area overhead. For example, in extreme cases, when the memory area overhead is not considered, the address range can be accurate to the byte, and then based on the byte, the data dependency relationship of the subsequent instruction is judged, the parallel execution of the instruction can be realized to the maximum extent, and the operation performance is improved. When the memory area is limited, the address range can be characterized by using only the minimum address value and the maximum address value of the memory space, so that the memory area is not excessively occupied under the condition of meeting certain operation performance.
In some embodiments, the result of the address range comparison of the current memory instruction and the address range of the history memory instruction may be characterized using a bitmap, where each bit in the bitmap is used to indicate whether there is address overlap between the current memory instruction and the address range of a history memory instruction.
It will be appreciated that the number of historical memory access instructions is limited, for example 48, due to the limited memory space of the instruction cache unit and the instruction registration unit. The size of the bitmap may be determined based on the upper limit, e.g., 48 bits. Thus, whenever an access instruction (current access instruction) is encountered, its address range may be compared to the address range of each of the historical access instructions. For example, if the address ranges of the current access instruction and the 1 st historical access instruction do not overlap, the 1 st bit in the bitmap may be set to "1"; when the address ranges of the current access instruction and the 2 nd historical access instruction overlap, the 2 nd bit in the bitmap can be set to be 0; and vice versa.
It will further be appreciated that when a new memory access instruction arrives, address dependency information may be generated for the memory access instruction with the historical memory access instructions. Then, as the instruction execution progresses, when an instruction is completed, the address dependency information may be updated according to the completion information.
The instruction control apparatus 500 further comprises a resource recording unit 530 for recording the status of the execution resources and the storage resources of the memory access instruction.
The execution resources of the memory access instructions may include, for example, one or more memory access instruction decode units downstream. The memory resources of the memory access instruction may then include various on-chip and/or off-chip memory circuits, such as DRAM in FIG. 2, external Scatchpadarray (not shown), NRAM, WRAM in FIG. 3, SRAM in FIG. 4, register file Regfile in other hardware system architectures, and so forth. The state of the memory resource is recorded according to the type of the memory resource, for example, the number of instructions of the current memory access NRAM is recorded.
In some embodiments, the resource recording unit 530 may be further configured to: responsive to a memory access instruction being transmitted to a resource downstream, marking the corresponding resource as BUSY; and/or marking a corresponding resource as Idle (IDEL) in response to the resource being used. For example, when a memory access instruction of a certain access DRAM is transmitted to the memory access instruction decoding unit DEC1 for decoding, DEC1 and DRAM may be marked as BUSY. Thus, the resource recording unit 530 can update the resource status in time to determine whether the instruction can be transmitted.
Instruction control apparatus 500 further comprises an issue control unit 540, responsible for monitoring the exit of instruction cache unit 510, blocking the exit before the instruction is de-dependent. In some embodiments, the issue control unit 540 controls the issue of instructions in the instruction cache unit 510 based on information of the instruction registration unit 520 and/or the resource recording unit 530.
In particular, the emission control unit 540 may be configured to: in response to the registration information in instruction registration unit 520 indicating that the address dependency between the current memory instruction and the historical memory instruction has been released, and the resource recording unit 530 indicates that there is a free execution resource, the current memory instruction is requested to be transmitted. Since the dependency relationship of the memory resource has been analyzed and judged by the address dependency information between the current memory access instruction and the history memory access instruction, the state information of the memory resource in the resource recording unit may not be considered here. For example, when the address dependency information between the current memory instruction and the history memory instruction indicates no dependency, for example, when the bits of the address dependency bitmap are all 1, and there is also a memory instruction decoding unit that is idle, the memory instruction may be requested to be transmitted to the idle memory instruction decoding unit.
According to the embodiment of the disclosure, when the upstream instruction comes, it is registered in the instruction registration unit 520 and then buffered in the instruction buffer unit. For example, when the first instruction comes, the instruction may be registered, an index may be allocated to the instruction, and the instruction may be cached in a corresponding instruction cache queue. Because only one instruction exists at present, no address dependence exists, and the execution resources are idle, so that the instruction can be issued immediately. At this time, there may be no need to generate an address dependency bitmap for the first instruction, and only the address range of the first instruction, such as the maximum and minimum values of the addresses, need to be recorded.
Then, when a second instruction comes, the second instruction may be registered and an address dependency bitmap may be generated for the second instruction. When the address ranges of the second instruction and the first instruction do not overlap, then the address dependency bitmap is all 1, where the bit of the corresponding empty entry (no instruction) in the bitmap is also set to 1. When a second instruction arrives at the head of the instruction cache queue (exit), its corresponding address dependency bitmap may be looked up in instruction registration unit 520 based on the index of the second instruction. If the address dependency bitmap is all 1, it indicates that there is no address dependency between the second instruction and the history access instruction. If the record of the resource record unit indicates that the execution resource is also free, a second instruction may be requested to be transmitted.
Therefore, by registering address dependency information between the current access instruction and the historical access instruction, the parallel issuing of the instructions with no overlapped access addresses can be allowed, and therefore the instruction level parallelism is improved. It can be seen that, under the condition of smaller hardware cost, the embodiment of the disclosure can reduce the data dependence judging time of software, improve the instruction execution parallelism and greatly improve the performance of a processor.
Fig. 6 illustrates an exemplary internal structural diagram of an instruction control device according to further embodiments of the present disclosure. The instruction control apparatus of fig. 6 is similar to that of fig. 5, further showing the specific implementation of the instruction cache unit and the instruction registration unit, and the emission control unit, and thus the foregoing description with reference to fig. 5 may be similarly applied to fig. 6, and the same parts will not be repeated.
As shown in fig. 6, the instruction buffer unit 610 may include a plurality of instruction buffer queues 611, which are used for buffering different types of memory access instructions, where there is no dependency between instruction streams corresponding to the different instruction buffer queues, and the memory access instructions in the same instruction buffer queue are ordered. That is, each instruction buffer 611 is a first-in-first-out FIFO queue, which is sequentially output to the execution units for instruction decoding.
As described above, after the instruction comes in, the instruction registration unit 620 registers the instruction, and if the instruction registration unit 620 is full, the instruction cannot be received continuously. But at this point the instruction cache unit is actually empty, which is a waste of resources.
Thus, in some embodiments, instruction cache unit 610 may further include a temporary cache queue 612 to temporarily cache subsequent unregistered instructions when instruction registration unit 620 is full, and to output the cached unregistered instructions for registration in instruction registration unit 620 when instruction registration unit 620 has room.
Further, the instruction cache unit 610 may further include a shared queue 613 for caching information of all access instructions currently entering the instruction cache unit. At this time, only the index of the access instruction cached in the queue thereof in the shared queue 613 needs to be recorded in each cache queue (including the instruction cache queue 611 and the temporary cache queue 612), and complete access instruction information does not need to be recorded. By setting the shared queue, the storage mode of the instruction cache unit can be optimized, so that the storage space is saved.
For example, assuming that the original instruction cache unit includes three instruction cache queues, each instruction cache queue stores 16 instructions, each instruction requiring 512 bits, the total memory space would require 512 bits 16×3=24576 bits. According to the above embodiment, the complete information of the instruction is stored in the shared queue 613, and it is assumed that all three instruction cache queues cannot be full at the same time, so the shared queue may be set to 512 bits by 20, that is, 20 instructions are stored. 20 instructions, each instruction can be identified using a 5 bit index. Thus, three instruction cache queues may be 5 bits by 16, i.e., each instruction cache queue may sequentially cache an index of 16 instructions. At this time, the occupied memory space is 512 bits×20+5 bits×16×3=10480 bits. It can be seen that the memory space occupied at this time is much smaller than the original memory space. When an instruction in the instruction cache queue needs to be launched, the corresponding instruction can be found from the corresponding position of the shared queue 613 according to the index.
In this case, when adding the temporary cache queue 612, only one queue storing the index actually needs to be added, and the storage overhead is very small. For example, continuing with the example above, the size of temporary buffer queue 612 also requires only 5 bits by 16.
In the case where the temporary cache queue 612 is added, when the instruction registration unit 620 is already full, it may first register in the instruction cache unit and store the corresponding index in the temporary cache queue 612. After that, when the instruction registration unit 620 has room, the unregistered instruction buffered in the temporary buffer queue 612 is registered in the instruction registration unit 620. Specifically, the temporary cache queue 612 outputs the cached unregistered instruction to register in the instruction registration unit 620. In the registration of the instruction registration unit 620, an index is allocated to the instruction, an address dependency bitmap is generated, and the like, as described above.
Thus, by introducing a temporary cache queue, only a small amount of logic needs to be added to extend the storage capacity of instruction cache unit 610 significantly.
In some implementations, instruction registration unit 620 may include a shared table 621 and several private commit queues 622. The shared table 621 is used to sequentially register the registration information of all the memory access instructions to be transmitted into the instruction registration unit 620, for example, an index number may be sequentially allocated to each memory access instruction. The private commit queue 622 is used to register memory access instructions into each private commit queue in order. Note that here the shared table is registered sequentially, the private commit queue is registered sequentially, but at the time of allocation of the shared table, it may be out of order, where there is free, and to which. Private commit queue 622 is in one-to-one correspondence with instruction cache queue 611 in instruction cache unit 610.
Instruction registration unit 620 is also responsible for updating the completion of these instructions. In particular, instruction registration unit 620 may release entries in the shared table and private commit queue for the registered instruction in response to completion of the registered instruction. As mentioned previously, updating also includes updating the address dependency bitmap created for the registration instruction when it is updated.
In some implementations, the emission control unit 640 may include a control logic unit 641 and a selector unit 642. A control logic 641 may be provided at the exit of each instruction cache queue 611 for monitoring the exit of the instruction cache queue. The exit of the instruction cache queue is blocked before the instruction is de-dependent or in the event that the prior synchronous instruction has not committed. The release of the instruction dependency may be based on the index allocated by the instruction registration unit 620, searching the corresponding address dependency bitmap from the shared table 621 of the instruction registration unit 620, determining that the dependency is released if the address dependency bitmap is all 1, otherwise, the dependency is not released. The control logic 641 further determines whether the execution resources required for the instruction are idle, and if so, can generate a transmission request.
Multiple instruction cache queues 611 may be used to generate issue requests at the same time, where multiple issue requests are arbitrated by selector unit 642, an instruction issue is selected, and the hardware resource status is updated accordingly.
As mentioned above, the processing data of the artificial intelligence processor is tensor data, and the address range of the corresponding instruction may be complex.
FIG. 7 illustrates an example diagram of single instruction address range calculation, according to some embodiments of the present disclosure. In this example, the instruction relates to the handling of tensor data, described by way of example as three-dimensional tensor data handling.
The storage of three-dimensional tensor data on memory can be described using the following parameters: base_addr represents the start address (base address) of the tensor data; dim0_size, representing the lowest dimension data size, is typically stored contiguously; dim0_stride represents the storage step size of the lowest dimension data, namely the storage interval of adjacent data in the next lowest dimension; item 1, which represents the number of data in the second lowest dimension; dim1_stride represents the storage step size of the next lowest dimension data, namely the storage interval of adjacent data on the highest dimension; item 2 represents the number of data in the highest dimension. It will be appreciated that some other parameters may also be included to describe the storage of tensor data, and are not explicitly recited herein, as such other parameters are not relevant to the address range calculation of embodiments of the present disclosure.
The sign of dim0_stride and dim1_stride represent different directions. Thus, according to the positive and negative of dim0_stride and dim1_stride, four cases can be divided: 00,10,01 and 11, wherein 0 represents positive and 1 represents negative. Examples of these four cases are shown in fig. 7, respectively, where the position indicated by the triangle is the start position base_addr, and each horizontal arrow segment represents one dim0_size data, i.e. the lowest dimension data stored consecutively, iter1=3, iter2=3.
As shown, in the case of 00, both dim0_stride and dim1_stride are positive, i.e., both address increases, which in the figure is a jump to the right. The upper arc arrow in the figure represents dim0_stride, and the lower arc arrow represents dim1_stride, the data handling sequence is: the address pointer starts from the triangle position and carries data with the size of dim0_size, then the address pointer jumps from the triangle position by one dim0_stride, namely base_addr+dim0_stride, carries data with the size of dim0_size, then the address pointer jumps by one dim0_stride, carries data with the size of dim0_size; then carrying the data of the next dimension, jumping an address pointer from the triangle position by a dim1_stride, repeating the previous operation, and carrying the data of the size of 3 dim0_size; and finally, the address pointer jumps to a dim1_stride again, the previous operation is repeated, and the 3 dim0_size data are carried, so that the whole tensor data are carried.
Similarly, the 10 case corresponds to the case where dim0_stride is negative and dim1_stride is positive, where the arrow representing dim0_stride jumps left and the arrow representing dim1_stride jumps right. The data handling process can be deduced by those skilled in the art from the illustrated sequence and is not developed here.
The address minimum and address maximum in different situations can be deduced from the process in the figure. For example, the four cases in fig. 7 may calculate the address minimum and the address maximum as follows, respectively:
00:
Min=base_addr
Max=base_addr+dim0_stride[47:0]*iter1+dim1_stride[31:0]*iter2+dim0_size-1
10:
Min=base_addr+dim0_stride[47:0]*iter1
Max=base_addr+dim1_stride[31:0]*iter2+dim0_size-1
01:
Min=base_addr+dim1_stride[31:0]*iter2
Max=base_addr+dim0_stride[47:0]*iter1+dim0_size-1
11:
Min=base_addr+dim1_stride[31:0]*iter2+dim0_stride[47:0]*iter1
Max=base_addr+dim0_size-1
it will be appreciated by those skilled in the art that the above address range calculation is merely an example, and that address range calculation methods may be constructed accordingly, depending on different circumstances, such as representing storage of tensor data in different ways, as the embodiments of the disclosure are not limited in this respect.
As can be seen from the above description, the address range calculation of some instructions is complex and may take more time. Thus, in some embodiments, registration may be advanced when the address computation is complex, with the computation path reducing critical path latency by beating (inserting registers) and re-writing the address in subsequent beats.
FIG. 8 is a schematic diagram of two instruction address dependency decisions. When the address range (address minimum and address maximum) of each instruction is determined, it is possible to determine whether there is a dependency between the two instructions according to the respective ranges. As shown, the address range for each instruction may be represented using one line from the address minimum to the address maximum. Four cases are shown in fig. 8, which are possible between the address ranges of two instructions.
As shown, there is no overlap, i.e., no dependency, between the two instructions for only (a) and (d). Thus, the determination of no dependency can be expressed as follows:
Mutex=A_max<B_min||B_max<A_min
the instruction control method of the embodiment of the disclosure can be applied to single-source operands and single-destination operands, and also applied to multi-source operand and multi-destination operand instruction systems.
When there are multiple operands in an instruction, an address range may be determined for each operand. For example, an instruction may support 3-read 2-write, i.e., three source operands need to be read, two destination operands are written. In a preferred embodiment, it is contemplated that atomic operations, while having a source operand and a destination operand, typically the source address and destination address are the same, so that one address range may be saved less. That is, in a hardware implementation, the above instruction value supporting 3-read-2 write needs to save the address range of 3-read-1 write.
FIG. 9 illustrates a schematic diagram of an address dependency determination between two instructions of multiple operands in accordance with an embodiment of the present disclosure. In this example, assume that each instruction holds a total of 4 address ranges of 3 reads 1 writes.
As shown, the 3 source address ranges and 1 destination address range for instructions inst1 and inst2, respectively, are shown in dashed lines. Inst2 follows inst1, and it is necessary to determine if there is an address dependency between inst2 and inst 1. According to the possible conflict mode between the reading and the writing, the reading after writing, the writing after writing and the writing after reading need to be judged. Specifically, it is necessary to determine whether there is overlap between the 3 source address ranges of inst2 and the 1 destination address ranges of inst1, respectively, i.e., a read-after-write collision, as indicated by arrows, each of which represents an address-dependent determination. In addition, it is also necessary to determine whether there is overlap between the 1 destination address range of inst2 and the 3 source address ranges and 1 destination address range of inst1, respectively, i.e., a write-after-read conflict and a write-after-write conflict, and the address dependency determination is also shown by an arrow.
It follows that 7 address dependency decisions are required between two instructions for multiple operands (3 read 1 write).
In some implementations, address dependencies between instructions may be determined based on the results of the multiple address dependency determinations described above. For example, when all address dependency decisions indicate that addresses do not overlap, it may be determined that there is no dependency between inst2 and inst 1. When either address dependency determination indicates that there is overlap in addresses, it can be determined that there is a dependency between inst2 and inst 1. That is, the dependency determination result of each of the source address and the destination address may not be recorded in detail, but the dependency relationship may be determined for the entire instruction, so that the storage space may be saved.
Fig. 10 shows an exemplary flowchart of an instruction control method according to an embodiment of the present disclosure.
As shown, in step 1010, the memory instructions to be transmitted are registered, wherein the registration information includes address dependency information between the current memory instructions and the historical memory instructions. This step may be performed, for example, in an instruction registration unit.
In some embodiments, registering the memory access instruction to be transmitted may include: calculating the address range of the current access instruction; comparing the address range of the current access instruction with the address range of the historical access instruction; and recording the result of the comparison in the registration information as the above-mentioned address dependency information.
The granularity of the address range may be selected from any of the following: the minimum address value and the maximum address value of the data block operated by the memory access instruction; minimum address value and maximum address value of each dimension of data block operated by memory access instruction; an address range of data taking bytes as a unit in a data block operated by the memory access instruction; and the storage resource name of the data block operated by the memory access instruction. In some embodiments, the granularity of the address range may be determined based on operational performance requirements and/or storage area overhead.
In some embodiments, the result of the above comparison may be characterized using a bitmap, wherein each bit in the bitmap is used to indicate whether there is an address overlap of the current memory instruction with the address range of one of the historical memory instructions.
Continuing with FIG. 10, in step 1020, the memory instruction to be issued is cached. This step may be performed, for example, in an instruction cache unit. Caching access instructions to be transmitted may include: and a plurality of instruction cache queues are utilized to respectively cache access instructions of different types, no dependency exists among instruction flows corresponding to the different instruction cache queues, and the access instructions in the same instruction cache queue are sequentially transmitted.
In step 1030, the resource status of the execution resources and storage resources of the access instruction is recorded. This step may be performed, for example, in a resource recording unit.
In step 1040, the transmission of the cached memory instructions is controlled based on the registration information and/or the resource status. This step may be performed, for example, in the transmit control unit.
In some embodiments, controlling the transmission of the cached access instructions may include: in response to the registration information indicating that address dependencies between current and historical memory instructions have been resolved and that there are idle execution resources, the current memory instruction is requested to be launched.
In some embodiments, the instruction control method may further include: temporarily buffering subsequent unregistered instructions when the registration space is full using a temporary buffering queue, and outputting the buffered unregistered instructions to register in the registration space when the registration space is free. The temporary cache queue may be provided in the instruction cache unit, and the registration space is provided in the instruction registration unit.
In some embodiments, the instruction control method may further include: caching information of all access instructions currently entering a cache space by utilizing a shared queue; wherein the instruction cache queue and the temporary cache queue only record the index of the access instruction cached in the queue thereof in the shared queue. The shared queue and the instruction cache queue can be arranged in the instruction cache unit.
In some embodiments, the instruction control method may further include: the shared table is utilized to sequentially register the registration information of all the memory access instructions to be transmitted entering the registration space; and registering access instructions entering each private commit queue in sequence by utilizing the plurality of private commit queues. The shared table and the private commit queue may be provided in the instruction registration unit.
In some embodiments, the instruction control method may further include: in response to completion of the registered instruction, the entries in the shared table and private commit queue for the registered instruction are released.
In some embodiments, the instruction control method may further include: responsive to the memory access instruction being transmitted to a resource downstream, marking the corresponding resource as busy; and/or marking the resource as idle in response to the corresponding resource being used; wherein the execution resources comprise one or more decode units and the memory resources comprise on-chip and/or off-chip memory circuits.
Those skilled in the art will appreciate that the various features of the command control apparatus described above in connection with fig. 5 and 6 may be similarly applied to the command control method of fig. 10 and are therefore not repeated here.
The above-described instruction control methods may be performed within a single processor core or within a single processor core, or between multiple processor cores, as the embodiments of the disclosure are not limited in this respect. The above-described instruction control method may be implemented in hardware entirely (such as the hardware described above in connection with fig. 5 and 6), or may be implemented in connection with a software system.
It will be appreciated by those skilled in the art that although the above-described scheme is presented with respect to the data dependent maintenance problem of tensor data, the above-described scheme is not so limited, but may be applied to data dependent detection and maintenance schemes between access instructions for any data. Furthermore, while the above-described solution is presented for an artificial intelligence processor, it may also be applied to general-purpose or special-purpose processors such as CPUs, GPUs, and the like, as the embodiments of the present disclosure are not limited in this respect. The scheme is suitable for both single instruction stream systems and multi-instruction stream systems.
The embodiment of the disclosure also provides a processor, which comprises the instruction control device for implementing the instruction control method. The disclosed embodiments also provide a chip that may include a processor of any of the embodiments described above in connection with the accompanying drawings. Further, the present disclosure also provides a board that may include the foregoing chip.
According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a PC device, an internet of things terminal, a mobile terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus. The electronic device or apparatus of the present disclosure may also be applied to the internet, the internet of things, data centers, energy sources, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, medical, and the like. Further, the electronic device or apparatus of the present disclosure may also be used in cloud, edge, terminal, etc. application scenarios related to artificial intelligence, big data, and/or cloud computing. In one or more embodiments, a computationally intensive electronic device or apparatus according to aspects of the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power consuming electronic device or apparatus may be applied to a terminal device and/or an edge device (e.g., a smart phone or camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate the hardware resources of the terminal device and/or the edge device, so as to complete unified management, scheduling and collaborative work of an end cloud entity or an edge cloud entity.
It should be noted that, for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of actions and combinations thereof, but those skilled in the art will understand that the aspects of the present disclosure are not limited by the order of actions described. Thus, one of ordinary skill in the art will appreciate in light of the present disclosure or teachings that certain steps thereof may be performed in other sequences or concurrently. Further, those skilled in the art will appreciate that the embodiments described in this disclosure may be considered alternative embodiments, i.e., wherein the acts or modules involved are not necessarily required for the implementation of some or some aspects of this disclosure. In addition, the description of some embodiments of the present disclosure is also focused on, depending on the scenario. In view of this, those skilled in the art will appreciate that portions of one embodiment of the disclosure that are not described in detail may be referred to in connection with other embodiments.
In particular implementations, based on the disclosure and teachings of the present disclosure, one of ordinary skill in the art will appreciate that several embodiments of the disclosure disclosed herein may also be implemented in other ways not disclosed herein. For example, in terms of the foregoing embodiments of the electronic device or apparatus, the units are split in consideration of the logic function, and there may be another splitting manner when actually implemented. For another example, multiple units or components may be combined or integrated into another system, or some features or functions in the units or components may be selectively disabled. In terms of the connection relationship between different units or components, the connections discussed above in connection with the figures may be direct or indirect couplings between the units or components. In some scenarios, the foregoing direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustical, magnetic, or other forms of signal transmission.
In the present disclosure, elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physically separate. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, some or all of the units may be selected to achieve the objectives of the embodiments of the disclosure, as desired. In addition, in some scenarios, multiple units in embodiments of the disclosure may be integrated into one unit or each unit may physically exist alone.
In other implementation scenarios, the integrated units may also be implemented in hardware, i.e. as specific hardware circuits, which may include digital circuits and/or analog circuits, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, devices such as transistors or memristors. In view of this, various types of devices described herein (e.g., computing devices or other processing devices) may be implemented by appropriate hardware processors, such as central processing units, GPU, FPGA, DSP, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which may be, for example, variable resistance memory (Resistive Random Access Memory, RRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM), static random access memory (Static Random Access Memory, SRAM), enhanced dynamic random access memory (Enhanced Dynamic Random Access Memory, EDRAM), high bandwidth memory (High Bandwidth Memory, HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM, RAM, etc.
The foregoing may be better understood in light of the following clauses:
clause 1, an instruction control device, comprising: the instruction cache unit is used for caching access instructions to be transmitted; the instruction registration unit is used for registering access instructions to be transmitted, wherein the registration information comprises address dependency information between the current access instructions and the historical access instructions; the resource recording unit is used for recording the states of the execution resource and the storage resource of the access instruction; and the emission control unit is used for controlling the emission of the instruction in the instruction cache unit based on the information of the instruction registration unit and/or the resource recording unit.
Clause 2, the instruction control device according to clause 1, wherein the instruction registration unit is further configured to: calculating the address range of the current access instruction; comparing the address range of the current access instruction with the address range of the historical access instruction; and recording a result of the comparison in the registration information as the address dependency information.
Clause 3, the instruction control device of clause 2, wherein the granularity of the address range is selected from any one of the following: the minimum address value and the maximum address value of the data block operated by the access instruction; the minimum address value and the maximum address value of each dimension of the data block operated by the memory access instruction; an address range of data taking bytes as a unit in a data block operated by the access instruction; and the storage resource name of the data block operated by the access instruction.
Clause 4, the instruction control device of clause 3, wherein the granularity of the address range is determined according to the operational performance requirement and/or the memory area overhead.
Clause 5, the instruction control device of any of clauses 2-4, wherein the result of the comparison is characterized using a bitmap, each bit in the bitmap being used to indicate whether there is an address overlap of the current access instruction with an address range of a historical access instruction.
Clause 6, the instruction control device of any of clauses 1-5, wherein the emission control unit is further configured to: in response to the registration information indicating that address dependencies between current and historical memory instructions have been resolved and that the execution resources are free, the current memory instruction is requested to be launched.
Clause 7, the instruction control device according to any of clauses 1-6, wherein the instruction cache unit comprises: and the instruction cache queues are used for respectively caching access instructions of different types, the instruction flows corresponding to the different instruction cache queues have no dependency, and the access instructions in the same instruction cache queue are sequentially transmitted.
Clause 8, the instruction control device of clause 7, wherein the instruction cache unit further comprises: a temporary buffer queue for temporarily buffering a subsequent unregistered instruction when the instruction registration unit is full, and outputting the buffered unregistered instruction to be registered in the instruction registration unit when the instruction registration unit has a space.
Clause 9, the instruction control device according to clause 8, wherein the instruction cache unit further comprises: the shared queue is used for caching the information of all access instructions which enter the instruction caching unit currently; wherein the instruction cache queue and the temporary cache queue only record the index of access instructions cached in their queues in the shared queue.
Clause 10, the instruction control device according to any of clauses 1-9, wherein the instruction registration unit includes: a shared table for sequentially registering the registration information of all the memory access instructions to be transmitted entering the instruction registration unit; and the private commit queues are used for registering access instructions entering each private commit queue in sequence.
Clause 11, the instruction control device of clause 10, wherein the instruction registration unit is further configured to: responsive to completion of a registered instruction, releasing entries in the shared table and the private commit queue for the registered instruction.
Clause 12, the instruction control device according to any of clauses 1-11, wherein the resource recording unit is further configured to: responsive to the memory access instruction being transmitted to a resource downstream, marking the corresponding resource as busy; and/or marking the resources as idle state in response to the corresponding resource usage being completed; wherein the execution resources include one or more memory access instruction decode units and the memory resources include on-chip and/or off-chip memory circuits.
Clause 13, a processor, comprising the instruction control device according to clause 12.
Clause 14, a chip comprising the processor of clause 13.
Clause 15, a board card comprising the chip of clause 14.
Clause 16, a method of instruction control, comprising: registering access instructions to be transmitted, wherein the registration information comprises address dependency information between the current access instructions and the historical access instructions; caching access instructions to be transmitted; recording the resource states of the execution resource and the storage resource of the access instruction; and controlling the transmission of the cached access instruction based on the registration information and/or the resource state.
Clause 17, the instruction control method according to clause 16, wherein the registering the memory access instruction to be transmitted includes: calculating the address range of the current access instruction; comparing the address range of the current access instruction with the address range of the historical access instruction; and recording a result of the comparison in the registration information as the address dependency information.
Clause 18, the instruction control method of clause 17, wherein the granularity of the address range is selected from any of the following: the minimum address value and the maximum address value of the data block operated by the access instruction; the minimum address value and the maximum address value of each dimension of the data block operated by the memory access instruction; an address range of data taking bytes as a unit in a data block operated by the access instruction; and the storage resource name of the data block operated by the access instruction.
Clause 19, the instruction control method of clause 18, wherein the granularity of the address range is determined according to operational performance requirements and/or storage area overhead.
Clause 20, the instruction control method of any of clauses 17-19, wherein the result of the comparing is characterized using a bitmap, each bit in the bitmap being used to indicate whether there is address overlap of the current access instruction with an address range of a historical access instruction.
Clause 21, the instruction control method according to any of clauses 16-20, wherein the controlling the transmission of the cached memory instruction comprises: in response to the registration information indicating that address dependencies between current and historical memory instructions have been resolved and that the execution resources are free, the current memory instruction is requested to be launched.
Clause 22, the instruction control method according to any of clauses 16-21, wherein the caching the memory access instruction to be transmitted comprises: and a plurality of instruction cache queues are utilized to respectively cache access instructions of different types, no dependency exists among instruction flows corresponding to the different instruction cache queues, and the access instructions in the same instruction cache queue are sequentially transmitted.
Clause 23, the instruction control method of clause 22, wherein the method further comprises: temporarily caching subsequent unregistered instructions when a registration space is full using a temporary cache queue, and outputting the cached unregistered instructions to register in the registration space when the registration space is free.
Clause 24, the instruction control method of clause 23, wherein the method further comprises: caching information of all access instructions currently entering a cache space by utilizing a shared queue; wherein the instruction cache queue and the temporary cache queue only record the index of access instructions cached in their queues in the shared queue.
Clause 25, the instruction control method of any of clauses 16-24, wherein the method further comprises: the shared table is utilized to sequentially register the registration information of all the memory access instructions to be transmitted entering the registration space; and registering access instructions entering each private commit queue in sequence by utilizing the plurality of private commit queues.
Clause 26, the instruction control method of clause 25, wherein the method further comprises: responsive to completion of a registered instruction, releasing entries in the shared table and the private commit queue for the registered instruction.
Clause 27, the instruction control method of any of clauses 16-26, wherein the method further comprises: responsive to the memory access instruction being transmitted to a resource downstream, marking the corresponding resource as busy; and/or marking the resources as idle state in response to the corresponding resource usage being completed; wherein the execution resources include one or more decode units and the memory resources include on-chip and/or off-chip memory circuits.
The foregoing has described in detail embodiments of the present disclosure, with specific examples being employed herein to illustrate the principles and implementations of the present disclosure, the above examples being provided solely to assist in the understanding of the methods of the present disclosure and their core ideas; also, as will be apparent to those of ordinary skill in the art in light of the present disclosure, there are variations in the detailed description and the scope of the application, which in light of the foregoing description should not be construed to limit the present disclosure.

Claims (27)

1. An instruction control apparatus comprising:
the instruction cache unit is used for caching access instructions to be transmitted;
the instruction registration unit is used for registering access instructions to be transmitted, wherein the registration information comprises address dependency information between the current access instructions and the historical access instructions;
The resource recording unit is used for recording the states of the execution resource and the storage resource of the access instruction;
and the emission control unit is used for controlling the emission of the instruction in the instruction cache unit based on the information of the instruction registration unit and/or the resource recording unit.
2. The instruction control apparatus according to claim 1, wherein the instruction registration unit is further configured to:
calculating the address range of the current access instruction;
comparing the address range of the current access instruction with the address range of the historical access instruction; and
and recording the result of the comparison in the registration information as the address dependency information.
3. The instruction control apparatus according to claim 2, wherein the granularity of the address range is selected from any one of:
the minimum address value and the maximum address value of the data block operated by the access instruction;
the minimum address value and the maximum address value of each dimension of the data block operated by the memory access instruction;
an address range of data taking bytes as a unit in a data block operated by the access instruction; and
and the memory resource name of the data block operated by the memory access instruction.
4. An instruction control apparatus according to claim 3, wherein the granularity of the address range is determined in dependence on operational performance requirements and/or storage area overhead.
5. The instruction control apparatus of any of claims 2-4, wherein the result of the comparison is characterized using a bitmap, each bit in the bitmap being used to indicate whether there is address overlap of the current memory instruction with an address range of a history memory instruction.
6. The instruction control device according to any one of claims 1 to 5, wherein the emission control unit is further configured to:
in response to the registration information indicating that address dependencies between current and historical memory instructions have been resolved and that the execution resources are free, the current memory instruction is requested to be launched.
7. The instruction control apparatus according to any one of claims 1 to 6, wherein the instruction cache unit includes:
and the instruction cache queues are used for respectively caching access instructions of different types, the instruction flows corresponding to the different instruction cache queues have no dependency, and the access instructions in the same instruction cache queue are sequentially transmitted.
8. The instruction control apparatus according to claim 7, wherein the instruction cache unit further comprises:
A temporary buffer queue for temporarily buffering a subsequent unregistered instruction when the instruction registration unit is full, and outputting the buffered unregistered instruction to be registered in the instruction registration unit when the instruction registration unit has a space.
9. The instruction control apparatus according to claim 8, wherein the instruction cache unit further comprises:
the shared queue is used for caching the information of all access instructions which enter the instruction caching unit currently;
wherein the instruction cache queue and the temporary cache queue only record the index of access instructions cached in their queues in the shared queue.
10. The instruction control apparatus according to any one of claims 1 to 9, wherein the instruction registration unit includes:
a shared table for sequentially registering the registration information of all the memory access instructions to be transmitted entering the instruction registration unit; and
and the private commit queues are used for registering access instructions entering each private commit queue in sequence.
11. The instruction control apparatus according to claim 10, wherein the instruction registration unit is further configured to: responsive to completion of a registered instruction, releasing entries in the shared table and the private commit queue for the registered instruction.
12. The instruction control apparatus according to any one of claims 1 to 11, wherein the resource recording unit is further configured to:
responsive to the memory access instruction being transmitted to a resource downstream, marking the corresponding resource as busy; and/or
Marking the resources as idle states in response to the corresponding resource usage;
wherein the execution resources include one or more memory access instruction decode units and the memory resources include on-chip and/or off-chip memory circuits.
13. A processor comprising the instruction control device according to claim 12.
14. A chip comprising the processor of claim 13.
15. A board card comprising the chip of claim 14.
16. An instruction control method, comprising:
registering access instructions to be transmitted, wherein the registration information comprises address dependency information between the current access instructions and the historical access instructions;
caching access instructions to be transmitted;
recording the resource states of the execution resource and the storage resource of the access instruction; and
and controlling the transmission of the cached access instruction based on the registration information and/or the resource state.
17. The instruction control method according to claim 16, wherein the registering the memory access instruction to be transmitted includes:
Calculating the address range of the current access instruction;
comparing the address range of the current access instruction with the address range of the historical access instruction; and
and recording the result of the comparison in the registration information as the address dependency information.
18. The instruction control method according to claim 17, wherein a granularity of the address range is selected from any one of:
the minimum address value and the maximum address value of the data block operated by the access instruction;
the minimum address value and the maximum address value of each dimension of the data block operated by the memory access instruction;
an address range of data taking bytes as a unit in a data block operated by the access instruction; and
and the memory resource name of the data block operated by the memory access instruction.
19. The instruction control method of claim 18, wherein granularity of the address range is determined according to operational performance requirements and/or storage area overhead.
20. The instruction control method of any of claims 17-19, wherein the result of the comparison is characterized using a bitmap, each bit in the bitmap being used to indicate whether there is address overlap of the current memory instruction with an address range of a history memory instruction.
21. The instruction control method according to any one of claims 16 to 20, wherein the controlling the transmission of the cached memory instruction includes:
in response to the registration information indicating that address dependencies between current and historical memory instructions have been resolved and that the execution resources are free, the current memory instruction is requested to be launched.
22. The instruction control method according to any one of claims 16 to 21, wherein the caching of the memory instruction to be issued includes:
and a plurality of instruction cache queues are utilized to respectively cache access instructions of different types, no dependency exists among instruction flows corresponding to the different instruction cache queues, and the access instructions in the same instruction cache queue are sequentially transmitted.
23. The instruction control method according to claim 22, wherein the method further comprises:
temporarily caching subsequent unregistered instructions when a registration space is full using a temporary cache queue, and outputting the cached unregistered instructions to register in the registration space when the registration space is free.
24. The instruction control method according to claim 23, wherein the method further comprises:
caching information of all access instructions currently entering a cache space by utilizing a shared queue;
Wherein the instruction cache queue and the temporary cache queue only record the index of access instructions cached in their queues in the shared queue.
25. The instruction control method according to any one of claims 16 to 24, wherein the method further comprises:
the shared table is utilized to sequentially register the registration information of all the memory access instructions to be transmitted entering the registration space; and
and registering access instructions entering each private commit queue in sequence by utilizing a plurality of private commit queues.
26. The instruction control method according to claim 25, wherein the method further comprises:
responsive to completion of a registered instruction, releasing entries in the shared table and the private commit queue for the registered instruction.
27. The instruction control method according to any one of claims 16 to 26, wherein the method further comprises:
responsive to the memory access instruction being transmitted to a resource downstream, marking the corresponding resource as busy; and/or
Marking the resources as idle states in response to the corresponding resource usage;
wherein the execution resources include one or more decode units and the memory resources include on-chip and/or off-chip memory circuits.
CN202211067770.9A 2022-09-01 2022-09-01 Instruction control device, method, processor, chip and board card Pending CN117667210A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211067770.9A CN117667210A (en) 2022-09-01 2022-09-01 Instruction control device, method, processor, chip and board card

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211067770.9A CN117667210A (en) 2022-09-01 2022-09-01 Instruction control device, method, processor, chip and board card

Publications (1)

Publication Number Publication Date
CN117667210A true CN117667210A (en) 2024-03-08

Family

ID=90073893

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211067770.9A Pending CN117667210A (en) 2022-09-01 2022-09-01 Instruction control device, method, processor, chip and board card

Country Status (1)

Country Link
CN (1) CN117667210A (en)

Similar Documents

Publication Publication Date Title
CN111310910B (en) Computing device and method
US10860326B2 (en) Multi-threaded instruction buffer design
CN114035916B (en) Compilation and scheduling methods of computational graphs and related products
WO2012174128A1 (en) General purpose digital data processor, systems and methods
US20200004587A1 (en) Multithreaded processor core with hardware-assisted task scheduling
TWI754310B (en) System and circuit of pure functional neural network accelerator
JP2011238271A (en) Simulation of multi-port memory using memory having small number of ports
CN103649932B (en) Decentralized allocation of resources and interconnect structure to support execution of instruction sequences by multiple engines
EP2671150A1 (en) Processor with a coprocessor having early access to not-yet issued instructions
CN117501254A (en) Providing atomicity for complex operations using near-memory computation
CN111651202A (en) Device for executing vector logic operation
CN119201836A (en) Design architecture and control method of in-memory computing AI accelerator based on RISC-V architecture
KR20250026790A (en) AI core, AI core system and load/store method of AI core system
US20210089305A1 (en) Instruction executing method and apparatus
CN117348929A (en) Instruction execution method, system controller and related products
CN112559403A (en) Processor and interrupt controller therein
CN117667210A (en) Instruction control device, method, processor, chip and board card
CN112766475B (en) Processing component and artificial intelligence processor
CN115129233B (en) Data processing devices, methods and related products
CN117348930A (en) Instruction processing device, instruction execution method, system on chip and board card
US10620958B1 (en) Crossbar between clients and a cache
EP4582959A1 (en) Instruction control method, data caching method, and related products
CN117667212A (en) Instruction control device, method, processor, chip and board card
CN114565075B (en) Device, method and readable storage medium supporting multiple access modes
CN117667211A (en) Instruction synchronous control method, synchronous controller, processor, chip and board card

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination