[go: up one dir, main page]

CN114816529B - Apparatus and method for configuring cooperative thread bundles in a vector computing system - Google Patents

Apparatus and method for configuring cooperative thread bundles in a vector computing system

Info

Publication number
CN114816529B
CN114816529B CN202210479765.2A CN202210479765A CN114816529B CN 114816529 B CN114816529 B CN 114816529B CN 202210479765 A CN202210479765 A CN 202210479765A CN 114816529 B CN114816529 B CN 114816529B
Authority
CN
China
Prior art keywords
warp
thread
warps
instruction
thread bundle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210479765.2A
Other languages
Chinese (zh)
Other versions
CN114816529A (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bi Ren Technology Co ltd
Shanghai Biren Intelligent Technology Co Ltd
Original Assignee
Shanghai Bi Ren Technology Co ltd
Shanghai Biren Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bi Ren Technology Co ltd, Shanghai Biren Intelligent Technology Co Ltd filed Critical Shanghai Bi Ren Technology Co ltd
Priority to CN202210479765.2A priority Critical patent/CN114816529B/en
Publication of CN114816529A publication Critical patent/CN114816529A/en
Application granted granted Critical
Publication of CN114816529B publication Critical patent/CN114816529B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • G06F9/30123Organisation of register space, e.g. banked or distributed register file according to context, e.g. thread buffers
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3851Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3888Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple threads [SIMT] in parallel

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Advance Control (AREA)

Abstract

The invention relates to a device and a method for configuring a cooperative thread bundle in a vector operation system, wherein the device comprises a general register, an arithmetic logic unit, a thread bundle instruction scheduler and a plurality of thread bundle resource registers. The thread bundle instruction scheduler enables each of the plurality of thread bundles to contain a part of relatively independent instructions in the program core according to thread bundle allocation instructions in the program core, enables each of the plurality of thread bundles to access all or appointed local data in the general register through the arithmetic logic unit according to configuration of software during execution, and completes operation of each thread bundle through the arithmetic logic unit. The invention can be more widely suitable for different applications, such as big data, artificial intelligent operation and the like by enabling software to dynamically adjust and configure the general registers to components of different thread bundles as described above.

Description

Apparatus and method for configuring cooperative thread bundles in a vector computing system
The application relates to a split application of Chinese patent application with the application date of 2020, 10-21 and the application number of 202011131448.9, and the name of 'a device and a method for configuring a cooperative thread bundle in a vector operation system'.
Technical Field
The present invention relates to vector computing devices, and more particularly, to a device and method for configuring a cooperative thread bundle in a vector computing system.
Background
A vector computer is a computer equipped with specialized vector instructions for increasing the speed of vector processing. Vector computers are capable of processing data computation of multiple thread bundles (Warp) simultaneously, and therefore, vector computers are much faster than scalar computers in terms of processing data of thread bundles. However, multiple thread bundles may conflict with access to a General-Purpose REGISTER FILE, GPR File, and therefore the present invention proposes a method of configuring a device for cooperating thread bundles in a vector computing system.
Disclosure of Invention
In view of this, how to alleviate or eliminate the above-mentioned drawbacks of the related art is a real problem to be solved.
An embodiment of the invention relates to a device for configuring a cooperative thread bundle in a vector operation system, which comprises a general register, an arithmetic logic unit, a thread bundle instruction scheduler and a plurality of thread bundle resource registers. The thread bundle instruction scheduler enables each of the plurality of thread bundles to contain a part of relatively independent instructions in the program core according to thread bundle allocation instructions in the program core, enables each of the plurality of thread bundles to access all or specified local data in the general register through the arithmetic logic unit according to configuration of software during execution, and completes operation of each thread bundle through the arithmetic logic unit. Each thread bundle resource register is associated with one thread bundle, and is used for enabling each thread bundle to match the content of the corresponding thread bundle resource register, mapping data access to specified local parts in the general registers, wherein the specified local parts in the general registers mapped among different thread bundles are not overlapped.
Embodiments of the present invention also relate to a method of configuring a cooperative thread bundle in a vector computing system, comprising letting each of a plurality of thread bundles contain a portion of relatively independent instructions in a program core according to a thread bundle allocation instruction in the program core, letting each of the plurality of thread bundles access data in a general purpose register or specified portions thereof through an arithmetic logic unit according to a configuration of software when executing, and completing an operation of each of the thread bundles through the arithmetic logic unit. Wherein the method further comprises having the data access of each of the thread bundles mapped to a specified local in the general purpose register in dependence upon the contents of a plurality of respective thread bundle resource registers, and wherein the specified local in the general purpose registers mapped between different ones of the thread bundles do not overlap.
One of the advantages of the above embodiments is that by enabling software to dynamically adjust and configure general purpose registers to components and operations of different thread bundles as described above, it is more broadly adaptable to different applications, such as big data, artificial intelligence operations, etc.
One of the advantages of the above embodiment is that by dynamically configuring multiple sections of relatively independent instructions in the program core to different thread bundles, the inter-thread bundles can be prevented from interfering with each other to improve the utilization rate of the pipeline.
Other advantages of the present invention will be explained in more detail in connection with the following description and accompanying drawings.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application.
FIG. 1 is a block diagram of a vector operation system according to an embodiment of the present invention.
Fig. 2 is a block diagram of a streaming multiprocessor according to an embodiment of the present invention.
FIG. 3 is a split schematic diagram of general registers of some embodiments.
FIG. 4 is a diagram illustrating dynamic partitioning of a general purpose register with thread bundle resource registers according to an embodiment of the present invention.
FIG. 5 is a flow chart of a cooperative thread bundle applied to execute tasks in parallel, in accordance with an embodiment of the present invention.
FIG. 6 is a schematic diagram of a collaboration thread bundle for a producer and consumer in accordance with an embodiment of the present invention.
FIG. 7 is a flow chart of a cooperative thread bundle for performing producer-consumer tasks in accordance with an embodiment of the present invention.
Wherein the symbols in the drawings are briefly described as follows:
10 electronic device 100 stream multiprocessor 210 arithmetic logic unit 220 thread bundle instruction scheduler 230 general register 240 instruction cache 250 barrier register 260, 260# 0-260 #7 each thread bundle resource register 300# 0-300 #7 general register memory block Base # 0-Base #7 Base location S510-S540 method steps 610 consumer thread bundles 621, 663 barrier instructions 623, 661 series of instructions 650 producer thread bundles S710-S770 method steps.
Detailed Description
Embodiments of the present invention will be described below with reference to the accompanying drawings. In the drawings, like reference numerals designate identical or similar components or process flows.
It should be understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, values, method steps, operation processes, components, and/or groups, but do not preclude the addition of further features, values, method steps, operation processes, components, groups, or groups of the above.
In the present invention, terms such as "first," "second," "third," and the like are used for modifying elements of the claims, and are not used for describing a priority order, a precedence order, or a temporal order in which elements of one method step are performed or are used for distinguishing between elements having the same name.
It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Conversely, when an element is described as being "directly connected" or "directly coupled" to another element, there are no intervening elements present. Other words used to describe the relationship between components may also be interpreted in a similar fashion, such as "between" versus "directly between," or "adjacent" versus "directly adjacent," etc.
Reference is made to fig. 1. The electronic device 10 may be implemented in a mainframe, workstation, personal computer, notebook computer (Laptop PC), tablet computer, mobile phone, digital camera, digital video camera, etc. The electronic device 10 may set up a stream multiprocessor cluster (STREAMING MULTIPROCESSOR CLUSTER, SMC) in a vector computing system, including a plurality of stream multiprocessors (STREAMING MULTIPROCESSOR, SM) 100, instruction execution between different stream multiprocessors 100 may be synchronized with each other using signals. The streaming multiprocessor 100 is programmed to perform a variety of application tasks including, but not limited to, linear and nonlinear data transformation, database operations, big data operations, artificial intelligence computations, encoding, decoding, modeling operations of audio, video data, image rendering operations, etc. Each stream multiprocessor 100 may simultaneously execute multiple thread bundles (Warps), each of which is made up of a group of threads (Group of Threads), which are the smallest unit of operation using hardware and have their own lifecycle. The thread bundles may be associated with single instruction multiple data flow (Single Instruction Multiple Data, SIMD) instructions, single instruction multiple thread (Single Instruction Multiple Thread, SIMT) techniques, and the like. Execution between different thread bundles may be independent or sequential. A thread may represent a task associated with one or more instructions. For example, each stream multiprocessor 100 may concurrently execute 8 thread bundles, each thread bundle including 32 threads. Although fig. 1 depicts 4 stream multiprocessors 100, those skilled in the art may arrange more or fewer stream multiprocessors in a vector computing system according to different needs, and the invention is not so limited.
Reference is made to fig. 2. Each stream multiprocessor 100 includes an instruction cache (Instruction Cache) 240 for storing a plurality of instructions of a program core (Kernel). Each stream multiprocessor 100 further comprises a thread bundle instruction scheduler (Warp Instruction Scheduler) 220 for fetching a series of instructions for each thread bundle and storing them in the instruction cache 240, fetching instructions to be executed from the instruction cache 240 for each thread bundle according to the program counter. Each thread bundle has a separate Program Counter (PC) register for recording the location of the instruction (i.e., the instruction address) that is now executing. Each time an instruction is fetched from the instruction cache for a thread bundle, the corresponding program counter is incremented by one. The thread bundle instruction scheduler 220 sends instructions, defined in the instruction set architecture (Instruction Set Architecture, ISA) of the particular computing system, to the arithmetic logic unit (ARITHMETIC LOGICAL UNIT, ALU) 210 for execution at appropriate points in time. The arithmetic logic unit 210 may perform a wide variety of operations, such as integer, floating point addition and multiplication, comparison, boolean (Boolean) operations, bit shifting, algebraic functions (e.g., planar interpolation, trigonometric functions, exponential functions, logarithmic functions), etc. The arithmetic logic unit 210 may read data from a specified location (also referred to as a source address) of a General-purpose register (General-Purpose Registers, GPRs) 230 and write back execution results to the specified location (also referred to as a destination address) of the General-purpose register 230 during execution. Each stream multiprocessor 100 also includes a barrier register (Barriers Register) 250 that can be used to allow software to synchronize execution among different thread bundles, and a respective thread bundle Resource register (Resource-per-WARP REGISTER) 230 that can be used to dynamically configure the spatial extent of the general purpose registers 230 that each thread bundle can use when executing. Although fig. 2 only lists components 210 through 260, this is for purposes of briefly illustrating the features of the present invention, and those skilled in the art will appreciate that each stream multiprocessor 100 also includes many more components.
In some embodiments, the general purpose registers 230 may be physically or logically divided into Blocks (Blocks) with each block of memory space being allocated for access by only one thread bundle. The memory space between different blocks is not overlapping for avoiding access conflicts between different thread bundles. Referring to FIG. 3, for example, when one stream multiprocessor 100 may process data of eight thread bundles and the general purpose register 230 contains 256 kilobytes (Kilobyte, KB) of memory, the memory of the general purpose register 230 may be divided into eight blocks 300#0 to 300#7, each block containing non-overlapping 32KB of memory and provided for specified thread bundle utilization. However, since vector computing systems are often applied to large data and artificial intelligence computation, the amount of processing data is huge, and the space for fixed division may be insufficient for one thread bundle, and thus the computing requirement of a large amount of data cannot be satisfied. For application in big data and artificial intelligence computing, these embodiments may be modified to have each stream multiprocessor 100 process data for only one thread bundle, and have the entire general purpose register 230 memory space for only that thread bundle. However, when two consecutive instructions have data correlation, that is, the input data of the second instruction is the output result after the execution of the first instruction, the operation of the arithmetic logic unit 210 is not efficient. In detail, the second instruction must wait for the execution result of the first instruction to be ready in the general purpose register 230 before starting execution. For example, assuming that each instruction requires 8 clock cycles to pass in the pipeline from initialization to output results to general purpose register 230, the second instruction needs to wait for the execution results of the first instruction before it can begin execution from the 9 th clock cycle. At this point, the instruction execution delay (Instruction Execution Latency) is 8 clock cycles, resulting in very low pipeline utilization. Furthermore, because the streaming multiprocessor 100 processes only one thread bundle, instructions that would otherwise be executable in parallel need to be arranged for sequential execution, which is inefficient.
To solve the above-described problems, in one aspect, the thread bundle instruction scheduler 220 causes each of the plurality of thread bundles to access all or specified partial data in the general register 230 through the arithmetic logic unit 210 and to complete the operation of each thread bundle through the arithmetic logic unit 210, depending on the configuration of the software at the time of execution. By enabling software to dynamically adjust and configure general purpose registers to components of different thread bundles as described above, it is more broadly adaptable to different applications, such as big data, artificial intelligence operations, and the like.
In another aspect, embodiments of the present invention provide an environment that enables software to determine the instruction segments that each thread bundle contains. In some embodiments, a program core may divide instructions into multiple segments, each of which is independent of the other and is executed in a thread bundle. Table 1 is an example of virtual code for a program core:
TABLE 1
Assuming that each stream multiprocessor 100 runs a maximum of eight thread bundles and that each thread bundle has unique identification, when the thread bundle instruction scheduler 220 fetches instructions of the program cores shown in Table 1, the thread bundle identification can be checked for a particular thread bundle, jumps to the instruction segment associated with this thread bundle and stores it to the instruction cache 240, and then fetches instructions from the instruction cache 240 according to the corresponding program counter values and sends them to the arithmetic logic unit 210 to complete the particular computation. In this case, each thread bundle may perform tasks independently, and all thread bundles may run simultaneously, leaving the pipeline in the arithmetic logic unit 210 as busy as possible to avoid Bubbles (Bubbles). The instructions of each segment in the same program core may be referred to as relatively independent instructions. While the example of Table 1 is directed to instructions that add conditional decisions to achieve segmentation of instructions in a program core, one skilled in the art may use other different instructions that achieve the same or similar results to accomplish segmentation of instructions in a program core. In a program core, these instructions used to segment instructions for multiple thread bundles may also be referred to as thread bundle allocation instructions. In general, the thread bundle instruction scheduler 240 may cause each of the plurality of thread bundles to include a portion of the relatively independent instructions in the program core in accordance with the thread bundle allocation instructions in the program core for the arithmetic logic unit 210 to execute the plurality of thread bundles independently and in parallel.
The barrier register 250 may store information for synchronizing execution of different thread bundles, including the number of threads that need to wait for other thread bundles to finish executing, and the number of thread bundles that are currently waiting to continue executing. To coordinate execution of different thread bundles, software may set the contents of barrier register 250 to record the number of thread bundles that need to wait for execution to complete. Each instruction segment in the program core may be provided with a barrier instruction (Barrier Instruction) where appropriate depending on the system requirements. When the thread bundle instruction scheduler 220 fetches a barrier instruction for one thread bundle, the number of thread bundles waiting for continued execution recorded in the barrier register 250 is increased by 1, and the thread bundle is put into a waiting state. Next, the thread bundle instruction scheduler 220 examines the contents of the barrier register 250 to determine whether the number of thread bundles currently waiting to continue execution is equal to or greater than the number of thread bundles that need to wait for execution of other thread bundles to complete. If so, the thread bundle instruction scheduler 220 wakes up all waiting thread bundles, allowing them to continue execution. Otherwise, the thread bundle instruction scheduler 220 fetches the instructions of the next thread bundle from the instruction cache.
Furthermore, the partitioning of the embodiment depicted in FIG. 3 is a pre-configuration of the streaming multiprocessor 100, and the software is not modifiable. However, the memory space required by different thread bundles may not be uniform, and for some thread bundles the pre-partitioned memory space may be more than needed, while for other thread bundles may be insufficient.
In another aspect, while one stream multiprocessor 100 is capable of executing multiple thread bundles and all thread bundles execute the same program core, embodiments of the present invention do not pre-partition the general purpose registers 230 for different thread bundles. In particular, to accommodate different applications more broadly, the streaming multiprocessor 100 does not fix the partitioning of the general purpose registers 230 into multiple blocks of memory space for multiple thread bundles, but rather provides an environment that enables software to dynamically adjust and configure the general purpose registers 230 to different thread bundles so that the software can have each thread bundle use all or part of the general purpose registers 230 depending on the application needs.
In other embodiments, each stream multiprocessor 100 may include a respective thread bundle resource register 260 for storing information for the base location of each thread bundle, each base location pointing to a particular address of the general purpose register 230. In order to allow different thread bundles to access non-overlapping memory spaces in the general purpose registers 230, software may dynamically change the contents of the thread bundle resource registers 260 to set the base location of each thread bundle. For example, referring to FIG. 4, software may divide the general purpose register 230 into eight blocks for eight thread bundles, with block 0 associated with the 0 th thread bundle having an address range that includes Base #0 through Base #1-1, block 1 associated with the 1 st thread bundle having an address range that includes Base #1 through Base #2-1, and so on. The software may set the contents of the respective thread bundle resource registers 260#0 to 260#7 prior to or at the beginning of execution of the program core for pointing to the base address associated with each thread bundle in the general purpose register 230. After the instruction scheduler 220 extracts the instruction from the instruction cache 240 for the ith thread bundle, the source address and destination address of the instruction may be adjusted according to the contents of the thread bundle resource register 260#i for mapping to the memory space dynamically allocated to the ith thread bundle in the general purpose register 230. For example, the original instruction is:
Dest_addr=Instr_i(Src_addr0,Src_addr1)
where Instr_i represents the OpCode (op code) of the instruction assigned to the ith thread bundle, src_addr0 represents the 0 th source address, src_addr0 represents the 1 st source address, and dest_addr represents the destination address.
The thread bundle instruction scheduler 220 modifies the instructions as described above to become:
Base#i+Dest_addr=Instr_i(Base#i+Src_addr0,Base#i+Src_addr1)
Where Base #i represents the Base location recorded in the thread bundle resource register 260 #i. That is, the thread bundle instruction scheduler 220 adjusts the source address and destination address of the instruction according to the contents of each thread bundle resource register 260 so that the specified portions of the general purpose registers mapped between different thread bundles do not overlap.
In still other embodiments, not only may one program core divide instructions into multiple sections independent of each other, and in addition to executing with one thread bundle per instruction section, the software may set the contents of the respective thread bundle resource registers 260#0 through 260#7 prior to or at the beginning of execution of the program core for pointing to the base address associated with each thread bundle in the general purpose register 230.
The invention is applicable to a collaborative thread bundle (Cooperative Warps) that performs tasks in parallel, with reference to the flowchart example shown in FIG. 5.
In step S510, the thread bundle instruction scheduler 220 starts fetching instructions for each thread bundle and stores them in the instruction cache 240.
Steps S520-S540 form a loop, and the thread bundle instruction scheduler 220 may use a scheduling method (e.g., a polling scheduling algorithm) to obtain the specified instructions from the instruction cache 240 one by one according to the corresponding program counter of the thread bundle, and send them to the arithmetic logic unit 210 for execution. Thread bundle instruction scheduler 220 may sequentially fetch instructions from instruction cache 240 indicated by the program counter of thread bundle 0, instructions indicated by the program counter of thread bundle 1, instructions indicated by the program counter of thread bundle 2, and so on.
In step S520, the thread bundle instruction scheduler 220 retrieves instructions for the 0 th or next thread bundle from the instruction cache 240.
Step S530, the thread bundle instruction scheduler 220 sends the fetched instruction to the arithmetic logic unit 210.
In step S540, the arithmetic logic unit 210 performs a specified operation according to the input instruction. In a pipeline executing instructions, arithmetic logic unit 210 fetches data from a source address of general purpose register 230, performs specified operations on the fetched data, and stores the results of the operations to a destination address in general purpose register 230.
In order to avoid conflicts in accessing the general purpose registers 230 by different thread bundles, in some embodiments of step S510, the thread bundle instruction scheduler 220 may, in accordance with thread bundle allocation instructions in the program core (e.g., the example shown in table 1), let each thread bundle be responsible for processing multiple instructions (also referred to as relatively independent instructions) of specified segments in the same program core, which are arranged to be independent of each other and may be executed in parallel.
In other embodiments, the thread bundle instruction scheduler 220 may adjust the source address and destination address in the instruction according to the contents of the corresponding thread bundle resource register 260 for mapping to the memory space dynamically allocated to the thread bundle in the general purpose register 230 before the instruction is provided to the arithmetic logic unit 210 (prior to step S530).
In still other embodiments, the thread bundle instruction scheduler 220 may adjust the source address and destination address in the instruction according to the contents of the respective thread bundle resource registers 260 before the instruction is sent to the arithmetic logic unit 210 (before step S530), in addition to having each thread bundle be responsible for processing multiple instructions of a specified segment in the same program core according to the associated instruction in the program core (as illustrated in the example of table 1) at step S510.
The present invention is applicable to a bundle of collaborative threads that perform producer-consumer tasks. Referring to FIG. 6, assume that thread bundle 610 is considered the consumer of data and thread bundle 650 is considered the producer of data. In other words, execution of a portion of instructions in thread bundle 610 requires referencing the results of execution of a portion of instructions in thread bundle 650. In some embodiments, software may configure the contents of the respective thread bundle resource registers corresponding to thread bundles 610 and 650 when executed to enable instructions of thread bundles 610 and 650 to be accessed by arithmetic logic unit 210 to overlapping blocks in general purpose registers 230.
For producer-consumer task execution, refer in detail to the flowchart example shown in fig. 7.
In step S710, the thread bundle instruction scheduler 220 starts fetching instructions for each thread bundle and stores them in the instruction cache 240. The thread bundle instruction scheduler 220 may take each thread bundle, depending on the relevant instructions in the program core (examples shown in table 1), to handle multiple instructions for a given segment of the same program core, which form a producer-consumer relationship under arrangement.
In step S720, the thread bundle instruction scheduler 220 obtains the barrier instruction (Barrier Instruction) 621 of the consumer thread bundle 610 from the instruction cache 240, and accordingly brings the consumer thread bundle 610 into a wait state.
In step S730, the thread bundle instruction scheduler 220 fetches a series of instructions 661 of the producer thread bundle 650 from the instruction cache 240 and sequentially sends the fetched instructions to the arithmetic logic unit 210.
In step S740, the arithmetic logic unit 210 performs a specified operation according to the input instruction 661. In a pipeline executing instructions, arithmetic logic unit 210 fetches data from a source address of general purpose register 230, performs specified operations on the fetched data, and stores the results of the operations to a destination address in general purpose register 230.
In step S750, thread bundle instruction scheduler 220 obtains barrier instruction 663 of producer thread bundle 650 from instruction cache 240 and wakes up consumer thread bundle 610 accordingly. In some embodiments, the thread bundle instruction scheduler 220 may also put the producer thread bundle 650 into a wait state.
Step S760 the thread bundle instruction scheduler 220 fetches a series of instructions 623 of the consumer thread bundle 610 from the instruction cache 240 and sequentially passes the fetched instructions to the arithmetic logic unit 210.
In step S770, the arithmetic logic unit 210 performs a specified operation according to the input instruction 623. In a pipeline executing instructions, arithmetic logic unit 210 fetches data from a source address of general purpose register 230 (including data previously generated by producer thread bundle 650), performs specified operations on the fetched data, and stores the results of the operations to a destination address in general purpose register 230.
It should be appreciated here that the content of steps S730, S740, S760, and S770 is merely a brief description for ease of understanding, and that during execution of steps S730 and S740, S760, and S770, thread bundle instruction scheduler 220 may also obtain instructions of other thread bundles (i.e., any thread bundle that is neither thread bundle 610 nor thread bundle 610) from instruction cache 240 and drive arithmetic logic unit 210 to perform operations.
Although the components described above are included in fig. 1 and 2, it is not excluded that many more additional components are used to achieve the better technical effect without violating the spirit of the invention. In addition, although the flowcharts of fig. 5 and 7 are executed in the order specified, the order among these steps may be modified by those skilled in the art without departing from the spirit of the invention, and therefore, the present invention is not limited to using only the order described above. Furthermore, one skilled in the art may integrate several steps into one step or perform more steps in addition to these steps, sequentially or in parallel, and the invention should not be limited thereby.
The above description is only of the preferred embodiments of the present application, but not limited thereto, and any person skilled in the art can make further modifications and variations without departing from the spirit and scope of the present application, and the scope of the present application is defined by the appended claims.

Claims (11)

1.一种配置向量运算系统中的协作线程束的装置,其特征在于,包含:1. A device for configuring cooperative thread warps in a vector computing system, comprising: 通用寄存器;General registers; 算术逻辑单元,耦接所述通用寄存器;an arithmetic logic unit, coupled to the general register; 线程束指令调度器,耦接所述算术逻辑单元,所述线程束指令调度器依据程序核中的线程束分配指令让多个线程束中的每一个包含所述程序核中的一部分相对独立指令,依据软件在执行时的配置让所述多个线程束中的每一个都通过所述算术逻辑单元存取所述通用寄存器中的指定局部的数据,并通过所述算术逻辑单元独立且并行的执行每个上述线程束的运算;以及a thread warp instruction scheduler coupled to the arithmetic logic unit, wherein the thread warp instruction scheduler allows each of the plurality of thread warps to include a portion of relatively independent instructions in the program core according to thread warp allocation instructions in the program core, allows each of the plurality of thread warps to access designated local data in the general register through the arithmetic logic unit according to the configuration of the software during execution, and independently and in parallel executes operations of each of the thread warps through the arithmetic logic unit; and 多个各线程束资源寄存器,其中,每个所述各线程束资源寄存器关联于一个所述线程束,用于让每个所述线程束搭配相应的所述各线程束资源寄存器的内容,将数据存取映射到所述通用寄存器中的指定局部,其中各线程束资源寄存器的内容用于指向所述通用寄存器中关联于每个线程束的基础地址,并且不同的所述线程束间所映射的所述通用寄存器中的指定局部并不重叠。A plurality of warp resource registers, wherein each of the warp resource registers is associated with one warp, and is used for each warp to map data access to a specified portion of the general register in combination with the content of the corresponding warp resource register, wherein the content of each warp resource register is used to point to a base address in the general register associated with each warp, and the specified portions of the general register mapped between different warps do not overlap. 2.如权利要求1所述的配置向量运算系统中的协作线程束的装置,其特征在于,所述装置不预先为每个所述线程束配置关联于所述通用寄存器中的指定局部。2. The device for configuring cooperative thread warps in a vector computing system according to claim 1, wherein the device does not pre-configure a designated part of the general register associated with each thread warp. 3.如权利要求1所述的配置向量运算系统中的协作线程束的装置,其特征在于,所述线程束包含第一线程束和第二线程束,所述线程束指令调度器从指令缓存获取所述第一线程束的屏障指令时,让所述第一线程束进入等待状态,以及所述线程束指令调度器从所述指令缓存获取所述第二线程束的屏障指令时,唤醒所述第一线程束,其中所述第一线程束和所述第二线程束被配置关联于所述通用寄存器中重叠的一块。3. The device for configuring cooperative thread warps in a vector computing system as described in claim 1 is characterized in that the thread warps include a first thread warp and a second thread warp, when the thread warp instruction scheduler obtains a barrier instruction of the first thread warp from an instruction cache, the first thread warp is put into a waiting state, and when the thread warp instruction scheduler obtains a barrier instruction of the second thread warp from the instruction cache, the first thread warp is awakened, wherein the first thread warp and the second thread warp are configured to be associated with an overlapping block in the general register. 4.如权利要求3所述的配置向量运算系统中的协作线程束的装置,其特征在于,所述第一线程束为消费者线程束,所述第二线程束为生产者线程束。4 . The device for configuring cooperative warps in a vector computing system according to claim 3 , wherein the first warp is a consumer warp and the second warp is a producer warp. 5.如权利要求1所述的配置向量运算系统中的协作线程束的装置,其特征在于,所述线程束之间互相独立,每个所述线程束被配置关联于所述通用寄存器中不重叠的一块。5. The apparatus for configuring cooperative warps in a vector computing system according to claim 1, wherein the warps are independent of each other, and each warp is configured to be associated with a non-overlapping block of the general register. 6.如权利要求1所述的配置向量运算系统中的协作线程束的装置,其特征在于,所述线程束指令调度器为每个所述线程束维护独立的程序计数器。6 . The apparatus for configuring cooperative warps in a vector computing system according to claim 1 , wherein the warp instruction scheduler maintains an independent program counter for each of the warps. 7.一种配置向量运算系统中的协作线程束的方法,执行在流多处理器之中,其特征在于,包含:7. A method for configuring cooperative thread warps in a vector computing system, executed in a streaming multiprocessor, comprising: 依据程序核中的线程束分配指令让多个线程束中的每一个包含所述程序核中的一部分相对独立指令;Allocating instructions according to the thread warps in the program core so that each of the plurality of thread warps contains a portion of relatively independent instructions in the program core; 依据软件在执行时的配置让所述多个线程束中的每个通过算术逻辑单元存取通用寄存器中的指定局部的数据;以及Allowing each of the plurality of warps to access designated local data in a general register through an arithmetic logic unit according to a configuration of the software during execution; and 通过所述算术逻辑单元独立且并行的执行每个上述线程束的运算,The arithmetic logic unit performs the operations of each of the above thread warps independently and in parallel, 其中所述方法还包括:The method further comprises: 依据多个各线程束资源寄存器的内容让每个所述线程束的数据存取映射到所述通用寄存器中的指定局部,并且其中各线程束资源寄存器的内容用于指向所述通用寄存器中关联于每个线程束的基础地址,并且不同的所述线程束间映射的所述通用寄存器中的指定局部并不重叠。Data access of each warp is mapped to a designated portion of the general register according to contents of a plurality of warp resource registers, wherein the contents of each warp resource register are used to point to a base address in the general register associated with each warp, and designated portions of the general register mapped between different warps do not overlap. 8.如权利要求7所述的配置向量运算系统中的协作线程束的方法,其特征在于,所述流多处理器不预先为每个所述线程束配置关联于所述通用寄存器中的指定局部。8 . The method for configuring cooperative warps in a vector computing system according to claim 7 , wherein the streaming multiprocessor does not pre-configure a designated portion of the general register associated with each warp. 9.如权利要求7所述的配置向量运算系统中的协作线程束的方法,其特征在于,所述线程束包含第一线程束和第二线程束,所述第一线程束和所述第二线程束被配置关联于所述通用寄存器中重叠的一块,所述方法包含:9. The method for configuring cooperative warps in a vector computing system according to claim 7, wherein the warps include a first warp and a second warp, the first warp and the second warp are configured to be associated with an overlapping block of the general register, and the method comprises: 从指令缓存获取所述第一线程束的屏障指令时,让所述第一线程束进入等待状态;以及When obtaining the barrier instruction of the first thread warp from the instruction cache, placing the first thread warp into a waiting state; and 从所述指令缓存获取所述第一线程束的屏障指令时,唤醒所述第二线程束。When the barrier instruction of the first warp is obtained from the instruction cache, the second warp is awakened. 10.如权利要求9所述的配置向量运算系统中的协作线程束的方法,其特征在于,所述第一线程束为消费者线程束,所述第二线程束为生产者线程束。10 . The method for configuring cooperative warps in a vector computing system according to claim 9 , wherein the first warp is a consumer warp, and the second warp is a producer warp. 11.如权利要求7所述的配置向量运算系统中的协作线程束的方法,其特征在于,包含:11. The method for configuring cooperative warps in a vector computing system according to claim 7, characterized in that it comprises: 为每个所述线程束维护独立的程序计数器。An independent program counter is maintained for each of the warps.
CN202210479765.2A 2020-10-21 2020-10-21 Apparatus and method for configuring cooperative thread bundles in a vector computing system Active CN114816529B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210479765.2A CN114816529B (en) 2020-10-21 2020-10-21 Apparatus and method for configuring cooperative thread bundles in a vector computing system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210479765.2A CN114816529B (en) 2020-10-21 2020-10-21 Apparatus and method for configuring cooperative thread bundles in a vector computing system
CN202011131448.9A CN112214243B (en) 2020-10-21 2020-10-21 Apparatus and method for configuring cooperative thread bundle in vector operation system

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN202011131448.9A Division CN112214243B (en) 2020-10-21 2020-10-21 Apparatus and method for configuring cooperative thread bundle in vector operation system

Publications (2)

Publication Number Publication Date
CN114816529A CN114816529A (en) 2022-07-29
CN114816529B true CN114816529B (en) 2025-07-18

Family

ID=74056291

Family Applications (3)

Application Number Title Priority Date Filing Date
CN202210479765.2A Active CN114816529B (en) 2020-10-21 2020-10-21 Apparatus and method for configuring cooperative thread bundles in a vector computing system
CN202011131448.9A Active CN112214243B (en) 2020-10-21 2020-10-21 Apparatus and method for configuring cooperative thread bundle in vector operation system
CN202210480192.5A Active CN114968358B (en) 2020-10-21 2020-10-21 Device and method for configuring cooperative thread warps in vector computing system

Family Applications After (2)

Application Number Title Priority Date Filing Date
CN202011131448.9A Active CN112214243B (en) 2020-10-21 2020-10-21 Apparatus and method for configuring cooperative thread bundle in vector operation system
CN202210480192.5A Active CN114968358B (en) 2020-10-21 2020-10-21 Device and method for configuring cooperative thread warps in vector computing system

Country Status (3)

Country Link
US (1) US20220121444A1 (en)
CN (3) CN114816529B (en)
TW (1) TWI793568B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114035847B (en) * 2021-11-08 2023-08-29 海飞科(南京)信息技术有限公司 Method and apparatus for parallel execution of kernel programs
CN114896079B (en) * 2022-05-26 2023-11-24 上海壁仞智能科技有限公司 Instruction execution method, processor and electronic device
CN116483536B (en) * 2023-04-24 2024-05-10 上海芷锐电子科技有限公司 Data scheduling method, computing chip and electronic equipment
CN116360708B (en) * 2023-05-26 2023-08-11 摩尔线程智能科技(北京)有限责任公司 Data writing method and device, electronic equipment and storage medium
CN118732958B (en) * 2024-09-02 2025-02-28 山东浪潮科学研究院有限公司 Warp-aware memory controller
CN119440774B (en) * 2025-01-08 2025-05-13 山东浪潮科学研究院有限公司 A conflicting thread warp scheduling method and a GPGPU register file access method and system
CN121166401A (en) * 2025-11-20 2025-12-19 上海壁仞科技股份有限公司 Inter-thread data sharing methods, electronic devices, storage media, and application products
CN121144256A (en) * 2025-11-20 2025-12-16 上海壁仞科技股份有限公司 Artificial intelligence chips and collaborative thread beam calculation methods

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101344842A (en) * 2007-07-10 2009-01-14 北京简约纳电子有限公司 Multithreading processor and multithreading processing method
CN106575219A (en) * 2014-09-26 2017-04-19 英特尔公司 Instruction and logic for a vector format for processing computations

Family Cites Families (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6925643B2 (en) * 2002-10-11 2005-08-02 Sandbridge Technologies, Inc. Method and apparatus for thread-based memory access in a multithreaded processor
US7472258B2 (en) * 2003-04-21 2008-12-30 International Business Machines Corporation Dynamically shared group completion table between multiple threads
US7290261B2 (en) * 2003-04-24 2007-10-30 International Business Machines Corporation Method and logical apparatus for rename register reallocation in a simultaneous multi-threaded (SMT) processor
CN1278227C (en) * 2004-06-25 2006-10-04 中国科学院计算技术研究所 A processor multithreading method and apparatus based on MIPS instruction set
TWI296387B (en) * 2005-12-13 2008-05-01 Nat Univ Tsing Hua Scheduling method for remote object procedure call and system thereof
US8321849B2 (en) * 2007-01-26 2012-11-27 Nvidia Corporation Virtual architecture and instruction set for parallel thread computing
US7685409B2 (en) * 2007-02-21 2010-03-23 Qualcomm Incorporated On-demand multi-thread multimedia processor
US9639479B2 (en) * 2009-09-23 2017-05-02 Nvidia Corporation Instructions for managing a parallel cache hierarchy
US10360039B2 (en) * 2009-09-28 2019-07-23 Nvidia Corporation Predicted instruction execution in parallel processors with reduced per-thread state information including choosing a minimum or maximum of two operands based on a predicate value
US9710275B2 (en) * 2012-11-05 2017-07-18 Nvidia Corporation System and method for allocating memory of differing properties to shared data objects
US20140258680A1 (en) * 2013-03-05 2014-09-11 Qualcomm Incorporated Parallel dispatch of coprocessor instructions in a multi-thread processor
CN103955356B (en) * 2014-04-24 2017-05-10 深圳中微电科技有限公司 General-purpose register bank distribution method and device in multithreaded processor
US9804666B2 (en) * 2015-05-26 2017-10-31 Samsung Electronics Co., Ltd. Warp clustering
CN106325996B (en) * 2015-06-19 2019-11-19 华为技术有限公司 A method and system for allocating GPU resources
GB2539958B (en) * 2015-07-03 2019-09-25 Advanced Risc Mach Ltd Data processing systems
KR102545176B1 (en) * 2015-11-16 2023-06-19 삼성전자주식회사 Method and apparatus for register management
CN106648545A (en) * 2016-01-18 2017-05-10 天津大学 Register file structure used for branch processing in GPU
US10115175B2 (en) * 2016-02-19 2018-10-30 Qualcomm Incorporated Uniform predicates in shaders for graphics processing units
US10592466B2 (en) * 2016-05-12 2020-03-17 Wisconsin Alumni Research Foundation Graphic processor unit providing reduced storage costs for similar operands
US20170371662A1 (en) * 2016-06-23 2017-12-28 Intel Corporation Extension of register files for local processing of data in computing environments
US10929944B2 (en) * 2016-11-23 2021-02-23 Advanced Micro Devices, Inc. Low power and low latency GPU coprocessor for persistent computing
US10558460B2 (en) * 2016-12-14 2020-02-11 Qualcomm Incorporated General purpose register allocation in streaming processor
GB2558220B (en) * 2016-12-22 2019-05-15 Advanced Risc Mach Ltd Vector generating instruction
US20180203694A1 (en) * 2017-01-16 2018-07-19 Intel Corporation Execution Unit with Selective Instruction Pipeline Bypass
GB201717303D0 (en) * 2017-10-20 2017-12-06 Graphcore Ltd Scheduling tasks in a multi-threaded processor
US10866806B2 (en) * 2017-11-14 2020-12-15 Nvidia Corporation Uniform register file for improved resource utilization
US11163578B2 (en) * 2018-02-23 2021-11-02 Intel Corporation Systems and methods for reducing register bank conflicts based on a software hint bit causing a hardware thread switch
CN108595258B (en) * 2018-05-02 2021-07-27 北京航空航天大学 A dynamic extension method of GPGPU register file
CN108733492A (en) * 2018-05-20 2018-11-02 北京工业大学 A kind of batch scheduling memory method divided based on Bank
US11138009B2 (en) * 2018-08-10 2021-10-05 Nvidia Corporation Robust, efficient multiprocessor-coprocessor interface
US10698689B2 (en) * 2018-09-01 2020-06-30 Intel Corporation Recompiling GPU code based on spill/fill instructions and number of stall cycles
GB2580327B (en) * 2018-12-31 2021-04-28 Graphcore Ltd Register files in a multi-threaded processor
GB2584268B (en) * 2018-12-31 2021-06-30 Graphcore Ltd Load-Store Instruction
CN111562976B (en) * 2019-02-13 2023-04-18 同济大学 GPU (graphics processing unit) acceleration method and system for radar imaging of electrically large target
WO2020177229A1 (en) * 2019-03-01 2020-09-10 Huawei Technologies Co., Ltd. Inter-warp sharing of general purpose register data in gpu
WO2020186630A1 (en) * 2019-03-21 2020-09-24 Huawei Technologies Co., Ltd. Serializing divergent accesses using peeling
CN110716755B (en) * 2019-10-14 2023-05-02 浙江诺诺网络科技有限公司 Thread exit method, device, equipment and readable storage medium
CN111124492B (en) * 2019-12-16 2022-09-20 成都海光微电子技术有限公司 Instruction generation method and device, instruction execution method, processor and electronic equipment
US11934867B2 (en) * 2020-07-23 2024-03-19 Nvidia Corp. Techniques for divergent thread group execution scheduling

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101344842A (en) * 2007-07-10 2009-01-14 北京简约纳电子有限公司 Multithreading processor and multithreading processing method
CN106575219A (en) * 2014-09-26 2017-04-19 英特尔公司 Instruction and logic for a vector format for processing computations

Also Published As

Publication number Publication date
CN112214243B (en) 2022-05-27
TWI793568B (en) 2023-02-21
CN114968358B (en) 2025-04-25
CN114816529A (en) 2022-07-29
TW202217601A (en) 2022-05-01
US20220121444A1 (en) 2022-04-21
CN112214243A (en) 2021-01-12
CN114968358A (en) 2022-08-30

Similar Documents

Publication Publication Date Title
CN114816529B (en) Apparatus and method for configuring cooperative thread bundles in a vector computing system
CN111310910B (en) Computing device and method
JP6895484B2 (en) Multithreaded processor register file
JP6944974B2 (en) Load / store instructions
Kapasi et al. The Imagine stream processor
CN112381220B (en) Neural network tensor processor
CN108376097B (en) Register file segments for supporting code block execution by using virtual cores instantiated by partitionable engines
US20170371660A1 (en) Load-store queue for multiple processor cores
US11720332B2 (en) Compiling a program from a graph
TWI794789B (en) Apparatus and method for vector computing
US20100145992A1 (en) Address Generation Unit Using Nested Loops To Scan Multi-Dimensional Data Structures
CN111651203A (en) A device and method for performing vector arithmetic
US20230305844A1 (en) Implementing specialized instructions for accelerating dynamic programming algorithms
CN117808048A (en) Operator execution method, device, equipment and storage medium
Yang et al. A case for a flexible scalar unit in SIMT architecture
WO2017185404A1 (en) Apparatus and method for performing vector logical operation
US20100146241A1 (en) Modified-SIMD Data Processing Architecture
US9003165B2 (en) Address generation unit using end point patterns to scan multi-dimensional data structures
Forsell et al. An extended PRAM-NUMA model of computation for TCF programming
CN120469721B (en) Vector core module of artificial intelligence chip and its operation method
US6785743B1 (en) Template data transfer coprocessor
Stepchenkov et al. Recurrent data-flow architecture: features and realization problems
US11822541B2 (en) Techniques for storing sub-alignment data when accelerating Smith-Waterman sequence alignments
Hussain et al. Mvpa: An fpga based multi-vector processor architecture
US10996960B1 (en) Iterating single instruction, multiple-data (SIMD) instructions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Country or region after: China

Address after: 201114 room 1302, 13 / F, building 16, 2388 Chenhang Road, Minhang District, Shanghai

Applicant after: Shanghai Bi Ren Technology Co.,Ltd.

Address before: 201114 room 1302, 13 / F, building 16, 2388 Chenhang Road, Minhang District, Shanghai

Applicant before: Shanghai Bilin Intelligent Technology Co.,Ltd.

Country or region before: China

GR01 Patent grant
GR01 Patent grant