[go: up one dir, main page]

WO2006007193A1 - Method and apparatus to vectorize multiple input instructions - Google Patents

Method and apparatus to vectorize multiple input instructions Download PDF

Info

Publication number
WO2006007193A1
WO2006007193A1 PCT/US2005/018444 US2005018444W WO2006007193A1 WO 2006007193 A1 WO2006007193 A1 WO 2006007193A1 US 2005018444 W US2005018444 W US 2005018444W WO 2006007193 A1 WO2006007193 A1 WO 2006007193A1
Authority
WO
WIPO (PCT)
Prior art keywords
instructions
trace
candidate instructions
candidate
optimization unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2005/018444
Other languages
French (fr)
Inventor
Yoav Almog
Roni Rosner
Ronny Ronen
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Intel Corp
Original Assignee
Intel Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intel Corp filed Critical Intel Corp
Priority to JP2007518079A priority Critical patent/JP2008503836A/en
Priority to GB0619968A priority patent/GB2429554B/en
Priority to CN2005800212790A priority patent/CN1977241B/en
Priority to DE112005001277T priority patent/DE112005001277B4/en
Publication of WO2006007193A1 publication Critical patent/WO2006007193A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3808Instruction prefetching for instruction reuse, e.g. trace cache, branch target cache
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors

Definitions

  • a central processing unit (CPU) of a computer system may include multiple functional execution units for processing instructions in parallel
  • the instructions may include single instruction multiple data (SIMD) instructions.
  • SIMD instruction may execute a common operation on multiple data in parallel.
  • SIMD instruction may allow the CPU to perform simultaneously a plurality of iterative calculations to reduce the overall execution time.
  • the use of SIMD operations may be exceptionally productive in multi-media applications, such as audio and image processing,
  • FIG. 1 is a block diagram of computer system according to an exemplary embodiment of the present invention.
  • FIG. 2 is a block diagram of an optimizer unit according to an exemplary embodiment of the present invention.
  • FIG 3 is an illustration of an exemplary dependency tree helpful to describe a method for transforming instructions into SIMD instruction according to exemplary embodiments of the invention
  • FIG 4 is an illustration of a table, helpful with the description of a vectorization opeiation according to exemplary embodiment of the invention.
  • FIG 5 is an illustration of a table, helpful with the description of a vectorization operation accoiding to another exemplary embodiment of the invention.
  • SIMDification or vectorization are equivalent terms that may refer to the process of merging operations that may be scheduled together for execution and require similar execution resources such as, for example, registers and functional units into a single SIMD instruction.
  • vectorization will be used to describe the process of merging operations that may be scheduled together for execution and required similar execution resources.
  • present invention may be used in a variety of applications. Although the present invention is not limited in this respect, the circuits and techniques disclosed herein may be used in many apparatuses such as computer systems, processors, CPU or the like.
  • processors intended to be included within the scope of the present invention include, by way of example only, a reduced instruction set computer (RISC), a processor that has a pipeline, a complex instruction set computei (CISC) and the like.
  • RISC reduced instruction set computer
  • CISC complex instruction set computei
  • Some embodiments of the invention may be implemented, for example, using a machine-readable medium or article which may store an instruction or a set of instructions that, if executed by a machine (for example, by a processor and/or by other suitable machines), cause the machine to perform a method aiid/oi operations in accordance with embodiments of the invention.
  • Such a machine may include, for example, any suitable processing platfoim, computing platform, computing device, processing device, computing system, processing system, computer, processor, or the like, and may be implemented using any suitable combination of hardware and/oi software.
  • the machine-readable medium or article may include, for example, any suitable type of memory unit, memory device, memory article, memory medium, storage device, storage article, storage medium and/or storage unit, for example, memory, removable or non-removable media, erasable or non-erasable media, writeable or re-writeable media, digital or analog media, hard disk, floppy disk, Compact Disk Read Only Memory (CD- ROM), Compact Disk Recordable (CD-R), Compact Disk Rewriteable (CD-RW), optical disk, magnetic media, various types of Digital Versatile Disks (DVDs), a tape, a cassette, or the like.
  • the instructions may include any suitable type of code, for example, source code, compiled code, interpreted code, executable code, static code, dynamic code, or the like, and may be implemented using any suitable high- level, low-level, object-oriented, visual, compiled and/or interpreted programming language, e.g., C, C++, Java, BASIC, Pascal, Fortran, Cobol, assembly language, machine code, or the like.
  • code for example, source code, compiled code, interpreted code, executable code, static code, dynamic code, or the like
  • suitable high- level, low-level, object-oriented, visual, compiled and/or interpreted programming language e.g., C, C++, Java, BASIC, Pascal, Fortran, Cobol, assembly language, machine code, or the like.
  • FIG. 1 a block diagram of a computer system 100 according to an exemplary embodiment of the invention is shown.
  • computer system 100 may be a personal computer (PC), a personal digital assistant (PDA), an Internet appliance, a cellular telephone, or any other computing device.
  • computer system 100 may include a main processing unit 110 poweied by a power supply 120,
  • main processing unit 110 may include a multi-processing unit 130 electrically coupled by a system interconnect 135 to a memory device 140 and one or more inteiface circuits 150.
  • the system interconnect 135 may be an address/data bus, if desired.
  • interconnects other than busses may be used to connect multi ⁇ processing unit 130 to memoiy device 140.
  • one or more dedicated lines and/or a crossbar may be used to connect multi-processing unit 130 to memory device 140.
  • multi-processing 130 may include any type of processing unit, such as, for example a processor from the Intel ® PentiumTM family of microprocessors, the Intel ® ItaniumTM family of microprocessors, and/or the Intel ® XScaleTM family of processors.
  • multi-processing 130 may include any type cache memoiy, such as, foi example, static random access memory (SRAM) and the like
  • Memory device 140 may include a dynamic random access memory (DRAM), non-volatile memory, or the like.
  • DRAM dynamic random access memory
  • memory device 140 may stoie a software program which may be executed by multi-processing 130, if desired,
  • interface circuit(s) 110 may include an Ethernet interface and/or a Universal Serial Bus (USB) interface, and/or the like
  • one or more input devices ISO may be connected to interface circuits 150 for entering data and commands into the main processing unit 110.
  • input devices 160 may include a keyboaid, mouse, touch screen, track pad, track ball, isopoint, a voice recognition system, and/or the like.
  • the output devices 170 may be operably coupled to main processing unit 110 via one or more of the interface circuits 160 and may include one or more displays, printers, speakers, and/or other output devices, if deshed.
  • one of the output devices may be a display.
  • the display may be a cathode ray tube (CRTs) 5 liquid crystal displays (L-CDs) 5 or any other type of display
  • CRTs cathode ray tube
  • L-CDs liquid crystal displays
  • computer system 100 may include one or more storage devices 180.
  • computer system 100 may include one or more hard drives, one or more compact disk (CD) drives, one or more digital versatile disk drives (DVD), and/or other computer media input/output (I/O) devices, if desired [0020]
  • computer system 100 may exchange data with other devices via a connection to a netwoik 190
  • the network connection may be any type of network connection, such as an Ethernet connection, digital subscriber Hue (DSL), telephone line, coaxial cable, etc.
  • Network 190 may be any type of network, such as the Internet, a telephone network * a cable network, a wireless network and/or the like.
  • multi-processing unit 130 may include an optimization unit 200.
  • optimization unit 200 may perform the process of searching for two or more candidate instructions in a trace.
  • optimization unit 200 may merge the two or more candidate instructions into a SIMD instruction according to a depth of a trace dependency tiee.
  • the candidate instructions may include a similar and/oi the same type of operation code that may be included in the SIMD instruction
  • optimization unit 200 may seatch for candidate instructions that perform similar opeiations base on the depth of dependency of the candidate instructions.
  • optimization unit 200 may merge at least some of the candidate instructions into a SIMD instruction, if desired.
  • optimization unit 200 may be implemented in software, in hardware, or in any suitable combination of software and hardware.
  • optimization unit 200 may include an input trace buffer 210, a sequencer 220, a vectorization unit 230 and an output trace buffer 240.
  • vectoiization unit 230 may include a first (1 st ) stage 232, a second (2 nd ) stage 234 and a memory 236, for example, a cache memory
  • input trace buffer 210 may receive a trace of instructions which may include operation (op) codes
  • sequencer 220 may pull fiom input trace buffer 210 instructions, and may provide a trace (e.g. a sequence) of operations codes and/or instructions to vectorization unit 230.
  • an instruction may include at least two types of operations, memory operations such as, for example, LOAD, STORE, etc. and arithmetic operations such as, for example, an operation e g. ADD, SUBTRACT, MULT, SHIFT, AND, etc.
  • the instruction may include input values and output values such as, for example, registers and/oi constants.
  • vectorization unit 230 may receive the trace from sequencer 220 and may search for candidate instructions according to trace dependencies.
  • 1 st stage 232 may process op codes instructions received from sequencer 220.
  • instructions and/or op codes of the trace may be transformed into single static assignment (SSA) form.
  • SSA form a register may be written only once in the trace, and a renaming process may introduce a "virtual" register name in order to satisfy the SSA condition.
  • a program code such as, for example, a program code written in a conventional Instruction Set Architecture (ISA), may present two source registers with the same name as identical legisters, although the scope of the present invention is not limited in this respect.
  • ISA Instruction Set Architecture
  • l sl stage 232 may search for a candidate for vectorization by placing the instructions in a dependency tree.
  • dependency tree 300 may include instructions at different heights
  • a level of the dependency tree 300 may include instructions at the same height.
  • a first level 310 may include instructions 312 and 314, a second level 320 may include an instruction 322, a third level 330 may include instmctions 332 and 334 and the fourth level 340 may include an instruction 342, although the scope of the present invention is in no way limited in this respect.
  • the depth of dependency tree 300 may be calculated according to the distance from first height 31O 5 to the last height 340 of dependency tree 300 ( ⁇ g the distance may shown by the arrows from level to level)
  • l sl stage 232 may store the candidate instmctions for vectorization in memory 236
  • 2 nd stage 234 may search memory 236 for similar op codes having the same or similar level and may generate the SIMD instruction.
  • 2 nd stage 232 may replace the original trace instiuctions with SIMD instruction and may store the SIMD instructions in output trace buffer 240.
  • an instruction structure may include source registers, op code, destination register and a Boolean variable that may indicate if the instruction is suitable for vector izatio ⁇ .
  • the instruction structure may be defined as:
  • a trace may be defined as a sequence of at most MAX__TRACE_SIZE instructions, represented by a vector of MAX_TRACE_SIZE entries,
  • two diminutions (2D) trace dependencies bitmap may be use to indicate the validity of an instruction of the trace. If the actual number of instructions in the trace may be INITIAL_TRACE_SIZE then only the first INITI ⁇ L_TRACE_SIZE entries may be valid
  • SIMD matrix which may be stored in memoiy 236, may includes the operation codes and may hold N lines of M op codes locations (e.g. total of lfhflog(MAXjrMCE_SIZE) bits)
  • l sl stage 232 of optimization unit 230 may search for candidate instructions in the trace by iterating the instructions in the trace in ascending order, 1 st stage 232 may compare the set of all predecessois of tiace[i] that may be constructed during the renaming process. Furthermore, 1 st stage 232 may tag the height (e.g. level) of the instructions in the dependency tree (e.g. dependency tree 300) by compute the dependency height (e.g. level) of tracefi], and its earliest potential scheduling location.
  • 2 nd stage 234 may search memory 236 (e.g. matrix SIMD) suitable instructions for vectorization.
  • a suitable instruction may be an older instruction trace[j] at the same dependency tree height (e.g. level),.
  • 2 nd stage 236 may generate SIMD instructions and may replace the original instructions with the S ⁇ MD instructions as is shown below:
  • optimization unit [0035] Accoiding to some embodiments of the invention, optimization unit
  • SIMD instruction 200 may generate SIMD instruction according to the rule that two instructions accessing a memory may be combined into a single SIMD instruction if they access continuous memory addresses. That is, it may be calculated from their memory addresses and corresponding data width that the data accessed by the two instructions is adjacent (at least in the viitual memory space). For example, in a trace that includes the following instructions:
  • the instructions may be combined into a single SIMD instruction
  • table 400 is shown. Although the scope of the present invention is not limited in this respect, table 400 may include a level column that shows the level of the instructions in the dependency tree (e.g dependency tree 300), an original trace column that shows the original instructions that may provided by input trace buffer 210 and sequencer 220 and a trace after vectorization that may show the instructions at output tiace buffer 240
  • the i ows of table 400 may show the level of an instruction, the original instruction and the instruction aftei vectorization
  • optimization unit 200 may tag the depth of the trace dependency graph (e g. the height of the instructions of the trace).
  • optimization unit 200 may identify instructions EAX ⁇ r LOAD(ESF, 4) and EBX ⁇ r LOAD(ESP, 8) that are in the same level (e g level 2) as candidates for vectorization and may combine the candidate instiuction into a SIMD instruction EAX, EBX ⁇ r SIMD_LOAD ⁇ ESV, 4), if desired.
  • optimization unit 200 may generate SIMD instruction by following the ioll that two instructions with common opeiation (e.g. LOAD) and at the same depth of the trace dependency graph (e g the height) may be combined into a single SIMD instruction (e.g. SIMDJLOAD) if all their non- constant (i.e. registei) sources are similar and/or the constant oi immediate sources may differ.
  • LOAD common opeiation
  • SIMDJLOAD SIMDJLOAD
  • table 500 may include a level column that shows the level of the original instructions in the dependency tree (e.g. dependency tree 300), an original trace column that shows the original instructions that may provided by input tiace buffer 210 and sequencer 220, a level column that shows the level of the instructions after a basic transformation, for example SSA, a column that shows the instructions after transfoimation, and a column that shows the instructions in a trace after vectoiization at output trace buffer 240.
  • the rows of table 500 may show the level of an instxuction, the original instiuction level of the instruction after basic transformation, the instruction after the basic transformation and the instruction after vectorization
  • optimization unit 200 may tag the height of the original instructions in the trace. Optimization unit 200 may transform the instructions of the trace, for example, into SSA form. Optimization unit 200 may transform the instructions of the trace by using, for example, the trace may be transform into SSA form.
  • Optimization unit 200 may tag the transformed instructions with the same level as candidate instructions for vectorization, for example, EAXJ 4- LOAIHESlH, Q), EAX ⁇ r LOAD(ES ⁇ +%, 0) and ASSER T EAX_1 o 1, ASSERTEhX o 1 and may combine them into the SIMD instructions EAX ,EAX_1 ⁇ r SIMDJLOAD ⁇ ESlH, 0) and SIMD_ASSERT(EAXJ o ⁇ , EAXoI), respectively [0040] While certain features of the invention have been illustrated and described heiein, many modifications, substitutions, changes, and equivalents will now occur to those skilled in the art, ⁇ t is, theiefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Executing Machine-Instructions (AREA)
  • Devices For Executing Special Programs (AREA)
  • Advance Control (AREA)

Abstract

Briefly, an optimization unit to search for two or more candidate instructions in an instruction trace and to merge the two or more candidate instructions into a single instruction with multiple data (SIMD) according to a depth of a trace dependency and a common operation code of the two or more candidate instructions.

Description

METHOD AND APPARATUS TO VECTORIZE MULTIPLE INPUT
INSTRUCTIONS
BACKGROUND OF THE INVENTION
[0001] A central processing unit (CPU) of a computer system may include multiple functional execution units for processing instructions in parallel The instructions may include single instruction multiple data (SIMD) instructions.. SIMD instruction may execute a common operation on multiple data in parallel. Thus, SIMD instruction may allow the CPU to perform simultaneously a plurality of iterative calculations to reduce the overall execution time. The use of SIMD operations may be exceptionally productive in multi-media applications, such as audio and image processing,
BRIEF DESCRIPTION OF THE DRAWINGS
[0002] The subject matter regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, however, both as to organization and method of operation, together with objects, features and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanied diawings in which:
[0003] FIG. 1 is a block diagram of computer system according to an exemplary embodiment of the present invention; and
[0004] FIG. 2 is a block diagram of an optimizer unit according to an exemplary embodiment of the present invention;
[0005] FIG 3 is an illustration of an exemplary dependency tree helpful to describe a method for transforming instructions into SIMD instruction according to exemplary embodiments of the invention;
[0006] FIG 4 is an illustration of a table, helpful with the description of a vectorization opeiation according to exemplary embodiment of the invention; and
[0007] FIG 5 is an illustration of a table, helpful with the description of a vectorization operation accoiding to another exemplary embodiment of the invention.
[0008] It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been diawn to scale Foi example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity- Further, where considered appropiiate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.
DETAILED DESCRIPTION OF THE INVENTION
[0009] In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention However it will be understood by those of ordinary skill in the art that the present invention may be practiced without these specific details, In other instances, well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the present invention.
[0010] Some portions of the detailed description, which follow, are presented in terms of algorithms and symbolic representations of operations on data bits or binary digital signals within a computer memory. These algorithmic descriptions and representations may be the techniques used by those skilled in the data processing aits to convey the substance of their work to others skilled in the art [0011 ] Unless specifically stated otherwise, as apparent from the following discussions, it is appreciated that throughout the specification discussions utilizing terms such as "processing,"' "computing," "calculating," "determining," or the like, refer to the action and/oi processes of a computer or computing system, or similar electronic computing device, that manipulate and/or transfoim data represented as physical, such as electronic, quantities within the computing system's registers and/or memories into other data similarly represented as physical quantities within the computing system's memories, registers or other such information storage, tiansmission oi display devices. In addition, the term "plurality" may be used thioughout the specification to describe two or more components, devices, elements, parameters and the like For example, "plurality of instructions" describes two or instructions.
[0012] It should be understood that the terms SIMDification or vectorization are equivalent terms that may refer to the process of merging operations that may be scheduled together for execution and require similar execution resources such as, for example, registers and functional units into a single SIMD instruction. Although the scope of the present invention is not limited in this respect, for the simplicity and clarity of the description the term vectorization will be used to describe the process of merging operations that may be scheduled together for execution and required similar execution resources. [0033] It should be understood that the present invention may be used in a variety of applications. Although the present invention is not limited in this respect, the circuits and techniques disclosed herein may be used in many apparatuses such as computer systems, processors, CPU or the like. Processors intended to be included within the scope of the present invention include, by way of example only, a reduced instruction set computer (RISC), a processor that has a pipeline, a complex instruction set computei (CISC) and the like. [0014] Some embodiments of the invention may be implemented, for example, using a machine-readable medium or article which may store an instruction or a set of instructions that, if executed by a machine (for example, by a processor and/or by other suitable machines), cause the machine to perform a method aiid/oi operations in accordance with embodiments of the invention. Such a machine may include, for example, any suitable processing platfoim, computing platform, computing device, processing device, computing system, processing system, computer, processor, or the like, and may be implemented using any suitable combination of hardware and/oi software. The machine-readable medium or article may include, for example, any suitable type of memory unit, memory device, memory article, memory medium, storage device, storage article, storage medium and/or storage unit, for example, memory, removable or non-removable media, erasable or non-erasable media, writeable or re-writeable media, digital or analog media, hard disk, floppy disk, Compact Disk Read Only Memory (CD- ROM), Compact Disk Recordable (CD-R), Compact Disk Rewriteable (CD-RW), optical disk, magnetic media, various types of Digital Versatile Disks (DVDs), a tape, a cassette, or the like. The instructions may include any suitable type of code, for example, source code, compiled code, interpreted code, executable code, static code, dynamic code, or the like, and may be implemented using any suitable high- level, low-level, object-oriented, visual, compiled and/or interpreted programming language, e.g., C, C++, Java, BASIC, Pascal, Fortran, Cobol, assembly language, machine code, or the like.
[0015] Turning to FIG. 1 , a block diagram of a computer system 100 according to an exemplary embodiment of the invention is shown. Although the scope of the present invention is not limited in this respect, computer system 100 may be a personal computer (PC), a personal digital assistant (PDA), an Internet appliance, a cellular telephone, or any other computing device. In one example, computer system 100 may include a main processing unit 110 poweied by a power supply 120, In embodiments of the invention, main processing unit 110 may include a multi-processing unit 130 electrically coupled by a system interconnect 135 to a memory device 140 and one or more inteiface circuits 150. For example, the system interconnect 135 may be an address/data bus, if desired. It should be understood that interconnects other than busses may be used to connect multi¬ processing unit 130 to memoiy device 140. For example, one or more dedicated lines and/or a crossbar may be used to connect multi-processing unit 130 to memory device 140.
[0016] According to some embodiments of the invention, multi-processing
130 may include any type of processing unit, such as, for example a processor from the Intel® Pentium™ family of microprocessors, the Intel® Itanium™ family of microprocessors, and/or the Intel® XScale™ family of processors. In addition, multi-processing 130 may include any type cache memoiy, such as, foi example, static random access memory (SRAM) and the like Memory device 140 may include a dynamic random access memory (DRAM), non-volatile memory, or the like. In one example, memory device 140 may stoie a software program which may be executed by multi-processing 130, if desired,
[0017] Although the scope of the piesent invention is not limited in this respect, interface circuit(s) 110 may include an Ethernet interface and/or a Universal Serial Bus (USB) interface, and/or the like In some exemplary embodiments of the invention, one or more input devices ISO may be connected to interface circuits 150 for entering data and commands into the main processing unit 110. For example, input devices 160 may include a keyboaid, mouse, touch screen, track pad, track ball, isopoint, a voice recognition system, and/or the like. [0018] Although the scope of the present invention is not limited in this respect, the output devices 170 may be operably coupled to main processing unit 110 via one or more of the interface circuits 160 and may include one or more displays, printers, speakers, and/or other output devices, if deshed. For example, one of the output devices may be a display. The display may be a cathode ray tube (CRTs)5 liquid crystal displays (L-CDs)5 or any other type of display [0019] Although the scope of the present invention is not limited in this xespect, computer system 100 may include one or more storage devices 180. For example, computer system 100 may include one or more hard drives, one or more compact disk (CD) drives, one or more digital versatile disk drives (DVD), and/or other computer media input/output (I/O) devices, if desired [0020] Although the scope of the present invention is not limited in this respect, computer system 100 may exchange data with other devices via a connection to a netwoik 190 The network connection may be any type of network connection, such as an Ethernet connection, digital subscriber Hue (DSL), telephone line, coaxial cable, etc. Network 190 may be any type of network, such as the Internet, a telephone network* a cable network, a wireless network and/or the like. [0021] Although the scope of the present invention is not limited to this embodiment, in this exemplaiy embodiment of the invention, multi-processing unit 130 may include an optimization unit 200. According to embodiments of the invention, optimization unit 200 may perform the process of searching for two or more candidate instructions in a trace. Furthermore, optimization unit 200 may merge the two or more candidate instructions into a SIMD instruction according to a depth of a trace dependency tiee. In some embodiments of the invention, the candidate instructions may include a similar and/oi the same type of operation code that may be included in the SIMD instruction For example, optimization unit 200 may seatch for candidate instructions that perform similar opeiations base on the depth of dependency of the candidate instructions. According to embodiments of the invention, optimization unit 200 may merge at least some of the candidate instructions into a SIMD instruction, if desired. Although the scope of the present invention is not limited in this respect, it should be understood that optimization unit 200 may be implemented in software, in hardware, or in any suitable combination of software and hardware.
[0022] Turning to FIG. 2 a block diagram of optimization unit 200 of FIG 1 , according to an exemplary embodiment of the invention is shown. Although the scope of the present invention is not limited in this respect, optimization unit 200 may include an input trace buffer 210, a sequencer 220, a vectorization unit 230 and an output trace buffer 240. Although the scope of the present invention is not limited in this respect, in some exemplary embodiments of the present invention, vectoiization unit 230 may include a first (1st) stage 232, a second (2nd) stage 234 and a memory 236, for example, a cache memory
[0023] Although the scope of the present invention is not limited in this respect, input trace buffer 210 may receive a trace of instructions which may include operation (op) codes, In some embodiments of the invention, sequencer 220 may pull fiom input trace buffer 210 instructions, and may provide a trace (e.g. a sequence) of operations codes and/or instructions to vectorization unit 230. For example, an instruction may include at least two types of operations, memory operations such as, for example, LOAD, STORE, etc. and arithmetic operations such as, for example, an operation e g. ADD, SUBTRACT, MULT, SHIFT, AND, etc. In addition, the instruction may include input values and output values such as, for example, registers and/oi constants.
[0024] According to an embodiment of the invention, vectorization unit 230 may receive the trace from sequencer 220 and may search for candidate instructions according to trace dependencies. In some embodiments of the invention, 1st stage 232 may process op codes instructions received from sequencer 220. For example, instructions and/or op codes of the trace may be transformed into single static assignment (SSA) form. In SSA form, a register may be written only once in the trace, and a renaming process may introduce a "virtual" register name in order to satisfy the SSA condition. A program code such as, for example, a program code written in a conventional Instruction Set Architecture (ISA), may present two source registers with the same name as identical legisters, although the scope of the present invention is not limited in this respect.
[0025] Although the scope of the present invention is not limited in this respect, lsl stage 232 may search for a candidate for vectorization by placing the instructions in a dependency tree.
[0026] Turning to FIG. 3 an illustration of an exemplary dependency tiee 300 helpful in describing a method for generating SIMD instructions according to exemplary embodiments of the invention is shown. Although the scope of the present invention is not limited in this respect, dependency tree 300 may include instructions at different heights A level of the dependency tree 300 may include instructions at the same height. A first level 310 may include instructions 312 and 314, a second level 320 may include an instruction 322, a third level 330 may include instmctions 332 and 334 and the fourth level 340 may include an instruction 342, although the scope of the present invention is in no way limited in this respect. In addition, the depth of dependency tree 300 may be calculated according to the distance from first height 31O5 to the last height 340 of dependency tree 300 (ε g the distance may shown by the arrows from level to level)
[0027] Turning back to FIG 2, although the scope of the present invention is not limited in this respect, lsl stage 232 may store the candidate instmctions for vectorization in memory 236 According to embodiments of the invention, 2nd stage 234 may search memory 236 for similar op codes having the same or similar level and may generate the SIMD instruction. Furthermore, 2nd stage 232 may replace the original trace instiuctions with SIMD instruction and may store the SIMD instructions in output trace buffer 240.
[0028] Although the scope of the present invention is not limited in this respect, the operation of 1st stage 232 and 2nd stage of optimization unit 200 may be described by an exemplary C-like pseudo code algorithm [0029] Although the scope of the present invention is not limited in this respect, the first part of the exemplary C-like pseudo code algorithm may define the constants, variables structures and the like. Foi example, maximum numbei of instructions in trace may be defined as: const MAX_TRACE_SΪZB
Maximum number of sources of an instiuction may be defined as: const MAX_SOURCES
Maximum number of sources of destinations of instructions may be defined as: const MAXJ)EST
The trace range and the internal buffer size may be defined as: rangedef [0 . M AX_TRACE_S IZE-I] inst_index_range inst_index_range M, N [0030] According to the exemplary C-Hke pseudo code algorithm, an instruction structure may include source registers, op code, destination register and a Boolean variable that may indicate if the instruction is suitable for vector izatioπ. The instruction structure may be defined as:
Structure instruction_type
{ source_type [MAX_SOURCES|| sources destinationjype [MAX-DEST] destinations operationjype opeiation
Boolean valid }
[0031] According to the exemplary C-like pseudo code algorithm, a trace may be defined as a sequence of at most MAX__TRACE_SIZE instructions, represented by a vector of MAX_TRACE_SIZE entries, In addition, two diminutions (2D) trace dependencies bitmap may be use to indicate the validity of an instruction of the trace. If the actual number of instructions in the trace may be INITIAL_TRACE_SIZE then only the first INITIΛL_TRACE_SIZE entries may be valid
Inst_mdex_jange INITIAL_TRACE_SIZE
Inslructionjype trace [MAX_TRACE_SIZE]
Boolean dependent [MAX_TRACE_SIZE,MAX_JRACE_SIZE]
[0032] According to the exemplary C-like pseudo code algorithm, a
SIMD matrix which may be stored in memoiy 236, may includes the operation codes and may hold N lines of M op codes locations (e.g. total of lfhflog(MAXjrMCE_SIZE) bits)
Structure entry Jype { Boolean valid inst_index_range loc
} entryjype simd_t [N][M]
[0033] Although the scope of the present invention is not limit in this respect, in this exemplary algorithm, lsl stage 232 of optimization unit 230 may search for candidate instructions in the trace by iterating the instructions in the trace in ascending order, 1st stage 232 may compare the set of all predecessois of tiace[i] that may be constructed during the renaming process. Furthermore, 1 st stage 232 may tag the height (e.g. level) of the instructions in the dependency tree (e.g. dependency tree 300) by compute the dependency height (e.g. level) of tracefi], and its earliest potential scheduling location.
For i = 0 to INITIAL_TRACE_SϊZE Λ
Predecessors = { ] | j -^ i AND dependent [i,j] }
Height <- 0
EarliestLocation <— 0
For Each p in Predecessors
Height f-max (Height, Height[p]) + 1 EarliestLocation *— max (EarliestLocation, p)
End for
Heightfi] <— Height
[0034] Although the scope of the present invention is not limited in this respect, in this exemplary C-like pseudo code algorithm, 2nd stage 234 may search memory 236 (e.g. matrix SIMD) suitable instructions for vectorization. For example, a suitable instruction may be an older instruction trace[j] at the same dependency tree height (e.g. level),. In addition, 2nd stage 236 may generate SIMD instructions and may replace the original instructions with the SΪMD instructions as is shown below:
op_type = trace[i].type For m *- 0 to M-I If ( simd<op_type>[Height]fm] valid — true) && (simd<op_type>[Height][m]Joc > EarliestLocation) && (additional constraints are satisfied ) then j = simd<op_type> [Height] [m ) Break End if End for
If (j == -1) then
Allocate i into simd<op_type>[Heighl] Else tracejj] <— Vectrozation(trace[j], trace[i]) tracefi] .valid <— false
// Update dependencies by replacing each reference to trace[i] by // a reference to tracejj]
// row-vectoi operation dependent O][11I *— dependent p][*] | dependent jj][*]
// column-vector operation dependent P Jfj] +- dependent [*][ i] | dependent [*][j]
End if End for
[0035] Accoiding to some embodiments of the invention, optimization unit
200 may generate SIMD instruction according to the rule that two instructions accessing a memory may be combined into a single SIMD instruction if they access continuous memory addresses. That is, it may be calculated from their memory addresses and corresponding data width that the data accessed by the two instructions is adjacent (at least in the viitual memory space). For example, in a trace that includes the following instructions:
1. LOAD 4 bytes from ESP + 4 2 LOAD 4 bytes from ESP + 12
3. LOAD 4 bytes from ESP + 8
The instructions may be combined into a single SIMD instruction
LOAD 12 bytes fiom ESP + 4, if desired.
[0036] Turning to FIG 4, table 400 is shown. Although the scope of the present invention is not limited in this respect, table 400 may include a level column that shows the level of the instructions in the dependency tree (e.g dependency tree 300), an original trace column that shows the original instructions that may provided by input trace buffer 210 and sequencer 220 and a trace after vectorization that may show the instructions at output tiace buffer 240 The i ows of table 400 may show the level of an instruction, the original instruction and the instruction aftei vectorization
[0037J Although the scope of the piesent invention is not limited in this respect, optimization unit 200 may tag the depth of the trace dependency graph (e g. the height of the instructions of the trace). In addition, for example, according to table 400, optimization unit 200 may identify instructions EAX <r LOAD(ESF, 4) and EBX <r LOAD(ESP, 8) that are in the same level (e g level 2) as candidates for vectorization and may combine the candidate instiuction into a SIMD instruction EAX, EBX <r SIMD_LOAD{ESV, 4), if desired. Although the scope of the present invention is not limited in this respect, optimization unit 200 may generate SIMD instruction by following the ioll that two instructions with common opeiation (e.g. LOAD) and at the same depth of the trace dependency graph (e g the height) may be combined into a single SIMD instruction (e.g. SIMDJLOAD) if all their non- constant (i.e. registei) sources are similar and/or the constant oi immediate sources may differ.
[0038] Turning to FIG. 5 a table 500 according to another exemplary embodiment of the invention is shown. Although the scope of the present invention is not limited in this respect, table 500 may include a level column that shows the level of the original instructions in the dependency tree (e.g. dependency tree 300), an original trace column that shows the original instructions that may provided by input tiace buffer 210 and sequencer 220, a level column that shows the level of the instructions after a basic transformation, for example SSA, a column that shows the instructions after transfoimation, and a column that shows the instructions in a trace after vectoiization at output trace buffer 240. The rows of table 500 may show the level of an instxuction, the original instiuction level of the instruction after basic transformation, the instruction after the basic transformation and the instruction after vectorization
[0039] Although the scope of the present invention is not limited in this respect, according to exemplary table 500, optimization unit 200 may tag the height of the original instructions in the trace. Optimization unit 200 may transform the instructions of the trace, for example, into SSA form. Optimization unit 200 may transform the instructions of the trace by using, for example, the trace may be transform into SSA form. Optimization unit 200 may tag the transformed instructions with the same level as candidate instructions for vectorization, for example, EAXJ 4- LOAIHESlH, Q), EAX <r LOAD(ESΪ+%, 0) and ASSER T EAX_1 o 1, ASSERTEhX o 1 and may combine them into the SIMD instructions EAX ,EAX_1 <r SIMDJLOADζESlH, 0) and SIMD_ASSERT(EAXJ o\, EAXoI), respectively [0040] While certain features of the invention have been illustrated and described heiein, many modifications, substitutions, changes, and equivalents will now occur to those skilled in the art, ϊt is, theiefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the invention.

Claims

[0041] What is claimed is:
L An apparatus comprising; an optimization unit to search for two or more candidate instructions in a trace and to merge the two or more candidate instructions into a single instruction with multiple data according to a depth of a trace dependency and a common operation code of the two oi more candidate instructions
2. The apparatus of claim 1, wherein the common operation code is selected from a group consisting of memory operation codes and arithmetic operation codes
3 The apparatus of claim 1, wherein the optimization unit comprises: a first stage to search for the two or more candidate instructions according to the depth of the trace dependency and the common operation code; a cache memory to store the candidate instructions; and a second stage to combine the two or more candidate instructions into the single instruction with multiple data
4. The apparatus of claim 1, wherein the optimization unit is able to combine the two or more candidate instructions that access continuous memory addresses.
5. The apparatus of claim 1 , wherein the optimization unit is able to transform the trace instruction into a desired form
6. The apparatus of claim 5, wherein the desired form is a single static assignment form
7. A method comprising: searching for two or moie candidate instructions in a trace; and merging the two or more candidate instructions into a single instruction with multiple data (SIMD) according to a depth of a trace dependency and a common operation code of the two or more candidate instructions.
8. The method of claim 7 comprising: selecting the common operation code from a group consisting of memory operation codes and arithmetic operation codes.
9. The method of claim 7 comprising: combining the two or more candidate instructions that access continuous memory addresses.
10. The method of claim 7, wherein merging comprises transforming instructions of the trace into a desired form.
I L A system comprising: a bus; a memory device coupled to a bus; and a processor to include an optimization unit to search for two or more candidate instructions in a trace and to merge the two oi moie candidate instructions into a single instruction with multiple data according to a depth of a trace dependency and a common operation code of the two or more candidate instructions.
12. The system of claim 11, wherein the coirmion operation code is selected from a group consisting of memory operation codes and arithmetic operation codes.
13. The system of claim 11 , wherein the optimization unit comprises: a first stage to search for the two or more candidate instructions according to the depth of the ti ace dependency and the common operation code; a cache memory to stoie the candidate instructions; and a second stage to combine the two or more candidate instructions into the single instruction with multiple data.
14. The system of claim 11, wherein the optimization unit is able to combine the two ox more candidate instructions that access continuous memory addresses.
15 The apparatus of claim 1, wherein the optimization unit is able to transform the trace instruction into a desired form.
16. The apparatus of claim 15, wherein the desired form is a single static assignment form.
17. An article comprising: a stoiage medium, having stored thereon instmctions, that when executed, result in: seaiching for two or moic candidate instructions in a tiace; and merging the two or moie candidate instructions into a single instruction with multiple data (SIMD) according to a depth of a trace dependency and a common operation code of the two or more candidate instructions.
18. The article of claim 17, wherein instructions, when executed, result in: selecting the common operation code from a group consisting of memory operation codes and arithmetic operation codes.
19. The article of claim 17, wherein instructions, when executed, result in: combining the two or more candidate instructions that access continuous memory addresses.
20. The article of claim 17, wherein instructions, when executed, result in: transforming instructions of the trace into a desired form.
PCT/US2005/018444 2004-06-24 2005-05-25 Method and apparatus to vectorize multiple input instructions Ceased WO2006007193A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
JP2007518079A JP2008503836A (en) 2004-06-24 2005-05-25 Method and apparatus for vectorizing a plurality of input instructions
GB0619968A GB2429554B (en) 2004-06-24 2005-05-25 Apparatus to vectorize multiple input instructions
CN2005800212790A CN1977241B (en) 2004-06-24 2005-05-25 Method and apparatus for vectorizing multiple input instructions
DE112005001277T DE112005001277B4 (en) 2004-06-24 2005-05-25 Method and device for vectoring multiple input commands

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US10/874,744 US7802076B2 (en) 2004-06-24 2004-06-24 Method and apparatus to vectorize multiple input instructions
US10/874,744 2004-06-24

Publications (1)

Publication Number Publication Date
WO2006007193A1 true WO2006007193A1 (en) 2006-01-19

Family

ID=35033618

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2005/018444 Ceased WO2006007193A1 (en) 2004-06-24 2005-05-25 Method and apparatus to vectorize multiple input instructions

Country Status (6)

Country Link
US (1) US7802076B2 (en)
JP (2) JP2008503836A (en)
CN (1) CN1977241B (en)
DE (2) DE112005001277B4 (en)
GB (1) GB2429554B (en)
WO (1) WO2006007193A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8943484B2 (en) 2012-03-29 2015-01-27 Fujitsu Limited Code generation method and information processing apparatus
GB2520571A (en) * 2013-11-26 2015-05-27 Advanced Risc Mach Ltd A data processing apparatus and method for performing vector processing
US9213548B2 (en) 2012-03-29 2015-12-15 Fujitsu Limited Code generation method and information processing apparatus
US9256437B2 (en) 2012-03-29 2016-02-09 Fujitsu Limited Code generation method, and information processing apparatus
US9823911B2 (en) 2014-01-31 2017-11-21 Fujitsu Limited Method and apparatus for compiling code based on a dependency tree
US10606595B2 (en) 2018-03-23 2020-03-31 Arm Limited Data processing systems

Families Citing this family (49)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7478377B2 (en) * 2004-06-07 2009-01-13 International Business Machines Corporation SIMD code generation in the presence of optimized misaligned data reorganization
US8549501B2 (en) * 2004-06-07 2013-10-01 International Business Machines Corporation Framework for generating mixed-mode operations in loop-level simdization
US7395531B2 (en) * 2004-06-07 2008-07-01 International Business Machines Corporation Framework for efficient code generation using loop peeling for SIMD loop code with multiple misaligned statements
US7475392B2 (en) * 2004-06-07 2009-01-06 International Business Machines Corporation SIMD code generation for loops with mixed data lengths
US7367026B2 (en) * 2004-06-07 2008-04-29 International Business Machines Corporation Framework for integrated intra- and inter-loop aggregation of contiguous memory accesses for SIMD vectorization
US7386842B2 (en) * 2004-06-07 2008-06-10 International Business Machines Corporation Efficient data reorganization to satisfy data alignment constraints
US7849292B1 (en) 2005-09-28 2010-12-07 Oracle America, Inc. Flag optimization of a trace
US7953933B1 (en) 2005-09-28 2011-05-31 Oracle America, Inc. Instruction cache, decoder circuit, basic block cache circuit and multi-block cache circuit
US7937564B1 (en) 2005-09-28 2011-05-03 Oracle America, Inc. Emit vector optimization of a trace
US7877630B1 (en) 2005-09-28 2011-01-25 Oracle America, Inc. Trace based rollback of a speculatively updated cache
US8032710B1 (en) 2005-09-28 2011-10-04 Oracle America, Inc. System and method for ensuring coherency in trace execution
US8037285B1 (en) 2005-09-28 2011-10-11 Oracle America, Inc. Trace unit
US7676634B1 (en) 2005-09-28 2010-03-09 Sun Microsystems, Inc. Selective trace cache invalidation for self-modifying code via memory aging
US8499293B1 (en) 2005-09-28 2013-07-30 Oracle America, Inc. Symbolic renaming optimization of a trace
US8051247B1 (en) 2005-09-28 2011-11-01 Oracle America, Inc. Trace based deallocation of entries in a versioning cache circuit
US7966479B1 (en) 2005-09-28 2011-06-21 Oracle America, Inc. Concurrent vs. low power branch prediction
US7949854B1 (en) 2005-09-28 2011-05-24 Oracle America, Inc. Trace unit with a trace builder
US7953961B1 (en) 2005-09-28 2011-05-31 Oracle America, Inc. Trace unit with an op path from a decoder (bypass mode) and from a basic-block builder
US8019944B1 (en) 2005-09-28 2011-09-13 Oracle America, Inc. Checking for a memory ordering violation after a speculative cache write
US7987342B1 (en) 2005-09-28 2011-07-26 Oracle America, Inc. Trace unit with a decoder, a basic-block cache, a multi-block cache, and sequencer
US7814298B1 (en) 2005-09-28 2010-10-12 Oracle America, Inc. Promoting and appending traces in an instruction processing circuit based upon a bias value
US7783863B1 (en) 2005-09-28 2010-08-24 Oracle America, Inc. Graceful degradation in a trace-based processor
US7870369B1 (en) 2005-09-28 2011-01-11 Oracle America, Inc. Abort prioritization in a trace-based processor
US8024522B1 (en) 2005-09-28 2011-09-20 Oracle America, Inc. Memory ordering queue/versioning cache circuit
US8015359B1 (en) 2005-09-28 2011-09-06 Oracle America, Inc. Method and system for utilizing a common structure for trace verification and maintaining coherency in an instruction processing circuit
US8370576B1 (en) 2005-09-28 2013-02-05 Oracle America, Inc. Cache rollback acceleration via a bank based versioning cache ciruit
US7797517B1 (en) * 2005-11-18 2010-09-14 Oracle America, Inc. Trace optimization via fusing operations of a target architecture operation set
US7681019B1 (en) 2005-11-18 2010-03-16 Sun Microsystems, Inc. Executing functions determined via a collection of operations from translated instructions
US8904151B2 (en) * 2006-05-02 2014-12-02 International Business Machines Corporation Method and apparatus for the dynamic identification and merging of instructions for execution on a wide datapath
US8010745B1 (en) 2006-09-27 2011-08-30 Oracle America, Inc. Rolling back a speculative update of a non-modifiable cache line
US8370609B1 (en) 2006-09-27 2013-02-05 Oracle America, Inc. Data cache rollbacks for failed speculative traces with memory operations
US8056067B2 (en) * 2006-09-29 2011-11-08 International Business Machines Corporation Method, computer program product, and device for reducing delays in data processing
US8640112B2 (en) * 2011-03-30 2014-01-28 National Instruments Corporation Vectorizing combinations of program operations
JP5887811B2 (en) * 2011-10-05 2016-03-16 富士通株式会社 Compiling device, compiling method, compiling program, and recording medium
US9009686B2 (en) * 2011-11-07 2015-04-14 Nvidia Corporation Algorithm for 64-bit address mode optimization
TWI447646B (en) 2011-11-18 2014-08-01 Asmedia Technology Inc Data transmission device and method for merging multiple instruction
WO2013089750A1 (en) * 2011-12-15 2013-06-20 Intel Corporation Methods to optimize a program loop via vector instructions using a shuffle table and a blend table
JP5413473B2 (en) * 2012-03-01 2014-02-12 日本電気株式会社 Vector processing apparatus and vector processing method
US9513915B2 (en) 2012-03-28 2016-12-06 International Business Machines Corporation Instruction merging optimization
US9292291B2 (en) 2012-03-28 2016-03-22 International Business Machines Corporation Instruction merging optimization
WO2014137327A1 (en) 2013-03-05 2014-09-12 Intel Corporation Analyzing potential benefits of vectorization
US9348596B2 (en) 2013-06-28 2016-05-24 International Business Machines Corporation Forming instruction groups based on decode time instruction optimization
CN103440229B (en) * 2013-08-12 2017-11-10 浪潮电子信息产业股份有限公司 A kind of vectorization optimization method based on MIC architecture processors
US11042929B2 (en) 2014-09-09 2021-06-22 Oracle Financial Services Software Limited Generating instruction sets implementing business rules designed to update business objects of financial applications
DE102015013627A1 (en) * 2015-10-20 2017-04-20 Fresenius Medical Care Deutschland Gmbh Blood treatment device and prescription procedure
US10061580B2 (en) 2016-02-25 2018-08-28 International Business Machines Corporation Implementing a received add program counter immediate shift (ADDPCIS) instruction using a micro-coded or cracked sequence
KR102593320B1 (en) 2016-09-26 2023-10-25 삼성전자주식회사 Electronic apparatus, process and control method thereof
CN110858150A (en) * 2018-08-22 2020-03-03 上海寒武纪信息科技有限公司 A computing device with local real-time reconfigurable pipeline stages
WO2021260888A1 (en) * 2020-06-25 2021-12-30 日本電気株式会社 Information processing device, information processing method, and recording medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4792894A (en) * 1987-03-17 1988-12-20 Unisys Corporation Arithmetic computation modifier based upon data dependent operations for SIMD architectures

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4710872A (en) * 1985-08-07 1987-12-01 International Business Machines Corporation Method for vectorizing and executing on an SIMD machine outer loops in the presence of recurrent inner loops
JPH10133885A (en) * 1996-10-28 1998-05-22 Hitachi Ltd Batch instruction generation and compilation method
US5920716A (en) * 1996-11-26 1999-07-06 Hewlett-Packard Company Compiling a predicated code with direct analysis of the predicated code
US5956503A (en) * 1997-04-14 1999-09-21 International Business Machines Corporation Method and system for front-end and back-end gathering of store instructions within a data-processing system
JP4125847B2 (en) 1998-11-27 2008-07-30 松下電器産業株式会社 Processor, compile device, and recording medium recording compile program
JP2001306332A (en) * 2000-04-20 2001-11-02 Nec Corp Method for evading excess overhead by using ssa form extended so as to use storage position other than local variable
US20030023960A1 (en) * 2001-07-25 2003-01-30 Shoab Khan Microprocessor instruction format using combination opcodes and destination prefixes
JP2003131887A (en) * 2001-10-25 2003-05-09 Hitachi Ltd Batch compilation method for variable loading and processing
JP4045802B2 (en) 2002-01-08 2008-02-13 ソニー株式会社 Program processing apparatus, program processing method, storage medium, and computer program

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4792894A (en) * 1987-03-17 1988-12-20 Unisys Corporation Arithmetic computation modifier based upon data dependent operations for SIMD architectures

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BULIC P ET AL: "Fast dependence analysis in a multimedia vectorizing compiler", PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING, 2004. PROCEEDINGS. 12TH EUROMICRO CONFERENCE ON 11-13 FEB. 2004, PISCATAWAY, NJ, USA,IEEE, 11 February 2004 (2004-02-11), pages 176 - 183, XP010685161, ISBN: 0-7695-2083-9 *
PAJUELO A. ET AL.: "Speculative Dynamic Vectorization", PROCEEDINGS OF THE 29TH. INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, ISCA2002, 25 May 2002 (2002-05-25) - 29 May 2002 (2002-05-29), US, pages 271 - 280, XP002348593 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8943484B2 (en) 2012-03-29 2015-01-27 Fujitsu Limited Code generation method and information processing apparatus
US9213548B2 (en) 2012-03-29 2015-12-15 Fujitsu Limited Code generation method and information processing apparatus
US9256437B2 (en) 2012-03-29 2016-02-09 Fujitsu Limited Code generation method, and information processing apparatus
GB2520571A (en) * 2013-11-26 2015-05-27 Advanced Risc Mach Ltd A data processing apparatus and method for performing vector processing
US9672035B2 (en) 2013-11-26 2017-06-06 Arm Limited Data processing apparatus and method for performing vector processing
GB2520571B (en) * 2013-11-26 2020-12-16 Advanced Risc Mach Ltd A data processing apparatus and method for performing vector processing
US9823911B2 (en) 2014-01-31 2017-11-21 Fujitsu Limited Method and apparatus for compiling code based on a dependency tree
US10606595B2 (en) 2018-03-23 2020-03-31 Arm Limited Data processing systems

Also Published As

Publication number Publication date
DE112005001277B4 (en) 2012-10-31
JP2008503836A (en) 2008-02-07
GB2429554A (en) 2007-02-28
US20050289529A1 (en) 2005-12-29
GB2429554B (en) 2009-04-22
JP5646390B2 (en) 2014-12-24
US7802076B2 (en) 2010-09-21
CN1977241A (en) 2007-06-06
GB0619968D0 (en) 2006-11-29
DE112005003852A5 (en) 2012-10-25
JP2011165216A (en) 2011-08-25
DE112005003852B4 (en) 2016-05-04
CN1977241B (en) 2011-08-03
DE112005001277T5 (en) 2007-05-16

Similar Documents

Publication Publication Date Title
WO2006007193A1 (en) Method and apparatus to vectorize multiple input instructions
US6128614A (en) Method of sorting numbers to obtain maxima/minima values with ordering
US8078828B1 (en) Memory mapped register file
US8225076B1 (en) Scoreboard having size indicators for tracking sequential destination register usage in a multi-threaded processor
CN1045024C (en) Method of Improving Instruction Scheduling Efficiency in Superscalar Processor System
JPH0778738B2 (en) Digital computer system
US6742013B2 (en) Apparatus and method for uniformly performing comparison operations on long word operands
US6279102B1 (en) Method and apparatus employing a single table for renaming more than one class of register
US6036350A (en) Method of sorting signed numbers and solving absolute differences using packed instructions
US6219781B1 (en) Method and apparatus for performing register hazard detection
US20190187988A1 (en) Processor load using a bit vector to calculate effective address
US6272676B1 (en) Method and apparatus for finding loop— lever parallelism in a pointer based application
US6871343B1 (en) Central processing apparatus and a compile method
US6857066B2 (en) Apparatus and method to identify the maximum operating frequency of a processor
US6889314B2 (en) Method and apparatus for fast dependency coordinate matching
US10552150B2 (en) Efficient conversion of numbers from database floating point format to binary integer format
US20050138339A1 (en) Method for and a trailing store buffer for use in memory renaming
IL301192A (en) Execution of a conditional sentence using an arithmetic and/or bit unit
US7013366B2 (en) Parallel search technique for store operations
US20040128475A1 (en) Widely accessible processor register file and method for use
US6185674B1 (en) Method and apparatus for reconstructing the address of the next instruction to be completed in a pipelined processor
US20190361703A1 (en) Method and apparatus for renaming source operands of instructions
US12223324B2 (en) Methods and apparatus for providing mask register optimization for vector operations
US20130046961A1 (en) Speculative memory write in a pipelined processor
US20190073218A1 (en) Fast reuse of physical register names

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KM KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NG NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

WWE Wipo information: entry into national phase

Ref document number: 0619968.1

Country of ref document: GB

Ref document number: 0619968

Country of ref document: GB

WWE Wipo information: entry into national phase

Ref document number: 1120050012778

Country of ref document: DE

WWE Wipo information: entry into national phase

Ref document number: 2007518079

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 200580021279.0

Country of ref document: CN

RET De translation (de og part 6b)

Ref document number: 112005001277

Country of ref document: DE

Date of ref document: 20070516

Kind code of ref document: P

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
REG Reference to national code

Ref country code: DE

Ref legal event code: 8607