[go: up one dir, main page]

WO2018149495A1 - A method and system to fetch multicore instruction traces from a virtual platform emulator to a performance simulation model - Google Patents

A method and system to fetch multicore instruction traces from a virtual platform emulator to a performance simulation model Download PDF

Info

Publication number
WO2018149495A1
WO2018149495A1 PCT/EP2017/053462 EP2017053462W WO2018149495A1 WO 2018149495 A1 WO2018149495 A1 WO 2018149495A1 EP 2017053462 W EP2017053462 W EP 2017053462W WO 2018149495 A1 WO2018149495 A1 WO 2018149495A1
Authority
WO
WIPO (PCT)
Prior art keywords
instructions
instruction
block
code
virtual platform
Prior art date
Application number
PCT/EP2017/053462
Other languages
French (fr)
Inventor
Ori Chalak
Shlomo PONGRATZ
Haibin Wang
Zuguang WU
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/EP2017/053462 priority Critical patent/WO2018149495A1/en
Priority to CN201780039897.0A priority patent/CN109690536B/en
Publication of WO2018149495A1 publication Critical patent/WO2018149495A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/30Circuit design
    • G06F30/32Circuit design at the digital level
    • G06F30/33Design verification, e.g. functional simulation or model checking
    • G06F30/3308Design verification, e.g. functional simulation or model checking using simulation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3457Performance evaluation by simulation
    • G06F11/3461Trace driven simulation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/30Circuit design
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/30Circuit design
    • G06F30/32Circuit design at the digital level
    • G06F30/33Design verification, e.g. functional simulation or model checking
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • G06F11/3414Workload generation, e.g. scripts, playback
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2115/00Details relating to the type of the circuit
    • G06F2115/10Processors

Definitions

  • the present invention in some embodiments thereof, relates to a method for simulating a hardware processor design, and more specifically but not exclusively, to a method to simulate a multi-core processor design by a virtual platform emulator and a multi-core pipeline model.
  • CPU central processing unit
  • IPC a specific number of instructions per cycle
  • L2 layer 2
  • OS operating systems
  • MIPS how many million instructions per second
  • a model may be created to simulate the performance of the design.
  • the performance goals of a CPU may be measured with a benchmark, comprising an executable program code that when executed generates processing loads on a CPU and automatically measures various performance metrics. Predicting the eventual performance of a manufactured CPU may be accomplished by a model of the CPU executing a benchmark.
  • a model may be a software code executing on a computing processor that simulates the behavior of the modeled CPU.
  • a CPU benchmark program code may be executed by a CPU model. The degree that a model accurately predicts a CPU performance is impacted by how the model executes flush events and branch mis-prediction.
  • a flush event occurs when the CPU detects an error, for example reading a calculated value from a register before the instruction to write the calculated value has completed, referred to in the art as read- after- write (RAW) hazard.
  • RAW read- after- write
  • a branch mis-prediction occurs when a branch instruction is dependent on a calculation, and the processor incorrectly predicts the outcome of the branch prior to completing the calculation.
  • ASIM and ZSIM are simulators, but they do not run on unmodified general purpose OS and do not accurately simulate multithreaded and/or multi-process benchmarks.
  • SimOS may run applications on multiple OS but is difficult to adapt to a specific CPU design.
  • SimpleScalar supports multithread and multi-process benchmarks, however the results of benchmarks may not reflect the performance of the modeled CPU.
  • a system for simulating a multicore processor design comprising: an input/output interface, a processor, a virtual platform emulator, a performance simulation model comprising at least one pipeline model, wherein the input/output interface is adapted to receive code instructions comprising a plurality of instructions blocks.
  • the processor is adapted to execute a code for instructing the virtual platform emulator to emulate an execution of an instructions block of the plurality of instructions blocks to generate a stream of a plurality of block derived code instructions based on the instructions block.
  • a mis-prediction branch in an instruction branch of the instructions block is detected, instructing the virtual platform emulator to add a plurality of dummy code instructions to the stream.
  • the stream of a plurality of block derived code instructions is stored in a memory or cache of the virtual platform emulator.
  • This aspect provides the advantages of a system for simulating a processor design including real time processing delays caused by branch mis-prediction and flushing of a pipeline.
  • a method for simulating a processor design comprising: receiving code instructions comprising a plurality of instructions blocks, and instructing a virtual platform emulator to emulate an execution of an instructions block of the plurality of instructions blocks to generate a stream of a plurality of block derived code instructions based on the instructions block.
  • a mis-prediction branch in an instruction branch of the instructions block is detected instructing the virtual platform emulator to add a plurality of dummy code instructions to the stream.
  • the processor is adapted to instruct the virtual platform emulator by a plurality of API instructions.
  • This implementation form provides the advantage of providing a defined set of function calls to enable a virtual platform emulator to supply traces of instruction execution to a pipeline model.
  • the processor is adapted to execute the code for each of a plurality of emulated cores of an emulated multicore processor in parallel.
  • the processor is adapted to maintain a history list of fetched instructions for each of the plurality of emulated cores or each of a plurality of hardware threads, upon detecting the flush pipeline event, move a location pointer or an index to point on an oldest flushed instruction in the history list, acquire new fetched instructions from the history list, progress the fetched instructions pointer to a next instruction in the history list, and upon reaching to the end of the history list, returning to a normal operation wherein instructions are fetched from the virtual platform emulator.
  • This implementation provides the advantages of enabling instructing flushing a pipeline model of loaded instructions, and fetch new instructions with minimal latency by means of maintaining a history list of fetched instructions.
  • the system when the mis-prediction branch is detected, the system is adapted to enter into a sandbox mode wherein at least one false instruction for execution and committing in the pipeline model is sent by the virtual platform emulator, and wherein, upon identification of a mis-prediction the system is adapted to instruct a rollback at a proxy layer of the virtual platform emulator to the instruction branch for taking a correct branch decision.
  • the plurality of previously executed instructions comprises at least one false instruction for execution without committing in the pipeline model
  • the processor is adapted to execute the code for instructing a flushing of the pipeline model in response to identification of branch mis-prediction and to instructa rollback at a proxy layer of the virtual platform emulator to the instruction branch.
  • the processor is further adapted to update a dataset of instructions flushed from the plurality of block derived code instructions and to instruct the virtual platform emulator to add instructions from the dataset of instructions as the plurality of previously executed code instructions.
  • the processor is adapted to instruct said virtual platform emulator to emulate execution of an instruction block in response to said instruction request from said pipeline model exclusively when said requested instruction is a member of said instruction block.
  • the virtual platform emulator comprises a scheduler adapted to schedule processing of a next instructions block, wherein the scheduler is adapted to instruct said virtual platform emulator to emulate execution of the next instructions block when an instruction request received from said pipeline model comprises a code instruction that is a member of said next instructions block.
  • FIG. 1 is a flowchart of a method for simulating a processor design, according to some embodiments of the present invention
  • FIG. 2 is schematic illustration of an exemplary system for simulating a processor design, according to some embodiments of the present invention
  • FIG. 3A is a schematic diagram representing the connections between a virtual platform emulator and a pipeline model, according to some embodiments of the present invention.
  • FIG. 3B is a schematic diagram representing an application programming interface, according to some embodiments of the present invention.
  • FIG. 3C is a schematic diagram representing a pipeline model fetching of code instructions from a virtual platform emulator, according to some embodiments of the present invention.
  • FIG. 3D is a schematic illustration of scheduling execution of code instructions on multiple cores according to some embodiments of the present invention.
  • FIG. 4A is a schematic diagram representing the state of a pipeline model during a branch mis-prediction, according to some embodiments of the present invention.
  • FIG. 4B is a schematic illustration of pipeline model sequentially fetching instructions from a virtual platform emulator, according to some embodiments of the present invention.
  • FIG. 4C is a schematic diagram representing the program counter and sequence of instructions executed in a pipeline model and a virtual platform emulator when a branch mis-prediction occurs, according to some embodiments of the present invention
  • FIG. 5A is a schematic diagram representing the state of program counter and core pipeline of Simulation Emulation system during a flush event, according to some embodiments of the present invention
  • FIG. 5B is a schematic diagram representing a pipeline model before and after a flush event, according to some embodiments of the present invention.
  • FIG. 5C is a schematic diagram representing the program counter and sequence of instruction execution in a pipeline model and a virtual platform emulator when a flush event occurs, according to some embodiments of the present invention.
  • FIG. 6 is a schematic diagram of multiple pipeline model cores controlling multiple virtual platform emulator cores, according to some embodiments of the present invention.
  • the present invention in some embodiments thereof, relates to a method for simulating a hardware processor design, and more specifically but not exclusively, to a method to simulate a multi-core processor design by a virtual platform emulator and a multi-core pipeline model.
  • PSM performance simulation models
  • Modern CPU may comprise single or multiple cores where each core may comprise a single or multiple hardware thread, and application benchmarks may need to run on multiple software threads and/or multiple processes. This in turn requires running the benchmark under an Operating System (OS).
  • OS Operating System
  • PSM are usually unable to run unmodified benchmarks under unmodified OS, which would require the PSM to implement very accurate and detailed functional behavior of the modeled CPU, for example memory management, communications management over networking interfaces, and the like.
  • VPE virtual platform emulators
  • QEMU virtual platform emulators
  • VPE code granularity
  • the smallest code granularity of a VPE is a block of instructions, comprising a set of instructions up to a branch or a maximum size.
  • a VPE scheduler may assign execution of an entire block of instructions to a single core, and schedule the next block of instructions to a different core.
  • the VPE is not adapted to guarantee any order of code instruction execution among multiple cores, and once in a while, the virtual platform switch to execute instruction blocks of a different core.
  • future implementations of virtual platform may execute different instructions of different cores in paralle on different threads.
  • Multi-core CPU have different scheduler code granularity, and may schedule the individual instructions from a block to multiple cores. For this reason it is difficult for a VPE to simulate multiple core CPUs.
  • VPE may not simulate core pipeline operations, for example branch mis-prediction and pipeline flush.
  • a block of code refers to groups of code instructions, for example a group of all code instructions until a branch instruction is encountered.
  • a PSM is a computer program that when executed on a processor simulates one or more hardware threads, referred to herein as cores, of a CPU on a clock cycle basis. Each core of a CPU may execute in parallel a separate software thread.
  • a PSM may be used for testing and benchmarking the performance of a modeled CPU design, including for example modeled CPU behavior during a misprediction and/or a flush event.
  • Mis-predicting of branches occurs in a CPU when a core mis-predicts the result of a calculation that determines a branch instruction.
  • the CPU fetches instructions ahead of execution. When a branch instruction is encountered, the following instruction depends on which branch is executed.
  • a CPU may predict the outcome of the branch instruction based on historical precedent and/or any other method.
  • the CPU identifies that a mis-prediction has occurred after completing execution of the branch calculation.
  • the CPU resets the program counter (PC) to the address of the correct branch, the correct instructions are fetched into the pipeline, and any incorrectly fetched instructions in the pipeline are not committed.
  • the state of the core including values of registers, is reset, or rolled back, to the state previous to executing incorrect branch instructions.
  • a flush event may result when a CPU detects error events, for example a RAW hazard, a Wait for Interrupt (WFI) instruction committed, an out of order read that executes prior to an earlier write where both access the same address, and the like.
  • a RAW hazard may occur when a write instruction precedes by one clock cycle a read instruction to the same register.
  • each core pipeline performs the following five operations for each instruction during five consecutive clock cycles: Instruction Fetch, Instruction Decode, Execute, Memory access, and Register write back.
  • a write instruction writes a value to the register (Register write back) in the fifth clock cycle.
  • a read instruction that immediately follows the write instruction will read the register (Memory access) also in the 5 th clock cycle, potentially reading an incorrect value from the register.
  • the read instruction is re-fetched, or re-played, in order to execute the correct read operation.
  • the instructions that are already in the core pipeline may be incorrect due to the RAW, so those instructions are not committed, and/or flushed.
  • the CPU may store recently fetched instructions in a short term memory or cache, for example LI Cache and/or L2 Cache, to be available for re-fetching in the event of a flush.
  • a CPU fetches a stream of individual instructions and assigns each instruction to a processing core.
  • a PSM may model a single core of a multi-core CPU by processing individual instructions, and/or multiple cores of a CPU.
  • performance simulation of a CPU must model the realtime behavior, including fetching individual instructions to multiple cores that execute in parallel, recovering from branch mis-predictions, recovering from a flush event, and the like.
  • VPEs and PSMs each offer a subset of capabilities, neither provides a complete simulation of CPU performance.
  • VPE may fetch and code instructions in blocks, which does not allow modeling parallel core processing. VPEs lack the ability to model branch mis- prediction, pipeline flushes, and individual code instruction processing.
  • PSMs do not emulate VPE services, for example memory management, power management, device management, register management, and the like. PSMs also lack the flexibility and scalability to run unmodified large scale OS and applications which limits the ability to test whether the CPU model meets design goals when running a variety of OS.
  • a VPE and a PSM are combined by means of a proxy layer to form an integrated model that simulates performance of a modeled CPU.
  • the integrated model may execute unmodified benchmarks under unmodified OS, fetch and simulate execution of individual code instructions, and simulate multiple core pipeline timing behavior including branch mis-prediction and flush events.
  • the proxy layer may provide function calls from an application programming interface (API) that enable interoperation between the VPE and PSM, for example API function calls to fetch single instructions from the VPE to the PSM, to control the progress of the VPE by the PSM, to synchronize VPE operations with PSM modeling of flush events and branch mis-prediction, and the like.
  • API application programming interface
  • the VPE comprises a scheduler for determining progressing from processing one block of code instructions to processing a next block of code instructions, and the scheduler is turned off and replaced with a modified scheduler.
  • the modified scheduler determines progressing to processing a next block of code instructions according to receiving an instruction fetch request from the PSM comprising a code instruction that is a member of the next instruction block.
  • the VPE scheduler is modified to process an entire block of code instructions and only to proceed to a next block of code instructions when a fetch request for an instruction from the next block is received from the PSM.
  • the PSM controls the progress of the VPE according to instruction fetch requests.
  • the VPE responds to an instruction fetch request with trace information generated by executing the instruction on the VPE. Trace information may comprise for example an opcode, a virtual and/or physical program counter address, a virtual and/or physical program counter of a next instructions, a virtual and/or physical address for loading and/or storing an instruction, and the like.
  • the integrated platform may simulate a branch mis-prediction by the PSM requesting an incorrect instruction as a result of a misprediction, and the VPE enters a sandbox mode where dummy instructions are fetched to the PSM and progress according to instruction fetch requests is suspended until the PSM corrects the mis-prediction and requests a correct instruction.
  • the VPE is prepared to fetch correct code instructions as soon as the PSM detects the mis-prediction.
  • the integrated platform may simulate a flush event by the PSM requesting the VPE to re-fetched recent instructions and ceasing to commit instructions in the core pipeline where the flush event occurred.
  • the PSM simulates the behavior of a CPU core.
  • a VPE may be a software code that executes on a host CPU that emulates an electronic system comprising a modeled CPU and peripheral devices such as virtual and physical memory, peripheral input-output devices such as Network Interface Card (NIC), block device, keyboards and/or screens, and the like.
  • the VPE may emulate the modeled CPU at a level of detail that an executable software code comprising machine code executable on the modeled CPU may execute on the VPE without any modification. For example, a benchmark software code for testing CPU performance may be executed by a VPE.
  • a VPE provides an interface for executing machine code comprising a virtual CPU, including for example multiple processor cores, registers, power management, virtual and physical memories, page table, interconnection buses interrupt model, network connections, and the like.
  • a VPE may be OS independent, where performance goals of the CPU may be tested with multiple OSs.
  • the VPE may be executed on a host computer CPU where the CPU that the VPE emulates may be completely different and unrelated to the host computer CPU.
  • the VPE When a code executes on a VPE, the VPE translates the executable code instructions of the modeled CPU into machine code native to the host computer CPU, causes the native code to be executed, and updates the modeled CPU state according to the results of the execution.
  • VPE may emulate the CPU architecture and interface to an executable code and/or operating system, it does not have the ability to model CPU micro architecture components such as pipelines and caches, and behavior when flush events or branch mis-prediction occur, and therefore has limited utility for predicting the performance of a CPU design.
  • the progress of the VPE is controlled by the PSM in the following manner: in response to an instruction fetch request from the PSM, the VPE fetches a block of executable instructions, for example from a memory in a host computer, processes the entire block, and returns the requested instruction with trace information to the PSM.
  • the processing comprises executing the instructions in the VPE emulated CPU by translating the executable instructions into machine code for the host computer CPU to execute, and storing the instruction trace of the emulated execution, comprising all information needed by the PSM to execute each the instructions.
  • Subsequent instructions from the same block are supplied according to fetch requests from the PSM.
  • the VPE does not proceed to another block of code instructions until the PSM requests an instruction from another block.
  • CPU branch mis-prediction is modeled by the PSM and VPE in the following manner: when PSM mis-predicts a branch instruction, as described above, the resulting instruction fetch requests to the VPE will be incorrect.
  • the VPE which as described above had already executed all instructions in the block, identifies the requested instruction as incorrect, and enters a mode of operations, referred to herein as Sandbox mode.
  • Sandbox mode comprises the VPE sending dummy, no operation (NOP), and/or fake instructions, referred to herein as dummy instructions, to the PSM in response to the incorrect fetch request and suspending progressing according to fetch requests from the PSM.
  • the dummy instructions are sent in place of the correct instructions in order to prevent the correct instructions from changing the state of VPE PC and/or registers.
  • the VPE exits Sandbox mode when the correct instruction fetch request is received from the PSM.
  • the branch instruction that was mis-predicted is calculated, whereby the PSM detects the mis-prediction.
  • the PSM then performs a function call to the proxy layer API to recover from the mis-prediction.
  • the API function referred to herein as rollback(), informs the VPE to roll back the program counter from the dummy instructions to the correct branch instruction address, and to fetch the correct instruction.
  • CPU flush events are modeled by the integrated model in the following manner: when the PSM detects a hazard, for example a RAW, a function call is made to the proxy layer API, referred to herein as replayO, which re-fetchs a set of instructions that were recently fetched.
  • the PSM does not commit the instructions already in the pipeline that are associated with the hazard, and the re- fetched instructions are re-executed and committed. For example, by re-fetching the read operation in a RAW hazard, the write instruction has sufficient time to be completely processed, thereby eliminating the hazard.
  • the proxy layer between the PSM and VPE may be a set of executable code instructions executing on the host computer CPU, for example a Simulation Emulation system 200 as described below.
  • the proxy layer may comprise API function calls, for example a rollback() function that when called by the PSM messages to the VPE to rollback the PC to a corrected branch instruction, a replayO function call that when called by the PSM re-fetches a set of previously fetched instructions, and the like.
  • the proxy layer may also provide memory management services to the PSM, for example storing previously fetched instructions in a non- volatile memory of the host computer to support the replayO function call, managing physical and virtual memory of the host computer, and the like.
  • the present invention in some embodiments thereof, provides a number of advantages over the existing art.
  • the integration of VPE and PSM provides microprocessor designers with a tool that decouples functionality from timing by emulating a modeled CPU code instruction interface, is compatible with multiple OS, and models clock cycle level timing of multi-core pipeline activities.
  • the throughput and operational behavior of multiple pipelines may be modeled for branch mis-prediction, pipeline flush, and re-fetching of flushed instructions.
  • the present invention may be a system, a method, and/or a computer program product.
  • the computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
  • the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
  • the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
  • network for example, the Internet, a local area network, a wide area network and/or a wireless network.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • LAN local area network
  • WAN wide area network
  • Internet Service Provider for example, AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
  • electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
  • FPGA field-programmable gate arrays
  • PLA programmable logic arrays
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).
  • the functions noted in the block may occur out of the order noted in the figures.
  • two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • FIG. 1 a flowchart schematically representing Simulation Emulation method 100 for simulating a CPU, including processing of branch mis-direction and flush events, according to some embodiments of the current invention.
  • Simulation Emulation method 100 may be executed by code instructions executing on a CPU, for example by code instructions executing on processor(s) 204 of Simulation Emulation system 200 as described below in FIG. 2.
  • Simulation Emulation method 100 begins when a request from a PSM to fetch an executable instruction is received by a VPE, and in response the VPE processing a block of instructions as described above, and returns the requested instruction to the PSM.
  • code instructions from PSM 212 executing on processor 204 may instruct sending the request and receiving the instructions
  • code instructions from VPE 211 executing on processor 204 may instruct responding to the request.
  • the VPE proxy layer When the fetch request comprises a branch mis-direction, the VPE proxy layer returns dummy instructions in place of the requested instructions, until the PSM requests fetching the correct instruction, as described below. When a flush event is detected, a requests to re-fetch the recent instructions is sent, as described below.
  • Simulation Emulation method 100 comprises a method for multiple cores operating in parallel.
  • a PSM may model multiple cores, where each modeled core may control the progress of a corresponding VPE core, as described below in FIG. 6
  • Simulation Emulation system 200 comprises an input/output (I/O) interface 202, processor(s) 204, and storage 208.
  • I/O input/output
  • processor(s) 204 processor(s) 204
  • storage 208 storage 208.
  • Simulation Emulation system 200 is adapted to receive code instructions that are executable on a modeled CPU, for example from user device 260 as described below, and to simulate the performance of the Modeled CPU when executing the received code instructions, for example by executing on processor(s) 204 code in storage 208.
  • Simulation Emulation system 200 may comprise for example a server, a desktop computer, an embedded computing system, an industrial computer, a ruggedized computer, a laptop, a cloud computer, a private cloud, a public cloud, a hybrid cloud, and/or any other type of computing system.
  • Simulation Emulation system 200 comprises a virtual machine (VM) in place of I/O 202, processor(s) 204, and storage 208.
  • VM virtual machine
  • I/O 202 may include one or more input interfaces, for example a Network Interface Card (NIC), a block device, a keyboard, a soft keyboard, a voice to text system, and/or any other data input interface.
  • I/O 202 may comprise one or more output interfaces, for example a screen, a touch screen, video display, and or any other visual display device.
  • NIC Network Interface Card
  • I/O 202 may comprise one or more output interfaces, for example a screen, a touch screen, video display, and or any other visual display device.
  • Processor(s) 204 may comprise one or more hardware processors, a multi-core processor, and/or any other type of CPU.
  • Storage 208 may include one or more non-transitory persistent storage devices, for example, a hard drive, a Flash array and the like.
  • Simulation Emulation system 200 is connected to a network 230 via I/O 202.
  • I/O 230 may be a network interface car (NIC), a wireless router, and/or any other type of network interface adapted to communicating with network 230.
  • NIC network interface car
  • wireless router any other type of network interface adapted to communicating with network 230.
  • Network 230 may be any type of data network, for example, a local area network (LAN), an Ethernet LAN, a fiber optic LAN, a digital subscriber line (DSL), a wireless LAN, a broadband connection, an Internet connection using an Internet Service Provider (ISP) and/or any other type of computer network.
  • Network 230 may employ any type of data networking protocols, including transport control protocol and internet protocol (TCP/IP), user datagram protocol (UDP), and the like.
  • user device 260 may be connected to Simulation Emulation system 200, for example via network 230.
  • User device 260 may be a smartphone, a computer, and/or any other computing platform.
  • User device 260 may be adapted to transmit to Simulation Emulation system 200 via network 230 a plurality of code instructions to execute on a Modeled CPU, for example an OS and benchmark application for testing performance of a modeled CPU design.
  • Simulation Emulation method 100 may be executed by processor(s) 204 executing code from one or more software modules in storage 208, for example Process Manager 210, Virtual Platform Emulator 211, Pipeline Model 212, and/or proxy layer 213.
  • a software module refers to a plurality of program instructions stored in a non-transitory medium such as the storage 208 and executed by a processor such as the processor(s) 204.
  • Emulation method 100 begins by sending a code instruction according to a request, for example code instructions from PSM 212 executing on processor(s) 204 may instruct sending a request to fetch a code instruction from VPE 211.
  • the fetch request, and/or all messaging and communication between PSM 212 and VPE 211 may be a proxy layer API function call, comprising code instructions from proxy layer 213 executing on processor(s) 204.
  • Proxy layer 213 may comprise code instructions that, when executed on processor(s) 204, instruct receiving and sending messages, for example a request to fetch a code instruction, a response to the request comprising the requested code instruction, and the like.
  • FIG. 3A a schematic diagram representing the connections between code modules VPE 211, PSM 212, and proxy layer 213, according to some embodiments of the invention.
  • multiple virtual cores in VPE 211 may communicate with corresponding pipeline cores 330 in PSM 212 via proxy layer 213.
  • the communication is implemented by code instructions in proxy layer 213 executing on processor(s) 204, as described below in FIG. 3B.
  • the virtual platform emulator 211 comprises CPU cores 320, memory 214, and devices 216.
  • FIG. 3B a schematic diagram representing the services provided by proxy layer 213 to PSM 212 and VPE 211, according to some embodiments of the invention.
  • API function calls for sending messages, communication, data, instruction fetch requests, and/or requested instructions between PSM 212 and VPE 211.
  • Storing instruction traces comprising making a temporary copy of all instructions and corresponding trace that were fetched by PSM 212, for example storing traces in a local memory cache and/or first in first out (FIFO) record in storage 208.
  • Executing Replay() API function calls comprising re-fetching previously fetched instructions, for example fetching instructions from the instruction trace described above and/or from VPE 211.
  • Managing virtual and/or physical PC for example directing Proxy Layer 213 to change the value of respective PC according to a Replay() function call by PSM 212.
  • Managing virtual and/or physical storage addresses for example receiving a request from VPE 211 to fetch a block of code according to a virtual address, and translating the virtual address into a physical address in storage 208.
  • Managing power states for example instructing VPE 211 and/or PSM 212 to cease operations according to detecting that a set of code instructions has been fully executed by VPE 211 and/or PSM 212.
  • Tracing packets for example keeping a record of packets of data passed between VPE 211 and/or PSM 212.
  • Tracing Disk I/O for example keeping a record of content of storage 208 that is retrieved by VPE 211.
  • proxy layer 213 may comprise any additional code instructions that instruct any communication and/or interaction between PSM 212 and VPE 211.
  • FIG. 3C a schematic diagram representing fetching of code instructions from VPE 211 to PSM 212, according to some embodiments of the present invention.
  • code instructions from PSM 212 executing on processor(s) 204 instruct sending a fetch instruction request to VPE 211.
  • code instructions from VPE 211 executing on processor(s) 204 instruct fetching a block of code instructions, processing the instructions in the block as described above, and as shown in 353 sending to PSM 212 the requested instruction.
  • PSM 212 requests and receives another instruction.
  • VPE 211 stores the traces of the generated code instructions until PSM 212 requests the corresponding code instruction.
  • the instruction to be fetched are identified according to a PC of the corresponding instruction, for example by code instructions from PSM 212 executing on processor(s) 204 specifying a PC for the requested instruction.
  • code instructions from VPE 211 executing on processor(s) 204 instruct disabling a scheduler of VPE 211, and to enable a modified scheduler that progresses according to code fetch requests from PSM 212.
  • a scheduler may be code instructions in VPE 211 executing on processor(s) 204 that instruct the progress of each VPE core, and the modified scheduler instructs fetching blocks of code instructions to process according to fetch requests from PSM 212 as described above.
  • FIG. 3D a schematic illustration of scheduling execution of code instructions on multiple cores, according to some embodiments of the invention.
  • an entire block of code instructions is scheduled and executed on one VPE 211 core, followed by a next block of code instructions is executed on another VPE 211 core.
  • PSM 212 cores execute code instructions individually or in smaller groups than a code block.
  • VPE 211 detects a branch mis-prediction of the code instruction requested by PSM 212.
  • code instructions from PSM 212 executing on processor(s) 204 that instruct comparing a predicted branch instruction with a calculation of a branch instruction, where a mis-match indicates a mis-prediction.
  • proxy layer 213 detects a mis-prediction by a mismatch between the PC of the requested instruction and the PC of the fetched instruction from VPE 211.
  • step 103 in response to the detected mis-prediction, optionally VPE 211 enters a Sandbox mode, as described above.
  • VPE 211 may respond to the fetch request and subsequent fetch requests from the PSM 212 by sending a dummy instruction, and exiting sandbox mode when a fetch request with the correct branch PC is received.
  • PSM 212 When the dummy instruction is fetched by PSM 212, it is input to the core pipeline. Prior to the commit stage of the pipeline, PSM 212 detects the misprediction as described above, and performs the following actions to recover from the mis-prediction. The location of the branch mis-prediction is identified, and a recovery message is sent to PSM 212 instructing PSM 212 to roll back to the branch point. The recovery message may be sent as a proxy layer API rollback() function call. PSM 212 does not commit the dummy instructions.
  • FIG 4A a schematic diagram representing the state of PSM 212 PC during a branch mis-prediction, where each circle represents an instruction in a core pipeline, according to some embodiments of the current invention.
  • PSM 212 requests incorrect instructions to be fetched from VPE 211 due to a mis-prediction.
  • PSM 212 sets the PC to the correct branch instruction, and as shown in 403 the correct instructions is fetched.
  • FIG. 4B a schematic illustration of PSM 212 sequentially fetching instructions from VPE 211, according to some embodiments of the current invention.
  • the thick arrows represent messages within PSM 212.
  • the medium arrows represent fetch requests from PSM 212.
  • the thin arrows represent fetch replies from VPE 211.
  • proxy layer 213 provides interoperability services to VPE 211 and PSM 212, as described above.
  • FIG. 4C a schematic diagram representing the PC
  • program counter program counter
  • 471 represents the order of execution of instructions in a CPU with PC numbered from 10 to 81
  • 472 represents the order of execution of instructions in VPE
  • 473 represents the order of execution of instructions in a PSM 212 with PC numbered from 10 to 81 when a branch mis- prediction occurs
  • 474 represents the order of execution of instructions in PSM
  • the instruction with PC 13 labeled "BZ LABEL” is a conditional branch instruction, and the following correct instruction is located at PC 80.
  • code instructions from VPE 211 executing on processor(s) As shown in 472, code instructions from VPE 211 executing on processor(s)
  • code instructions in PSM 212 executing on processor(s) 204 instruct mis-predicting branch at PC 13, and instructions with PC 14, 15, and 16 are fetched from VPE 211.
  • code instructions in PSM 212 executing on processor(s) 204 instruct calling API function rollback(), as described above, resulting in fetching the correct instruction at PC 80, and not committing the incorrectly fetched instructions.
  • a flush pipeline event is generated and/or detected, for example by code instructions from PSM 212 executing on processor(s) 204 that instruct detection of a hazard.
  • PSM 212 generates one or more flush pipeline events during normal operations, and then detects the generated pipeline flush event.
  • the flush pipeline event may be the result of a branch mis-prediction and/or a pipeline hazard, for example a load/store data cache miss, a memory disambiguation for out of order or RAW hazard, a stalled pipeline, and the like.
  • code instructions from PSM 212, code instructions from proxy layer 213, and/or code instructions from VPE 211 executing on processor(s) 204 may instruct the following: when a flush event is detected by PSM 212.
  • PSM 212 calls the replay() API function call, which requests to re-fetch, or replay, recently fetched instructions currently in PSM 212 core pipeline.
  • proxy layer 213 function replayO returns to PSM 212 from a local cache as described above recently fetched instructions.
  • proxy layer 213 function replayO returns to PSM 212 from VPE 211 the recently fetched instructions.
  • VPE 211 maintains a cache of recently fetched instructions which comprises the instructions currently in PSM 212 pipeline, for example a historical transactional memory stored in local cache and/or storage 208.
  • VPE 211 sends the re-fetched instructions to PSM 212.
  • PSM 212 does not commit the instructions in the pipeline associated with the hazard, and executes and commits the re-fetched instructions.
  • Simulation Emulation method 100 is completed.
  • FIG. 5A a schematic diagram representing the state of PC and core pipeline of Simulation Emulation system 200 during a flush event, according to some embodiments of the current invention.
  • each circle represents an instruction in PSM 212 core pipeline, for example by code instructions from PSM 212 executing on processor(s) 204.
  • a flush event is detected, as described above.
  • the curved arrow represents the result of the detected flush event, wherein the previously fetched instructions are re-fetched, as described above.
  • FIG. 5B a schematic diagram representing a PSM
  • each square represents a code instruction in a core pipeline, where code instructions proceed from left to right as they are processed by PSM 212.
  • the instruction that was most recently processed referred to herein as the latest instruction
  • the execution of instructions in the pipeline are not committed, and as a result of the replayO API function call, as described above, the latest instruction is re-fetched and is now at an earlier pipeline location.
  • FIG. 5C a schematic diagram representing the PC and sequence of instruction execution in PSM 212 and VPE 211 when a flush event occurs, according to some embodiments of the invention.
  • 571 represents the PC of instructions in a core pipeline of a CPU with PC numbered from 10 to 81
  • 572 represents the PC of executed instructions in VPE 211 with PC numbered from 10 to 81
  • 573 represents the PC of instructions in a core pipeline of PSM 212 with PC numbered from 10 to 81 when a flush event occurs
  • 574 represents the PC of committed instructions in PSM 212 after the replay () API has been executed.
  • code instructions in PSM 212 executing on processor(s) 204 instruct detecting the possible hazard, and by calling API function call replay() as described above, the most recent instructions, in this example with PC 13- 16, are re- fetched.
  • code instructions in PSM 212 executing on processor(s) 204 instruct not committing PC 13-16, and committing only the instructions PC 13- 16 that were re-fetched.
  • FIG. 6 a schematic diagram of multiple PSM core pipelines controlling multiple VPE 211 cores, according to some embodiments of the current invention.
  • Bold arrow lines represent the movement of instructions as they are retrieved by VPE 211 from code blocks and fetched to PSM 212.
  • PSM 212 comprises code instructions that when executed on processor(s) 204 instruct multiple CPU core pipeline models, where each core pipeline model controls progress of a corresponding VPE 211 core. For example, as shown in 601 and 602, PSM core "0" controls the progress of corresponding VPE 211 core "0", and as shown in 603 and 604, PSM 212 core "1" controls VPE 211 core "1".
  • Code instructions in process manager 210 when executed on processor(s) 204 may instruct assigning code instructions to individual PSM 212 core pipelines.
  • VPE and PSM are intended to include all such new technologies a priori.
  • composition or method may include additional ingredients and/or steps, but only if the additional ingredients and/or steps do not materially alter the basic and novel characteristics of the claimed composition or method.
  • a compound or “at least one compound” may include a plurality of compounds, including mixtures thereof.
  • range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Geometry (AREA)
  • Quality & Reliability (AREA)
  • Debugging And Monitoring (AREA)

Abstract

A system and a method for simulating a multicore processor design is provided. The system comprises an input/output interface, a processor, a virtual platform emulator, and a performance simulation model comprising at least one pipeline model. The input/output interface receives code instructions comprising a plurality of instructions blocks. The processor executes a code for instructing the virtual platform emulator to emulate an execution of an instructions block of the plurality of instructions blocks to generate a stream of a plurality of block derived code instructions based on the instructions block. When a mis-prediction branch in an instruction branch of the instructions block is detected, the processor instructs the virtual platform emulator to add a plurality of dummy code instructions to the stream. When a flush pipeline event is detected the processor instructs the virtual platform emulator to add a plurality of previously executed instructions to the stream in an original execution order of the previously executed instructions, and in response to each of a plurality of sequential and independent instruction requests received from a pipeline model to sequentially and independently fetch and transfer each of the plurality of block derived code instructions and at least one of the plurality of dummy code instructions and the plurality of previously executed instructions for execution and committing in the pipeline model.

Description

A METHOD AND SYSTEM TO FETCH MULTICORE INSTRUCTION TRACES FROM A VIRTUAL PLATFORM EMULATOR TO A PERFORMANCE SIMULATION MODEL
BACKGROUND
The present invention, in some embodiments thereof, relates to a method for simulating a hardware processor design, and more specifically but not exclusively, to a method to simulate a multi-core processor design by a virtual platform emulator and a multi-core pipeline model.
Development of new microprocessors is an expensive and time consuming process. This problem is even more challenging in multiple core architecture, where overall performance of a single or multiple core hardware processor, referred to herein as a central processing unit (CPU), is dependent on multiple independent core processors. Each new CPU design is expected to meet performance goals and/or metrics, for example a specific number of instructions per cycle (IPC), a specific hit ratio of layer one (LI) and/or layer 2 (L2) cache, the ability to execute multiple operating systems (OS), how many million instructions per second (MIPS) are executed, the ability to run multiple CPU benchmark application programs, and other performance metrics
In order to reduce the time and cost of CPU design, it is desirable prior to manufacturing the CPU to be able to predict whether the CPU design meets performance goals. In order to predict whether a proposed design will reach the stated goals, a model may be created to simulate the performance of the design.
The performance goals of a CPU may be measured with a benchmark, comprising an executable program code that when executed generates processing loads on a CPU and automatically measures various performance metrics. Predicting the eventual performance of a manufactured CPU may be accomplished by a model of the CPU executing a benchmark.
A model may be a software code executing on a computing processor that simulates the behavior of the modeled CPU. For example, a CPU benchmark program code may be executed by a CPU model. The degree that a model accurately predicts a CPU performance is impacted by how the model executes flush events and branch mis-prediction.
A flush event occurs when the CPU detects an error, for example reading a calculated value from a register before the instruction to write the calculated value has completed, referred to in the art as read- after- write (RAW) hazard. After a flush event is detected, the instructions already in the pipeline are flushed, meaning the results of executing the instructions are not committed by the CPU.
A branch mis-prediction occurs when a branch instruction is dependent on a calculation, and the processor incorrectly predicts the outcome of the branch prior to completing the calculation.
While individual existing models may provide a subset of critical features of CPU models, there is currently no single model that provides a complete solution.
A number of existing products provide partial answers to the challenges described here. For example, ASIM and ZSIM are simulators, but they do not run on unmodified general purpose OS and do not accurately simulate multithreaded and/or multi-process benchmarks. SimOS may run applications on multiple OS but is difficult to adapt to a specific CPU design. SimpleScalar supports multithread and multi-process benchmarks, however the results of benchmarks may not reflect the performance of the modeled CPU.
SUMMARY
It is an object of the present invention to provide a system, a computer program product, and a method for simulating a processor design.
The foregoing and other objects are achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.
According to a first aspect of the invention there is provided a system for simulating a multicore processor design, comprising: an input/output interface, a processor, a virtual platform emulator, a performance simulation model comprising at least one pipeline model, wherein the input/output interface is adapted to receive code instructions comprising a plurality of instructions blocks. The processor is adapted to execute a code for instructing the virtual platform emulator to emulate an execution of an instructions block of the plurality of instructions blocks to generate a stream of a plurality of block derived code instructions based on the instructions block. When a mis-prediction branch in an instruction branch of the instructions block is detected, instructing the virtual platform emulator to add a plurality of dummy code instructions to the stream. When a flush pipeline event is detected (in the instruction block) instructing the virtual platform emulator to add a plurality of previously executed instructions to the stream in an original execution order of the previously executed instructions, and in response to each of a plurality of sequential and independent instruction requests received from a pipeline model sequentially and independently fetching and transferring each of the plurality of block derived code instructions and at least one of the plurality of dummy code instructions and the plurality of previously executed instructions for execution and committing in the pipeline model.
Preferably, the stream of a plurality of block derived code instructions is stored in a memory or cache of the virtual platform emulator.
This aspect provides the advantages of a system for simulating a processor design including real time processing delays caused by branch mis-prediction and flushing of a pipeline.
According to a second aspect of the invention there is provided a method for simulating a processor design, comprising: receiving code instructions comprising a plurality of instructions blocks, and instructing a virtual platform emulator to emulate an execution of an instructions block of the plurality of instructions blocks to generate a stream of a plurality of block derived code instructions based on the instructions block. When a mis-prediction branch in an instruction branch of the instructions block is detected instructing the virtual platform emulator to add a plurality of dummy code instructions to the stream. When a flush pipeline event is detected instructing the virtual platform emulator to add a plurality of previously executed instructions to the stream in an original execution order of the previously executed instructions, and in response to each of a plurality of sequential and independent instruction requests received from a pipeline model sequentially and independently fetching and transferring each of the plurality of block derived code instructions and at least one of the plurality of dummy code instructions and the plurality of previously executed instructions for execution and committing in the pipeline model. This aspect provides the advantages of a method for simulating a processor design including real time processing delays caused by branch mis-prediction and flushing of a pipeline.
In a further implementation form of the first and/or second aspects, the processor is adapted to instruct the virtual platform emulator by a plurality of API instructions. This implementation form provides the advantage of providing a defined set of function calls to enable a virtual platform emulator to supply traces of instruction execution to a pipeline model.
In a further implementation form of the first and/or second aspects, the processor is adapted to execute the code for each of a plurality of emulated cores of an emulated multicore processor in parallel.
In a further implementation form of the first and/or second aspects, the processor is adapted to maintain a history list of fetched instructions for each of the plurality of emulated cores or each of a plurality of hardware threads, upon detecting the flush pipeline event, move a location pointer or an index to point on an oldest flushed instruction in the history list, acquire new fetched instructions from the history list, progress the fetched instructions pointer to a next instruction in the history list, and upon reaching to the end of the history list, returning to a normal operation wherein instructions are fetched from the virtual platform emulator. This implementation provides the advantages of enabling instructing flushing a pipeline model of loaded instructions, and fetch new instructions with minimal latency by means of maintaining a history list of fetched instructions.
In a further implementation form of the first and/or second aspects, when the mis-prediction branch is detected, the system is adapted to enter into a sandbox mode wherein at least one false instruction for execution and committing in the pipeline model is sent by the virtual platform emulator, and wherein, upon identification of a mis-prediction the system is adapted to instruct a rollback at a proxy layer of the virtual platform emulator to the instruction branch for taking a correct branch decision. This implementation provides the advantages of enabling a pipeline model to simulate detection and recovery of branch mis-prediction in a modeled processor design.
In a further implementation form of the first and/or second aspects, when the mis-prediction branch is detected during the emulation of the plurality of block derived code instructions, the plurality of previously executed instructions comprises at least one false instruction for execution without committing in the pipeline model, wherein the processor is adapted to execute the code for instructing a flushing of the pipeline model in response to identification of branch mis-prediction and to instructa rollback at a proxy layer of the virtual platform emulator to the instruction branch..
In a further implementation form of the first and/or second aspects, the processor is further adapted to update a dataset of instructions flushed from the plurality of block derived code instructions and to instruct the virtual platform emulator to add instructions from the dataset of instructions as the plurality of previously executed code instructions. This implementation provides the advantages of enabling a virtual platform emulator to send dummy instructions to a pipeline model when the pipeline model has requested incorrect instructions, for example when a pipeline model has mis-predicted a branch.
In a further implementation form of the first and/or second aspects, the processor is adapted to instruct said virtual platform emulator to emulate execution of an instruction block in response to said instruction request from said pipeline model exclusively when said requested instruction is a member of said instruction block.
In a further implementation form of the first and/or second aspects, the virtual platform emulator comprises a scheduler adapted to schedule processing of a next instructions block, wherein the scheduler is adapted to instruct said virtual platform emulator to emulate execution of the next instructions block when an instruction request received from said pipeline model comprises a code instruction that is a member of said next instructions block.
Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention pertains. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the invention, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting. BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
Some embodiments of the invention are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments of the invention. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments of the invention may be practiced.
In the drawings:
FIG. 1 is a flowchart of a method for simulating a processor design, according to some embodiments of the present invention;
FIG. 2 is schematic illustration of an exemplary system for simulating a processor design, according to some embodiments of the present invention;
FIG. 3A is a schematic diagram representing the connections between a virtual platform emulator and a pipeline model, according to some embodiments of the present invention;
FIG. 3B is a schematic diagram representing an application programming interface, according to some embodiments of the present invention;
FIG. 3C is a schematic diagram representing a pipeline model fetching of code instructions from a virtual platform emulator, according to some embodiments of the present invention;
FIG. 3D is a schematic illustration of scheduling execution of code instructions on multiple cores according to some embodiments of the present invention;
FIG. 4A is a schematic diagram representing the state of a pipeline model during a branch mis-prediction, according to some embodiments of the present invention;
FIG. 4B is a schematic illustration of pipeline model sequentially fetching instructions from a virtual platform emulator, according to some embodiments of the present invention;
FIG. 4C is a schematic diagram representing the program counter and sequence of instructions executed in a pipeline model and a virtual platform emulator when a branch mis-prediction occurs, according to some embodiments of the present invention; FIG. 5A is a schematic diagram representing the state of program counter and core pipeline of Simulation Emulation system during a flush event, according to some embodiments of the present invention;
FIG. 5B is a schematic diagram representing a pipeline model before and after a flush event, according to some embodiments of the present invention;
FIG. 5C is a schematic diagram representing the program counter and sequence of instruction execution in a pipeline model and a virtual platform emulator when a flush event occurs, according to some embodiments of the present invention; and
FIG. 6 is a schematic diagram of multiple pipeline model cores controlling multiple virtual platform emulator cores, according to some embodiments of the present invention.
DETAILED DESCRIPTION
The present invention, in some embodiments thereof, relates to a method for simulating a hardware processor design, and more specifically but not exclusively, to a method to simulate a multi-core processor design by a virtual platform emulator and a multi-core pipeline model.
Architectural timing simulation models, referred to herein as performance simulation models (PSM), are used to explore and optimize the performance of multi- core CPU micro-architecture design at an early stage of development. The models may be software codes that execute on a hardware processor of a host platform.
Modern CPU may comprise single or multiple cores where each core may comprise a single or multiple hardware thread, and application benchmarks may need to run on multiple software threads and/or multiple processes. This in turn requires running the benchmark under an Operating System (OS).
PSM are usually unable to run unmodified benchmarks under unmodified OS, which would require the PSM to implement very accurate and detailed functional behavior of the modeled CPU, for example memory management, communications management over networking interfaces, and the like.
There are virtual platform emulators (VPE), such as QEMU, that operate with unmodified OS and largely unmodified benchmark programs and model detailed functional behavior. An integration between VPE and PSM enables to decouple the functionality from timing, where VPE knows everything about functionality and nothing about timing, while PSM knows nothing about functionality and everything about timing.
However, processing of instructions by a VPE is difficult to control in terms of code granularity. The smallest code granularity of a VPE is a block of instructions, comprising a set of instructions up to a branch or a maximum size. A VPE scheduler may assign execution of an entire block of instructions to a single core, and schedule the next block of instructions to a different core. The VPE is not adapted to guarantee any order of code instruction execution among multiple cores, and once in a while, the virtual platform switch to execute instruction blocks of a different core. Furthermore, future implementations of virtual platform may execute different instructions of different cores in paralle on different threads. Multi-core CPU have different scheduler code granularity, and may schedule the individual instructions from a block to multiple cores. For this reason it is difficult for a VPE to simulate multiple core CPUs.
Furthermore, the VPE may not simulate core pipeline operations, for example branch mis-prediction and pipeline flush.
A block of code refers to groups of code instructions, for example a group of all code instructions until a branch instruction is encountered.
Another type of CPU model is a PSM. A PSM is a computer program that when executed on a processor simulates one or more hardware threads, referred to herein as cores, of a CPU on a clock cycle basis. Each core of a CPU may execute in parallel a separate software thread.
A PSM may be used for testing and benchmarking the performance of a modeled CPU design, including for example modeled CPU behavior during a misprediction and/or a flush event.
Mis-predicting of branches occurs in a CPU when a core mis-predicts the result of a calculation that determines a branch instruction. In order to increase speed execution of code, the CPU fetches instructions ahead of execution. When a branch instruction is encountered, the following instruction depends on which branch is executed. A CPU may predict the outcome of the branch instruction based on historical precedent and/or any other method. The CPU identifies that a mis-prediction has occurred after completing execution of the branch calculation. To recover from a mis-prediction, the CPU resets the program counter (PC) to the address of the correct branch, the correct instructions are fetched into the pipeline, and any incorrectly fetched instructions in the pipeline are not committed. The state of the core, including values of registers, is reset, or rolled back, to the state previous to executing incorrect branch instructions.
A flush event may result when a CPU detects error events, for example a RAW hazard, a Wait for Interrupt (WFI) instruction committed, an out of order read that executes prior to an earlier write where both access the same address, and the like. For example, a RAW hazard may occur when a write instruction precedes by one clock cycle a read instruction to the same register. For example, when a CPU requires five clock cycles to execute each instruction, each core pipeline performs the following five operations for each instruction during five consecutive clock cycles: Instruction Fetch, Instruction Decode, Execute, Memory access, and Register write back. A write instruction writes a value to the register (Register write back) in the fifth clock cycle. However, a read instruction that immediately follows the write instruction will read the register (Memory access) also in the 5th clock cycle, potentially reading an incorrect value from the register.
When the flush event is detected by the CPU, for example according to a specific sequence of a "read" instruction following a "write" instructions, the read instruction is re-fetched, or re-played, in order to execute the correct read operation. The instructions that are already in the core pipeline may be incorrect due to the RAW, so those instructions are not committed, and/or flushed. In order to minimize the latency caused by re-fetching instructions after a detected hazard, the CPU may store recently fetched instructions in a short term memory or cache, for example LI Cache and/or L2 Cache, to be available for re-fetching in the event of a flush.
In order to execute code instructions in parallel, a CPU fetches a stream of individual instructions and assigns each instruction to a processing core. A PSM may model a single core of a multi-core CPU by processing individual instructions, and/or multiple cores of a CPU.
As described above, performance simulation of a CPU must model the realtime behavior, including fetching individual instructions to multiple cores that execute in parallel, recovering from branch mis-predictions, recovering from a flush event, and the like. However, while VPEs and PSMs each offer a subset of capabilities, neither provides a complete simulation of CPU performance.
VPE may fetch and code instructions in blocks, which does not allow modeling parallel core processing. VPEs lack the ability to model branch mis- prediction, pipeline flushes, and individual code instruction processing.
PSMs do not emulate VPE services, for example memory management, power management, device management, register management, and the like. PSMs also lack the flexibility and scalability to run unmodified large scale OS and applications which limits the ability to test whether the CPU model meets design goals when running a variety of OS.
In exemplary embodiments, a VPE and a PSM are combined by means of a proxy layer to form an integrated model that simulates performance of a modeled CPU. The integrated model may execute unmodified benchmarks under unmodified OS, fetch and simulate execution of individual code instructions, and simulate multiple core pipeline timing behavior including branch mis-prediction and flush events.
Optionally, the proxy layer may provide function calls from an application programming interface (API) that enable interoperation between the VPE and PSM, for example API function calls to fetch single instructions from the VPE to the PSM, to control the progress of the VPE by the PSM, to synchronize VPE operations with PSM modeling of flush events and branch mis-prediction, and the like.
Optionally, the VPE comprises a scheduler for determining progressing from processing one block of code instructions to processing a next block of code instructions, and the scheduler is turned off and replaced with a modified scheduler. The modified scheduler determines progressing to processing a next block of code instructions according to receiving an instruction fetch request from the PSM comprising a code instruction that is a member of the next instruction block.
Optionally, the VPE scheduler is modified to process an entire block of code instructions and only to proceed to a next block of code instructions when a fetch request for an instruction from the next block is received from the PSM. Advantageously, the PSM controls the progress of the VPE according to instruction fetch requests. Optionally, the VPE responds to an instruction fetch request with trace information generated by executing the instruction on the VPE. Trace information may comprise for example an opcode, a virtual and/or physical program counter address, a virtual and/or physical program counter of a next instructions, a virtual and/or physical address for loading and/or storing an instruction, and the like.
In exemplary embodiments, the integrated platform may simulate a branch mis-prediction by the PSM requesting an incorrect instruction as a result of a misprediction, and the VPE enters a sandbox mode where dummy instructions are fetched to the PSM and progress according to instruction fetch requests is suspended until the PSM corrects the mis-prediction and requests a correct instruction. Advantageously, by recognizing the mis-prediction the VPE is prepared to fetch correct code instructions as soon as the PSM detects the mis-prediction.
In exemplary embodiments, the integrated platform may simulate a flush event by the PSM requesting the VPE to re-fetched recent instructions and ceasing to commit instructions in the core pipeline where the flush event occurred. Advantageously, by requesting re-fetching of instructions and not committing instructions in the pipeline, the PSM simulates the behavior of a CPU core.
A VPE may be a software code that executes on a host CPU that emulates an electronic system comprising a modeled CPU and peripheral devices such as virtual and physical memory, peripheral input-output devices such as Network Interface Card (NIC), block device, keyboards and/or screens, and the like. The VPE may emulate the modeled CPU at a level of detail that an executable software code comprising machine code executable on the modeled CPU may execute on the VPE without any modification. For example, a benchmark software code for testing CPU performance may be executed by a VPE.
A VPE provides an interface for executing machine code comprising a virtual CPU, including for example multiple processor cores, registers, power management, virtual and physical memories, page table, interconnection buses interrupt model, network connections, and the like. A VPE may be OS independent, where performance goals of the CPU may be tested with multiple OSs.
The VPE may be executed on a host computer CPU where the CPU that the VPE emulates may be completely different and unrelated to the host computer CPU. When a code executes on a VPE, the VPE translates the executable code instructions of the modeled CPU into machine code native to the host computer CPU, causes the native code to be executed, and updates the modeled CPU state according to the results of the execution.
While a VPE may emulate the CPU architecture and interface to an executable code and/or operating system, it does not have the ability to model CPU micro architecture components such as pipelines and caches, and behavior when flush events or branch mis-prediction occur, and therefore has limited utility for predicting the performance of a CPU design. In addition, it is difficult to control the progress of a VPE, as code instructions are executed in blocks, such that all instructions in a block of code are continuously executed.
In an exemplary embodiment, the progress of the VPE is controlled by the PSM in the following manner: in response to an instruction fetch request from the PSM, the VPE fetches a block of executable instructions, for example from a memory in a host computer, processes the entire block, and returns the requested instruction with trace information to the PSM. The processing comprises executing the instructions in the VPE emulated CPU by translating the executable instructions into machine code for the host computer CPU to execute, and storing the instruction trace of the emulated execution, comprising all information needed by the PSM to execute each the instructions.
Subsequent instructions from the same block are supplied according to fetch requests from the PSM. The VPE does not proceed to another block of code instructions until the PSM requests an instruction from another block.
Optionally, CPU branch mis-prediction is modeled by the PSM and VPE in the following manner: when PSM mis-predicts a branch instruction, as described above, the resulting instruction fetch requests to the VPE will be incorrect. The VPE, which as described above had already executed all instructions in the block, identifies the requested instruction as incorrect, and enters a mode of operations, referred to herein as Sandbox mode.
Sandbox mode comprises the VPE sending dummy, no operation (NOP), and/or fake instructions, referred to herein as dummy instructions, to the PSM in response to the incorrect fetch request and suspending progressing according to fetch requests from the PSM. The dummy instructions are sent in place of the correct instructions in order to prevent the correct instructions from changing the state of VPE PC and/or registers. The VPE exits Sandbox mode when the correct instruction fetch request is received from the PSM.
As the PSM continues to process instructions already in the pipeline, the branch instruction that was mis-predicted is calculated, whereby the PSM detects the mis-prediction. The PSM then performs a function call to the proxy layer API to recover from the mis-prediction. The API function, referred to herein as rollback(), informs the VPE to roll back the program counter from the dummy instructions to the correct branch instruction address, and to fetch the correct instruction.
Optionally, CPU flush events are modeled by the integrated model in the following manner: when the PSM detects a hazard, for example a RAW, a function call is made to the proxy layer API, referred to herein as replayO, which re-fetchs a set of instructions that were recently fetched. The PSM does not commit the instructions already in the pipeline that are associated with the hazard, and the re- fetched instructions are re-executed and committed. For example, by re-fetching the read operation in a RAW hazard, the write instruction has sufficient time to be completely processed, thereby eliminating the hazard.
Optionally, the proxy layer between the PSM and VPE may be a set of executable code instructions executing on the host computer CPU, for example a Simulation Emulation system 200 as described below. The proxy layer may comprise API function calls, for example a rollback() function that when called by the PSM messages to the VPE to rollback the PC to a corrected branch instruction, a replayO function call that when called by the PSM re-fetches a set of previously fetched instructions, and the like. The proxy layer may also provide memory management services to the PSM, for example storing previously fetched instructions in a non- volatile memory of the host computer to support the replayO function call, managing physical and virtual memory of the host computer, and the like.
The present invention, in some embodiments thereof, provides a number of advantages over the existing art. The integration of VPE and PSM provides microprocessor designers with a tool that decouples functionality from timing by emulating a modeled CPU code instruction interface, is compatible with multiple OS, and models clock cycle level timing of multi-core pipeline activities. In particular, the throughput and operational behavior of multiple pipelines may be modeled for branch mis-prediction, pipeline flush, and re-fetching of flushed instructions. Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or the Examples. The invention is capable of other embodiments or of being practiced or carried out in various ways.
The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention. Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
Reference is now made to FIG. 1, a flowchart schematically representing Simulation Emulation method 100 for simulating a CPU, including processing of branch mis-direction and flush events, according to some embodiments of the current invention.
Simulation Emulation method 100 may be executed by code instructions executing on a CPU, for example by code instructions executing on processor(s) 204 of Simulation Emulation system 200 as described below in FIG. 2.
Simulation Emulation method 100 begins when a request from a PSM to fetch an executable instruction is received by a VPE, and in response the VPE processing a block of instructions as described above, and returns the requested instruction to the PSM. For example, code instructions from PSM 212 executing on processor 204, as described below, may instruct sending the request and receiving the instructions, and code instructions from VPE 211 executing on processor 204, as described below, may instruct responding to the request.
When the fetch request comprises a branch mis-direction, the VPE proxy layer returns dummy instructions in place of the requested instructions, until the PSM requests fetching the correct instruction, as described below. When a flush event is detected, a requests to re-fetch the recent instructions is sent, as described below.
Optionally, Simulation Emulation method 100 comprises a method for multiple cores operating in parallel. For example, a PSM may model multiple cores, where each modeled core may control the progress of a corresponding VPE core, as described below in FIG. 6
Reference is now made to FIG. 2, a schematic illustration of exemplary Simulation Emulation system 200 for simulating a CPU, according to some embodiments of the present invention. Simulation Emulation system 200 comprises an input/output (I/O) interface 202, processor(s) 204, and storage 208.
Simulation Emulation system 200 is adapted to receive code instructions that are executable on a modeled CPU, for example from user device 260 as described below, and to simulate the performance of the Modeled CPU when executing the received code instructions, for example by executing on processor(s) 204 code in storage 208.
Simulation Emulation system 200 may comprise for example a server, a desktop computer, an embedded computing system, an industrial computer, a ruggedized computer, a laptop, a cloud computer, a private cloud, a public cloud, a hybrid cloud, and/or any other type of computing system. Optionally, Simulation Emulation system 200 comprises a virtual machine (VM) in place of I/O 202, processor(s) 204, and storage 208.
I/O 202 may include one or more input interfaces, for example a Network Interface Card (NIC), a block device, a keyboard, a soft keyboard, a voice to text system, and/or any other data input interface. I/O 202 may comprise one or more output interfaces, for example a screen, a touch screen, video display, and or any other visual display device.
Processor(s) 204 may comprise one or more hardware processors, a multi-core processor, and/or any other type of CPU. Storage 208 may include one or more non-transitory persistent storage devices, for example, a hard drive, a Flash array and the like.
Optionally, Simulation Emulation system 200 is connected to a network 230 via I/O 202. For example, I/O 230 may be a network interface car (NIC), a wireless router, and/or any other type of network interface adapted to communicating with network 230.
Network 230 may be any type of data network, for example, a local area network (LAN), an Ethernet LAN, a fiber optic LAN, a digital subscriber line (DSL), a wireless LAN, a broadband connection, an Internet connection using an Internet Service Provider (ISP) and/or any other type of computer network. Network 230 may employ any type of data networking protocols, including transport control protocol and internet protocol (TCP/IP), user datagram protocol (UDP), and the like.
Optionally, user device 260 may be connected to Simulation Emulation system 200, for example via network 230. User device 260 may be a smartphone, a computer, and/or any other computing platform. User device 260 may be adapted to transmit to Simulation Emulation system 200 via network 230 a plurality of code instructions to execute on a Modeled CPU, for example an OS and benchmark application for testing performance of a modeled CPU design.
Simulation Emulation method 100 may be executed by processor(s) 204 executing code from one or more software modules in storage 208, for example Process Manager 210, Virtual Platform Emulator 211, Pipeline Model 212, and/or proxy layer 213. Wherein a software module refers to a plurality of program instructions stored in a non-transitory medium such as the storage 208 and executed by a processor such as the processor(s) 204.
Reference is now made again to FIG. 1. As shown in 101, Simulation
Emulation method 100 begins by sending a code instruction according to a request, for example code instructions from PSM 212 executing on processor(s) 204 may instruct sending a request to fetch a code instruction from VPE 211.
Optionally, the fetch request, and/or all messaging and communication between PSM 212 and VPE 211, may be a proxy layer API function call, comprising code instructions from proxy layer 213 executing on processor(s) 204. Proxy layer 213 may comprise code instructions that, when executed on processor(s) 204, instruct receiving and sending messages, for example a request to fetch a code instruction, a response to the request comprising the requested code instruction, and the like.
Reference is now made to FIG. 3A, a schematic diagram representing the connections between code modules VPE 211, PSM 212, and proxy layer 213, according to some embodiments of the invention. As shown in 320, multiple virtual cores in VPE 211 may communicate with corresponding pipeline cores 330 in PSM 212 via proxy layer 213. The communication is implemented by code instructions in proxy layer 213 executing on processor(s) 204, as described below in FIG. 3B. The virtual platform emulator 211 comprises CPU cores 320, memory 214, and devices 216.
Reference is now made to FIG. 3B, a schematic diagram representing the services provided by proxy layer 213 to PSM 212 and VPE 211, according to some embodiments of the invention.
As shown in 213, optionally code instructions from proxy layer 213 when executed on processor(s) 204 instruct the following set of actions: API function calls for sending messages, communication, data, instruction fetch requests, and/or requested instructions between PSM 212 and VPE 211. Storing instruction traces, comprising making a temporary copy of all instructions and corresponding trace that were fetched by PSM 212, for example storing traces in a local memory cache and/or first in first out (FIFO) record in storage 208. Executing Replay() API function calls, comprising re-fetching previously fetched instructions, for example fetching instructions from the instruction trace described above and/or from VPE 211. Managing virtual and/or physical PC, for example directing Proxy Layer 213 to change the value of respective PC according to a Replay() function call by PSM 212. Managing virtual and/or physical storage addresses, for example receiving a request from VPE 211 to fetch a block of code according to a virtual address, and translating the virtual address into a physical address in storage 208. Managing power states, for example instructing VPE 211 and/or PSM 212 to cease operations according to detecting that a set of code instructions has been fully executed by VPE 211 and/or PSM 212. Tracing packets, for example keeping a record of packets of data passed between VPE 211 and/or PSM 212. Tracing Disk I/O, for example keeping a record of content of storage 208 that is retrieved by VPE 211. In addition, proxy layer 213 may comprise any additional code instructions that instruct any communication and/or interaction between PSM 212 and VPE 211.
Reference is now made to FIG. 3C, a schematic diagram representing fetching of code instructions from VPE 211 to PSM 212, according to some embodiments of the present invention. As shown in 351, code instructions from PSM 212 executing on processor(s) 204 instruct sending a fetch instruction request to VPE 211. As shown in 352, code instructions from VPE 211 executing on processor(s) 204 instruct fetching a block of code instructions, processing the instructions in the block as described above, and as shown in 353 sending to PSM 212 the requested instruction. As shown in 354 and 355, PSM 212 requests and receives another instruction.
Optionally, VPE 211 stores the traces of the generated code instructions until PSM 212 requests the corresponding code instruction.
Optionally, the instruction to be fetched are identified according to a PC of the corresponding instruction, for example by code instructions from PSM 212 executing on processor(s) 204 specifying a PC for the requested instruction.
Optionally, code instructions from VPE 211 executing on processor(s) 204 instruct disabling a scheduler of VPE 211, and to enable a modified scheduler that progresses according to code fetch requests from PSM 212. A scheduler may be code instructions in VPE 211 executing on processor(s) 204 that instruct the progress of each VPE core, and the modified scheduler instructs fetching blocks of code instructions to process according to fetch requests from PSM 212 as described above.
Reference is now made to FIG. 3D, a schematic illustration of scheduling execution of code instructions on multiple cores, according to some embodiments of the invention. As shown in 371, an entire block of code instructions is scheduled and executed on one VPE 211 core, followed by a next block of code instructions is executed on another VPE 211 core. As shown in 372, PSM 212 cores execute code instructions individually or in smaller groups than a code block.
Reference is now made again to FIG. 1. As shown in 102, VPE 211 detects a branch mis-prediction of the code instruction requested by PSM 212. For example, code instructions from PSM 212 executing on processor(s) 204 that instruct comparing a predicted branch instruction with a calculation of a branch instruction, where a mis-match indicates a mis-prediction. Optionally, proxy layer 213 detects a mis-prediction by a mismatch between the PC of the requested instruction and the PC of the fetched instruction from VPE 211.
As shown in Fig. 1, step 103, in response to the detected mis-prediction, optionally VPE 211 enters a Sandbox mode, as described above. VPE 211 may respond to the fetch request and subsequent fetch requests from the PSM 212 by sending a dummy instruction, and exiting sandbox mode when a fetch request with the correct branch PC is received.
When the dummy instruction is fetched by PSM 212, it is input to the core pipeline. Prior to the commit stage of the pipeline, PSM 212 detects the misprediction as described above, and performs the following actions to recover from the mis-prediction. The location of the branch mis-prediction is identified, and a recovery message is sent to PSM 212 instructing PSM 212 to roll back to the branch point. The recovery message may be sent as a proxy layer API rollback() function call. PSM 212 does not commit the dummy instructions.
Reference is now made to FIG 4A, a schematic diagram representing the state of PSM 212 PC during a branch mis-prediction, where each circle represents an instruction in a core pipeline, according to some embodiments of the current invention. As shown in 401, PSM 212 requests incorrect instructions to be fetched from VPE 211 due to a mis-prediction. As shown in 402, as a result of detecting the mis-prediction, PSM 212 sets the PC to the correct branch instruction, and as shown in 403 the correct instructions is fetched.
Reference is now made to FIG. 4B, a schematic illustration of PSM 212 sequentially fetching instructions from VPE 211, according to some embodiments of the current invention. As shown in 451, the thick arrows represent messages within PSM 212. As shown in 452 the medium arrows represent fetch requests from PSM 212. As shown in 453 the thin arrows represent fetch replies from VPE 211. As shown in 310, proxy layer 213 provides interoperability services to VPE 211 and PSM 212, as described above.
Reference is now made to FIG. 4C, a schematic diagram representing the PC
(program counter) and sequence of instructions executed in PSM 212 and VPE 211 when a branch mis-prediction occurs, according to some embodiments of the current invention. 471 represents the order of execution of instructions in a CPU with PC numbered from 10 to 81, 472 represents the order of execution of instructions in VPE
211 with PC numbered from 10 to 81, 473 represents the order of execution of instructions in a PSM 212 with PC numbered from 10 to 81 when a branch mis- prediction occurs, and 474 represents the order of execution of instructions in PSM
212 after the rollback() API has been executed.
As shown in 471, the instruction with PC 13 labeled "BZ LABEL" is a conditional branch instruction, and the following correct instruction is located at PC 80.
As shown in 472, code instructions from VPE 211 executing on processor(s)
204 instruct correctly calculating the branch located at PC 13, and the following instruction is at PC 80.
As shown in 473, code instructions in PSM 212 executing on processor(s) 204 instruct mis-predicting branch at PC 13, and instructions with PC 14, 15, and 16 are fetched from VPE 211.
As shown in 474 code instructions in PSM 212 executing on processor(s) 204 instruct calling API function rollback(), as described above, resulting in fetching the correct instruction at PC 80, and not committing the incorrectly fetched instructions.
Reference is now made again to FIG. 1. As shown in 104, a flush pipeline event is generated and/or detected, for example by code instructions from PSM 212 executing on processor(s) 204 that instruct detection of a hazard. Optionally, PSM 212 generates one or more flush pipeline events during normal operations, and then detects the generated pipeline flush event. Optionally, the flush pipeline event may be the result of a branch mis-prediction and/or a pipeline hazard, for example a load/store data cache miss, a memory disambiguation for out of order or RAW hazard, a stalled pipeline, and the like.
As shown in 105, in response to the detected flush event, previously fetched instructions are re-fetched, and the results of the instructions in the pipeline associated with the hazard are not committed. For example, code instructions from PSM 212, code instructions from proxy layer 213, and/or code instructions from VPE 211 executing on processor(s) 204 may instruct the following: when a flush event is detected by PSM 212. Optionally PSM 212 calls the replay() API function call, which requests to re-fetch, or replay, recently fetched instructions currently in PSM 212 core pipeline. Optionally, proxy layer 213 function replayO returns to PSM 212 from a local cache as described above recently fetched instructions. Optionally, proxy layer 213 function replayO returns to PSM 212 from VPE 211 the recently fetched instructions. Optionally, VPE 211 maintains a cache of recently fetched instructions which comprises the instructions currently in PSM 212 pipeline, for example a historical transactional memory stored in local cache and/or storage 208. VPE 211 sends the re-fetched instructions to PSM 212. PSM 212 does not commit the instructions in the pipeline associated with the hazard, and executes and commits the re-fetched instructions.
As shown in 106, when the fetched instructions are fetched and executed,
Simulation Emulation method 100 is completed.
Reference is now made to FIG. 5A, a schematic diagram representing the state of PC and core pipeline of Simulation Emulation system 200 during a flush event, according to some embodiments of the current invention. As shown in 501, each circle represents an instruction in PSM 212 core pipeline, for example by code instructions from PSM 212 executing on processor(s) 204. As shown in 502, a flush event is detected, as described above. As shown in 503, the curved arrow represents the result of the detected flush event, wherein the previously fetched instructions are re-fetched, as described above.
Reference is now made to FIG. 5B a schematic diagram representing a PSM
212 core pipeline before and after a flush event, according to some embodiments of the current invention. As shown in 551, each square represents a code instruction in a core pipeline, where code instructions proceed from left to right as they are processed by PSM 212. As shown in 552, before detection of a flush event, the instruction that was most recently processed, referred to herein as the latest instruction, is the farthest right instruction in the pipeline. As shown in 553, after detecting a flush event in the same core pipeline, the execution of instructions in the pipeline are not committed, and as a result of the replayO API function call, as described above, the latest instruction is re-fetched and is now at an earlier pipeline location.
Reference is now made to FIG. 5C, a schematic diagram representing the PC and sequence of instruction execution in PSM 212 and VPE 211 when a flush event occurs, according to some embodiments of the invention. 571 represents the PC of instructions in a core pipeline of a CPU with PC numbered from 10 to 81, 572 represents the PC of executed instructions in VPE 211 with PC numbered from 10 to 81, 573 represents the PC of instructions in a core pipeline of PSM 212 with PC numbered from 10 to 81 when a flush event occurs, and 574 represents the PC of committed instructions in PSM 212 after the replay () API has been executed.
As shown in 571, the instruction with PC 13 labeled "load R3 <r [Rl]" generates a hazard since the value of register "Rl" may not be correct due to the write to Rl at PC 11.
As shown in 572, code instructions from VPE 211 executing on processor(s)
204 instruct correctly calculating value of "Rl".
As shown in 573, code instructions in PSM 212 executing on processor(s) 204 instruct detecting the possible hazard, and by calling API function call replay() as described above, the most recent instructions, in this example with PC 13- 16, are re- fetched.
As shown in 574 code instructions in PSM 212 executing on processor(s) 204 instruct not committing PC 13-16, and committing only the instructions PC 13- 16 that were re-fetched.
Reference is now made to FIG. 6, a schematic diagram of multiple PSM core pipelines controlling multiple VPE 211 cores, according to some embodiments of the current invention. Bold arrow lines represent the movement of instructions as they are retrieved by VPE 211 from code blocks and fetched to PSM 212. PSM 212 comprises code instructions that when executed on processor(s) 204 instruct multiple CPU core pipeline models, where each core pipeline model controls progress of a corresponding VPE 211 core. For example, as shown in 601 and 602, PSM core "0" controls the progress of corresponding VPE 211 core "0", and as shown in 603 and 604, PSM 212 core "1" controls VPE 211 core "1". Code instructions in process manager 210 when executed on processor(s) 204 may instruct assigning code instructions to individual PSM 212 core pipelines.
The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
It is expected that during the life of a patent maturing from this application many relevant processor emulators and pipeline models will be developed and the scope of the terms VPE and PSM are intended to include all such new technologies a priori.
As used herein the term "about" refers to ± 10 %.
The terms "comprises", "comprising", "includes", "including", "having" and their conjugates mean "including but not limited to". This term encompasses the terms "consisting of" and "consisting essentially of".
The phrase "consisting essentially of" means that the composition or method may include additional ingredients and/or steps, but only if the additional ingredients and/or steps do not materially alter the basic and novel characteristics of the claimed composition or method.
As used herein, the singular form "a", "an" and "the" include plural references unless the context clearly dictates otherwise. For example, the term "a compound" or "at least one compound" may include a plurality of compounds, including mixtures thereof.
The word "exemplary" is used herein to mean "serving as an example, instance or illustration". Any embodiment described as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments.
The word "optionally" is used herein to mean "is provided in some embodiments and not provided in other embodiments". Any particular embodiment of the invention may include a plurality of "optional" features unless such features conflict.
Throughout this application, various embodiments of this invention may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.
Whenever a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range. The phrases "ranging/ranges between" a first indicate number and a second indicate number and "ranging/ranges from" a first indicate number "to" a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween.
It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as suitable in any other described embodiment of the invention. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements.

Claims

WHAT IS CLAIMED IS:
1. A system (200) for simulating a multicore processor design, comprising:
an input/output interface (202);
a processor (204);
a virtual platform emulator (211);
a performance simulation model (212), wherein the performance simulation model (212) comprises at least one pipeline model (330);
wherein the input/output interface (202) is adapted to receive code instructions comprising a plurality of instructions blocks;
wherein the processor (204) is adapted to execute a code for:
instructing the virtual platform emulator (211) to emulate an execution of an instructions block of the plurality of instructions blocks to generate a stream of a plurality of block derived code instructions based on said instructions block;
when a mis-prediction branch in an instruction branch of said instructions block is detected, instructing said virtual platform emulator (211) to add a plurality of dummy code instructions to said stream;
when a flush pipeline event is detected instructing said virtual platform emulator (211) to add a plurality of previously executed instructions to said stream in an original execution order of said previously executed instructions; in response to each of a plurality of sequential and independent instruction requests received from the pipeline model, sequentially and independently fetching and transferring each of said plurality of block derived code instructions and at least one of said plurality of dummy code instructions and said plurality of previously executed instructions for execution and committing in said pipeline model.
2. The system (200) of claim 1, wherein said processor is adapted to instruct said virtual platform emulator by a plurality of application programming interface instructions.
3. The system (200) of any of the previous claims, wherein said processor is adapted to execute said code for each of a plurality of emulated cores of the multicore processor in parallel.
4. The system (200) of claim 3, wherein said processor is adapted to maintain a history list of fetched instructions for each of said plurality of emulated cores, upon detecting said flush pipeline event, move a location pointer or an index to point on an oldest flushed instruction in said history list, acquire new fetched instructions from the history list, progress the fetched instructions pointer to a next instruction in the history list, and upon reaching to the end of the history list, returning to a normal operation wherein instructions are fetched from said virtual platform emulator.
5. The system (200) of any of the previous claims, wherein when said misprediction branch is detected, the system is adapted to enter into a sandbox mode wherein at least one false instruction for execution and committing in said pipeline model is sent by said virtual platform emulator, and upon identification of a misprediction, to instruct a rollback at a proxy layer of said virtual platform emulator (211) to said instruction branch for taking a correct branch decision.
6. The system (200) of any of the previous claims, wherein said mis-prediction branch is detected during said emulation of said plurality of block derived code instructions, said plurality of previously executed instructions comprises at least one false instruction for execution without committing in said pipeline model; wherein said processor is adapted to execute said code for instructing a flushing of said pipeline model in response to identification of branch mis-prediction and instructing a rollback at a proxy layer of said virtual platform emulator (211) to said instruction branch.
7. The system (200) of any of the previous claims, further comprising updating a dataset of instructions not yet committed, flushed from said plurality of block derived code instructions and instructing said virtual platform emulator (211) to add instructions from said dataset of instructions as said plurality of previously executed code instructions.
8. The system (200) of any of the previous claims, wherein said processor is adapted to instruct said virtual platform emulator (211) to emulate execution of an instruction block in response to said instruction request from said pipeline model exclusively when said requested instruction is a member of said instruction block.
9. The system (200) of any of the previous claims, wherein the virtual platform emulator (211) comprises a scheduler adapted to schedule processing of a next instructions block, wherein the scheduler is adapted to instruct said virtual platform emulator to emulate execution of the next instructions block when an instruction request received from said pipeline model comprises a code instruction that is a member of said next instructions block.
10. A method for simulating a multicore processor design, comprising:
receiving code instructions comprising a plurality of instructions blocks;
instructing a virtual platform emulator (211) to emulate an execution of an instructions block of the plurality of instructions blocks to generate a stream of a plurality of block derived code instructions based on said instructions block;
when a mis-prediction branch in an instruction branch of said instructions block is detected, instructing said virtual platform emulator (211) to add a plurality of dummy code instructions to said stream;
when a flush pipeline event is detected, instructing said virtual platform emulator (211) to add a plurality of previously executed instructions to said stream in an original execution order of said previously executed instructions;
in response to each of a plurality of sequential and independent instruction requests received from a pipeline model, sequentially and independently fetching and transferring each of said plurality of block derived code instructions and at least one of said plurality of dummy code instructions and said plurality of previously executed instructions for execution and committing in said pipeline model.
PCT/EP2017/053462 2017-02-16 2017-02-16 A method and system to fetch multicore instruction traces from a virtual platform emulator to a performance simulation model WO2018149495A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/EP2017/053462 WO2018149495A1 (en) 2017-02-16 2017-02-16 A method and system to fetch multicore instruction traces from a virtual platform emulator to a performance simulation model
CN201780039897.0A CN109690536B (en) 2017-02-16 2017-02-16 Method and system for fetching multicore instruction traces from virtual platform emulators to performance simulation models

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2017/053462 WO2018149495A1 (en) 2017-02-16 2017-02-16 A method and system to fetch multicore instruction traces from a virtual platform emulator to a performance simulation model

Publications (1)

Publication Number Publication Date
WO2018149495A1 true WO2018149495A1 (en) 2018-08-23

Family

ID=58046683

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2017/053462 WO2018149495A1 (en) 2017-02-16 2017-02-16 A method and system to fetch multicore instruction traces from a virtual platform emulator to a performance simulation model

Country Status (2)

Country Link
CN (1) CN109690536B (en)
WO (1) WO2018149495A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111324948A (en) * 2020-02-10 2020-06-23 上海兆芯集成电路有限公司 Simulation method and simulation system
CN115421861A (en) * 2022-09-14 2022-12-02 北京计算机技术及应用研究所 Universal TMS320C55x processor instruction set virtualization simulation method
CN115421863A (en) * 2022-09-14 2022-12-02 北京计算机技术及应用研究所 Universal MIPS64 processor instruction set virtualization simulation method
WO2024220137A1 (en) * 2023-04-20 2024-10-24 Synopsys, Inc. Performance analysis using architecture model of processor architecture design
CN119201357A (en) * 2024-10-23 2024-12-27 上海兆芯集成电路股份有限公司 Computer Systems

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113515348B (en) * 2021-07-16 2023-11-14 江苏师范大学 Simulator modeling method and device based on opportunity action flow
CN115658227A (en) * 2022-10-26 2023-01-31 中国航空工业集团公司西安航空计算技术研究所 QEMU-based simulation software multi-core system-level debugging method
CN116070565B (en) * 2023-03-01 2023-06-13 摩尔线程智能科技(北京)有限责任公司 Method and device for simulating multi-core processor, electronic equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050289323A1 (en) * 2004-05-19 2005-12-29 Kar-Lik Wong Barrel shifter for a microprocessor
JP5043560B2 (en) * 2007-08-24 2012-10-10 パナソニック株式会社 Program execution control device
CN102360282A (en) * 2011-09-26 2012-02-22 杭州中天微系统有限公司 Production-line processor device for rapidly disposing prediction error of branch instruction
EP2731008B1 (en) * 2012-11-12 2020-03-18 Fujitsu Limited Resource contention in multiple cores
CN103530471B (en) * 2013-10-23 2017-03-08 中国科学院声学研究所 A kind of CPA method based on simulator

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BHARGAVA R ET AL: "Accurately modeling speculative instruction fetching in trace-driven simulation", PERFORMANCE, COMPUTING AND COMMUNICATIONS CONFERENCE, 1999 IEEE INTERN ATIONAL SCOTTSDALE, AZ, USA 10-12 FEB. 1999, PISCATAWAY, NJ, USA,IEEE, US, 10 February 1999 (1999-02-10), pages 65 - 71, XP010323680, ISBN: 978-0-7803-5258-2 *
MODI H ET AL: "Accurate Modeling of Aggressive Speculation in Modern Microprocessor Architectures", MODELING, ANALYSIS, AND SIMULATION OF COMPUTER AND TELECOMMUNICATION S YSTEMS, 2005. 13TH IEEE INTERNATIONAL SYMPOSIUM ON ATLANTA, GA, USA 27-29 SEPT. 2005, PISCATAWAY, NJ, USA,IEEE, 27 September 2005 (2005-09-27), pages 75 - 84, XP010843582, ISBN: 978-0-7695-2458-0, DOI: 10.1109/MASCOTS.2005.12 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111324948A (en) * 2020-02-10 2020-06-23 上海兆芯集成电路有限公司 Simulation method and simulation system
CN111324948B (en) * 2020-02-10 2023-04-25 上海兆芯集成电路有限公司 Simulation method and simulation system
CN115421861A (en) * 2022-09-14 2022-12-02 北京计算机技术及应用研究所 Universal TMS320C55x processor instruction set virtualization simulation method
CN115421863A (en) * 2022-09-14 2022-12-02 北京计算机技术及应用研究所 Universal MIPS64 processor instruction set virtualization simulation method
CN115421863B (en) * 2022-09-14 2023-04-28 北京计算机技术及应用研究所 Universal MIPS64 processor instruction set virtualization simulation method
CN115421861B (en) * 2022-09-14 2023-10-31 北京计算机技术及应用研究所 General TMS320C55x processor instruction set virtualization simulation method
WO2024220137A1 (en) * 2023-04-20 2024-10-24 Synopsys, Inc. Performance analysis using architecture model of processor architecture design
CN119201357A (en) * 2024-10-23 2024-12-27 上海兆芯集成电路股份有限公司 Computer Systems

Also Published As

Publication number Publication date
CN109690536A (en) 2019-04-26
CN109690536B (en) 2021-03-23

Similar Documents

Publication Publication Date Title
CN109690536B (en) Method and system for fetching multicore instruction traces from virtual platform emulators to performance simulation models
US8855994B2 (en) Method to simulate a digital system
CN102609296B (en) Virtual machine branching and parallel execution
CN104169889B (en) Method and system for runtime instrumentation sampling in transactional execution mode
TWI551986B (en) Computer program product, method, and system for controlling operation of a run-time instrumentation facility from a lesser-privileged state
JP6450705B2 (en) Persistent commit processor, method, system and instructions
US20170147369A1 (en) Performance-imbalance-monitoring processor features
US10360322B2 (en) Simulation of virtual processors
CN101730881B (en) System comprising a plurality of processors and methods of operating the same
CA2866809C (en) Run-time instrumentation directed sampling
TW201734767A (en) Methods, apparatus, and instructions for user-level thread suspension
US9715403B2 (en) Optimized extended context management for virtual machines
TW201610708A (en) Common boot sequence for control utilities that can be initialized in multiple architectures
US8533394B2 (en) Controlling simulation of a microprocessor instruction fetch unit through manipulation of instruction addresses
US20120029900A1 (en) Simulation method and system for simulating a multi-core hardware platform
JP2017513128A (en) System, method and computer program product for dispatching a multi-threaded guest virtual machine (VM)
RU2635044C2 (en) Tracking mode in the processing device of the tracing commands systems
JP2017515202A (en) Method, system, and computer program for exiting multiple threads in a computer
US10223149B2 (en) Implementing device models for virtual machines with reconfigurable hardware
US8407453B2 (en) Facilitating processing in a computing environment using an extended drain instruction
WO2013101031A1 (en) Hiding instruction cache miss latency by running tag lookups ahead of the instruction accesses
JP5542643B2 (en) Simulation apparatus and simulation program
US12118355B2 (en) Cache coherence validation using delayed fulfillment of L2 requests
CN120322761A (en) Trigger producer and trigger consumer instructions
WO2013189899A1 (en) Issue contention modeling for interval simulation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17705406

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17705406

Country of ref document: EP

Kind code of ref document: A1