WO2018149495A1

WO2018149495A1 - A method and system to fetch multicore instruction traces from a virtual platform emulator to a performance simulation model

Info

Publication number: WO2018149495A1
Application number: PCT/EP2017/053462
Authority: WO
Inventors: Ori Chalak; Shlomo PONGRATZ; Haibin Wang; Zuguang WU
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2017-02-16
Filing date: 2017-02-16
Publication date: 2018-08-23
Anticipated expiration: 2019-08-16
Also published as: CN109690536B; CN109690536A

Abstract

A system and a method for simulating a multicore processor design is provided. The system comprises an input/output interface, a processor, a virtual platform emulator, and a performance simulation model comprising at least one pipeline model. The input/output interface receives code instructions comprising a plurality of instructions blocks. The processor executes a code for instructing the virtual platform emulator to emulate an execution of an instructions block of the plurality of instructions blocks to generate a stream of a plurality of block derived code instructions based on the instructions block. When a mis-prediction branch in an instruction branch of the instructions block is detected, the processor instructs the virtual platform emulator to add a plurality of dummy code instructions to the stream. When a flush pipeline event is detected the processor instructs the virtual platform emulator to add a plurality of previously executed instructions to the stream in an original execution order of the previously executed instructions, and in response to each of a plurality of sequential and independent instruction requests received from a pipeline model to sequentially and independently fetch and transfer each of the plurality of block derived code instructions and at least one of the plurality of dummy code instructions and the plurality of previously executed instructions for execution and committing in the pipeline model.

Description

A METHOD AND SYSTEM TO FETCH MULTICORE INSTRUCTION TRACES FROM A VIRTUAL PLATFORM EMULATOR TO A PERFORMANCE SIMULATION MODEL

BACKGROUND

The present invention, in some embodiments thereof, relates to a method for simulating a hardware processor design, and more specifically but not exclusively, to a method to simulate a multi-core processor design by a virtual platform emulator and a multi-core pipeline model.

Development of new microprocessors is an expensive and time consuming process. This problem is even more challenging in multiple core architecture, where overall performance of a single or multiple core hardware processor, referred to herein as a central processing unit (CPU), is dependent on multiple independent core processors. Each new CPU design is expected to meet performance goals and/or metrics, for example a specific number of instructions per cycle (IPC), a specific hit ratio of layer one (LI) and/or layer 2 (L2) cache, the ability to execute multiple operating systems (OS), how many million instructions per second (MIPS) are executed, the ability to run multiple CPU benchmark application programs, and other performance metrics

In order to reduce the time and cost of CPU design, it is desirable prior to manufacturing the CPU to be able to predict whether the CPU design meets performance goals. In order to predict whether a proposed design will reach the stated goals, a model may be created to simulate the performance of the design.

The performance goals of a CPU may be measured with a benchmark, comprising an executable program code that when executed generates processing loads on a CPU and automatically measures various performance metrics. Predicting the eventual performance of a manufactured CPU may be accomplished by a model of the CPU executing a benchmark.

A model may be a software code executing on a computing processor that simulates the behavior of the modeled CPU. For example, a CPU benchmark program code may be executed by a CPU model. The degree that a model accurately predicts a CPU performance is impacted by how the model executes flush events and branch mis-prediction.

A flush event occurs when the CPU detects an error, for example reading a calculated value from a register before the instruction to write the calculated value has completed, referred to in the art as read- after- write (RAW) hazard. After a flush event is detected, the instructions already in the pipeline are flushed, meaning the results of executing the instructions are not committed by the CPU.

A branch mis-prediction occurs when a branch instruction is dependent on a calculation, and the processor incorrectly predicts the outcome of the branch prior to completing the calculation.

While individual existing models may provide a subset of critical features of CPU models, there is currently no single model that provides a complete solution.

A number of existing products provide partial answers to the challenges described here. For example, ASIM and ZSIM are simulators, but they do not run on unmodified general purpose OS and do not accurately simulate multithreaded and/or multi-process benchmarks. SimOS may run applications on multiple OS but is difficult to adapt to a specific CPU design. SimpleScalar supports multithread and multi-process benchmarks, however the results of benchmarks may not reflect the performance of the modeled CPU.

SUMMARY

It is an object of the present invention to provide a system, a computer program product, and a method for simulating a processor design.

The foregoing and other objects are achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.

According to a first aspect of the invention there is provided a system for simulating a multicore processor design, comprising: an input/output interface, a processor, a virtual platform emulator, a performance simulation model comprising at least one pipeline model, wherein the input/output interface is adapted to receive code instructions comprising a plurality of instructions blocks. The processor is adapted to execute a code for instructing the virtual platform emulator to emulate an execution of an instructions block of the plurality of instructions blocks to generate a stream of a plurality of block derived code instructions based on the instructions block. When a mis-prediction branch in an instruction branch of the instructions block is detected, instructing the virtual platform emulator to add a plurality of dummy code instructions to the stream. When a flush pipeline event is detected (in the instruction block) instructing the virtual platform emulator to add a plurality of previously executed instructions to the stream in an original execution order of the previously executed instructions, and in response to each of a plurality of sequential and independent instruction requests received from a pipeline model sequentially and independently fetching and transferring each of the plurality of block derived code instructions and at least one of the plurality of dummy code instructions and the plurality of previously executed instructions for execution and committing in the pipeline model.

Preferably, the stream of a plurality of block derived code instructions is stored in a memory or cache of the virtual platform emulator.

This aspect provides the advantages of a system for simulating a processor design including real time processing delays caused by branch mis-prediction and flushing of a pipeline.

According to a second aspect of the invention there is provided a method for simulating a processor design, comprising: receiving code instructions comprising a plurality of instructions blocks, and instructing a virtual platform emulator to emulate an execution of an instructions block of the plurality of instructions blocks to generate a stream of a plurality of block derived code instructions based on the instructions block. When a mis-prediction branch in an instruction branch of the instructions block is detected instructing the virtual platform emulator to add a plurality of dummy code instructions to the stream. When a flush pipeline event is detected instructing the virtual platform emulator to add a plurality of previously executed instructions to the stream in an original execution order of the previously executed instructions, and in response to each of a plurality of sequential and independent instruction requests received from a pipeline model sequentially and independently fetching and transferring each of the plurality of block derived code instructions and at least one of the plurality of dummy code instructions and the plurality of previously executed instructions for execution and committing in the pipeline model. This aspect provides the advantages of a method for simulating a processor design including real time processing delays caused by branch mis-prediction and flushing of a pipeline.

In a further implementation form of the first and/or second aspects, the processor is adapted to instruct the virtual platform emulator by a plurality of API instructions. This implementation form provides the advantage of providing a defined set of function calls to enable a virtual platform emulator to supply traces of instruction execution to a pipeline model.

In a further implementation form of the first and/or second aspects, the processor is adapted to execute the code for each of a plurality of emulated cores of an emulated multicore processor in parallel.

In a further implementation form of the first and/or second aspects, the processor is adapted to maintain a history list of fetched instructions for each of the plurality of emulated cores or each of a plurality of hardware threads, upon detecting the flush pipeline event, move a location pointer or an index to point on an oldest flushed instruction in the history list, acquire new fetched instructions from the history list, progress the fetched instructions pointer to a next instruction in the history list, and upon reaching to the end of the history list, returning to a normal operation wherein instructions are fetched from the virtual platform emulator. This implementation provides the advantages of enabling instructing flushing a pipeline model of loaded instructions, and fetch new instructions with minimal latency by means of maintaining a history list of fetched instructions.

In a further implementation form of the first and/or second aspects, when the mis-prediction branch is detected, the system is adapted to enter into a sandbox mode wherein at least one false instruction for execution and committing in the pipeline model is sent by the virtual platform emulator, and wherein, upon identification of a mis-prediction the system is adapted to instruct a rollback at a proxy layer of the virtual platform emulator to the instruction branch for taking a correct branch decision. This implementation provides the advantages of enabling a pipeline model to simulate detection and recovery of branch mis-prediction in a modeled processor design.

In a further implementation form of the first and/or second aspects, when the mis-prediction branch is detected during the emulation of the plurality of block derived code instructions, the plurality of previously executed instructions comprises at least one false instruction for execution without committing in the pipeline model, wherein the processor is adapted to execute the code for instructing a flushing of the pipeline model in response to identification of branch mis-prediction and to instructa rollback at a proxy layer of the virtual platform emulator to the instruction branch..

In a further implementation form of the first and/or second aspects, the processor is further adapted to update a dataset of instructions flushed from the plurality of block derived code instructions and to instruct the virtual platform emulator to add instructions from the dataset of instructions as the plurality of previously executed code instructions. This implementation provides the advantages of enabling a virtual platform emulator to send dummy instructions to a pipeline model when the pipeline model has requested incorrect instructions, for example when a pipeline model has mis-predicted a branch.

In a further implementation form of the first and/or second aspects, the processor is adapted to instruct said virtual platform emulator to emulate execution of an instruction block in response to said instruction request from said pipeline model exclusively when said requested instruction is a member of said instruction block.

In a further implementation form of the first and/or second aspects, the virtual platform emulator comprises a scheduler adapted to schedule processing of a next instructions block, wherein the scheduler is adapted to instruct said virtual platform emulator to emulate execution of the next instructions block when an instruction request received from said pipeline model comprises a code instruction that is a member of said next instructions block.

Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention pertains. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the invention, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting. BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

Some embodiments of the invention are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments of the invention. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments of the invention may be practiced.

In the drawings:

FIG. 1 is a flowchart of a method for simulating a processor design, according to some embodiments of the present invention;

FIG. 2 is schematic illustration of an exemplary system for simulating a processor design, according to some embodiments of the present invention;

FIG. 3A is a schematic diagram representing the connections between a virtual platform emulator and a pipeline model, according to some embodiments of the present invention;

FIG. 3B is a schematic diagram representing an application programming interface, according to some embodiments of the present invention;

FIG. 3C is a schematic diagram representing a pipeline model fetching of code instructions from a virtual platform emulator, according to some embodiments of the present invention;

FIG. 3D is a schematic illustration of scheduling execution of code instructions on multiple cores according to some embodiments of the present invention;

FIG. 4A is a schematic diagram representing the state of a pipeline model during a branch mis-prediction, according to some embodiments of the present invention;

FIG. 4B is a schematic illustration of pipeline model sequentially fetching instructions from a virtual platform emulator, according to some embodiments of the present invention;

FIG. 4C is a schematic diagram representing the program counter and sequence of instructions executed in a pipeline model and a virtual platform emulator when a branch mis-prediction occurs, according to some embodiments of the present invention; FIG. 5A is a schematic diagram representing the state of program counter and core pipeline of Simulation Emulation system during a flush event, according to some embodiments of the present invention;

FIG. 5B is a schematic diagram representing a pipeline model before and after a flush event, according to some embodiments of the present invention;

FIG. 5C is a schematic diagram representing the program counter and sequence of instruction execution in a pipeline model and a virtual platform emulator when a flush event occurs, according to some embodiments of the present invention; and

FIG. 6 is a schematic diagram of multiple pipeline model cores controlling multiple virtual platform emulator cores, according to some embodiments of the present invention.

DETAILED DESCRIPTION

Architectural timing simulation models, referred to herein as performance simulation models (PSM), are used to explore and optimize the performance of multi- core CPU micro-architecture design at an early stage of development. The models may be software codes that execute on a hardware processor of a host platform.

Modern CPU may comprise single or multiple cores where each core may comprise a single or multiple hardware thread, and application benchmarks may need to run on multiple software threads and/or multiple processes. This in turn requires running the benchmark under an Operating System (OS).

PSM are usually unable to run unmodified benchmarks under unmodified OS, which would require the PSM to implement very accurate and detailed functional behavior of the modeled CPU, for example memory management, communications management over networking interfaces, and the like.

There are virtual platform emulators (VPE), such as QEMU, that operate with unmodified OS and largely unmodified benchmark programs and model detailed functional behavior. An integration between VPE and PSM enables to decouple the functionality from timing, where VPE knows everything about functionality and nothing about timing, while PSM knows nothing about functionality and everything about timing.

However, processing of instructions by a VPE is difficult to control in terms of code granularity. The smallest code granularity of a VPE is a block of instructions, comprising a set of instructions up to a branch or a maximum size. A VPE scheduler may assign execution of an entire block of instructions to a single core, and schedule the next block of instructions to a different core. The VPE is not adapted to guarantee any order of code instruction execution among multiple cores, and once in a while, the virtual platform switch to execute instruction blocks of a different core. Furthermore, future implementations of virtual platform may execute different instructions of different cores in paralle on different threads. Multi-core CPU have different scheduler code granularity, and may schedule the individual instructions from a block to multiple cores. For this reason it is difficult for a VPE to simulate multiple core CPUs.

Furthermore, the VPE may not simulate core pipeline operations, for example branch mis-prediction and pipeline flush.

A block of code refers to groups of code instructions, for example a group of all code instructions until a branch instruction is encountered.

Another type of CPU model is a PSM. A PSM is a computer program that when executed on a processor simulates one or more hardware threads, referred to herein as cores, of a CPU on a clock cycle basis. Each core of a CPU may execute in parallel a separate software thread.

A PSM may be used for testing and benchmarking the performance of a modeled CPU design, including for example modeled CPU behavior during a misprediction and/or a flush event.

Mis-predicting of branches occurs in a CPU when a core mis-predicts the result of a calculation that determines a branch instruction. In order to increase speed execution of code, the CPU fetches instructions ahead of execution. When a branch instruction is encountered, the following instruction depends on which branch is executed. A CPU may predict the outcome of the branch instruction based on historical precedent and/or any other method. The CPU identifies that a mis-prediction has occurred after completing execution of the branch calculation. To recover from a mis-prediction, the CPU resets the program counter (PC) to the address of the correct branch, the correct instructions are fetched into the pipeline, and any incorrectly fetched instructions in the pipeline are not committed. The state of the core, including values of registers, is reset, or rolled back, to the state previous to executing incorrect branch instructions.

A flush event may result when a CPU detects error events, for example a RAW hazard, a Wait for Interrupt (WFI) instruction committed, an out of order read that executes prior to an earlier write where both access the same address, and the like. For example, a RAW hazard may occur when a write instruction precedes by one clock cycle a read instruction to the same register. For example, when a CPU requires five clock cycles to execute each instruction, each core pipeline performs the following five operations for each instruction during five consecutive clock cycles: Instruction Fetch, Instruction Decode, Execute, Memory access, and Register write back. A write instruction writes a value to the register (Register write back) in the fifth clock cycle. However, a read instruction that immediately follows the write instruction will read the register (Memory access) also in the 5^th clock cycle, potentially reading an incorrect value from the register.

When the flush event is detected by the CPU, for example according to a specific sequence of a "read" instruction following a "write" instructions, the read instruction is re-fetched, or re-played, in order to execute the correct read operation. The instructions that are already in the core pipeline may be incorrect due to the RAW, so those instructions are not committed, and/or flushed. In order to minimize the latency caused by re-fetching instructions after a detected hazard, the CPU may store recently fetched instructions in a short term memory or cache, for example LI Cache and/or L2 Cache, to be available for re-fetching in the event of a flush.

In order to execute code instructions in parallel, a CPU fetches a stream of individual instructions and assigns each instruction to a processing core. A PSM may model a single core of a multi-core CPU by processing individual instructions, and/or multiple cores of a CPU.

As described above, performance simulation of a CPU must model the realtime behavior, including fetching individual instructions to multiple cores that execute in parallel, recovering from branch mis-predictions, recovering from a flush event, and the like. However, while VPEs and PSMs each offer a subset of capabilities, neither provides a complete simulation of CPU performance.

VPE may fetch and code instructions in blocks, which does not allow modeling parallel core processing. VPEs lack the ability to model branch mis- prediction, pipeline flushes, and individual code instruction processing.

PSMs do not emulate VPE services, for example memory management, power management, device management, register management, and the like. PSMs also lack the flexibility and scalability to run unmodified large scale OS and applications which limits the ability to test whether the CPU model meets design goals when running a variety of OS.

In exemplary embodiments, a VPE and a PSM are combined by means of a proxy layer to form an integrated model that simulates performance of a modeled CPU. The integrated model may execute unmodified benchmarks under unmodified OS, fetch and simulate execution of individual code instructions, and simulate multiple core pipeline timing behavior including branch mis-prediction and flush events.

Optionally, the proxy layer may provide function calls from an application programming interface (API) that enable interoperation between the VPE and PSM, for example API function calls to fetch single instructions from the VPE to the PSM, to control the progress of the VPE by the PSM, to synchronize VPE operations with PSM modeling of flush events and branch mis-prediction, and the like.

Optionally, the VPE comprises a scheduler for determining progressing from processing one block of code instructions to processing a next block of code instructions, and the scheduler is turned off and replaced with a modified scheduler. The modified scheduler determines progressing to processing a next block of code instructions according to receiving an instruction fetch request from the PSM comprising a code instruction that is a member of the next instruction block.

Optionally, the VPE scheduler is modified to process an entire block of code instructions and only to proceed to a next block of code instructions when a fetch request for an instruction from the next block is received from the PSM. Advantageously, the PSM controls the progress of the VPE according to instruction fetch requests. Optionally, the VPE responds to an instruction fetch request with trace information generated by executing the instruction on the VPE. Trace information may comprise for example an opcode, a virtual and/or physical program counter address, a virtual and/or physical program counter of a next instructions, a virtual and/or physical address for loading and/or storing an instruction, and the like.

In exemplary embodiments, the integrated platform may simulate a branch mis-prediction by the PSM requesting an incorrect instruction as a result of a misprediction, and the VPE enters a sandbox mode where dummy instructions are fetched to the PSM and progress according to instruction fetch requests is suspended until the PSM corrects the mis-prediction and requests a correct instruction. Advantageously, by recognizing the mis-prediction the VPE is prepared to fetch correct code instructions as soon as the PSM detects the mis-prediction.

In exemplary embodiments, the integrated platform may simulate a flush event by the PSM requesting the VPE to re-fetched recent instructions and ceasing to commit instructions in the core pipeline where the flush event occurred. Advantageously, by requesting re-fetching of instructions and not committing instructions in the pipeline, the PSM simulates the behavior of a CPU core.

A VPE may be a software code that executes on a host CPU that emulates an electronic system comprising a modeled CPU and peripheral devices such as virtual and physical memory, peripheral input-output devices such as Network Interface Card (NIC), block device, keyboards and/or screens, and the like. The VPE may emulate the modeled CPU at a level of detail that an executable software code comprising machine code executable on the modeled CPU may execute on the VPE without any modification. For example, a benchmark software code for testing CPU performance may be executed by a VPE.

A VPE provides an interface for executing machine code comprising a virtual CPU, including for example multiple processor cores, registers, power management, virtual and physical memories, page table, interconnection buses interrupt model, network connections, and the like. A VPE may be OS independent, where performance goals of the CPU may be tested with multiple OSs.

The VPE may be executed on a host computer CPU where the CPU that the VPE emulates may be completely different and unrelated to the host computer CPU. When a code executes on a VPE, the VPE translates the executable code instructions of the modeled CPU into machine code native to the host computer CPU, causes the native code to be executed, and updates the modeled CPU state according to the results of the execution.

While a VPE may emulate the CPU architecture and interface to an executable code and/or operating system, it does not have the ability to model CPU micro architecture components such as pipelines and caches, and behavior when flush events or branch mis-prediction occur, and therefore has limited utility for predicting the performance of a CPU design. In addition, it is difficult to control the progress of a VPE, as code instructions are executed in blocks, such that all instructions in a block of code are continuously executed.

In an exemplary embodiment, the progress of the VPE is controlled by the PSM in the following manner: in response to an instruction fetch request from the PSM, the VPE fetches a block of executable instructions, for example from a memory in a host computer, processes the entire block, and returns the requested instruction with trace information to the PSM. The processing comprises executing the instructions in the VPE emulated CPU by translating the executable instructions into machine code for the host computer CPU to execute, and storing the instruction trace of the emulated execution, comprising all information needed by the PSM to execute each the instructions.

Subsequent instructions from the same block are supplied according to fetch requests from the PSM. The VPE does not proceed to another block of code instructions until the PSM requests an instruction from another block.

Optionally, CPU branch mis-prediction is modeled by the PSM and VPE in the following manner: when PSM mis-predicts a branch instruction, as described above, the resulting instruction fetch requests to the VPE will be incorrect. The VPE, which as described above had already executed all instructions in the block, identifies the requested instruction as incorrect, and enters a mode of operations, referred to herein as Sandbox mode.

Sandbox mode comprises the VPE sending dummy, no operation (NOP), and/or fake instructions, referred to herein as dummy instructions, to the PSM in response to the incorrect fetch request and suspending progressing according to fetch requests from the PSM. The dummy instructions are sent in place of the correct instructions in order to prevent the correct instructions from changing the state of VPE PC and/or registers. The VPE exits Sandbox mode when the correct instruction fetch request is received from the PSM.

As the PSM continues to process instructions already in the pipeline, the branch instruction that was mis-predicted is calculated, whereby the PSM detects the mis-prediction. The PSM then performs a function call to the proxy layer API to recover from the mis-prediction. The API function, referred to herein as rollback(), informs the VPE to roll back the program counter from the dummy instructions to the correct branch instruction address, and to fetch the correct instruction.

Optionally, CPU flush events are modeled by the integrated model in the following manner: when the PSM detects a hazard, for example a RAW, a function call is made to the proxy layer API, referred to herein as replayO, which re-fetchs a set of instructions that were recently fetched. The PSM does not commit the instructions already in the pipeline that are associated with the hazard, and the re- fetched instructions are re-executed and committed. For example, by re-fetching the read operation in a RAW hazard, the write instruction has sufficient time to be completely processed, thereby eliminating the hazard.

Optionally, the proxy layer between the PSM and VPE may be a set of executable code instructions executing on the host computer CPU, for example a Simulation Emulation system 200 as described below. The proxy layer may comprise API function calls, for example a rollback() function that when called by the PSM messages to the VPE to rollback the PC to a corrected branch instruction, a replayO function call that when called by the PSM re-fetches a set of previously fetched instructions, and the like. The proxy layer may also provide memory management services to the PSM, for example storing previously fetched instructions in a non- volatile memory of the host computer to support the replayO function call, managing physical and virtual memory of the host computer, and the like.

The present invention, in some embodiments thereof, provides a number of advantages over the existing art. The integration of VPE and PSM provides microprocessor designers with a tool that decouples functionality from timing by emulating a modeled CPU code instruction interface, is compatible with multiple OS, and models clock cycle level timing of multi-core pipeline activities. In particular, the throughput and operational behavior of multiple pipelines may be modeled for branch mis-prediction, pipeline flush, and re-fetching of flushed instructions. Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or the Examples. The invention is capable of other embodiments or of being practiced or carried out in various ways.

The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.

The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.

Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via network, for example, the Internet, a local area network, a wide area network and/or a wireless network.

The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention. Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

Reference is now made to FIG. 1, a flowchart schematically representing Simulation Emulation method 100 for simulating a CPU, including processing of branch mis-direction and flush events, according to some embodiments of the current invention.

Simulation Emulation method 100 may be executed by code instructions executing on a CPU, for example by code instructions executing on processor(s) 204 of Simulation Emulation system 200 as described below in FIG. 2.

Simulation Emulation method 100 begins when a request from a PSM to fetch an executable instruction is received by a VPE, and in response the VPE processing a block of instructions as described above, and returns the requested instruction to the PSM. For example, code instructions from PSM 212 executing on processor 204, as described below, may instruct sending the request and receiving the instructions, and code instructions from VPE 211 executing on processor 204, as described below, may instruct responding to the request.

When the fetch request comprises a branch mis-direction, the VPE proxy layer returns dummy instructions in place of the requested instructions, until the PSM requests fetching the correct instruction, as described below. When a flush event is detected, a requests to re-fetch the recent instructions is sent, as described below.

Optionally, Simulation Emulation method 100 comprises a method for multiple cores operating in parallel. For example, a PSM may model multiple cores, where each modeled core may control the progress of a corresponding VPE core, as described below in FIG. 6

Reference is now made to FIG. 2, a schematic illustration of exemplary Simulation Emulation system 200 for simulating a CPU, according to some embodiments of the present invention. Simulation Emulation system 200 comprises an input/output (I/O) interface 202, processor(s) 204, and storage 208.

Simulation Emulation system 200 is adapted to receive code instructions that are executable on a modeled CPU, for example from user device 260 as described below, and to simulate the performance of the Modeled CPU when executing the received code instructions, for example by executing on processor(s) 204 code in storage 208.

Simulation Emulation system 200 may comprise for example a server, a desktop computer, an embedded computing system, an industrial computer, a ruggedized computer, a laptop, a cloud computer, a private cloud, a public cloud, a hybrid cloud, and/or any other type of computing system. Optionally, Simulation Emulation system 200 comprises a virtual machine (VM) in place of I/O 202, processor(s) 204, and storage 208.

I/O 202 may include one or more input interfaces, for example a Network Interface Card (NIC), a block device, a keyboard, a soft keyboard, a voice to text system, and/or any other data input interface. I/O 202 may comprise one or more output interfaces, for example a screen, a touch screen, video display, and or any other visual display device.

Processor(s) 204 may comprise one or more hardware processors, a multi-core processor, and/or any other type of CPU. Storage 208 may include one or more non-transitory persistent storage devices, for example, a hard drive, a Flash array and the like.

Optionally, Simulation Emulation system 200 is connected to a network 230 via I/O 202. For example, I/O 230 may be a network interface car (NIC), a wireless router, and/or any other type of network interface adapted to communicating with network 230.

Network 230 may be any type of data network, for example, a local area network (LAN), an Ethernet LAN, a fiber optic LAN, a digital subscriber line (DSL), a wireless LAN, a broadband connection, an Internet connection using an Internet Service Provider (ISP) and/or any other type of computer network. Network 230 may employ any type of data networking protocols, including transport control protocol and internet protocol (TCP/IP), user datagram protocol (UDP), and the like.

Optionally, user device 260 may be connected to Simulation Emulation system 200, for example via network 230. User device 260 may be a smartphone, a computer, and/or any other computing platform. User device 260 may be adapted to transmit to Simulation Emulation system 200 via network 230 a plurality of code instructions to execute on a Modeled CPU, for example an OS and benchmark application for testing performance of a modeled CPU design.

Simulation Emulation method 100 may be executed by processor(s) 204 executing code from one or more software modules in storage 208, for example Process Manager 210, Virtual Platform Emulator 211, Pipeline Model 212, and/or proxy layer 213. Wherein a software module refers to a plurality of program instructions stored in a non-transitory medium such as the storage 208 and executed by a processor such as the processor(s) 204.

Reference is now made again to FIG. 1. As shown in 101, Simulation

Emulation method 100 begins by sending a code instruction according to a request, for example code instructions from PSM 212 executing on processor(s) 204 may instruct sending a request to fetch a code instruction from VPE 211.

Optionally, the fetch request, and/or all messaging and communication between PSM 212 and VPE 211, may be a proxy layer API function call, comprising code instructions from proxy layer 213 executing on processor(s) 204. Proxy layer 213 may comprise code instructions that, when executed on processor(s) 204, instruct receiving and sending messages, for example a request to fetch a code instruction, a response to the request comprising the requested code instruction, and the like.

Reference is now made to FIG. 3A, a schematic diagram representing the connections between code modules VPE 211, PSM 212, and proxy layer 213, according to some embodiments of the invention. As shown in 320, multiple virtual cores in VPE 211 may communicate with corresponding pipeline cores 330 in PSM 212 via proxy layer 213. The communication is implemented by code instructions in proxy layer 213 executing on processor(s) 204, as described below in FIG. 3B. The virtual platform emulator 211 comprises CPU cores 320, memory 214, and devices 216.

Reference is now made to FIG. 3B, a schematic diagram representing the services provided by proxy layer 213 to PSM 212 and VPE 211, according to some embodiments of the invention.

As shown in 213, optionally code instructions from proxy layer 213 when executed on processor(s) 204 instruct the following set of actions: API function calls for sending messages, communication, data, instruction fetch requests, and/or requested instructions between PSM 212 and VPE 211. Storing instruction traces, comprising making a temporary copy of all instructions and corresponding trace that were fetched by PSM 212, for example storing traces in a local memory cache and/or first in first out (FIFO) record in storage 208. Executing Replay() API function calls, comprising re-fetching previously fetched instructions, for example fetching instructions from the instruction trace described above and/or from VPE 211. Managing virtual and/or physical PC, for example directing Proxy Layer 213 to change the value of respective PC according to a Replay() function call by PSM 212. Managing virtual and/or physical storage addresses, for example receiving a request from VPE 211 to fetch a block of code according to a virtual address, and translating the virtual address into a physical address in storage 208. Managing power states, for example instructing VPE 211 and/or PSM 212 to cease operations according to detecting that a set of code instructions has been fully executed by VPE 211 and/or PSM 212. Tracing packets, for example keeping a record of packets of data passed between VPE 211 and/or PSM 212. Tracing Disk I/O, for example keeping a record of content of storage 208 that is retrieved by VPE 211. In addition, proxy layer 213 may comprise any additional code instructions that instruct any communication and/or interaction between PSM 212 and VPE 211.

Reference is now made to FIG. 3C, a schematic diagram representing fetching of code instructions from VPE 211 to PSM 212, according to some embodiments of the present invention. As shown in 351, code instructions from PSM 212 executing on processor(s) 204 instruct sending a fetch instruction request to VPE 211. As shown in 352, code instructions from VPE 211 executing on processor(s) 204 instruct fetching a block of code instructions, processing the instructions in the block as described above, and as shown in 353 sending to PSM 212 the requested instruction. As shown in 354 and 355, PSM 212 requests and receives another instruction.

Optionally, VPE 211 stores the traces of the generated code instructions until PSM 212 requests the corresponding code instruction.

Optionally, the instruction to be fetched are identified according to a PC of the corresponding instruction, for example by code instructions from PSM 212 executing on processor(s) 204 specifying a PC for the requested instruction.

Optionally, code instructions from VPE 211 executing on processor(s) 204 instruct disabling a scheduler of VPE 211, and to enable a modified scheduler that progresses according to code fetch requests from PSM 212. A scheduler may be code instructions in VPE 211 executing on processor(s) 204 that instruct the progress of each VPE core, and the modified scheduler instructs fetching blocks of code instructions to process according to fetch requests from PSM 212 as described above.

Reference is now made to FIG. 3D, a schematic illustration of scheduling execution of code instructions on multiple cores, according to some embodiments of the invention. As shown in 371, an entire block of code instructions is scheduled and executed on one VPE 211 core, followed by a next block of code instructions is executed on another VPE 211 core. As shown in 372, PSM 212 cores execute code instructions individually or in smaller groups than a code block.

Reference is now made again to FIG. 1. As shown in 102, VPE 211 detects a branch mis-prediction of the code instruction requested by PSM 212. For example, code instructions from PSM 212 executing on processor(s) 204 that instruct comparing a predicted branch instruction with a calculation of a branch instruction, where a mis-match indicates a mis-prediction. Optionally, proxy layer 213 detects a mis-prediction by a mismatch between the PC of the requested instruction and the PC of the fetched instruction from VPE 211.

As shown in Fig. 1, step 103, in response to the detected mis-prediction, optionally VPE 211 enters a Sandbox mode, as described above. VPE 211 may respond to the fetch request and subsequent fetch requests from the PSM 212 by sending a dummy instruction, and exiting sandbox mode when a fetch request with the correct branch PC is received.

When the dummy instruction is fetched by PSM 212, it is input to the core pipeline. Prior to the commit stage of the pipeline, PSM 212 detects the misprediction as described above, and performs the following actions to recover from the mis-prediction. The location of the branch mis-prediction is identified, and a recovery message is sent to PSM 212 instructing PSM 212 to roll back to the branch point. The recovery message may be sent as a proxy layer API rollback() function call. PSM 212 does not commit the dummy instructions.

Reference is now made to FIG 4A, a schematic diagram representing the state of PSM 212 PC during a branch mis-prediction, where each circle represents an instruction in a core pipeline, according to some embodiments of the current invention. As shown in 401, PSM 212 requests incorrect instructions to be fetched from VPE 211 due to a mis-prediction. As shown in 402, as a result of detecting the mis-prediction, PSM 212 sets the PC to the correct branch instruction, and as shown in 403 the correct instructions is fetched.

Reference is now made to FIG. 4B, a schematic illustration of PSM 212 sequentially fetching instructions from VPE 211, according to some embodiments of the current invention. As shown in 451, the thick arrows represent messages within PSM 212. As shown in 452 the medium arrows represent fetch requests from PSM 212. As shown in 453 the thin arrows represent fetch replies from VPE 211. As shown in 310, proxy layer 213 provides interoperability services to VPE 211 and PSM 212, as described above.

Reference is now made to FIG. 4C, a schematic diagram representing the PC

(program counter) and sequence of instructions executed in PSM 212 and VPE 211 when a branch mis-prediction occurs, according to some embodiments of the current invention. 471 represents the order of execution of instructions in a CPU with PC numbered from 10 to 81, 472 represents the order of execution of instructions in VPE

211 with PC numbered from 10 to 81, 473 represents the order of execution of instructions in a PSM 212 with PC numbered from 10 to 81 when a branch mis- prediction occurs, and 474 represents the order of execution of instructions in PSM

212 after the rollback() API has been executed.

As shown in 471, the instruction with PC 13 labeled "BZ LABEL" is a conditional branch instruction, and the following correct instruction is located at PC 80.

As shown in 472, code instructions from VPE 211 executing on processor(s)

204 instruct correctly calculating the branch located at PC 13, and the following instruction is at PC 80.

As shown in 473, code instructions in PSM 212 executing on processor(s) 204 instruct mis-predicting branch at PC 13, and instructions with PC 14, 15, and 16 are fetched from VPE 211.

As shown in 474 code instructions in PSM 212 executing on processor(s) 204 instruct calling API function rollback(), as described above, resulting in fetching the correct instruction at PC 80, and not committing the incorrectly fetched instructions.

Reference is now made again to FIG. 1. As shown in 104, a flush pipeline event is generated and/or detected, for example by code instructions from PSM 212 executing on processor(s) 204 that instruct detection of a hazard. Optionally, PSM 212 generates one or more flush pipeline events during normal operations, and then detects the generated pipeline flush event. Optionally, the flush pipeline event may be the result of a branch mis-prediction and/or a pipeline hazard, for example a load/store data cache miss, a memory disambiguation for out of order or RAW hazard, a stalled pipeline, and the like.

As shown in 105, in response to the detected flush event, previously fetched instructions are re-fetched, and the results of the instructions in the pipeline associated with the hazard are not committed. For example, code instructions from PSM 212, code instructions from proxy layer 213, and/or code instructions from VPE 211 executing on processor(s) 204 may instruct the following: when a flush event is detected by PSM 212. Optionally PSM 212 calls the replay() API function call, which requests to re-fetch, or replay, recently fetched instructions currently in PSM 212 core pipeline. Optionally, proxy layer 213 function replayO returns to PSM 212 from a local cache as described above recently fetched instructions. Optionally, proxy layer 213 function replayO returns to PSM 212 from VPE 211 the recently fetched instructions. Optionally, VPE 211 maintains a cache of recently fetched instructions which comprises the instructions currently in PSM 212 pipeline, for example a historical transactional memory stored in local cache and/or storage 208. VPE 211 sends the re-fetched instructions to PSM 212. PSM 212 does not commit the instructions in the pipeline associated with the hazard, and executes and commits the re-fetched instructions.

As shown in 106, when the fetched instructions are fetched and executed,

Simulation Emulation method 100 is completed.

Reference is now made to FIG. 5A, a schematic diagram representing the state of PC and core pipeline of Simulation Emulation system 200 during a flush event, according to some embodiments of the current invention. As shown in 501, each circle represents an instruction in PSM 212 core pipeline, for example by code instructions from PSM 212 executing on processor(s) 204. As shown in 502, a flush event is detected, as described above. As shown in 503, the curved arrow represents the result of the detected flush event, wherein the previously fetched instructions are re-fetched, as described above.

Reference is now made to FIG. 5B a schematic diagram representing a PSM

212 core pipeline before and after a flush event, according to some embodiments of the current invention. As shown in 551, each square represents a code instruction in a core pipeline, where code instructions proceed from left to right as they are processed by PSM 212. As shown in 552, before detection of a flush event, the instruction that was most recently processed, referred to herein as the latest instruction, is the farthest right instruction in the pipeline. As shown in 553, after detecting a flush event in the same core pipeline, the execution of instructions in the pipeline are not committed, and as a result of the replayO API function call, as described above, the latest instruction is re-fetched and is now at an earlier pipeline location.

Reference is now made to FIG. 5C, a schematic diagram representing the PC and sequence of instruction execution in PSM 212 and VPE 211 when a flush event occurs, according to some embodiments of the invention. 571 represents the PC of instructions in a core pipeline of a CPU with PC numbered from 10 to 81, 572 represents the PC of executed instructions in VPE 211 with PC numbered from 10 to 81, 573 represents the PC of instructions in a core pipeline of PSM 212 with PC numbered from 10 to 81 when a flush event occurs, and 574 represents the PC of committed instructions in PSM 212 after the replay () API has been executed.

As shown in 571, the instruction with PC 13 labeled "load R3 <r [Rl]" generates a hazard since the value of register "Rl" may not be correct due to the write to Rl at PC 11.

As shown in 572, code instructions from VPE 211 executing on processor(s)

204 instruct correctly calculating value of "Rl".

As shown in 573, code instructions in PSM 212 executing on processor(s) 204 instruct detecting the possible hazard, and by calling API function call replay() as described above, the most recent instructions, in this example with PC 13- 16, are re- fetched.

As shown in 574 code instructions in PSM 212 executing on processor(s) 204 instruct not committing PC 13-16, and committing only the instructions PC 13- 16 that were re-fetched.

Reference is now made to FIG. 6, a schematic diagram of multiple PSM core pipelines controlling multiple VPE 211 cores, according to some embodiments of the current invention. Bold arrow lines represent the movement of instructions as they are retrieved by VPE 211 from code blocks and fetched to PSM 212. PSM 212 comprises code instructions that when executed on processor(s) 204 instruct multiple CPU core pipeline models, where each core pipeline model controls progress of a corresponding VPE 211 core. For example, as shown in 601 and 602, PSM core "0" controls the progress of corresponding VPE 211 core "0", and as shown in 603 and 604, PSM 212 core "1" controls VPE 211 core "1". Code instructions in process manager 210 when executed on processor(s) 204 may instruct assigning code instructions to individual PSM 212 core pipelines.

The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

It is expected that during the life of a patent maturing from this application many relevant processor emulators and pipeline models will be developed and the scope of the terms VPE and PSM are intended to include all such new technologies a priori.

As used herein the term "about" refers to ± 10 %.

The terms "comprises", "comprising", "includes", "including", "having" and their conjugates mean "including but not limited to". This term encompasses the terms "consisting of" and "consisting essentially of".

The phrase "consisting essentially of" means that the composition or method may include additional ingredients and/or steps, but only if the additional ingredients and/or steps do not materially alter the basic and novel characteristics of the claimed composition or method.

As used herein, the singular form "a", "an" and "the" include plural references unless the context clearly dictates otherwise. For example, the term "a compound" or "at least one compound" may include a plurality of compounds, including mixtures thereof.

The word "exemplary" is used herein to mean "serving as an example, instance or illustration". Any embodiment described as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments.

The word "optionally" is used herein to mean "is provided in some embodiments and not provided in other embodiments". Any particular embodiment of the invention may include a plurality of "optional" features unless such features conflict.

Throughout this application, various embodiments of this invention may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.

Whenever a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range. The phrases "ranging/ranges between" a first indicate number and a second indicate number and "ranging/ranges from" a first indicate number "to" a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween.

It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as suitable in any other described embodiment of the invention. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements.

Claims

WHAT IS CLAIMED IS:

1. A system (200) for simulating a multicore processor design, comprising:

an input/output interface (202);

a processor (204);

a virtual platform emulator (211);

a performance simulation model (212), wherein the performance simulation model (212) comprises at least one pipeline model (330);

wherein the input/output interface (202) is adapted to receive code instructions comprising a plurality of instructions blocks;

wherein the processor (204) is adapted to execute a code for:

instructing the virtual platform emulator (211) to emulate an execution of an instructions block of the plurality of instructions blocks to generate a stream of a plurality of block derived code instructions based on said instructions block;

when a mis-prediction branch in an instruction branch of said instructions block is detected, instructing said virtual platform emulator (211) to add a plurality of dummy code instructions to said stream;

when a flush pipeline event is detected instructing said virtual platform emulator (211) to add a plurality of previously executed instructions to said stream in an original execution order of said previously executed instructions; in response to each of a plurality of sequential and independent instruction requests received from the pipeline model, sequentially and independently fetching and transferring each of said plurality of block derived code instructions and at least one of said plurality of dummy code instructions and said plurality of previously executed instructions for execution and committing in said pipeline model.

2. The system (200) of claim 1, wherein said processor is adapted to instruct said virtual platform emulator by a plurality of application programming interface instructions.

3. The system (200) of any of the previous claims, wherein said processor is adapted to execute said code for each of a plurality of emulated cores of the multicore processor in parallel.

4. The system (200) of claim 3, wherein said processor is adapted to maintain a history list of fetched instructions for each of said plurality of emulated cores, upon detecting said flush pipeline event, move a location pointer or an index to point on an oldest flushed instruction in said history list, acquire new fetched instructions from the history list, progress the fetched instructions pointer to a next instruction in the history list, and upon reaching to the end of the history list, returning to a normal operation wherein instructions are fetched from said virtual platform emulator.

5. The system (200) of any of the previous claims, wherein when said misprediction branch is detected, the system is adapted to enter into a sandbox mode wherein at least one false instruction for execution and committing in said pipeline model is sent by said virtual platform emulator, and upon identification of a misprediction, to instruct a rollback at a proxy layer of said virtual platform emulator (211) to said instruction branch for taking a correct branch decision.

6. The system (200) of any of the previous claims, wherein said mis-prediction branch is detected during said emulation of said plurality of block derived code instructions, said plurality of previously executed instructions comprises at least one false instruction for execution without committing in said pipeline model; wherein said processor is adapted to execute said code for instructing a flushing of said pipeline model in response to identification of branch mis-prediction and instructing a rollback at a proxy layer of said virtual platform emulator (211) to said instruction branch.

7. The system (200) of any of the previous claims, further comprising updating a dataset of instructions not yet committed, flushed from said plurality of block derived code instructions and instructing said virtual platform emulator (211) to add instructions from said dataset of instructions as said plurality of previously executed code instructions.

8. The system (200) of any of the previous claims, wherein said processor is adapted to instruct said virtual platform emulator (211) to emulate execution of an instruction block in response to said instruction request from said pipeline model exclusively when said requested instruction is a member of said instruction block.

9. The system (200) of any of the previous claims, wherein the virtual platform emulator (211) comprises a scheduler adapted to schedule processing of a next instructions block, wherein the scheduler is adapted to instruct said virtual platform emulator to emulate execution of the next instructions block when an instruction request received from said pipeline model comprises a code instruction that is a member of said next instructions block.

10. A method for simulating a multicore processor design, comprising:

receiving code instructions comprising a plurality of instructions blocks;

instructing a virtual platform emulator (211) to emulate an execution of an instructions block of the plurality of instructions blocks to generate a stream of a plurality of block derived code instructions based on said instructions block;

when a flush pipeline event is detected, instructing said virtual platform emulator (211) to add a plurality of previously executed instructions to said stream in an original execution order of said previously executed instructions;

in response to each of a plurality of sequential and independent instruction requests received from a pipeline model, sequentially and independently fetching and transferring each of said plurality of block derived code instructions and at least one of said plurality of dummy code instructions and said plurality of previously executed instructions for execution and committing in said pipeline model.