CN100392618C - System, method and apparatus for protecting memory in a computer from being written to - Google Patents
System, method and apparatus for protecting memory in a computer from being written to Download PDFInfo
- Publication number
- CN100392618C CN100392618C CNB971822298A CN97182229A CN100392618C CN 100392618 C CN100392618 C CN 100392618C CN B971822298 A CNB971822298 A CN B971822298A CN 97182229 A CN97182229 A CN 97182229A CN 100392618 C CN100392618 C CN 100392618C
- Authority
- CN
- China
- Prior art keywords
- instruction
- target
- address
- translation
- memory
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Landscapes
- Executing Machine-Instructions (AREA)
Abstract
Description
技术领域 technical field
本发明涉及计算机系统,特别是涉及防止当第一指令集的指令在存储器被覆盖时从第一指令集翻译为另一指令集的指令的误用的方法和装置。The present invention relates to computer systems, and more particularly to methods and apparatus for preventing misuse of instructions translated from a first instruction set to another instruction set when the instructions of the first instruction set are overwritten in memory.
背景技术 Background technique
在计算机上运行着成千上万针对特定微处理器系列设计的应用程序。其中为利用“X86”系列微处理器(包括Intel8088、Intel 8086、Intel 80186 Intel 80286、i386、i486以及随后发展的各种奔腾系列微处理器,它们都是位于加州SantaClara的Intel公司设计和制造)的计算机(通常称为“IBM兼容个人电脑”)而设计的程序数量最多。针对利用其它系列处理器的计算机也设计有许多程序。由于运行在这些计算机上的应用程序非常多,所以能够为这类计算机(特别是处理X86程序的计算机)所用的微处理器拥有巨大的市场。微处理器市场不仅容量巨大而且获利丰厚。Thousands of applications designed for a particular family of microprocessors run on a computer. Among them is the use of "X86" series microprocessors (including Intel 8088, Intel 8086, Intel 80186 Intel 80286, i386, i486 and subsequent Pentiums series of microprocessors, all of which are designed and manufactured by Intel Corporation of Santa Clara, California) (often referred to as "IBM Compatible Personal Computers"), the largest number of programs are designed. Many programs are also designed for computers utilizing other families of processors. Because of the large number of applications running on these computers, there is a huge market for microprocessors that can be used for such computers, especially computers that process X86 programs. The microprocessor market is not only huge but also lucrative.
虽然能够运行大量应用程序的微处理器市场容量巨大而且获利丰厚,但是设计新的有竞争力的微处理器却不那么容易。例如,虽然X86系列的微处理器已经问世多年并且在大多数销售和使用的计算机内都包含这些处理器,但是也只有少数成功的有竞争力的微处理器能够运行X86程序。其中的原因是多方面的。While the market for microprocessors capable of running a wide range of applications is large and lucrative, designing new competitive microprocessors has not been so easy. For example, although the X86 family of microprocessors has been around for many years and is included in most computers sold and used, only a few successful competitive microprocessors can run X86 programs. There are many reasons for this.
为了取得成功,微处理器必须能在不高于现有微处理器开销的前提下,与现有处理器一样快地运行针对现有系列处理器设计的所有程序(包括操作系统和已有程序)。此外,为了在经济上取得成功,新的微处理器必须至少在某一方面胜过现有的处理器,从而使买主有选购新微处理器的理由。To be successful, a microprocessor must be able to run all programs (including operating systems and existing programs) designed for existing families of processors as fast as existing processors at no higher overhead than existing microprocessors. ). Furthermore, in order to be economically successful, a new microprocessor must outperform existing processors in at least one respect, thereby giving buyers a reason to purchase the new microprocessor.
要做到使微处理器运行速度与现有微处理器一样快是困难和昂贵的。处理器通过诸如加载、移位、相加、存储和类似的低层次操作之类的基本操作执行指令并且在执行应用程序提供的指令时仅仅响应这类基本操作。例如在诸如X86之类被设计为运行复杂指令集计算机(CSIC)的处理器中,指令在较高的级别上表示待执行的处理,该处理器包含存储所谓微指令的只读存储器(ROM)。每条微指令包含一串基本指令,依照顺序执行这些基本指令将得到高级CISC指令所命令的结果。典型的“A加B”CISC指令经过译码后在ROM内查找地址,执行“A加B”功能指令的微指令就存储在ROM内。微指令加载后,其基本指令按照顺序执行,从而完成“A加B”指令的执行。在这类CSIC计算机中,微指令内基本操作在程序执行过程中绝对不可能发生变化。每条CISC指令只能通过指令译码、寻址和提取微指令以及按照微指令提供的顺序运行基本操作序列完成运行过程。微指令每次运行时都必须遵循同一顺序。It is difficult and expensive to make microprocessors run as fast as existing microprocessors. A processor executes instructions through primitive operations such as load, shift, add, store, and similar low-level operations and responds only to such primitive operations when executing instructions provided by an application. For example in a processor such as the X86 designed to run a complex instruction set computer (CSIC), where instructions represent at a high level the processing to be performed, the processor contains a read-only memory (ROM) that stores so-called microinstructions . Each microinstruction contains a series of basic instructions, and executing these basic instructions in sequence will obtain the result commanded by the high-level CISC instruction. A typical "A plus B" CISC instruction is decoded to find the address in the ROM, and the microinstructions for executing the "A plus B" function instruction are stored in the ROM. After the microinstructions are loaded, their basic instructions are executed in sequence, thus completing the execution of the "A plus B" instruction. In this type of CSIC computer, the basic operations in the microinstructions are absolutely impossible to change during program execution. Each CISC instruction can only complete the operation process through instruction decoding, addressing and extracting micro-instructions, and running the basic operation sequence in the order provided by the micro-instructions. Microinstructions must follow the same order each time they are run.
当今运行X86应用程序的处理器利用大量的技术,在合理的性价比前提下提供最快的处理速度。任何采用已知硬件技术提高处理器速度的新处理器都必然会增加处理硬件的复杂度。这增加了硬件成本。Today's processors running X86 applications utilize a number of techniques to provide the fastest processing speed at a reasonable price/performance ratio. Any new processor that uses known hardware techniques to increase the speed of the processor necessarily increases the complexity of the processing hardware. This increases hardware costs.
例如超标量微处理器(superscalar),为了同时执行两个以上的操作,它使用多条处理通道,对此需要一系列额外的需求。在最底层级别上,简单的超标量微处理器将每条应用指令译码为执行应用指令功能的微指令。随后,如果两条微指令不需要相同的硬件资源并且一条微指令的执行不依赖于待处理的另一条微指令的结果,则简单超标量微处理器将这两条微指令安排为同时执行。For example, superscalar microprocessors (superscalar), in order to perform more than two operations simultaneously, it uses multiple processing channels, which requires a series of additional requirements. At the lowest level, a simple superscalar microprocessor decodes each application instruction into microinstructions that perform the function of the application instruction. Subsequently, a simple superscalar microprocessor schedules two microinstructions to execute concurrently if they do not require the same hardware resources and the execution of one microinstruction does not depend on the results of the other microinstruction being processed.
更加高级的超标量微处理器一般将每条应用指令译码为一系列的基本指令,从而以最有效率的执行次序重新排序和安排这些基本指令的执行。这需要每条基本操作都能单独寻址和存取。为了完成重新排序,处理器必须确保需要其它基本指令数据结果的基本指令在其他指令产生所需数据之后执行。这类超标量微处理器必须确保同时执行的两条基本指令不需要使用同一硬件资源。而且还必须在完成转移操作之前解决条件转移。More advanced superscalar microprocessors generally decode each application instruction into a series of basic instructions, so as to reorder and arrange the execution of these basic instructions in the most efficient execution order. This requires each elementary operation to be individually addressable and accessible. To accomplish reordering, the processor must ensure that primitive instructions that require data results from other primitive instructions are executed after other instructions produce the required data. Such superscalar microprocessors must ensure that the two basic instructions executed simultaneously do not need to use the same hardware resource. Also the conditional branch must be resolved before the branch operation can be completed.
因此超标量微处理器需要大量的硬件来比较基本指令之间的相互关系,重新排序和安排基本指令的顺序执行任何指令。随着处理通道数的增加,完成这些超标量加速技术的硬件数量和成本将以近似二次方的速度增加。所有这些硬件需求都增加所涉及电路系统的复杂性和成本。在处理微指令过程中,当每次执行一条应用指令时,超标量微处理器都必须使用比较复杂的寻址和提取硬件来提取每条基本指令,都必须根据其它基本指令和硬件使用状态重新排序和重新安排这些基本指令,并且随后还必须执行所有重新安排后的基本指令。由于每次执行一条应用指令时都必须通过整个硬件序列,所以限制了超标量处理器能够执行指令的速度。Therefore, superscalar microprocessors require a large amount of hardware to compare the relationship between the basic instructions, reorder and arrange the order of the basic instructions to execute any instruction. As the number of processing lanes increases, the amount and cost of hardware to implement these superscalar acceleration techniques increases approximately quadratically. All of these hardware requirements increase the complexity and cost of the circuitry involved. In the process of processing micro-instructions, when each time an application instruction is executed, the superscalar microprocessor must use more complex addressing and fetching hardware to extract each basic instruction, and must be re-written according to other basic instructions and hardware usage status. These elementary instructions are sequenced and rearranged, and all rearranged elementary instructions must then also be executed. The speed at which a superscalar processor can execute instructions is limited by the fact that the entire hardware sequence must be traversed each time an application instruction is executed.
即使利用各种硬件技术提高了处理速度,由此带来的硬件复杂性也大大增加了这类微处理器的成本。例如,Intel i486 DX4处理器采用大约150万个晶体管。但是如果要在诸如Intel奔腾之类的基本超标量处理器中通过增加硬件来利用两条通道完成指令处理所需的依存度检查和执行安排,则需要300多万个晶体管。在Intel Pentium Pro微处理器中,为了能重新排序源自不同目标指令的基本指令,提供预测执行,允许寄存器更名和提供转移预测,增加到超过600多万个晶体管。由此可见,为提高运算速度而增加硬件已使最新一代微处理器的晶体管数量出乎寻常地增加。Even with increased processing speeds utilizing various hardware techniques, the resulting hardware complexity adds significantly to the cost of such microprocessors. For example, the Intel i486 DX4 processor employs approximately 1.5 million transistors. But if you want to run on such as Intel Pentium In a basic superscalar processor such as , by adding hardware to use two channels to complete the dependency checking and execution scheduling required for instruction processing, more than 3 million transistors are required. on Intel Pentium Pro In microprocessors, the number of transistors increased to over 6 million transistors in order to be able to reorder basic instructions derived from different target instructions, to provide speculative execution, to allow register renaming and to provide branch prediction. It can be seen that the addition of hardware to increase computing speed has resulted in an extraordinary increase in the transistor count of the latest generation of microprocessors.
即使采用这些已知的技术,但是由于现有微处理器制造商已经采用了绝大部分经济可行的技术来提高现有微处理器的运行速度,所以也不可能制造出更快的微处理器。这样,设计更快的处理器就成了一项非常困难和费钱的工作。Even with these known techniques, it is impossible to make faster microprocessors because existing microprocessor manufacturers have already implemented most of the economically feasible techniques to increase the speed of existing microprocessors . Thus, designing faster processors has become a very difficult and expensive task.
降低处理器成本也是非常困难的。如上所述,制造具有足够能力处理器的硬件加速技术非常昂贵。如果要设计一种新的处理器,必须拥有生产硬件的设备。由于芯片制造商一般不会投资小批量器件,所以难以获得这类设备。生产芯片制造设备所需的资本投资如此巨大,以致于超出了大多数公司力所能及的范围。Reducing processor costs is also very difficult. As mentioned above, hardware acceleration technology is very expensive to manufacture processors with sufficient power. If you want to design a new processor, you must have the equipment to manufacture the hardware. Such equipment is difficult to obtain because chipmakers generally do not invest in low-volume devices. The capital investment required to produce chip-making equipment is so vast that it is beyond the reach of most companies.
即使设计出的新处理器能够至少以竞争处理器同样快的速度运行所有针对竞争处理器设计的应用程序,竞争处理器在价格中也包含了足够的利润,从而确保其能面对竞争对手作大幅度削价。Even if a new processor is designed to run all the applications designed for the competing processor at least as fast as the competing processor, the competing processor will have enough margin built into the price to ensure that it can compete against the competition. Significant price cuts.
虽然通过增加硬件复杂度来设计富有竞争力的处理器是非常困难的,但是可以有另一种运行针对某一特定系列微处理器(目标微处理器)设计的应用程序(目标应用程序)的方法,它在另一种更快的微处理器(主微处理器)上用软件仿真目标微处理器。由于这仅仅需要增加一定形式的仿真软件以在更快的微处理器上运行应用程序,所以是一种成本日趋低廉的方法。仿真器软件将针对目标处理器系列编写的应用程序目标指令改写为能够被主微处理器执行的主指令(host instruction)。随后这些被改写的指令在较快主微处理器上操作系统的控制下运行。Although it is very difficult to design competitive processors by increasing hardware complexity, there is another way to run applications (target applications) designed for a particular family of microprocessors (target microprocessors). method, which emulates the target microprocessor in software on another faster microprocessor (the main microprocessor). Since this simply requires the addition of some form of emulation software to run the application on the faster microprocessor, it is an increasingly inexpensive approach. The emulator software rewrites the application target instructions written for the target processor family into host instructions that can be executed by the host microprocessor. These rewritten instructions then run under the control of the operating system on the faster main microprocessor.
有许多种不同的设计可以使目标应用程序运行在处理器速度快于目标计算机处理器的主计算机上。由于从理论上讲,精简指令集(RISC)微处理器更为简单而且速度较其它类型处理器更快,所以利用仿真软件执行目标程序的主计算机都采用RISC微处理器。There are many different designs for running the target application on a host computer with a processor faster than the target computer's processor. Because in theory, the reduced instruction set (RISC) microprocessor is simpler and faster than other types of processors, so the host computer that uses emulation software to execute the target program uses a RISC microprocessor.
但是即使利用仿真器软件的RISC计算机系统能够运行X86(或其它)程序,它们的运行速度通常也明显慢于X86计算机系统运行同一程序的速度。而且这些仿真器程序常常无法运行所有或大量现有的目标程序。But even if RISC computer systems utilizing emulator software can run X86 (or other) programs, they typically run significantly slower than an X86 computer system can run the same program. And these emulator programs often cannot run all or a large number of existing target programs.
仿真器程序无法象目标微处理器那样快地运行目标程序的原因相当复杂并且需要对不同的仿真操作有所了解。图1包括一系列的示意图,它们表示不同类型微处理器执行目标应用程序的方式。The reasons why an emulator program cannot run the target program as fast as the target microprocessor is quite complex and requires an understanding of the different emulation operations. Figure 1 includes a series of diagrams illustrating the manner in which different types of microprocessors execute target applications.
在图1(a)中,诸如Intel X86微处理器之类的典型CISC微处理器运行着设计在目标微处理器运行的目标应用程序。如图所示,应用程序运行在利用CISC操作系统(例如X86计算机使用的MS DOS、Windows 3.1、Windows NT和OS/2)的CISC处理器上,这些操作系统提供了访问计算机硬件的接口。典型情况是选择应用程序的指令,只通过操作系统提供的访问使用计算机设备。因此操作系统处理控制器允许应用程序访问计算机存储器和各种输入/输出设备。目标计算机包含能够被操作系统识别的存储器和硬件,并且目标应用程序对操作系统的调用使操作系统设备驱动程序在目标计算机定义的设备上产生预期的操作。应用程序的指令在处理器上执行,它们被变换为可以被处理器执行的操作,这些操作由微代码或汇编为微代码的更加基本的操作具体实现。如上所述,每次执行复杂的目标指令时,指令都调用以微代码(或同一基本操作组)形式存储的同一子程序。执行的总是同一子程序。如果处理器是超标量的,则执行目标指令的这些基本操作常常可以被处理器以上述方式,利用各种处理通道进行重新排序、重新安排和执行;但是子程序仍然被提取和执行。In Figure 1(a), a typical CISC microprocessor such as the Intel X86 microprocessor runs a target application program designed to run on the target microprocessor. As shown, applications run on CISC processors utilizing CISC operating systems (such as MS DOS, Windows 3.1, Windows NT, and OS/2 used by X86 computers) that provide an interface to the computer's hardware. The typical situation is to choose the instructions of the application program to use the computer equipment only through the access provided by the operating system. The operating system handles the controller thus allowing application programs to access computer memory and various input/output devices. The target computer contains memory and hardware that can be recognized by the operating system, and calls to the operating system by the target application program cause the operating system device driver to produce the expected operation on the device defined by the target computer. The instructions of the application program are executed on the processor, and they are transformed into operations that can be executed by the processor. These operations are embodied by microcode or more basic operations assembled into microcode. As mentioned above, each time a complex target instruction is executed, the instruction calls the same subroutine stored in microcode (or the same set of basic operations). The same subroutine is always executed. If the processor is superscalar, these basic operations of executing the target instructions can often be reordered, rearranged, and executed by the processor using various processing channels in the manner described above; however, subroutines are still fetched and executed.
在图1(b)中,诸如用于苹果Macintosh计算机中的PowerPC之类的典型RISC微处理器运行着针对图1(a)CISC处理器运行设计的同一目标应用程序。如图所示,目标应用程序至少借助部分目标操作系统运行于主处理器上以响应目标应用程序生成的一部分调用。典型的是对目标操作系统应用类部分的调用,用来在显示器上提供图形接口和通常是应用类程序的小型工具程序。目标应用程序和目标操作系统的这些部分由诸如SoftPC之类的软件仿真器变换,它将目标应用程序和应用类目标操作系统提供的指令分解为主处理器及其主操作系统能够执行的指令。主操作系统提供了访问RISC计算机的存储器和输入/输出硬件的接口。In Figure 1(b), a typical RISC microprocessor such as the PowerPC used in Apple's Macintosh computers runs the same target application designed to run on the CISC processor of Figure 1(a). As shown, the target application runs on the host processor via at least a portion of the target operating system in response to a portion of the calls generated by the target application. Typically a call to the application-like portion of the target operating system to provide a graphical interface on the display and small utility programs, usually application-like programs. The target application and these parts of the target operating system are provided by software such as SoftPC A software emulator transformation, which decomposes the instructions provided by the target application program and the application-like target operating system into instructions that the main processor and its main operating system can execute. The main operating system provides the interface to access the memory and input/output hardware of the RISC computer.
但是主RISC处理器及其RISC计算机中相关的硬件设备通常与这样一些设备有相当大的差异,它们与针对目标应用程序而设计的处理器相关;并且目标应用程序提供的各种指令被设计为使目标操作系统的设备驱动程序在访问目标计算机各部分时协同工作。因此对于将目标应用程序指令变换为主操作系统能够利用的基本主指令的仿真程序而言,它必须在某种程度上将操纵目标计算机内硬件设备的操作与主系统硬件设备能够实现的操作联系起来。由于目标设备并不等同于主计算机的设备,所以通常需要由仿真器软件生成响应目标应用程序指令的虚拟设备来完成主系统无法实现的操作。有时候仿真器需要通过主操作系统生成从这些虚拟设备至主硬件设备的链接,这些主硬件设备虽然是真实存在,但是要由主操作系统按照不同的方式寻址。But the main RISC processor and its associated hardware devices in a RISC computer are usually quite different from those devices associated with the processor for which the target application is designed; and the various instructions provided by the target application are designed to Makes the target operating system's device drivers work together in accessing parts of the target computer. Therefore, for an emulation program to transform the target application program instructions into basic host instructions that the host operating system can utilize, it must somehow link the operations that manipulate the hardware devices in the target computer with the operations that the host system hardware devices can implement stand up. Since the target device is not equivalent to the device of the host computer, it is usually necessary for the emulator software to generate a virtual device that responds to the instructions of the target application program to complete operations that the host system cannot achieve. Sometimes the emulator needs to generate links from these virtual devices through the main operating system to the main hardware devices, which are real but are addressed differently by the main operating system.
由于众多的原因,以这种方式运行的目标程序速度较慢。首先,来自目标应用程序和目标操作系统的每条目标指令都必须由仿真器变换为主处理器使用的主基本功能。如果目标应用程序是针对诸如X86之类的CSIC机而设计的,则目标指令是变长度的并且相当复杂,因此将它们变换为主基本指令就相当费事。源目标指令首先被译码,并确定构成目标指令的主基本指令序列。随后确定每串基本主指令序列的地址,提取每串基本主指令序列,并按照顺序或者不按照顺序执行这些基本主指令。每次执行指令时都必须由仿真器将目标应用程序和操作系统指令变换为主处理器理解的主指令,这需要大量额外的步骤,因此减慢了仿真处理的速度。Target programs run this way are slower for a number of reasons. First, each target instruction from the target application and the target operating system must be transformed by the emulator into the main elementary function used by the host processor. If the target application program is designed for a CSIC machine such as X86, the target instructions are variable-length and quite complex, so it is quite laborious to convert them into the main basic instructions. The source target instruction is first decoded and the sequence of primary primitive instructions that make up the target instruction is determined. The address of each sequence of elementary host instructions is then determined, each sequence of elementary host instructions is extracted, and the elementary host instructions are executed sequentially or out of sequence. Target application and operating system instructions must be translated by the emulator into host instructions understood by the host processor each time an instruction is executed, requiring numerous additional steps and thus slowing down the emulation process.
其次,许多目标指令与由特定硬件设备执行操作有关,这些特定硬件设备在目标计算机中起着特殊的作用,但在主计算机中却不存在。为了执行该类操作,仿真软件必须通过已有的主操作系统与主计算机的硬件设备实现软件连接或者配备一台虚拟硬件设备。用软件仿真另一种计算机硬件是非常困难的。仿真软件必须生成各种目标应用调用主操作系统的虚拟设备;并且每台虚拟设备都必须向实际主设备提供调用。硬件设备仿真要求在目标指令使用设备时从存储器中提取表示指令所需虚拟设备代码并进行运行以实现设备功能。解决问题的这些方法在执行指令序列时都额外增加了一系列的操作。Second, many target instructions are concerned with the execution of operations by specific hardware devices that serve special functions in the target computer but do not exist in the host computer. In order to perform such operations, the emulation software must realize a software connection with the hardware device of the host computer through the existing main operating system or be equipped with a virtual hardware device. It is very difficult to emulate another kind of computer hardware in software. The emulation software must generate virtual devices from which the various target applications call the host operating system; and each virtual device must provide calls to the actual host device. Hardware device emulation requires that when the target instruction uses the device, the virtual device code required to represent the instruction is extracted from the memory and run to realize the device function. These methods to solve the problem all add a series of additional operations when executing the instruction sequence.
仿真问题的复杂化是目标应用要解决各种意外事件的结果,为了使计算机系统运行,目标计算机的硬件和目标操作系统要对这些意外事件作出处理。当目标计算机运行期间的异常产生时,一般必须通过调用微代码序列完成保存发生意外事件时刻计算机状态的操作,正确的异常处理必须被恢复(handle),异常必须被处理,并且必须找到程序继续运行下去的正确入口。有时候这要求程序返回至处理意外事件之处目标计算机的状态,而其它时候进入意外事件句柄提供的转移。在任一情况下,都必须在某种程度上对完成这些操作所需的目标计算机硬件和软件进行仿真。由于必须在发生意外事件时可以得到正确的目标状态以供适当执行,仿真器必须始终精确跟踪该状态以准确响应这些意外事件。在现有技术中,这要求每条指令的执行必须按照目标应用程序提供的顺序,因为只有这种方式能够保持正确的目标状态The complexity of the emulation problem is a result of the various contingencies that the target application has to deal with for the computer system to operate, both the target computer's hardware and the target operating system. When an exception occurs during the operation of the target computer, the operation of saving the state of the computer at the time of the unexpected event must generally be completed by calling the microcode sequence, the correct exception handling must be restored (handle), the exception must be handled, and the program must be found to continue running Go down to the correct entrance. Sometimes this requires the program to return to the state of the target computer where the exception was handled, and other times to enter the branch provided by the exception handler. In either case, the target computer hardware and software required to accomplish these operations must be emulated to some extent. Since the correct target state must be available for proper execution when unexpected events occur, the emulator must keep track of that state precisely to respond accurately to these unexpected events. In the prior art, this requires the execution of each instruction must be in the order provided by the target application, because only this way can maintain the correct target state
而且现有技术的仿真器出于其它原因,总是需要保持目标应用程序的执行顺序。目标指令可以有两种,一种作用于存储器而另一种作用于存储器映射的输入/输出(I/O)设备。如果不执行指令是无法知道操作是作用于存储器还是存储器映射I/O设备的。当指令在存储器上运行时,可以作优化和重新排序并且这大大有助于提高系统运行速度。但是作用于I/O设备的操作常常必须按照精确的顺序进行,这些操作必须按照该顺序编程而不能省略任何步骤,否则可能对I/O设备的操作产生一些不利的影响。例如特殊的I/O操作可能会清除I/O寄存器的内容。如果操作不按照顺序进行从而清除了寄存器内仍然需要的值,则操作结果可能不同于目标指令所命令的那样。如果没有区分存储器和存储器映射I/O的装置,就需要将所有的指令都象处理作用于存储器映射I/O的指令那样处理。这大大限制了优化所能达到的性能。由于现有技术的仿真器缺少检测被寻址存储器性质的装置和从这类故障恢复的装置,所以需要顺序处理目标指令,就好象每条指令都作用于存储器映射I/O那样。这极大地制约了主指令的优化的可能性。Moreover, prior art emulators always need to maintain the execution order of the target application program for other reasons. There can be two target instructions, one that acts on memory and one that acts on a memory-mapped input/output (I/O) device. It is impossible to know whether the operation is on memory or a memory-mapped I/O device without executing the instruction. When instructions are run in memory, optimizations and reordering can be done and this greatly helps to increase the speed of the system. However, the operations acting on the I/O device must be performed in a precise order, and these operations must be programmed in this order without omitting any steps, otherwise it may have some adverse effects on the operation of the I/O device. For example, special I/O operations may clear the contents of I/O registers. If operations are performed out of order, thereby clearing values in registers that are still needed, the result of the operation may differ from that commanded by the target instruction. If there were no means for distinguishing between memory and memory-mapped I/O, all instructions would need to be treated as if they were acting on memory-mapped I/O. This greatly limits the performance that can be achieved with optimization. Since prior art emulators lack means to detect the nature of the memory being addressed and to recover from such failures, it is necessary to process the target instructions sequentially as if each instruction were acting on memory mapped I/O. This greatly restricts the optimization possibilities of host instructions.
另一个限制现有技术仿真器优化主代码能力的问题源于自修改代码。如果目标指令被变换为主指令序列,而主指令序列又被回写从而改变了源目标指令,则主指令不再有效。结果仿真器必须始终作检验以确定是否有目标代码区域的存储。所有这些问题都使这种类型的仿真比在目标处理器上运行目标应用程序慢得多。Another problem that limits the ability of prior art emulators to optimize host code stems from self-modifying code. If a target instruction is transformed into a sequence of host instructions, and the sequence of host instructions is written back, changing the source target instruction, the host instruction is no longer valid. As a result the emulator must always check to see if there is storage for the object code region. All of these issues make this type of emulation much slower than running the target application on the target processor.
图1(b)所示另一种类型的仿真软件实例在题为“Talisman:快速而精确的多计算机模拟”(R.C.Bedichek,麻省理工学院计算机科学实验室)的文章中有所论述。这是更为完整的转换实例,它可以仿真完整的研究系统并运行研究目标操作系统。Talisman采用主UNIX操作系统。An example of another type of simulation software shown in Figure 1(b) is discussed in an article entitled "Talisman: Fast and Accurate Multicomputer Simulation" (R.C. Bedichek, MIT Computer Science Laboratory). This is a more complete conversion example that emulates the complete research system and runs the research target operating system. Talisman uses the main UNIX operating system.
在图1(c)中示出了另一种仿真实例。在该实例中,用于苹果Macintosh计算机的PowerPC微处理器正在运行针对莫托罗拉68000系列CISC处理器设计的目标应用程序,后者用于早期的Macintosh计算机;这样做的目的是为了使苹果原有的程序能够在带RISC处理器的Macintosh计算机上运行。显而易见,目标应用程序至少借助部分目标操作系统运行于主处理器上以响应目标操作系统应用类部分的调用。软件仿真器将目标应用程序和应用类目标操作系统程序提供的指令分解为主处理器及其主操作系统能够执行的指令。主操作系统提供了访问主计算机的存储器和输入/输出硬件的接口。Another simulation example is shown in Fig. 1(c). In this example, a PowerPC microprocessor used in Apple's Macintosh computers is running a target application designed for Motorola's 68000 series of CISC processors used in earlier Macintosh computers; Some programs can run on Macintosh computers with RISC processors. Apparently, the target application program runs on the main processor with at least part of the target operating system to respond to calls from the application class part of the target operating system. The software emulator decomposes the instructions provided by the target application program and the application class target operating system program into instructions that the main processor and its main operating system can execute. The host operating system provides an interface to access the host computer's memory and input/output hardware.
但是主RISC处理器及其主RISC计算机中相关的设备与配备莫托罗拉CISC处理器的设备有相当大的差异;并且各种目标指令被设计为在访问目标计算机各部分时与目标CSIC操作系统协同工作。因此仿真程序必须将操纵目标计算机内硬件设备的操作与主系统硬件设备能够实现的操作联系起来。这需要由仿真器生成响应目标应用程序指令的软件虚拟设备并通过主操作系统生成从这些虚拟设备至主硬件设备的链接,这些主硬件设备虽然是真实存在,但是要由主操作系统按照不同的方式寻址。But the main RISC processor and its associated equipment in the main RISC computer are quite different from those equipped with Motorola's CISC processor; and the various target instructions are designed to interact with the target CSIC operating system when accessing parts of the target computer Collaborative work. Therefore, the simulation program must link the operation of manipulating the hardware devices in the target computer with the operations that the hardware devices of the main system can realize. This requires the emulator to generate software virtual devices that respond to the instructions of the target application program and the host operating system to generate links from these virtual devices to the main hardware devices, which are real, but are controlled by the host operating system according to different mode addressing.
由于与图1(b)仿真同样的原因,以这种方式运行的目标程序运行速度较慢。首先,来自目标应用程序和目标操作系统的每条目标指令都必须经过指令提取才能变换;并且每次执行指令时从该指令导出的所有主基本功能都必须按照顺序运行。其次,仿真软件必须生成每个目标应用程序调用主操作系统的虚拟设备;并且每台虚拟设备都必须向实际主设备提供调用。第三,仿真器必须象处理直接作用于存储器映射I/O设备的指令那样保守地处理所有指令,否则就有发生无法恢复的意外事件的危险。最后,仿真器必须始终保持正确的目标状态并总是要在确定是否存储目标代码区域之前检查存储操作。所有这些要求都削弱了仿真器对运行在主处理器上代码作重大优化的能力并且使这种仿真的速度远远慢于目标应用程序在目标处理器上的运行速度。在非常乐观的情况估计仿真速度也要低于现有处理器的四分之一。通常情况下,这种仿真软件在能够运行为另一种处理器设计的应用程序的能力只是有用而非基本用途的情况下,很难找到它的使用价值。For the same reason as the simulation in Figure 1(b), the target program running in this way runs slower. First, each target instruction from the target application and target operating system must be fetched to be transformed; and all main primitive functions derived from that instruction must be run in sequence each time the instruction is executed. Second, the emulation software must generate virtual devices for each target application that calls the host operating system; and each virtual device must provide calls to the actual host device. Third, the emulator must treat all instructions as conservatively as instructions that act directly on memory-mapped I/O devices, or risk unrecoverable exceptions. Finally, the emulator must always maintain the correct target state and always check for store operations before determining whether to store the target code region. All of these requirements impair the emulator's ability to make significant optimizations to code running on the host processor and make the emulation much slower than the target application would run on the target processor. In a very optimistic case, the simulation speed is estimated to be less than a quarter of that of existing processors. Typically, such emulation software is hard to find use for when the ability to run applications designed for another processor is useful rather than essential.
在图1(d)中示出一种在主处理器上仿真目标应用程序的特殊方法,它为非常少数的一系列目标应用程序提供了较好的性能。目标应用程序为仿真器提供指令,该仿真器将这些指令变换为主处理器和主操作系统的指令。主处理器为数字设备公司的Alpha RISC处理器,而主操作系统为微软的NT。只能在这种系统上运行的目标应用程序是为采用Windows WIN32s兼容操作系统的目标X86处理器而设计的32位应用程序。由于主操作系统和目标操作系统几乎是相同的,它们设计成处理相同的指令,所以仿真器软件可以非常方便地变换指令。而且主操作系统也已设计为响应目标应用程序产生的相同调用,所以生成的虚拟设备数量大大减少。A particular method of simulating target applications on the host processor, which provides better performance for a very small set of target applications, is shown in Figure 1(d). The target application provides instructions to the emulator, which translates these instructions into instructions for the host processor and host operating system. The main processor is Digital Equipment Corporation's Alpha RISC processor, and the main operating system is Microsoft's NT. The only target applications that can run on this system are 32-bit applications designed for target X86 processors with Windows WIN32s compatible operating systems. Since the host and target operating systems are nearly identical and designed to handle the same instructions, emulator software can easily switch instructions. Also, the host operating system has been designed to respond to the same calls made by the target application, so the number of virtual devices generated is greatly reduced.
虽然在技术上它是一种使目标应用程序运行于主处理器上的仿真系统,但却是一种非常特殊的情况。在这种情况下仿真软件运行于已经为运行相似应用程序而设计好的主操作系统上。它使得来自目标应用程序的调用能够更为简单地指向主处理器和主操作系统正确的设备。更为重要的是,该系统只能运行32位Windows应用程序,其在所有X86应用程序中所占的数量不到1%。而且该系统只能在一种操作系统上,即Windows NT上运行应用程序;而X86处理器可运行针对大量操作系统而设计的应用程序。因此就本说明书前面所表述的含义而言,这样的系统不具备兼容性。因此运行这类仿真器的处理器是不能认为是富有竞争力的X86处理器。Although technically an emulation system that has the target application running on the main processor, it is a very special case. In this case the emulation software runs on a host operating system already designed to run similar applications. It makes it easier for calls from the target application to point to the correct device for the main processor and main operating system. More importantly, the system can only run 32-bit Windows applications, which account for less than 1% of all X86 applications. Moreover, the system can only run applications on one operating system, that is, Windows NT; while the X86 processor can run applications designed for a large number of operating systems. Such systems are therefore not compatible in the sense expressed earlier in this specification. Therefore, the processor running this type of emulator cannot be considered as a competitive X86 processor.
在图1(e)中示出另一种仿真方法,它利用软件在识别不同指令集的计算机上运行为第一指令集编写的应用程序部分。这种形式的仿真软件一般由程序员使用,他们将一个应用程序从一种计算机系统移植到另一种计算机系统上。典型的情况是目标应用程序针对除运行仿真器的主机以外的一些目标计算机而设计。仿真器软件分析目标指令,将这些指令翻译为可以在主机上运行的指令,并且将这些主指令存入高速缓冲存储器内供再次使用。这种动态翻译和高速缓存可以使应用程序部分运行得非常快。这种形式的仿真器一般与软件跟踪工具一起使用,该工具提供了正在运行的目标程序详细的运行信息。跟踪工具的输出又被用来启动分析程序,对跟踪信息进行分析。Another emulation method is shown in FIG. 1(e), which uses software to run the application program part written for the first instruction set on a computer recognizing different instruction sets. This form of emulation software is typically used by programmers who port an application from one computer system to another. Typically, the target application is designed for some target computer other than the host computer running the emulator. The emulator software analyzes the target instructions, translates these instructions into instructions that can run on the host computer, and stores these host instructions in the cache memory for reuse. This dynamic translation and caching can make parts of the application run very fast. This form of emulator is typically used in conjunction with a software trace tool, which provides detailed operational information about a running target program. The output of the trace tool is used to launch the analysis program to analyze the trace information.
为了确定代码实际上是如何工作的,这种类型的仿真器与主机上的主操作系统协同运行,配备了主操作系统未提供的虚拟硬件,并且将设计应用软件的计算机的操作映射到主机硬件资源以执行正在运行的程序的操作。这种硬件的软件虚拟化和对主计算机的映射可能非常慢并且很不完善。To determine how the code actually works, this type of emulator runs in conjunction with the main operating system on the host computer, is equipped with virtual hardware not provided by the main operating system, and maps the operations of the computer on which the application software was designed to the host hardware resources to perform operations on a running program. Software virtualization of such hardware and mapping to the host computer can be very slow and imperfect.
而且由于常常需要多条主指令执行一条目标指令,所以可能产生包括故障和陷井在内的意外事件,这需要目标操作系统的意外事件句柄,并且使主机在与目标指令边界无关的位置上中止处理主指令。当发生这种情况时,由于主处理器和存储器的状态不正确,所以无法正确处理意外事件。如果出现这种情况,仿真器必须停止运行并重新运行并返回跟踪产生意外事件的操作。因此虽然这种仿真器能够非常快地运行目标代码序列,但是却没有办法从这些意外事件中恢复过来,因此无法快速运行应用程序中任何相当大的一块。and since multiple host instructions are often required to execute a target instruction, exceptions including faults and traps can be generated which require the target operating system's exception handler and cause the host to abort at a location unrelated to the target instruction boundary Process the main command. When this happens, the exception cannot be handled correctly because the state of the main processor and memory is incorrect. If this happens, the emulator must stop and run again and return to tracing the operation that produced the unexpected event. So while such an emulator can run sequences of object code very quickly, it has no way to recover from these contingencies and therefore cannot run any sizable chunk of the application very quickly.
由于仿真器、跟踪器和相关的分析器所完成的功能是直接产生新的程序或者将旧程序移植到另一种机器上,因而就仿真器软件运行速度而言很少有定论,因此这不是这种形式仿真器固有的问题。即,程序员通常对仿真器生成的代码在主机上运行得有多快并不感兴趣,他们感兴趣的是仿真器是否能生成可在为其设计的机器上执行并在该机器上运行快速的代码。因此除了编程目的以外,这种类型的仿真软件不能提供使第一指令集编写的应用程序能运行在不同类型的微处理器上的方法。这种仿真软件的实例在题为“Shade:A Fast Instruction-Set Simulator forExecution Profiling”(Cmelik和Keppel)的文章中有所论及。Since the functions performed by emulators, tracers, and related analyzers are to directly generate new programs or port old programs to another kind of machine, there are few conclusive opinions on the speed of emulator software, so this is not This form of emulator is inherently problematic. That is, programmers are usually not interested in how fast the emulator-generated code runs on the host computer, they are interested in whether the emulator can generate code that can execute on the machine it was designed for and run fast on that machine code. Therefore, except for programming purposes, this type of emulation software does not provide a means to enable applications written in the first instruction set to run on different types of microprocessors. Examples of such simulation software are discussed in the article entitled "Shade: A Fast Instruction-Set Simulator for Execution Profiling" (Cmelik and Keppel).
因此需要提供一种具有竞争力的微处理器,它比现有技术的微处理器更快更便宜,但是又与为现有技术微处理器运行各种可用操作系统而设计的目标应用程序完全兼容。There is therefore a need to provide a competitive microprocessor that is faster and cheaper than prior art microprocessors, but is completely compatible with the target applications for which prior art microprocessors run the various operating systems available. compatible.
具体而言需要提供一种主处理器,它包含提高处理器功能速度的电路系统。In particular, it is desirable to provide a host processor that includes circuitry that increases the speed at which the processor functions.
发明内容 Contents of the invention
因此本发明的目标是提供一种带有提高微处理器性能的装置的主处理器,与现有技术的微处理器相比,本发明的微处理器价格便宜并且是兼容的,而且能够比其他微处理器更快地运行为其他微处理器设计的应用程序和操作系统。It is therefore an object of the present invention to provide a main processor with means for increasing the performance of a microprocessor which is inexpensive and compatible with prior art microprocessors and capable of Other microprocessors run applications and operating systems designed for other microprocessors faster.
为了实现本发明的各种目标,本发明的装置和方法试图写入包含目标指令的存储器地址,目标指令被翻译为主处理器执行的主指令,它包含的步骤为:标记包含被翻译为主指令的目标指令的存储器地址;当试图向存储器地址写入时检测被标记的存储器地址;以及通过保护存储器地址处的目标指令直到确认与存储器地址相关的翻译在更新之前不用时来检测被标记的存储器地址。In order to achieve the various objects of the present invention, the apparatus and method of the present invention attempt to write memory addresses containing target instructions, which are translated into host the memory address of the instruction's target instruction; detecting a marked memory address when attempting to write to the memory address; and detecting a marked memory address by protecting the target instruction at the memory address until it is confirmed that the translation associated with the memory address is not used before updating memory address.
根据本发明的第一方面,提供一种保护计算机内存储器被写入的系统,计算机包括针对执行主指令集设计的主处理器和从目标指令集翻译至主指令集指令的软件,包括:硬件装置,包括翻译旁路缓冲器,用于指示存储器地址是否存储已经被翻译为主指令的目标指令,所述翻译旁路缓冲器包括多个虚拟地址及其相关物理地址的存储区,且所述翻译旁路缓冲器的每个存储区具有一个存储位置;以及软件装置,包括翻译机、用于保存翻译和相关信息的翻译缓冲器以及陷阱句柄,用于响应存储器地址存储已经被翻译为主指令的目标指令的指示,用于保护对存储器地址的写入直到确保在一旦写入所述存储器地址而更新之前与存储器地址相关的翻译不会被使用。According to the first aspect of the present invention, there is provided a system for protecting the internal memory of a computer from being written, the computer includes a main processor designed to execute the main instruction set and software translated from the target instruction set to the main instruction set instruction, including: hardware An apparatus comprising a translation bypass buffer for indicating whether a memory address stores a target instruction that has been translated into a host instruction, the translation bypass buffer comprising storage areas for a plurality of virtual addresses and their associated physical addresses, and the each bank of the translation lookaside buffer has a memory location; and software means, including a translator, a translation buffer for holding translations and related information, and a trap handler for storing, in response to a memory address, an instruction that has been translated into a host An indication of a target instruction for protecting a write to a memory address until it is ensured that the translation associated with the memory address is not used until it is updated once written to that memory address.
根据本发明的第二方面,提供一种保护计算机内存储器被写入的计算机系统,包括:针对执行主指令集指令设计的主处理器;用于将来自目标指令集的指令翻译为主指令集的指令的软件;存储来自被翻译程序的目标指令的存储器;存储从目标指令翻译的执行用主指令以及相关信息的翻译缓冲器;硬件装置,包括翻译旁路缓冲器,用于生成到目标地址的写入访问的意外事件,所述目标地址存储了已经翻译为主指令的目标指令,所述翻译旁路缓冲器包括多个最近访问的存储器的虚拟地址及其相关物理地址的存储区,且所述翻译旁路缓冲器的每个存储区具有一个存储位置;以及陷阱句柄,用于响应对目标地址的写入访问的意外事件,用于保护对存储器地址的写入直到确保在更新之前与存储器地址相关的翻译不会被使用。According to the second aspect of the present invention, there is provided a computer system for protecting the internal memory of a computer from being written, including: a main processor designed for executing instructions of the main instruction set; for translating instructions from the target instruction set into the main instruction set software for instructions; memory for storing target instructions from the program being translated; translation buffers for storing host instructions for execution translated from target instructions and related information; hardware devices, including translation bypass buffers, for generating In the event of a write access, the target address stores a target instruction that has been translated into a host instruction, the translation lookaside buffer includes a storage area for a plurality of recently accessed memory virtual addresses and their associated physical addresses, and Each storage region of the translation lookaside buffer has a storage location; and a trap handle, for responding to an exception of a write access to a target address, for protecting a write to a memory address until it is guaranteed to be consistent with Memory address-related translations are not used.
根据本发明的第三方面,提供一种保护计算机内存储器被写入的方法,存储器地址包括已经被翻译为主处理器执行的主指令的目标指令,所述方法包括以下步骤:对包含已经被翻译为主指令的目标指令的存储器地址进行标记;当试图写入存储器地址时检测已经被标记的存储器地址;以及通过保护存储器地址处的目标指令直到确保在更新之前与存储器地址相关的翻译不会被使用,来响应对已经标记存储器地址的检测,其中,标记包含已经被翻译为主指令的目标指令的存储器地址的步骤包括:将目标地址已经翻译的指示连同目标指令的物理地址存储在翻译旁路缓冲器的存储区内,且通过保护存储器地址处的目标指令直到确保在更新之前与存储器地址相关的翻译不会被使用来响应对已经标记存储器地址的检测的步骤包括:生成意外事件以响应检测到已经标记的存储器地址;以及通过在写入存储器地址之前使与存储器地址相关的翻译无效,来响应意外事件。According to a third aspect of the present invention, there is provided a method for protecting a memory in a computer from being written, the memory address includes a target instruction that has been translated into a host instruction executed by a host processor, and the method includes the following steps: Marking the memory address of the target instruction translated into the main instruction; detecting a memory address that has been marked when attempting to write to the memory address; and ensuring that translations associated with the memory address do not is used in response to the detection of a memory address that has been marked, wherein the step of marking a memory address containing a target instruction that has been translated as a host instruction includes storing an indication that the target address has been translated along with the physical address of the target instruction next to the translation The step of responding to the detection of a marked memory address by protecting the target instruction at the memory address until it is ensured that the translation associated with the memory address will not be used prior to updating includes: generating an exception in response to detecting a marked memory address; and responding to the exception by invalidating translations associated with the memory address prior to writing to the memory address.
根据本发明的第四方面,提供一种保护计算机内存储器被写入的微处理器,包含:能够执行第一指令集的主处理器;代码词态软件,包括翻译机、翻译缓冲器和陷阱句柄,用于将为包含第二不同指令集的目标处理器编写的程序翻译为主处理器执行的第一指令集的指令;以及存储器控制器,包含:地址翻译缓冲器,包括多个记录了用虚拟目标地址表示的最近被访问虚拟目标地址和物理存储器地址的存储区,每个存储区包含指示物理地址处的目标指令是否已经被翻译为主指令的装置;以及响应地址翻译缓冲器存储区内地址的写入访问的装置,其中指示装置指示物理地址处的目标指令已经被翻译为主指令,以保护存储器地址的写入直到确保在更新之前与存储器地址相关的翻译不会被使用。According to a fourth aspect of the present invention, there is provided a microprocessor for protecting the internal memory of a computer from being written, comprising: a main processor capable of executing a first instruction set; code morphological software, including a translator, a translation buffer and a trap a handle for translating instructions of a first instruction set executed by a host processor for a program written for a target processor comprising a second different instruction set; and a memory controller comprising: an address translation buffer comprising a plurality of records a storage area of recently accessed virtual target addresses and physical memory addresses represented by virtual target addresses, each storage area containing means for indicating whether a target instruction at a physical address has been translated into a host instruction; and a response address translation buffer storage area means for write access to an internal address, wherein the indicating means indicates that the target instruction at the physical address has been translated into a host instruction to protect writes to the memory address until it is ensured that the translation associated with the memory address will not be used prior to updating.
根据本发明的第五方面,提供一种保护计算机内存储器被写入的存储器控制器,包含:地址翻译缓冲器,包括多个记录了用虚拟地址表示的最近被访问的虚拟地址和物理地址的存储区,每个存储区包含指示物理地址是否存储了目标指令集的指令的装置,该目标指令集的指令已经被翻译为主指令集的指令;以及检测存储区内指示以防止物理地址被写入并且在访问地址之前指示后续操作的装置。According to a fifth aspect of the present invention, there is provided a memory controller for protecting memory in a computer from being written, comprising: an address translation buffer, including a plurality of recently accessed virtual addresses and physical addresses represented by virtual addresses storage areas, each storage area containing means for indicating whether a physical address stores an instruction of a target instruction set that has been translated into an instruction of the host instruction set; and detecting an indication within the storage area to prevent the physical address from being written means of entering and indicating subsequent operations prior to accessing the address.
通过以下附图对本发明的详细描述可以更好地理解本发明的各种目标和特点,在附图中相同的单元采用相同的标记。The various objects and features of the present invention may be better understood from the following detailed description of the invention in the accompanying drawings, in which like elements are designated by like numerals.
附图说明 Description of drawings
图1(a)-(e)为按照现有技术设计的微处理器操作方式的示意图。1(a)-(e) are schematic diagrams of the operation of a microprocessor designed according to the prior art.
图2为按照本发明设计的微处理器的框图,该处理器可运行针对不同微处理器设计的应用程序。Fig. 2 is a block diagram of a microprocessor designed according to the present invention, which can run application programs designed for different microprocessors.
图3为图2所示微处理器某一部分的示意图。FIG. 3 is a schematic diagram of a part of the microprocessor shown in FIG. 2 .
图4为寄存器文件的框图,该寄存器文件在按照本发明设计的微处理器中使用。Figure 4 is a block diagram of a register file for use in a microprocessor designed in accordance with the present invention.
图5为按照本发明设计的门控存储缓冲器的框图。Figure 5 is a block diagram of a gated store buffer designed in accordance with the present invention.
图6(a)-(c)示出了用于现有技术各种微处理器和按照本发明设计的微处理器的指令。Figures 6(a)-(c) show instructions for various prior art microprocessors and microprocessors designed in accordance with the present invention.
图7示出了按照本发明设计的微处理器软件部分实现的方法。Fig. 7 shows a method partially realized by the microprocessor software designed according to the present invention.
图8示出了按照本发明设计的微处理器软件部分实现的另一种方法。Fig. 8 shows another method implemented by the software part of the microprocessor designed according to the present invention.
图9为改进的计算机系统的框图,它包含了本发明。Figure 9 is a block diagram of an improved computer system incorporating the present invention.
图10为图3所示微处理器某一部分的框图。FIG. 10 is a block diagram of a part of the microprocessor shown in FIG. 3 .
图11为图3微处理器中翻译查找旁路缓冲器的更为详细的框图。FIG. 11 is a more detailed block diagram of the translation lookup bypass buffer in the microprocessor of FIG. 3. FIG.
记号和术语Notation and Terminology
以下某些详细描述部分将借助对计算机存储器内数据位操作所作的符号化表示。这些描述和表达形式是数据处理领域内技术人员向同行表述其工作内容最为有效的方式。这些操作需要对物理量施行物理操作。虽然并非必要,但是通常情况下这些物理量采用可以存储、转换、组合、比较和其他处理方式的电学或磁学信号形式。为方便起见,主要是通用的原因,这些信号被称为位、值、元素、符号、字符、项、数字等。但是应该认识到,所有这些以及相似的术语都与合适的物理量相联系并且仅仅是为方便表示这些物理量而采用的标记。Some of the detailed description that follows will refer in part to symbolic representations of operations on data bits within a computer memory. These descriptions and expressions are the most effective way for those skilled in the field of data processing to express their work to their peers. These operations are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals that can be stored, transformed, combined, compared, and otherwise manipulated. For convenience, principally reasons of common usage, these signals are referred to as bits, values, elements, symbols, characters, terms, numbers, or the like. It should be recognized, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely labels employed for convenience of reference.
所完成的控制通常用诸如加法或比较之类的术语表示,它们一般与人类的智力活动有联系。在构成本发明的大多数操作中,人类的这种能力并不是必不可少或需要的;这些操作是机器操作。完成本发明操作的有用机器包括通用数字计算机或其它类似设备。在所有情况下,操纵计算机的操作方法与计算方法本身之间应视为是有区别的。本发明涉及操作计算机的方法和装置,这种计算机处理电学或其它(例如机械、化学)物理信号以生成其它所需的物理信号。The control achieved is usually expressed in terms such as addition or comparison, which are generally associated with human mental activities. This human ability is not essential or required in most of the operations that make up the invention; these are machine operations. Useful machines for performing the operations of the present invention include general purpose digital computers or other similar devices. In all cases, a distinction should be considered between the method of operation for manipulating the computer and the method of computing itself. The present invention relates to methods and apparatus for operating computers that process electrical or other (eg mechanical, chemical) physical signals to generate other desired physical signals.
在下面的描述中,由于大多数的仿真器都运行X86应用程序,所以为了举例说明操作细节,在某些情况下将目标程序看作为在X86微处理器上执行的程序。但是目标程序可以为在任意目标计算机系列上运行而设计。它们包括目标虚拟计算机,例如Pcode机、Postscript机或者Java虚拟机。In the following description, since most emulators run X86 application programs, in order to exemplify the operation details, the object program is regarded as a program executed on an X86 microprocessor in some cases. But the target program can be designed to run on any target computer series. They include target virtual machines, such as Pcode machines, Postscript machines or Java virtual machines.
具体实施方式 Detailed ways
本发明有助于克服现有技术存在的问题并且提供了一种速度比现有技术更快的微处理器,它能够运行所有(大量现有技术微处理器都可以运行的)操作系统下的所有软件,但是价格又低于现有技术的微处理器。The present invention helps overcome the problems of the prior art and provides a faster microprocessor than the prior art capable of running all operating systems (a large number of prior art microprocessors can run) All software, but at a lower price than state-of-the-art microprocessors.
与采用更为复杂的硬件加快运算速度的做法不同,本发明一方面采用比现有技术微处理器简单得多的增强硬件处理部分(在本说明书中称为“词态主机(morphhost)”),另一方面采用一种仿真软件(称为“代码词态化软件(code morphingsoftware”),二者结合成为比已知的竞争性微处理器功能更强大的微处理器。具体而言,词态主机是一种包含硬件增强的处理器,当发生意外事件或者错误时它立即恢复为目标计算机的状态,而代码词态化软件将目标程序的指令翻译为词态主机的词态主机指令,并在需要时用正确的目标状态代替工作状态以响应意外事件或错误,从而进行正确的重新翻译。代码词态化软件也可以包括提高处理速度的各种进程。与所有速度极快的现有技术微处理器通过提供硬件来提高处理速度的做法不同,改进的微处理器借助代码词态化软件在可以选择的阶段允许大量的加速增强技术实现。在代码词态化软件中运用提速技术使得可以采用复杂程度较低但运行速度比现有技术微处理器更快而价格大为降低的硬件实现词态主机。作为比较,在一个用包含Pentium Pro微处理器四分之一数量门电路的词态主机实现的本发明实施例中,运行X86应用程序的速度要快于Pentium Pro微处理器或者其它能够处理这类应用程序的已知微处理器。Different from the way of using more complex hardware to speed up the calculation speed, the present invention adopts a much simpler enhanced hardware processing part (referred to as "morphhost (morphhost)" in this specification) than prior art microprocessors. , on the other hand employing an emulation software (called "code morphing software") that combine to create a microprocessor that is more powerful than known competing microprocessors. Specifically, the word A morph host is a processor that contains hardware enhancements. When an unexpected event or error occurs, it immediately returns to the state of the target computer, and the code morphing software translates the instructions of the target program into the morph host instructions of the morph host. And replace the working state with the correct target state when needed in response to unexpected events or errors, so that correct re-translation. Code morphing software can also include various processes to improve processing speed. With all extremely fast existing Unlike technology microprocessors that increase processing speed by providing hardware, improved microprocessors allow a large number of acceleration enhancements at selectable stages with the help of code morphing software.Using speed-up techniques in code morphing software makes Can adopt the hardware that complexity degree is lower but run faster than prior art microprocessor to realize morphological host computer and price greatly reduces.As a comparison, in a
代码词态化软件采用了某些技术,这些技术以前只是由程序员在设计新软件或仿真新硬件时采用过。词态主机包含硬件增强部分,它特别适合于充分利用代码词态化软件提供的加速技术。这些硬件增强部分允许代码词态化软件在更宽的指令范围内实现加速技术。这些硬件增强部分还允许代码词态化软件实现其它加速技术,这些技术在硬件处理器中不具备,且无法在硬件处理器内应用,除非花费巨额的代价。与现有技术微处理器执行本身的指令集的情形相比,这些技术明显提高了包含本发明的微处理器的速度。Code morphing software employs techniques that were previously only employed by programmers when designing new software or simulating new hardware. The Morph Host contains hardware enhancements that are especially suited to take advantage of the acceleration techniques provided by code morphing software. These hardware enhancements allow code morphing software to implement acceleration techniques over a wider range of instructions. These hardware enhancements also allow code morphing software to implement other acceleration techniques that are not available in, and cannot be implemented in, hardware processors without significant expense. These techniques significantly increase the speed of microprocessors incorporating the present invention compared to prior art microprocessors executing their own instruction sets.
例如,与增强型词态主机结合的代码词态化软件可以采用重新排列与重新安排由目标指令序列生成的基本指令的技术而无需增加太多的电路。由于可以一起重新排序与重新安排大量目标指令,所以可以采用其它的优化技术来减少处理器执行目标指令组所需的步骤数,使其少于其它运行目标应用程序的微处理器所需的一组目标指令。For example, code morphing software combined with an enhanced morph host can employ the technique of rearranging and rearranging the basic instructions generated by the target instruction sequence without adding much circuitry. Since large numbers of target instructions can be reordered and rearranged together, other optimization techniques can be employed to reduce the number of steps required by the processor to execute the target instruction set to one less than that required by other microprocessors running the target application. Group target directive.
与增强型词态主机组合起来的代码词态化软件快速将目标指令翻译为词态主机的指令并且将这些主指令存入存储器数据结构(在本说明书中称为“翻译缓冲器”)。使用保存翻译指令的翻译缓冲器可以再次调用指令而无需在每次执行每条目标指令时,重新运行冗长的进程,确定需要哪一条基本指令来实现每条目标指令,对每条基本指令作寻址、提取,优化基本指令序列,为每条基本指令分配资源,进行重新排序基本指令并执行每一基本指令(包含每次每个目标指令被执行)序列的每个步骤。目标指令一旦被翻译出来,它就可以从翻译缓冲器内再次调用并执行而无需这些步骤。The code morphing software combined with the Enhanced Morph Host quickly translates the target instructions into those of the Morph Host and stores these host instructions into a memory data structure (referred to in this specification as a "translation buffer"). Using a translation buffer that holds translated instructions allows instructions to be recalled without having to rerun the tedious process each time each target instruction is executed, determining which primitive instructions are needed to implement each target instruction, and searching for each primitive instruction Addressing, fetching, optimizing the basic instruction sequence, allocating resources for each basic instruction, reordering the basic instructions and executing each step of the sequence of each basic instruction (including every time each target instruction is executed). Once the target instruction has been translated, it can be recalled from the translation buffer and executed without these steps.
现有仿真技术的主要问题是无法高性能地处理目标程序执行期间产生的意外事件。如果在运行直接指向目标操作系统的目标应用程序时产生意外事件,情况更是如此,此时为了正确处理意外事件和执行随后的指令,必须有发生意外事件时正确的目标状态供使用。因此仿真器被迫始终精确跟踪目标状态并一直进行检查,以确定是否存储到目标代码区域。其它意外事件也会带来类似的问题。例如仿真器检测到已经被一些特定主功能替代的特定目标操作也可能产生意外事件。特别是目标处理器的各种硬件操作可以由仿真器软件提供的软件操作代替。此外,执行由目标指令导出的主指令的主处理器也可能产生意外事件。所有这些意外事件都有可能在仿真器试图将目标指令变换为主指令时或者在主处理器上执行主翻译时发生。高效的仿真必须提供一些从这些意外事件有效恢复的方式并且能够正确处理意外事件。现有技术均无法对所有被仿真的软件做到这一点。The main problem of the existing emulation technology is that it cannot deal with the unexpected events generated during the execution of the target program with high performance. This is especially true if an exception occurs while running a target application directed directly at the target operating system, where the correct target state at the time of the exception must be available for proper handling of the exception and execution of subsequent instructions. So the emulator is forced to keep track of the target state exactly and check all the time to determine whether to store into the target code area. Other contingencies pose similar problems. Unexpected events may also be generated, for example, if the emulator detects that a specific target operation has been replaced by some specific main function. In particular, various hardware operations of the target processor can be replaced by software operations provided by the emulator software. In addition, exceptions may also be generated by the host processor executing host instructions derived from target instructions. All of these contingencies can occur when the emulator tries to translate the target instruction into a host instruction or when the host translation is performed on the host processor. An efficient simulation must provide some means of efficiently recovering from these contingencies and be able to handle contingencies correctly. None of the existing technologies can do this for all simulated software.
为了克服现有技术的这些不足,在增强型词态主机中作了大量的硬件改进。这些改进包括门控(gated)存储缓冲器和许多新增加的处理器寄存器。新增的其中一些寄存器可以重新命名以缓解需要同一硬件资源指令的问题。新增的寄存器还可以维护一组处理主指令的主机或工作寄存器和一组保存目标处理器(它被用于目标应用程序的生成)正式状态的目标寄存器。目标(或影子)寄存器通过专用接口与与其等价的工作寄存器相连,专用接口使得称为“提交”的操作将所有工作寄存器的内容快速传递至正式目标寄存器,并使得称为“回退”的操作将所有正式目标寄存器的内容快速返回到与其等价的工作寄存器。门控存储缓冲器存储工作存贮器状态改变硬件“门电路”“未提交”一侧改变正式存储器状态变化硬件门电路“提交”一侧,这些提交的存储结果“排放”至主存储器。提交操作将门电路为非提交一侧的存储传递至门电路的提交一侧。新增的正式寄存器和门控存储缓冲器使存储器状态和目标寄存器状态在一条或一组目标指令被翻译和运行无误之后一起更新。In order to overcome these deficiencies in the prior art, a large number of hardware improvements have been made in the enhanced morph host. These improvements include gated memory buffers and many new processor registers. Some of these new registers can be renamed to alleviate the problem of instructions requiring the same hardware resource. The added registers also maintain a set of host or working registers that process host instructions and a set of target registers that hold the official state of the target processor that is used to generate the target application program. Target (or shadow) registers are connected to their equivalent working registers through a dedicated interface that allows an operation called "commit" to quickly transfer the contents of all working registers to the official target registers, and an operation called "rollback" The operation quickly returns the contents of all official destination registers to their equivalent working registers. The gated store buffer stores the "uncommitted" side of the working memory state change hardware "gate" and changes the "commit" side of the official memory state change hardware gate, and the results of these committed stores are "drained" to main memory. A commit operation passes the store from the non-commit side of the gate to the commit side of the gate. The addition of formal registers and gated store buffers enables the memory state and target register state to be updated together after a target instruction or group of target instructions has been translated and executed without errors.
这些更新由代码词态化软件选择发生于完整的目标指令边界。因此,如果构成目标指令系列翻译的基本主指令由主处理器运行而未发生意外事件,则这些指令产生工作存储器存储和工作寄存器状态被传递至正式存储器和正式目标寄存器。这样,如果意外事件发生在处理主指令而此时又不在被翻译的一条或一组目标指令的边界上时,可以将最近更新(或提交)的目标寄存器原始状态再次调用至工作寄存器并转储门控存储缓冲器内未提交的存储器存储结果。如果发生的意外事件是目标意外事件,则可以一次重新翻译一条引起目标意外事件的目标指令并象目标微处理器执行目标指令那样依照顺序执行目标指令。当每条目标指令被正确无误地执行时,可以更新目标寄存器的状态;并且存储缓冲器内数据受到门电路的控制向存储器传送。随后,当运行主指令时再次发生意外事件时,目标计算机的正确状态由词态主机的目标寄存器和存储器保存;并且可以无延迟地正确处理操作。由于每条通过这种纠错翻译生成的新翻译结果是翻译或者交替转储(防止一次性或偶尔发生的诸如页面出错事件)完成的,所以可以高速缓存以供后面使用。这使得代码词态化软件与词态主机组合构成的微处理器能够比原先为其编写软件的处理器更快地执行指令。These updates are chosen by the code morphing software to occur on complete target instruction boundaries. Thus, if the basic host instructions that make up the translation of the target instruction series are executed by the host processor without exception, these instructions result in working memory stores and working register states being passed to official memory and official target registers. This way, if an exception occurs while processing a host instruction that is not on the boundary of a target instruction or set of instructions being translated, the most recently updated (or committed) original state of the target register can be recalled to the working register and dumped Uncommitted memory stores results within the gated store buffer. If the exception that occurs is a target exception, the target instructions that caused the target exception can be retranslated one at a time and executed in sequence as the target microprocessor executes the target instructions. When each target instruction is executed correctly, the state of the target register can be updated; and the data in the storage buffer is controlled by the gate circuit and transferred to the memory. Then, when another contingency occurs while running the host instruction, the correct state of the target computer is preserved by the target registers and memory of the morph host; and the operation can be processed correctly without delay. Because each new translation result generated by this error correction translation is completed by translation or alternate dump (to prevent one-off or occasional events such as page faults), it can be cached for later use. This allows the combination of the code morphing software and the morph host to execute instructions faster than the processor for which the software was originally written.
应该指出的是,在利用本发明的微处理器执行目标程序的过程中,可能会发生许多不同类型的意外事件,它们需要不同方式处理。例如一些意外事件由产生一件意外事件的目标软件引起,该意外事件使用了目标操作系统意外事件句柄。使用这种意外事件句柄要求代码词态化软件包含仿真整个意外事件处理进程的子程序,包含任何由处理进程的目标计算机提供的硬件。这就需要代码词态化软件保存目标处理器的状态从而在处理完意外事件之后继续正确执行。某些意外事件(例如页面出错)需要在实现被翻译进程之前将数据送入新的存储器页面中,当意外事件处理之后它要求返回至被翻译进程的开始之处。其它意外事件实现了软件中硬件未提供的特殊操作。这些意外事件都要求意外事件句柄在处理完意外事件之后将操作返回翻译中的下一步骤。每种不同类型的意外事件都可以由本发明的微处理器有效处理It should be pointed out that during the process of executing the target program using the microprocessor of the present invention, many different types of unexpected events may occur, which need to be handled in different ways. For example, some exceptions are caused by the target software generating an exception that uses the target operating system exception handler. Use of this exception handler requires that the code morphing software contain subroutines that emulate the entire exception handling process, including any hardware provided by the target computer that handles the process. This requires code morphing software to save the state of the target processor so that it can continue to execute correctly after handling unexpected events. Certain exceptions (such as page faults) require data to be moved into a new page of memory before the translated process is implemented, which requires returning to the beginning of the translated process after the exception has been handled. Other contingencies implement special operations in software not provided by hardware. These exceptions require the exception handler to return operations to the next step in the translation after handling the exception. Each different type of contingency can be efficiently handled by the microprocessor of the present invention
此外,有些意外事件由主机硬件产生并且检测各种主机和目标机状态。有些意外事件的表现如同在普通微处理器上一样,而其它的则被代码词态化软件用来检测各种预测的故障。在这些情况下,采用上述状态保存和恢复机制的代码词态化软件将目标状态恢复至其最近的正式状态并将产生和保存新的翻译结果(或者再次使用先前生成的安全的翻译结果),这避免了无效预测。随后执行该翻译结果。In addition, some exceptions are generated by the host hardware and detect various host and target states. Some contingencies behave as on ordinary microprocessors, while others are used by code morphing software to detect various predicted failures. In these cases, code morphing software employing the state preservation and restoration mechanisms described above will restore the target state to its most recent official state and will generate and save new translations (or reuse previously generated translations that are safe), This avoids invalid predictions. The translation result is then executed.
词态主机包含新增的硬件意外事件检测机构,它与上述返还和重新翻译方法结合可以作进一步的优化。例如有一种装置,它将存储器从存储器映射I/O区分开来,另有一种装置,它通过保护地址或地址范围省却了存储器参照,由此可以在寄存器内保存目标变量。The morph host includes an additional hardware exception detection mechanism that can be further optimized in conjunction with the return and retranslation methods described above. For example, there is a device that separates memory from memory-mapped I/O, and another device that eliminates memory references by protecting addresses or address ranges, so that object variables can be stored in registers.
在利用意外事件检测其它预测失效(例如操作是影响存储器还是存储器映射I/O)时,通过借助不同的存储器操作和不同的优化生成新的翻译结果来完成恢复工作。Recovery is done by generating new translations with different memory operations and different optimizations while exceptions are used to detect other predictive failures (such as whether the operation affects memory or memory-mapped I/O).
图2为词态主机硬件的示意图,它正在运行与图1(a)上CISC处理器相同的应用程序。如图所示,微处理器包含上述代码词态化软件部分和增强型硬件词态主机部分。目标应用程序将目标指令载入代码词态化软件以将其翻译为词态主机能够执行的主机指令。与此同时,目标操作系统接收目标应用程序的调用并将调用转送至代码词态化软件。在微处理器的较佳实施例中,词态主机是一种超长指令字(VLIW)处理器,它设计有多条处理通道。图6(c)示出了这种处理器的总体操作。Figure 2 is a schematic diagram of the morph host hardware, which is running the same application program as the CISC processor in Figure 1(a). As shown, the microprocessor includes the morphing software portion of the above code and the enhanced hardware morphing host portion. The target application loads the target instructions into the code morphing software to translate them into host instructions that the morph host can execute. At the same time, the target operating system receives calls from the target application and forwards the calls to the code morphing software. In the preferred embodiment of the microprocessor, the morph host is a Very Long Instruction Word (VLIW) processor designed with multiple processing lanes. Figure 6(c) shows the overall operation of such a processor.
在图6(a)-(c)中示出了适用于CISC处理器、RISC处理器和VLIW处理器的指令。显然CISC指令是变长度的并且可以包含多条更基本的操作(例如加载和相加)。另一方面,RISC指令是等长度的并且主要是基本操作。图示的VLIW处理器的一条超长指令包括CISC和RISC指令的每个更为基本的操作(例如加载、存储、整数相加、比较、浮点乘法和转移)和。如图6(c)所示,一起构成一条超长指令字的每条基本指令与其它基本指令并行地载入VLIW处理器其中一条处理通道或者存储器内以供处理通道和存储器并行处理。并行操作的所有结果都被转送至多端口寄存器文件。Instructions applicable to CISC processors, RISC processors and VLIW processors are shown in Figures 6(a)-(c). Obviously CISC instructions are variable length and can contain multiple more basic operations (such as load and add). RISC instructions, on the other hand, are of equal length and are mostly primitive operations. A very long instruction for the illustrated VLIW processor includes each of the more basic operations of CISC and RISC instructions (eg, load, store, integer add, compare, floating point multiply, and branch) and. As shown in FIG. 6(c), each basic instruction constituting a VLIW is loaded into one of the processing channels or memory of the VLIW processor in parallel with other basic instructions for parallel processing by the processing channel and the memory. All results of parallel operations are transferred to the multiported register file.
可作为词态主机基础的VLIW处理器在结构上比上述其它处理器简单得多。它不包含检测结果依存性的电路或者重新排序、优化和重新安排基本指令的电路。与运行原先为其设计目标应用程序的处理器或者利用仿真程序运行目标应用程序的其它处理器相比,这使得可以在较高的时钟频率有更快的处理速度。但是这并不局限于VLIW处理器,诸如RISC处理器之类的任意类型处理器都可以实现同样的效果The VLIW processor, which may be the basis for a morph host, is much simpler in structure than the other processors mentioned above. It does not contain circuitry to detect dependencies of results or to reorder, optimize, and rearrange elementary instructions. This enables faster processing at higher clock frequencies than other processors running the target application for which it was originally designed or using an emulation program. But this is not limited to VLIW processors, any type of processor such as RISC processors can achieve the same effect
图2所示微处理器的代码词态化软件包括翻译机部分,它对目标应用程序的指令进行译码,将目标指令转换为词态主机能够执行的基本主指令,优化目标指令所需的操作,将基本指令重新排序和重新安排为词态主机的VLIW指令(翻译)并执行主VLIW指令。图7为翻译机操作示意图,它示出了代码词态化软件主循环操作The code morphing software of the microprocessor shown in Figure 2 includes a translator part, which decodes the instructions of the target application program, converts the target instructions into basic main instructions that the morph host can execute, and optimizes the required operation, reordering and rearranging the elementary instructions into VLIW instructions of the morph host (translation) and executing the main VLIW instructions. Figure 7 is a schematic diagram of the operation of the translator, which shows the main loop operation of the code morphing software
为了加速包含代码词态化软件和增强词态主硬件的微处理器的运行速度,如图2所示,代码词态化软件包含翻译缓冲器。一个实施例的翻译缓冲器是一种可以在存储器内存储的软件数据结构;在特殊的实施例中也可以采用硬件高速缓存。翻译缓冲器被用来存储主指令,主指令实现了每一个目标指令的完整翻译。显而易见,一旦翻译出单条目标指令并对获得的主指令进行优化、重新排序和重新安排,所得到的主指令就被存储在翻译缓冲器内。构成翻译结果的主指令随后由词态主机执行。如果主指令的执行不发生意外事件,则无论何时需要实现目标指令的操作或需要指令时都可以再次调用该主指令。In order to accelerate the running speed of the microprocessor including the code morphing software and the enhanced morphing main hardware, as shown in FIG. 2, the code morphing software includes a translation buffer. The translation buffer of one embodiment is a software data structure that may be stored in memory; hardware caches may also be employed in particular embodiments. Translation buffers are used to store host instructions that implement a complete translation of each target instruction. Obviously, once a single target instruction is translated and the resulting host instructions are optimized, reordered and rearranged, the resulting host instructions are stored in the translation buffer. The host instructions that make up the translation are then executed by the morph host. Provided the host instruction executes without exception, the host instruction can be invoked again whenever the operation of the target instruction is needed or the instruction is needed.
如图7所示,由应用程序载入目标指令地址的微处理器词态化软件的典型操作是首先确定目标地址上的目标指令是否已经翻译。如果目标指令未被翻译,则提取该目标指令以及随后的目标指令,并且将它们译码、翻译并随后优化(可能是)、重新排序和重新安排为新的主翻译指令,并由翻译机将它们存储在翻译缓冲器内。如下可见,优化可以达到不同的程序。在本说明书中术语“优化”常常指的是加速处理的技术。例如重新排序就是一种优化形式,它使得处理更快,因此属术该术语的范畴。许多优化方法在现有技术的编译器优化中都有描述,有些优化方法难以在类似现有技术的“超块”中完成,而在VLIW研究中出现。随后控制被转移至翻译以使增强词态主硬件重新开始执行。As shown in FIG. 7, the typical operation of the microprocessor morphing software loaded by the application program at the address of the target instruction is to first determine whether the target instruction at the target address has been translated. If the target instruction has not been translated, the target instruction and subsequent target instructions are fetched, and they are decoded, translated and then optimized (possibly), reordered and rearranged into new main translation instructions and translated by the translator They are stored in the translation buffer. As can be seen below, optimization can be achieved for different programs. In this specification the term "optimization" often refers to techniques to speed up processing. For example, reordering is a form of optimization that makes processing faster and thus falls under the term. Many optimization methods have been described in prior art compiler optimization, and some optimization methods are difficult to complete in "superblocks" similar to the prior art, but appear in VLIW research. Control is then transferred to the translator so that the enhanced morph host hardware resumes execution.
当在运行应用程序中下面遇到某一目标指令序列时,将在翻译缓冲器内寻找主翻译指令并立即执行而无需翻译、优化、或者重新安排。利用下述高级技术,据估计对于每一百万次翻译执行操作而言只有一次需执行释放,而其他均可在翻译缓冲器中找到目标指令的翻译结果(曾经被完整翻译过)。因此在第一次翻译之后,翻译所需的所有步骤(例如译码、提取基本指令,优化基本指令,将其重新安排为主翻译指令并在翻译缓冲器内存储)可以省略。由于为其编写目标指令的处理器在每次执行指令时都必须译码、提取、重新排序和重新安排每条指令,所以这样做大大减少了执行目标指令所需的工作量并提高了改进型微处理器的速度。When a target instruction sequence is encountered next in a running application, the host translation instruction is looked up in the translation buffer and executed immediately without translation, optimization, or rearrangement. Using the advanced techniques described below, it is estimated that for every million translation executions only one release needs to be performed, and everything else will find the translation of the target instruction (which has been fully translated) in the translation buffer. Therefore after the first translation, all steps required for translation (such as decoding, extracting primitive instructions, optimizing primitive instructions, rearranging them as main translation instructions and storing them in the translation buffer) can be omitted. Doing so greatly reduces the amount of work required to execute the target instruction and improves improved performance because the processor for which the target instruction is written must decode, fetch, reorder, and rearrange each instruction each time it is executed. The speed of the microprocessor.
在省略所有现有技术处理器执行目标应用程序所需的步骤之后,本发明的微处理器克服了现有技术的问题,使得这类操作可以合理的速度运行。例如改进型微处理器的某些技术被用于上述仿真器以将应用程序移植到其它系统。但是由于在处理翻译指令时,引起调用各种系统意外事件句柄的意外事件发生在主处理器状态与处理同样指令的目标处理器无关的操作时刻,所以一些仿真器无法运行应用程序较长的部分。因此产生这类意外事件时目标处理器的状态是未知的。这样,目标机的正确状态就无法确定;操作必须停止,重新启动,并在意外事件可以处理和继续执行之前确认正确的状态。这使得无法以主机速度运行应用程序。The microprocessor of the present invention overcomes the problems of the prior art, allowing such operations to run at reasonable speeds, after omitting all steps required by prior art processors to execute the target application. Certain technologies such as modified microprocessors are used in the emulators described above to port applications to other systems. However, some emulators cannot run longer portions of the application because the exceptions that cause the various system exception handlers to be called when processing translated instructions occur at operating moments when the state of the host processor is independent of the target processor processing the same instruction. . Therefore the state of the target processor at the time such an exception is generated is unknown. In this way, the correct state of the target machine cannot be determined; the operation must be stopped, restarted, and the correct state confirmed before the contingency can be handled and execution can continue. This makes it impossible to run applications at host speed.
词态主机硬件包含了解决该问题的多个增强部分。图3、4和5示出了这些增强部分。为了在发生错误时确定寄存器的正确状态,增强硬件提供了一组正式目标寄存器来保存为其设计源应用程序的目标处理器的寄存器状态。目标寄存器可以包含在每个浮点单元、任何整数单元和其它执行单元内。这些正式寄存器连同增加的正常工作寄存器一起被加入词态主机使得包括寄存器重命名的一系列优化深以实现。增强硬件的一个实施例在整数单元内包含64个工作寄存器而在浮点单元内包含32个工作寄存器。实施例还包含一组增强目标寄存器,它们包含所有提供处理器状态的经常变化的目标处理器寄存器;这包括状态控制寄存器和其它控制仿真系统所需的寄存器。The Morph Host hardware includes several enhancements to address this issue. Figures 3, 4 and 5 illustrate these enhancements. To determine the correct state of registers in the event of an error, Enhanced Hardware provides a formal set of target registers to hold the register state of the target processor for which the source application was designed. Destination registers can be contained within each floating-point unit, any integer unit, and other execution units. These official registers were added to the morph host along with the addition of normal working registers to enable a series of optimizations including register renaming to be implemented. One embodiment of the enhanced hardware includes 64 working registers in the integer unit and 32 working registers in the floating point unit. Embodiments also include a set of enhanced target registers that contain all of the constantly changing target processor registers that provide the state of the processor; this includes state control registers and other registers needed to control the simulation system.
值得注意的是,根据词态主机所用的增强处理硬件类型,翻译指令序列可包含构成从源应用程序来的多条目标指令的基本操作。例如VLIW微处理器可以如图6(a)-(c)所示立即执行多条CISC指令或者RISC指令。无论词态主机类型如何,除了整体目标指令边界以外词态主机硬件的目标寄存器状态是不会改变的;并且随后所有的目标寄存器得到了更新。因此,如果微处理器正在执行目标指令或已经被翻译为可以重新排序和重新安排为主指令的基本操作序列的指令,则当处理器开始执行已翻译的指令序列时,正式目标寄存器在第一条目标指令被寻址时保存由为其设计应用程序的目标处理器寄存器保存的值。但是在词态主机开始执行翻译指令之后,工作寄存器保存由执行到该点的翻译指令基本操作确定的值。因此尽管一些工作寄存器保存的是与正式目标寄存器内相同的值,但是其它工作寄存器内保存的值对于目标处理器毫无意义。在为了采用高级加速技术提供位定目标机更多寄存器的实施例中这尤其如此。一旦开始执行翻译的主指令,工作寄存器中的值是翻译的主指令确定寄存器状态的任何值。如果执行一组翻译的主指令而未产生意外事件,则在这组指令结束时确定的新工作寄存器值就被一起转送至正式目标寄存器(可能包括目标指令指针寄存器)。在处理器的本实施例,这种转移发生于附加的流水线阶段内的主指令执行外部,从而不会降低词态主机的处理速度。It is worth noting that, depending on the type of enhanced processing hardware used by the morph host, the sequence of translated instructions may contain elementary operations that constitute multiple target instructions from the source application. For example, a VLIW microprocessor can immediately execute multiple CISC instructions or RISC instructions as shown in Figure 6(a)-(c). Regardless of the morph host type, the state of the target registers of the morph host hardware is unchanged except at the boundary of the overall target instruction; and subsequently all target registers are updated. Therefore, if a microprocessor is executing a target instruction or an instruction that has been translated into a sequence of elementary operations that can be reordered and rearranged as a host instruction, when the processor begins executing the translated sequence of instructions, the official target register is in the first The entry target instruction is addressed to hold the value held by the target processor register for which the application was designed. But after the morph host starts executing the translated instruction, the working registers hold the values determined by the primitive operations of the translated instruction executed up to that point. So while some working registers hold the same values as the official target registers, other working registers hold values that are meaningless to the target processor. This is especially true in embodiments where more registers are provided in the target machine in order to employ advanced acceleration techniques. Once execution of the translated host instruction begins, the value in the working register is whatever the translated host instruction determined the state of the register to be. If a group of translated host instructions is executed without exceptions, the new working register values determined at the end of the group of instructions are transferred together to the official target registers (possibly including the target instruction pointer register). In this embodiment of the processor, this branch occurs outside the execution of the host instruction in an additional pipeline stage so as not to slow down the processing speed of the morph host.
同样,图5所示的门控存储缓冲器在改进型微处理器的硬件中被用来控制数据转移至存储器。门控存储缓冲器包括多个单元,每个可保存存储器存储操作的地址和数据。这些单元可通过任意数量不同的硬件配置实现(例如先进先出缓冲器);所示实施例利用随机存储器和三个专用工作寄存器实现。三个专用寄存器分别存储指向存储器存储队列头部的指针、指向门的指针和指向存储器存储队列尾部的指针。位于队列头部与门之间的存储器存储内容已经被提交入存储器,而位于队列门与尾部之间的还未提交入存储器。在主指令执行期间产生的存储器存储内容由整数单元将按照词态主机执行主指令的顺序放入存储缓冲器但是在主指令中遇到提交操作以前不允许写入存储器。因此随着翻译指令的执行,存储操作被放入队列。假定这些是第一次存储的内容因而在门控存储缓冲器内没有其它的存储内容,则头部和门指针将指向同一位置。随着每一存储内容的执行,它被放入队列中下一位置并且尾部指针增一指向下一位置(在图中是向上)。这一直持续到执行提交指令为止。这通常发生于完成一组目标指令翻译而未发生意外事件或者出现错误退出条件时。当词态主机正确无误地执行翻译指令时,执行期间生成的存储缓冲器内的存储器存储内容被一起移动通过存储缓冲器的门(提交)并随后写入存储器。在所示实施例中,这是通过将保存尾部指针的寄存器内的值复制到保存门指针的寄存器内完成的。Likewise, the gated store buffer shown in Figure 5 is used in the hardware of the modified microprocessor to control the transfer of data to memory. A gated store buffer includes multiple locations, each of which can hold addresses and data for memory store operations. These units can be implemented in any number of different hardware configurations (eg first-in-first-out buffers); the illustrated embodiment is implemented using random access memory and three dedicated working registers. Three special purpose registers store a pointer to the head of the memory store queue, a pointer to the gate, and a pointer to the tail of the memory store queue, respectively. Memory stores located between the head of the queue and the gate have been committed to memory, while those located between the gate and the tail of the queue have not yet been committed to memory. Memory stores generated during host instruction execution are placed by the integer unit into the memory buffer in the order in which the morph host executes the host instructions but are not allowed to be written into memory until a commit operation is encountered in the host instruction. Store operations are thus queued as translated instructions are executed. Assuming these are the first stores and there are no other stores in the gated store buffer, the head and gate pointers will point to the same location. As each store is executed, it is put into the next position in the queue and the tail pointer is incremented to point to the next position (up in the figure). This continues until the commit instruction is executed. This typically occurs when translation of a set of target instructions completes without exception or error exit conditions. When the morph host executes the translated instruction without error, the memory stores within the store buffer generated during execution are moved together through the store buffer's gates (committed) and subsequently written to memory. In the embodiment shown, this is done by copying the value in the register holding the tail pointer to the register holding the gate pointer.
由此可见,寄存器状态从工作寄存器转送至正式目标寄存器和工作存储器内容转移至正式存储器是一起发生的并且只发生在整个目标指令之间的边界上从而响应明确的提交操作。It follows that transfer of register state from working registers to official target registers and transfer of working memory contents to official memory occur together and only at boundaries between entire target instructions in response to explicit commit operations.
这使得微处理器可以从发生于增强词态主机执行指令过程中的目标意外事件中几乎无延迟地恢复过来。如果在运行任何翻译指令或指令期间产生目标意外事件,则由词态主机硬件或软件检测意外事件。为了响应检测到目标意外事件,代码词态化软件可以将正式寄存器内保存的任何值返回工作寄存器并使门控存储缓冲器内未提交的存储器存储内容转储(一种称为“回退”的操作)。图5门控存储缓冲器存储内容的转储可以通过将保存门指针的寄存器内的值复制到保存尾部指针的寄存器内实现。This allows the microprocessor to recover with virtually no delay from target exceptions that occur during the execution of instructions by the enhanced morph host. If a target exception is generated during execution of any translation instruction or instructions, the exception is detected by the morph host hardware or software. In response to detecting a target exception, code morphing software can return any values held in official registers to working registers and dump uncommitted memory stores in gated store buffers (a process known as "rollback"). operation). The dumping of the stored content of the gated storage buffer in FIG. 5 can be realized by copying the value in the register storing the gate pointer to the register storing the tail pointer.
将目标寄存器的值放入工作寄存器可以使发生意外事件的第一条运行的目标指令的地址放入工作指针寄存器内。从工作寄存器内目标处理器的正式状态开始,发生意外事件时正在运行的目标指令被按照串行顺序重新翻译而不进行重新排序或其它优化。在将每条目标指令重新译码和翻译为新的主指令之后,代表目标指令的翻译后的主指令由词态主机执行并且可能引起意外事件(如果词态主机是VLIW处理器以外的处理器,则主翻译指令的每条基本操作按照顺序执行。如果在主翻译指令执行时没有产生意外事件,则运行下一基本功能)。这一直延续到发生意外事件为止或者单条目标指令被翻译和执行为止。在一个实施例中,如果在执行目标指令翻译时没有产生意外事件,则工作寄存器的状态被转移至目标寄存器并且门控存储缓冲器内的数据被提交从而可以转移至存储器。但是如果在翻译指令运行期间再次发生意外事件,则目标寄存器和存储器的状态未变化而是与发生意外事件时目标计算机内产生的状态一致。因此当产生目标意外事件时,意外事件将由目标操作系统正确处理。Putting the value of the target register into the working register causes the address of the first target instruction to run on the exception to be placed in the working pointer register. Starting from the official state of the target processor in the working registers, the target instructions that were running at the time of the exception are retranslated in serial order without reordering or other optimizations. After each target instruction is re-decoded and translated into a new host instruction, the translated host instruction representing the target instruction is executed by the morph host and may cause unexpected events (if the morph host is a processor other than a VLIW processor , each basic operation of the main translation instruction is executed in sequence. If no unexpected event occurs when the main translation instruction is executed, the next basic function is run). This continues until an unexpected event occurs or until a single target instruction is translated and executed. In one embodiment, if no exceptions occur while the translation of the target instruction is being performed, the state of the working register is transferred to the target register and the data in the gated store buffer is committed so that it can be transferred to memory. But if an exception occurs again during the execution of the translated instruction, the state of the target registers and memory does not change but is consistent with the state produced in the target computer when the exception occurred. So when a target exception is generated, the exception will be handled correctly by the target operating system.
同样,一旦指令系列在翻译时产生意外事件的第一条目标指令执行时未发生意外事件,则目标指令指针指向下一目标指令。这第二条目标指令与第一条指令的处理方式一样,只作译码和重新翻译而不进行优化或重新排序。随着词态主机处理单条目标指令的每条主指令,当目标寄存器和存储器的状态与目标计算机内发生的状态一致时将产生意外事件。因此意外事件得到立即而正确的处理。这些新的翻译指令可以存储在翻译缓冲器内作为目标应用程序内指令序列的正确翻译结果,并在指令再次执行时被调用。Likewise, the target instruction pointer points to the next target instruction once the sequence of instructions executes without exceptions when the first target instruction that generated an exception during translation executes. This second target instruction is processed in the same manner as the first instruction, only decoded and retranslated without optimization or reordering. As the morph host processes each host instruction of a single target instruction, exceptions are generated when the state of the target registers and memory is consistent with what is happening within the target computer. Incidents are thus dealt with immediately and correctly. These new translation instructions can be stored in the translation buffer as the correct translation result of the instruction sequence in the target application program, and called when the instruction is executed again.
完成与图5门控存储缓冲器同样结果的其它实施例包括这样的方案,它将存储内容直接转移至存储器而与此同时记录足够的数据以在执行翻译结果引起意外事件或者错误(由此需要回退)时恢复目标计算机的状态。在这种情况下,任何在翻译和执行期间发生的存储器存储内容的影响都不得不逆转过来并且恢复翻译开始时存储器的状态;与此同时工作寄存器必须以上述方式接收正式目标寄存器内保存的数据。在实现这项操作的一个实施例中,维持一个分立的目标存储器保存原始的存储器状态,如果发生回退,则该状态被用来代替覆盖的存储器内容。在实现存储器回退的另一实施例中,在每条存储和存储器数据被替换时对它们进行登录并在需要转储时使存储进程逆向进行。Other embodiments that accomplish the same result as the gated store buffer of FIG. 5 include schemes that transfer store content directly to memory while recording enough data to cause contingencies or errors when translation results are performed (thus requiring fallback) to restore the state of the target computer. In this case, any effect on the contents of memory stores that occurred during translation and execution would have to be reversed and the state of memory at the start of translation would have to be restored; at the same time the working registers would have to receive the data held in the official target registers in the manner described above . In one embodiment that accomplishes this, a separate target memory is maintained to preserve the original memory state, which is used in place of the overwritten memory content if rollback occurs. In another embodiment implementing memory rollback, logging each piece of storage and memory data as they are replaced and reversing the storage process when a dump is required.
代码词态软件提供了一项新增的操作,它大大加快了被翻译的处理程序的速度。除了简单翻译指令、优化、重新排序、重新安排、高速缓存和执行每条翻译指令从而可以在需要执行该组指令时再次运行以外,翻译器还链接不同的翻译结果以在绝大多数情况下避免返回翻译进程的主循环。图8示出了完成链接进程的代码词态化软件翻译器部分所执行的步骤。对于本领域内的普通技术人员来说,显而易见的是这种链接操作基本上在大多数指令翻译时避免了返回主循环,这节约了开销。Code Morph software provides an added operation that greatly speeds up the translated process. In addition to simply translating instructions, optimizing, reordering, rearranging, caching, and executing each translated instruction so that it can be run again when the set of instructions needs to be executed, the translator also chains different translation results to avoid Return to the main loop of the translation process. Figure 8 shows the steps performed by the code morphing software translator portion of the linking process. It is obvious to those skilled in the art that this chaining operation basically avoids returning to the main loop when most instructions are translated, which saves overhead.
为了阐述方便,运行的目标程序由X86指令组成。当翻译目标指令序列并且重新排序和重新安排基本主指令时,两条基本主指令可能在每条主翻译指令结束处发生。第一条是更新目标处理器的指令指针(或类似的指针)的值;该指令用来将下一目标指令的正确地址放入目标指令指针寄存器。该基本指令之后是一条转移指令,它包含转移的两种可能目标地址。处理转移指令的基本指令可以更新目标处理器的指令指针值,其方式是测试在条件代码寄存器内的条件代码,并随后确定控制转移的条件所指示的两个转移地址的其中一个是否存入翻译缓冲器内。第一次翻译目标指令序列时,主指令的两个转移目标都保存翻译机软件主循环的同一主处理器地址。For the convenience of explanation, the running target program is composed of X86 instructions. When translating the sequence of target instructions and reordering and rearranging the base host instructions, two base host instructions may occur at the end of each host translation instruction. The first is to update the value of the target processor's instruction pointer (or similar pointer); this instruction is used to place the correct address of the next target instruction into the target instruction pointer register. This base instruction is followed by a branch instruction, which contains the two possible target addresses of the branch. The basic instructions that handle branch instructions update the value of the target processor's instruction pointer by testing the condition code in the condition code register and then determining whether one of the two branch addresses indicated by the condition of the control branch is stored in the translation inside the buffer. When the target instruction sequence is translated for the first time, both branch targets of the host instruction hold the same host processor address of the translator software main loop.
当主指令经过翻译,存储在翻译缓冲器并第一次执行时,目标指令指针寄存器(作为其余目标寄存器)内的指令指针得到更新;并且操作转移返回主循环。在主循环内,翻译器软件在目标指令指针寄存器内查找指向下一目标指令的指令指针。随后寻址下一目标指令序列。假定该目标指令序列还未被翻译,因此在翻译缓冲器内未驻留翻译指令,则从存储器提取下一组目标指令,对其译码、翻译、优化、重新排序、重新安排、高速缓存入翻译缓冲器并予以执行。由于第二组目标指令跟随在第一组目标指令之后,所以第一组目标指令的主翻译结束处的基本转移指令就被自动更新替代为第二组目标指令的主翻译指令地址成为控制转移的特定条件转移地址。When the host instruction is translated, stored in the translation buffer and executed for the first time, the instruction pointer in the target instruction pointer register (as the remaining target registers) is updated; and the operation transfers back to the main loop. Within the main loop, the translator software looks up the instruction pointer to the next target instruction in the target instruction pointer register. The next target instruction sequence is then addressed. Assuming that the sequence of target instructions has not been translated, and therefore no translated instructions reside in the translation buffer, the next set of target instructions is fetched from memory, decoded, translated, optimized, reordered, rearranged, cached Translate the buffer and execute it. Since the second set of target instructions follows the first set of target instructions, the basic branch instruction at the end of the main translation of the first set of target instructions is automatically updated and replaced by the address of the main translation instruction of the second set of target instructions as the address of the control transfer Transfer address under certain conditions.
如果这样,则第二主翻译主指令循环返回到第一条翻译的主指令,第二翻译指令结束处的转移操作包含主循环地址和第一条翻译指令的X86地址作为转移两种可能的目标。在测试转移状态及决定循环回到第一翻译指令前更新指令指针基本操作并将目标指令指针更新为第一条翻译指令的X86地址。这使得翻译器在翻译缓冲器内查询以确定是否有X86地址。第一条翻译指令的地址被找到,并且它在主机存储器空间内的值被第二条主翻译指令结束处转移中的X86地址上的值替换。随后第二主翻译指令被高速缓存和执行。这使得循环延续下去直到从第一条翻译指令转移到第二条翻译指令的条件失败,并且转移取道返回主循环。当发生这种情况时,第一条翻译主指令返回主循环,从而在翻译缓冲器内搜索目标指令指针指定的下一组目标指令,主指令从高速缓存中提取;或者在翻译缓冲器内找不到,则从存储器内提取该目标指令并进行翻译。当该经过翻译的主指令高速缓存入翻译缓冲器内时,其地址代替了结束循环的转移指令内的主循环地址。If so, the second host translation host instruction loops back to the first translated host instruction, and the branch operation at the end of the second translation instruction contains the main loop address and the X86 address of the first translation instruction as two possible targets for the branch . Updating the instruction pointer basic operation and updating the target instruction pointer to the X86 address of the first translated instruction before testing the branch status and deciding to loop back to the first translated instruction. This causes the translator to look in the translation buffer to see if there is an X86 address. The address of the first translated instruction is found and its value in host memory space is replaced by the value at the X86 address in the branch at the end of the second host translated instruction. The second host translation instruction is then cached and executed. This causes the loop to continue until the condition for branching from the first translated instruction to the second translated instruction fails, and the branch is taken back to the main loop. When this happens, the first translated host instruction returns to the main loop to search the translation buffer for the next set of target instructions specified by the target instruction pointer, and the host instruction is fetched from the cache; or the translation buffer is searched for If not, the target instruction is extracted from the memory and translated. When the translated host instruction is cached in the translation buffer, its address replaces the main loop address in the loop-ending branch instruction.
这样,各种翻译主指令被互相串接起来,因此只有在这种串接不存在时才需要历经通过翻译器主循环的长路径。最终,主指令转移指令内主循环参考地址几乎可以完全省略。当达到该条件时,在运行任何主指令之前提取目标指令、译码目标指令、提取构成目标指令的基本指令、优化这些基本操作、重新排序基本操作和重新安排这些基本操作所需的时间可以省去。因此与所有每次执行应用程序指令时必须执行这些步骤的每一步的所有现有技术微处理器相反,利用改进型微处理器在第一条指令执行之后运行任意目标指令组所需的工作量大为减少。当每组翻译的主指令与其它组翻译主指令都链接起来的,工作量可进一步减少。实际上,据估计在应用程序运行期间一百万条翻译指令需要执行的翻译次数不超过一次。In this way, the various translation host instructions are chained to each other, so the long path through the translator's main loop needs to be traversed only when such chaining does not exist. Finally, the main loop reference address within the main branch instruction can be almost completely omitted. When this condition is met, the time required to fetch the target instruction, decode the target instruction, fetch the primitive instructions that make up the target instruction, optimize those primitive operations, reorder primitive operations, and rearrange these primitive operations before running any host instructions can be saved go. Thus in contrast to all prior art microprocessors which must perform each of these steps each time an application program instruction is executed, the effort required to run an arbitrary set of target instructions after the execution of the first instruction is exploited by the modified microprocessor greatly reduced. The workload can be further reduced when each set of translated host commands is linked with other sets of translated host commands. In fact, it is estimated that a million translated instructions need to be translated no more than once during the runtime of the application.
本领域内技术人员将会发现,由于为了避免再次翻译,被翻译的每组指令都被高速缓存,所以微处理器需要大量的翻译缓冲器。针对不同系统编程的应用程序功能而设计的翻译机将根据支持的缓冲存储器不同而有所不同。但是针对运行X86程序设计的微处理器实施例采用2Mb的随机存储器作为翻译缓冲器。Those skilled in the art will recognize that microprocessors require large translation buffers because each set of instructions being translated is cached to avoid re-translation. The translators designed for the application functions programmed for different systems will vary according to the supported buffer memory. However, for the embodiment of the microprocessor designed to run X86 programming, 2Mb RAM is used as the translation buffer.
两种新增的硬件增强手段有助于提高本发明微处理器处理应用程序的速度。第一种手段是与每条地址翻译指令一起存储在翻译指令后备缓冲器(TLB)(参见图3)中的异常/正常(A/N)保护位位,在上述缓冲器中首先查询目标指令的物理地址。翻译指令的目标存储器操作可以分为两类,一种是对存储器操作(正常操作)而另一种是对存储器映射I/O设备操作(异常操作)。Two additional hardware enhancements help to increase the speed at which the microprocessor of the present invention processes applications. The first means are exception/normal (A/N) protection bits stored with each address translation instruction in the translation instruction lookaside buffer (TLB) (see Figure 3), where the target instruction is first queried physical address. Target memory operations of translation instructions can be divided into two categories, one is operations on memory (normal operation) and the other is operation on memory-mapped I/O devices (exception operation).
作用于存储器的正常存取以正常方式完成。当指令在存储器上操作时,指令的优化和重新排序是有益的并且大大加快了利用本发明微处理器的系统的操作。另一方面,作用于I/O设备的异常存取操作必须按照这些操作编程时的精确顺序进行而不能省略任何步骤,否则对I/O设备有不利的影响。例如某一特定的I/O操作可能是清除I/O寄存器;如果基本操作顺序出错,则操作结果可能会不同于目标指令要求的操作。由于没有区分存储器和存储器映射I/O的装置,所以在翻译指令时需要将所有的存储操作按照是作用于存储器映射I/O指令的保守假设进行处理。这大大限制了优化所能达到的性能。由于现有技术仿真器没有检测被寻址存储器性质预测失效的装置和从这类失效中恢复过来的装置,所以其性能受到限制。Normal accesses to memory are done in the normal way. Optimization and reordering of instructions when operating on memory is beneficial and greatly speeds up the operation of systems utilizing the microprocessor of the present invention. On the other hand, the abnormal access operations acting on the I/O device must be performed in the precise order of these operations during programming without omitting any steps, otherwise it will have an adverse effect on the I/O device. For example, a specific I/O operation may be to clear the I/O register; if the basic operation sequence is wrong, the operation result may be different from the operation required by the target instruction. Since there is no means for distinguishing between memory and memory-mapped I/O, all store operations need to be processed on the conservative assumption that they act on memory-mapped I/O instructions when translating instructions. This greatly limits the performance that can be achieved with optimization. Prior art emulators are limited in their performance because they do not have means to detect predictive failures of the addressed memory properties and to recover from such failures.
在图11所示的微处理器实施例中,A/N位可以在翻译后备缓冲器内设定以指示是存储器页面还是存储映射I/O的位。翻译后备缓冲器存储用于存储器存取的页面表输入项。每条输入项包括被存取的虚拟地址和可以用以找到数据的物理地址以及其它有关输入项的信息。在本发明中,A/N位属于其它信息部分并且表示物理地址是存储器地址还是存储器映射I/O地址。尽管作用于存储器的操作似乎属于存储器操作,但是实际上该操作的翻译是预测其作用于存储器。在一个实施例中,当代码词态化软件首先执行需要访问存储器或存储器映射I/O设备的指令翻译时,它假定该存取是存储器存取。在另一实施例中,软件可能假定目标指令需要进行I/O存取。假定在此之前未对该地址进行过访问,则在翻译后备缓冲器中将没有相应的输入项;并且在翻译后备缓冲器中的存取失败。这种失败使得软件进行页面表查询并将页面表输入项填入翻译后备缓冲器的存储位置以向虚拟地址提供正确的物理地址翻译结果。此后,软件将物理地址的A/N位输入翻译后备缓冲器内。随后假定访问的是存储器地址,则试图进行再一次的存取。当试图进行存取时,通过将预先假定的存取类型(正常或异常)与TLB页面表输入项内的A/N保护位比较,检验目标存储器的参考地址。当存取类型与A/N保护不一致时,发生意外事件。如果操作实际二作用于存储器,则在翻译期间可以正确应用上述优化、重新排序和重新安排技术。如果与TLB中A/N位的比较表明操作作用于I/O设备,则操作的执行导致意外事件;并且翻译器一次产生一条目标指令的新翻译结果而不进行优化、记录或者重新安排等。同样,如果翻译时错误地将作用于存储器的操作假定为I/O操作,则操作的执行导致意外事件;并且利用优化、重新排序和重新安排技术重新翻译目标指令。这样处理器可以出乎寻常地提高性能。In the microprocessor embodiment shown in FIG. 11, the A/N bit may be set in the translation lookaside buffer to indicate whether it is a memory page or memory mapped I/O bit. The translation lookaside buffer stores page table entries for memory accesses. Each entry includes the virtual address being accessed and the physical address where the data can be found, as well as other information about the entry. In the present invention, the A/N bit belongs to the other information part and indicates whether the physical address is a memory address or a memory-mapped I/O address. Although an operation acting on memory appears to be a memory operation, in fact the translation of the operation predicts that it acts on memory. In one embodiment, when the code morphing software first performs translation of an instruction that requires access to memory or a memory-mapped I/O device, it assumes that the access is a memory access. In another embodiment, software may assume that the target instruction requires I/O access. Assuming that the address has not been accessed before, there will be no corresponding entry in the translation lookaside buffer; and the access in the translation lookaside buffer fails. This failure causes software to do a page table lookup and fill the page table entry into the storage location of the translation lookaside buffer to provide the correct physical address translation to the virtual address. Thereafter, the software enters the A/N bits of the physical address into the translation lookaside buffer. Then, assuming that a memory address is being accessed, another access is attempted. When an access is attempted, the reference address of the target memory is checked by comparing the pre-assumed type of access (normal or abnormal) with the A/N protection bits in the TLB page table entry. An unexpected event occurs when the access type does not match the A/N protection. The optimization, reordering, and rearranging techniques described above are correctly applied during translation if the operation actually acts on memory. If a comparison with the A/N bit in the TLB reveals that the operation acts on an I/O device, execution of the operation resulted in an exception; and the translator produces a new translation of the target instruction one at a time without optimizing, recording, or rearranging, etc. Likewise, if an operation acting on memory is incorrectly assumed to be an I/O operation at translation time, the execution of the operation results in an exception; and the target instruction is retranslated using optimization, reordering, and rearrangement techniques. This processor can increase the performance of extraordinary.
本领域内的技术人员将会发现,利用A/N位来确定有关存储器访问还是存储器映射I/O设备访问的预测是否失败的技术也可以用来预测存储器映射地址的其它性质。例如可以利用这种正常/异常位来区分不同类型的存储器。本领域内技术人员还可以找到区分存储器性质的其它类似用途。Those skilled in the art will recognize that the technique of using the A/N bit to determine whether predictions about memory accesses or memory-mapped I/O device accesses failed can also be used to predict other properties of memory-mapped addresses. For example, this normal/abnormal bit can be used to distinguish different types of memory. Those skilled in the art can also find other similar uses for differentiating the nature of the memory.
改进型微处理器预测最多的是翻译内是否发生目标意外事件。这使得与现有技术相比作了明显的优化。首先,目标状态不必在每条目标指令边界上更新而只需在发生在翻译边界的目标指令边界上更新。这省略了在每条目标指令边界上保存目标状态所需的指令。原先无法对安排和去除冗余操作所作的优化现在也变得可行起来。The most predicted by the improved microprocessor is whether a target contingency occurs within the translation. This enables a significant optimization compared to the prior art. First, the target state does not have to be updated on every target instruction boundary but only on target instruction boundaries that occur at translation boundaries. This omits the instructions needed to save the target state on every target instruction boundary. Optimizations that were previously impossible for scheduling and removing redundant operations are now possible.
改进的微处理器适于选择合适的翻译进程。按照上述翻译方法,尽管一组指令是作用于存储器的,但是仍然可以先进行翻译。当优化、重新排序和重新安排主指令执行时,利用翻译后备缓冲器内提供的A/N位状态可能发现是I/O设备的地址。A/N位与表示I/O操作的翻译指令地址进行比较后产生错误意外事件,启动软件初始化转储程序,引起未提交的存储器存储内容转储并将目标寄存器内的值放回到工作寄存器。随后一次对一条目标指令进行翻译而不进行优化、重新排序或者重新安排。这种重新翻译适合于对I/O设备进行主翻译。A modified microprocessor is adapted to select the appropriate translation process. According to the above translation method, although a group of instructions act on the memory, they can still be translated first. When optimizing, reordering, and rescheduling host instruction execution, it may be possible to find the address of the I/O device using the state of the A/N bit provided in the translation lookaside buffer. The A/N bit is compared with the address of the translated instruction representing the I/O operation to generate an error exception, which initiates a software-initiated dump routine, causing uncommitted memory stores to be dumped and the value in the target register to be placed back into the working register . Subsequent translations are performed one target instruction at a time without optimization, reordering, or rearranging. This retranslation is suitable for main translation of I/O devices.
同样,存储器操作也可能被错误地翻译为I/O操作。产生的错误可以用来引发正确的重新翻译,对指令进行优化、重新排序和重新安排以提供更快的操作。Likewise, memory operations may be misinterpreted as I/O operations. Resulting errors can be used to trigger correct retranslations, optimizing, reordering and rearranging instructions to provide faster operation.
现有技术仿真器对称为自修改代码的问题一直比较棘手。即使目标程序应该向包含目标指令的存储器写入内容,这将导致目标指令已有的翻译结果“过时”并不再有效。当这些存储器内容动态发生时需要对它们进行检测。在现有技术中,这种检测需要借助对每条存储使用额外指令完成。这个问题的影响要超出程序自修改本身。任何能够向存储器写入内容的机构,例如第二处理器或DMA设备,也可能引起这个问题。Prior art emulators have been troubled by a problem known as self-modifying code. Even if the target program should write to the memory containing the target instructions, this will cause the existing translations of the target instructions to be "stale" and no longer valid. These memory contents need to be detected as they occur dynamically. In the prior art, this detection needs to be done by using an extra instruction for each store. The impact of this problem goes beyond program self-modification itself. Any mechanism capable of writing to memory, such as a second processor or a DMA device, can also cause this problem.
本发明通过另外的增强词态主机的性能解决这个问题。可以用也存储在翻译后备缓冲器内的翻译位(T位)来表示已存在翻译结果的目标存储器页面。T位可表示特定目标存储器页面包含已被翻译目标指令,如果这些目标指令被覆盖则它们将过时。如果试图向存储器内被保护的页面写入内容,则翻译位的存的将引起意外事件,当代码词态化软件进行处理时将使正确的翻译结果无效或者从翻译缓冲器内去除。T位还可以用来标记其它保护翻译结果不被覆盖的目标页面。The present invention solves this problem by additionally enhancing the performance of the Morph Host. A translation bit (T bit) also stored in the translation lookaside buffer may be used to indicate a target memory page for which translation results already exist. The T bit may indicate that a particular target memory page contains translated target instructions that would be obsolete if they were overwritten. If an attempt is made to write to a protected page in memory, the storage of the translation bits will cause an exception, and the correct translation will be invalidated or removed from the translation buffer when the code morphing software is processed. The T bit can also be used to mark other target pages that protect translation results from being overwritten.
借助图3可以理解这一点,它示出了本发明微处理器总体功能单元的框图。当词态主机执行目标程序时,实际上它运行的是代码词态化软件的翻译器部分,它仅仅包括有效运行在词态主机上的初始未翻译主指令。图中右边是存储器,它被划分为包含翻译器和翻译缓冲器的主机部分和包含目标指令和数据(包括目标操作操作系统)的目标部分。词态主机硬件开始执行翻译器时从存储器内提取主指令并将其放入指令高速缓存内。翻译器指令产生对存储在存储器目标部分内的第一条目标指令的提取指令。目标提取指令使整数单元在正式目标指令指针寄存器内查找目标指令的开始地址。随后将开始地址放入存储器管理单元的翻译后备缓冲器内。存储器管理单元包括页面查找硬件并提供TLB的存储器映射手段。假定TLB被正确映射从而保存了目标存储器正确页面的查找数据,则目标指令指针值被翻译为目标指令的物理地址。此时,对表示目标指令是否完成翻译的位(T位)状态进行检测;但是访问是读操作,并且不会发生T位意外事件。还检测了表示是对存储器还是存储器映射I/O访问的A/N位的状态。假定后面提及的位表示存储位置,则由于不存在翻译结果,所以在目标存储器内访问目标指令。目标指令和后续目标指令作为数据被转移至词态主机计算单元并在指令高速缓存内存储的翻译器指令的控制下进行翻译。翻译器指令采用重新排序、优化和重新安排技术,就好象处理作用于存储器的目标指令一样。随后将包含主指令序列的最终翻译结果存储在主存储器的翻译缓冲器内。翻译结果经门控存储缓冲器被直接转送至主存储器的翻译缓冲器内。一旦将翻译结果存储在主存储器内之后,翻译器转入随后执行的翻译。所执行的操作(以及后续操作)确定翻译是否对意外事件和存储器作出了正确的假设。在执行翻译之前,对包含已翻译目标指令的目标页面的T位进行设定。该指示提醒指令已经翻译;并且如果试图向目标地址写入内容,则将导致意外事件,可能使得翻译结果无效或者被除去。This can be understood with the aid of FIG. 3, which shows a block diagram of the general functional units of the microprocessor of the present invention. When the morph host executes the target program, it is actually running the translator portion of the code morphing software, which consists only of the original untranslated host instructions effectively running on the morph host. On the right in the figure is memory, which is divided into a host portion containing the translator and translation buffers and a target portion containing target instructions and data (including the target operating system). Morph host hardware starts executing the translator by fetching host instructions from memory and placing them in the instruction cache. The translator instructions generate a fetch instruction for a first target instruction stored in the target portion of memory. The target fetch instruction causes the integer unit to look up the start address of the target instruction in the official target instruction pointer register. The start address is then placed into the translation lookaside buffer of the memory management unit. The memory management unit includes page lookup hardware and provides the memory mapping means of the TLB. The target instruction pointer value is translated to the physical address of the target instruction, assuming the TLB is correctly mapped to hold the lookup data for the correct page of target memory. At this point, the status of the bit (T bit) indicating whether the translation of the target instruction is complete is checked; however, the access is a read operation, and no T bit exception occurs. The state of the A/N bit indicating whether the access is to memory or memory-mapped I/O is also checked. Assuming that the latter-mentioned bits represent storage locations, the target instruction is accessed within the target memory since there is no translation result. The target instruction and subsequent target instructions are transferred as data to the morphological host computing unit and translated under the control of the translator instructions stored in the instruction cache. Translator instructions employ reordering, optimization, and rearrangement techniques as if they were target instructions acting on memory. A final translation result comprising the host instruction sequence is then stored in a translation buffer in main memory. The translation result is directly transferred to the translation buffer of the main memory through the gated storage buffer. Once the translation result is stored in main memory, the translator moves on to the subsequent translation. The actions performed (and subsequent actions) determine whether the translation made correct assumptions about contingencies and memory. Before performing the translation, the T bit of the target page containing the translated target instruction is set. This indication alerts that the instruction has already been translated; and if an attempt is made to write to the target address, an exception will result, possibly invalidating or removing the translation.
如果试图向用T位标记的目标页面写入内容,则产生意外事件并中止写入。在意外事件的响应确认对写入目标存储器地址的指令翻译是无效的或者受到保护的,直到它们被适当更新之后,写入操作才可以继续。由于翻译不起作用,所以有些写入操作实际上不需要做。其它的写入操作则需要,做一条或更多与寻址的目标存储器(相关的翻译被适当标记或者去除)。图11示出了翻译后备缓冲器的实施例,包含保存T位指示的每条输入项的存储位置。If an attempt is made to write to a target page marked with the T bit, an exception is generated and the write is aborted. The write operation may not proceed until the response to the exception confirms that translations of instructions written to the target memory address are invalidated or protected until they are properly updated. Some writes don't actually need to be done because the translation doesn't work. Other write operations require one or more entries to be made to the addressed target memory (associative translations are appropriately marked or removed). FIG. 11 shows an embodiment of a translation lookaside buffer, including storing a storage location for each entry indicated by the T bit.
新增的加强词态主机性能的硬件电路可以使正常存储在存储器内但经常用于操作执行的数据在执行单元寄存器内被复制(或者“别名”)以便节省从存储器内提取存储器内的时间。为了在实施例中完成别名操作,词态主机被设计一个“装入和保护”命令为响应将存储器数据复制到图10所示执行单元110的工作寄存器111内并将存储器地址放入该单元的寄存器112内。与地址寄存器相连的是比较器113。比较器接收加载地址并在翻译期间存入指向存储器的门控存储缓冲器。如果加载或存储的存储器地址与寄存器112(或根据实施方案不同是其他的寄存器)进行比较,则产生意外事件。代码词态软件通过确保存储器地址与寄存器保存相同正确的数据响应该意外事件。在实施例中,为此重新翻译并不用执行单元寄存器内的“别名”数据重新执行。解决这个问题的其它可行方法是用最近的存储器数据更新执行单元寄存器或者用最近的加载数据更新存储器。The newly added hardware circuit that enhances the performance of the morph host can make the data normally stored in the memory but often used for operation execution be copied (or "aliased") in the execution unit register to save the time of extracting the memory from the memory. To accomplish the aliasing operation in an embodiment, the morph host is programmed with a "load and protect" command that in response copies the memory data into the working
本领域内的技术人员将会发现,微处理器可以通过电路形式与典型的计算机单元连接从而构成诸如图9所示的计算机。显而易见,当微处理器用于现代X86计算机时,它可以通过处理器总线与存储器和总线控制电路连接。存储器和总线控制电路提供了对主存储器的访问,也提供了对与微处理器一起使用的高速缓存的访问。存储器和总线控制线路还提供了对诸如PCI或其它局部总线的访问,通过这些总线对I/O设备进行访问。特定的计算机系统取决于由本发明微处理器所替代的典型微处理器一起使用的电路。Those skilled in the art will find that a microprocessor can be connected in circuit form to a typical computer unit to form a computer such as that shown in FIG. 9 . Obviously, when a microprocessor is used in a modern x86 computer, it can be connected to memory and bus control circuits through a processor bus. Memory and bus control circuits provide access to main memory and also to cache memory used with the microprocessor. Memory and bus control lines also provide access to local buses, such as PCI or other, through which I/O devices are accessed. A particular computer system will depend on the circuitry used with a typical microprocessor that is replaced by the microprocessor of the present invention.
为了描述处理器的操作和加快执行速度的方法,这里的实例是将少量X86目标代码翻译为主基本指令。实例涉及了将X86目标指令翻译为词态主机指令的过程,包括本发明微处理器所作的优化、重新排序和重新安排各种步骤。通过以下描述的过程,本领域内技术人员将会理解利用目标处理器执行源指令所需的操作与主处理器上执行翻译所需操作之间的差别。In order to describe the operation of the processor and the method of accelerating the execution speed, the example here is to translate a small amount of X86 object code into the main basic instruction. The examples relate to the process of translating X86 object instructions into morphological host instructions, including various steps of optimization, reordering and rearranging by the microprocessor of the present invention. Through the process described below, those skilled in the art will understand the difference between the operations required to execute a source instruction using a target processor and the operations required to perform translation on a host processor.
用C语言源代码编写的源指令描述了一个非常简单的循环操作。当每次循环后都减一的变量“n”大于“0 ”时,数值“c”存储在指针“*s”表示的地址内,每次循环之后该指针都增一。The source instructions written in C language source code describe a very simple loop operation. When the variable "n" which is decremented by one after each loop is greater than "0", the value "c" is stored in the address indicated by the pointer "*s", and the pointer is increased by one after each loop.
Original C codeOriginal C code
while((n--)>0){while((n--)>0){
*s++=c*s++=c
}}
===================================================================================================================== ===================
Win32 x86 instructions produced by a compiler compiling this C code.Win32 x86 instructions produced by a compiler compiling this C code.
mov %ecx,[%ebp+0xc] // load c from mcmory address into the %ecxmov %ecx,[%ebp+0xc] // load c from mcmory address into the %ecx
mov %eax,[%ebp+0x8] // load s from memory address into the %eaxmov %eax,[%ebp+0x8] // load s from memory address into the %eax
mov [%eax],%ecx // store c into memory address s held in %eaxmov [%eax],%ecx // store c into memory address s held in %eax
add %eax,#4 // increment s by4.add %eax, #4 // increment s by4.
mov [%ebp+0x8],%eax // store(s+4) back into memorymov [%ebp+0x8],%eax // store(s+4) back into memory
mov %eax,[%ebp+0x10] // load n from memory address into the %eaxmov %eax,[%ebp+0x10] // load n from memory address into the %eax
lea %ecx,[%eax-1] // decrement n and store the result in %ecxlea %ecx,[%eax-1] // decrement n and store the result in %ecx
mov [%ebp+0x10],%ecx // store(n-1)into memorymov [%ebp+0x10],%ecx // store(n-1) into memory
and %eax,%eax // test n to set the condition codesand %eax, %eax // test n to set the condition codes
jg .-0x1b // branch to the top of this section if″n>0″jg .-0x1b // branch to the top of this section if″n>0″
在实例的第一部分,执行用C语言语句定义的操作的每条X86汇编语言指令用汇编语言操作助记符列示,后面跟着涉及特定基本操作的参数。每条指令的注释对操作作了解释。虽然所示的执行顺序可以由目标处理器改变,但是每次执行目标C语言指令的循环时每条汇编语言指令都必须执行。In the first part of the example, each X86 assembly language instruction that performs an operation defined in a C language statement is listed with an assembly language operation mnemonic, followed by parameters referring to the particular basic operation. The comments for each instruction explain the operation. Although the order of execution shown can be changed by the target processor, each assembly language instruction must be executed each time a loop of target C language instructions is executed.
因此,如果循环执行100次,则所示的每条指令也必须执行100次。So if the loop executes 100 times, each instruction shown must also execute 100 times.
Shows each X86 Instruction shown above followed by the host instructions necessary toShows each X86 Instruction shown above followed by the host instructions necessary to
Implement the X86 Instruction.Implement the X86 Instruction.
mov %ecx,[%ebp+0xc] // lcad c from memory address into ecxmov %ecx,[%ebp+0xc] // lcad c from memory address into ecx
add R0,Rebp,0xc ;form the memory address and put it in R0add R0, Rebp, 0xc ; form the memory address and put it in R0
ld Recx,[R0] ;load c from memory address in R0 into Recxld Recx, [R0] ; load c from memory address in R0 into Recx
mov %eax,[%ebp+0x8] // load s from memory address into %eaxmov %eax,[%ebp+0x8] // load s from memory address into %eax
add R2,Rebp,0x8 ;form the memory address and put it in R2add R2, Rebp, 0x8 ; form the memory address and put it in R2
ld Reax,[R2] ;load s from memoryaddress in R2 into Recxld Reax,[R2] ; load s from memoryaddress in R2 into Recx
mov [%eax],%ecx // store c into memory address s held in %eaxmov [%eax],%ecx // store c into memory address s held in %eax
st [Reax],Recx ;store c into memory address s hald in Reaxst [Reax], Recx ; store c into memory address s hald in Reax
add %eax,#4 // increment s by 4add %eax, #4 // increments by 4
add Reax,Reax,4 ;increment s by 4add Reax, Reax, 4 ; increment s by 4
mov [%ebp+0x8],%eax // store(s+4)back into memorymov [%ebp+0x8],%eax // store(s+4) back into memory
add R5,Rebp,0x8 ;form the memory address and put it in R5add R5, Rebp, 0x8 ; form the memory address and put it in R5
st [R5],Reax ;store(s-4)back into memoryst [R5], Reax ; store(s-4) back into memory
mov %eax,[%ebp+0x10] // load n from memory address into %eaxmov %eax,[%ebp+0x10] // load n from memory address into %eax
add R7,Rebp,0x10 ;form the memory address and put it in R7add R7, Rebp, 0x10 ; form the memory address and put it in R7
ld Reax,[R7] ;load n from memory address into the Reaxld Reax, [R7] ; load n from memory address into the Reax
lea %ecx,[%eax-1] // decrementn andstore the result in %ecxlea %ecx,[%eax-1] // decrementn and store the result in %ecx
sub Recx,Reax,1 ;decrement n and store the result in Recxsub Recx, Reax, 1 ; decrement n and store the result in Recx
mov [%ebp+0x10],%ecx // store(n-1)into memorymov [%ebp+0x10],%ecx // store(n-1) into memory
add R9,Rebp,0x10 ;form the memory address and put it in R9add R9, Rebp, 0x10 ; form the memory address and put it in R9
st [R9],Recx ;store(n-1)into memoryst [R9], Recx ; store(n-1) into memory
and %eax,%eax // test n to set the condition codesand %eax, %eax // test n to set the condition codes
andcc R11,Reax,Reax ;test n to set the condition codesandcc R11, Reax, Reax ; test n to set the condition codes
jg .-0x1b // branch to the top of this sectionif″n>0″jg .-0x1b // branch to the top of this section if″n>0″
jg mainloop.mainloop ;jump to the main loopjg mainloop.mainloop ; jump to the main loop
Host Instruction key:Host Instruction key:
ld=load add=ADD st=storeld=load add=ADD st=store
sub=subtract jg=jump if conditicn codes indicate greatersub=subtract jg=jump if conditicn codes indicate greater
andcc=and set the condition codesandcc=and set the condition codes
下一实例描述了执行C语言指令的同一目标基本指令。但是在每条基本目标指令之后列出了在微处理器特定实施例中完成同一操作所需的基本主指令,其中词态主机为针对上述方式设计的VLIW处理器。值得注意的是,正式目标寄存器屏蔽的主寄存器用X86寄存器名的前面加“R”表示,因此例如Reax表示与EAX正式目标寄存器相关的工作寄存器。The next example describes the same target primitive instruction executing a C language instruction. But after each basic target instruction are listed the basic host instructions required to accomplish the same operation in a particular embodiment of a microprocessor where the morph host is a VLIW processor designed for the above. It is worth noting that the main register masked by the official target register is indicated by adding an "R" in front of the X86 register name, so for example Reax indicates the working register related to the official target register of EAX.
Adds host instructions necessary to perform X86 address computation and upper and lowerAdds host instructions necessary to perform X86 address computation and upper and lower
segment limit checks.segment limit checks.
mov %ecx,[%ebp+0xc] // load cmov %ecx,[%ebp+0xc] // load c
add R0,Rebp,0xc ;form logical address into R0add R0, Rebp, 0xc ; form logical address into R0
chkl R0,Rss_limit ;Check the logical address against segment lowerchkl R0,Rss_limit ; Check the logical address against segment lower
limitlimit
chku R0,R_FFFFFFFF ;Check the logical address against segment upperchku R0, R_FFFFFFFF ; Check the logical address against segment upper
limitlimit
add R1,R0,Rss_base ;add the segment base to form the linearadd R1, R0, Rss_base ; add the segment base to form the linear
addressaddress
ld Recx,[R1] ;load c from memory address in R1 into Recxld Recx, [R1] ; load c from memory address in R1 into Recx
mov %eax,[%ebp+0x8] // load smov %eax,[%ebp+0x8] // load s
add R2,Rebp,0x8 ;form logical address into R0add R2, Rebp, 0x8 ; form logical address into R0
chkl R2,Rss_limit ;Check the logical address against segment lowerchkl R2, Rss_limit ; Check the logical address against segment lower
limitlimit
chku R2,R_FFFFFFFF ;Check the logical address against segment upperchku R2, R_FFFFFFFF ; Check the logical address against segment upper
limitlimit
add R3,R2,Rss_base ;add the segment base to form the linearadd R3, R2, Rss_base ; add the segment base to form the linear
addressaddress
ld Reax,[R3] ;load s from memory address in R3 into Rald Reax, [R3] ; load s from memory address in R3 into Ra
mov [%eax],%ecx // store c into[s]mov [%eax],%ecx // store c into[s]
chku Reax,Rds_limit ;Checkthe logical address against segment upperchku Reax,Rds_limit ; Check the logical address against segment upper
limitlimit
add R4,Reax,Rds_base ;add the segment base to form the linearadd R4, Reax, Rds_base; add the segment base to form the linear
addressaddress
st [R4],Recx ;store c into memory address sst [R4], Recx ; store c into memory address s
add %eax,#4 // increment s by 4add %eax, #4 // increments by 4
addcc Reax,Reax,4 ;increment s by 4addcc Reax, Reax, 4 ; increment s by 4
mov [%ebp+0x8],%eax // store(s+4)to memorymov [%ebp+0x8],%eax // store(s+4) to memory
add R5,Rebp,0x8 ;form logical address into R5add R5, Rebp, 0x8 ; form logical address into R5
chkl R5,Rss_limit ;Check the logical address against segment lowerchkl R5,Rss_limit ; Check the logical address against segment lower
limitlimit
chku R5,R_FFFFFFFF ;Check the logical address against seg ment upperchku R5, R_FFFFFFFF ; Check the logical address against segment upper
limitlimit
add R6,R5,Rss_base ;add the segment base to form the linearadd R6, R5, Rss_base ; add the segment base to form the linear
addressaddress
st [R6],Reax ; store(s+4)to memory address in R6st [R6], Reax ; store(s+4)to memory address in R6
mov %eax,[%ebp+0x10] // load nmov %eax,[%ebp+0x10] // load n
add R7,Rebp,0x10 ;form logical address into R7add R7, Rebp, 0x10 ; form logical address into R7
chkl R7,Rss_limit ;Check the lcgical address against segment lowerchkl R7, Rss_limit ; Check the lcgical address against segment lower
limitlimit
chku R7,R_FFFFFFFF ;Check the logical address against segment upperchku R7, R_FFFFFFFF ; Check the logical address against segment upper
limitlimit
add R8,R7,Rss_base ;add the segment base to form the linearadd R8, R7, Rss_base ; add the segment base to form the linear
addressaddress
ld Reax,[R8] ;loadn from memory address in R8 into Reaxld Reax, [R8] ; loadn from memory address in R8 into Reax
lea %ecx,[%eax-1] // decrement nlea %ecx,[%eax-1] // decrement n
sub Recx,Reax,1 ;decrement nsub Recx, Reax, 1 ; decrement n
mov [%ebp+0x10],%ecx // store(n-1)mov [%ebp+0x10],%ecx // store(n-1)
add R9,Rebp,0x10 ;form logical address into R9add R9, Rebp, 0x10 ; form logical address into R9
chkl R9,Rss_limit ;Check the logical address against segment lowerchkl R9, Rss_limit ; Check the logical address against segment lower
limitlimit
chku R9,R_FFFFFFFF ;Check the logical address against segment upperchku R9, R_FFFFFFFF ; Check the logical address against segment upper
limitlimit
add R10,R9,Rss_base ;add the segment base to form the linearadd R10, R9, Rss_base ; add the segment base to form the linear
addressaddress
st [R10],Recx ;store n-1 in Recx into memory using addressst [R10], Recx ; store n-1 in Recx into memory using address
in R10in R10
and %eax,%eax // test n to set the condition codesand %eax, %eax // test n to set the condition codes
andcc R11,Reax,Reax ;test n to set the condition codesandcc R11, Reax, Reax ; test n to set the condition codes
jg .-0x1b // branch to the top of this section if″n>0″jg .-0x1b // branch to the top of this section if″n>0″
jg mainloop,mainloop ;jump to the mainn loopjg mainloop, mainloop; jump to the mainn loop
Host Instruction key:Host Instruction key:
chkl+check lower limitchkl+check lower limit
chku=check upper limitchku=check upper limit
下一实例表示每条基本目标指令的主基本指令加法,代码词态化软件可以利用其产生目标操作所需的地址。值得注意的是,主地址生成指令只是在采用代码词态化软件而非地址生成硬件来生成地址的微处理器实施例中才需要。在诸如X86微处理器之类的目标处理器中,地址是利用地址生成硬件生成的。在这里的实施例中,无论何时产生地址,都完成了计算;并且还加入主基本指令以检查地址值从而确定计算的地址是否在合适的X86段边界内。The next instance shows the main primitive addition for each primitive target instruction, which the code morphing software can use to generate the addresses needed for the target operation. It is worth noting that the main address generation instruction is only required in microprocessor embodiments that employ code morphing software rather than address generation hardware to generate addresses. In a target processor such as an X86 microprocessor, addresses are generated using address generation hardware. In the embodiments herein, calculations are done whenever an address is generated; and the host primitive instruction is also added to check the address value to determine if the calculated address is within the proper X86 segment boundary.
Adds instructions to main tain the target X86 instruction pointer″eip″andAdds instructions to maintain the target X86 instruction pointer″eip″and
the commit instructions that usethe special morph host hardware to update X86 state.the commit instructions that use the special morph host hardware to update X86 state.
mov %ecx,[%ebp+0xc] // load cmov %ecx,[%ebp+0xc] // load c
add R0,Rebp,0xcadd R0, Rebp, 0xc
chkl R0,Rss_limitchkl R0, Rss_limit
chku R0,R_FFFFFFFFchku R0, R_FFFFFFFF
add R1,R0,Rss_baseadd R1, R0, Rss_base
ld Recx,[R1]ld Recx, [R1]
add Reip,Reip,3 ;add X86 instruotion Iength to eip inadd Reip, Reip, 3 ; add X86 instruction Iength to eip in
ReipReip
commit ;commits working state to official statecommit ; commits working state to official state
mov %eax,[%ebp+0x8] // load smov %eax,[%ebp+0x8] // load s
add R2,Rebp,0x8add R2, Rebp, 0x8
chkl R2,Rss_limitchkl R2, Rss_limit
chku R2,R_FFFFFFFFchku R2, R_FFFFFFFF
add R3,R2,Rss_baseadd R3, R2, Rss_base
ld Reax,[R3]ld Reax, [R3]
add Reip,Reip,3 ;add X86 instruction lengthtoeip inadd Reip, Reip, 3 ; add X86 instruction lengthtoeip in
ReipReip
commit ;commits working state to official statecommit ; commits working state to official state
mov [%eax],%ecx // store c into[s]mov [%eax],%ecx // store c into[s]
chku Reax,Rds_limitchku Reax, Rds_limit
add R4,Reax,Rds_baseadd R4, Reax, Rds_base
st [R4],Recxst[R4], Recx
add Reip,Reip,2 ;add X86 instruction length to eip inadd Reip, Reip, 2 ; add X86 instruction length to eip in
ReipReip
commit ;commits workingstate to official statecommit ; commits working state to official state
add %eax,#4 // increments by 4add %eax, #4 // increments by 4
addcc Reax,Reax,4addcc Reax, Reax, 4
add Reip,Reip,5 ;add X86 instruction length toeip inadd Reip, Reip, 5 ; add X86 instruction length toeip in
ReipReip
commit ;commits working state to official statecommit ; commits working state to official state
mov [%ebp+0x8],%eax // store(s+4)mov [%ebp+0x8],%eax // store(s+4)
add R5,Rebp,0x8add R5, Rebp, 0x8
chkl R5,Rss_limitchkl R5, Rss_limit
chku R5,R_FFFFFFFFchku R5, R_FFFFFFFF
add R6,R5,Rss_baseadd R6, R5, Rss_base
st [R6],Reaxst[R6], Reax
add Reip,Reip,3 ;add X86 instruction length to eip inadd Reip, Reip, 3 ; add X86 instruction length to eip in
ReipReip
commit ;commits working state to official statecommit ; commits working state to official state
mov %eax,[%ebp+0x10] // load nmov %eax,[%ebp+0x10] // load n
add R7,Rebp,0x10add R7, Rebp, 0x10
chkl R7,Rss_limitchkl R7, Rss_limit
chku R7,R_FFFFFFFFchku R7, R_FFFFFFFF
add R8,R7,Rss_baseadd R8, R7, Rss_base
ld Reax,[R8]ld Reax, [R8]
add Reip,Reip,3 ;add X66 instruotion length to eip inadd Reip, Reip, 3 ; add X66 instruction length to eip in
ReipReip
commit ;commits working state to official statecommit ; commits working state to official state
lea %ecx,[%eax-1] // decrement nlea %ecx,[%eax-1] // decrement n
sub Recx,Reax,1sub Recx, Reax, 1
add Reip,Reip,3 ;add X86 instruction length to eip inadd Reip, Reip, 3 ; add X86 instruction length to eip in
ReipReip
commit ;commits working state to official statecommit ; commits working state to official state
mov [%ebp+0x10],%ecx // store(n-1)mov [%ebp+0x10],%ecx // store(n-1)
add R9,Rebp,0x10add R9, Rebp, 0x10
chkl R9,Rss_limitchkl R9, Rss_limit
chku R9,R_FFFFFFFFchku R9, R_FFFFFFFF
add R10,R9,Rss_baseadd R10, R9, Rss_base
st [R10],Recxst[R10], Recx
add Reip,Reip,3 add X86 instruction length to eip inadd Reip,Reip,3 add X86 instruction length to eip in
ReipReip
commit ;commits working state to official statecommit ; commits working state to official state
and %eax,%eax // test nand %eax, %eax // test n
andcc R11,Reax,Reaxandcc R11, Reax, Reax
add Reip,Reip,3add Reip, Reip, 3
commit ;commits working state to official statecommit ; commits working state to official state
jg .-0x1b // branch″n>0″jg .-0x1b // branch″n>0″
add Rseq,Reip,Length(jg)add Rseq,Reip,Length(jg)
ldc Rta rg,EIP(target)ldc Rta rg,EIP(target)
selcc Reip,Rseq,Rtargselcc Reip, Rseq, Rtarg
commit ;commits working state to official statecommit ; commits working state to official state
jg mainloop,mainloopjg mainloop, mainloop
Host Ins truction key:Host Ins truction key:
commit=copythe contents of the working registers to the officialcommit=copy the contents of the working registers to the official
target regi sters and sendworking stores to memorytarget regi sters and sendworking stores to memory
本实例示出了每组基本主指令的两步相加,在执行了完成每条基本指令所需的主指令后更新正式目标寄存器并将门控存储缓冲器内未提交的值提交入存储器。显而易见,在每种情况下,目标指令的长度与工作指令指针寄存器(Reip)内的值相加。随后执行提交指令。在实施例中,提交指令对屏蔽在相关正式目标寄存器的工作寄存器的当前值进行复制并将指定门控存储缓冲器位置的指针值从紧靠未提交存储数据之前移动至紧靠这些存储数据之后从而将它们放入存储器。This example shows a two-step addition of each set of primitive host instructions, updating the official target registers and committing uncommitted values in the gated store buffers to memory after execution of the host instructions required to complete each primitive instruction. Obviously, in each case the length of the target instruction is added to the value in the working instruction pointer register (Reip). Then execute the commit command. In an embodiment, the commit instruction copies the current value of the working register masked at the associated official target register and moves the pointer value specifying the location of the gated store buffer from immediately before uncommitted store data to immediately after those store data thus putting them into memory.
显而易见的是上面最后示出的指令清单都是构成源目标汇编语言指令的主指令翻译所需的指令。如果翻译操作在该处停止,则基本主指令的数量将远远大于目标指令数量(大约是6倍),并且执行时间将超过目标处理器。但是此时尚未对指令进行过重新排序、优化和重新安排。It should be apparent that the last list of instructions shown above are all instructions required for the translation of the host instructions that constitute the source target assembly language instructions. If the translation operation stops there, the number of base host instructions will be far greater than the number of target instructions (about 6 times), and the execution time will exceed the target processor. But instructions have not been reordered, optimized, and rescheduled at this point.
如果指令得到了运行但只是一次,则完成指令进一步重新排序和其它优化所需的时间可能超过此时执行翻译的时间。如果这样,微处理器实施例将在此处停止翻译,存储翻译结果,随后执行翻译指令以确定是否发生意外事件或错误。在本实施例中,重新排序其它优化步骤仅仅发生在经确定某一翻译操作将进行多次或者需要优化时。例如通过在每条翻译指令内放入对翻译执行进行计数并在计数值达到某一数值时产生意外事件(或转移)的主指令来实现。意外事件(或转移)将操作转移至代码词态化软件,由其进行下述部分或所有优化以及任何适于该翻译操作的其它优化。第二种确定翻译指令执行次数和是否需要优化的方法是以一定的频度或者根据一些统计依据中断翻译指令的执行,并优化该时刻运行的任何翻译指令。这最终使得最经常运行的指令得到了优化。另一种方案是对每一条特定类型的主指令进行优化,例如生成循环的主指令或者运行次数可能最多的主指令。If the instruction gets executed but only once, the time required to complete further reordering of the instruction and other optimizations may exceed the time it takes to perform the translation at this point. If so, the microprocessor embodiment will stop the translation at this point, store the translation results, and then execute the translated instructions to determine if an exception or error occurred. In this embodiment, reordering other optimization steps only occurs when it is determined that a certain translation operation will be performed multiple times or needs to be optimized. This is achieved, for example, by putting in each translation instruction a host instruction that counts translation execution and generates an exception (or branch) when the count value reaches a certain value. The contingency (or branch) transfers the operation to the code morphing software, which performs some or all of the optimizations described below and any other optimizations appropriate for the translation operation. The second method for determining the execution times of translation instructions and whether optimization is required is to interrupt the execution of translation instructions at a certain frequency or according to some statistical basis, and optimize any translation instructions running at that moment. This ultimately allows the most frequently executed instructions to be optimized. Another approach is to optimize each host instruction of a specific type, such as those that generate loops or those that are likely to run the most times.
OptimizationOptimization
===================================================================================================================== ===================
Assumes 32 bit ilat address space which allows the elimination of segment base additions andAssumes 32 bit ilat address space which allows the elimination of segment base additions and
some limit checks.some limit checks.
Win32 uses Flat 32bsegmentalionWin32 uses Flat 32bsegmentalion
Record Assumptions: Record Assumptions:
Rss_base==0Rss_base==0
Rss_limit==0Rss_limit==0
Rds_base==0Rds_base==0
Rds_limit==FFFFFFFFRds_limit==FFFFFFFF
SS and DS protection checkSS and DS protection check
mov %ecx,[%ebp+0xc] //load cmov %ecx,[%ebp+0xc] //load c
add R0,Rebp,0xcadd R0, Rebp, 0xc
chku R0,R_FFFFFFFFchku R0, R_FFFFFFFF
ld Recx,[R0]ld Recx, [R0]
add Reip,Reip,3add Reip, Reip, 3
commitcommit
mov %eax,[%ebp+0x8] //load smov %eax,[%ebp+0x8] //load s
add R2,Rebp,0x8add R2, Rebp, 0x8
chku R2,R_FFFFFFFFchku R2, R_FFFFFFFF
ld Reax,[R2]ld Reax, [R2]
add Reip,Reip,3add Reip, Reip, 3
commitcommit
mov [%eax],%ecx //store c into[ s]mov [%eax],%ecx //store c into[ s]
chku Reax,R_FFFFFFFFchku Reax, R_FFFFFFFF
st [Reax ],Recxst [Reax], Recx
add Reip,Reip,2add Reip, Reip, 2
ccmmitccmmit
add %eax,#4 //increment s by 4add %eax, #4 //increment s by 4
addcc Reax,Reax,4addcc Reax, Reax, 4
add Reip,Reip,5add Reip, Reip, 5
commitcommit
mov [%ebp+0x8],%eax //store(s-4)mov [%ebp+0x8],%eax //store(s-4)
add R5,Rebp,0x8add R5, Rebp, 0x8
chku R5,R_FFFFFFFFchku R5, R_FFFFFFFF
st [R5],Reaxst[R5], Reax
add Reip,Reip,3add Reip, Reip, 3
commitcommit
mov %eax,[%ebp+0x10] //load nmov %eax,[%ebp+0x10] //load n
add R7,Rebp,0x10add R7, Rebp, 0x10
chku R7,R_FFFFFFFFchku R7, R_FFFFFFFF
ld Reax,[R7]ld Reax, [R7]
add Reip,Reip,3add Reip, Reip, 3
commitcommit
lea %ecx,[%eax-1] //decrement nlea %ecx,[%eax-1] //decrement n
sub Recx,Reax,1sub Recx, Reax, 1
add Reip,Reip,3add Reip, Reip, 3
commitcommit
mov [%ebp+0x10],%ecx //store(n-1)mov [%ebp+0x10],%ecx //store(n-1)
add R9,Rebp,0x10add R9, Rebp, 0x10
chku R9,R_FFFFFFFFchku R9, R_FFFFFFFF
st [R9],Recxst[R9], Recx
add Reip,Reip,3add Reip, Reip, 3
commitcommit
and %eax,%eax //testnand %eax, %eax //testn
andcc R11,Reax,Reaxandcc R11, Reax, Reax
add Reip,Reip,3add Reip, Reip, 3
commitcommit
jg .-0x1b //branch″n>0″jg .-0x1b //branch″n>0″
add Rseq,Reip,Length(jg)add Rseq,Reip,Length(jg)
ldc Rtarg,EIP(target)ldc Rtarg, EIP(target)
selcc Reip,Rseq,Rtargselcc Reip, Rseq, Rtarg
commitcommit
jg mainlcop,mainloopjg mainlcop, mainloop
本实例示出了可以利用改进型微处理器实现的优化步骤的第一步。在优化阶段,就象代码词态化软件的许多其它操作一样,假定是有优化结果的。特定的优化操作假定作为为X86系列处理器普通存储器模型编写的目标应用程序作为32位程序将继续保持原样。将会注意到是,这种假设只针对X86系列而对其它被仿真的处理器系列并非必要。This example shows the first step of an optimization procedure that can be implemented using a modified microprocessor. During the optimization phase, like many other operations of code morphing software, it is assumed that there are optimized results. Certain optimizations assume that target applications written for the normal memory model of the x86 family of processors will continue as 32-bit programs. It will be noted that this assumption is only for the X86 family and is not necessary for other emulated processor families.
如果假设成立,则在X86应用程序中所有的段都被映射至相同的地址空间。这使得可以减少X86分段处理所需的基本主指令。显而易见,段值开始时被设定为零。随后,数据基点也被设定为零,并且上限被设定为最大可用存储空间。随后在执行目标基本指令的每组基本主指令中,都省略了分段所需的对段基点值的检查和段基点地址的计算。这减少了用于需要寻址功能的每条目标基本指令的两条主基本指令执行循环的次数。此时仍然需要主指令检查存储空间上限。If the assumption holds, all segments in an x86 application are mapped to the same address space. This makes it possible to reduce the basic host instructions required for X86 segmentation processing. Obviously, the segment value is initially set to zero. Subsequently, the data base point is also set to zero, and the upper limit is set to the maximum available storage space. Subsequently, in each group of elementary host instructions that execute the target elementary instruction, the checking of the segment base point value and the calculation of the segment base point address required for segmentation are omitted. This reduces the number of execution cycles of the two primary primitives for each target primitive requiring addressing functionality. At this time, the main command still needs to check the upper limit of the storage space.
值得注意的是,这种优化需要对应用程序是否采用32位普通存储器模型作出预测。如果不是这样的情况,则由于主循环要实现的目的地址的控制转移并检查出源地址假设与目的地址假设不匹配,所以将会发现错误。随后将进行新的翻译操作。这种技术非常常用并且被应用于各种分段操作和其它不经常变化的“模式化”情况,诸如调试、系统管理模式或“实”模式之类的“模式”。It is worth noting that this optimization requires predictions about whether the application uses the 32-bit normal memory model. If this were not the case, the error would be discovered because the main loop would implement a control transfer of the destination address and check that the source address assumption did not match the destination address assumption. A new translation operation will follow. This technique is very common and is applied to various segmentation operations and other infrequently changing "modal" situations, such as "modes" such as debug, system management mode, or "real" mode.
mov %ecx,[%ebp+0xc] //load cmov %ecx,[%ebp+0xc] //load c
add R0,Rebp,0xcadd R0, Rebp, 0xc
ld Recx,[R0]ld Recx, [R0]
add Reip,Reip,3add Reip, Reip, 3
commitcommit
mov %eax,[%ebp+0x8] //load smov %eax,[%ebp+0x8] //load s
add R2,Rebp,0x8add R2, Rebp, 0x8
ld Reax,[R2]ld Reax, [R2]
add Reip,Reip,3add Reip, Reip, 3
commitcommit
mov [%eax],%ecx //store c into [s]mov [%eax],%ecx //store c into [s]
st [Reax],Recxst[Reax], Recx
add Reip,Reip,2add Reip, Reip, 2
commitcommit
add %eax,#4 //increment s by 4add %eax, #4 //increments by 4
addcc Reax,Reax,4addcc Reax, Reax, 4
add Reip,Reip,5add Reip, Reip, 5
commitcommit
mov [%ebp+0x8],%eax //store (s-4)mov [%ebp+0x8],%eax //store (s-4)
add R5,Rebp,0x8add R5, Rebp, 0x8
st [R5],Reaxst[R5], Reax
add Reip,Reip,3add Reip, Reip, 3
commitcommit
mov %eax,[%ebp+0x10] //load nmov %eax,[%ebp+0x10] //load n
add R7,Rebp,0x10add R7, Rebp, 0x10
ld Reax,[R7]ld Reax, [R7]
add Reip,Reip,3add Reip, Reip, 3
commitcommit
lea %ecx,[%eax-1] //decrement nlea %ecx,[%eax-1] //decrement n
sub Recx,Reax,1sub Recx, Reax, 1
add Reip,Reip,3add Reip, Reip, 3
commitcommit
mov [%ebp+0x10],%ecx //store(n-1)mov [%ebp+0x10],%ecx //store(n-1)
add R9,Rebp,0x10add R9, Rebp, 0x10
st [R9],Recxst[R9], Recx
add Reip,Reip,3add Reip, Reip, 3
commitcommit
and %eax,%eax //test nand %eax, %eax //test n
andcc R11,Reax,Reaxandcc R11, Reax, Reax
add Reip,Reip,3add Reip, Reip, 3
commitcommit
jg .-0x1b //branch″n>0″jg .-0x1b //branch″n>0″
add Rseq,Reip,Length(jg)add Rseq,Reip,Length(jg)
ldc Rtarg,EIP(target)ldc Rtarg, EIP(target)
selcc Reip,Rseq,Rtargselcc Reip, Rseq, Rtarg
commitcommit
jg mainloop,mainloopjg mainloop, mainloop
Host Instruction key:Host Instruction key:
selcc=Select one of the source registers and copy its contents to theselcc=Select one of the source registers and copy its contents to the
destination register basedon the condition codes. destination register based on the condition codes.
上述实例示出了优化步骤的下一阶段,其中的预测翻译操作避免了存储上限边界的检查,这只对于穿越位于存储器地址空间顶部的存储器参考位置的未定位页面才需要。这种假设的失效由硬件或软件定位工具检查。这减少了对需要寻址的每条目标基本指令用另一主基本指令翻译的量。这种优化需要以前所作的假定,应用程序使用32位普通存储器模型并且预测到指令得到了定位。如果上述假设和预测都得不到满足,则当执行翻译指令时将会失效;并且需要重新翻译。The above example shows the next stage of the optimization step, where the speculative translation operation avoids the check of the upper storage boundary, which is only required for unlocated pages that traverse memory reference locations at the top of the memory address space. Failure of this assumption is checked by hardware or software location tools. This reduces the amount of translation with another host primitive for each target primitive that needs to be addressed. This optimization requires the assumption previously made that the application uses a 32-bit normal memory model and that instructions are predicted to be located. If none of the above assumptions and predictions are met, the translated instruction will fail when executed; and a re-translation will be required.
Detect and eliminate redundant address calculations.The example shows the code afterDetect and eliminate redundant address calculations. The example shows the code after
eliminating the redundantoperations.eliminating the redundant operations.
mov %ecx,[%ebp+0xc] //load cmov %ecx,[%ebp+0xc] //load c
add R0,Rebp,0xcadd R0, Rebp, 0xc
ld Recx,[R0]ld Recx, [R0]
add Reip,Reip,3add Reip, Reip, 3
commitcommit
mov %eax,[%ebp+0x8] //load smov %eax,[%ebp+0x8] //load s
add R2,Rebp,0x8add R2, Rebp, 0x8
ld Reax,[R2]ld Reax, [R2]
add Reip,Reip,3add Reip, Reip, 3
commitcommit
mov [%eax],%ecx //store c into [s]mov [%eax],%ecx //store c into [s]
st [Reax],Recxst[Reax], Recx
add Reip,Reip,2add Reip, Reip, 2
commitcommit
add %eax,#4 //increment s by 4add %eax, #4 //increments by 4
addcc Reax,Reax,4addcc Reax, Reax, 4
add Reip,Reip,5add Reip, Reip, 5
commitcommit
mov [%ebp+0x8],%eax //store(s+4)mov [%ebp+0x8],%eax //store(s+4)
st [R2],Reaxst[R2], Reax
add Reip,Reip,3add Reip, Reip, 3
commitcommit
mov %eax,[%ebp+0x10] //load nmov %eax,[%ebp+0x10] //load n
add R7,Rebp,0x10add R7, Rebp, 0x10
ld Reax,[R7]ld Reax, [R7]
add Reip,Reip,3add Reip, Reip, 3
commitcommit
lea %ecx,[%eax-1] //decrement nlea %ecx,[%eax-1] //decrement n
sub Recx,Reax,1sub Recx, Reax, 1
add Reip,Reip,3add Reip, Reip, 3
commitcommit
mov [%ebp+0x10],%ecx //store(n-1)mov [%ebp+0x10],%ecx //store(n-1)
st [R7],Recxst[R7], Recx
add Relp,Reip,3add Relp, Reip, 3
commitcommit
and %eax,%eax //test nand %eax, %eax //test n
andcc R11,Reax,Reaxandcc R11, Reax, Reax
add Reip,Reip,3add Reip, Reip, 3
commitcommit
jg .-0x1b //branch″n>0″jg .-0x1b //branch″n>0″
add Rseq,Reip,Length(jg)add Rseq,Reip,Length(jg)
ldc Rtarg,EIP(target)ldc Rtarg, EIP(target)
selcc Reip,Rseg,Rtargselcc Reip, Rseg, Rtarg
commitcommit
jg mainloop,mainloopjg mainloop, mainloop
本实例示出了下一步优化,其中省略了共用的主指令表达方式。具体而言,在翻译第二条目标基本指令时,工作寄存器Rebp(表示X86处理器堆栈基点寄存器的工作寄存器)内的值与偏移值0x8相加并放入主工作寄存器R2。将会注意到的是,除了将加法结果放入工作寄存器R5以外,在前面的实例中,翻译目标基本指令5时进行的是同一操作。因此当主基本指令5执行期间工作寄存器R5内放入的值已经存在于工作寄存器R2。这样在翻译目标基本指令5时可以省略主加法指令;将工作寄存器R2的值复制到工作寄存器R5内。同样,由于在翻译目标基本指令6时已经完成了将工作寄存器Rebp的值与偏移值0x10相加的步骤并且结果驻留在寄存器R7内,所以在翻译目标基本指令8时可以省略该步骤。值得注意的是这种优化不依赖于预测,因此不会失效或者需要重新翻译。This example shows a further optimization in which the common host instruction expression is omitted. Specifically, when translating the second target basic instruction, the value in the working register Rebp (the working register representing the stack base point register of the X86 processor) is added to the offset value 0x8 and put into the main working register R2. It will be noted that the translation of the target primitive instruction 5 is the same as in the previous example, except that the result of the addition is placed into working register R5. Therefore, the value put into the working register R5 during the execution of the main basic instruction 5 already exists in the working register R2. In this way, the main addition instruction can be omitted when translating the target basic instruction 5; the value of the working register R2 is copied into the working register R5. Likewise, since the step of adding the value of the working register Rebp to the offset value 0x10 has already been completed when translating target basic instruction 6 and the result resides in register R7, this step can be omitted when translating target basic instruction 8. It's worth noting that this optimization doesn't depend on predictions, so it won't fail or require retranslation.
Assume that target exceptions will not occur within the translation so delay updating eip andAssume that target exceptions will not occur within the translation so delay updating eip and
target state.target state.
mov %ecx,[%ebp+0xc] //load cmov %ecx,[%ebp+0xc] //load c
add R0,Rebp,0xcadd R0, Rebp, 0xc
ld Recx,[R0]ld Recx, [R0]
mov %eax,[%ebp+0x8] //load smov %eax,[%ebp+0x8] //load s
add R2,Rebp,0x8add R2, Rebp, 0x8
ld Reax,[R2]ld Reax, [R2]
mov [%eax],%ecx //store c into [s]mov [%eax],%ecx //store c into [s]
st [Reax],Recxst[Reax], Recx
add %eax,#4 //increment s by 4add %eax, #4 //increments by 4
add Reax,Reax,4add Reax, Reax, 4
mov [%ebp+0x8],%eax //store (s+4)mov [%ebp+0x8],%eax //store (s+4)
st [R2],Reaxst[R2], Reax
mov %eax,[%ebp+0x10] //load nmov %eax,[%ebp+0x10] //load n
add R7,Rebp,0x10add R7, Rebp, 0x10
ld Reax,[R7]ld Reax, [R7]
lea %ecx,[%eax-1] //decrement nlea %ecx,[%eax-1] //decrement n
sub Recx,Reax,1sub Recx, Reax, 1
mov [%ebp+0x10],%ecx //store(n-1)mov [%ebp+0x10],%ecx //store(n-1)
st [R7],Recxst[R7], Recx
and %eax,%eax //test nand %eax, %eax //test n
andcc R11,Reax,Reaxandcc R11, Reax, Reax
jg .-0x1b //branch″n>0″jg .-0x1b //branch″n>0″
上述实例示出了一种优化步骤,它预测构成整个翻译操作的基本目标指令的翻译能够不发生意外事件就完成。如果符合预测情况,则无需在执行一条目标基本指令的每一主基本指令序列结束时更新正式目标寄存器或者提交存储缓冲器内的未动用存储结果。如果预测为真,则只需在目标基本指令序列结束时更新正式目标寄存器并且存储内容只需被提交一次。这可以省略两条执行每一基本目标指令的基本主指令。它们被一条更新正式目标寄存器并将未动用存储内容提交入存储器的主基本指令代替。The above example shows an optimization step that predicts that the translation of the basic target instructions that make up the entire translation operation can be completed without exception. If predicted, there is no need to update official target registers or commit unused stores in store buffers at the end of each host primitive sequence executing a target primitive. If the prediction is true, the official target registers need only be updated at the end of the target primitive instruction sequence and the stores need only be committed once. This makes it possible to omit the two basic host instructions that execute each basic target instruction. They are replaced by a main primitive instruction that updates the official target register and commits unused storage to memory.
显而易见的是,其它的预测操作也极有可能是正确的预测。如果预测保持为真值,那么该步骤与现有仿真技术比就有极大的优势。它使所有执行目标基本指令的基本主指令归类为一个序列从而可以对所有主基本指令逐条优化。这有利于在得益于超长指令字技术的词态主机上并行运行大量的操作。由于有更多的优化选择,所以可以采用其它大量的优化。但是如果预测情况失真并且在执行循环时发生意外事件,则由于直到实际执行主指令序列后才发生提交操作,所以正式目标寄存器和存储器保存的是目标基本指令序列开始时的正式目标状态。从意外事件恢复所需的操作是转储未动用的存储内容,使正式寄存器返回工作寄存器并在序列开始时重新翻译目标基本指令。这种重新翻译一次翻译一条目标指令,并且在代表每条目标基本指令的主序列翻译之后才更新正式状态。随后执行翻译。当该重新翻译过程中出现意外事件时,立即使用正式目标寄存器和存储器的正确目标状态来执行意外事件。It is obvious that other prediction operations are also highly likely to be correct predictions. This step has a huge advantage over existing simulation techniques if the prediction holds true. It classifies all basic host instructions that execute target basic instructions into a sequence so that all host basic instructions can be optimized one by one. This facilitates running a large number of operations in parallel on the morph host thanks to VLW technology. As more optimization options are available, a large number of other optimizations can be employed. But if the prediction is distorted and an unexpected event occurs while the loop is being executed, since the commit operation does not occur until after the main sequence of instructions is actually executed, the official target registers and memory hold the official target state at the beginning of the target primitive sequence of instructions. Actions required to recover from the contingency are dumping unused storage, returning official registers to working registers and retranslating the target primitive instruction at the start of the sequence. This retranslation translates one target instruction at a time, and updates the official state after translation of the main sequence representing each target primitive instruction. Then execute the translation. When an exception occurs during this retranslation process, the exception is immediately executed using the correct target state of the official target registers and memory.
In summary:In summary:
add R0,Rebp,0xcadd R0, Rebp, 0xc
ld Recx,[R0]ld Recx, [R0]
add R2,Rebp,0x8add R2, Rebp, 0x8
ld Reax,[R2]ld Reax, [R2]
st [Reax],Recxst[Reax], Recx
add Reax,Reax,4add Reax, Reax, 4
st [R2],Reaxst[R2], Reax
add R7,Rebp,0x10add R7, Rebp, 0x10
ld Reax,[R7] //Live outld Reax, [R7] //Live out
sub Recx,Reax,1 //Live outsub Recx,Reax,1 //Live out
st [R7],Recxst[R7], Recx
andcc R11,Reax,Reaxandcc R11, Reax, Reax
add Rseq,Reip,Length (block)add Rseq,Reip,Length (block)
ldc Rtarg,EIP(target)ldc Rtarg, EIP(target)
selcc Reip,Rseq,Rtargselcc Reip, Rseq, Rtarg
commitcommit
jg mainloop,mainloopjg mainloop, mainloop
The comment ″Live Out″refers to the need to actually maintain Reax and RecxThe comment ″Live Out″ refers to the need to actually maintain Reax and Recx
correctly prior to the commit.Otherwise further optimization might becorrectly prior to the commit. Otherwise further optimization might be
possible.possible.
===================================================================================================================== ===================
上述总结示出了优化过程中处于该位置时保留的主基本指令序列。虽然本实例示出了目标指令指针(EIP)串列的维护,但是也可以使翻译时转移的指针EIP保持不一致,这将省去实例中本步骤和后续步骤中更新序列的指针EIP。The above summary shows the main primitive instruction sequence preserved at this position during optimization. Although this example shows the maintenance of the target instruction pointer (EIP) sequence, it is also possible to keep the pointer EIP transferred during translation inconsistent, which will save the pointer EIP of the update sequence in this step and subsequent steps in the example.
Renaming to reduce register resource dependencies.This wi llallow subsequent scheduling to beRenaming to reduce register resource dependencies. This will allow subsequent scheduling to be
more effective,From this point on,the original target X86 code is omitted as the relatlonshipmore effective, From this point on, the original target X86 code is omitted as the relatlonship
between Individual target X86 instructionsand host instructions becomes increasingly blurred.between Individual target X86 instructions and host instructions become increasingly blurred.
add R0,Rebp,0xc add R0, Rebp, 0xc
ld R1,[R0]ld ld R1,[R0]
add R2,Rebp,0x8Add R2, Rebp, 0x8
ld R3,[R2]ld ld R3, [R2]
st [R3],R1st [R3], R1
add R4,R3,4Add R4, R3, 4
st [R2],R4st [R2], R4
add R7,Rebp,0x10 add R7, Rebp, 0x10
ld Reax,[R7] //Live outld Reax, [R7] //Live out
sub Recx,Reax,1 //Live outSub Recx,Reax,1 //Live out
st [R7],Recxst [R7], Recx
andcc R11,Reax,Reaxandcc R11, Reax, Reax
add Rseg,Reip,Length(block) add Rseg, Reip, Length(block)
ldc Rtarg,EIP(target)ldc Rtarg, EIP(target)
selcc Reip,Rseq,Rtarg selcc Reip, Rseq, Rtarg
commitcommit
jg mainloop,mainloopjg mainloop, mainloop
本实例示出了通常称为寄存器更名的下一优化步骤,其中需要使用在主基本指令系列中用于多个操作的工作寄存器的操作改为采用不同的未使用工作寄存器以防止两条主指令要求使用同一硬件。因此,例如在上述两例中的第二条主基本指令采用代表正式目标寄存器ECX的工作寄存器Recx。第十条主基本指令也使用工作寄存器Recx。通过改变第二主基本指令中的操作从而使RO内地址所指向的值存储在工作寄存器R1而不是寄存器Recx,两条主指令就不会使用同一寄存器。同样,第四、第五和第六条主基本指令在前面实例中都使用工作寄存器Reax;通过使第四主基本指令使用前面未使用的工作寄存器R3而第六主基本指令使用前面未使用的工作寄存器R4,避免了它们使用同一硬件。This example shows the next optimization step commonly known as register renaming, where an operation that needs to use a working register for multiple operations in the main primitive instruction series instead uses a different unused working register to prevent two main instruction Requires use of the same hardware. Thus, for example, the second host primitive instruction in the above two examples uses the working register Recx representing the official target register ECX. The tenth main primitive instruction also uses the working register Recx. By changing the operation in the second host primitive instruction so that the value pointed to by the address in RO is stored in working register R1 instead of register Recx, the two host instructions do not use the same register. Likewise, the fourth, fifth, and sixth primary primitive instructions all use the working register Reax in the preceding examples; by having the fourth primary primitive instruction use the previously unused working register R3 and the sixth primary primitive instruction use the previously unused Working register R4, avoiding them using the same hardware.
After the scheduling process which organizes the primitive host operations as multipleAfter the scheduling process which organizes the primitive host operations as multiple
operations that can execute in the parallel on the host VLIW hardware.Each line shows theoperations that can execute in the parallel on the host VLIW hardware. Each line shows the
parallel operations that the VLIW machine executes,and the″&″indicates the parallelism.parallel operations that the VLIW machine executes, and the″&″indicates the parallelism.
add R2,Rebp,0x8 & add R0,Rebp,0xcAdd R2, Rebp, 0x8 & add R0, Rebp, 0xc
nop & add R7,Rebp,0x10& add R7, Rebp, 0x10
ld R3,[R2] & add Rseq,Reip,Length(block)ld ld R3, [R2] & add Rseq, Reip, Length(block)
ld R1,[R0] & add R4,R3,4ld ld R1, [R0] & add R4, R3, 4
st [R3],R1 & ldc Rtarg,EIP(target) st [R3], R1 & ldc Rtarg, EIP(target)
ld Reax,[R7] & nopld ld Reax,[R7] & nop
st [R2],R4 & sub Recx,Reax,1st [R2], R4 & sub Recx, Reax, 1
st [R7],Recx & andcc R11,Reax,Reaxst [R7], Recx & andcc R11, Reax, Reax
selcc Reip,Rseq,Rtarg & jg mainloop,mainloop & commit selcc Reip, Rseq, Rtarg & jg mainloop, mainloop & commit
Host Inst ruction key:Host Instruction key:
nop=no operationnop=no operation
上述实例示出了在词态主机上执行的主基本指令的安排。在该实例中,假定词态主机为VLIW处理器,除了用于协调词态化软件的硬件增强部分以外,词态主机还包括其它处理单元中的两个算术和逻辑(ALU)单元。第一行表示两个单独的相加指令,它们尽管安排在词态主机上一起运行。显而易见,在前述实例中它们是第三和第八条基本主指令。第二行包括NOP指令(无操作但是进入下一指令)和另一加法操作。NOP指令表示即使在经过某些安排优化之后也不一定是两条指令放在一起运行。在任何情况下,该实例示出了此时只剩下9组基本主指令来执行原先10条目标指令。The above example shows the arrangement of the main primitive instructions executed on the morph host. In this example, the Morph Host is assumed to be a VLIW processor, which includes two Arithmetic and Logic (ALU) units among other processing units, in addition to hardware enhancements for coordinating the morphing software. The first line represents two separate add instructions that are arranged to run together on the morph host. Obviously, they are the third and eighth basic host instructions in the preceding example. The second row includes a NOP instruction (no operation but go to next instruction) and another add operation. The NOP instruction means that even after some scheduling optimizations, it is not necessarily two instructions that are run together. In any event, this example shows that at this point there are only 9 sets of basic host instructions left to execute the original 10 target instructions.
Resolve host branch targets and chain stored translationsResolve host branch targets and chain stored translations
add R2,Rebp,0x8 & add R0,Rebp,0xcadd R2, Rebp, 0x8 & add R0, Rebp, 0xc
nop & add R7,Rebp,0x10nop & add R7, Rebp, 0x10
ld R3,[R2] & add Rseq,Reip,Length(block)ld R3, [R2] & add Rseq, Reip, Length(block)
ld R1,[R0] & add R4,R3,4ld R1, [R0] & add R4, R3, 4
st [R3],R1 & ldc Rta rg,EIP(target)st [R3], R1 & ldc Rta rg, EIP(target)
ld Reax,[R7] & nopld Reax, [R7] & nop
st [R2],R4 & sub Recx,Reax,1st [R2], R4 & sub Recx, Reax, 1
st [R7],Recx & andcc R11,Reax,Reaxst[R7], Recx & andcc R11, Reax, Reax
selcc Reip,Rseq,Rtarg & jg Sequential,Target & commitselcc Reip, Rseq, Rtarg & jg Sequential, Target & commit
本实例除了指令现在被存储于翻译缓冲器内并且执行一次以上(由于最后的跳转(jg)指令现在指向将另一翻译指令序列串接起来的跳转地址)以外基本上是同一组主基本指令。串接过程使指令序列跳出了翻译器主循环从而完成了序列的翻译。This example is essentially the same set of main basics except that the instruction is now stored in the translation buffer and executed more than once (since the final jump (jg) instruction now points to a jump address that concatenates another sequence of translated instructions) instruction. The concatenation process makes the sequence of instructions jump out of the main loop of the translator and completes the translation of the sequence.
Advanced Optimizations,Backward Code Motion:Advanced Optimizations, Backward Code Motion:
This and subseguent examples start with the code prior to scheduling.ThisThis and subsequent examples start with the code prior to scheduling.This
optimization first depends on detecting that the code is a loop.Thenoptimization first depends on detecting that the code is a loop.Then
invariant operations can be moved out of the loop body and execuued onceinvariant operations can be moved out of the loop body and executed once
before entering the loop body.before entering the loop body.
entry:entry:
add R0,Rebp,0xc add R0, Rebp, 0xc
add R2,Rebp,0x8Add add R2, Rebp, 0x8
add R7,Rebp,0x10Add add R7, Rebp, 0x10
add Rseq,Reip,Length(block) add Rseq, Reip, Length(block)
ldc Rtarg,EIP(target)ldc Rtarg, EIP(target)
Loop:Loop:
ld R1,[R0]ld ld R1,[R0]
ld R3,[R2]ld ld R3, [R2]
st [R3],R1st st [R3], R1
add R4,R3,4 add R4, R3, 4
st [R2],R4st st [R2], R4
ld Reax,[R7]ld ld Reax, [R7]
sub Recx,Reax,1Sub Sub Recx, Reax, 1
st [R7],Recxst [R7], Recx
andcc R11,Reax,Reaxandcc R11, Reax, Reax
selcc Reip,Rseq,Rtargselcc Reip, Rseq, Rtarg
commitcommit
jg mainloop,Loopjg mainloop, loop
上述实例示出了通常只与大量重复的序列一起使用的高级优化步骤。该进程首先检测构成循环的翻译,并检查单条基本主指令以确定循环体内哪条指令产生不变结果。这些指令从循环内去除并且只执行一次,将值被放入寄存器内;从此,存储在寄存器内的值被重复使用而不是重新运行指令。The above examples illustrate advanced optimization steps that are typically only used with heavily repeated sequences. The process first examines the translations that make up the loop, and examines the individual elementary host instructions to determine which instruction within the loop body produces an invariant result. These instructions are removed from the loop and executed only once, placing the value in the register; from then on, the value stored in the register is reused instead of re-running the instruction.
Schedule the loop body after backward code motion.For example purposes,only theSchedule the loop body after backward code motion. For example purposes, only the
code in the loop body is shown scheduledcode in the loop body is shown scheduled
Entry:Entry:
add R0,Rebp,0xc add R0, Rebp, 0xc
add R2,Rebp,0x8Add add R2, Rebp, 0x8
add R7,Rebp,0x10Add add R7, Rebp, 0x10
add Rseq,Reip,Length(block) add Rseq, Reip, Length(block)
ldc Rtarg,EIp (target)ldc Rtarg, EIp (target)
Loop:Loop:
ld R3,[R2] & nopld ld R3, [R2] & nop
ld R1,[R0] & addR4,R3,4ld ld R1,[R0] & addR4,R3,4
st [R3],R1 & nopst [R3], R1 & nop
ld Reax,[R7] & nopld ld Reax,[R7] & nop
st [R2],R4 & sub Recx,Reax,1st [R2], R4 & sub Recx, Reax, 1
st [R7],Recx & andcc R11,Reax,Reaxst st [R7], Recx & andcc R11, Reax, Reax
selcc Reip,Rseq,Rtarg & jg Sequential,Loop &commit selcc Reip, Rseq, Rtarg & jg Sequential, Loop &commit
Host Instruction key:Host Instruction key:
ldc=load a 32-bit constantldc=load a 32-bit constant
当这些不重复的指令从循环中去除并且对序列进行安排之后,它们与上述实例中的指令相同。由此可见,在循环第一次迭代期间完成初始化指令但是只执行一次,此后在循环期间只执行所示7个时钟间隔内剩余的主基本指令。这样执行时间从10条指令间隔缩短为7个指令间隔就能执行基本目标指令。When these non-repetitive instructions are removed from the loop and sequenced, they are identical to the instructions in the above example. It can be seen that the initialization instructions are completed but executed only once during the first iteration of the loop, after which only the remaining main primitive instructions for the seven clock intervals shown are executed during the loop. In this way, the execution time is shortened from 10 instruction intervals to 7 instruction intervals to execute the basic target instruction.
显而易见,从循环中去除的步骤是地址生成步骤。这样在改进型微处理器内只需在循环开始时生成一次地址;即,地址只需生成一次。另一方面,X86目标处理器的地址生成硬件必须在每次执行循环时生成地址。如果循环执行100次,则改进型微处理器只生成一次地址而目标处理器要生成100次的地址。Obviously, the step removed from the loop is the address generation step. This allows the address to be generated only once within the improved microprocessor, at the beginning of the loop; ie, the address only needs to be generated once. On the other hand, the address generation hardware of an x86 target processor must generate an address every time a loop is executed. If the loop is executed 100 times, the improved microprocessor generates the address only once and the target processor generates the address 100 times.
After Backward Code Motion:After Backward Code Motion:
Target:Target:
add R0,Rebp,0xcAdd R0, Rebp, 0xc
add R2,Rebp,0x8Add R2, Rebp, 0x8
add R7,Rebp,0x10Add R7, Rebp, 0x10
add Rseq,Reip,Length(block)
ldc Rtarg,EIP(target)ldc Rtarg, EIP(target)
Loop:Loop:
ld R1,[R0]ld ld R1,[R0]
ld R3,[R2]ld ld R3, [R2]
st [R3],R1[R3], R1
add R4,R3,4Add R4, R3, 4
st [R2],R4
ld Reax,[R7] //Live outld ld Reax,[R7] //Live out
sub Recx,Reax,1 //Live outSub Recx, Reax, 1 //Live out
st [R7],Recxst [R7], Recx
andcc R11,Reax,Reaxandcc R11, Reax, Reax
selcc Reip,Rseq,Rtarg selcc Reip, Rseq, Rtarg
commitcommit
jg mainloop,Loopjg mainloop, loop
===================================================================================================================== ===================
Register Allocation:Register Allocation:
This shows the use of register alias detection hardware of the morph hostThis shows the use of register alias detection hardware of the morph host
that allows variables to be safely moved from memory into registers.Thethat allows variables to be safely moved from memory into registers. The
starting point is the code after″backward code motion″.This shows thestarting point is the code after "backward code motion". This shows the
optimization that can eliminate loads.optimization that can eliminate loads.
First the loads are performed.The address is protected by the aliasFirst the loads are performed. The address is protected by the alias
hardware,such that should a store to the address occur,an″alias″exceptionhardware, such that should a store to the address occur, an "alias" exception
is raised.The loads in the loop body are then replaced with copies.Afteris raised. The loads in the loop body are then replaced with copies. After
the main body of the loop,the alias hardware is treed.The main body of the loop, the alias hardware is treed.
Entry:Entry:
add R0,Rebp,0xc add R0, Rebp, 0xc
add R2,Rebp,0x8Add add R2, Rebp, 0x8
add R7,Rebp,0x10Add add R7, Rebp, 0x10
add Rseq,Reip,Length (block) add Rseq, Reip, Length (block)
ldc Rcarg,EIP(target)ldc Rcarg, EIP(target)
ld Rc,[R0] ;Firstdo the load of the variable from memoryld ld Rc,[R0] ; First do the load of the variable from memory
prot [R0],Alias1 ;Then protect the memory location from stores prot [R0], Alias1 ; Then protect the memory location from stores
ld Rs,[R2]ld ld Rs, [R2]
prot [R2],Alias2Prot [R2], Alias2
ld Rn,[R7]ld ld Rn, [R7]
prot [R7],Alias3Prot [R7], Alias3
Loop:Loop:
copy R1,Rccopy R1, Rc
copy R3,Rscopy R3, Rs
st [R3],R1st st [R3], R1
add R4,Rs,4Add R4, Rs, 4
copy Rs,R4copy Rs, R4
st [R2],Rs,NoAliasCheckst st [R2], Rs, NoAliasCheck
copy Reax,Rn //Live outCopy Reax,Rn //Live out
sub Recx,Reax,1 //Live outSub Sub Recx,Reax,1 //Live out
copy Rn,RecxCopy Rn, Recx
st [R7],Rn,noAliasCheckst [R7], Rn, noAliasCheck
andcc R11,Reax,Reaxandcc R11, Reax, Reax
selcc Reip,Rseq,Rtargselcc Reip, Rseq, Rtarg
commitcommit
jg Epilog,Loopjg Epilog, Loop
Epilog:Epilog:
FA Alias1 Free the alias detection hardwareFA Alias1 Free the alias detection hardware
FA Alias2 Free the a1ias detection hardware
FA Alias3 Free the alias detection hardware Alias3 Free the alias detection hardware
j Sequentialj Sequential
Host Instruction key:Host Instruction key:
protect=protectaddress from loads FA=free aliasprotect=protectaddress from loads FA=free alias
copy=copy j=jumpcopy=copy j=jump
本实例示出由本发明微处理器实现的更为高级的优化步骤。参见本实例前的第二个实例就会注意到堆栈内涉及地址计算的前3条相加指令。这些地址在主操作序列执行期间不会变化。因此这些地址上存储的值可以从存储器内查找出来并载入寄存器以供立即使用。显而易见,在主基本指令6、8和10中都是这样做的。在指令7、9和11中,每个存储器地址被特殊的主别名硬件加上保护标记并且寄存器被表示为这些存储器地址的别名,所以如果试图改变数据就会引发意外事件。此时,每条涉及从堆栈存储器地址移动数据的加载操作都变为简单的寄存器之间的复制操作,这比从存储器地址加载数据要快许多。值得注意的是,一旦循环执行到n=0,则必须去除每个存储器地址上的保护从而可以使用别名寄存器。This example shows the more advanced optimization steps implemented by the microprocessor of the present invention. Referring to the second example before this example, you will notice the first 3 addition instructions in the stack that involve address calculations. These addresses do not change during the execution of the main sequence of operations. The values stored at these addresses can therefore be retrieved from memory and loaded into registers for immediate use. Obviously, this is done in the main basic instructions 6, 8 and 10. In
Copy Propagation:Copy Propagation:
After using the alias hardware to turn loads within the loop body intoAfter using the alias hardware to turn loads within the loop body into
copies,copy propagation allows the elimination of some copies.copies, copy propagation allows the elimination of some copies.
Entry:Entry:
add R0,Rebp,0xc add R0, Rebp, 0xc
add R2,Rebp,0x8Add add R2, Rebp, 0x8
add R7,Rebp,0x10Add add R7, Rebp, 0x10
add Rseq,Reip,Length(block) add Rseq, Reip, Length(block)
ldc Rtarg,EIP(target)ldc Rtarg, EIP(target)
ld Rc,[R0]ld ld Rc,[R0]
prot [R0],Alias1Prot [R0], Alias1
ld Rs,[R2]ld ld Rs, [R2]
prot [R2],Alias2Prot [R2], Alias2
ld Recx,[R7]ld ld Recx, [R7]
prot [R7],Alias3Prot [R7], Alias3
Loop:Loop:
st [Rs],Rcst st [Rs], Rc
add Rs,Rs,4Add Rs, Rs, 4
st [R2],Rs,NoAliasCheckst [R2], Rs, NoAliasCheck
copy Reax,Recx //Live outCopy Reax,Recx //Live out
sub Recx,Reax,1 //Live outSub Sub Recx,Reax,1 //Live out
st [R7],Recx,NoAliasCheckst [R7], Recx, NoAliasCheck
andcc R11,Reax,Reaxandcc R11, Reax, Reax
selcc Reip,Rseq,Rtargselcc Reip, Rseq, Rtarg
commitcommit
jg Epilog,Loopjg Epilog, Loop
Epilog:Epilog:
FA Alias1FA Alias1
FA Alias2FA Alias2
FA Alias3FA Alias3
j Sequentialj Sequential
本实例示出了下一优化步骤,其中大多数代替上一实例中优化的加载指令的复制指令并非必要并且可以省略。即,如果进行寄存器之间的复制,则在复制数据的寄存器的操作前数据已经存在。如果是这样,可以在第一寄存器内访问数据而不是在复制数据的寄存器内,因此可以省略复制操作。显而易见,这省略了上一实例循环中所示的第一、第二、第五和第九基本主指令。此外,其它主基本指令中所用的寄存器也可以改为反映正确的寄存器数据内容。这样,例如当省略第一和第二复制指令时,第三存储指令必须从实际存在数据的工作寄存器Rc(而不是寄存器R1)复制数据并放入实际存在地址的工作寄存器Rs(而非寄存器R3)表示的地址上。This example shows the next optimization step where most of the copy instructions replacing the optimized load instructions in the previous example are unnecessary and can be omitted. That is, if copying between registers is performed, the data already exists before the operation of the register where the data is copied. If so, the data can be accessed in the first register rather than in the register where the data was copied, so the copy operation can be omitted. Obviously, this omits the first, second, fifth and ninth basic host instructions shown in the previous example loop. Additionally, registers used in other host primitive instructions can also be changed to reflect the correct register data content. Thus, for example, when the first and second copy instructions are omitted, the third store instruction must copy the data from the working register Rc (instead of register R1) where the data actually exists and into the working register Rs (instead of register R3) where the address actually exists. ) on the address indicated.
Example illustrating scheduling of the loop body only.Example illustrating scheduling of the loop body only.
Entry:Entry:
add R0,Rebp,0xc add R0, Rebp, 0xc
add R2,Rebp,0x8Add add R2, Rebp, 0x8
add R7,Rebp,0x10Add add R7, Rebp, 0x10
add Rseq,Reip,Length(block) add Rseq, Reip, Length(block)
ldc Rtarg,EIP(target)ldc Rtarg, EIP(target)
ld Rc,[R0]ld ld Rc,[R0]
prot [R0],Alias1Prot [R0], Alias1
ld Rs,[R2]ld ld Rs, [R2]
prot [R2],Alias2Prot [R2], Alias2
ld Recx,[R7]ld ld Recx, [R7]
prot [R7],Alias3Prot [R7], Alias3
Loop:Loop:
st [Rs],Rc, & add Rs,Rs,4 © Reax,Recxst st [Rs], Rc, & add Rs, Rs, 4 © Reax, Recx
st [R2],Rs,NAC & sub Recx,Reax,1 st [R2], Rs, NAC & sub Recx, Reax, 1
st [R7],Recx,NAC & andcc R11,Reax,Reax st [R7], Recx, NAC & andcc R11, Reax, Reax
selcc Reip,Rseq,Rtarg & jg Epilog,Loop &commitselcc Reip, Rseq, Rtarg & jg Epilog, Loop &commit
Epilog:Epilog:
FA Alias1FA FA Alias1
FA Alias2FA FA Alias2
FA Alias3FA FA Alias3
j SequentialSequential
Host Instruction key:Host Instruction key:
NAC-No Alias CheckNAC-No Alias Check
上述实例示出了安排后的主指令。将会看到,在执行该循环时,序列所需的时钟数要少于从源代码得到的基本目标指令所需的时钟数。因此除了所有其它加速技术以外,所运行联合的操作的总数也少于执行源目标码所需的操作数量。The above example shows the scheduled host order. It will be seen that, in executing this loop, the sequence requires fewer clocks than the basic target instruction derived from the source code. So in addition to all other acceleration techniques, the total number of combined operations performed is less than the number of operations required to execute the source object code.
Store Elimination by use of the alias hardware.Store Elimination by use of the alias hardware.
Entry:Entry:
add R0,Rebp,0xc add R0, Rebp, 0xc
add R2,Rebp,0x8Add add R2, Rebp, 0x8
add R7,Rebp,0x10Add add R7, Rebp, 0x10
add Rseq,Reip,Length(block) add Rseq, Reip, Length(block)
ldc Rtarg,EIP(target)ldc Rtarg, EIP(target)
ld Rc,[R0]ld ld Rc,[R0]
prot [R0],Alias1 ;protect the address from loads and prot [R0], Alias1 ; protect the address from loads and
storesstores
ld Rs,[R2]ld ld Rs, [R2]
prot [R2],Alias2 ;protect the address from loads andProt [R2], Alias2 ; protect the address from loads and
storesstores
ld Recx,[R7]ld ld Recx, [R7]
prot [R7],Alias3 ;protect the address from loads and prot [R7], Alias3 ; protect the address from loads and
storesstores
Loop:Loop:
st [Rs],Rc, & add Rs,Rs,4 © Reax ,Recxst st [Rs], Rc, & add Rs, Rs, 4 © Reax, Recx
sub Recx,Reax,1 & andcc R11,Reax,Reax sub sub Recx, Reax, 1 & andcc R11, Reax, Reax
selcc Reip,Rseq,Rtarg & jg Epilog,Loop&commit selcc Reip, Rseq, Rtarg & jg Epilog, Loop & commit
Epilog:Epilog:
FA Alias1FA FA Alias1
FA Alias2FA FA Alias2
FA Alias3FA FA Alias3
st [R2],Rs ;writeback the final valueof Rs st [R2], Rs ; writeback the final value of Rs
st [R7],Recx ;writebackthe finalvalueof Recxst [R7], Recx ; writebackthe finalvalueofRecx
j Sequentialj Sequential
本实例所示的最后优化用于别名硬件以节省存储。这节省了循环体内的存储并且只在循环结束部分施行。与最初10条目标指令相比,这将循环体内主指令的数量减少到3条。The last optimization shown in this example is to alias the hardware to save storage. This saves storage in the loop body and is only done at the end of the loop. This reduces the number of host instructions in the loop body to 3 compared to the original 10 target instructions.
虽然以上借助实施例描述了本发明,但是本领域内技术人员可以在不偏离本发明范围和精神的前提下对本发明作出各种修改和变动。例如虽然本发明描述的是X86处理器的仿真,但是它也可以应用于针对其它处理器结构设计的应用程序上,也可以用于运行在虚拟机上的程序,例如P code、Postscript或Java程序等。因此本发明由下面所附权利要求限定。Although the present invention has been described above by means of the embodiments, those skilled in the art can make various modifications and changes to the present invention without departing from the scope and spirit of the present invention. For example, although what the present invention describes is the emulation of X86 processor, it also can be applied to the application program designed for other processor structure, also can be used for the program running on the virtual machine, such as P code, Postscript or Java program wait. Accordingly the invention is defined by the claims hereinafter appended.
Claims (14)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB971822298A CN100392618C (en) | 1997-08-11 | 1997-08-11 | System, method and apparatus for protecting memory in a computer from being written to |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB971822298A CN100392618C (en) | 1997-08-11 | 1997-08-11 | System, method and apparatus for protecting memory in a computer from being written to |
Publications (2)
Publication Number | Publication Date |
---|---|
CN1286772A CN1286772A (en) | 2001-03-07 |
CN100392618C true CN100392618C (en) | 2008-06-04 |
Family
ID=5178347
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNB971822298A Expired - Fee Related CN100392618C (en) | 1997-08-11 | 1997-08-11 | System, method and apparatus for protecting memory in a computer from being written to |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN100392618C (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2543306B (en) * | 2015-10-14 | 2019-05-01 | Advanced Risc Mach Ltd | Exception handling |
CN112363759B (en) * | 2020-10-22 | 2022-10-14 | 海光信息技术股份有限公司 | Register configuration method and device, CPU chip and electronic equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1060731A (en) * | 1990-10-01 | 1992-04-29 | 国际商业机器公司 | The Memory Controller direct or interleave memory accessing is used |
US5282274A (en) * | 1990-05-24 | 1994-01-25 | International Business Machines Corporation | Translation of multiple virtual pages upon a TLB miss |
US5361340A (en) * | 1990-01-05 | 1994-11-01 | Sun Microsystems, Inc. | Apparatus for maintaining consistency in a multiprocessor computer system using virtual caching |
US5437017A (en) * | 1992-10-09 | 1995-07-25 | International Business Machines Corporation | Method and system for maintaining translation lookaside buffer coherency in a multiprocessor data processing system |
-
1997
- 1997-08-11 CN CNB971822298A patent/CN100392618C/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5361340A (en) * | 1990-01-05 | 1994-11-01 | Sun Microsystems, Inc. | Apparatus for maintaining consistency in a multiprocessor computer system using virtual caching |
US5282274A (en) * | 1990-05-24 | 1994-01-25 | International Business Machines Corporation | Translation of multiple virtual pages upon a TLB miss |
CN1060731A (en) * | 1990-10-01 | 1992-04-29 | 国际商业机器公司 | The Memory Controller direct or interleave memory accessing is used |
US5437017A (en) * | 1992-10-09 | 1995-07-25 | International Business Machines Corporation | Method and system for maintaining translation lookaside buffer coherency in a multiprocessor data processing system |
Also Published As
Publication number | Publication date |
---|---|
CN1286772A (en) | 2001-03-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US5958061A (en) | Host microprocessor with apparatus for temporarily holding target processor state | |
US6011908A (en) | Gated store buffer for an advanced microprocessor | |
US5832205A (en) | Memory controller for a microprocessor for detecting a failure of speculation on the physical nature of a component being addressed | |
JP3753743B2 (en) | Method and apparatus for memory data aliasing in advanced processors | |
US7840776B1 (en) | Translated memory protection apparatus for an advanced microprocessor | |
US6031992A (en) | Combining hardware and software to provide an improved microprocessor | |
JP3776132B2 (en) | Microprocessor improvements | |
JP3621116B2 (en) | Conversion memory protector for advanced processors | |
JP3654913B2 (en) | Host microprocessor with a device that temporarily holds the state of the target processor | |
CN100392618C (en) | System, method and apparatus for protecting memory in a computer from being written to |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
ASS | Succession or assignment of patent right |
Owner name: TRANSMITAR CO., LTD Free format text: FORMER OWNER: TRANSMITAR CO., LTD. Effective date: 20091030 Owner name: KNOWLEDGE VENTURE CAPITAL ROMPLAST-14 O., LTD Free format text: FORMER OWNER: TRANSMITAR CO., LTD Effective date: 20091030 |
|
C41 | Transfer of patent application or patent right or utility model | ||
TR01 | Transfer of patent right |
Effective date of registration: 20091030 Address after: Nevada Patentee after: TRANSMETA Corp. Address before: California, USA Patentee before: Full simeida LLC Effective date of registration: 20091030 Address after: California, USA Patentee after: Full simeida LLC Address before: California, USA Patentee before: Transmeta Corp. |
|
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20080604 Termination date: 20140811 |
|
EXPY | Termination of patent right or utility model |