CN1295599C

CN1295599C - Instruction prefetch method of association of dynamic and inactive of overlength instruction word structure microprocessor

Info

Publication number: CN1295599C
Application number: CNB2004100467655A
Authority: CN
Inventors: 扈啸; 陈书明; 陈宝民; 张丹瑜; 胡定磊; 郭阳; 万江华; 刘祥远
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2004-09-17
Filing date: 2004-09-17
Publication date: 2007-01-17
Anticipated expiration: 2024-09-17
Also published as: CN1598761A

Abstract

The invention discloses a dynamic and static instruction prefetching method for a microprocessor with an ultra-long instruction word structure. The technical problem to be solved is to improve the efficiency of instruction prefetch and hide storage delay. The technical solution is to design a prefetch controller to be responsible for instruction prefetch, establish a simulation debugging environment to control instruction prefetch, design the prefetch instruction nop-pref to be compatible with NOP, replace NOP with nop-pref during compilation, and use prefetch during debugging. The fetch controller dynamically adjusts the prefetching timing, and feeds back the change information of the prefetching timing to the compiler to optimize the static prefetching strategy, and downloads the new program generated after the compiler adjusts the prefetching strategy to the RAM memory. Finally, a personalized prefetch prediction mechanism for this application system is formed. Under the premise of simple implementation of embedded microprocessor hardware and low power consumption, it hides storage delays and improves program execution efficiency. It not only overcomes the shortcomings of static prefetch to increase code length , and overcome the shortcoming of low precision of dynamic prefetching.

Description

Instruction Prefetching Method Based on Dynamic and Static Combination of VLW Structured Microprocessor

技术领域：本发明涉及超长指令字(VLIW)结构微处理器设计中指令预取的方法，尤其是嵌入式VLIW微处理器设计中指令预取的方法。Technical field: the present invention relates to the method for instruction prefetch in very long instruction word (VLIW) structure microprocessor design, especially the method for instruction prefetch in embedded VLIW microprocessor design.

背景技术：目前的高性能微处理器很多采用了VLIW结构，其特点是：多个功能单元并行工作，指令并行性和数据传送完全是在编译时由编译器确定的。VLIW指令的这种机制实现起来简单，硬件复杂度低，且能够开发出更多的指令级并行，因而得到广泛的应用。Background technology: Many current high-performance microprocessors adopt the VLIW structure, which is characterized by multiple functional units working in parallel, and instruction parallelism and data transmission are completely determined by the compiler at compile time. This mechanism of VLIW instructions is simple to implement, has low hardware complexity, and can develop more instruction-level parallelism, so it is widely used.

由于处理器和存储器的性能差距越来越大，存储延迟成为系统瓶颈。为了能够隐藏存储延迟，很多处理器采用了预取技术，通过专用预取指令或专用硬件机制，由编译器进行软件调度或软硬件结合的方式将外部存储器的数据提前调到片内高速缓存cache当中。其中硬件预取的精度往往不高，而如果有了编译器的支持，只在合适的时候才进行预取，则可以提高效率，减少不必要的预取。IBM 370/168和Amdahl 470V处理器都使用了预取技术。Due to the widening performance gap between processors and memory, storage latency becomes a system bottleneck. In order to hide storage delays, many processors use prefetch technology. Through special prefetch instructions or special hardware mechanisms, the compiler performs software scheduling or a combination of software and hardware to transfer the data of the external memory to the on-chip cache in advance. among. Among them, the precision of hardware prefetching is often not high, and if there is support from the compiler, prefetching is only performed when appropriate, which can improve efficiency and reduce unnecessary prefetching. Both IBM 370/168 and Amdahl 470V processors use prefetch technology.

Steven Paul VanderWiel在其博士论文中研究了编译辅助的数据预取控制器，是动静态结合的数据预取方法，能取得50％左右执行效率的提高。但其针对通用CPU来研究，嵌入式VLIW微处理器的特点没有被充分利用，而且插入预取指令增加了代码长度。Steven Paul VanderWiel studied the compiler-assisted data prefetch controller in his doctoral dissertation, which is a dynamic and static data prefetch method, which can improve the execution efficiency by about 50%. But it is researched for general-purpose CPU, the characteristics of embedded VLIW microprocessor are not fully utilized, and the code length is increased by inserting prefetching instructions.

一般的嵌入式VLIW微处理器系统包括CPU核、Cache系统、存储控制器和存储器四个部分，它们之间通过指令总线、数据总线和地址总线连接。CPU核负责执行指令，并将执行结果写回存储器；Cache系统是保存指令和数据的高速缓存，由指令Cache和数据Cache两部分组成；存储控制器提供存储器与Cache系统的接口，当CPU核所需的指令或数据不在Cache中时，存储控制器负责将指令或数据从存储器读入Cache；存储器保存指令和数据。A general embedded VLIW microprocessor system includes four parts: CPU core, Cache system, storage controller and memory, which are connected by instruction bus, data bus and address bus. The CPU core is responsible for executing instructions and writing the execution results back to the memory; the Cache system is a cache for storing instructions and data, consisting of two parts: the instruction Cache and the data Cache; the storage controller provides the interface between the memory and the Cache system, and when the CPU core When the required instruction or data is not in the Cache, the storage controller is responsible for reading the instruction or data from the memory into the Cache; the memory stores the instruction and data.

嵌入式VLIW微处理器的执行程序可以划分成若干个基本块，基本块是程序的基本组成单元，它只有一个入口(即基本块的第一条语句)和一个出口(即基本块的最后一条语句)，基本块的出口为分支指令。嵌入式微处理器的程序代码一般固化在非易失性存储器中，但为减少存储器延迟，程序代码是被搬移到RAM中运行的。The execution program of the embedded VLIW microprocessor can be divided into several basic blocks. The basic block is the basic unit of the program. It has only one entry (ie, the first statement of the basic block) and one exit (ie, the last statement of the basic block). statement), the exit of the basic block is a branch instruction. The program code of the embedded microprocessor is generally solidified in the non-volatile memory, but in order to reduce the delay of the memory, the program code is moved to the RAM to run.

嵌入式VLIW微处理器主要应用于计算密集型或数据密集型的嵌入式应用系统中，运行嵌入式实时操作系统或无操作系统，执行的应用程序代码相对固定，代码一般固化在非易失性存储器中，很少更改。Embedded VLIW microprocessors are mainly used in computing-intensive or data-intensive embedded application systems, running embedded real-time operating systems or no operating systems, and the executed application codes are relatively fixed, and the codes are generally fixed in non-volatile In memory, rarely changed.

在嵌入式应用系统的调试阶段，应用系统工作在一个尽量真实的环境中，应用程序代码通过仿真器在嵌入式VLIW微处理器中运行。这种程序运行状态称为仿真调试，调试主机通过仿真器访问VLIW微处理器的寄存器和存储器。In the debugging stage of the embedded application system, the application system works in an environment as real as possible, and the application program code runs in the embedded VLIW microprocessor through the emulator. This kind of program running state is called emulation debugging, and the debugging host accesses the registers and memory of the VLIW microprocessor through the emulator.

发明内容：本发明要解决的技术问题是充分利用嵌入式VLIW微处理器的特点，用动静态结合的方法，进一步提高指令预取的效率，达到隐藏存储延迟的目的。本发明的技术方案是在VLIW微处理器中设计一个预取控制器专门负责指令预取，建立一个仿真调试环境用于指令预取控制，设计预取指令nop_pref与NOP指令兼容，在编译时将VLIW微处理器中的空操作指令NOP替换为预取指令，在调试过程中由预取控制器动态调整预取时机，并将预取时机的变化信息反馈给编译器以优化静态预取策略，将经编译器调整静态预取策略后生成的新程序代码下载到RAM存储器，经过多次反馈后形成针对这个应用系统的个性化预取预测机制，在满足嵌入式微处理器硬件实现简单、功耗低的前提下，隐藏存储延迟，提高程序执行效率，既克服静态预取增加代码长度的缺点，又克服动态预取精度低的缺点。目前国内外尚无采用这种方法预取指令的报道。Summary of the invention: The technical problem to be solved by the present invention is to make full use of the characteristics of the embedded VLIW microprocessor, and use the dynamic and static combination method to further improve the efficiency of instruction prefetch and achieve the purpose of hiding storage delay. The technical scheme of the present invention is to design a prefetch controller in the VLIW microprocessor to be responsible for instruction prefetch specially, set up a simulation debugging environment for instruction prefetch control, design prefetch instruction nop_pref and NOP instruction compatible, when compiling will The empty operation instruction NOP in the VLIW microprocessor is replaced by a prefetch instruction. During the debugging process, the prefetch controller dynamically adjusts the prefetch timing, and feeds back the change information of the prefetch timing to the compiler to optimize the static prefetch strategy. Download the new program code generated by the compiler to adjust the static prefetch strategy to the RAM memory, and form a personalized prefetch prediction mechanism for this application system after multiple feedbacks. Under the premise of low cost, the storage delay is hidden, and the program execution efficiency is improved. It not only overcomes the shortcomings of static prefetching to increase the code length, but also overcomes the shortcomings of dynamic prefetching with low precision. At present, there is no report of using this method to prefetch instructions at home and abroad.

本发明具体方案是：Concrete scheme of the present invention is:

建立一个仿真调试环境用于指令预取控制。此环境由调试主机上的编译器，在真实环境下运行的嵌入式应用系统，连接调试主机和嵌入式应用系统中VLIW微处理器的硬件仿真器组成。编译器通过硬件仿真器将RAM存储器中保存的预取调整信息从嵌入式应用系统读回，编译器将调整预取指令后的程序代码通过仿真器下载到嵌入式应用系统中供VLIW微处理器执行。Establish a simulation debugging environment for instruction prefetch control. This environment consists of a compiler on the debugging host, an embedded application system running in a real environment, and a hardware emulator connecting the debugging host and the VLIW microprocessor in the embedded application system. The compiler reads back the prefetch adjustment information saved in the RAM memory from the embedded application system through the hardware emulator, and the compiler downloads the program code after adjusting the prefetch instructions to the embedded application system through the emulator for VLIW microprocessor implement.

嵌入式VLIW微处理器指令集中设计了空操作指令(NOP)，用来填充多周期指令的延迟槽和用于并行指令包的边界对齐。一般应用程序中NOP指令约占指令总数的5％-20％。A no-operation instruction (NOP) is designed in the instruction set of the embedded VLIW microprocessor, which is used to fill the delay slots of multi-cycle instructions and to align the boundaries of parallel instruction packets. NOP instructions account for about 5%-20% of the total number of instructions in general applications.

本发明设计nop_pref指令与NOP指令兼容，在CPU核看来nop_pref指令就是NOP指令。The invention designs the nop_pref instruction to be compatible with the NOP instruction, and the nop_pref instruction is the NOP instruction in the view of the CPU core.

nop_pref指令包括指令有效字段V和E字段，预取的地址intrc和cst字段和空操作兼容部分spec字段。V字段表明该NOP是否可以被替换为nop_pref指令。V为有效表明该NOP可以被替换为nop_pref指令，其地址可以被存入NOP buffer；V为无效表明该NOP不可以被替换为nop_pref指令。E字段表明该nop_pref指令是否有效。E为有效表示该nop_pref指令可以被执行，E为无效表示该nop_pref指令不可以被执行。intrc字段和cst字段共同表示预取的指令地址。spec字段表示兼容原指令集中NOP指令空操作的含义，包括NOP的指令码，空操作的周期数，是否并行等。The nop_pref instruction includes instruction effective fields V and E fields, prefetched address intrc and cst fields, and no-operation compatible part spec field. The V field indicates whether this NOP can be replaced by a nop_pref instruction. V is valid, indicating that the NOP can be replaced by a nop_pref instruction, and its address can be stored in the NOP buffer; V is invalid, indicating that the NOP cannot be replaced by a nop_pref instruction. The E field indicates whether the nop_pref directive is valid. If E is valid, it means that the nop_pref instruction can be executed, and if E is invalid, it means that the nop_pref instruction cannot be executed. The intrc field and the cst field together represent the address of the prefetched instruction. The spec field indicates the meaning of the empty operation of the NOP instruction compatible with the original instruction set, including the instruction code of the NOP, the number of cycles of the empty operation, and whether it is parallel or not.

本发明在嵌入式VLIW微处理器中设计一个预取控制器来执行预取指令、分析预取效果、调整预取时机。它由总线监视模块、NOP buffer和预取模块三部分组成，总线监视模块通过指令总线、地址总线与cache系统和CPU核相连，NOP buffer通过数据总线、地址总线与总线监视模块和预取模块相连。The invention designs a prefetch controller in the embedded VLIW microprocessor to execute prefetch instructions, analyze prefetch effect and adjust prefetch timing. It consists of three parts: bus monitoring module, NOP buffer and prefetching module. The bus monitoring module is connected to the cache system and CPU core through the instruction bus and address bus. The NOP buffer is connected to the bus monitoring module and the prefetching module through the data bus and address bus. .

总线监视模块由监听子模块、译码子模块和时间子模块组成。监听子模块是比较器，监听CPU核与指令cache之间的指令总线和地址总线。译码子模块通过数据总线和地址总线与监听子模块和NOP buffer相连，功能有两个：将所有V字段有效的nop_pref指令存入NOP buffer；将E字段有效的nop_pref指令送入预取模块执行预取。时间子模块记录执行子模块发出预取的时刻t₀、预取指令块进入程序cache的时刻t_p、CPU核请求该指令块的时刻t_r，把这三个时间通过数据总线送入预取模块分析预取效果。The bus monitoring module is composed of monitoring sub-module, decoding sub-module and time sub-module. The monitoring sub-module is a comparator, which monitors the instruction bus and address bus between the CPU core and the instruction cache. The decoding sub-module is connected to the monitoring sub-module and the NOP buffer through the data bus and the address bus. It has two functions: storing all valid nop_pref instructions in the V field into the NOP buffer; sending valid nop_pref instructions in the E field to the prefetch module for execution Prefetching. The time sub-module records the time t ₀ when the execution sub-module issues the prefetch, the time t _p when the prefetch instruction block enters the program cache, and the time t _r when the CPU core requests the instruction block, and sends these three times to the prefetch through the data bus The module analyzes the effect of prefetching.

NOP buffer是一个缓存器，采用RAM实现，采用先入先出结构，负责缓存最近执行的nop_pref指令的内容和地址，其深度为N项，有一个写端口和一个读端口，写端口将最近执行的V字段为有效的nop_pref指令的内容和地址写入写指针指定的buffer项，写指针以1、2、3...N、1、2...循环方式递增写入方式。读端口由预取模块控制，预取模块通过读端口检索NOP buffer。NOP buffer is a buffer, implemented by RAM, with a first-in-first-out structure, responsible for caching the content and address of the recently executed nop_pref instruction, with a depth of N items, a write port and a read port, the write port will be the most recently executed The V field is the content and address of the effective nop_pref instruction written into the buffer item specified by the write pointer, and the write pointer is incremented in a 1, 2, 3...N, 1, 2... cycle. The read port is controlled by the prefetch module, which retrieves the NOP buffer through the read port.

预取模块由执行子模块和分析子模块组成。执行子模块通过数据总线与译码子模块和指令cache相连，负责向指令cache发出预取命令。分析子模块通过数据总线和地址总线与NOP buffer和存储控制器相连，根据t₀、t_p和t_r分析预取效果，判断预取时机是否适当：若t_p-t_r＜0则预取滞后，需将nop_pref位置提前；若t_p-t_r＞Tm(Tm为经验值)则预取过早，预取的内容占据cache空间而不会被立即使用，造成cache空间的浪费，需将nop_pref位置挪后；若Tm＞t_p-t_r＞0则预取位置适当，无需调整。Tm的取值是(t_p-t₀)cache容量和cache行容量决定。The prefetch module consists of an execution sub-module and an analysis sub-module. The execution sub-module is connected with the decoding sub-module and the instruction cache through the data bus, and is responsible for issuing prefetch commands to the instruction cache. The analysis sub-module is connected to the NOP buffer and the storage controller through the data bus and the address bus, and analyzes the prefetching effect according to t ₀ , t _p and t _r to judge whether the timing of prefetching is appropriate: if t _p -t _r <0, prefetch Lag, the nop_pref position needs to be advanced; if t _p -t _r > Tm (Tm is an empirical value), the prefetch is too early, and the prefetched content occupies cache space and will not be used immediately, resulting in a waste of cache space. The position of nop_pref is moved backward; if Tm>t _p -t _r >0, the prefetching position is appropriate and no adjustment is required. The value of Tm is determined by (t _p -t ₀ ) cache capacity and cache line capacity.

调整nop_pref位置的方法如下：如果需要提前nop_pref，分析子模块在NOP buffer中检索到本nop_pref指令前一项可用的nop_pref(E字段无效)的地址，将本nop_pref指令通过存储控制器写入存储器的该地址；并将本nop_pref指令的E字段设为无效后写回本指令原来的地址。如果需要挪后nop_pref，预取模块在NOP buffer中检索到本nop_pref指令后一项可用的nop_pref地址，将本nop_pref指令写入该地址；将设为无效后的本nop_pref指令写入原地址。The method of adjusting the nop_pref position is as follows: If nop_pref needs to be advanced, the analysis submodule retrieves the address of the nop_pref (E field is invalid) available before the nop_pref instruction in the NOP buffer, and writes the nop_pref instruction to the address of the memory through the storage controller. This address; and write back the original address of this instruction after setting the E field of this nop_pref instruction as invalid. If the nop_pref needs to be moved, the prefetch module retrieves the available nop_pref address in the NOP buffer after the nop_pref instruction, and writes the nop_pref instruction to this address; writes the invalid nop_pref instruction to the original address.

当一个程序在嵌入式VLIW微处理器中运行时，本发明指令预取方法的具体执行步骤是：When a program was running in the embedded VLIW microprocessor, the concrete execution steps of the instruction prefetching method of the present invention are:

1.编译器根据CPU时钟、存贮器带宽和cache结构，估计CPU核请求一个指令块的传输时间，将编译后的汇编代码中某个NOP(空操作)指令换成预取指令nop_pref，使得程序先执行预取指令，提前将指令块从存储器中取出存入指令cache，在该指令块被执行时指令已在cache中准备好，CPU核可立即执行无需等待。编译器对所有指令块都进行预取预测，并替换相应NOP为nop_pref。1. The compiler estimates the transmission time of an instruction block requested by the CPU core according to the CPU clock, memory bandwidth, and cache structure, and replaces a certain NOP (null operation) instruction in the compiled assembly code with a prefetch instruction nop_pref, so that The program executes the prefetch instruction first, and takes the instruction block out of the memory in advance and stores it in the instruction cache. When the instruction block is executed, the instruction is ready in the cache, and the CPU core can execute it immediately without waiting. The compiler performs prefetch prediction for all instruction blocks and replaces the corresponding NOP with nop_pref.

2.为提高预取的性能，在程序执行过程中由预取控制器根据实际程序运行结果动态调整nop_pref的位置。具体过程如下：2. In order to improve the performance of prefetch, the prefetch controller dynamically adjusts the position of nop_pref according to the actual program running results during program execution. The specific process is as follows:

1)预取控制器的监听子模块监听CPU核与指令cache之间的指令总线和地址总线，译码子模块不断将CPU核到指令cache的指令总线(以下简称指令总线)上V字段有效的nop_pref指令及其地址存入NOP buffer。1) The monitoring sub-module of the prefetch controller monitors the instruction bus and address bus between the CPU core and the instruction cache, and the decoding sub-module continuously transfers the CPU core to the instruction bus of the instruction cache (hereinafter referred to as the instruction bus). The nop_pref instruction and its address are stored in the NOP buffer.

2)监听子模块一旦检测到指令总线上V字段有效的nop_pref指令，向指令cache发出预取请求，并记录此时刻为t₀时刻。当nop_pref指定的程序块被读入指令cache时，记录此时刻为t_p时刻。当CPU核请求该被预取的程序块时，记录此时刻为t_r时刻，若t_r无穷大则本次预取无效，记t_r＝T_MAX。分析子模块根据t₀、t_p和t_r这三个时刻判断预取时机是否适当。若t_p-t_r＜0则预取滞后，需将nop_pref位置提前；若t_p-t_r＞Tm则预取提前，如果提前过多会过早占据cache行造成浪费，需将nop_pref位置挪后；若Tm＞t_p-t_r＞0则预取位置适当，无需调整。2) Once the monitoring sub-module detects the valid nop_pref command in the V field on the command bus, it sends a prefetch request to the command cache, and records this time as time t ₀ . When the program block specified by nop_pref is read into the instruction cache, record this time as time t _p . When the CPU core requests the prefetched program block, record this time as t _r time, if t _r is infinite, then this prefetch is invalid, record t _r =T _MAX . The analysis sub-module judges whether the timing of prefetching is appropriate according to the three moments of t ₀ , t _p and t _r . If t _p -t _r < 0, the prefetch lags behind, and the nop_pref position needs to be advanced; if t _p -t _r > Tm, the prefetch is advanced, and if it is too early, it will occupy the cache line too early and cause waste, and the nop_pref position needs to be moved After that; if Tm>t _p -t _r >0, the prefetching position is appropriate and no adjustment is required.

3)分析子模块根据NOP buffer中的记录，将此条nop_pref指令调整到一个满足Tm＞t_p-t_r＞0的V字段无效的nop_pref位置，并将此条nop_pref指令写到该V字段无效的nop_pref的地址中去，实现动态调整预取的目的。3) According to the records in the NOP buffer, the analysis sub-module adjusts this nop_pref instruction to a nop_pref position where the V field is invalid if Tm>t _p -t _r >0, and writes this nop_pref instruction to the invalid V field The address of the nop_pref to achieve the purpose of dynamically adjusting the prefetch.

3.程序在仿真环境下运行一段时间后，调试主机通过仿真器停止程序的运行，并将RAM存储器中的程序代码读回调试主机，提取出预取控制器动态对nop_pref的位置调整信息即所有nop_pref指令地址和内容反馈给编译器。编译器根据此信息调整其静态预取策略以即调整nop_pref在程序中的位置，调试主机将调整了nop_pref在程序中的位置后的新程序代码下载到RAM存储器中，VLIW微处理器重新执行该程序。重复上述过程，编译器不断根据硬件动态调整后的预取位置对静态预取策略进行修正，经过一定次数的重复后停止修正，得到针对这个应用系统的个性化预取预测机制。最后调试过程结束，编译器给出充分优化预取的程序代码，作为最终的程序固化在嵌入式应用系统中。3. After the program runs in the simulation environment for a period of time, the debugging host stops the running of the program through the emulator, reads the program code in the RAM memory back to the debugging host, and extracts the dynamic position adjustment information of the prefetch controller to nop_pref, that is, all The nop_pref instruction address and content are fed back to the compiler. The compiler adjusts its static prefetching strategy according to this information to adjust the position of nop_pref in the program, and the debugging host downloads the new program code after adjusting the position of nop_pref in the program to the RAM memory, and the VLIW microprocessor re-executes the program code. program. Repeating the above process, the compiler continuously revises the static prefetch strategy according to the prefetch position dynamically adjusted by the hardware, and stops the revision after a certain number of repetitions to obtain a personalized prefetch prediction mechanism for this application system. Finally, the debugging process is over, and the compiler provides fully optimized and prefetched program code, which is solidified in the embedded application system as the final program.

采用本发明能产生如下有益技术效果：Adopting the present invention can produce following beneficial technical effect:

1.根据程序执行过程提前将VLIW微处理器所需指令包从片外存储器搬移到指令cache中等待读取，隐藏了处理器访问外部存储器的延时。1. According to the program execution process, the instruction package required by the VLIW microprocessor is moved from the off-chip memory to the instruction cache for reading in advance, which hides the delay of the processor accessing the external memory.

2.将VLIW微处理器置于真实环境下仿真调试，通过预取控制器根据实际程序运行结果动态调整nop_pref的位置，大大提高了预取的准确性，达到很好的预取效果。2. Place the VLIW microprocessor in a real environment for simulation and debugging, and dynamically adjust the position of nop_pref through the prefetch controller according to the actual program running results, which greatly improves the accuracy of prefetch and achieves a good prefetch effect.

3.将硬件动态调整的预取时机信息通过仿真器反馈给编译器，编译器调整静态预取策略重新生成程序代码执行预取，利用程序中原本存在的NOP指令，在不增加任何代码长度的情况下实现了指令预取。3. Feedback the prefetching timing information dynamically adjusted by the hardware to the compiler through the emulator, and the compiler adjusts the static prefetching strategy to regenerate the program code to perform prefetching, using the original NOP instruction in the program without increasing any code length Instruction prefetching is implemented in the case.

附图说明：Description of drawings:

图1是背景技术VLIW微处理器总体逻辑结构图；Fig. 1 is background technology VLIW microprocessor overall logic structural diagram;

图2是本发明建立的硬件仿真调试环境示意图；Fig. 2 is the hardware emulation debugging environment schematic diagram that the present invention sets up;

图3是本发明预取指令nop_pref的指令格式；Fig. 3 is the instruction format of prefetching instruction nop_pref of the present invention;

图4是采用了本发明的VLIW微处理器逻辑结构图；Fig. 4 has adopted the VLIW microprocessor logic structure diagram of the present invention;

图5是本发明预取控制器中NOP buffer的结构示意图；Fig. 5 is the structural representation of NOP buffer in the prefetching controller of the present invention;

图6是本发明与传统预取方法的效果对比图。FIG. 6 is a comparison diagram of effects between the present invention and the traditional prefetching method.

具体实施方式：Detailed ways:

图1是背景技术总体逻辑结构图。整个嵌入式VLIW微处理器系统包括CPU核、Cache系统、存储控制器和存储器四个部分：CPU核负责执行指令，并将执行结果写回存储器，它通过指令/数据总线和地址总线与Cache系统连接；Cache系统是保存指令和数据的高速缓存，由指令Cache和数据Cache两部分组成；存储控制器提供了存储器与Cache系统的接口，当CPU核所需的指令或数据不在Cache中时，存储控制器负责将指令或数据从存储器中读入Cache；存储器保存指令和数据。CPU核、Cache系统、存储控制器和存储器这四个模块之间通过指令总线、数据总线和地址总线连接。FIG. 1 is an overall logical structure diagram of the background technology. The entire embedded VLIW microprocessor system includes four parts: CPU core, Cache system, storage controller and memory: the CPU core is responsible for executing instructions and writing the execution results back to the memory. It communicates with the Cache system through the instruction/data bus and address bus. Connection; the Cache system is a cache for storing instructions and data, consisting of two parts: the instruction Cache and the data Cache; the storage controller provides the interface between the memory and the Cache system, and when the instructions or data required by the CPU core are not in the Cache, the storage The controller is responsible for reading instructions or data from the memory into the Cache; the memory stores instructions and data. The four modules of CPU core, Cache system, storage controller and memory are connected through instruction bus, data bus and address bus.

图2是本发明建立的硬件仿真调试环境示意图。硬件仿真调试环境包括调试主机，硬件仿真器和嵌入式应用系统三部分。调试主机调试嵌入式VLIW微处理器中运行的程序代码，编译器在调试主机上运行；嵌入式应用系统是嵌入式VLIW微处理器的工作环境；硬件仿真器为调试主机和嵌入式VLIW微处理器提供调试通信通道，也为编译器和预取控制器提供预取通信通道。FIG. 2 is a schematic diagram of a hardware emulation debugging environment established by the present invention. The hardware emulation debugging environment includes three parts: debugging host computer, hardware emulator and embedded application system. The debugging host debugs the program code running in the embedded VLIW microprocessor, and the compiler runs on the debugging host; the embedded application system is the working environment of the embedded VLIW microprocessor; the hardware emulator is the debugging host and the embedded VLIW microprocessor Provides a debug communication channel for the compiler and a prefetch communication channel for the compiler and prefetch controller.

在嵌入式应用系统的调试阶段，该应用系统工作在一个尽量真实的环境中，应用程序代码通过硬件仿真器在应用系统的VLIW微处理器中运行。这种程序运行状态称为仿真调试，调试主机通过仿真器可访问VLIW微处理器的寄存器和存储器。In the debugging stage of the embedded application system, the application system works in an environment as real as possible, and the application program code runs in the VLIW microprocessor of the application system through a hardware emulator. This kind of program running state is called emulation debugging, and the debugging host can access the registers and memory of the VLIW microprocessor through the emulator.

编译器根据VLIW微处理器时钟、存贮器带宽和cache结构，估计CPU核请求一个指令块的传输时间，将某个NOP指令替换为预取指令nop_pref，通过硬件仿真器将程序代码下载到嵌入式应用系统的RAM存储器中执行。在程序执行过程中，由预取控制器根据实际程序运行过程动态调整预取指令的位置，并将调整后的结果写回RAM存储器。程序代码在真实应用系统中运行一段时间后，调试主机控制VLIW微处理器停止运行，将存储器内容读回调试主机，由编译器根据硬件动态调整后的预取信息对的静态预取策略进行修正，再将修正预取后的程序代码下载到应用系统中执行。如此反复，经过多次反馈后编译器形成针对此应用系统的个性化预取预测机制，用较低的代价实现高效率的预取。According to the VLIW microprocessor clock, memory bandwidth and cache structure, the compiler estimates the transmission time of an instruction block requested by the CPU core, replaces a certain NOP instruction with the prefetch instruction nop_pref, and downloads the program code to the embedded Execute in the RAM memory of the application system. During program execution, the prefetch controller dynamically adjusts the position of the prefetch instruction according to the actual program running process, and writes the adjusted result back to the RAM memory. After the program code runs for a period of time in the real application system, the debugging host controls the VLIW microprocessor to stop running, reads the memory content back to the debugging host, and the static prefetching strategy is corrected by the compiler according to the prefetching information dynamically adjusted by the hardware , and then download the corrected prefetched program code to the application system for execution. Repeatedly, after multiple feedbacks, the compiler forms a personalized prefetch prediction mechanism for this application system, achieving high-efficiency prefetch at a lower cost.

图3是本发明预取指令nop_pref的指令格式。nop_pref指令包括指令有效字段(V和E字段)，预取的地址(intrc和cst字段)和空操作兼容部分(spec字段)。V字段表明该NOP是否可以被替换为nop_pref指令。其中V为有效表明该NOP可以被替换为nop_pref指令，其地址可以被存入NOP buffer；V为无效表明该NOP不可以被替换为nop_pref指令。E字段表明该nop_pref指令是否有效。E为有效表示该nop_pref指令可以被执行，E为无效表示该nop_pref指令不可以被执行。intrc字段和cst字段共同表示预取的指令地址。spec字段表示兼容原指令集中NOP指令空操作的含义，包括NOP的指令码，空操作的周期数，是否并行等。Fig. 3 is the instruction format of the prefetching instruction nop_pref of the present invention. The nop_pref instruction includes instruction valid fields (V and E fields), prefetched addresses (intrc and cst fields) and a no-op compatible part (spec field). The V field indicates whether this NOP can be replaced by a nop_pref instruction. Where V is valid, indicating that the NOP can be replaced by a nop_pref instruction, and its address can be stored in the NOP buffer; V is invalid, indicating that the NOP cannot be replaced by a nop_pref instruction. The E field indicates whether the nop_pref directive is valid. If E is valid, it means that the nop_pref instruction can be executed, and if E is invalid, it means that the nop_pref instruction cannot be executed. The intrc field and the cst field together represent the address of the prefetched instruction. The spec field indicates the meaning of the empty operation of the NOP instruction compatible with the original instruction set, including the instruction code of the NOP, the number of cycles of the empty operation, and whether it is parallel or not.

图4是本发明预取控制器逻辑结构图。预取控制器与CPU核、指令cache、存储控制器通过总线相连。预取控制器由总线监视模块、NOP buffer和预取模块三部分组成。Fig. 4 is a logical structure diagram of the prefetching controller of the present invention. The prefetch controller is connected to the CPU core, the instruction cache, and the storage controller through a bus. The prefetch controller consists of three parts: bus monitoring module, NOP buffer and prefetch module.

总线监视模块由监听子模块、译码子模块和时间子模块组成。监听子模块是比较器，用于监听CPU核与指令cache之间的指令总线和地址总线。译码子模块功能有两个：将所有V字段有效的nop_pref指令存入NOP buffer；将E字段有效的nop_pref指令送入预取模块执行预取。时间子模块记录发出预取的时刻t₀，预取指令块进入程序cache的时刻t_p，CPU核请求该指令块的时刻t_r，把这三个时间送入预取模块分析预取效果。The bus monitoring module is composed of monitoring sub-module, decoding sub-module and time sub-module. The monitoring sub-module is a comparator for monitoring the instruction bus and address bus between the CPU core and the instruction cache. The decoding sub-module has two functions: storing all nop_pref instructions valid in the V field into the NOP buffer; sending the nop_pref instructions valid in the E field to the prefetch module for prefetching. The time sub-module records the time t ₀ when the prefetch is issued, the time t _p when the prefetch instruction block enters the program cache, and the time t _r when the CPU core requests the instruction block, and sends these three times to the prefetch module to analyze the prefetch effect.

NOP buffer负责缓存最近执行的nop_pref指令的内容和地址。其深度为N，写入时采用循环写入方式。其内容供预取模块检索。The NOP buffer is responsible for caching the contents and addresses of the most recently executed nop_pref instructions. Its depth is N, and it adopts a circular writing method when writing. Its contents are retrieved by the prefetch module.

预取模块由执行子模块和分析子模块组成。执行子模块将有效的nop_pref指令译码，向指令cache发出预取命令。分析子模块根据t₀、t_p和t_r分析预取效果，判断预取时机是否适当；将调整后的预取指令通过存储控制器写回到存储器中。The prefetch module consists of an execution sub-module and an analysis sub-module. The execution sub-module decodes the valid nop_pref instruction, and issues a prefetch command to the instruction cache. The analysis sub-module analyzes the effect of prefetching according to t ₀ , t _p and t _r , and judges whether the timing of prefetching is appropriate; writes the adjusted prefetching instruction back into the memory through the memory controller.

图5是本发明中NOP buffer的结构示意图。NOP buffer是RAM，负责缓存最近执行的nop_pref指令的内容和地址，宽度为M位，深度为N项，有一个写端口和一个读端口。写端口将最近执行的V字段为1的nop_pref指令的内容和地址写入写指针指定的buffer项，写指针以1、2、3...N、1、2...循环方式递增。读端口由预取模块控制。Fig. 5 is the structural representation of NOP buffer among the present invention. NOP buffer is RAM, which is responsible for caching the content and address of the recently executed nop_pref instruction. It has a width of M bits and a depth of N items. It has a write port and a read port. The write port writes the content and address of the most recently executed nop_pref instruction whose V field is 1 to the buffer item specified by the write pointer, and the write pointer increments in a cyclic manner of 1, 2, 3...N, 1, 2.... The read port is controlled by the prefetch block.

图6是本发明与传统预取方法的效果对比图。其中图(a)是没有预取的效果，图(b)是传统方法的预取效果，图(c)是采用本发明的预取效果。图中r1、r2和r3表示程序在CPU核中执行的三次取指请求。横轴表示时间，横轴上的每小格表示一个时钟周期。深灰色的小格表示CPU核在计算中。浅灰色的小格表示在进行存储器访问。白色小格表示CPU核取指访问cache命中，立即得到所需指令。黑色小格表示CPU核取指访问cache不命中，不能立即得到所需指令。花色小格表示CPU核执行预取指令，存储器开始预取指令。FIG. 6 is a comparison diagram of effects between the present invention and the traditional prefetching method. Wherein, figure (a) is the effect without prefetching, figure (b) is the prefetching effect of the traditional method, and figure (c) is the prefetching effect of the present invention. In the figure, r1, r2 and r3 represent the three instruction fetch requests executed by the program in the CPU core. The horizontal axis represents time, and each cell on the horizontal axis represents a clock cycle. The dark gray cells indicate that the CPU core is computing. A light gray cell indicates that a memory access is in progress. The white cells indicate that the CPU core fetches and accesses the cache hit, and the required instruction is obtained immediately. The small black box indicates that the CPU core fetches and accesses the cache miss, and the required instruction cannot be obtained immediately. The colored cell indicates that the CPU core executes the prefetch instruction, and the memory starts to prefetch the instruction.

在图(a)中，CPU核在r1、r2和r3时刻分别发出取指请求，由于没有预取机制，访问cache都不命中，CPU核得不到指令，流水线暂停等待访问存储器结束。得到指令后流水线继续工作。In figure (a), the CPU core sends instruction fetch requests at time r1, r2, and r3 respectively. Since there is no prefetch mechanism, the access cache does not hit, the CPU core cannot get instructions, and the pipeline pauses to wait for the memory access to end. The pipeline continues to work after receiving the instruction.

在图(b)中，CPU核在r1时刻之前一个周期发出预取请求，但预取请求过晚，导致CPU核在r1时刻发出取指请求时，预取的内容还没有被送入cache，流水线被暂停。CPU核在r2时刻发出取指请求时情况类似，流水线仍被暂停。而CPU核对第三个预取指令执行的时刻过早，在r3时刻发出取指请求时，指令已在cache中保存了三个周期，造成一定程度的浪费。In Figure (b), the CPU core sends a prefetch request one cycle before r1, but the prefetch request is too late, so when the CPU core sends an instruction fetch request at r1, the prefetched content has not been sent to the cache. The pipeline is paused. The situation is similar when the CPU core issues an instruction fetch request at time r2, and the pipeline is still suspended. However, the CPU checks the execution time of the third prefetch instruction too early. When the instruction fetch request is issued at time r3, the instruction has been stored in the cache for three cycles, resulting in a certain degree of waste.

在图(c)中，CPU核执行预取指令的时刻经过预取控制器和编译器在仿真环境中的反复调整，使得指令被取入cache后立即执行，既没有导致流水线暂停，也没有浪费cache空间，处理器的整体性能得到大幅度提高。In Figure (c), the time when the CPU core executes the prefetch instruction is repeatedly adjusted by the prefetch controller and the compiler in the simulation environment, so that the instruction is immediately executed after being fetched into the cache, which neither causes the pipeline to pause nor wastes cache space, the overall performance of the processor has been greatly improved.

本发明已应用在国防科大自行研制的YHFT系列DSP中，设计时预取控制器中的NOPbuffer宽度为64位，深度为64项，Tm为3×(t_p-t₀)。The present invention has been applied in the YHFT series DSP independently developed by the University of National Defense Science and Technology. The width of the NOPbuffer in the prefetching controller is 64 bits, the depth is 64 items, and the Tm is 3×(t _p -t ₀ ).

Claims

1. the instruction prefetch method of very long instruction word structure microprocessor sound attitude combination, it is characterized in that prefetch controller of design is responsible for instruction prefetch specially in vliw microprocessor, set up an artificial debugging environment and be used for instruction prefetch control, design prefetched instruction nop-pref and NOP instruction are compatible, when compiling, the non-operation instruction NOP in the vliw microprocessor is replaced with prefetched instruction, in debug process, look ahead opportunity by dynamic adjustment of prefetch controller, and the change information that will look ahead opportunity feeds back to compiler to optimize static prefetch policy, the new procedures code that will generate after the static prefetch policy of compiler adjustment downloads to the RAM storer, forms at the personalization of this application system forecasting mechanism of looking ahead through feedback back repeatedly.

2. the instruction prefetch method of very long instruction word structure microprocessor sound attitude as claimed in claim 1 combination, it is characterized in that the described method of setting up the artificial debugging environment is: by the compiler on the debug host, the built-in applied system that under true environment, moves, the hardware emulator that connects vliw microprocessor in debug host and the built-in applied system is formed the artificial debugging environment, compiler reads back the adjustment information of looking ahead of preserving in the RAM storer by hardware emulator from built-in applied system, the program code that compiler will be adjusted behind the prefetched instruction downloads in the built-in applied system for the vliw microprocessor execution by emulator.

3. the instruction prefetch method of very long instruction word structure microprocessor sound attitude as claimed in claim 1 combination, it is characterized in that described nop-pref instruction method for designing is: it comprises effective field V of instruction and E field, the compatible part spec of address intrc that looks ahead and cst field and blank operation field: the V field shows whether this NOP can be replaced by the nop-pref instruction, V can be replaced by the nop-pref instruction for effectively showing this NOP, and its address can be deposited in NOP buffer; V is that invalid this NOP that shows cannot be replaced by the nop-pref instruction; The E field shows whether this nop-pref instruction is effective, and E can be performed for effectively representing this nop-pref instruction, and E is that this nop-pref instruction of invalid representation cannot be performed; The instruction address that the common expression of intrc field and cst field is looked ahead, spec field are represented the implication of NOP instruction blank operation in the compatible former instruction set, comprise the order code of NOP, and whether the periodicity of blank operation is parallel.

4. the instruction prefetch method of very long instruction word structure microprocessor sound attitude as claimed in claim 1 combination, it is characterized in that described prefetch controller is that prefetched instruction is carried out in design in vliw microprocessor, look ahead effect and dynamically adjust the parts of looking ahead opportunity of analysis, its method for designing is: it is by the bus monitor module, NOP buffer and prefetch module three parts are formed, the bus monitor module passes through instruction bus, address bus and cache system and CPU nuclear phase connect, and NOP buffer passes through data bus, address bus links to each other with prefetch module with the bus monitor module:

The bus monitor module is formed by monitoring submodule, decoding submodule and chronon module, monitoring submodule is comparer, monitor instruction bus and address bus between the CPU nuclear and instruction cache, constantly CPU is examined on the instruction bus of command cache effective nop-pref instruction of V field and address thereof and deposit NOP buffer in; The decoding submodule links to each other with NOP buffer with the monitoring submodule with address bus by data bus, and function has two: deposit effective nop-pref instruction of all V fields and address thereof in NOP buffer; The prefetch module execution is sent in the effective nop-pref instruction of E field looks ahead; Chronon module records implementation sub-module sends the moment t that looks ahead ₀, the prefetched instruction piece enters the moment t of program cache _p, this instruction block of CPU nuclear request moment t _r, this three times are sent into the prefetch module analysis effect of looking ahead by data bus;

NOP buffer is a buffer, adopt the first-in first-out structure, realize with RAM, be responsible for the content and the address of the nop-pref instruction of buffer memory execution recently, its degree of depth is the N item, a write port and a read port are arranged, the V field that write port will be carried out recently is the content of effective nop-pref instruction and the buffer item that the address writes the write pointer appointment, write pointer is with 1,2,3 ... N, 1,2 ... recycle design increases progressively writing mode, read port is controlled by prefetch module, and prefetch module is by read port retrieval NOP buffer;

Prefetch module is made up of implementation sub-module and analysis submodule, and implementation sub-module links to each other with command cache with the decoding submodule by data bus, is responsible for sending prefetched command to command cache; Analyze submodule and link to each other with memory controller with NOP buffer with address bus, according to t by data bus ₀, t _pAnd t _rThe analysis effect of looking ahead judges whether the opportunity of looking ahead is suitable: if t _p-t _r＜0 hysteresis of looking ahead needs with the nop-pref position in advance; If t _p-t _r＞Tm then looks ahead too early, and the content of looking ahead occupies the cache space and can not used immediately, causes the waste in cache space, after the nop-pref position need being moved; If Tm＞t _p-t _r＞0 location-appropriate of looking ahead need not to adjust; The value of Tm is t _p-t ₀, the capable capacity decision of cache capacity and cache; The method of adjusting the nop-pref position is as follows: shift to an earlier date nop-pref if desired, analyze submodule retrieves the invalid nop-pref of last available E field of this nop-pref instruction in NOPbuffer address, with this nop-pref instruction this address by the memory controller write store; And the E field of this nop-pref instruction is made as writes back the original address of this instruction after invalid; Move back nop-pref if desired, prefetch module retrieves this nop-pref and instructs a back available nop-pref address in NOP buffer, and this nop-pref instruction is write this address; This nop-pref instruction that is made as after invalid is write raw address.

5. the instruction prefetch method of very long instruction word structure microprocessor sound attitude as claimed in claim 1 combination is characterized in that when a program is moved in embedded vliw microprocessor, and concrete execution in step is:

Compiler is according to cpu clock, memory bandwidth and cache structure, estimate the transmission time of an instruction block of CPU nuclear request, change certain NOP instruction in the assembly code after the compiling into prefetched instruction nop-pref, make calling program carry out prefetched instruction earlier, in advance instruction block is taken out from storer and deposit command cache in, instruct to be ready in cache when this instruction block is performed, CPU endorses to carry out immediately and need not to wait for; Compiler is to the prediction of all looking ahead of all instruction blocks, and to replace corresponding NOP be nop-pref;

In program process, dynamically adjust the position of nop-pref according to the practical programs operation result by prefetch controller;

Program is moved a period of time under simulated environment after, debug host is by the operation of emulator shut down procedure, and with the debug host of reading back of the program code in the RAM storer, dynamically promptly all nop-pref instruction addresses and content feed are given compiler to the position adjustment information of nop-pref to extract prefetch controller, compiler is adjusted its static prefetch policy promptly to adjust the position of nop-pref in program according to this information, the new procedures code that debug host will have been adjusted behind the position of nop-pref in program downloads in the RAM storer, vliw microprocessor re-executes this program, repeat said process, compiler is constantly revised static prefetch policy according to the adjusted position of looking ahead of hardware dynamic, revise through stopping after the repetition of certain number of times, obtain at the personalization of this application system forecasting mechanism of looking ahead, last debug process finishes, compiler provides the program code that abundant optimization is looked ahead, as final program Solidification in built-in applied system.