CN102566974A

CN102566974A - Instruction acquisition control method based on simultaneous multithreading

Info

Publication number: CN102566974A
Application number: CN2012100108958A
Authority: CN
Inventors: 李静梅; 关海洋
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2012-01-14
Filing date: 2012-01-14
Publication date: 2012-07-11
Anticipated expiration: 2032-01-14
Also published as: CN102566974B

Abstract

The invention provides an instruction fetch control method based on simultaneous multithreading. In each clock cycle of the processor, the instruction fetching unit reads the PC value of the instruction according to the program counter, first selects two threads with higher priority as the instruction fetching thread, and then calculates the actual number of instructions required by each thread, Perform the operation of reading instructions; the dual-priority resource allocation mechanism calculates the system resources required by the thread in the fetching phase according to the two parameters of the thread IPC value and the Cache failure rate, and completes the dynamic allocation of resources; and the TBHBP branch predictor Cooperate with the fetching operation of the fetching component, by connecting the global historical information and local historical information read into the branch instruction Bi, as the index of the secondary pattern matching table PHT, to obtain the pattern matching bit Sc, and input the calculation result to Branch result output table BRT; when the branch instruction Bi is executed again, judge whether the CONF field is greater than or equal to 2 through the selector Selector, if so, directly output the recorded branch result, and finally put the fetched instruction into the instruction cache, Complete all operations of instruction fetch control.

Description

Instruction Fetching Control Method Based on Simultaneous Multithreading

技术领域 technical field

本发明涉及的是一种取指控制方法。具体地说是一种基于同时多线程的指令获取处理方法。The invention relates to an instruction fetch control method. Specifically, it is an instruction acquisition processing method based on simultaneous multithreading.

背景技术 Background technique

随着计算机体系结构的发展，为顺应人们对高性能处理器的迫切需求，同时多线程处理器应运而生，成为目前主流的微处理器结构。针对同时多线程处理器的各项研究变得十分活跃，同时多线程处理器的取指控制方法作为高性能处理器领域的研究热点备受关注。With the development of computer architecture, in order to meet people's urgent needs for high-performance processors, multi-thread processors emerged as the times require, and have become the mainstream microprocessor structure. Various researches on simultaneous multi-threaded processors have become very active, and at the same time, the instruction fetch control method of multi-threaded processors has attracted much attention as a research hotspot in the field of high-performance processors.

近年来，国内外许多专家学者和科研机构对其展开积极的研究和探索。在取指策略方面，美国华盛顿大学的Tullsen教授提出目前公认的性能较好的取指策略ICOUNT。ICOUNT策略授予运行速度快的线程以较高的优先级，有效地阻止某一线程阻塞指令队列，使得指令队列中的指令并行最大化，也是传统处理器中取指性能最好的，但因其取指带宽利用不均衡、指令队列冲突率高的缺点，因此极大地限制了同时多线程处理器性能的充分发挥。在分支预测器方面，由McFarling提出Gshare预测器，通过将地址高位和历史低位做“异或”处理，使得出现干扰的分支指令被映像到不同的预测表目项，有效地缓解了线程间指令互扰现象的发生，但却可能将新的分支别名干扰引入到原本不发生冲突的分支指令之间，因此分支预测性能还有待提高。In recent years, many experts, scholars and scientific research institutions at home and abroad have carried out active research and exploration on it. In terms of fetching strategies, Professor Tullsen of the University of Washington in the United States proposed the currently recognized fetching strategy ICOUNT with better performance. The ICOUNT policy grants higher priority to fast-running threads, effectively preventing a certain thread from blocking the instruction queue, maximizing the parallelism of instructions in the instruction queue, and it is also the best instruction fetch performance in traditional processors, but due to its The shortcomings of unbalanced instruction fetch bandwidth utilization and high instruction queue conflict rate greatly limit the full performance of simultaneous multi-threaded processors. In terms of branch predictors, the Gshare predictor proposed by McFarling, through the "XOR" processing of the high address and the historical low, makes the branch instructions that appear to be interfered be mapped to different prediction entries, effectively alleviating the inter-thread instructions. Mutual interference occurs, but it may introduce new branch alias interference into branch instructions that do not conflict originally, so the branch prediction performance needs to be improved.

发明内容 Contents of the invention

本发明的目的在于提供一种能提高处理器的指令吞吐率，取指带宽利用均匀，能降低指令队列冲突率和提高分支预测性能的基于同时多线程的取指控制方法。The object of the present invention is to provide an instruction fetching control method based on simultaneous multi-threading that can improve the instruction throughput rate of the processor, uniformly utilize the instruction fetch bandwidth, reduce the instruction queue conflict rate and improve the branch prediction performance.

本发明的目的是这样实现的：The purpose of the present invention is achieved like this:

步骤一：在处理器的每一个时钟周期，取指部件根据程序计数器读取指令的PC值；Step 1: In each clock cycle of the processor, the fetching component reads the PC value of the instruction according to the program counter;

步骤二：通过T选2多路选择器选择指令队列项数计数器值最小的两个线程进行输出，假设线程1的优先级高于线程2；Step 2: Select two threads with the smallest counter values of the number of command queue items through the T selection 2 multiplexer to output, assuming that the priority of thread 1 is higher than that of thread 2;

步骤三：线程1的计数值先经加法器和乘法器执行多项表达式的运算，之后将结果值依次进行一次按位取反和模16运算操作，将输出值通过2选1选择器与取指带宽进行比较，取较小值；除读取指令的计算外，线程2的执行过程与1相同，对于线程2，读取的指令数为线程1的取指数与取指带宽的差值；Step 3: The count value of thread 1 first performs the operation of multiple expressions through the adder and multiplier, and then performs a bitwise inversion and modulo 16 operation on the result value, and passes the output value through the 2-to-1 selector and Compare the instruction fetch bandwidth and take the smaller value; except for the calculation of reading instructions, the execution process of thread 2 is the same as that of 1. For thread 2, the number of instructions read is the difference between thread 1's index fetch and instruction fetch bandwidth ;

步骤四：将两个线程的输出结果送入取指部件寄存器完成取指带宽的划分；Step 4: Send the output results of the two threads into the register of the instruction fetch unit to complete the division of the instruction fetch bandwidth;

步骤五：双优先级资源分配机制按照线程IPC值和Cache失效率这两项参数，计算线程在取指阶段所需的系统资源，完成资源的动态分配操作。Step 5: The dual-priority resource allocation mechanism calculates the system resources required by the thread in the instruction fetching phase according to the two parameters of the thread IPC value and the cache failure rate, and completes the dynamic allocation of resources.

步骤六：判断是否存在分支指令；若有，则根据分支指令Bi的PC值索引分支预测信息表BPIT，读取分支指令所属线程索引号TID；反之，则将读取的指令送入指令Cache；Step 6: Determine whether there is a branch instruction; if so, index the branch prediction information table BPIT according to the PC value of the branch instruction Bi, and read the thread index number TID to which the branch instruction belongs; otherwise, send the read instruction into the instruction cache;

步骤七：通过获取的TID编号索引线程分支历史寄存器信息表TBHRIT，读取线程预测的分支历史信息BPHI，作为分支预测的全局历史信息；同时，通过获取的指令PC值索引分支目标地址历史寄存器信息表BTAHRIT，读取分支指令的目标地址BPTA，并根据指令地址读取用于分支预测的局部历史信息；Step 7: Index the thread branch history register information table TBHRIT through the obtained TID number, read the branch history information BPHI predicted by the thread, as the global history information of the branch prediction; at the same time, index the branch target address history register information through the obtained instruction PC value Table BTAHRIT, reads the target address BPTA of the branch instruction, and reads the local history information used for branch prediction according to the instruction address;

步骤八：将每个线程的分支历史信息BHR和根据目标地址读取的历史信息BHT通过哈希函数结合在一起，作为二级模式匹配表PHT的索引；Step 8: Combine the branch history information BHR of each thread and the history information BHT read according to the target address through a hash function as an index of the secondary pattern matching table PHT;

步骤九：通过拼接的历史信息索引PHT表获取分支指令的模式历史位Sc，用于实际的分支预测操作；Step 9: Obtain the mode history bit Sc of the branch instruction through the concatenated historical information index PHT table, and use it for the actual branch prediction operation;

步骤十：将获取的模式历史位Sc输入到预测决定函数，完成分支预测结果的计算操作，同时，通过状态转换函数δ来完成模式历史位的更新操作，更新后的模式历史位将由原来的Ri，c-k Ri，c-k+1......Ri，c-1变为Ri，c-k+1Ri，c-k+2......Ri，c；Step 10: Input the obtained mode history bit Sc into the prediction decision function to complete the calculation operation of the branch prediction result. At the same time, complete the update operation of the mode history bit through the state transition function δ. The updated mode history bit will be replaced by the original Ri , c-k Ri, c-k+1...Ri, c-1 becomes Ri, c-k+1Ri, c-k+2...Ri, c;

步骤十一：将分支指令Bi的预测结果写入分支结果输出表BRT中；当下次有相同的分支指令被预测时，若预测结果与BRT表中的PRED值相同，则CONF加1；反之，CONF做减1操作；Step 11: Write the prediction result of the branch instruction Bi into the branch result output table BRT; when the same branch instruction is predicted next time, if the prediction result is the same as the value of PRED in the BRT table, add 1 to CONF; otherwise, CONF does minus 1 operation;

步骤十二：通过TBHRIT表的更新电路，将获取的分支输出结果Ri，c左移入线程历史寄存器中的最末位上，并将预测的历史信息更新为分支指令提交的历史信息；Step 12: through the update circuit of the TBHRIT table, the obtained branch output results Ri, c are left shifted into the last bit in the thread history register, and the predicted historical information is updated to the historical information submitted by the branch instruction;

步骤十三：通过BTAHRIT表的更新电路，将获取分支输出结果Ri，c所对应的目标地址历史信息左移入地址历史寄存器中的最末位上，并将预测的分支指令目标地址更新为分支指令提交时的实际地址信息；Step 13: Through the update circuit of the BTAHRIT table, move the history information of the target address corresponding to the obtained branch output result Ri, c to the left into the last bit in the address history register, and update the predicted target address of the branch instruction to the branch instruction Physical address information at the time of submission;

步骤十四：当分支预测器对下一分支指令Bi+1进行分支预测时，首先根据其PC值索引BRT表中的CONF字段；若CONF大于等于2，则BPIT表中的TAG字段记为1，分支预测电路将不对指令Bi+1进行分支预测操作，而是直接将存储的分支结果输出；反之，若CONF小于2，则BPIT表中的TAG字段记为0，分支指令重新执行分支预测操作，并将预测结果与BRT表中的数据进行比对，完成CONF字段和PRED字段的更新操作；最后，将预测结果告知取指单元；Step 14: When the branch predictor performs branch prediction on the next branch instruction Bi+1, it first indexes the CONF field in the BRT table according to its PC value; if CONF is greater than or equal to 2, then the TAG field in the BPIT table is recorded as 1 , the branch prediction circuit will not perform the branch prediction operation on the instruction Bi+1, but directly output the stored branch result; on the contrary, if CONF is less than 2, the TAG field in the BPIT table will be recorded as 0, and the branch instruction will re-execute the branch prediction operation , and compare the prediction result with the data in the BRT table, and complete the update operation of the CONF field and the PRED field; finally, inform the fetching unit of the prediction result;

步骤十五：若在分支预测的整个过程中出现分支误预测现象，则将处理器启动误预测处理机制即时停止剩余操作，并取消误预测分支指令后所属同一线程的正在流水线中运行的全部指令，线程的PC值调整为分支后的正确目标指令地址，然后从新的地址重新开始读取指令执行；同时，根据分支的实际执行结果调整分支结果输出表BRT中相应表目项的CONF字段和PRED字段，以供该分支指令再次执行时使用。Step 15: If a branch misprediction phenomenon occurs during the whole process of branch prediction, the processor will start the misprediction processing mechanism to stop the remaining operations immediately, and cancel all instructions running in the pipeline belonging to the same thread after the mispredicted branch instruction , the PC value of the thread is adjusted to the correct target instruction address after the branch, and then restarts reading instruction execution from the new address; at the same time, adjusts the CONF field and PRED of the corresponding entry in the branch result output table BRT according to the actual execution result of the branch field for use when the branch instruction is executed again.

本发明还可以包括：The present invention may also include:

1、所述计算线程在取指阶段所需的系统资源中，所述系统资源包括：取指带宽、指令队列长度、保留站队列长度，1. Among the system resources required by the calculation thread in the instruction fetch phase, the system resources include: instruction fetch bandwidth, instruction queue length, reserved station queue length,

资源分配的具体方式为：The specific method of resource allocation is as follows:

$Ni Ni = = \frac{{P P}_{Ti Ti}}{{P P}_{Ti Ti} + + {P P}_{Tj Tj}} \times \times R R$

其中，P_Ti和P_Tj分别表示线程Ti和Tj的资源分配优先级，Ni表示分配给线程Ti的资源数目，R表示系统资源的总数；Wherein, P _Ti and P _Tj respectively represent the resource allocation priorities of threads Ti and Tj, Ni represents the number of resources allocated to thread Ti, and R represents the total number of system resources;

在主优先级和次优先级均不相同的情况，以线程的IPC值与次优先级数的比值作为资源分配评定的依据，资源分配的具体方式为：In the case where the primary priority and secondary priority are different, the ratio of the IPC value of the thread to the secondary priority is used as the basis for resource allocation evaluation. The specific method of resource allocation is as follows:

$Ni Ni = = \frac{TLi TL / / CLi CLi}{TLi TL / / CLi CLi + + TLj TLj / / CLj CL} \times \times R R$

其中，TLi和TLj分别表示线程Ti和Tj的主优先级；CLi和CLj分别表示线程Ti和Tj的次优先级数，其值可取1、2、3；Ni表示分配给线程Ti的资源数目，R表示系统资源的总数。Among them, TLi and TLj represent the main priority of threads Ti and Tj respectively; CLi and CLj represent the sub-priority numbers of threads Ti and Tj respectively, and their values can be 1, 2, 3; Ni represents the number of resources allocated to thread Ti, R represents the total number of system resources.

2、在将每个线程的分支历史信息BHR和根据目标地址读取的历史信息BHT通过哈希函数结合在一起，作为二级模式匹配表PHT的索引过程中，分别对

BHR+BHT和BHT+BHR三种历史信息的连接方式进行分支预测性能的测试，确定两种历史信息最佳的连接方式；对于二级模式匹配表PHT的索引，采取线程历史信息和地址历史信息拼接的方式。2. In the process of combining the branch history information BHR of each thread and the history information BHT read according to the target address through a hash function as the indexing process of the secondary pattern matching table PHT, respectively

BHR+BHT and BHT+BHR are used to test the branch prediction performance of three historical information connection methods, and determine the best connection method of the two historical information; for the index of the secondary pattern matching table PHT, thread historical information and address historical information are used The way of splicing.

本发明旨于设计取指策略和分支预测器相结合的取指控制方法FCMBSMT。具体改进为：设计IFSBSMT取指策略控制取指部件的工作时序，提高处理器的取指效率。同时，设计TBHBP分支预测器辅助取指部件，提高取出指令的可用性和有效性，更为有效地提高处理器的指令吞吐率和分支预测性能，具有良好的应用前景和应用价值。The present invention aims to design an instruction fetch control method FCMBSMT which combines an instruction fetch strategy and a branch predictor. The specific improvements are as follows: designing the IFSBSMT instruction fetch strategy to control the working sequence of the instruction fetch components and improve the instruction fetch efficiency of the processor. At the same time, the TBHBP branch predictor is designed to assist instruction fetching components to improve the availability and effectiveness of fetching instructions, and more effectively improve the processor's instruction throughput and branch prediction performance, which has good application prospects and application value.

本发明主要包含如下特点：The present invention mainly comprises following characteristics:

IFSBSMT取指策略部分的整个实施过程包括线程选择、取指带宽划分和系统资源分配三个阶段。The entire implementation process of IFSBSMT fetching strategy includes three stages: thread selection, fetching bandwidth division and system resource allocation.

所谓的线程选择是指在每个时钟周期内，取指部件选择多少线程和对哪些线程取指。在这里，IFSBSMT策略采取ICOUNT 2.8的线程选择方式，即每个时钟周期内选择两个线程进行取指，每次最多读取八条指令，有效避免对取指带宽划分过细导致某些线程因指令Cache失效等原因无法进行取指现象的发生。The so-called thread selection refers to how many threads are selected by the instruction fetching unit and which threads are fetched in each clock cycle. Here, the IFSBSMT strategy adopts the ICOUNT 2.8 thread selection method, that is, two threads are selected for instruction fetching in each clock cycle, and a maximum of eight instructions are read each time, which effectively avoids the fine division of the instruction fetch bandwidth and causes some threads to fail due to instruction cache Instruction fetching cannot be performed due to failure or other reasons.

IFSBSMT策略的第二个阶段是取指带宽的划分，也是整个策略实施的关键阶段。在这一阶段，取指部件根据线程指令的流速和线程在指令队列中的指令数计算本周期内需要读取的指令数。若指令队列中有足够的指令执行则不进行指令的读取，否则按照需求读取一定数目的指令，读取指令的最大数目为最初设定的取指带宽，取指带宽为8。线程在某一时钟周期内执行的指令数约为线程在指令队列中指令数的平方根，那么所需指令数的计算公式就应为公式(1)。The second stage of the IFSBSMT strategy is the division of the fetch bandwidth, which is also a key stage for the implementation of the entire strategy. At this stage, the fetching component calculates the number of instructions to be read in this period according to the flow rate of thread instructions and the number of instructions of the thread in the instruction queue. If there are enough instructions in the instruction queue to be executed, the instruction will not be read. Otherwise, a certain number of instructions will be read as required. The maximum number of read instructions is the initially set instruction fetch bandwidth, and the instruction fetch bandwidth is 8. The number of instructions executed by a thread in a certain clock cycle is about the square root of the number of instructions in the instruction queue of the thread, so the calculation formula for the required number of instructions should be formula (1).

$I I = = {I I}_{fs fs} - - \sqrt{{I I}^{' '}} - - - - - - ((11))$

在公式(1)中，I为线程在某个时钟周期内需读取的指令数，Ifs为线程在运行过程中的指令流速，其值应为线程IPC与某一系数的乘积，I′为线程在指令队列中的指令数。将指令流速计算方式带入公式(1)后改写为公式(2)。In the formula (1), I is the number of instructions that the thread needs to read in a certain clock cycle, Ifs is the instruction flow rate of the thread during the running process, and its value should be the product of the thread IPC and a certain coefficient, and I' is the thread The number of instructions in the instruction queue. Bring the command flow rate calculation method into formula (1) and rewrite it as formula (2).

$I I = = IPC IPC \times \times P P - - \sqrt{{I I}^{' '}} - - - - - - ((22))$

在处理器的实际运行过程中，由于Cache失效和分支误预测等因素的存在，系统实际获取的线程IPC值往往低于预估值，因此需乘以一系数P来纠正这一误差，即上面提到的指令流速计算方式。During the actual operation of the processor, due to the existence of factors such as cache failure and branch misprediction, the thread IPC value actually obtained by the system is often lower than the estimated value, so it needs to be multiplied by a coefficient P to correct this error, that is, the above Mentioned command flow rate calculation method.

在系统初始化、Cache失效和分支误预测的情况下，取指部件并不进行指令的读取操作，因此此时的线程IPC值均为0，那么线程的指令速度也相应的为0，严重影响线程的执行速度，为避免此种现象的出现，将线程的IPC进行加1处理，那么公式(2)将改写为(3)。In the case of system initialization, Cache failure, and branch misprediction, the instruction fetch unit does not read instructions, so the thread IPC value at this time is 0, and the instruction speed of the thread is correspondingly 0, which seriously affects For the execution speed of the thread, in order to avoid this phenomenon, the IPC of the thread is increased by 1, then the formula (2) will be rewritten as (3).

$I I = = ((IPC IPC + + 11)) \times \times P P - - \sqrt{{I I}^{' '}} - - - - - - ((33))$

在IFSBSMT策略具体的硬件实现过程中，线程IPC值的计算不仅需要额外的硬件开销，并且需通过线程的预执行和采样来修正IPC值，严重影响指令的执行速度。因此，通过采用将参数I按位取反并对因子P进行取模的方式来化简繁琐的线程IPC值计算，有效地减少了IFSBSMT策略所需的硬件开销。新的公式表示见公式(4)。In the specific hardware implementation process of the IFSBSMT strategy, the calculation of the thread IPC value not only requires additional hardware overhead, but also needs to correct the IPC value through thread pre-execution and sampling, which seriously affects the execution speed of instructions. Therefore, by inverting the parameter I bit by bit and moduloing the factor P to simplify the cumbersome thread IPC value calculation, the hardware overhead required by the IFSBSMT strategy is effectively reduced. See formula (4) for the new formula expression.

$I I = = \overset{&OverBar; &OverBar;}{\sqrt{{I I}^{' '}}} mod mod P P - - - - - - ((44))$

在公式(4)中，需对线程在指令队列中的指令数I′进行开根号操作，其硬件实现的复杂度过高，通过采用二阶泰勒公式对根号进行处理，将公式(4)改写为(5)。In formula (4), it is necessary to carry out the root number operation on the instruction number I′ of the thread in the instruction queue, and the complexity of its hardware implementation is too high. By using the second-order Taylor formula to process the root number, the formula (4 ) is rewritten as (5).

$I I = = \overset{&OverBar; &OverBar;}{11 / / 22 (({I I}^{' '} + + 11))} mod mod P P - - - - - - ((55))$

尽管利用二阶泰勒公式对公式(5)进行优化所得的结果为近似值，但并不影响取指的正确性，况且和实际取出的指令数相比，这点误差可以忽略不计。Although the result obtained by using the second-order Taylor formula to optimize formula (5) is an approximate value, it does not affect the correctness of instruction fetching, and compared with the actual number of fetched instructions, this error can be ignored.

在每个时钟周期内，某一线程所取指令数的最大值不应超过预设的取指带宽N。因此，线程需读取的指令数应为公式(5)的计算结果与取指带宽的较小值，且参数P的最优值应取16，预设的取指带宽为8。那么，最终每周期线程所取指令数的计算公式如(6)所示。In each clock cycle, the maximum number of instructions fetched by a certain thread should not exceed the preset instruction fetch bandwidth N. Therefore, the number of instructions to be read by a thread should be the smaller value of the calculation result of formula (5) and the instruction fetch bandwidth, and the optimal value of the parameter P should be 16, and the preset instruction fetch bandwidth is 8. Then, the formula for calculating the number of instructions fetched by the thread per cycle is shown in (6).

$I I = = MIN MIN ((\overset{&OverBar; &OverBar;}{11 / / 22 (({I I}^{' '} + + 11))} mod mod 16.8 16.8)) - - - - - - ((66))$

第二阶段只对取指带宽的划分进行设计，但由于在线程取指过程中L2 Cache失效现象时有发生，导致共享资源可能会被某一线程独占，影响其他后继线程的顺利执行，限制SMT(Simultaneous Multithreading)处理器总体性能的提升。因此，还需对线程的共享资源进行合理分配来解决这一问题。In the second stage, only the division of instruction fetch bandwidth is designed. However, due to frequent occurrence of L2 Cache failure during thread instruction fetching, shared resources may be monopolized by a certain thread, which affects the smooth execution of other subsequent threads and limits SMT. (Simultaneous Multithreading) The improvement of the overall performance of the processor. Therefore, it is also necessary to reasonably allocate shared resources of threads to solve this problem.

IFSBSMT策略的最后阶段就是对系统资源的分配，采用线程IPC和L2 Cache失效率的双优先级方式来对线程的共享资源进行动态分配，其实现的基本原理：根据线程IPC值设定的资源分配优先级为主优先级；根据L2Cache失效率设定的优先级为次优先级，次优先级由高到低依次划分为CLevel 1、CLeve2和CLeve3，其评定的标准：线程未发生L1 data Cache和L2 Cache失效为CLevel 1、线程发生L1 data Cache失效、L2 Cache未失效为CLevel 2、线程发生L2 Cache失效为CLevel 3。在主优先级不同次优先级相同的情况下，将以主优先级作为资源分配的依据，主优先级高的线程具有较高的资源分配权限。在主优先级相同次优先级不同的情况下，将以次优先级作为资源分配的依据，次优先级高的线程具有较高的资源分配权限。资源分配的具体公式如(7)所示。The final stage of the IFSBSMT strategy is the allocation of system resources. The dual priority method of thread IPC and L2 Cache failure rate is used to dynamically allocate thread shared resources. The basic principle of its realization is: resource allocation set according to thread IPC value The priority is the main priority; the priority set according to the L2Cache failure rate is the secondary priority, and the secondary priority is divided into CLevel 1, CLeve2, and CLeve3 from high to low. The evaluation standard: the thread does not have L1 data Cache and L2 Cache invalidation is CLevel 1, thread L1 data Cache invalidation, L2 Cache not invalidation is CLevel 2, thread L2 Cache invalidation is CLevel 3. When the main priority is different and the secondary priority is the same, the main priority will be used as the basis for resource allocation, and threads with higher main priority have higher resource allocation rights. In the case of the same main priority and different sub-priorities, the sub-priority will be used as the basis for resource allocation, and threads with higher sub-priority have higher resource allocation rights. The specific formula of resource allocation is shown in (7).

$Ni Ni = = \frac{{P P}_{Ti Ti}}{{P P}_{Ti Ti} + + {P P}_{Tj Tj}} \times \times R R - - - - - - ((77))$

在公式(7)中，PTi和PTj分别表示线程Ti和Tj的资源分配优先级，Ni表示分配给线程Ti的资源数目，R表示系统资源的总数。In formula (7), PTi and PTj represent the resource allocation priorities of threads Ti and Tj respectively, Ni represents the number of resources allocated to thread Ti, and R represents the total number of system resources.

在主优先级和次优先级均不相同的情况，将以线程的IPC值与次优先级数的比值作为资源分配评定的依据。资源分配的具体公式如(8)所示。In the case where the primary priority and secondary priority are not the same, the ratio of the thread's IPC value to the number of secondary priorities will be used as the basis for resource allocation evaluation. The specific formula of resource allocation is shown in (8).

$Ni Ni = = \frac{TLi TL / / CLi CLi}{TLi TL / / CLi CLi + + TLj TLj / / CLj CL} \times \times R R - - - - - - ((88))$

在公式(8)中，TLi和TLj分别表示线程Ti和Tj的主优先级，CLi和CLj分别表示线程Ti和Tj的次优先级数，其值可取1、2、3，Ni表示分配给线程Ti的资源数目，R表示系统资源的总数。In formula (8), TLi and TLj represent the main priority of threads Ti and Tj respectively, CLi and CLj represent the secondary priorities of threads Ti and Tj respectively, and their values can be 1, 2, 3, and Ni represents the Ti resource number, R represents the total number of system resources.

TBHBP分支预测器采用两级自适应分支预测器为基础，线程间采用独立的分支历史寄存器和地址历史寄存器，对于模式匹配表则采取线程共享的方式，并应用分支结果输出表来对分支指令结果进行存储。其具体的硬件结构如图1所示。The TBHBP branch predictor is based on a two-level adaptive branch predictor. Threads use independent branch history registers and address history registers. For the pattern matching table, thread sharing is adopted, and the branch result output table is used to check the result of the branch instruction. to store. Its specific hardware structure is shown in Figure 1.

如图1所示，TBHBP分支预测器主要包含六个部分：分支预测信息表BPIT(Branch PredictInformation Table)、线程分支历史寄存器信息表TBHRIT(Thread Branch History RegisterInformation Table)、分支目标地址历史寄存器信息表BTAHRIT(Branch Target Address HistoryRegister Information Table)、模式匹配表PHT(Pattern History Table)、分支结果输出表BRT(Branch Result Table)、以及除分支预测信息表和PHT表外，其他三个表的逻辑更新电路。As shown in Figure 1, the TBHBP branch predictor mainly includes six parts: branch prediction information table BPIT (Branch Predict Information Table), thread branch history register information table TBHRIT (Thread Branch History Register Information Table), branch target address history register information table BTAHRIT (Branch Target Address HistoryRegister Information Table), pattern matching table PHT (Pattern History Table), branch result output table BRT (Branch Result Table), and the logic update circuit of the other three tables except the branch prediction information table and PHT table.

分支预测信息表BPIT是根据分支指令的PC值进行索引的，每个线程对应一组独立的表目项，每一个表目项包含4个字段：TID字段为线程的索引号，用于索引TBHRIT表；PC字段用于BTAHRIT表的索引；TAG字段用于与分支结果表的PC值进行比较，以确定该分支指令是否需要进行分支预测；CONF字段作为分支预测的阈值，以判断是否应用BRT表中的分支预测结果。当线程的某一分支指令进入流水线后，分支预测电路用其PC值索引BPIT表的某一表目项。The branch prediction information table BPIT is indexed according to the PC value of the branch instruction. Each thread corresponds to a set of independent entries. Each entry contains 4 fields: the TID field is the index number of the thread, which is used to index TBHRIT table; the PC field is used for the index of the BTAHRIT table; the TAG field is used to compare with the PC value of the branch result table to determine whether the branch instruction needs to perform branch prediction; the CONF field is used as the threshold of the branch prediction to determine whether to apply the BRT table The branch prediction results in . When a certain branch instruction of the thread enters the pipeline, the branch prediction circuit uses its PC value to index a certain entry of the BPIT table.

线程分支历史寄存器信息表TBHRIT是通过BPIT表中的TID字段进行索引的，每个线程对应一组独立的表目项，每一个表目项包含3个字段：TID字段为线程的索引号；PC字段用于对分支指令的索引；BPHI字段为预测的分支历史信息，用于分支历史信息位的拼接，并在分支指令提交时进行更新。当线程的某一指令进入流水线的译码阶段时，分支预测电路用其PC值索引TBHRIT表的某一表目项。The thread branch history register information table TBHRIT is indexed through the TID field in the BPIT table. Each thread corresponds to a set of independent entries, and each entry contains 3 fields: the TID field is the index number of the thread; PC The field is used to index the branch instruction; the BPHI field is the predicted branch history information, which is used for splicing the branch history information bits, and is updated when the branch instruction is submitted. When a certain instruction of the thread enters the decoding stage of the pipeline, the branch prediction circuit uses its PC value to index a certain entry of the TBHRIT table.

分支目标地址历史寄存器信息表BTAHRIT是通过BPIT表中的PC字段进行索引的，每个线程对应一组独立的表目项，每一个表目项包含3个字段：TID字段为线程的索引号；PC字段用于对分支指令的索引；BPTA字段为分支指令的目标地址信息，用于读取每个分支指令目标地址的局部分支历史信息，之后来完成分支历史信息位的拼接，并在分支指令提交时进行更新。当线程的某一分支指令进入流水线的译码阶段时，分支预测电路用其PC值索引BTAHRIT表的某一表目项。The branch target address history register information table BTAHRIT is indexed through the PC field in the BPIT table. Each thread corresponds to a set of independent entries, and each entry contains 3 fields: the TID field is the index number of the thread; The PC field is used to index the branch instruction; the BPTA field is the target address information of the branch instruction, which is used to read the local branch history information of the target address of each branch instruction, and then complete the splicing of the branch history information bits, and in the branch instruction Update when submitted. When a branch instruction of a thread enters the decoding stage of the pipeline, the branch prediction circuit uses its PC value to index a certain entry of the BTAHRIT table.

模式匹配表PHT是通过线程的综合分支历史进行索引的，综合历史信息是由TBHRIT表记录的线程分支历史信息和BTAHRIT表中依据分支指令目标地址读取的局部分支历史信息拼接而成的，对PHT表的应用采用多线程共享的方式。The pattern matching table PHT is indexed through the comprehensive branch history of the thread. The comprehensive historical information is spliced from the thread branch historical information recorded in the TBHRIT table and the local branch historical information read according to the branch instruction target address in the BTAHRIT table. The application of the PHT table adopts the method of multi-thread sharing.

分支结果输出表BRT是通过BPIT表中的PC字段进行索引的，每个线程对应一组独立的表目项，每一个表目项包含4个字段：TID字段为线程的索引号；PC字段用于对分支指令的索引；PRED用于存储分支指令的预测结果；CONF作为分支预测的阈值。当线程的某一分支指令进入流水线的写回阶段时，分支结果更新电路用其PC值索引BRT表的所有表目项，来完成对其的更新操作。同时，对TBHRIT表和BTAHRIT表也将应用相关的更新电路来完成表目项的更新操作。The branch result output table BRT is indexed through the PC field in the BPIT table. Each thread corresponds to a set of independent entries, and each entry contains 4 fields: the TID field is the index number of the thread; the PC field is used for For the index of the branch instruction; PRED is used to store the prediction result of the branch instruction; CONF is used as the threshold of the branch prediction. When a certain branch instruction of the thread enters the write-back stage of the pipeline, the branch result update circuit uses its PC value to index all entries of the BRT table to complete its update operation. At the same time, the relevant update circuit will also be applied to the TBHRIT table and the BTAHRIT table to complete the update operation of the entries.

本发明在处理器的取指控制上有独到的优势，通过取指单元将IFSBSMT取指策略和TBHBP分支预测器有效的结合在一起，不仅使其能充分发挥自身的技术优势，两者的无缝融合更进一步提高了FCMBSMT方法的优越性。The present invention has unique advantages in the fetching control of the processor. The IFSBSMT fetching strategy and the TBHBP branch predictor are effectively combined through the fetching unit, which not only enables it to give full play to its own technical advantages, but also makes the two seamless Seam fusion further improves the superiority of the FCMBSMT method.

在处理器的取指过程中，IFSBSMT策略通过线程选择、取指带宽划分和动态资源分配这三个实施阶段，对取指部件读取指令的操作进行控制，使处理器的取指带宽利用更为均衡，线程平均占用指令队列的长度明显减小，指令队列的冲突率接近0，极大地提高处理器的指令吞吐率。但美中不足的是，由于SMT处理器在每个时钟周期内的取指数较传统处理器大幅增多，处理器的分支预测性能呈现明显的下降趋势。In the process of fetching instructions of the processor, the IFSBSMT strategy controls the operation of reading instructions of the fetching parts through three implementation stages of thread selection, fetching bandwidth division and dynamic resource allocation, so as to make the utilization of the fetching bandwidth of the processor more efficient. For balance, the average length of the command queue occupied by threads is significantly reduced, and the conflict rate of the command queue is close to 0, which greatly improves the command throughput rate of the processor. But the fly in the ointment is that the branch prediction performance of the processor shows an obvious downward trend because the number of index fetches in each clock cycle of the SMT processor is significantly higher than that of the traditional processor.

TBHBP分支预测器的实现有效的解决了这一问题，通过将线程的全局历史信息和局部历史信息结合为综合历史信息，作为模式匹配表PHT的索引，有效减少了SMT处理中分支信息过时、混乱问题的出现。同时，线程独立共享分支预测资源的模式，极大降低了在SMT处理器下发生分支别名冲突和容量冲突的概率，提高了分支执行的正确性。相比于传统分支预测器，新增的分支预测结果输出表BRT这一硬件结构，通过记录常用分支指令的预测结果，推进了分支指令的预测执行速度，避免分支指令队列堆积现象的出现，促进后续指令的顺利执行。The implementation of the TBHBP branch predictor effectively solves this problem. By combining the global history information and local history information of threads into comprehensive history information, which is used as the index of the pattern matching table PHT, it effectively reduces the obsolescence and confusion of branch information in SMT processing. problem arises. At the same time, the mode that threads independently share branch prediction resources greatly reduces the probability of branch alias conflicts and capacity conflicts under the SMT processor, and improves the correctness of branch execution. Compared with traditional branch predictors, the newly added branch prediction result output table BRT, a hardware structure, improves the predicted execution speed of branch instructions by recording the prediction results of commonly used branch instructions, avoids the phenomenon of branch instruction queue accumulation, and promotes The smooth execution of subsequent instructions.

在两者独立发挥自身技术优势的同时，彼此之间更是通过功能互补的方式，使其各自性能都得到充分的发挥。TBHBP预测器的精度预测使得流水线中分支指令得以正常执行，有效地缓解了分支指令对取指操作的影响，促进了IFSBSMT策略在取指性能上的进一步提升。同时，IFSBSMT策略通过合理的利用取指带宽，减少高优先级线程的取指数量，使得读取的分支指令数目相对减少，减轻了TBHBP预测器的分支预测压力，提高了分支预测器的预测精度和准确性。While the two independently exert their own technical advantages, they also complement each other with functions so that their respective performances can be fully utilized. The accuracy prediction of the TBHBP predictor enables the normal execution of branch instructions in the pipeline, effectively alleviates the impact of branch instructions on instruction fetch operations, and promotes the further improvement of IFSBSMT strategy in instruction fetch performance. At the same time, the IFSBSMT strategy reduces the number of index fetches of high-priority threads through reasonable use of the fetch bandwidth, so that the number of branch instructions read is relatively reduced, which reduces the branch prediction pressure of the TBHBP predictor and improves the prediction accuracy of the branch predictor. and accuracy.

本发明的优点在于有效地克服了传统方法中存在的取指策略不够优化、分支预测性能低下等缺点。多次的实例分析和性能测试结果表明：FCMBSMT取指控制方法与传统的ICG方法相比，指令吞吐率提升了59.3％，指令队列的平均长度减少17.33，分支误预测率和误预测路径取指率分别下降了2.16％和3.28％，极大地促进处理器指令吞吐率和分支预测预测性能的提升，具有良好的应用前景和研究价值。The invention has the advantages of effectively overcoming the disadvantages of insufficient optimization of the instruction fetching strategy and low branch prediction performance existing in the traditional method. Multiple instance analysis and performance test results show that compared with the traditional ICG method, the FCMBSMT instruction fetching control method increases the instruction throughput by 59.3%, reduces the average length of the instruction queue by 17.33%, and leads to branch misprediction rate and misprediction path fetching. The rate drops by 2.16% and 3.28% respectively, which greatly promotes the improvement of processor instruction throughput and branch prediction performance, and has good application prospects and research value.

附图说明 Description of drawings

图1是本发明的TBHBP分支预测器的硬件结构图。Fig. 1 is a hardware structural diagram of the TBHBP branch predictor of the present invention.

图2是本发明的FCMBSMT取指控制方法的硬件结构图。Fig. 2 is a hardware structural diagram of the FCMBSMT fetching control method of the present invention.

图3是本发明的FCMBSMT取指控制方法的实现流程图。Fig. 3 is the realization flowchart of the FCMBSMT fetching control method of the present invention.

图4是本发明的处理器IPC性能测试对比图。Fig. 4 is a comparison chart of processor IPC performance test of the present invention.

图5是本发明的单线程IPC性能测试对比图。FIG. 5 is a comparison chart of the single-thread IPC performance test of the present invention.

图6是本发明的指令队列平均长度性能测试对比图。Fig. 6 is a comparison chart of the performance test of the average length of the instruction queue in the present invention.

图7是本发明的分支误预测率性能测试对比图。Fig. 7 is a comparison chart of branch misprediction rate performance test of the present invention.

图8是本发明的分支误预路径取指率性能测试对比图。Fig. 8 is a comparison chart of performance test of branch misprediction path instruction fetch rate in the present invention.

具体实施方式 Detailed ways

下面结合附图举例对本发明做更详细地描述：The present invention is described in more detail below in conjunction with accompanying drawing example:

FCMBSMT取指控制方法的整个实施过程分为两个阶段：读取指令、指令的分支预测，且这两者的执行顺序没有先后之分，通过两者的互相作用来完成同时多线程处理器的取指操作。结合图2和图3发明的基于同时多线程取指控制FCMBSMT的具体实现流程如下：The entire implementation process of the FCMBSMT instruction fetching control method is divided into two stages: reading instructions, branch prediction of instructions, and the execution order of the two is not sequential, and the simultaneous multi-threaded processor is completed through the interaction of the two. Fetch operation. The specific implementation process of the FCMBSMT based on simultaneous multi-threaded instruction fetching control invented in conjunction with Fig. 2 and Fig. 3 is as follows:

步骤一：在处理器的每一个时钟周期，取指部件根据程序计数器读取指令的PC值。Step 1: In each clock cycle of the processor, the fetching unit reads the PC value of the instruction according to the program counter.

步骤二：通过T选2多路选择器选择指令队列项数计数器值最小的两个线程进行输出，假设线程1的优先级高于线程2。Step 2: Select the two threads with the smallest counter values of the number of items in the instruction queue through the T selection 2 multiplexer for output, assuming that the priority of thread 1 is higher than that of thread 2 .

步骤三：线程1的计数值先经加法器和乘法器执行多项表达式的运算，之后将结果值依次进行一次按位取反和模16运算操作，将输出值通过2选1选择器与取指带宽进行比较，取其较小值。Step 3: The count value of thread 1 first performs the operation of multiple expressions through the adder and multiplier, and then performs a bitwise inversion and modulo 16 operation on the result value, and passes the output value through the 2-to-1 selector and Fetch the bandwidth for comparison and choose the smaller value.

步骤四：除读取指令的计算外，线程2的执行过程与1相同。对于线程2，读取的指令数应为线程1的取指数与取指带宽的差值。Step 4: Except for the calculation of the read instruction, the execution process of thread 2 is the same as that of 1. For thread 2, the number of read instructions should be the difference between thread 1's index fetch and instruction fetch bandwidth.

步骤五：将两个线程的输出结果送入取指部件寄存器完成取指带宽的划分。Step 5: Send the output results of the two threads into the registers of the instruction fetch unit to complete the division of the instruction fetch bandwidth.

步骤六：双优先级资源分配机制按照线程IPC值和Cache失效率这两项参数，通过公式(7)和公式(8)计算线程在取指阶段所需的系统资源，例如：取指带宽、指令队列长度、保留站队列长度等，完成资源的动态分配操作。Step 6: The dual-priority resource allocation mechanism calculates the system resources required by the thread in the instruction fetching phase through formulas (7) and (8) according to the two parameters of thread IPC value and cache failure rate, such as: fetching bandwidth, Command queue length, reserved station queue length, etc., to complete the dynamic allocation of resources.

步骤七：判断是否存在分支指令。若有，则根据分支指令Bi的PC值索引分支预测信息表BPIT，读取分支指令所属线程索引号TID。反之，则将读取的指令送入指令Cache。Step 7: Determine whether there is a branch instruction. If yes, index the branch prediction information table BPIT according to the PC value of the branch instruction Bi, and read the index number TID of the thread to which the branch instruction belongs. Otherwise, the read instruction is sent to the instruction cache.

步骤八：通过获取的TID编号索引线程分支历史寄存器信息表TBHRIT，读取线程预测的分支历史信息BPHI，用其作为分支预测的全局历史信息。同时，通过获取的指令PC值索引分支目标地址历史寄存器信息表BTAHRIT，读取分支指令的目标地址BPTA，并根据指令地址读取用于分支预测的局部历史信息。Step 8: Index the thread branch history register information table TBHRIT through the obtained TID number, read the branch history information BPHI predicted by the thread, and use it as the global history information of the branch prediction. At the same time, index the branch target address history register information table BTAHRIT through the obtained instruction PC value, read the target address BPTA of the branch instruction, and read the local history information for branch prediction according to the instruction address.

步骤九：将每个线程的分支历史信息BHR和根据目标地址读取的历史信息BHT通过哈希函数结合在一起，作为二级模式匹配表PHT的索引。在这里，分别对BHR+BHT和BHT+BHR这三种历史信息的连接方式进行分支预测性能的测试，以确定两种历史信息最佳的连接方式。实验通过运行art-perlbmk、craft-mcf和bip2-lucas这三组两线程工作负载程序，分析在不同的连接方式下分支误预测率和分支误预测路径取指率。具体分析结果如表1所示。Step 9: combine the branch history information BHR of each thread and the history information BHT read according to the target address through a hash function, and use it as an index of the secondary pattern matching table PHT. Here, respectively for BHR+BHT and BHT+BHR are three connection methods of historical information to test the branch prediction performance to determine the best connection method of the two kinds of historical information. The experiment runs three sets of two-thread workload programs, art-perlbmk, craft-mcf, and bip2-lucas, to analyze the branch misprediction rate and branch misprediction path index fetch rate under different connection modes. The specific analysis results are shown in Table 1.

表1不同历史信息连接方式的分支预测性能对比表Table 1 Comparison table of branch prediction performance of different historical information connection methods

由表1分析可知，BHR+BHT相比于其他两种历史信息的连接方式，在分支预测性能上有一定的优势。因此，对于二级模式匹配表PHT的索引，采取线程历史信息和地址历史信息拼接的方式。From the analysis in Table 1, it can be seen that BHR+BHT has certain advantages in branch prediction performance compared to the other two connection methods of historical information. Therefore, for the index of the secondary pattern matching table PHT, the stitching method of thread history information and address history information is adopted.

步骤十：通过拼接的历史信息索引PHT表获取分支指令的模式历史位Sc，用于实际的分支预测操作。Step 10: Obtain the mode history bit Sc of the branch instruction through the concatenated historical information index PHT table, which is used for the actual branch prediction operation.

步骤十一：将获取的模式历史位Sc输入到预测决定函数，完成分支预测结果的计算操作。同时，通过状态转换函数δ来完成模式历史位的更新操作，更新后的模式历史位将有原来的Ri，c-k Ri，c-k+1......Ri，c-1变为Ri，c-k+1Ri，c-k+2......Ri，c。Step 11: Input the obtained pattern history bit Sc into the prediction decision function to complete the calculation operation of the branch prediction result. At the same time, the update operation of the mode history bits is completed through the state transition function δ, and the updated mode history bits will have the original Ri, c-k Ri, c-k+1...Ri, c-1 become Ri , c-k+1Ri, c-k+2...Ri, c.

步骤十二：将分支指令Bi的预测结果写入分支结果输出表BRT中。当下次有相同的分支指令被预测时，若预测结果与BRT表中的PRED值相同，则CONF加1；反之，CONF做减1操作。Step 12: Write the prediction result of the branch instruction Bi into the branch result output table BRT. When the same branch instruction is predicted next time, if the prediction result is the same as the PRED value in the BRT table, then CONF will be incremented by 1; otherwise, CONF will be decremented by 1.

步骤十三：通过TBHRIT表的更新电路，将获取的分支输出结果Ri，c左移入线程历史寄存器中的最末位上，并将预测的历史信息更新为分支指令提交的历史信息。Step 13: Through the update circuit of the TBHRIT table, shift the obtained branch output results Ri, c to the left to the last bit in the thread history register, and update the predicted history information to the history information submitted by the branch instruction.

步骤十四：通过BTAHRIT表的更新电路，将获取分支输出结果Ri，c所对应的目标地址历史信息左移入地址历史寄存器中的最末位上，并将预测的分支指令目标地址更新为分支指令提交时的实际地址信息。Step 14: Through the update circuit of the BTAHRIT table, move the history information of the target address corresponding to the obtained branch output result Ri, c to the left into the last bit in the address history register, and update the predicted target address of the branch instruction to the branch instruction Physical address information at the time of submission.

步骤十五：当分支预测器对下一分支指令Bi+1进行分支预测时，首先根据其PC值索引BRT表中的CONF字段。若CONF大于等于2，则BPIT表中的TAG字段记为1，分支预测电路将不对指令Bi+1进行分支预测操作，而是直接将存储的分支结果输出。反之，若CONF小于2，则BPIT表中的TAG字段记为0，分支指令会按照上述的八个步骤重新执行分支预测操作，并将预测结果与BRT表中的数据进行比对，完成CONF字段和PRED字段的更新操作。最后，将预测结果告知取指单元使其正确的完成取指操作。Step fifteen: When the branch predictor performs branch prediction on the next branch instruction Bi+1, it first indexes the CONF field in the BRT table according to its PC value. If CONF is greater than or equal to 2, the TAG field in the BPIT table is recorded as 1, and the branch prediction circuit will not perform branch prediction operation on instruction Bi+1, but directly output the stored branch result. Conversely, if CONF is less than 2, the TAG field in the BPIT table is recorded as 0, and the branch instruction will re-execute the branch prediction operation according to the above eight steps, and compare the prediction result with the data in the BRT table to complete the CONF field and the update operation of the PRED field. Finally, the prediction result is notified to the fetching unit so that it can correctly complete the fetching operation.

步骤十六：若在分支预测的整个过程中出现分支误预测现象，则将处理器会启动误预测处理机制即时停止剩余操作，并取消误预测分支指令后所属同一线程的正在流水线中运行的全部指令，线程的PC值调整为分支后的正确目标指令地址，然后从新的地址重新开始读取指令执行。同时，根据分支的实际执行结果调整分支结果输出表BRT中相应表目项的CONF字段和PRED字段，以供该分支指令再次执行时使用。Step 16: If a branch misprediction phenomenon occurs during the whole process of branch prediction, the processor will start the misprediction processing mechanism to stop the remaining operations immediately, and cancel all the programs running in the pipeline belonging to the same thread after the mispredicted branch instruction. Instructions, the PC value of the thread is adjusted to the correct target instruction address after the branch, and then the instruction execution is restarted from the new address. At the same time, the CONF field and the PRED field of the corresponding entry in the branch result output table BRT are adjusted according to the actual execution result of the branch, so as to be used when the branch instruction is executed again.

以SPEC 2000基准测试程序为例来说明FCMBSMT方法取指控制的过程，本实验还需设置性能测试基准程序参数、同时多线程模拟器、性能参照对象和性能参数指标，具体参数配置如下：Taking the SPEC 2000 benchmark test program as an example to illustrate the process of FCMBSMT method fetch control, this experiment also needs to set the performance test benchmark program parameters, simultaneous multi-thread simulator, performance reference object and performance parameter indicators. The specific parameter configuration is as follows:

(1)设置性能测试基准程序参数。实验将选取SPEC 2000测试集中的7个定点程序和5个浮点程序，并将其随机组合为6个双线程负载集进行性能测评。同时，由于在实验中完整的模拟测试程序需要花费大量的时间，有时甚至不可能完成，因此针对不同测试程序的运行指令数也进行具体的配置。具体的测试程序参数及运行指令数配置如表2所示，运行指令数的单位为十亿。(1) Set the performance test benchmark program parameters. The experiment will select 7 fixed-point programs and 5 floating-point programs from the SPEC 2000 test set, and randomly combine them into 6 dual-thread load sets for performance evaluation. At the same time, because it takes a lot of time to complete the simulation test program in the experiment, sometimes it is even impossible to complete, so the number of running instructions for different test programs is also specifically configured. The specific test program parameters and the configuration of the number of operating instructions are shown in Table 2, and the unit of the number of operating instructions is one billion.

表2FCMBSMT方法性能测试基准程序参数配置表Table 2 FCMBSMT method performance test benchmark program parameter configuration table

(2)同时多线程模拟器。实验采用西部实验室Dean.M.Tullsen等人研发的SMISIM模拟器进行实验研究。SMISIM模拟器是基于James Lames编写的SPIM模拟器进行开发的，可同时运行8个线程，每个线程运行的指令可以达到300M。同时，SMTSIM模拟器还支持Alpha可执行程序的运行，且运行速度也是目前SMT模拟器中最快的。模拟器参数的基本配置如表3所示。(2) Simultaneous multithreading simulator. The experiment uses the SMISIM simulator developed by Dean.M.Tullsen et al. in the Western Laboratory for experimental research. The SMISIM simulator is developed based on the SPIM simulator written by James Lames. It can run 8 threads at the same time, and each thread can run up to 300M instructions. At the same time, the SMTSIM simulator also supports the operation of the Alpha executable program, and the running speed is the fastest among the current SMT simulators. The basic configuration of simulator parameters is shown in Table 3.

表3SMTSIM模拟器参数的基本配置表Table 3 Basic configuration table of SMTSIM simulator parameters

(3)性能参照对象。性能参照将采用目前取指性能最好的ICOUNT2.8和Gshare分支预测器相结合的ICG方法进行性能比对，通过与高性能的取指控制方法进行性能对比，更能显现FCMBSMT方法在取指性能上的优越性和可用性。(3) Performance reference object. The performance reference will use the ICG method which combines the ICOUNT2.8 with the best index fetching performance and the Gshare branch predictor for performance comparison. Through the performance comparison with the high-performance instruction fetch control method, it can better show the performance of the FCMBSMT method in fetching instructions. Performance superiority and usability.

(4)性能参数指标。针对SMT处理器体系结构特点和FCMBSMT方法的实现原理，综合考虑各方面因素的影响，性能测试实验将采用的评估参数包括：处理器IPC、指令队列长度及队列冲突率、分支误预测率和误预测路径取指率。(4) Performance parameter index. According to the characteristics of the SMT processor architecture and the implementation principle of the FCMBSMT method, considering the influence of various factors, the evaluation parameters used in the performance test experiment include: processor IPC, instruction queue length and queue conflict rate, branch misprediction rate and misprediction rate. Prediction path fetch rate.

处理器的IPC值是指处理器在每个时钟周期内执行的指令数，是衡量处理器指令吞吐率和加速比的重要性能指标。The IPC value of a processor refers to the number of instructions the processor executes in each clock cycle, and is an important performance indicator for measuring the instruction throughput and speedup ratio of the processor.

指令队列长度是指基准测试程序占用定点队列、浮点队列和访存队列的长度之和。指令队列冲突率是指基准测试程序所占定点队列冲突率、浮点队列冲突率和访存队列冲突率的算术平均值。The instruction queue length refers to the sum of the lengths of the fixed-point queue, floating-point queue and memory access queue occupied by the benchmark test program. The instruction queue conflict rate refers to the arithmetic average of the fixed-point queue conflict rate, floating-point queue conflict rate and memory access queue conflict rate occupied by the benchmark test program.

分支误预测率是指误预测的分支指令数与分支指令总数的比值。误预测路径的取指率是指误预测路径读取的指令数与读取指令总数的比值。The branch misprediction rate refers to the ratio of the number of mispredicted branch instructions to the total number of branch instructions. The instruction fetch rate of the mispredicted path refers to the ratio of the number of instructions read by the mispredicted path to the total number of read instructions.

同时，为了使得测试环境更为接近实际程序运行状态，实验采用12个工作负载程序两两随机组合的方式，最终形成6个复合型测试程序来其进行性能测试。具体的性能测试结果如图4所示。At the same time, in order to make the test environment closer to the actual program running state, the experiment adopts the random combination of 12 workload programs, and finally forms 6 composite test programs for performance testing. The specific performance test results are shown in Figure 4.

由图4分析可知，相比于传统的ICG取指控制方法，在FCMBSMT方法下处理器的指令吞吐率大幅度提升。在两线程负载运行的条件下处理器的IPC性能达到2.95，而在ICG方法下处理器IPC的性能仅为1.89，工作负载程序性能的平均加权加速比约为26.1％，提升幅度相比单独的IFSBSMT策略和TBHBP预测器都有所提高。处理器指令吞吐率的提升是在IFSBSMT策略和TBHBP预测器共同的推动下实现的，IFSBSMT策略通过合理利用取指带宽和动态分配线程执行所需的系统资源，极大地提高了处理器的指令吞吐率。同时，高精度的TBHBP分支预测器通过减少分支指令的别名冲突和容量冲突，提高了处理器的取指质量和取指效率，进而促进了指令吞吐率性能的提升。From the analysis of Figure 4, it can be seen that compared with the traditional ICG instruction fetch control method, the instruction throughput rate of the processor is greatly improved under the FCMBSMT method. Under the condition of two-thread load running, the processor's IPC performance reaches 2.95, while under the ICG method, the processor's IPC performance is only 1.89, and the average weighted speedup ratio of the workload program performance is about 26.1%, which is an improvement compared to the single Both the IFSBSMT strategy and the TBHBP predictor are improved. The improvement of the processor instruction throughput is achieved under the joint promotion of the IFSBSMT strategy and the TBHBP predictor. The IFSBSMT strategy greatly improves the instruction throughput of the processor by rationally utilizing the instruction fetch bandwidth and dynamically allocating the system resources required for thread execution. Rate. At the same time, the high-precision TBHBP branch predictor improves the quality and efficiency of instruction fetching by reducing the alias conflicts and capacity conflicts of branch instructions, thereby promoting the improvement of instruction throughput performance.

IFSBSMT策略和TBHBP分支预测器在实现的过程中都具有一定的取指公平性，因此两者相结合的FCMBSMT取指控制方法也将具有相同的优点，即在提高处理器整体指令吞吐率性能的同时，单线程的指令吞吐率性能应所有提升。下面将针对上一实验6个两线程工作负载中的12个单线程的IPC性能进行测试，具体的性能测试结果如图5所示。Both the IFSBSMT strategy and the TBHBP branch predictor have a certain fairness in fetching instructions during the implementation process, so the FCMBSMT fetching control method combined with the two will also have the same advantages, that is, in improving the overall instruction throughput performance of the processor. At the same time, the single-threaded instruction throughput performance should be improved. The following will test the IPC performance of 12 single-threaded workloads among the 6 two-threaded workloads in the previous experiment. The specific performance test results are shown in Figure 5.

由图5分析可知，与ICG方法相比，在FCMBSMT方法下的12个工作负载程序的指令吞吐率均呈现不同程度的涨幅。性能测试结果经过统计后显示，FCMBSMT方法下单线程的平均IPC值为1.45，而ICG方法仅为0.91，平均加权加速比约为29.3％。可见，FCMBSMT取指控制方法完全继承了IFSBSMT策略和TBHBP预测器的公平性这一优点，且与二者相比，单线程指令吞吐率性能的提升幅度更为明显。From the analysis in Figure 5, it can be seen that compared with the ICG method, the instruction throughput rates of the 12 workload programs under the FCMBSMT method all show different degrees of increase. The statistics of the performance test results show that the average IPC value of a single thread under the FCMBSMT method is 1.45, while that of the ICG method is only 0.91, and the average weighted speedup ratio is about 29.3%. It can be seen that the FCMBSMT instruction fetch control method fully inherits the advantages of the IFSBSMT strategy and the fairness of the TBHBP predictor, and compared with the two, the single-threaded instruction throughput performance has a more obvious improvement.

IFSBSMT取指策略通过合理划分取指带宽，减少了高优先级线程的取指数，进而有效地减少了线程占用指令队列的平均长度，极大地提高了系统资源的利用率。而TBHBP分支预测器通过增加分支结果输出表BRT，有效地避免了分支指令在指令队列中堆积现象的出现，加快了分支指令的预测执行速度，推进后继指令的顺利执行，进而有效地降低了线程对指令队列等系统资源的占用率。在两种因素共同的推动作用下，FCMBSMT方法下指令队列的平均长度应有所减少。具体测试结果如图6所示。The IFSBSMT fetching strategy reduces the index fetching of high-priority threads by reasonably dividing the fetching bandwidth, thereby effectively reducing the average length of the instruction queue occupied by threads and greatly improving the utilization of system resources. By increasing the branch result output table BRT, the TBHBP branch predictor effectively avoids the accumulation of branch instructions in the instruction queue, speeds up the predicted execution speed of branch instructions, promotes the smooth execution of subsequent instructions, and effectively reduces the number of threads. Occupancy of system resources such as instruction queues. Under the common impetus of the two factors, the average length of the instruction queue under the FCMBSMT method should be reduced. The specific test results are shown in Figure 6.

由图6测试结果数据的统计分析可知，除applu-sixtrack程序负载外，其他程序负载占用指令队列的长度均所有减小，这主要是由于applu-sixtrack程序负载读取可用指令的数量增多，使得其占用指令队列的长度有所增加，但最终表现为工作负载IPC性能的提升。总体而言，ICG方法下程序负载的平均占用指令队列长度为36.83，而FCMBSMT方法仅为19.50，平均降幅约为47.05％。From the statistical analysis of the test result data in Figure 6, it can be seen that, except for the applu-sixtrack program load, the length of the instruction queue occupied by other program loads is all reduced. This is mainly due to the increase in the number of available instructions read by the applu-sixtrack program load. The length of the instruction queue it occupies has increased, but it is ultimately reflected in the improvement of workload IPC performance. Overall, the average occupied instruction queue length of the program load under the ICG method is 36.83, while that of the FCMBSMT method is only 19.50, and the average decrease is about 47.05%.

TBHBP分支预测器精度的提升有效地提高了FCMBSMT取指控制方法的分支预测命中率，进而使得处理器的分支误预测率有所下降。具体的性能测试结果如图7所示。The improvement of the accuracy of the TBHBP branch predictor effectively improves the branch prediction hit rate of the FCMBSMT instruction fetching control method, thereby reducing the branch misprediction rate of the processor. The specific performance test results are shown in Figure 7.

由图7分析可知，除bzip2-lucas和applu-sixtrack程序负载因其自身原因外，其他程序负载的分支误预测率均呈现下降的趋势。总体而言，ICG方法下的分支误预测率为6.03％，而FCMBSMT方法下的分支误预测率仅为3.87％，平均降幅将近2.16个百分点。From the analysis in Figure 7, it can be seen that except for the bzip2-lucas and applu-sixtrack program loads due to their own reasons, the branch misprediction rates of other program loads show a downward trend. Overall, the branch misprediction rate under the ICG method is 6.03%, while the branch misprediction rate under the FCMBSMT method is only 3.87%, with an average drop of nearly 2.16 percentage points.

由此可见，FCMBSMT方法对于处理器分支预测性能的改善非常显著。It can be seen that the improvement of the branch prediction performance of the processor by the FCMBSMT method is very significant.

同时，分支误预测率的下降有效地减少了取指部件在误预测路径上的取指数，处理器在误预测路径上的取指率也相应的下降。具体的性能测试结果如图8所示。At the same time, the reduction of the branch misprediction rate effectively reduces the index fetching of the instruction fetching unit on the misprediction path, and the index fetching rate of the processor on the misprediction path also decreases accordingly. The specific performance test results are shown in Figure 8.

由图8分析可知，分支误预路径取指率与分支预测误预测率的降幅趋势基本保持一致，即除bzip2-lucas和applu-sixtrack程序负载外因其自身原因外，其他程序负载的分支误预测率路径取指率均有所下降。总体而言，ICG方法下的分支误预测路径取指率为10.64％，而FCMBSMT方法下的分支误预测路径取指率仅为7.42％，平均降幅将近3.28个百分点。分支误预测率和误预测路径取指率的下降有效地提高了处理器的分支预测性能，并促进了处理器指令吞吐率性能的提升。From the analysis in Figure 8, it can be seen that the branch misprediction path index fetch rate and the branch prediction misprediction rate are basically consistent, that is, except for the bzip2-lucas and applu-sixtrack program load due to its own reasons, the branch misprediction rate of other program loads The index fetch rate of the rate path has decreased. Overall, the branch misprediction path index retrieval rate under the ICG method is 10.64%, while the branch misprediction path index retrieval rate under the FCMBSMT method is only 7.42%, with an average drop of nearly 3.28 percentage points. The reduction of the branch misprediction rate and the misprediction path instruction fetch rate effectively improves the branch prediction performance of the processor, and promotes the improvement of the processor instruction throughput performance.

以上是本发明的较佳实施例，凡依本发明技术方案作为改变的，所产生的功能作用未超出本发明方案范围的，均属于本发明的保护范围。The above are the preferred embodiments of the present invention, and any changes made according to the technical solution of the present invention, and the resulting functional effects do not exceed the scope of the solution of the present invention, all belong to the protection scope of the present invention.

Claims

1. A method for fetching instructions based on simultaneous multithreading, characterized in that:

Step 1: In each clock cycle of the processor, the fetching component reads the PC value of the instruction according to the program counter;

Step 2: Select two threads with the smallest counter values of the number of command queue items through the T selection 2 multiplexer to output, assuming that the priority of thread 1 is higher than that of thread 2;

Step 3: The count value of thread 1 first performs the operation of multiple expressions through the adder and multiplier, and then performs a bitwise inversion and modulo 16 operation on the result value, and passes the output value through the 2-to-1 selector and Compare the instruction fetch bandwidth and take the smaller value; except for the calculation of reading instructions, the execution process of thread 2 is the same as that of 1. For thread 2, the number of instructions read is the difference between thread 1's index fetch and instruction fetch bandwidth ;

Step 4: Send the output results of the two threads into the register of the instruction fetch unit to complete the division of the instruction fetch bandwidth;

Step 5: The dual-priority resource allocation mechanism calculates the system resources required by the thread in the instruction fetching phase according to the two parameters of the thread IPC value and the cache failure rate, and completes the dynamic allocation of resources.

Step 6: Determine whether there is a branch instruction; if so, index the branch prediction information table BPIT according to the PC value of the branch instruction Bi, and read the thread index number TID to which the branch instruction belongs; otherwise, send the read instruction into the instruction cache;

Step 7: Index the thread branch history register information table TBHRIT through the obtained TID number, read the branch history information BPHI predicted by the thread, as the global history information of the branch prediction; at the same time, index the branch target address history register information through the obtained instruction PC value Table BTAHRIT, reads the target address BPTA of the branch instruction, and reads the local history information used for branch prediction according to the instruction address;

Step 8: Combine the branch history information BHR of each thread and the history information BHT read according to the target address through a hash function as an index of the secondary pattern matching table PHT;

Step 9: Obtain the mode history bit Sc of the branch instruction through the concatenated historical information index PHT table, and use it for the actual branch prediction operation;

Step 10: Input the obtained mode history bit Sc into the prediction decision function to complete the calculation operation of the branch prediction result. At the same time, complete the update operation of the mode history bit through the state transition function δ. The updated mode history bit will be replaced by the original Ri , c-k Ri, c-k+1...Ri, c-1 becomes Ri, c-k+1Ri, c-k+2...Ri, c;

Step 11: Write the prediction result of the branch instruction Bi into the branch result output table BRT; when the same branch instruction is predicted next time, if the prediction result is the same as the value of PRED in the BRT table, add 1 to CONF; otherwise, CONF does minus 1 operation;

Step 12: through the update circuit of the TBHRIT table, the obtained branch output results Ri, c are left shifted into the last bit in the thread history register, and the predicted historical information is updated to the historical information submitted by the branch instruction;

Step 13: Through the update circuit of the BTAHRIT table, move the history information of the target address corresponding to the obtained branch output result Ri, c to the left into the last bit in the address history register, and update the predicted target address of the branch instruction to the branch instruction Physical address information at the time of submission;

Step 14: When the branch predictor performs branch prediction on the next branch instruction Bi+1, it first indexes the CONF field in the BRT table according to its PC value; if CONF is greater than or equal to 2, then the TAG field in the BPIT table is recorded as 1 , the branch prediction circuit will not perform the branch prediction operation on the instruction Bi+1, but directly output the stored branch result; on the contrary, if CONF is less than 2, the TAG field in the BPIT table will be recorded as 0, and the branch instruction will re-execute the branch prediction operation , and compare the prediction result with the data in the BRT table, and complete the update operation of the CONF field and the PRED field; finally, inform the fetching unit of the prediction result;

Step 15: If a branch misprediction phenomenon occurs during the whole process of branch prediction, the processor will start the misprediction processing mechanism to stop the remaining operations immediately, and cancel all instructions running in the pipeline belonging to the same thread after the mispredicted branch instruction , the PC value of the thread is adjusted to the correct target instruction address after the branch, and then restarts reading instruction execution from the new address; at the same time, adjusts the CONF field and PRED of the corresponding entry in the branch result output table BRT according to the actual execution result of the branch field for use when the branch instruction is executed again.

2. The instruction fetching control method based on simultaneous multithreading according to claim 1, characterized in that: among the system resources required by the computing thread in the fetching stage, the system resources include: fetching bandwidth, instruction queue length, reserved station queue length,

The specific method of resource allocation is as follows:

Ni Ni = = \frac{{P P}_{Ti Ti}}{{P P}_{Ti Ti} + + {P P}_{Tj Tj}} \times \times R R

Wherein, P _Ti and P _Tj respectively represent the resource allocation priorities of threads Ti and Tj, Ni represents the number of resources allocated to thread Ti, and R represents the total number of system resources;

In the case where the primary priority and secondary priority are different, the ratio of the IPC value of the thread to the secondary priority is used as the basis for resource allocation evaluation. The specific method of resource allocation is as follows:

Ni Ni = = \frac{TLi TL / / CLi CLi}{TLi TL / / CLi CLi + + TLj TLj / / CLj CL} \times \times R R

Among them, TLi and TLj represent the main priority of threads Ti and Tj respectively; CLi and CLj represent the sub-priority numbers of threads Ti and Tj respectively, and their values can be 1, 2, 3; Ni represents the number of resources allocated to thread Ti, R represents the total number of system resources.

3. The instruction fetching control method based on simultaneous multithreading according to claim 1 or 2, characterized in that: the branch history information BHR of each thread is combined with the history information BHT read according to the target address through a hash function Together, as the indexing process of the secondary pattern matching table PHT, respectively