CN114327641B

CN114327641B - Instruction prefetching method, instruction prefetching device, processor and electronic equipment

Info

Publication number: CN114327641B
Application number: CN202111671514.6A
Authority: CN
Inventors: 赵春尧; 胡世文; 邵奇
Original assignee: Hygon Information Technology Co Ltd
Current assignee: Hygon Information Technology Co Ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2025-08-19
Anticipated expiration: 2041-12-31
Also published as: CN114327641A

Abstract

An instruction prefetching method, an instruction prefetching device, a processor and electronic equipment. The instruction prefetching method comprises the steps of responding to the miss of a target instruction in a target cache, writing a target access request aiming at the target instruction into a lost state processing queue, wherein the lost state processing queue comprises a plurality of access requests, the target access request is one of the access requests, the lost state processing queue is configured to sequentially send the access requests to the next-level cache of the target cache, responding to the target instruction prediction error, sending a cancel request aiming at the target instruction to the lost state processing queue, and responding to the cancel request, and releasing the queue space occupied by the access requests positioned after the target access request in the lost state processing queue. The instruction prefetching method can improve prefetching precision, improve the utilization rate of the lost state processing queue and is beneficial to improving overall performance.

Description

Instruction prefetching method, instruction prefetching device, processor and electronic equipment

Technical Field

Embodiments of the present disclosure relate to an instruction prefetching method, an instruction prefetching device, a processor and an electronic device.

Background

In the development of computer technology, low-price, high-capacity storage is often delayed relatively high, and data cannot be provided to a processor in time. The low-latency, low-capacity storage serves as an intermediate buffer area for the processor and the mass storage, effectively alleviating the impact of high latency on processor performance from data transfer. One such low latency, low capacity storage technology is the cache technology.

To balance capacity and delay, the cache is divided into multiple levels. The cache with lower level number has smaller capacity and lower delay, and the cache with higher level number has the opposite effect. Commonly used processors typically have three levels of cache, with capacity increasing in sequence from one level to three levels. The capacity of the primary Cache (L1 Cache) is typically several tens of KB, the capacity of the secondary Cache (L2 Cache) is several MB, and the capacity of the tertiary Cache (L3 Cache) is several hundred MB. Since the code itself and the space of the fetched data in memory are not generally overlapped, in order to improve the operation efficiency of the processor, the first-level cache is generally divided into a first-level instruction cache and a first-level data cache.

As the capacity of code running in a server increases, the capacity of the instruction cache becomes increasingly unable to meet the cache requirements of the code. Since increasing instruction cache increases its latency, it is difficult to meet the increasing code capacity demand for greater cache space by simply increasing the instruction cache. Code that cannot be buffered to the instruction cache will result in an increase in the number of instruction cache misses, i.e., the need to look for the code from the higher latency secondary cache, tertiary cache, and even memory. The increased latency associated with accessing mass storage may degrade the performance of the processor.

This leaves the processor designer room for optimization, since not all code in the instruction cache will be used by the processor during the last time period. If the upcoming code for the processor can be predicted, there is an opportunity to replace the code that is temporarily unused in the instruction cache. Thus, the probability of the processor accessing the high-delay storage is reduced, and the data transmission delay is reduced, so that the performance of the processor is improved. This technique is an instruction prefetch technique.

Disclosure of Invention

At least one embodiment of the present disclosure provides an instruction prefetching method, which includes writing a target access request for a target instruction into a lost state processing queue in response to the target instruction missing in a target cache, wherein the lost state processing queue includes a plurality of access requests, the target access request is one of the plurality of access requests, the lost state processing queue is configured to sequentially send the plurality of access requests to a next level cache of the target cache, send a cancel request for the target instruction to the lost state processing queue in response to the target instruction prediction error, and release a queue space occupied by an access request located after the target access request in the lost state processing queue in response to the cancel request.

For example, in a method provided by an embodiment of the present disclosure, at least one of the access requests located after the target access request has been sent to the next level cache.

For example, in a method provided by an embodiment of the present disclosure, the access request following the target access request is an access request sent to the lost state processing queue later than the target access request.

For example, the method provided by an embodiment of the present disclosure further includes the missing state processing queue sequentially sending the plurality of access requests to the next-level cache, and the next-level cache sequentially responding to each access request.

For example, the method provided by the embodiment of the disclosure further includes comparing first preset information in data returned by the next-level cache with second preset information stored in a corresponding queue space in the lost state processing queue for an access request corresponding to the data, sending the data to the target cache in response to the first preset information being matched with the second preset information, and releasing the corresponding queue space in the lost state processing queue for the access request corresponding to the data, wherein the data comprises the target instruction, and sending a consistency maintaining request to the next-level cache in response to the first preset information not being matched with the second preset information.

For example, a method provided by an embodiment of the present disclosure further includes, in response to the coherency maintenance request, the next level cache discarding the data.

For example, the method provided by the embodiment of the present disclosure further includes, in response to the coherence maintaining request, updating, by the next-level cache, flag information in an entry corresponding to the data to an invalid state.

For example, in the method provided in an embodiment of the present disclosure, the first preset information includes address information of the data, and the second preset information includes address information stored in a queue space corresponding to the lost state processing queue and/or flag information indicating an idle state of the access request corresponding to the data.

For example, in a method provided by an embodiment of the present disclosure, the target cache includes a first level instruction cache and the next level cache includes a second level cache.

For example, in the method provided by an embodiment of the present disclosure, releasing the queue space occupied by the access request located after the target access request in the lost state processing queue includes emptying the content in the queue space occupied by the access request located after the target access request in the lost state processing queue, or updating the state identifier of the queue space occupied by the access request located after the target access request in the lost state processing queue to an idle state.

For example, the method provided by the embodiment of the disclosure further comprises selecting one cache way in the target cache, updating the latest access information of the selected cache way before the next-level cache returns data, and storing the target instruction contained in the data into the selected cache way when the next-level cache returns data.

For example, the method provided by the embodiment of the disclosure further comprises selecting one cache way in the target cache, storing the target instruction contained in the data into the selected cache way when the next-level cache returns the data, and updating the latest access information of the selected cache way.

For example, in a method provided by an embodiment of the present disclosure, selecting one of the target caches includes taking a cache that is not used for the longest time as a selected cache according to latest access information of each cache in the target cache.

The instruction prefetching device comprises a request writing unit and a request processing unit, wherein the request writing unit is configured to respond to the miss of a target instruction in a target cache, the target access request aiming at the target instruction is written into a lost state processing queue, the lost state processing queue comprises a plurality of access requests, the target access request is one of the access requests, the lost state processing queue is configured to sequentially send the access requests to the next-level cache of the target cache, the request canceling unit is configured to respond to the target instruction prediction error, the cancel request aiming at the target instruction is sent to the lost state processing queue, and the request processing unit is configured to respond to the cancel request, and release the queue space occupied by the access requests located behind the target access request in the lost state processing queue.

The processor comprises a target cache, a next-level cache of the target cache and a lost state processing queue, wherein the lost state processing queue is configured to receive a target access request for a target instruction in response to the target instruction missing in the target cache and sequentially send a plurality of access requests to the next-level cache, the lost state processing queue comprises the plurality of access requests, the target access request is one of the plurality of access requests, the lost state processing queue is further configured to receive a cancel request for the target instruction in response to the target instruction prediction error and release a queue space occupied by the access requests located after the target access request in response to the cancel request.

At least one embodiment of the present disclosure also provides an electronic device, including the instruction prefetching apparatus provided in any one embodiment of the present disclosure.

At least one embodiment of the present disclosure also provides an electronic device including a processor provided in any one of the embodiments of the present disclosure.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly described below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure, not to limit the present disclosure.

FIG. 1 is a diagram of a cache architecture;

FIGS. 2A-2C are schematic diagrams of instruction prefetch flows;

FIG. 3 is a schematic diagram of a request to clear a misprediction in a lost state processing queue;

FIG. 4 is a flow chart of an instruction prefetching method according to some embodiments of the present disclosure;

FIG. 5 is a schematic diagram of a request for clearing mispredictions in a lost state processing queue in an instruction prefetch method according to some embodiments of the present disclosure;

FIG. 6 is a flow chart of another method for instruction prefetching according to some embodiments of the present disclosure;

FIG. 7 is a flow chart of another method for instruction prefetching according to some embodiments of the present disclosure;

FIG. 8 is a flow chart of another method for instruction prefetching according to some embodiments of the present disclosure;

FIG. 9 is a schematic block diagram of an instruction pre-fetching apparatus provided by some embodiments of the present disclosure;

FIG. 10 is a schematic block diagram of a processor provided by some embodiments of the present disclosure;

FIG. 11 is a schematic block diagram of an electronic device provided in some embodiments of the present disclosure, an

Fig. 12 is a schematic block diagram of another electronic device provided by some embodiments of the present disclosure.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present disclosure. It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without the need for inventive faculty, are within the scope of the present disclosure, based on the described embodiments of the present disclosure.

Unless defined otherwise, technical or scientific terms used in this disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The terms "first," "second," and the like, as used in this disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. Likewise, the terms "a," "an," or "the" and similar terms do not denote a limitation of quantity, but rather denote the presence of at least one. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.

In the instruction prefetching technology, main factors influencing the effectiveness of instruction prefetching include timeliness, coverage rate and accuracy. Timeliness measures whether code is provided just as needed by the processor, with performance loss provided too late, and with cache space occupied by too early. Coverage measures whether prefetched code is sufficiently large to cover potential instruction cache misses as much as possible. Accuracy measures the probability that prefetched code will be used before being kicked out of the instruction cache.

Instruction prefetching is based on speculative execution. Erroneous speculation may result in reduced instruction prefetching. Speculative execution predicts the execution outcome of a portion of an instruction, allowing subsequent instructions to enter the processor in advance to obtain performance advantages. These predicted instructions are called branch instructions. The branch instruction resembles two branches of a way, only one of which is the correct way. The processor does not know which way to go before the branch instruction executes. For example, the processor may be allowed to wait for the end of execution of the branch instruction, i.e., to get its correct jump direction, and then execute. Waiting means an extension of the time to execute the program. For example, the processor may predict the direction of the branch instruction by a certain mechanism and make a branch in advance. If the prediction is correct, the program execution time is reduced, and if the prediction is incorrect, the program is returned to the intersection and another path is taken. This branch instruction prediction mechanism is one type of speculative execution. Instructions that enter the processor ahead of time may improve processor performance if the predicted outcome of an instruction is the same as the outcome of the execution of the instruction. If one instruction predicts that the result is wrong, the wrong instruction entering the processor in advance is cancelled, and execution is started again from the correct instruction after the branch instruction, so that the correct execution of the program result is ensured.

If an instruction prefetch has sent an access request to the secondary cache, this request is not canceled and must be backfilled into the instruction cache even if the access request happens to be on the mispredicted path. After a mispredicted branch instruction, an instruction that enters the processor may be canceled, but the access request that the instruction has sent to the secondary cache may not be canceled. This erroneous instruction is no longer executed, but its data is backfilled to the instruction cache. Then this misfetched data is likely not used by the program for a period of time. Thus, instruction prefetching accuracy may be reduced and instruction cache misses may increase, thereby affecting processor performance.

FIG. 1 is a schematic diagram of a cache architecture, such as a 256 entry 4-way cache. The addresses in fig. 1 refer to memory addresses. The index (index) takes 3 bits to 10 bits of the memory address, corresponding to 9 to 2 of the memory address in fig. 1. The Tag (Tag) takes 11 bits to 32 bits of the memory address, corresponding to 31 to 10 of the memory address in fig. 1. A1 to 0 of the memory address indicates that the minimum unit of cache access is 4 Bytes (Bytes).

When the processor accesses the cache, the corresponding "group" is found by the index (index). Within the "group", the tag (tag) is used for comparison with the tag (tag) stored in each "way" data. If the same tag (tag) is found, the cache hits, otherwise the cache is lost.

It is assumed that there is only one level of cache and that the memory holds all the data needed. When the processor starts running, the buffer is empty, and when the processor accesses the buffer, the buffer is lost. The processor needs to further issue an access request to the memory and backfill the data to the cache. During backfilling, firstly, an index (index) is calculated, and data with the same index (index) can only be backfilled into the same group. If there are slots in the "group," a slot is selected to place the backfill data. If there is no room, the data that should be "kicked off" is picked out by a mechanism, such as the data that was not Used for the longest time (LEAST RECENTLY Used). Here, when a certain "set" of all "ways" in the cache are occupied by valid content, if there is a new content fill request, an existing valid content is replaced or discarded, a process called "kick-out" or "kick-out". For example, the least recently used "way" is marked with a 0, the larger the number, the more recently used time. Each time an access hits on a "way" data, the "way" most recent access information is marked as maximum, and the most recent access information numbers of the remaining "ways" are decremented by 1 until they are decremented to 0.

In the first manner (also referred to as "update-first" manner, for example), if a cache miss occurs, the most recently accessed information of the "set" may be accessed immediately, the number of the "way" to be "kicked off" may be sorted out and saved, and the "way" most recently accessed information may be updated (all subsequent most recently accessed information refers to the "way" most recently accessed information). When the memory is used for backfilling data into the cache, the memory is directly backfilled into the selected path, and the latest access information does not need to be updated. In the second mode (also referred to as a "post-update" mode, for example), if a cache miss occurs, the number of the "way" to be "kicked" may not be immediately sorted out, and when the data is backfilled, the "way" to be "kicked" is sorted out again, and the latest access information is updated. The advantage of the first approach is that the data backfilling takes a short time, but if the access request is canceled (i.e., the access request has not yet been sent to the secondary cache), the latest access information that was "updated first" is erroneous, which can have a negative impact on processor performance if such errors occur more often.

FIGS. 2A-2C are schematic diagrams of instruction prefetch flows. For example, access requests by the level one instruction cache 100 and the instruction prefetcher 105 to the level two cache 115 are issued through the miss status handling queue 110. When a cache miss occurs for an instruction or prefetch instruction, the first level instruction cache 100 or instruction prefetcher 105 may further issue an access request to the second level cache 115 or to a larger capacity storage. This access request is written to the lost state processing queue 110 first and occupies a location as in process ① of FIG. 2A (each stripe filled rectangle in the figure represents an access request written to the lost state processing queue 110, the more right, the earlier it is written). The update of the latest access information is performed, for example, in a "first update" manner. The information written into the lost state processing queue 110 includes the memory address and the way information to be backfilled to the level one instruction cache 100.

When the secondary cache 115 has the capability to receive new access requests, the access requests held in the lost state processing queue 110 are issued to the secondary cache 115, as in process ② of FIG. 2B (each dotted filled rectangle represents an access request that has been sent). The access request is not released in the place of the lost state processing queue 110 until the data is obtained. When the secondary cache 115 is ready for data and ready for backfilling, backfill "way" information for access requests (dotted filled rectangles) still held in the lost state processing queue 110 is queried, as in process ④ in FIG. 2C. With this location information, the secondary cache 115 backfills data to the corresponding location of the primary instruction cache 100, as in process ③ in FIG. 2C, and releases the corresponding access request at the lost state processing queue 110 (as in the right-most blank rectangular location of the lost state processing queue 110 in FIG. 2C).

If the processor finds a mispredicted instruction, the processor sends a cancel mispredicted access request to the miss status processing queue 110, as shown in FIG. 3, the closer to the right the access request is, the earlier it is written to the miss status processing queue 110 (the filled rectangle in dots is the access request that has been sent to the secondary cache, the filled rectangle in stripes is the access request that has not been sent to the secondary cache, and the mark X is the location where the misprediction occurred). Cancellation of the erroneous access request cancels only the request that is not sent to the secondary cache (stripe-filled portion) and is later than the request to predict the erroneous instruction write (portion to the left of X).

In the instruction prefetching mode, since the request sent to the secondary cache is not canceled, even if the prediction is wrong, the data is still backfilled, so that the prefetching precision is low and the utilization rate of the lost state processing queue is low. Moreover, the update of the latest access information of the first-level instruction cache is caused to be wrong.

At least one embodiment of the present disclosure provides an instruction prefetching method, an instruction prefetching device, a processor and an electronic device. The instruction prefetching method can improve prefetching precision, improve the utilization rate of the lost state processing queue and is beneficial to improving overall performance.

Hereinafter, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings. It should be noted that the same reference numerals in different drawings will be used to refer to the same elements already described.

At least one embodiment of the present disclosure provides an instruction prefetch method. The instruction prefetching method comprises the steps of responding to the miss of a target instruction in a target cache, writing a target access request aiming at the target instruction into a lost state processing queue, wherein the lost state processing queue comprises a plurality of access requests, the target access request is one of the access requests, the lost state processing queue is configured to sequentially send the access requests to the next-level cache of the target cache, responding to the target instruction prediction error, sending a cancel request aiming at the target instruction to the lost state processing queue, and responding to the cancel request, and releasing the queue space occupied by the access requests positioned after the target access request in the lost state processing queue.

Fig. 4 is a flowchart illustrating an instruction prefetching method according to some embodiments of the present disclosure. As shown in FIG. 4, in some embodiments, the instruction pre-fetching method includes the following operations.

Step S11, responding to the miss of a target instruction in a target cache, writing a target access request aiming at the target instruction into a lost state processing queue, wherein the lost state processing queue comprises a plurality of access requests, the target access request is one of the plurality of access requests, and the lost state processing queue is configured to sequentially send the plurality of access requests to a next-level cache of the target cache;

Step S12, responding to the target instruction prediction error, and sending a cancel request for the target instruction to a lost state processing queue;

and step S13, in response to the cancellation request, releasing the queue space occupied by the access request positioned after the target access request in the lost state processing queue.

For example, in step S11, if the target instruction misses in the target cache, the target access request for the target instruction is written to the miss status processing queue, i.e., similar to process ① of FIG. 2A. For example, the lost state processing queue includes a plurality of access requests, and the target access request is one of the plurality of access requests. The lost state processing queue is configured to sequentially send a plurality of access requests to a next level cache of the target cache to request instructions. For example, the target instruction may be a currently predicted instruction, i.e., an instruction that requires prefetching. For example, in some examples, the target cache is a level one cache, the next level cache is a level two cache, and the lost state processing queue is configured to sequentially send a plurality of access requests to the level two cache. For example, the process by which the lost state processing queue sends an access request to the secondary cache is similar to process ② of FIG. 2B. The lost state processing queue, also referred to as MSHQ, is a component that holds the lost address and some other information after a cache miss occurs and processes the access request to the next level cache.

For example, in some examples, the target cache may be a first level instruction cache (e.g., an instruction cache in the aforementioned first level cache) and the next level cache may be a second level cache (e.g., the aforementioned second level cache). Of course, embodiments of the present disclosure are not limited thereto, and the target cache and the next level cache thereof may be any level of cache, for example, a second level cache and a third level cache, respectively, as long as they are two adjacent levels of cache, which may be determined according to actual needs.

For example, in step S12, if the target instruction is mispredicted, a cancel request for the target instruction is sent to the miss status processing queue. For example, a target instruction misprediction may refer to a misprediction of a branch instruction, i.e., the current target instruction is not the instruction to be executed next, and therefore, in order to cancel a target access request corresponding to the target instruction in the lost state processing queue, a cancellation request for the target instruction needs to be sent to the lost state processing queue.

For example, in step S13, upon receiving the cancel request, the lost state processing queue releases the queue space occupied by the access request located after the target access request, that is, cancels the access request located after the target access request. At this time, it is not necessary to determine whether these requests have been sent to the next level cache, and they are canceled as long as they are located after the target access request. For example, in embodiments of the present disclosure, even though requests in the lost state processing queue have been sent to the next level of cache, the occupation of the requests in the lost state processing queue is canceled as long as the requests are from erroneous predictions.

For example, the manner in which the queue space occupied by the access request located after the target access request in the lost state processing queue is freed may include flushing the contents of the queue space occupied by the access request located after the target access request in the lost state processing queue, or updating the state identification of the queue space occupied by the access request located after the target access request in the lost state processing queue to an idle state. Thereby, a release of space can be achieved, so that new access requests can be written.

For example, in some examples, at least one of the access requests that is located after the target access request has been sent to the next level cache. As shown in FIG. 5, in some examples, the X is the location where the misprediction occurred, and the dotted filled rectangle represents the access request that has been sent to the next level cache, and the striped filled rectangle is the access request that has not been sent to the next level cache. When the lost state processing queue receives the cancellation request, the access request at the left side of the position where the misprediction occurs is cancelled, and among the cancelled access requests, the access request which is not sent to the next-level cache yet and the access request which is sent to the next-level cache are available. In this embodiment, one access request is multiply deleted compared to the case shown in fig. 3, which is an access request that has been sent to the next level cache, and the disclosed embodiment discards the backfilling mispredicted request more than the case shown in fig. 3.

For example, an access request that follows the target access request is an access request that is sent to the lost state processing queue later than the target access request. For example, "located after a target access request" is not a description of physical location, but rather a description of chronological order, which is used to indicate the chronological order in which individual access requests were sent to the lost state processing queue, consistent with the meaning of the chronological order in the queue in conventional computer technology.

By the method, the requests corresponding to the instructions with the prediction errors can be completely canceled, so that the instructions with the prediction errors are avoided from being prefetched, and the prefetching precision is improved. In addition, more positions in the lost state processing queue are empty, the released queue space after cancellation can be used for storing new access requests, and more new and effective access requests can be allowed to be written into the lost state processing queue in advance, so that the utilization rate of the lost state processing queue is improved, and the situation that the request corresponding to the instruction with the prediction error occupies the queue space of the lost state processing queue is avoided. The instruction prefetching method provided by the embodiment of the disclosure provides a strategy capable of improving the instruction prefetching precision, and is beneficial to improving the overall performance of a system.

FIG. 6 is a flow chart illustrating another method of instruction prefetching according to some embodiments of the present disclosure. As shown in fig. 6, in some examples, the instruction pre-fetching method may further include the following operations.

Step S14, the lost state processing queue sequentially sends a plurality of access requests to the next-level cache, and the next-level cache sequentially responds to each access request;

Step S15, comparing first preset information in data returned by the next-level cache with second preset information stored in a corresponding queue space in a lost state processing queue of an access request corresponding to the data;

Step S16, responding to the matching of the first preset information and the second preset information, sending the data to a target cache, and releasing a queue space corresponding to an access request corresponding to the data in a lost state processing queue, wherein the data comprises a target instruction;

step S17, responding to the fact that the first preset information is not matched with the second preset information, and sending a consistency maintaining request to a next-level cache;

step S18, responding to the consistency maintaining request, and discarding the data by the next-level cache;

And step S19, in response to the consistency maintaining request, the next-level cache updates the mark information in the entry corresponding to the data into an invalid state.

For example, in step S14, the missing state processing queue sequentially sends the multiple access requests to the next level cache, and the next level cache sequentially responds to each access request, for example, sequentially provides data corresponding to each access request, for example, the data includes an instruction corresponding to each access request.

For example, in step S15, the first preset information in the data returned by the next-level cache is compared with the second preset information stored in the corresponding queue space in the lost state processing queue for the access request corresponding to the data. Since in the aforementioned step S13, if there is a prediction error, the lost state processing queue receives the cancellation request, and releases the queue space occupied by the access request located after the target access request in response to the cancellation request, that is, cancels the access request located after the target access request. Among these cancelled access requests, there may be access requests that have been sent to the next level cache, so in this step S15, a comparison determination needs to be made on the data returned by each access request to prevent the data corresponding to the cancelled access request from being backfilled to the target cache.

For example, the first preset information may include address information of the data. For example, the second preset information may include address information stored in a corresponding queue space in the lost state processing queue for the access request corresponding to the data, and/or include flag information indicating an idle state. That is, in the case that the queue space corresponding to the access request corresponding to the data is not released in the lost state processing queue, the second preset information may be address information stored in the corresponding queue space, and in the case that the queue space corresponding to the access request corresponding to the data is released in the lost state processing queue, the second preset information may be flag information indicating an idle state. By comparing the first preset information with the second preset information, it can be judged whether or not the access request corresponding to the data has been canceled.

For example, in step S16, if the first preset information matches the second preset information, that is, the address indicated by the first preset information is the same as the address indicated by the second preset information, it indicates that the corresponding access request is not canceled, and therefore, the corresponding data needs to be backfilled into the target cache. For example, the data that needs to be backfilled includes target instructions. After the data is stored in the target cache, the corresponding queue space of the access request corresponding to the data in the lost state processing queue is released, namely, the corresponding access request is deleted. Thus, prefetching of instructions may be accomplished in the manner described above.

For example, if the "update first" method is adopted, storing the data in the target cache may include sending the data to the target cache and storing the data in a cache way corresponding to the target cache based on the location information stored in the corresponding queue space in the lost state processing queue for the access request corresponding to the data. For example, the above-described location information indicates in which cache way of the target cache the corresponding data should be stored. For example, if the "post update" mode is used, the location information does not need to be acquired in the lost state processing queue.

For example, in step S17, if the first preset information does not match the second preset information, that is, the first preset information indicates an address and the second preset information indicates flag information of an idle state, or the address indicated by the first preset information is different from the address indicated by the second preset information, it indicates that the corresponding access request has been canceled, so that it is not necessary to backfill the corresponding data into the target cache. At this time, in order to maintain cache coherency, a coherency-maintaining request needs to be sent to the next-level cache. For example, when different processor cores or threads have their own caches, it is ensured that content modifications to an address by one of the cores or threads can be "seen" by the other cores or threads to ensure proper execution of the program, which is referred to as cache coherency.

It should be noted that, the first preset information and the second preset information are not matched, which may mean that the values of the first preset information and the second preset information are different, or that the types of the first preset information and the second preset information are different, or may mean that other situations, and as long as the first preset information and the second preset information are not the same, the first preset information and the second preset information may be considered as not matched. The first preset information is not limited to address information, and the second preset information is not limited to address information or flag information indicating an idle state, and the first preset information and the second preset information may be any type of information as long as the purpose of confirming the state of the access request (confirming whether the access request is canceled) can be achieved, and the embodiment of the present disclosure is not limited thereto.

For example, in some examples, if the corresponding access request in the lost state processing queue is canceled and the corresponding queue space has not been written with a new request, then flag information indicating an idle state may be stored in the queue space, that is, the second preset information is flag information indicating an idle state. The first preset information is address information, so that the first preset information and the second preset information are different in type and are not matched. For example, in other examples, if the corresponding access request in the lost state processing queue is canceled and the corresponding queue space is written with a new request, then the second preset information stored in the queue space is the address to which the new request corresponds, and thus the first preset information and the second preset information are different in value (specific address is different), and the two do not match. Of course, other situations may exist and are not listed here.

For example, in step S18, the next level cache discards the data in response to the coherency maintenance request. In order to maintain cache coherency, the data may be discarded since it does not need to be backfilled to the target cache.

For example, in step S19, in response to the coherency maintenance request, the next level cache updates the flag information in the entry corresponding to the data to an invalid state. In order to maintain cache consistency, since the data does not need to be backfilled to the target cache, the next-level cache updates the flag information in the entry corresponding to the data to an invalid state, thereby invalidating the data. For example, a flag information indicating whether the data of the address is stored in the target cache may be added to each entry (entry) of the next-level cache. For example, if the data of the address is stored in the target cache, the flag information is in a valid state, and if the data of the address is not stored in the target cache, the flag information is in an invalid state.

It should be noted that, the above step S18 and step S19 provide two different ways to maintain cache consistency. Embodiments of the present disclosure are not limited thereto and may also employ other ways to maintain cache coherency, as may be practical.

By the method, when the next-level cache is ready for data and is ready for backfilling, the previous position of the corresponding access request processing queue in the lost state is firstly queried, and address comparison is carried out. If the location is not occupied by a new access request or the address of a newly written access request and the backfill address do not match, then the data from the next level cache will not be backfilled to the target cache and an operation to maintain cache coherency will be performed. If the access request address and the backfill address for the location match, then the data from the next level cache is backfilled to the target cache. Thus, the step of checking the address is added before the data backfilling, so that the correct data backfilling can be realized under the condition that the prediction is correct, and the consistency of the cache can be maintained under the condition that the prediction is incorrect.

Fig. 7 is a flow chart illustrating another instruction prefetching method according to some embodiments of the present disclosure. For example, as shown in fig. 7, in some examples, the instruction pre-fetching method may further include the following operations.

S21, selecting one cache way in the target cache;

step S22, updating the latest access information of the selected cache way before the next level of cache returns data;

and step S23, when the next level of cache returns data, storing the target instruction contained in the data into the selected cache way.

For example, steps S21-S23 may be the "update-first" approach described above, i.e., the latest access information is updated first, and then the data is prefetched and backfilled.

For example, step S21 may include taking the cache way that was not Used (LEAST RECENTLY Used) for the longest time as the selected cache way based on the most recent access information for each cache way in the target cache. For example, after the cache way is selected in step S22, the latest access information of the selected cache way is updated, for example, the latest access information of the "way" is marked as the maximum value, and the latest access information of the remaining "ways" is reduced by 1 and the minimum value is 0 before the next level of cache returns data. For example, in step S23, when the next level cache returns data, a target instruction contained in the data is stored in the selected cache way. Thus, a "first update" approach is achieved.

Fig. 8 is a flow chart illustrating another instruction prefetching method according to some embodiments of the present disclosure. For example, as shown in fig. 8, in some examples, the instruction pre-fetching method may further include the following operations.

Step S24, selecting one cache way in the target cache;

and S25, when the next level of cache returns data, storing the target instruction contained in the data into the selected cache way, and updating the latest access information of the selected cache way.

For example, steps S24-S25 described above may be in the "post-update" manner described above, i.e., the most recently accessed information is updated while the data is being backfilled.

For example, step S24 may include taking the cache way that was not Used (LEAST RECENTLY Used) for the longest time as the selected cache way based on the most recent access information for each cache way in the target cache. For example, in step S25, when the next level of cache returns data, a target instruction contained in the data is stored in the selected cache way, and the latest access information of the selected cache way is updated, that is, the latest access information is updated while the data is backfilled. Thus, a "post-update" approach is achieved. In this way, an update error of the latest access information of the target cache can be avoided, for example, an update error of the latest access information of the first-level instruction cache can be avoided.

It should be noted that, in the embodiment of the present disclosure, the manner of maintaining cache consistency is not limited to the two manners described above in connection with steps S18 and S19, but may be any other applicable manner, which may be determined according to actual needs, and the embodiment of the present disclosure is not limited thereto.

It should be noted that, the instruction prefetching method provided by the embodiment of the present disclosure is not limited to the steps and sequences described above, and may include more or fewer steps, and the execution sequence of each step may be determined according to the actual needs, which is not limited by the embodiment of the present disclosure. The instruction prefetching method may be used in the cache architecture shown in fig. 1, and may be used in any other applicable cache architecture, which embodiments of the present disclosure do not limit.

At least one embodiment of the present disclosure also provides an instruction prefetching apparatus. The instruction prefetching device can improve prefetching precision, improve the utilization rate of the lost state processing queue and is beneficial to improving overall performance.

Fig. 9 is a schematic block diagram of an instruction prefetching apparatus provided in some embodiments of the present disclosure. As shown in fig. 9, the instruction prefetch apparatus 10 includes a request write unit 11, a request cancel unit 12, and a request processing unit 13.

The request writing unit 11 is configured to write a target access request for a target instruction to the lost state processing queue in response to the target instruction not hitting in the target cache. For example, the lost state processing queue includes a plurality of access requests, and the target access request is one of the plurality of access requests. The lost state processing queue is configured to sequentially send a plurality of access requests to a next level cache of the target cache. For example, the request writing unit 11 may perform step S11 in the instruction prefetch method shown in fig. 4.

The request cancellation unit 12 is configured to send a cancellation request for a target instruction to the lost state processing queue in response to the target instruction prediction error. For example, the request canceling unit 12 may perform step S12 in the instruction prefetch method shown in fig. 4.

The request processing unit 13 is configured to release, in response to the cancellation request, a queue space occupied by an access request located after the target access request in the lost state processing queue. For example, the request processing unit 13 may execute step S13 in the instruction prefetch method shown in fig. 4.

For example, in some examples, at least one of the access requests that is located after the target access request has been sent to the next level cache. For example, an access request that follows the target access request is an access request that is sent to the lost state processing queue later than the target access request. For example, "located after a target access request" is not a description of physical location, but rather a description of chronological order, which is used to indicate the chronological order in which individual access requests were sent to the lost state processing queue, consistent with the meaning of the chronological order in the queue in conventional computer technology.

By the method, the requests corresponding to the instructions with the prediction errors can be completely canceled, so that the instructions with the prediction errors are avoided from being prefetched, and the prefetching precision is improved. In addition, more positions in the lost state processing queue are empty, the released queue space after cancellation can be used for storing new access requests, and more new and effective access requests can be allowed to be written into the lost state processing queue in advance, so that the utilization rate of the lost state processing queue is improved, and the situation that the request corresponding to the instruction with the prediction error occupies the queue space of the lost state processing queue is avoided. The instruction prefetching device provided by the embodiment of the disclosure provides a strategy capable of improving the instruction prefetching precision, and is beneficial to improving the overall performance of a system.

For example, the lost state processing queue sequentially sends a plurality of access requests to the next-level cache, and the next-level cache sequentially responds to the respective access requests.

For example, the instruction pre-fetch apparatus 10 may further include a comparison unit, a backfill unit, and a coherency processing unit. For example, the instruction pre-fetch apparatus 10 may further include a first processing unit or a second processing unit.

The comparison unit is configured to compare the first preset information in the data returned by the next-level cache with the second preset information stored in the corresponding queue space in the lost state processing queue, wherein the access request corresponds to the data. The backfilling unit is configured to send data to the target cache in response to the first preset information being matched with the second preset information, and release a queue space corresponding to an access request corresponding to the data in the lost state processing queue. For example, the data includes target instructions. The coherence processing unit is configured to send a coherence maintenance request to the next level cache in response to the first preset information not matching the second preset information.

The first processing unit is configured to cause the next level cache to discard data in response to the coherency maintenance request. The second processing unit is configured to cause the next-level cache to update the flag information in the entry corresponding to the data to an invalid state in response to the coherency maintenance request.

For example, the first preset information includes address information of the data. The second preset information comprises address information and/or flag information representing an idle state, wherein the address information and/or flag information are/is stored in a corresponding queue space in a lost state processing queue, and the access request corresponds to data.

For example, the target cache includes a first level instruction cache and the next level cache includes a second level cache.

For example, the request processing unit 13 may include a first subunit or a second subunit. The first subunit is configured to empty the contents of the lost state processing queue in a queue space occupied by an access request located after the target access request. The second subunit is configured to update the state identification of the queue space occupied by the access request located after the target access request in the lost state processing queue to an idle state.

For example, in some examples, the instruction pre-fetch apparatus 10 may further include a selection unit, an update unit, a return unit. The selection unit is configured to select one cache way in the target cache. The updating unit is configured to update the latest access information of the selected cache way before the next level of cache returns data. The return unit is configured to store a target instruction contained in the data into the selected cache way when the next level of cache returns the data.

For example, in other examples, the instruction pre-fetch apparatus 10 may further include a selection unit, an update return unit. The selection unit is configured to select one cache way in the target cache. The update return unit is configured to store a target instruction contained in the data into the selected cache way and update the latest access information of the selected cache way when the next level caches the returned data.

For example, the selection unit includes a selection subunit configured to take a cache way that is not used for the longest time as a selected cache way according to the latest access information of each cache way in the target cache.

For example, the various units and sub-units described above may be hardware, software, firmware, and any feasible combination thereof. For example, each of the above units and sub-units may be dedicated or general-purpose circuits, chips, devices, or the like, or may be a combination of a processor and a memory. With respect to the specific implementation forms of the individual units and sub-units described above, embodiments of the disclosure are not limited in this regard.

It should be noted that, in the embodiment of the present disclosure, each unit of the instruction pre-fetching apparatus 10 corresponds to each step of the instruction pre-fetching method, and the specific function and technical effect of the instruction pre-fetching apparatus 10 may refer to the related description of the instruction pre-fetching method, which is not repeated herein. The components and structures of instruction pre-fetching apparatus 10 shown in fig. 9 are merely exemplary and not limiting, and the instruction pre-fetching apparatus 10 may also include other components and structures as desired.

At least one embodiment of the present disclosure also provides a processor. The processor can improve the prefetching precision, improve the utilization rate of the lost state processing queue and contribute to the improvement of the overall performance.

Fig. 10 is a schematic block diagram of a processor provided in some embodiments of the present disclosure. As shown in fig. 10, the processor 20 includes a target cache 21, a next level cache 22 of the target cache, and a lost state processing queue 23.

The miss state processing queue 23 is configured to receive a target access request for a target instruction in response to the target instruction missing in the target cache, and sequentially send a plurality of access requests to the next level cache. For example, the lost state processing queue 23 includes a plurality of access requests, and the target access request is one of the plurality of access requests. The lost state processing queue 23 is further configured to receive a cancel request for the target instruction in response to the target instruction prediction error, and to release the queue space occupied by an access request located after the target access request in response to the cancel request. The processor 20 may be a Central Processing Unit (CPU), a Graphics Processing Unit (GPU) or other form of processing unit having data processing capabilities and/or program execution capabilities. For example, the Central Processing Unit (CPU) may be an X86 or ARM architecture, or the like. The processor 20 may be a general purpose processor or a special purpose processor.

It should be noted that, in the embodiment of the present disclosure, the working manner of the processor 20 corresponds to each step of the instruction prefetching method, and the specific function and technical effect of the processor 20 may be referred to the related description of the instruction prefetching method, which is not repeated herein. The components and structures of the processor 20 shown in fig. 10 are exemplary only and not limiting, and the processor 20 may also include other components and structures as desired.

At least one embodiment of the present disclosure also provides an electronic device, which includes the instruction prefetching apparatus provided in any one embodiment of the present disclosure. The electronic equipment can improve the prefetching precision, improve the utilization rate of the lost state processing queue and is beneficial to improving the overall performance.

Fig. 11 is a schematic block diagram of an electronic device provided in some embodiments of the present disclosure. As shown in fig. 11, the electronic device 30 includes an instruction prefetching apparatus 31, and the instruction prefetching apparatus 31 may be the instruction prefetching apparatus 10 shown in fig. 9. For example, electronic device 30 may be any device having data processing capabilities and/or program execution capabilities, as embodiments of the present disclosure are not limited in this regard. For specific functions and technical effects of the electronic device 30, reference is made to the description of the instruction prefetching apparatus 10 above, and details thereof are not repeated here.

At least one embodiment of the present disclosure also provides an electronic device including a processor provided in any one of the embodiments of the present disclosure. The electronic equipment can improve the prefetching precision, improve the utilization rate of the lost state processing queue and is beneficial to improving the overall performance.

Fig. 12 is a schematic block diagram of another electronic device provided by some embodiments of the present disclosure. As shown in fig. 12, the electronic device 40 includes a processor 41, and the processor 41 may be the processor 20 shown in fig. 10. For example, electronic device 40 may be any device having data processing capabilities and/or program execution capabilities, as embodiments of the present disclosure are not limited in this regard. Reference may be made to the above description regarding the processor 20 for specific functions and technical effects of the electronic device 40, and no further description is given here.

The following points need to be described:

(1) The drawings of the embodiments of the present disclosure relate only to the structures to which the embodiments of the present disclosure relate, and reference may be made to the general design for other structures.

(2) The embodiments of the present disclosure and features in the embodiments may be combined with each other to arrive at a new embodiment without conflict.

The foregoing is merely specific embodiments of the disclosure, but the scope of the disclosure is not limited thereto, and the scope of the disclosure should be determined by the claims.

Claims

1. An instruction pre-fetching method, comprising:

Responsive to a target instruction not hitting in a target cache, writing a target access request for the target instruction into a lost state processing queue, wherein the lost state processing queue comprises a plurality of access requests, the target access request is one of the plurality of access requests, the lost state processing queue is configured to sequentially send the plurality of access requests to a next level cache of the target cache;

Responsive to the target instruction prediction error, sending a cancel request for the target instruction to the lost state processing queue;

Responding to the cancellation request, and releasing the queue space occupied by the access request positioned behind the target access request in the lost state processing queue;

Wherein the method further comprises:

The lost state processing queue sequentially sends the plurality of access requests to the next-level cache, and the next-level cache sequentially responds to each access request;

Comparing the first preset information in the data returned by the next-level cache with the second preset information stored in the corresponding queue space of the lost state processing queue, wherein the access request corresponds to the data;

Responding to the first preset information and the second preset information to be matched, sending the data to the target cache, and releasing a queue space corresponding to an access request corresponding to the data in the lost state processing queue, wherein the data comprises the target instruction;

and responding to the fact that the first preset information is not matched with the second preset information, and sending a consistency maintaining request to the next-level cache.

2. The method of claim 1, wherein at least one of the access requests following the target access request has been sent to the next level cache.

3. The method of claim 2, wherein the access request following the target access request is an access request sent to the lost state processing queue later than the target access request.

4. The method of claim 1, further comprising:

In response to the coherency maintenance request, the next level cache discards the data.

5. The method of claim 1, further comprising:

And in response to the consistency maintaining request, the next-level cache updates the mark information in the entry corresponding to the data into an invalid state.

6. The method of claim 1, wherein the first preset information includes address information of the data;

the second preset information comprises address information and/or flag information representing an idle state, wherein the address information and/or flag information are/is stored in a corresponding queue space in the lost state processing queue, and the address information and/or flag information represent an idle state are/is stored in the corresponding queue space in the lost state processing queue.

7. The method of claim 1, wherein the target cache comprises a first level instruction cache and the next level cache comprises a second level cache.

8. The method of claim 1, wherein freeing up queue space occupied by access requests located after the target access request in the lost state processing queue comprises:

Emptying the content in the queue space occupied by the access request positioned behind the target access request in the lost state processing queue, or

And updating the state identification of the queue space occupied by the access request positioned behind the target access request in the lost state processing queue into an idle state.

9. The method of claim 1, further comprising:

Selecting one cache way in the target cache;

updating the latest access information of the selected cache way before the next level cache returns data;

and when the next-level cache returns data, storing the target instruction contained in the data into the selected cache way.

10. The method of claim 1, further comprising:

Selecting one cache way in the target cache;

and when the next-level cache returns data, storing the target instruction contained in the data into the selected cache way, and updating the latest access information of the selected cache way.

11. The method of claim 9 or 10, wherein selecting one of the cache ways in the target cache comprises:

And taking the cache way which is not used for the longest time as the selected cache way according to the latest access information of each cache way in the target cache.

12. An instruction prefetching apparatus comprising:

A request writing unit configured to write a target access request for a target instruction into a lost state processing queue in response to a miss of the target instruction in a target cache, wherein the lost state processing queue includes a plurality of access requests, the target access request is one of the plurality of access requests, the lost state processing queue is configured to sequentially send the plurality of access requests to a next level cache of the target cache, the next level cache sequentially responding to each access request;

A request cancellation unit configured to send a cancellation request for the target instruction to the lost state processing queue in response to the target instruction prediction error;

A request processing unit configured to release, in response to the cancellation request, a queue space occupied by an access request located after the target access request in the lost state processing queue;

wherein the apparatus further comprises:

the comparison unit is configured to compare the first preset information in the data returned by the next-level cache with the second preset information stored in the corresponding queue space of the lost state processing queue of the access request corresponding to the data;

The backfilling unit is configured to respond to the first preset information and the second preset information to be matched, send the data to the target cache, and release a queue space corresponding to an access request corresponding to the data in the lost state processing queue, wherein the data comprises the target instruction;

And the consistency processing unit is configured to send a consistency maintaining request to the next-level cache in response to the fact that the first preset information is not matched with the second preset information.

13. A processor comprises a target cache, a next-level cache of the target cache and a lost state processing queue, wherein,

The lost state processing queue is configured to receive a target access request for a target instruction in response to the target instruction missing in a target cache, and sequentially send a plurality of access requests to the next level cache;

the lost state processing queue comprises a plurality of access requests, wherein the target access request is one of the plurality of access requests;

The lost state processing queue is further configured to receive a cancel request for the target instruction in response to the target instruction misprediction, and to release, in response to the cancel request, queue space occupied by an access request located after the target access request;

The processor is further configured to compare first preset information in data returned by the next-level cache with second preset information stored in a corresponding queue space in the lost state processing queue for an access request corresponding to the data, send the data to the target cache in response to the first preset information being matched with the second preset information, and release the corresponding queue space in the lost state processing queue for the access request corresponding to the data, wherein the data comprises the target instruction, and send a consistency maintaining request to the next-level cache in response to the first preset information being not matched with the second preset information.

14. An electronic device comprising the instruction pre-fetching apparatus of claim 12.

15. An electronic device comprising the processor of claim 13.