Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present disclosure. It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without the need for inventive faculty, are within the scope of the present disclosure, based on the described embodiments of the present disclosure.
Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The terms "first," "second," and the like, as used in this disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. Likewise, the word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.
In the existing processor (also referred to as processor core or CPU core in this disclosure) architecture, both the program and the data are stored in memory (e.g., DRAM), so there are a large number of memory read instructions (Load instructions) in the program. Because the operating frequency of the processor core is far higher than the operating frequency of the memory, hundreds of processor core clocks are required to acquire data from the memory, which often causes idle running of the processor core due to incapability of continuing to run related instructions, and performance loss is caused. High performance processor cores typically include multiple levels of caches (caches) (e.g., level one Cache L1, level two Cache L2, etc.) to reduce the latency of memory accesses and speed up the operation of the processor core, but when reading data that has never been accessed or data that has been kicked out due to Cache size limitations, the processor core still needs to wait tens or even hundreds of clock cycles, which can result in performance loss.
In multi-level caches, the level one Cache (L1 Cache) is typically integrated directly within the processor core and operated directly by the processor core, the level one Cache L1 is very fast but has a relatively small capacity, typically between a few KB to a few tens KB, and is further divided into an instruction Cache (IC for storing instructions to be executed) and a data Cache (DC for storing data being processed) for storing data and instructions that are most frequently accessed by the processor core, to reduce the number of times data is read from a slower Cache or memory. The second level Cache (L2 Cache or L2 for short) may be integrated in the processor core or may be located outside the processor core, the speed of the second level Cache is slower than that of the first level Cache, but still fast, the second level Cache can respond to the read-write request of the CPU quickly, the capacity of the second level Cache is larger than that of the first level Cache, and the second level Cache is not generally subdivided into an instruction Cache and a data Cache, so that data and instructions which cannot be found in the first level Cache but are used recently or can be used again soon can be stored. The tertiary Cache (L3 Cache) is usually located in the CPU package, but may be separate from the processor core or shared in the multi-core processor, is slower than the secondary Cache but still much faster than the main memory, has a larger capacity than the secondary Cache, typically between a few MB and tens of MB, further expands the capacity of the Cache, storing data that cannot be found in the secondary Cache but may be accessed again in the near future. In multi-core processors, tertiary caches are typically designed as shared caches for all cores to access to reduce cache coherency issues.
For example, each cache may include a cache miss (miss) queue for which when a read-write request or prefetch request is not in the cache, then a read-to-next level cache or memory-to-memory read is required, the request and its corresponding attributes being held in the cache miss queue until the next level cache or memory returns the data or instruction for which the request is intended.
In addition, the high-performance processor core not only includes a multi-level cache architecture to store recently accessed data, but also utilizes a prefetcher to find out the rule of checking data and instruction access by the processor so as to prefetch the data and the instruction to be accessed into a cache in advance. If the prefetching is instruction, the corresponding operation is called instruction prefetching, the corresponding prefetcher is instruction prefetcher, if the prefetching is data, the corresponding prefetcher is data prefetcher. The latter may be further subdivided into L1 prefetchers (prefetching to the first level cache), L2 prefetchers (prefetching to the second level cache), LLC data prefetchers (prefetching to the last level cache (LAST LEVEL CACHE)) and so on, depending on the target cache location.
Currently, there are various methods for prefetching instructions, such as prefetching N cache lines (CACHE LINE) or +1 cache lines consecutively, a branch prediction method using branch prediction, a fetch-specific-mode (pattern) instruction prefetching method, a prefetching method based on the characteristics of the cache hierarchy in which the instruction is located, and so on. Some processors employ a branch prediction based instruction prefetch method (Fetch-Directed Instruction Prefetch, FDIP) that is capable of relatively accurately issuing instruction prefetch requests.
FIG. 2 illustrates a schematic diagram of a finger portion of a processor core that involves branch prediction.
As shown in FIG. 2, the front end of the processor core includes a fetch unit, a decode unit, an execution unit, an instruction release unit, a branch prediction unit, an instruction Cache unit (L1) instruction Cache, control logic, etc., and other components are omitted, where corresponding control needs to be implemented, for example, branch prediction address queue write control, branch prediction address queue release control, prefetch control, fetch control, instruction write-back control, miss fetch control, etc., respectively, and the processor core may include a built-in secondary Cache unit (L2 Cache) or be coupled to an external secondary Cache unit.
The instruction fetching unit fetches instructions based on the branch prediction unit, obtains instruction data, provides the instruction data to the decoding unit, provides the instruction data for the execution unit to execute after the decoding unit decodes the instruction data, and provides execution information for the instruction release unit, wherein the instruction release unit provides instruction release information according to related information submitted by instructions so as to control whether to release an instruction branch prediction address information queue.
As shown in FIG. 2, a branch prediction unit is coupled to the instruction release unit and the instruction fetch unit, receives instruction release information from the instruction release unit, and provides branch prediction information to the instruction fetch unit after a branch prediction has been made. For example, the branch prediction unit includes a branch prediction component and a branch prediction address information queue, and the control logic provides branch prediction address queue release control and branch prediction address queue write control accordingly. In the branch prediction unit, a branch prediction unit performs branch prediction, and virtual addresses of a branch prediction fragment, i.e., an instruction fragment, obtained by the branch prediction are stored in a branch prediction address information queue based on a branch prediction address queue write control. Also, the physical address after address translation may be simultaneously stored in the branch prediction address information queue. The branch prediction address information is written into the branch prediction address information queue and is also transferred to the instruction fetch unit as fetch information.
As shown in FIG. 2, the instruction fetch unit is coupled to the instruction cache unit and includes a fetch information queue, a fetch request queue, etc., and accordingly the control logic provides fetch control, prefetch control, miss fetch control, etc.
The fetch unit fills the fetch information into the fetch information queue after receiving the branch prediction address information from the branch prediction unit, and may perform a prefetch operation based on prefetch control when filling the fetch information queue. In addition, the instruction prefetch operation or the instruction fetch request to the secondary Cache unit (L2 Cache) caused by the Miss (Miss) of the query instruction Cache unit in the normal instruction fetch process is stored in the instruction fetch request queue, then the secondary Cache itself or the lower-level Cache or the memory is waited for returning the required instruction data, and the instruction fetch request caused by the Miss (Miss) of the query instruction Cache unit is performed based on Miss instruction fetch control.
In the process that the second-level cache unit acquires instruction data and the instruction data is backfilled to the instruction cache unit through instruction write-back control, the instruction fetch information queue is awakened through the information stored in the instruction fetch request queue, so that instruction data can be acquired from an instruction backfill bus through instruction reading operation corresponding to the instruction fetch information at the head of the instruction fetch information queue. Otherwise, if the instruction data of the current backfill is not at the head of the instruction fetching information queue, the instruction data is backfilled into the instruction cache unit, and then the instruction data is read out from the instruction cache unit again to complete the corresponding instruction fetching operation.
In the architecture design implementing the branch prediction based instruction pre-fetching method described above, instruction data is either fetched from the instruction backfill bus of the secondary cache unit or read from the storage array (i.e., SRAM array) of the secondary cache unit itself. In theory, the best mode is that the instruction data is directly taken out from the backfill bus of the secondary cache unit, so that the power consumption of the instruction for accessing the storage array of the secondary cache unit can be reduced, and the instruction data can be read earlier, thereby improving the instruction fetching efficiency. In practice, however, due to branch prediction, a large portion of the instruction data is stored into the cache early, so that the cache has to be read repeatedly in order to fetch the instruction data. This increases both the power consumption of the read operation and reduces the fetch efficiency of the CPU core. Moreover, branch prediction errors may also occur frequently in processors, and then instruction prefetching based on branch prediction may be difficult to circumvent the instruction prefetching of the wrong path of the branch prediction error, resulting in instruction cache pollution and thus more cache query misses. Furthermore, aggressive prefetch operations may be more severe to the pollution of the instruction cache.
At least one embodiment of the present disclosure provides a processor, an instruction processing method, and an electronic device.
A processor in accordance with at least one embodiment of the present disclosure includes control logic, an instruction fetch unit, an instruction release unit, an instruction cache unit L, and an instruction cache queue L0. The instruction fetch unit is configured to fetch the object instruction according to the instruction prefetch request, to cache the fetched object instruction in the instruction cache queue L0, the instruction release unit is configured to provide instruction release information, and the control logic is configured to fill the object instruction from the instruction cache queue L0 into the instruction cache unit L1 in response to the instruction release information.
Accordingly, an instruction processing method according to at least one embodiment of the present disclosure includes fetching an object instruction according to an instruction prefetch request, buffering the fetched object instruction in an instruction cache queue L0, and filling the object instruction from the instruction cache queue L0 into an instruction cache unit L1 in response to instruction release information from a processor pipeline.
The "object instruction" is used herein to refer to any possible instruction that is the object of description, and instruction cache unit L1 is configured to be directly accessed by the processor core, and correspondingly the secondary cache is not directly accessed by the processor core. For example, if the instruction cache unit is not queried while the processor core is fetching, the instruction cache unit will request that the required instruction data be read from the secondary cache and returned to the processor core after the required instruction data is fetched from the secondary cache, i.e. the processor core accesses the secondary cache through the instruction cache unit.
The above-described embodiments of the present disclosure add instruction cache queues (e.g., herein understood as "zero-level (L0) instruction caches") in addition to instruction cache units in a processor (or processor core), and in at least one example of an embodiment, may further provide for time to backfill instructions to instruction cache units, improving instruction fetch efficiency, avoiding pollution to instruction cache units due to prefetched instructions. Here, as the instruction cache queues are added to the front end of the processor core, the control logic of the processor is correspondingly increased. These control logic in the processor core of embodiments of the present disclosure relate to the implementation of various functions in the processor, either centrally or distributed.
A processor according to embodiments of the present disclosure may include a single processor core or include multiple processor cores, for example, and may be provided in the form of a separately packaged processor, or integrated with other functional components in the form of a system on a chip (SOC), for example. For example, the processor core may employ microarchitecture as required by the x86, ARM, RISC-V, MIPS instruction set, as embodiments of the present disclosure are not limited in this respect.
The processor and instruction processing method of the embodiments of the present disclosure will be described below with reference to specific examples.
FIG. 3 illustrates a schematic diagram of a processor core, primarily illustrating a front end of the processor core, which relates to a finger fetch portion of a branch prediction, in accordance with at least one embodiment of the present disclosure.
As shown in FIG. 3, the instruction fetch unit, decode unit, execution unit, instruction release unit, branch prediction unit, instruction cache queue, control logic, etc. of the processor core omit other components, and the control logic is respectively disposed where corresponding control needs to be implemented, for example, branch prediction address queue write control, branch prediction address queue release control, prefetch control, instruction fetch control, instruction cache independent write back control, instruction cache write back control, miss (Miss) instruction fetch control, etc., where the processor core may include a built-in secondary cache unit (or simply "secondary cache"), or be coupled to a secondary cache unit external to the processor core.
The instruction fetching unit fetches instructions based on the branch prediction unit, obtains instruction data, provides the instruction data for the decoding unit, provides the instruction execution ending information for the instruction releasing unit after the decoding unit decodes the instruction data, and provides the instruction releasing information for controlling whether to release the instruction branch prediction address information queue according to the related information submitted by the instruction.
The control logic may be implemented by a micro-program or hard-wired in embodiments of the present disclosure, which embodiments of the present disclosure are not limited to. The execution unit includes a plurality of different types of functional units to handle different types of operations, such as integer operations, floating point operations, vector operations, etc., for example, the execution unit may include an Arithmetic Logic Unit (ALU), a floating point operation unit (FPU), a vector execution unit (Vector Execution Unit), a load/store execution unit (LSU), a special function execution unit (Special Function Execution Unit), etc., as embodiments of the disclosure are not limited in this respect.
As shown in FIG. 3, a branch prediction unit is coupled to the instruction release unit and the instruction fetch unit, receives instruction release information from the instruction release unit, and provides branch prediction information to the instruction fetch unit after a branch prediction has been made. For example, the branch prediction unit includes a branch prediction component and a branch prediction address information queue, and the control logic provides branch prediction address queue release control and branch prediction address queue write control accordingly. In the branch prediction unit, a branch prediction unit performs branch prediction, and virtual addresses of a branch prediction fragment, i.e., an instruction fragment, obtained by the branch prediction are stored in a branch prediction address information queue based on a branch prediction address queue write control. Also, the physical address after address translation may be simultaneously stored in the branch prediction address information queue. The branch prediction address information is written into the branch prediction address information queue and is also transferred to the instruction fetch unit as fetch information.
As shown in fig. 3, the instruction fetch unit is coupled to the instruction cache unit and the instruction cache queue, and includes a fetch information queue, a fetch request queue, and the like, and accordingly the control logic provides fetch control, prefetch control, miss fetch control, and the like.
The instruction cache unit and the instruction cache queue are coupled to each other for caching instruction data (but not for caching application data processed during execution of the instructions), and the instruction cache queue is further coupled to the secondary cache unit, and further, the instruction cache unit may be coupled to the secondary cache unit as needed, whereby the instruction cache unit and the instruction cache queue may receive backfilled instructions directly from the secondary cache unit. Accordingly, the control logic provides instruction cache unit write back control for backfilling instruction data from the instruction cache queue to the instruction cache unit and instruction cache queue write back control for backfilling instruction data from the secondary cache unit to the instruction cache queue.
For example, in at least one embodiment of the present disclosure, the instruction cache unit and the second level cache unit may be implemented in a conventional manner, and the instruction cache queue may be identical in implementation (including structure and control) to the conventional cache unit, for example, include Static Random Access Memory (SRAM), and use cache line (CACHE LINE) as a basic data storage unit, and the address mapping manner may be direct mapping, full association, set association mapping, and the like, and the embodiment of the present disclosure does not limit the implementation of the instruction cache queue, the instruction cache unit, and the second level cache unit.
The fetch unit fills the fetch information into the fetch information queue after receiving the branch prediction address information from the branch prediction unit, and may perform a prefetch operation based on prefetch control when filling the fetch information queue. In addition, the instruction prefetch operation or the instruction fetch request to the secondary Cache unit (L2 Cache) caused by the Miss (Miss) of the query instruction Cache unit in the normal instruction fetch process is stored in the instruction fetch request queue, then the secondary Cache itself or the lower-level Cache or the memory is waited for returning the required instruction data, and the instruction fetch request caused by the Miss (Miss) of the query instruction Cache unit is performed based on Miss instruction fetch control.
In the process that the second-level cache unit acquires instruction data and the instruction data is backfilled to the instruction cache unit through instruction write-back control, the instruction fetch information queue is awakened through the information stored in the instruction fetch request queue, so that instruction data can be acquired from an instruction backfill bus through instruction reading operation corresponding to the instruction fetch information at the head of the instruction fetch information queue. Otherwise, if the instruction data of the current backfill is not at the head of the instruction fetching information queue, the instruction data is backfilled into the instruction cache unit, and then the instruction data is read out from the instruction cache unit again to complete the corresponding instruction fetching operation.
For example, an instruction processing method according to at least one embodiment of the present disclosure further includes generating an instruction prefetch request according to the branch prediction result. For example, a branch prediction component of a branch prediction unit of a processor generates an instruction prefetch request based on a branch prediction result.
For example, in a processor and an instruction processing method according to at least one embodiment of the present disclosure, fetching an object instruction according to an instruction prefetch request includes acquiring an instruction prefetch address according to the instruction prefetch request, transmitting an access request to a secondary cache L2 using the instruction prefetch address in response to an instruction prefetch operation according to the instruction prefetch address, and writing the object instruction corresponding to the instruction prefetch address to an instruction cache queue L0 in response to an instruction data backfilling operation of the secondary cache L2.
For example, in a processor and an instruction processing method according to at least one embodiment of the present disclosure, fetching an object instruction according to an instruction prefetch request further includes determining whether to perform an instruction prefetch operation according to a prefetch algorithm.
For example, an instruction processing method according to at least one embodiment of the present disclosure further includes fetching an object instruction according to a fetch request to the instruction cache unit L1 or the instruction cache queue L0, and then sending the object instruction to the decode unit and dispatch to the execution unit. For example, the decoding unit of the processor decodes the object instruction after receiving the object instruction, and then provides the decoding result to the corresponding execution unit for execution. The decoded results include micro instructions that are dispatched by a dispatch unit (not shown) to an execution unit for execution.
For example, in a processor and an instruction processing method according to at least one embodiment of the present disclosure, obtaining an object instruction according to a fetch request to an instruction cache unit L1 or an instruction cache queue L0 includes writing the fetch request to a fetch information queue, and querying the instruction cache unit L1 or the instruction cache queue L0 according to the fetch information queue to obtain the object instruction.
The above-described processor and instruction processing method are described below with reference to specific embodiments.
Fig. 4 illustrates an exemplary flow chart of an instruction processing method in accordance with at least one embodiment of the present disclosure.
As shown in fig. 4, first, (1) a branch prediction unit receives instruction release information from an instruction release unit, a branch prediction unit generates a branch prediction result according to a branch prediction algorithm, obtains a new instruction address, and stores the new instruction address into a branch prediction address information queue.
And (2) determining whether to perform prefetching operation according to an instruction prefetching algorithm based on the new instruction address, if so, distributing an instruction Miss queue to the prefetching request, storing the instruction prefetching address corresponding to the expected request into the instruction Miss queue, and then initiating instruction fetching request operation to the secondary cache to wait for the response of the secondary cache.
And (3) storing the fetch information corresponding to the new instruction address into a fetch information queue. The fetch information at the head of the fetch information queue is the oldest fetch information.
And (4) reading the instruction fetch information at the head of the instruction fetch information queue, and judging whether the current instruction fetch operation occupies the instruction Miss queue.
(5) Based on the result of (4), if the instruction Miss queue is not occupied, the instruction cache queue will be looked up.
(6) Based on the result of the step (4), if the instruction Miss queue is occupied, continuing to judge whether the instruction data to be extracted by the current instruction fetching operation is in a backfilling process or a backfilled state of the secondary cache.
(7) Based on the result of (6), if the instruction data to be extracted by the current fetching operation is not in the backfilling process or the backfilling state of the secondary cache, the current fetching operation is stopped, and the fetching information is not read out from the fetching information queue.
(8) Based on the result of (6), if the instruction data to be extracted by the current instruction fetching operation is in the backfilling process or the backfilled state of the secondary cache, judging whether the instruction data is in the backfilled state.
Further, (9) based on the result of (8), if the current required instruction data is in the backfilled state, inquiring the instruction cache queue and obtaining an entry of instruction information corresponding to the instruction fetching address in the instruction cache queue by indexing, and reading the required instruction data from the instruction cache queue, otherwise, (10) based on the result of (8), if the current required instruction data is not in the backfilled state, directly acquiring the instruction data required for reading from a backfilled bus of the secondary cache L2 to the instruction cache unit.
(11) Based on the two conditions (9) and (10), the success of finger taking is determined, and the finger taking operation is completed.
Further, in the above operation, (12) based on the result of (4), if the current fetch operation does not occupy the instruction Miss queue, the instruction cache queue is searched, and at this time, if the instruction cache queue is searched, i.e., no query Miss (Miss) occurs, the fetch is successful, which means that the fetch operation is completed this time.
Otherwise, (13) based on the result of (4), if the query instruction cache queue is not hit, namely query missing (Miss) occurs, the instruction cache unit L1 is further searched, if the query instruction cache unit L1 is hit, the instruction fetching operation is completed, otherwise, (14) based on the result of (13), a memory access request is sent to the instruction Miss queue.
(15) In either case of (2) or (14) above, it is necessary to read the memory request from the instruction Miss queue and send the memory request to the secondary cache.
In the above operation, if the access request is sent to the instruction Miss queue, and then the instruction fetch operation is initiated to the secondary cache, after receiving the response of the secondary cache, the queues, the instruction fetch buffer unit, the instruction fetch buffer queue, and the like need to be updated according to the response of the secondary cache.
For example, in a processor and instruction processing method in accordance with at least one embodiment of the present disclosure, sending an access request to a secondary cache L2 using an instruction prefetch address includes updating an instruction Miss (Miss) queue for an instruction cache unit using the instruction prefetch address and then sending the access request to the secondary cache L2. And writing the object instruction corresponding to the instruction prefetch address into the instruction cache queue L0 in response to the instruction data backfilling operation of the secondary cache L2, wherein the method comprises the steps of waking up the instruction missing queue in response to a backfilling request of the secondary cache L2 for the object instruction, and determining to write the object instruction corresponding to the instruction prefetch address into the instruction cache queue L0 according to record information corresponding to the object instruction in the instruction missing queue.
For example, in a processor and an instruction processing method according to at least one embodiment of the present disclosure, record information corresponding to an object instruction in an instruction miss queue includes whether an entry corresponding to the object instruction is Valid (Valid) and cacheable (non-cacheable).
FIG. 5 illustrates a flow diagram of a secondary cache backfilling operation in accordance with at least one embodiment of the present disclosure.
As shown in FIG. 5, the second level cache backfills a requested instruction data to the instruction cache queue, (1) at this time, the entry information corresponding to the instruction data to be backfilled in the instruction Miss queue needs to be checked, and the entry in the instruction Miss queue is awakened to start the data backfilling process.
Then, (2) see if the corresponding entry in the instruction Miss queue is VALID (VALID), if the entry is not VALID at this time (VALID value is 0), so this instruction data backfilled from the level two cache L2 will not be backfilled into the instruction cache queue.
Otherwise, (3) if the entry is VALID at this time (VALID value is 1), the cache attribute of the instruction fetch request needs to be continuously checked, if the cache attribute is Non-Cacheable, the instruction data of the pen backfilled from the secondary cache L2 will not be backfilled into the instruction cache queue, otherwise, the instruction data is backfilled into the instruction cache queue.
In addition, (4) after the corresponding entry in the wake-up instruction Miss queue of (1) above, the corresponding entry in the get instruction information queue may be awakened.
And (5) judging whether the awakened instruction fetch item is positioned at the head of the instruction fetch information queue, if so, obtaining instruction data in a backfill bus of the direct secondary cache, and otherwise, performing subsequent indexing/query operation.
As described above, the processor of the embodiment of the present disclosure adds an instruction cache queue, which may be regarded as a level 0 cache, to the processor shown in fig. 2, for example, and accordingly, two cache systems of the instruction cache queue and the instruction cache unit need to be maintained.
FIG. 6 illustrates an exemplary operational flow diagram of an instruction cache unit update in an instruction processing method in accordance with at least one embodiment of the present disclosure.
As shown in FIG. 6, first, (1) the instruction release unit releases a number of branch prediction entries to the branch prediction unit, judges whether the number of released branch prediction entries is 0, if so, continues to monitor the instruction release unit, otherwise, starts the update operation of the instruction cache unit.
(2) After the update operation is started, determining the corresponding entry of the branch prediction entry released by the release unit to the branch prediction unit in the branch prediction address information queue through indexing or searching.
(3) On the basis of (2), for the determined branch prediction entry to be released, determining the physical position of the instruction information corresponding to the instruction fetch request, which needs to be backfilled into the instruction cache unit, according to the physical address.
Thereafter, (4) waking up a corresponding entry of the branch prediction entry in the instruction Miss queue based on (2) based on the branch prediction entry to be released.
(5) The "backfill process flag bit" in the corresponding entry in the instruction Miss queue is set high (i.e., 1, in the backfill process).
Thereafter, (6) find instruction Miss queue if there is an entry with "backfill" and "valid" bit of 1 and "shared" flag bit of 0 (with shared instruction fetch request reduced, "shared" flag adaptation determination is set to 0).
(7) According to the result of (6), if such an entry exists, an entry of the instruction Miss queue to be backfilled into the instruction cache unit is determined, and the "backfill process" flag bit of the entry is set low (i.e., 0, the backfill process is completed).
And (8) determining the WAY (WAY) information needed to be backfilled to the instruction cache unit according to the group (SET) index of the instruction cache unit stored in the instruction Miss queue entry.
And (9) backfilling the corresponding instruction data in the instruction cache queue to the designated position in the instruction cache unit.
And (10) releasing the corresponding entry of the instruction Miss queue.
In the processor of the embodiment of the present disclosure, since there is an instruction cache queue, a SNOOP (SNOOP) management control system may be updated. SNOOP (SNOOP) operations are used primarily in cache coherency protocols to ensure coherency of data in multiprocessor or multi-cache systems. The snoop mechanism allows the cache controller to snoop communications between other caches or memory controllers, thereby being able to detect modifications to the shared data and update its own cache copy as needed. This ensures that all the processors see up-to-date data and avoids problems due to inconsistencies.
The MESI protocol is a common cache coherency protocol, whose names originate from four possible states, modified, exclusive, shared, invalid. These states help manage the state of the cache blocks in the different processor caches to ensure data coherency.
For example, according to the MESI protocol, when a cache attempts to Modify a shared data, it sends a "Modify" request to the bus, and other cache controllers, upon listening to the request, check whether they hold a copy of the data. If the other cache does hold a copy of the data, it will mark the copy as Invalid (Invalid), thereby ensuring that only the cache requesting the modification can modify the data.
For example, an instruction processing method according to at least one embodiment of the present disclosure further includes maintaining cache coherency between the instruction cache unit L1 and the instruction cache queue L0.
For example, in a processor and instruction processing method in accordance with at least one embodiment of the present disclosure, maintaining cache coherency between instruction cache unit L1 and instruction cache queue L0 includes maintaining cache coherency by a snoop operation issued by secondary cache L2. For example, the control logic of the secondary cache includes performing snoop operations to maintain cache coherency.
For example, in a processor and an instruction processing method according to at least one embodiment of the present disclosure, maintaining cache coherency by issuing snoop operations through a level two cache L2 includes clearing a valid bit in a first target entry of an instruction cache queue L0 in response to a snoop operation hitting the first target entry and determining whether the cleared first target entry is operating as a backfill instruction cache unit, and clearing the valid bit in the first target entry in response to a snoop operation hitting a second target entry of the instruction cache unit L1.
Fig. 7 illustrates an exemplary operational flow diagram of a snoop operation in an instruction processing method in accordance with at least one embodiment of the present disclosure.
As shown in fig. 7, (1) a SNOOP (SNOOP) operation request is sent from the level two cache L2.
Then, (2) based on (1), determining whether a branch prediction address information queue is hit, if an entry in the branch prediction address information queue is hit, indicating that a cache line to be snooped is already in the process of the processor pipeline, then clearing the VALID bit (VALID) of the entry, and Flushing (FLUSH) other instructions in the processor pipeline following the cache line (CACHELINE).
(3) Based on (1), it is determined whether a cache line in the instruction cache queue is hit.
(4) On the basis of the step (3), if the cache line in the instruction cache queue is hit, the VALID bit (VALID) of the cache line in the instruction cache queue is cleared, whether the cache line in the instruction cache queue is backfilling the instruction cache unit is further judged, if the backfilling operation is being carried out, the backfilling bus is refreshed, and if not, the operation is abandoned.
(5) On the basis of (3), if the cache line in the instruction cache queue is not hit, the VALID bit (VALID) clearing operation of the present cache line is abandoned.
(6) Based on the step (1), judging whether a cache line in the instruction cache unit is hit, if the cache line in the instruction cache unit is hit, resetting the valid bit corresponding to the cache line in the instruction cache unit, otherwise, abandoning the monitoring request operation.
In at least one embodiment of the present disclosure, because of the addition of the instruction cache queue, in some cases, the instruction fetch unit only needs to read the prefetched instruction into the instruction cache queue, in other states, it needs to look up the instruction cache queue first, and if not, then look up the instruction cache unit. To better reduce the impact on timing, another embodiment of the present disclosure provides a modified instruction processing method, an exemplary flow chart of which is shown in FIG. 7.
Fig. 8 illustrates an exemplary flow chart of an instruction processing method in accordance with at least one further embodiment of the present disclosure. In a processor of at least one embodiment of the present disclosure, the control logic adjusts accordingly.
As shown in fig. 8, first, (1) a branch prediction unit receives instruction release information from an instruction release unit, a branch prediction unit generates a branch prediction result according to a branch prediction algorithm, obtains a new instruction address, and stores the new instruction address into a branch prediction address information queue.
And (2) determining whether to perform prefetching operation according to an instruction prefetching algorithm based on the new instruction address, if so, allocating an instruction Miss queue to the prefetching request, storing the instruction prefetching address corresponding to the expected request into the instruction Miss queue, then initiating the instruction prefetching operation to the secondary cache, and waiting for the response of the secondary cache.
And (3) storing the fetch information corresponding to the new instruction address into a fetch information queue.
(4) Based on the step (2), if the prefetching operation is performed, the subsequent operations such as querying the instruction cache queue are abandoned.
(5) Based on (2), if no prefetch operation is performed, the instruction Miss queue is looked up based on the instruction address.
(6) Based on the result of (5), if the instruction Miss queue is not hit, the subsequent operations such as inquiring the instruction cache queue are abandoned, and if the instruction Miss queue is hit, the instruction Miss queue is marked to be occupied by the instruction fetching operation.
And (7) reading the instruction fetch information at the head of the instruction fetch information queue, and judging whether the current instruction fetch operation occupies the instruction Miss queue.
(8) Based on the result of (7), if the instruction Miss queue is not occupied, the instruction cache unit will be looked up.
(9) Based on the result of the step (7), if the instruction Miss queue is occupied, continuing to judge whether the instruction data to be extracted by the current instruction fetching operation is in a backfilling process or a backfilled state of the secondary cache.
(10) Based on the result of (9), if the instruction data to be extracted by the current fetching operation is not in the backfilling process or the backfilling state of the secondary cache, the current fetching operation is stopped, and the fetching information is not read out from the fetching information queue.
(11) Based on the result of the step (9), if the instruction data to be extracted by the current instruction fetching operation is in the backfilling process or the backfilled state of the secondary cache, judging whether the instruction data is in the backfilled state or not.
Further, (12) based on the result of (11), if the currently required instruction data is in the backfilled state, querying an instruction cache queue, and obtaining an entry of instruction information corresponding to the instruction fetching address in the instruction cache queue by indexing, and reading the required instruction data from the instruction cache queue.
And (13) if the currently required instruction data is not in the backfilled state based on the result of (11), directly acquiring the instruction data required for reading from the backfilling bus of the second-level cache L2 to the instruction cache unit.
(14) Based on the two conditions (12) and (13), the success of finger taking is determined, and the finger taking operation is completed.
Further, in the above operation, (15) based on the result of (8), the instruction cache unit is searched, and at this time, if the query instruction cache queue hits, i.e., no query Miss (Miss) occurs, it also means that the fetching is successful, and the fetching operation is completed this time.
And (6) if the query instruction cache unit is not hit, namely query missing (Miss) occurs, sending a memory access request to an instruction Miss queue, and if the query instruction cache unit is hit, the instruction fetching operation is finished, wherein the instruction fetching operation is successful.
(17) In either case of (2) or (16) above, it is necessary to read the memory request from the instruction Miss queue and send the memory request to the secondary cache.
Similarly, in the above operation, if the access request is sent to the instruction Miss queue, then the instruction fetch operation is initiated to the secondary cache, and after the response of the secondary cache is received, the queues, the instruction fetch buffer unit, the instruction fetch buffer queue, and the like need to be updated according to the response of the secondary cache.
In the processing method, when the instruction fetching request enters the instruction fetching information queue, the instruction Miss queue is searched, so that whether the instruction is read in the instruction cache queue or the instruction data is read in the instruction cache unit is determined in the instruction fetching process. Since the speed of the instruction fetch request entering the instruction fetch information queue is greater than the speed of the instruction fetch request being read from the instruction fetch information queue, the operation can be performed before entering the instruction fetch information queue.
In at least one embodiment of the present disclosure, a processor (or processor core) translates each architecture instruction (instruction) into one or more micro-instructions (uops) within a micro-architecture, each micro-instruction performing only limited operations, which may ensure that each pipeline stage is short to increase processor core operating frequency. For example, a load may be translated into an address generation micro instruction and a memory read micro instruction, where a second micro instruction depends on the result of the first micro instruction, so that the second micro instruction begins execution only after the first micro instruction has completed execution. The microinstructions include a plurality of microarchitectural-related fields that are used to communicate related information between the pipeline stages.
Accordingly, in this embodiment, the instruction cache system at the front end of the pipeline of the processor core includes a micro instruction cache (OC) in addition to the instruction cache unit (IC) and the instruction cache queue, and in order to better increase the respective effective capacities of the instruction cache unit and the micro instruction cache, it is necessary to maintain a mutual exclusion (exclusive) relationship between the two.
Fig. 9 shows a schematic diagram of a processor front-end architecture according to one embodiment of the present disclosure. As shown in fig. 9, the front-end architecture 10 of the processor includes a branch prediction unit 101, a branch prediction address information queue 102, an instruction cache unit 103, a decode unit 104, a microinstruction processing module 105, a microinstruction cache unit 106, a microinstruction queue 107, and an issue unit 108.
For the front end, the corresponding finger extraction method can be as follows.
The branch prediction unit 101 sends prediction information to the branch prediction address information queue 102 for caching to await processing of the prediction information.
The processor core initially enables the instruction cache mode to process the prediction information. For example, first, based on address information in the prediction information from the branch prediction address information queue 102, the instruction data requested by the prediction information is tried to be extracted from the instruction cache unit 103, and sent to the decoding unit 104 for decoding. Here, the instruction data may be continuous binary data. Decode units 104 may decode the fetched instruction data into corresponding micro instruction groups (each including one or more micro instructions) and send the micro instruction groups to a micro instruction queue 107 cache to await dispatch (not shown in FIG. 2).
Decode units 104 also provide decoded micro instruction groups to micro instruction cache units 106 for caching. At this point, a micro instruction register entry may be created in the micro instruction cache unit 106 for storing the micro instruction. One or more micro instructions in the micro instruction set are cached in the created micro instruction register entry, e.g., one micro instruction register entry may store 8 micro instructions. The micro instruction cache 106 provides a determination of whether the micro instruction is already present in the micro instruction cache 106 when caching the micro instruction. For example, when a micro instruction is already present in the micro instruction cache unit 106, the micro instruction cache unit 106 may present information about a cache hit (build hit), and when a micro instruction is not present in the micro instruction cache unit 106, the micro instruction cache unit 106 may present information about a cache miss (build miss).
The processor core determines whether to enable the micro instruction cache fetch mode based on the information of the cache hit or cache miss provided by the micro instruction cache unit 106. In one embodiment, for example, when the micro instruction cache unit 106 gives cache hit information that several consecutive micro instruction groups exist in the micro instruction cache unit 106, the determination result is yes, the micro instruction cache fetch mode is enabled, and when the determination result is no, the prediction information is continuously processed in the instruction cache mode.
In response to enabling the micro instruction cache fetch mode, the prediction information in the branch prediction address information queue 102 is sent to a micro instruction cache fetch queue contained in the micro instruction processing module 105, and the sending of the prediction information to the instruction cache unit 103 is stopped.
The micro instruction cache unit 106 determines whether a micro instruction group corresponding to the prediction information can be fetched from the micro instruction cache unit 106 according to the address information in the prediction information. For example, in response to failing to fetch the microinstruction set corresponding to the prediction information, the system resumes processing the prediction information in the instruction cache mode and processing the prediction information of the current miss in the instruction cache mode.
In response to being able to fetch the set of micro instructions corresponding to the prediction information, the fetched set of micro instructions is sent to the micro instruction queue 107 to await dispatch.
The micro instruction queue 107 sequentially directs groups of micro instructions from the instruction cache mode or micro instruction cache mode processing to the issue unit 108 for back-end execution, e.g., register renaming, execution, retirement (retire), etc.
In this embodiment, it is checked at instruction release whether the corresponding cache line is fetched via the micro instruction fetch path, and if so, the corresponding cache line in the instruction cache queue will not be updated in the instruction cache.
For example, in a processor core of an embodiment of the present disclosure as shown in fig. 2, the processor core front end includes a plurality of queues, e.g., a branch prediction address information queue, a fetch request queue, an instruction cache queue. In at least one embodiment of the present disclosure, the size of the instruction cache queue and the size of the instruction fetch request queue are made the same, while the size of the instruction fetch request queue is the same as the size of the branch prediction address information queue.
FIG. 10 illustrates a schematic diagram of a relationship between various queues included in a processor core in at least one embodiment of the present disclosure.
In embodiments of the present disclosure, entries of a branch prediction address information queue are used to store branch prediction related data, e.g., including branch prediction address information, branch prediction instruction information, and the like.
The instruction fetch information queue may be understood herein as a subset of the branch prediction address information queue into which branch prediction information is filled in addition to the branch prediction address information queue, e.g., in at least one embodiment, an index corresponding to an entry of the branch prediction address information queue into which a branch prediction fragment is placed is also stored. As shown in FIG. 10, each entry of the fetch information queue includes a branch prediction address queue index number, a fetch physical address, a level two cache (L2) request queue index number, an instruction cache queue backfill "done/process" flag bit, a fetch request queue index, and the like.
The instruction fetching request queue is used for storing prefetch requests sent by the branch prediction unit or instruction cache unit miss requests generated by the instruction fetching unit, and relevant information distributed to the instruction fetching request queue is also stored in the instruction fetching information queue and used for waking up relevant instruction fetching operations, and meanwhile, the relevant instruction fetching operations can be woken up to the corresponding instruction cache queues to read instruction information. As shown in FIG. 10, each entry of the fetch request queue may include a Valid bit (Valid), a branch prediction address queue tag, a Way value of a corresponding instruction cache unit (IC), a tag bit during backfilling of the instruction cache unit, a fetch address, and so forth.
As shown in fig. 10, the instruction cache queue includes a plurality of cache lines, each of which has a size of, for example, 32 bytes, and each of which stores 1 cache line with a minimum storage granularity of 1/4 cache line.
In this embodiment, the branch prediction address information queue has 64 entries, the fetch information queue has 16 entries, the fetch request queue has 64 entries, and the instruction cache queue has 64 cache lines, and at this time, the capacity (number of entries) of the branch prediction address information queue, the capacity (number of entries) of the fetch request queue, and the size (number of cache lines) of the instruction cache queue are identical to each other, which helps control the instruction backfill time.
Some embodiments of the present disclosure also provide an electronic device including the processor of any one of the above embodiments or an instruction processing method capable of executing any one of the above embodiments.
Fig. 11 is a schematic diagram of an electronic device according to at least one embodiment of the present disclosure. The electronic device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a notebook computer, a PDA (personal digital assistant), a PAD (tablet computer), etc., and a fixed terminal such as a desktop computer.
The electronic device 1000 shown in fig. 11 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments. For example, as shown in fig. 11, in some examples, an electronic device 1000 includes a processor of any of the embodiments of the present disclosure that can perform various suitable actions and processes, such as a method of processing a computer program of an embodiment of the present disclosure, according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage device 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data required for the operation of the computer system are also stored. The processor 1001, ROM 1002, and RAM 1003 are connected thereto by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.
For example, components including input devices 1006 such as a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc., output devices 1007 such as a Liquid Crystal Display (LCD), speaker, vibrator, etc., storage devices 1008 such as magnetic tape, hard disk, etc., communication devices 1009 such as network interface cards such as LAN cards, modems, etc., may also be connected to I/O interface 1005. The communication device 1009 may allow the electronic device 1000 to perform wireless or wired communication with other apparatuses to exchange data, performing communication processing via a network such as the internet. The drive 1010 is also connected to the I/O interface 1005 as needed. A removable storage medium 1011, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed as needed in the drive 1010, so that a computer program read therefrom is installed as needed in the storage device 1008.
While fig. 11 illustrates an electronic device 1000 that includes various devices, it should be understood that not all illustrated devices are required to be implemented or included. More or fewer devices may be implemented or included instead.
For example, the electronic device 1000 may further include a peripheral interface (not shown), and the like. The peripheral interface may be various types of interfaces, such as a USB interface, a lightning (lighting) interface, etc. The communication means 1009 may communicate with a network, such as the internet, an intranet, and/or a wireless network, such as a cellular telephone network, a wireless Local Area Network (LAN), and/or a Metropolitan Area Network (MAN), and other devices via wireless communication. The wireless communication may use any of a variety of communication standards, protocols, and technologies including, but not limited to, global System for Mobile communications (GSM), enhanced Data GSM Environment (EDGE), wideband code division multiple Access (W-CDMA), code Division Multiple Access (CDMA), time Division Multiple Access (TDMA), bluetooth, wi-Fi (e.g., based on the IEEE 802.11a, IEEE 802.11b, IEEE 802.11g, and/or IEEE 802.11n standards), voice over Internet protocol (VoIP), wi-MAX, protocols for email, instant messaging, and/or Short Message Service (SMS), or any other suitable communication protocol.
For the purposes of this disclosure, the following points are also noted:
(1) The drawings of the embodiments of the present disclosure relate only to the structures related to the embodiments of the present disclosure, and other structures may refer to the general design.
(2) The embodiments of the present disclosure and features in the embodiments may be combined with each other to arrive at a new embodiment without conflict.
The foregoing is merely exemplary embodiments of the present disclosure and is not intended to limit the scope of the disclosure, which is defined by the appended claims.