[go: up one dir, main page]

WO2025242009A1 - 高速缓存性能评估方法、装置、电子设备及可读存储介质 - Google Patents

高速缓存性能评估方法、装置、电子设备及可读存储介质

Info

Publication number
WO2025242009A1
WO2025242009A1 PCT/CN2025/095503 CN2025095503W WO2025242009A1 WO 2025242009 A1 WO2025242009 A1 WO 2025242009A1 CN 2025095503 W CN2025095503 W CN 2025095503W WO 2025242009 A1 WO2025242009 A1 WO 2025242009A1
Authority
WO
WIPO (PCT)
Prior art keywords
memory access
cache
access request
statistics
request
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/CN2025/095503
Other languages
English (en)
French (fr)
Inventor
刘宇航
满洋
陈泓佚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute Of Open Source Chip
Original Assignee
Beijing Institute Of Open Source Chip
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute Of Open Source Chip filed Critical Beijing Institute Of Open Source Chip
Publication of WO2025242009A1 publication Critical patent/WO2025242009A1/zh
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3452Performance evaluation by statistical analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3024Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a central processing unit [CPU]

Definitions

  • This application relates to the field of computer technology, and in particular to a method, apparatus, electronic device and readable storage medium for evaluating cache performance.
  • CPUs central processing units
  • the cache is a crucial component of the CPU, used to store frequently accessed data or instructions, thus improving CPU speed and efficiency.
  • the performance of the cache significantly impacts CPU performance; therefore, evaluating cache performance is a critical issue that needs to be addressed.
  • This application provides a method, apparatus, electronic device, and readable storage medium for evaluating cache performance, which can solve the problem of how to evaluate cache performance in the prior art.
  • this application discloses a cache performance evaluation method, the method comprising:
  • memory access statistics of the test program are obtained; the memory access statistics include at least the number of times each memory access request is hit in the cache to be evaluated;
  • the performance of the cache to be evaluated is assessed based on the memory access statistics and the memory access patterns of each memory access request.
  • embodiments of this application disclose a cache performance evaluation device, the device comprising:
  • the acquisition module is used to acquire memory access statistics of the test program in response to multiple memory access operations of each memory access request in the test program; the memory access statistics include at least the number of times each memory access request is hit in the cache to be evaluated;
  • the first evaluation module is used to evaluate the performance of the cache to be evaluated based on the memory access statistics and the memory access patterns of each memory access request.
  • embodiments of this application also disclose an electronic device, which includes a processor, a memory, a communication interface, and a communication bus.
  • the processor, the memory, and the communication interface communicate with each other through the communication bus.
  • the memory is used to store executable instructions, which cause the processor to execute the aforementioned cache performance evaluation method.
  • This application also discloses a readable storage medium, which, when the instructions in the readable storage medium are executed by the processor of an electronic device, enables the electronic device to perform the aforementioned cache performance evaluation method.
  • This application also discloses a computer program product containing instructions that, when run on a computer, cause the computer to execute the aforementioned cache performance evaluation method.
  • This application provides a method for evaluating cache performance.
  • memory access statistics of the test program are obtained. These statistics include at least the number of hits of each memory access request in the cache to be evaluated. Based on the memory access statistics and the access patterns of each memory access request, the performance of the cache to be evaluated is assessed. Thus, cache performance can be evaluated by obtaining the number of hits of memory access requests in the cache. Furthermore, by using the memory access statistics and the access patterns of each memory access request, cache performance can be evaluated from different access pattern dimensions, achieving multi-dimensional evaluation and improving the accuracy and interpretability of cache performance evaluation.
  • Figure 1 is a flowchart of the steps of an embodiment of a cache performance evaluation method of this application
  • Figure 2 is a schematic diagram illustrating one method of obtaining the hit count in this application
  • Figure 3 is a schematic diagram of obtaining one memory access mode of this application.
  • Figure 4 is a schematic diagram of one statistical result of this application.
  • FIG. 5 is a schematic diagram of another statistical result of this application.
  • Figure 6 is a structural block diagram of an embodiment of a cache performance evaluation device according to this application.
  • Figure 7 is a structural block diagram of an electronic device for cache performance evaluation provided in this application example.
  • FIG. 1 a flowchart illustrating an embodiment of a cache performance evaluation method according to this application is shown.
  • the method may specifically include the following steps:
  • Step 101 In response to multiple memory access operations of each memory access request in the test program, obtain the memory access statistics of the test program; the memory access statistics include at least the number of times each memory access request hits the cache to be evaluated.
  • Step 102 Based on the memory access statistics and the memory access patterns of each memory access request, evaluate the performance of the cache to be evaluated.
  • the cache in this application can be a cache system design with caching functionality, which can belong to the processor to be evaluated. It is understood that a processor typically includes modules or systems with different functions, and to evaluate the processor's functionality, different modules or systems are usually evaluated separately.
  • This application embodiment evaluates the cache system.
  • this application embodiment can be applied to the processor to be evaluated, or to a processor simulator, such as gem5.
  • the cache to be evaluated can be built on gem5.
  • gem5 is a periodically accurate processor simulator, and its processor core simulation is periodically accurate.
  • the cache system simulation can be performed at a certain frequency, for example, once per update cycle (a variable in the simulator used to represent how many times the processor state is updated per second, TICK), that is, once a memory access statistics result is obtained after one clock cycle.
  • test program can be pre-built, randomly generated, or constructed according to certain test requirements; this application embodiment does not impose any restrictions on this.
  • the test program can contain multiple different memory access requests, and the memory access modes of each memory access request can be the same or different.
  • the memory access request refers to a load instruction.
  • the aforementioned memory access mode refers to the memory access type of the memory access request, which may include step-by-step memory access, indirect memory access, etc.
  • the memory access statistics of the test program can be obtained first.
  • the memory access statistics include at least the number of times each memory access request hits the cache to be evaluated.
  • the cache refers to the memory between the CPU and main memory; it is typically small in capacity but very fast.
  • the processor executes a memory access request, it often first retrieves data from the cache. If the required data exists in the cache, it indicates that the memory access request has hit the cache, and no main memory access is needed. Conversely, if the required data does not exist in the cache, it indicates that the memory access request has missed the cache, and the required data needs to be retrieved from main memory.
  • one memory access request corresponds to one static load instruction.
  • a static load instruction can be executed multiple times as a dynamic instruction.
  • a memory access request can be executed multiple times to obtain the number of cache hits for each memory access request during multiple executions.
  • the aforementioned operation of obtaining memory access statistics can be achieved through the processor's performance counters. These counters can statistically analyze the memory access behavior of the processor during test program execution, tracking the number of cache hits and memory hits for different memory access requests across multiple executions.
  • a specified software tool e.g., the performance analysis tool perf
  • the aforementioned memory access patterns can be pre-built with a test program containing different memory access patterns during test program construction, ensuring that the memory access patterns for each request are known.
  • the test program can be randomly constructed.
  • each memory access request can be output and displayed, allowing relevant personnel to evaluate the memory access patterns of each request, thereby obtaining the memory access patterns of each request by receiving input information from relevant personnel.
  • the memory access operation in this embodiment can be either data access or instruction access.
  • this embodiment of the application can evaluate cache performance by considering the access modes of each memory access request. Specifically, this embodiment of the application can pre-divide different performance levels and determine the cache performance level for different access modes based on the number of cache hits and the access mode of each memory access request. For example, different performance levels for different access modes can be associated with their corresponding ranges of cache hits. The performance level can be used to characterize performance; the higher the performance level, the larger the range of cache hits. Furthermore, different weighting coefficients can be set for different access modes according to actual memory access needs, and the performance level of the cache to be evaluated can be obtained by weighting the performance levels of the different access modes.
  • the method provided in this application embodiment can be applied to a processor, thereby allowing the aforementioned different performance levels to be pre-uploaded to the processor.
  • the cache performance evaluation method provided in this application obtains the memory access statistics of the test program by responding to multiple memory access operations of each memory access request in the test program.
  • the memory access statistics include at least the number of hits of each memory access request in the cache to be evaluated.
  • the performance of the cache to be evaluated is evaluated. In this way, the performance of the cache can be evaluated by obtaining the number of hits of memory access requests in the cache.
  • the performance of the cache can be evaluated from the dimensions of different memory access patterns, realizing multi-dimensional evaluation and improving the accuracy and interpretability of cache performance evaluation.
  • this embodiment evaluates cache performance based on the number of hits and the memory access patterns of memory access requests.
  • the cache miss rate reflects the proportion of missed memory access requests received by the cache, and it cannot pinpoint the timing of the miss.
  • Program segments with the same miss rate may have different impacts on IPC due to different timings of the misses, thus failing to identify the program segments that have a significant impact on cache performance.
  • IPC is sensitive to factors such as branch prediction and cannot directly reflect the coverage of memory accesses by the cache system.
  • This embodiment evaluates cache performance from different dimensions of memory access patterns by using the number of hits and the memory access patterns of each memory access request, achieving multi-dimensional evaluation. This allows for the identification of memory access patterns that have a significant impact on cache performance, improving the accuracy of cache performance evaluation.
  • the number of hits of each memory access request in each level of cache is obtained as the memory access statistics.
  • the aforementioned hierarchy refers to different memory hierarchies.
  • a cache When a cache has at least two memory hierarchies, it is a multi-level cache system. Different cache levels have different speeds.
  • the processor typically accesses different cache levels sequentially, starting with the highest-level cache (L1 cache) to retrieve the required data. If the data is found in the L1 cache, it is returned. If the data is not found in the L1 cache, the processor accesses the second-level cache (L2 cache), and so on. If the lowest-level cache (last-level cache) also fails to retrieve the data, then main memory is accessed.
  • L1 cache the highest-level cache
  • L2 cache second-level cache
  • L2 cache second-level cache
  • the lowest-level cache last-level cache
  • Higher-level caches generally have smaller capacities and faster read speeds. Correspondingly, the more frequent the memory access requests hit in higher-level caches, the better the cache system's performance; conversely, the more frequent the memory access requests hit in lower-level cache
  • Performance counters can count the number of hits of memory access requests in different levels of cache, and obtain the hit distribution in the storage hierarchy.
  • the number of times each memory access request is hit in each level of cache can be obtained as memory access statistics. In this way, the performance of the cache can be evaluated more accurately and in a more detailed manner based on the number of hits in different levels of cache.
  • the operation of obtaining the number of cache hits for each memory access request at each level may specifically include the following steps in this embodiment:
  • the memory access request is encapsulated into a request packet, and a hierarchical parameter is set in the request packet.
  • the level parameter in the request packet is incremented by 1 for each level it passes through; the target level is the level where the cache hit by the request packet is located.
  • this application embodiment can set multiple hit counters for each memory access request. Different hit counters can correspond to different levels of cache, so that each hit counter of each memory access request can count the number of hits of different levels of cache.
  • the aforementioned request packet refers to a message packet. Encapsulating memory access requests into request packets facilitates their transmission across different cache levels. Furthermore, this embodiment also sets a level parameter within the request packet, which characterizes the cache level hit by the request packet. Specifically, a variable can be created within the request packet as the level parameter. Further, after creating the level parameter, its initial value can be set to 0. Specifically, the aforementioned level parameter can be an integer variable, or it can be a floating-point number; this embodiment does not impose any limitations on this.
  • the request packet is sent, it is first passed to the L1 cache. If the target data to be accessed exists in the L1 cache, the target data is added to the request packet and the request packet is returned. If the target data does not exist in the L1 cache, the request packet continues to be passed to the L2 cache until the target data is accessed. Then, the request packet is passed back up from the hit level sequentially. In this embodiment, when the request packet returns from the level where the hit cache is located, the level parameter in the request packet is incremented by 1 for each level it passes through. In this way, the level of the cache hit by the request packet can be determined by the level parameter, which is convenient for the hit counter to count.
  • the target level of the cache hit by the memory access request can be determined based on the value of the level parameter. This allows the hit counter corresponding to the target level to be incremented by 1, enabling the counting of hits at different levels.
  • the aforementioned evaluation conditions can be the number of times the test program is executed reaching a preset threshold, or the execution time of the test program reaching an execution time threshold. These can be set according to actual needs, and this application embodiment does not impose any restrictions on them. Furthermore, after the evaluation conditions are met, the current value of each hit counter corresponding to each memory access request can be determined as the number of times each memory access request is hit in the cache at each level.
  • the program counter (PC value) of each memory access request can be used as the index value of each memory access request.
  • PC value the program counter of each memory access request.
  • the PC value of a memory access request serves as the identifier of the hit counter corresponding to that memory access request.
  • the hit counters corresponding to all memory access requests can be indexed based on the PC value of the memory access request carried in the request packet. The hit counter whose identifier matches the PC value is determined as the hit counter for the memory access request corresponding to that request packet.
  • multiple hit counters are set for each memory access request; different hit counters correspond to different cache levels; for any memory access operation of each memory access request, the memory access request is encapsulated into a request packet, and a level parameter is set in the request packet; during the process of the request packet returning from the target level, the level parameter in the request packet is incremented by 1 for each level it passes through; the target level is the level of the cache hit by the request packet; the target level of the memory access request is determined based on the value of the level parameter in the returned request packet, and the hit counter corresponding to the target level is incremented by 1 from the hit counters corresponding to the memory access request; under the condition of satisfying the evaluation, the number of hits of each memory access request in the cache at each level is obtained based on the current value of each hit counter corresponding to each memory access request.
  • the level parameter the level of the cache hit by the memory access request can be determined.
  • the number of hits of the memory access request in the cache at different levels can be counted separately.
  • Figure 2 takes a cache with three levels as an example, namely a private level 1 data cache, a private level 2 cache, and a shared level 3 cache.
  • Processor core 1 can execute a test program to obtain memory access statistics through the performance counters in the processor.
  • the cache queue refers to the Load Store queue.
  • Other cores refer to processor cores other than processor core 1.
  • a memory access request (Load instruction) at point A completes address calculation and sends a memory access request to the cache.
  • the memory access request is encapsulated in a message packet and passed through various cache levels.
  • the message packet contains metadata, which may include the level parameter d, the source of the cache line, and the priority of the replacement algorithm corresponding to the cache line.
  • the metadata of the message packet is recorded in the Load Store queue of the processor performance model.
  • a Load instruction at point B returns a message packet corresponding to the hit storage level. According to Figure 2, this message packet hit memory. Therefore, when returning the message packet, the level parameter in the request packet is incremented by 1 for each level it passes through. When it returns to the cache queue, the level parameter in the message packet is 3, indicating that the target level of this message packet is memory.
  • a Load instruction at point C is processed. At this point, the response status of this memory access instruction in the cache system can be statistically analyzed based on the metadata stored in the Load Store queue.
  • the performance counters can contain five entries: program counter, L1 cache hit counter, L2 cache hit counter, L3 cache hit counter, and L1 memory access count.
  • the program counter records the program counter (PC) value of memory access requests. As shown in Figure 2, this indicates that a memory access request with a PC value of 0xABC hit K times in the L1 cache, L times in the L2 cache, M times in the L3 cache, and N times in memory.
  • the efficiency of the caching system can be reflected by the proportion of data returned by each level of cache in each Load instruction. If a memory access request is mostly returned by the L3 cache or by memory, then the IPC of the program segment containing that memory access request is usually also lower.
  • the memory access statistics information mentioned above also includes the index value of each memory access request.
  • the embodiments of this application may further include the following steps:
  • the aforementioned index value refers to the program counter (PC) value of a memory access request.
  • PC program counter
  • the source code of the corresponding memory access request can be obtained through the PC value, and the source code can be output to the information display interface for analysis by relevant testers.
  • the source code corresponding to each PC value can be obtained through addr2line.
  • Addr2line is a debugging information reading tool that can map a program counter (PC) to a specific line of source code.
  • the source code corresponding to each index value can be sequentially output to the information display interface for display.
  • Relevant testers can analyze the displayed source code to obtain its memory access mode. Users can input the memory access mode as input information, and then this embodiment can determine the memory access mode corresponding to the source code by receiving the mode information input by the user.
  • the high-level language source file refers to the source code of the test program.
  • the compiler can generate a binary executable file containing code and data segments, as well as debugging information (e.g., DWARF format debugging information, Debugging With Arbitrary Record Formats; DWARF is a debugging information file format used by many compilers and debuggers to support source code-level debugging).
  • This debugging information may contain a mapping between source code and PC values.
  • the debugging information reading tool can read the source code corresponding to each memory access request, i.e., the high-level language code, based on the mapping relationship between source code and PC values in the debugging information and the performance counter statistical result. Then, the memory access mode of each memory access request can be determined through the high-level language code.
  • embodiments of this application can evaluate the performance of the cache or the effectiveness of optimization algorithms based on the memory access mode, the cache hit count at each level in the memory access statistics, and the percentage of cache hits at each level.
  • the memory access statistics also include the index value of each memory access request; based on the index value of each memory access request, the source code corresponding to each index value is output to the information display interface; the mode information input by the user for the source code corresponding to each index value is received based on the information display interface, and the mode information is determined as the memory access mode of the memory access request corresponding to each index value.
  • the memory access mode corresponding to each memory access request can be determined by receiving user input.
  • the operation of evaluating the performance of the cache to be evaluated based on the memory access statistics and the memory access patterns of each memory access request may specifically include the following steps in this embodiment:
  • the performance of the cache to be evaluated is evaluated based on the number of hits of each memory access request and the reference number of hits.
  • the aforementioned reference hit count can be preset and can be the number of times memory access requests for each memory access mode hit the cache that meets the performance requirements.
  • the embodiments of this application can evaluate the performance of the cache to be evaluated by the reference hit counts of different memory access modes.
  • the reference hit count can be used as a performance threshold. If the hit count of the memory access request is not less than the reference hit count corresponding to the memory access pattern of the memory access request, the cache performance of the cache to be evaluated for that memory access pattern is determined to meet the requirements. Conversely, if the hit count of the memory access request is less than the reference hit count corresponding to the memory access pattern of the memory access request, the cache performance of the cache to be evaluated for that memory access pattern is determined to not meet the requirements.
  • embodiments of this application may also pre-set reference hit counts for different levels of cache, which can be the number of hits of memory access requests for each memory access mode in each level of the cache that meets the performance requirements. Accordingly, the above evaluation method can also be performed by combining the hit counts in different levels of cache and the reference hit counts.
  • This application embodiment obtains the reference hit count corresponding to the memory access mode of each memory access request; based on the hit count of each memory access request and the reference hit count, the performance of the cache to be evaluated is assessed. By setting the reference hit count, the performance of the cache can be effectively evaluated.
  • embodiments of this application may further include:
  • the cache to be evaluated is optimized using the first optimization algorithm, and the operation of obtaining the memory access statistics of the test program is re-executed based on the optimized cache to obtain the second memory access statistics.
  • the memory access statistics corresponding to the cache before optimization are used as the first memory access statistics.
  • the second memory access statistics include the number of times each memory access request hits the optimized cache and the optimization state of the first optimization algorithm when each memory access request hits the optimized cache.
  • the first optimization algorithm refers to an optimization technique for the cache, which can be any prefetcher, prefetch technique, prefetch algorithm, or replacement strategy, etc.
  • the first optimization algorithm can be selected according to actual needs, and this application embodiment does not limit it. It is understood that the first optimization algorithm can optimize the performance of the cache, and different optimization algorithms have different optimization effects. This application embodiment can evaluate the optimization effect of the first optimization algorithm.
  • the optimization effect of the first optimization algorithm in this application embodiment can be evaluated from the perspective of different memory access modes. Specifically, for any memory access request and its memory access mode, the number of hits for the memory access request can be obtained from the first memory access statistics as the first number, and the number of hits for the memory access request can be obtained from the second memory access statistics as the second number. If the second number is greater than the first number, it indicates that the first optimization algorithm can improve the processing efficiency of the cache for this memory access mode. Furthermore, if the second number is greater than the first number, and the difference between the two is greater than a preset threshold, it indicates that the first optimization algorithm can greatly improve the processing efficiency of the cache for this memory access mode, and its optimization effect is good.
  • FIG. 4 a statistical result diagram of this application is shown. As shown in Figure 4, it illustrates six memory access requests with PC values of 0x119fa, 0x119fe, 0x119ea, 0x119f0, 0x119f8, and 0x119f4, respectively.
  • the row containing each PC value corresponds to the number of times the corresponding memory access request is hit in the L1 cache, L2 cache, L3 cache, and main memory.
  • 0x119fa and 0x119fe as indirect memory access
  • the other requests as stepping memory access
  • 0x119fa and 0x119fe have more hits in the L3 cache and main memory compared to other memory access requests. This indicates that the cache performs poorly for indirect memory access modes but better for stepping memory access modes.
  • Figure 5 illustrates another statistical result of this application.
  • the statistical results of the cache after optimization using a certain hardware prefetching technique can be seen.
  • the hardware prefetching technique can increase the number of hits of indirect memory access 0x119fa and 0x119fe in the L1 cache and reduce the number of hits in the L2 cache and below, which can effectively improve the processing efficiency of the cache system for indirect memory access mode.
  • This prefetching technique uses a step-by-step memory predictor and indirect memory access identification to handle the prefetching of L1 indirect memory accesses.
  • the number of indirect memory access instructions returned from the L1 cache should be close to the number of step-by-step memory accesses they depend on.
  • the number of indirect memory accesses returned from the L1 cache is still less than the number of step-by-step memory accesses they depend on. Therefore, it can be concluded that this prefetching technique still has room for improvement.
  • the IPC of the test program did not increase. This may be due to factors within other modules of the processor core, such as branch prediction. If existing techniques are used to evaluate this prefetching technique solely based on IPC, the conclusion that the prefetching technique is useless will be drawn. However, the embodiments of this application evaluate based on the number of hits and memory access patterns. It can be seen that as the number of L1 cache hits increases, the number of requests sent to the L2 cache decreases accordingly, thus causing the L2 cache hit rate to decrease. If only the cache hit rate is used to evaluate this technique, no intuitive results can be obtained, or the erroneous conclusion that the technique reduces the L2 cache hit rate can be drawn, resulting in a poor evaluation effect.
  • This application addresses the difficulties and poor evaluation results of traditional methods for evaluating cache optimization mechanisms such as prefetchers and replacement algorithms by combining memory access patterns and the effect of each memory access instruction in the caching system.
  • the reason for each cache hit can be recorded in the metadata of the aforementioned message packet. This could be a hit after prefetching using prefetching technology 1, a hit after prefetching using prefetching technology 2, or a previous access to the address. Furthermore, the reason for each cache miss can also be recorded in the metadata, such as: the first access not covered by the prefetcher, the prefetcher covering the cache but not retrieving it in time, being swapped out of the cache due to capacity limitations, or being swapped out of the cache due to conflicts. This allows for a more refined evaluation of the cache and optimization algorithms based on the metadata.
  • the aforementioned second memory access statistics may include the optimization status of the first optimization algorithm when each memory access request hits the optimized cache.
  • optimization status refers to the optimization parameters of the optimization algorithm.
  • Different optimization algorithms have different optimization parameters.
  • the aforementioned optimization status may be the Least Recently Used (LRU) distance of the Least Recently Used replacement algorithm or the rereference interval predicted by the Re-Reference Interval Prediction (RRIP) replacement algorithm, etc.
  • LRU Least Recently Used
  • RRIP Re-Reference Interval Prediction
  • the aforementioned optimization status can be obtained by reading the current values of the optimization parameters of the optimization algorithm.
  • This embodiment of the application optimizes the cache to be evaluated using a first optimization algorithm, and re-executes the operation of obtaining the memory access statistics of the test program based on the optimized cache to obtain second memory access statistics.
  • the memory access statistics corresponding to the cache before optimization are then used as the first memory access statistics.
  • the optimization effect of the first optimization algorithm is evaluated. This allows for an effective evaluation of the first optimization algorithm.
  • embodiments of this application may further include:
  • the second optimization algorithm is used to optimize the optimized cache, and the operation of obtaining the memory access statistics of the test program is re-executed to obtain the third memory access statistics.
  • the third memory access statistics include the optimization status of the first optimization algorithm and the optimization status of the second optimization algorithm when each memory access request hits the optimized cache.
  • the second optimization algorithm mentioned above also refers to the optimization technique for the cache. It can be any prefetcher, prefetch technique, prefetch algorithm or replacement strategy that is different from the first optimization algorithm.
  • the second optimization algorithm can be selected according to actual needs. This application embodiment does not limit this.
  • a caching system may employ two or more optimization techniques simultaneously.
  • the effects of these different techniques may either compound or cancel each other out, leading to poorer caching performance.
  • a hardware prefetching technique and a replacement strategy can effectively optimize computer system performance when compared individually with a baseline, using both methods simultaneously may result in a less effective performance compared to using either method alone. This is because hardware prefetching increases memory access traffic and accesses data earlier than normal reads, violating the assumptions made in the replacement algorithm's design.
  • the embodiments of this application further optimize the cache using the second optimization algorithm.
  • the cache applies both the first and second optimization algorithms.
  • the embodiments of this application can use a test program to perform multiple access operations to obtain third memory access statistics.
  • the third memory access statistics can include the optimization status of the first optimization algorithm and the optimization status of the second algorithm when each memory access request hits the cache.
  • embodiments of this application can evaluate the overall optimization effect of the first and second optimization algorithms using second and third memory access statistics. Specifically, the optimization state in the second memory access statistics can be compared with the optimization state of the first optimization algorithm in the third memory access statistics. If the optimization state of the first optimization algorithm in the third memory access statistics deteriorates, it can be concluded that the optimization effect of applying both the first and second optimization algorithms simultaneously is poor.
  • the rereference interval (optimized state) predicted by the replacement algorithm for a certain memory access request is relatively long, but after the addition of the hardware prefetching, the rereference interval predicted by the memory access instruction becomes shorter.
  • the replacement algorithm is affected by the prefetching, resulting in poor overall optimization effect of the two.
  • this application provides a method for evaluating cache performance.
  • memory access statistics of the test program are obtained. These statistics include at least the number of hits of each memory access request in the cache to be evaluated. Based on the memory access statistics and the access patterns of each memory access request, the performance of the cache to be evaluated is assessed. Thus, cache performance can be evaluated by obtaining the number of hits of memory access requests in the cache. Furthermore, by using the memory access statistics and the access patterns of each memory access request, cache performance can be evaluated from different access pattern dimensions, achieving multi-dimensional evaluation and improving the accuracy of cache performance evaluation.
  • this embodiment evaluates cache performance based on the number of hits and the access patterns of memory access requests.
  • the cache miss rate Compared to evaluation methods using metrics such as cache miss rate or average instructions per cycle (IPC), the cache miss rate only reflects the proportion of missed memory access requests received by the cache, and cannot pinpoint the timing of the miss. Program segments with the same miss rate may have different impacts on IPC depending on the timing of the miss, thus failing to identify the program segments that have a significant impact on cache performance.
  • IPC is sensitive to factors such as branch prediction and cannot directly reflect the coverage of memory accesses by the cache system.
  • This embodiment evaluates cache performance from different dimensions of access patterns by using the number of hits and the access patterns of each memory access request, achieving multi-dimensional evaluation. This allows for the identification of access patterns that have a significant impact on cache performance, improving the accuracy and interpretability of cache performance evaluation.
  • the embodiments of this application can also provide a basis for the design optimization of cache systems, and can also provide a basis for the improvement of optimization algorithms.
  • the embodiments of this application can assist in the design of hardware cache systems and can be conveniently used in simulators or simulation environments.
  • the device 20 may specifically include:
  • the acquisition module 201 is used to acquire memory access statistics of the test program in response to multiple memory access operations of each memory access request in the test program; the memory access statistics include at least the number of times each memory access request is hit in the cache to be evaluated;
  • the first evaluation module 202 is used to evaluate the performance of the cache to be evaluated based on the memory access statistics and the memory access patterns of each memory access request.
  • the storage hierarchy of the cache to be evaluated includes at least two levels; the acquisition module 201 is specifically used for:
  • the number of hits of each memory access request in each level of cache is obtained as the memory access statistics.
  • the acquisition module 201 includes:
  • the configuration submodule is used to set multiple hit counters for each memory access request; different hit counters correspond to different levels of cache.
  • the encapsulation submodule is used to encapsulate the memory access request into a request packet for any memory access operation of each memory access request, and set the hierarchical parameters in the request packet;
  • the parameter submodule is used to increment the level parameter in the request packet by 1 each time the request packet passes through a level during its return from the target level; the target level is the level where the cache hit by the request packet is located.
  • the determination submodule is used to determine the target level of the memory access request based on the value of the level parameter in the returned request packet, and increment the hit counter corresponding to the target level by 1 from the hit counter corresponding to the memory access request;
  • the hit count acquisition submodule is used to obtain the hit count of each memory access request in the cache at each level based on the current value of each hit counter corresponding to each memory access request, provided that the evaluation conditions are met.
  • the memory access statistics further include the index value of each memory access request; the device further includes:
  • the output module is used to output the source code corresponding to each memory access request to the information display interface based on the index value of each memory access request.
  • the receiving module is used to receive the mode information input by the user for the source code corresponding to each index value based on the information display interface, and to determine the mode information as the memory access mode of the memory access request corresponding to each index value.
  • the first evaluation module includes:
  • the reference acquisition submodule is used to acquire the reference hit count corresponding to the memory access mode of each memory access request for each memory access request.
  • the evaluation submodule is used to evaluate the performance of the cache to be evaluated based on the number of hits for each memory access request and a reference number of hits.
  • the device further includes:
  • the first optimization module is used to optimize the cache to be evaluated using a first optimization algorithm, and to re-execute the operation of obtaining the memory access statistics of the test program based on the optimized cache to obtain second memory access statistics, and to use the memory access statistics corresponding to the cache before optimization as the first memory access statistics;
  • the second memory access statistics include the number of times each memory access request hits the optimized cache and the optimization state of the first optimization algorithm when each memory access request hits the optimized cache;
  • the second evaluation module is used to evaluate the optimization effect of the first optimization algorithm based on the first memory access statistics, the second memory access statistics, and the memory access patterns of each memory access request.
  • the device further includes:
  • the second optimization module is used to optimize the optimized cache using the second optimization algorithm and re-execute the operation of obtaining the memory access statistics of the test program to obtain the third memory access statistics.
  • the third memory access statistics include the optimization status of the first optimization algorithm and the optimization status of the second optimization algorithm when each memory access request hits the optimized cache.
  • the third evaluation module is used to evaluate the optimization effects of the first optimization algorithm and the second optimization algorithm based on the second memory access statistics and the third memory access statistics.
  • this application provides a cache performance evaluation apparatus.
  • By responding to multiple memory access operations of each memory access request in a test program it obtains memory access statistics of the test program. These statistics include at least the number of hits of each memory access request in the cache to be evaluated. Based on the memory access statistics and the access patterns of each memory access request, the performance of the cache to be evaluated is assessed. Thus, by obtaining the number of hits of memory access requests in the cache, cache performance can be evaluated. Furthermore, by using the memory access statistics and the access patterns of each memory access request, cache performance can be evaluated from different access pattern dimensions, achieving multi-dimensional evaluation and improving the accuracy of cache performance evaluation.
  • This application also provides an electronic device, including: a processor and a memory for storing processor-executable instructions, wherein the processor is configured to execute the above-described cache performance evaluation method.
  • the electronic device includes a processor, a memory, a communication interface, and a communication bus.
  • the processor, the memory, and the communication interface communicate with each other through the communication bus.
  • the memory is used to store at least one executable instruction, which causes the processor to execute the cache performance evaluation method of the aforementioned embodiment.
  • the electronic devices in the embodiments of this application include mobile electronic devices and non-mobile electronic devices.
  • the processor can be a CPU (Central Processing Unit), a general-purpose processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), a FPGA (Field Programmable Gate Array), or other programmable devices, transistor logic devices, hardware components, or any combination thereof.
  • the processor can also be a combination that implements computational functions, such as a combination of one or more microprocessors, a combination of a DSP and a microprocessor, etc.
  • the communication bus may include a path for transmitting information between the memory and the communication interface.
  • the communication bus may be a PCI (Peripheral Component Interconnect) bus or an EISA (Extended Industry Standard Architecture) bus, etc.
  • the communication bus can be divided into address bus, data bus, control bus, etc. For ease of illustration, only one line is used in Figure 7, but this does not indicate that there is only one bus or one type of bus.
  • the memory may be ROM (Read-Only memory) or other types of static storage devices that can store static information and instructions, RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, or it may be EEPROM (Electrically Erasable Programmable Read-Only), CD-ROM (Compact Disc Read-Only), magnetic tape, floppy disk, and optical data storage devices, etc.
  • ROM Read-Only memory
  • RAM Random Access Memory
  • EEPROM Electrically Erasable Programmable Read-Only
  • CD-ROM Compact Disc Read-Only
  • magnetic tape magnetic tape
  • floppy disk floppy disk
  • optical data storage devices etc.
  • This application also provides a non-transitory computer-readable storage medium, which, when the instructions in the storage medium are executed by the processor of an electronic device (server or terminal), enables the processor to execute the cache performance evaluation method shown in FIG1.
  • This application also provides a computer program product containing instructions that, when run on a computer, cause the computer to execute the cache performance evaluation method shown in FIG1.
  • This application also provides a chip, which includes a processor and a communication interface.
  • the communication interface is coupled to the processor.
  • the processor is used to run programs or instructions to implement the various processes of the above-described cache performance evaluation method embodiments and can achieve the same technical effect. To avoid repetition, it will not be described again here.
  • chip mentioned in the embodiments of this application may also be referred to as a system-on-a-chip, system chip, chip system, or system-on-a-chip, etc.
  • the embodiments of this application can be provided as methods, apparatus, or computer program products. Therefore, the embodiments of this application can be implemented entirely or partially by software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented entirely or partially as a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated.
  • the computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
  • the computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another.
  • the computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means.
  • the computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that integrates one or more available media.
  • the available media may be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., DVDs), or semiconductor media (e.g., solid-state disks (SSDs)).
  • These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing terminal device to operate in a predictive manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means that implement the functions specified in one or more flowcharts and/or one or more block diagrams.
  • These computer program instructions may also be loaded onto a computer or other programmable data processing terminal equipment to cause a series of operational steps to be performed on the computer or other programmable terminal equipment to produce a computer-implemented process, such that the instructions, which execute on the computer or other programmable terminal equipment, provide steps for implementing the functions specified in one or more flowcharts and/or one or more block diagrams.
  • the disclosed apparatus and methods can be implemented in other ways.
  • the apparatus embodiments described above are merely illustrative.
  • the division of units is only a logical functional division, and in actual implementation, there may be other division methods.
  • multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed.
  • the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.
  • the units described as separate components may or may not be physically separate.
  • the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
  • the functional units in the various embodiments of this disclosure can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.
  • modules, units, and subunits can be implemented in one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), general-purpose processors, controllers, microcontrollers, microprocessors, other electronic units for performing the functions described in this disclosure, or combinations thereof.
  • ASICs application-specific integrated circuits
  • DSPs digital signal processors
  • DSPDs digital signal processing devices
  • PLDs programmable logic devices
  • FPGAs field-programmable gate arrays
  • general-purpose processors controllers, microcontrollers, microprocessors, other electronic units for performing the functions described in this disclosure, or combinations thereof.
  • the techniques described in the embodiments of this disclosure can be implemented by modules (e.g., procedures, functions, etc.) that perform the functions described in the embodiments of this disclosure.
  • the software code can be stored in memory and executed by a processor.
  • the memory can be implemented in the processor or externally.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Hardware Design (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

本申请实施例提供一种高速缓存性能评估方法、装置、电子设备及可读存储介质,涉及计算机技术领域,该方法包括:响应于测试程序中的各访存请求的多次访存操作,获取所述测试程序的访存统计信息;所述访存统计信息至少包括各所述访存请求在待评估的高速缓存中的命中次数;基于所述访存统计信息以及各所述访存请求的访存模式,对所述待评估的高速缓存的性能进行评估。

Description

高速缓存性能评估方法、装置、电子设备及可读存储介质
本申请要求在2024年5月21日提交中国专利局、申请号为202410634793.6、发明名称为“高速缓存性能评估方法、装置、电子设备及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种高速缓存性能评估方法、装置、电子设备及可读存储介质。
背景技术
随着计算机技术的发展,对中央处理器(Central Processing Unit,CPU)的设计要求也越来越高,因而需要对所设计的CPU的性能进行评估。
其中,高速缓存是CPU的重要组成部分,用于存储CPU需要经常访问的数据或指令,可以提高CPU的运行速度和运行效率。高速缓存的性能对CPU的性能具有较大的影响,因此,如何对高速缓存的性能进行评估成为了亟需解决的问题。
申请内容
本申请实施例提供一种高速缓存性能评估方法、装置、电子设备及可读存储介质,可以解决现有技术中如何对高速缓存的性能进行评估的问题。
为了解决上述问题,本申请实施例公开了一种高速缓存性能评估方法,所述方法包括:
响应于测试程序中的各访存请求的多次访存操作,获取所述测试程序的访存统计信息;所述访存统计信息至少包括各所述访存请求在待评估的高速缓存中的命中次数;
基于所述访存统计信息以及各所述访存请求的访存模式,对所述待评估的高速缓存的性能进行评估。
另一方面,本申请实施例公开了一种高速缓存性能评估装置,所述装置包括:
获取模块,用于响应于测试程序中的各访存请求的多次访存操作,获取所述测试程序的访存统计信息;所述访存统计信息至少包括各所述访存请求在待评估的高速缓存中的命中次数;
第一评估模块,用于基于所述访存统计信息以及各所述访存请求的访存模式,对所述待评估的高速缓存的性能进行评估。
再一方面,本申请实施例还公开了一种电子设备,所述电子设备包括处理器、存储器、通信接口和通信总线,所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;所述存储器用于存放可执行指令,所述可执行指令使所述处理器执行前述的高速缓存性能评估方法。
本申请实施例还公开了一种可读存储介质,当所述可读存储介质中的指令由电子设备的处理器执行时,使得电子设备能够执行前述的高速缓存性能评估方法。
本申请实施例还公开了一种包含指令的计算机程序产品,当其在计算机上运行时,使得所述计算机执行前述的高速缓存性能评估方法。
本申请实施例包括以下优点:
本申请实施例提供了一种高速缓存性能评估方法,通过响应于测试程序中的各访存请求的多次访存操作,获取所述测试程序的访存统计信息;所述访存统计信息至少包括各所述访存请求在待评估的高速缓存中的命中次数;基于所述访存统计信息以及各所述访存请求的访存模式,对所述待评估的高速缓存的性能进行评估。这样,可以通过获取访存请求在高速缓存中的命中次数,可以实现对高速缓存的性能评估。同时,通过访存统计信息以及各访存请求的访存模式,可以从不同访存模式的维度对高速缓存的性能进行评估,实现多维度的评估,提高高速缓存性能评估的准确性和可解释性。
上述说明仅是本申请技术方案的概述,为了能够更清楚了解本申请的技术手段,而可依照说明书的内容予以实施,并且为了让本申请的上述和其它目的、特征和优点能够更明显易懂,以下特举本申请的具体实施方式。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请的一种高速缓存性能评估方法实施例的步骤流程图;
图2是本申请的一种命中次数的获取示意图;
图3是本申请的一种访存模式的获取示意图;
图4是本申请的一种统计结果示意图;
图5是本申请的又一种统计结果示意图;
图6是本申请的一种高速缓存性能评估装置实施例的结构框图;
图7是本申请示例提供的一种用于高速缓存性能评估的电子设备的结构框图。
具体实施例
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本领域技术人员应理解的是,在本申请的揭露中,术语“第一”、“第二”、“第三”、“第四”、“第五”等仅用于区分不同的结构,而不对具体结构的数量、连接关系等进行限定;另外“纵向”、“横向”、“上”、“下”、“前”、“后”、“左”、“右”、“竖直”、“水平”、“顶”、“底”“内”、“外”等指示的方位或位置关系是基于附图所示的方位或位置关系,其仅是为了便于描述本申请和简化描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此上述术语不能理解为对本申请的限制。
方法实施例
参照图1,示出了本申请的一种高速缓存性能评估方法实施例的步骤流程图,所述方法具体可以包括如下步骤:
步骤101、响应于测试程序中的各访存请求的多次访存操作,获取所述测试程序的访存统计信息;所述访存统计信息至少包括各所述访存请求在待评估的高速缓存中的命中次数。
步骤102、基于所述访存统计信息以及各所述访存请求的访存模式,对所述待评估的高速缓存的性能进行评估。
需要说明的是,针对上述步骤101~102,本申请中的高速缓存可以是具有缓存功能的缓存系统设计,其可以属于待评估的处理器,可以理解的,一个处理器通常包括不同功能的模块或系统,而为了评估处理器的功能,通常会对不同模块或系统分别进行评估,本申请实施例是对高速缓存系统的评估。可选地,本申请实施例可以应用于待评估的处理器上,也可以应用于处理器模拟器,例如,gem5,可以在gem5上搭建待评估的高速缓存,gem5是一个周期精确的处理器模拟器,其处理器核模拟是周期精确的,缓存系统的模拟可以以一定的频率,例如:每更新周期(模拟器中用于表示每秒更新多少次处理器状态的变量,TICK)进行一次模拟,也就是每经过一个时钟周期,获取一次访存统计结果。
其中,上述测试程序可以是预先构建的,可以是随机构建的,也可以是按照一定的测试要求构建的,本申请实施例对此不作限制。具体的,测试程序中可以包含多个不同的访存请求,各个访存请求的访存模式可以相同,也可以不同。另外,本申请实施例中的访存请求指的是加载指令(load)。
具体的,本申请实施例可以预先构建多个不同访存模式的访存请求,得到测试程序。其中,上述访存模式指的是访存请求的访存类型,访存类型可以包括步进访存、间接访存等。
本申请实施例在各访存请求进行访存操作时,可以先获取测试程序的访存统计信息。需要说明的是,访存统计信息至少包括各个访存请求在待评估的高速缓存中的命中次数。具体的,高速缓存指的是CPU与内存之间的存储器,其通常容量较小但速度很快。而处理器在执行访存请求时,往往会先从高速缓存中获取数据,在高速缓存中存在所需获取的数据时,表明该访存请求命中高速缓存,此时无需访问内存。相应地,在高速缓存中不存在所需获取的数据时,表明该访存请求未命中高速缓存,此时需要从内存中获取所需的数据。而CPU读取内存中的数据时,往往除了读取本次要加载的数据,还会预取部分数据到高速缓存中,这样CPU后续要读取的数据已经在高速缓存中,可以有效提升性能。进一步地,由于从高速缓存中获取数据的效率相较于从内存中获取数据的效率更高,因而访存请求命中高速缓存的次数越多,处理器执行访存请求的效率越高,也就是,高速缓存的性能越好,处理器性能越好,因此,本申请实施例可以通过获取测试程序的访存统计信息对高速缓存的性能进行评估。
其中,本申请实施例中的一个访存请求对应一个静态加载指令,测试程序执行的过程中,一个静态加载指令可以作为动态指令被执行多次。进一步地,本申请实施例中可以使一个访存请求执行多次,从而获取各访存请求在多次执行中,在高速缓存中的命中次数。
具体的,上述获取访存统计信息的操作可以通过处理器的性能计数器获取,处理器的性能计数器可以对处理器执行测试程序时的访存行为进行统计,可以统计不同访存请求在多次执行时,分别在高速缓存中的命中次数以及在内存中的命中次数,本申请实施例可以通过指定的软件工具(例如,性能分析工具perf)对性能计数器进行读取,得到各访存请求在高速缓存中的命中次数。进一步地,上述访存模式可以是在构建测试程序时,预先构建包含不同访存模式的访存请求的测试程序,进而保证各访存请求的访存模式已知。或者,本申请实施例中测试程序还可以是随机构建的,在得到访存统计信息之后,可以将各访存请求输出显示,使相关工作人员评估各访存请求的访存模式,从而通过接收相关工作人员的输入信息得到各个访存请求的访存模式。其中,本申请实施例中的访存操作可以是数据访问,也可以是指令访问。
可以理解的,访存请求在高速缓存中的命中次数越多,则高速缓存的性能越好。而对于同一高速缓存系统,其针对不同访存模式的访存请求的性能可能存在差异,因此,本申请实施例可以结合各个访存请求的访存模式对高速缓存的性能进行评估。具体的,本申请实施例可以预先划分不同的性能等级,并根据各访存请求在高速缓存中的命中次数以及各访存请求的访存模式确定高速缓存针对不同访存模式的性能等级。例如,可以使不同访存模式的不同的性能等级关联各自对应的在高速缓存中的命中次数的区间。其中,性能等级的高低可以用于表征性能的好坏。性能等级越高所关联的在高速缓存中的命中次数的区间的区间值越大。进一步地,还可以按照实际的访存需求为不同访存模式设置不同的权重系数,通过所得到的不同访存模式的性能等级进行加权计算,进而得到该待评估的高速缓存对应的性能等级。
进一步地,本申请实施例提供的方法可以应用于处理器,从而可以将上述不同的性能等级预先上传至该处理器中。
示例性地,以测试程序中存在访存请求A与访存请求B,且A的访存模式为间接访存,B的访存模式为步进访存为例,若访存统计信息中表示A在高速缓存中的命中次数为1850707次,在内存中的命中次数为3059次,且B在高速缓存中的命中次数为2584088次,在内存中的命中次数为645次,可以得到该高速缓存针对步进访存的访存性能较好,针对间接访存的访存性能较差。进一步地,可以根据预设的性能等级,进一步对待评估的高速缓存针对不同访存模式的性能进行划分。
本申请实施例提供的高速缓存性能评估方法,通过响应于测试程序中的各访存请求的多次访存操作,获取所述测试程序的访存统计信息;所述访存统计信息至少包括各所述访存请求在待评估的高速缓存中的命中次数;基于所述访存统计信息以及各所述访存请求的访存模式,对所述待评估的高速缓存的性能进行评估。这样,可以通过获取访存请求在高速缓存中的命中次数,可以实现对高速缓存的性能评估。同时,通过访存统计信息以及各访存请求的访存模式,可以从不同访存模式的维度对高速缓存的性能进行评估,实现多维度的评估,提高高速缓存性能评估的准确性和可解释性。
进一步地,本申请实施例基于命中次数以及访存请求的访存模式对高速缓存的性能进行评估,相较于采用高速缓存缺失率或者每个时钟周期执行的平均指令数(Instruction Per Cycle,IPC)等指标进行评估的方式,高速缓存缺失率只能反映高速缓存收到的访存请求中未命中的比例,其不能对缺失的时机进行定位,相同缺失率的程序片段可能由于缺失的时机不同而对IPC的影响不同,进而也无法确定对高速缓存的性能影响较大的程序片段。而IPC则对分支预测等因素较敏感,不能直接反映出高速缓存系统对内存访问的覆盖程度。而本申请实施例通过各访存请求的命中次数以及各访存请求的访存模式,可以从不同访存模式的维度对高速缓存的性能进行评估,实现多维度的评估,可以确定出对高速缓存的性能影响较大的访存模式,提高高速缓存性能评估的准确性。
在本申请的一种可选实施例中,所述待评估的高速缓存的存储层次包含至少两个层级;步骤101中响应于测试程序中的各访存请求的多次访存操作,获取所述测试程序的访存统计信息的操作,具体可以包括下述步骤:
响应于测试程序中的各访存请求的多次访存操作,获取各所述访存请求在各个层级的高速缓存中的命中次数,作为所述访存统计信息。
其中,上述层级指的是不同的存储层次(memory hierarchy),在高速缓存的存储层次包含至少两个层级的情况下,该高速缓存为多层级缓存系统,而不同层级的高速缓存的速度不同,处理器往往会按照层级顺序依次访存不同层级的高速缓存,即,会先访问层级最高的高速缓存,也就是第一级高速缓存(L1缓存)获取所需获取的数据,在L1缓存命中时,则返回数据,在L1缓存中不存在时,再访问第二级高速缓存(L2缓存),…,在层级最低的高速缓存,也就是最后一级高速缓存依然不命中时,则访问内存。同时,层级越高的高速缓存其容量往往较小且读取速度较快,相应地,若访存请求在层级较高的高速缓存中的命中次数越多,则表征高速缓存系统的性能较好,若访存请求在层级较低的高速缓存中的命中次数越多,则表征高速缓存系统的性能一般。
具体的,待评估的高速缓存的存储层次包含至少两个层级时,上述访存统计信息也可以通过性能计数器获取,性能计数器可以对访存请求在不同层级的缓存中的命中次数分别进行统计,可以得到存储层次中的命中次数分布。
进一步地,待评估的高速缓存包含至少两个层级的高速缓存时,本申请实施例中可以获取各个访存请求在各个层级的高速缓存中的命中次数,作为访存统计信息,这样,可以根据不同层级的高速缓存的命中次数,对高速缓存的性能实现更精确、更细化的评估。
可选地,上述获取各所述访存请求在各个层级的高速缓存中的命中次数的操作,本申请实施例具体可以包括如下步骤:
S11、为各所述访存请求设置多个命中计数器;不同命中计数器对应不同层级的高速缓存。
S12、对于各个所述访存请求的任一次访存操作,将所述访存请求封装为请求包,并在所述请求包中设置层级参数。
S13、在所述请求包从目标层级返回的过程中,每经过一个层级,则将所述请求包中的层级参数加1;所述目标层级为所述请求包命中的高速缓存所在的层级;
S14、基于所返回的请求包中的层级参数的数值确定所述访存请求的目标层级,并从所述访存请求所对应的命中计数器中,将所述目标层级对应的命中计数器加1;
S15、在满足评估条件的情况下,基于各访存请求对应的各个命中计数器的当前数值,获取各所述访存请求在各个层级的高速缓存中的命中次数。
具体的,针对上述步骤S11~S15,本申请实施例可以为各个访存请求分别设置多个命中计数器,不同命中计数器可以对应不同层级的高速缓存,从而各访存请求的各个命中计数器,可以分别对不同层级的高速缓存的命中次数进行统计。
其中,上述请求包指的是消息包(Message Packet),将访存请求封装为请求包可以便于其在高速缓存的各个层级中传递。进一步地,本申请实施例还在请求包中设置了层级参数,该层级参数用于表征请求包所命中的高速缓存的层级。具体的,可以在请求包中创建一个变量作为层级参数。进一步地,在创建层级参数后,还可以设置其初始值为0。具体的,上述层级参数可以为整数型变量,当然也可以为浮点数型,本申请实施例对此不作限制。
进一步地,在请求包被发出后,其会先传递到L1缓存,若L1缓存中存在所需访问的目标数据,则会将目标数据添加至请求包之后,返回请求包,若L1缓存中不存在目标数据,则请求包会继续传递到L2缓存,直至访问到目标数据后,再从所命中的层级向上依次传递返回请求包。而本申请实施例中,在请求包从所命中的高速缓存所在的层级返回时,每经过一个层级,则将请求包中的层级参数加1,这样,可以通过层级参数确定请求包所命中的高速缓存所在的层级,便于命中计数器进行统计。
相应地,在得到返回后的请求包之后,可以根据层级参数的数值确定访存请求所命中的缓存所在的目标层级,从而可以将目标层级对应的命中计数器加1,实现对不同层级的命中次数的统计。
其中,上述评估条件可以是测试程序的执行次数达到预设次数阈值,或者测试程序的执行时间达到执行时间阈值,可以按照实际需求自行设置,本申请实施例对此不作限制。进一步地,在满足评估条件后,可以将各个访存请求对应的各个命中计数器的当前数值,确定为各访存请求在各个层级的缓存中的命中次数。
进一步地,本申请实施例中可以将各个访存请求的程序计数器(Program Counter,PC值)作为各访存请求的索引值,进而可以通过PC值区别不同访存请求对应的命中计数器,也就是将访存请求的PC值作为该访存请求对应的命中计数器的标识。相应地,在接收到返回的请求包后,可以基于请求包中所携带的访存请求的PC值,对所有访存请求对应的命中计数器进行索引,将标识与该PC值一致的命中计数器确定为该请求包对应的访存请求的命中计数器。
本申请实施例中,通过为各所述访存请求设置多个命中计数器;不同命中计数器对应不同层级的缓存;对于各个所述访存请求的任一次访存操作,将所述访存请求封装为请求包,并在所述请求包中设置层级参数;在所述请求包从目标层级返回的过程中,每经过一个层级,则将所述请求包中的层级参数加1;所述目标层级为所述请求包命中的缓存所在的层级;基于所返回的请求包中的层级参数的数值确定所述访存请求的目标层级,并从所述访存请求所对应的命中计数器中,将所述目标层级对应的命中计数器加1;在满足评估条件的情况下,基于各访存请求对应的各个命中计数器的当前数值,获取各所述访存请求在各个层级的缓存中的命中次数。这样,通过设置层级参数可以确定访存请求所命中的缓存的层级,同时,通过针对一个访存请求设置多个命中计数器,可以对该访存请求在不同层级的缓存中的命中次数进行分别统计。
示例性地,参照图2,示出了本申请的一种命中次数的获取示意图,图2中以高速缓存包含三个层级的高速缓存为例,分别为私有一级数据缓存、私有二级缓存以及共享三级缓存。处理器核心1可以执行测试程序,通过处理器中的性能计数器获取访存统计信息。其中,缓存队列指的是Load Store队列。其他核心指的是处理器核心1之外的处理器核心。
具体的,A处某一条访存请求(Load指令)完成地址计算并向高速缓存发出访存请求。访存请求被封装在消息包(Message Packet)中,并在高速缓存的各层次中传递。消息包中存放有元数据,元数据中可以包括层级参数d、缓存行的来源,缓存行对应的替换算法优先级等。根据图2,该消息包命中一级缓存,则其直接返回消息包,此时层级参数依然为0,表征该消息包的目标层级为一级缓存。消息包的元数据被记录在处理器性能模型的Load Store队列中。
B处一条Load指令对应的消息包从所命中的存储层级返回。根据图2,该消息包命中内存,则其返回消息包时,每经过一个层级,则将请求包中的层级参数加1,在其返回到缓存队列时,消息包中的层级参数为3,表征该消息包的目标层级为内存。C处一条Load指令被处理完成,此时可以根据Load Store队列中存储的元数据统计该访存指令在缓存系统中的响应情况。
如图2所示,性能计数器中可以包含5个表项,分别为程序计数器、一级缓存命中计数器、二级缓存命中计数器、三级缓存命中计数器一级内存访问计数,其中,程序计数器用于记录访存请求的PC值。如图2所示,其表示PC值为0xABC的访存请求在一级缓存中命中K次,在二级缓存中命中L次,在三级缓存中命中M次,在内存中命中N次。
进一步地,根据每个Load指令中,由各层级高速缓存返回的占比,可以反映高速缓存系统的效率。若某个访存请求由L3缓存或者由内存返回较多,那么这个访存请求所在的程序片段的IPC通常也较低。
可选地,上述访存统计信息中还包括各所述访存请求的索引值,上述获取所述测试程序的访存统计信息的操作之后,本申请实施例具体还可以包括如下步骤:
S21、基于各所述访存请求的索引值,将各索引值对应的源代码输出至信息显示界面。
S22、接收用户基于所述信息显示界面为各索引值对应的源代码所输入的各模式信息,并将各所述模式信息确定为各索引值对应的访存请求的访存模式。
其中,上述索引值指的是访存请求的PC值,对于一个静态指令,其PC值是唯一且固定的,因而本申请实施例可以通过PC值得到对应的访存请求的源代码,并将源代码输出至信息显示界面,供相关测试人员进行分析。具体的,可以通过addr2line获取各个PC值对应的源代码。其中,上述addr2line为调试信息读取工具,可以将一个程序计数器(Program Counter,PC)对应到源代码的某一行。
进一步地,本申请实施例可以将各索引值对应的源代码依次输出至信息显示界面显示,相关测试人员通过所显示的源代码进行分析得到其访存模式,用户可以将访存模式作为输入信息进行输入,进而本申请实施例通过接收用户所输入的模式信息确定源代码对应的访存模式即可。
示例性地,参照图3,示出了本申请的一种访存模式的获取示意图,其中,高级语言源文件指的是测试程序的源程序,编译器在编译阶段可以生成包含代码和数据段的二进制可执行文件以及调试信息(例如:DWARF格式的调试信息,Debugging With Arbitrary Record Formats,DWARF是一种调试信息文件格式,被许多编译器和调试器用来支持源码级调试),该调试信息中可以包含源代码与PC值的映射。当测试程序在处理器上运行完毕后,可以得到一个性能计数器的统计结果(访存统计信息)。调试信息读取工具可以基于调试信息中的源代码与PC值的映射关系以及性能计数器的统计结果,读取到每个访存请求所对应的源代码,也就是高级语言代码。进而通过高级语言代码可以确定出各访存请求的访存模式。
进一步地,本申请实施例可以基于访存模式、访存统计信息中的各级高速缓存命中计数和各级高速缓存命中的占比,对高速缓存的性能或者优化算法的效果进行评估。
本申请实施例中,所述访存统计信息中还包括各所述访存请求的索引值;通过基于各所述访存请求的索引值,将各索引值对应的源代码输出至信息显示界面;接收用户基于所述信息显示界面为各索引值对应的源代码所输入的模式信息,并将所述模式信息确定为各索引值对应的访存请求的访存模式。这样,通过设置信息显示界面,通过接收用户的输入即可确定各访存请求对应的访存模式。
可选地,上述基于所述访存统计信息以及各所述访存请求的访存模式,对所述待评估的高速缓存的性能进行评估的操作,本申请实施例具体可以包括如下步骤:
S31、针对各个所述访存请求,获取所述访存请求的访存模式所对应的参照命中次数。
S32、基于各所述访存请求的命中次数以及参照命中次数,对所述待评估的高速缓存的性能进行评估。
其中,上述参照命中次数可以是预先设置的,可以是各访存模式的访存请求在性能满足要求的高速缓存中的命中次数,从而本申请实施例可以通过不同访存模式的参照命中次数对待评估的高速缓存的性能进行评估。
具体的,针对任一访存请求,可以将参照命中次数作为性能阈值,在该访存请求的命中次数不小于该访存请求的访存模式所对应的参照命中次数的情况下,确定该待评估的高速缓存针对该访存模式的缓存性能满足要求。相应地,在该访存请求的命中次数小于该访存请求的访存模式所对应的参照命中次数的情况下,确定该待评估的高速缓存针对该访存模式的缓存性能不满足要求。
可选地,本申请实施例还可以预先设置不同层次的高速缓存的参照命中次数,可以是各访存模式的访存请求在性能满足要求的高速缓存的各层次中的命中次数。相应地,上述评估方式也可以结合不同层次的高速缓存中的命中次数以及参照命中次数进行。
本申请实施例通过针对各个所述访存请求,获取所述访存请求的访存模式所对应的参照命中次数;基于各所述访存请求的命中次数以及参照命中次数,对所述待评估的高速缓存的性能进行评估。通过设置参照命中次数,可以对高速缓存的性能进行有效评估。
可选地,本申请实施例具体还可以包括:
S41、采用第一优化算法对所述待评估的高速缓存进行优化,并基于优化后的高速缓存重新执行所述获取所述测试程序的访存统计信息的操作,得到第二访存统计信息,以及,将优化前的高速缓存对应的访存统计信息作为第一访存统计信息;所述第二访存统计信息包括各所述访存请求在优化后的高速缓存中的命中次数以及各所述访存请求在命中所述优化后的高速缓存时,所述第一优化算法的优化状态。
S42、基于所述第一访存统计信息、所述第二访存统计信息以及各访存请求的访存模式,对所述第一优化算法的优化效果进行评估。
针对上述步骤S41~S42,上述第一优化算法指的是对高速缓存的优化技术,可以是任一预取器、预取技术、预取算法或替换策略等,可以按照实际需求选择上述第一优化算法,本申请实施例对此不作限制。可以理解的,上述第一优化算法可以对高速缓存的性能进行优化,而不同的优化算法的优化效果不同,本申请实施例可以对第一优化算法的优化效果进行评估。
具体的,本申请实施例针对第一优化算法的优化效果,可以从不同访存模式的角度进行评估。具体的,针对任一访存请求的访存模式,可以从第一访存统计信息中获取该访存请求的命中次数,作为第一次数,以及,从第二访存统计信息中获取该访存请求的命中次数,作为第二次数,若第二次数大于第一次数,则表明该第一优化算法可以提高高速缓存对该访存模式的处理效率,进一步地,若第二次数大于第一次数,且两者差值大于预设阈值,则表明该第一优化算法可以大大提高高速缓存对该访存模式的处理效率,其优化效果较好。
示例性地,参照图4,示出了本申请的一种统计结果示意图,如图4所示,其示出了6个访存请求,其PC值分别为0x119fa,0x119fe,0x119ea,0x119f0,0x119f8,0x119f4,每个PC值所在的行对应该PC值对应的访存请求分别在一级缓存、二级缓存、三级缓存以及内存中的命中次数。以0x119fa和0x119fe的访存模式为间接访存,其他请求为步进访存为例,可以看出,0x119fa和0x119fe相较于其他访存请求在三级缓存以及内存中的命中次数较多,可知该高速缓存针对间接访存模式的性能较差,针对步进访存模式的性能较好。
又一示例性地,参照图5,示出了本申请的又一种统计结果示意图,如图5所示,图5为对高速缓存采用某一硬件预取技术优化后的统计结果,可以看到该硬件预取技术能够增加间接访存0x119fa和0x119fe在一级缓存中的命中次数,减少在二级缓存以下的层级中的命中次数,能够有效提高高速缓存系统对间接访存模式的处理效率。
另外,此统计结果还反映了该预取技术仍有改进空间。该预取技术使用步进访存预测器和间接访存识别技术来处理一级间接访存的预取。理想情况下,间接访存指令从一级缓存中返回的数量应与其依赖的步进型访存相近。而实际结果中间接访存从一级缓存中返回的数量仍少于其依赖的步进型访存。因此,可以得出该预取技术仍有改进空间的结论。
同时,该预取技术应用后,测试程序的IPC并没有上升,原因可能是处理器核内部其他模块的原因,如分支预测等。如果采用现有技术仅使用IPC对该预取技术进行评估的方式,将会得到该预取技术无用的结论。而本申请实施例根据命中次数以及访存模式进行评估,可以看到一级缓存命中次数增多时,发送到二级缓存的请求数量也相应减少,从而二级缓存的命中率才出现下降,如果仅使用缓存命中率对该技术进行评估,则不能得到直观结果,或者得到该技术降低了二级缓存命中率的错误结论,评估效果较差。
本申请通过结合访存模式和每条访存指令在缓存系统中的效果,能够解决传统方法评估预取器、替换算法等缓存优化机制的困难,以及评估效果较差的问题。
进一步,本申请实施例还可以在上述消息包的元数据中记录每级缓存命中的原因,可以是采用预取技术1预取后命中、预取技术2预取后命中或者之前曾访问过该地址等。进一步地,还可以在元数据中记录每级缓存未命中的原因,例如:预取器未覆盖的首次访问、预取器覆盖但未及时取回、由于容量原因被换出缓存、由于冲突原因被换出缓存等。进而可以根据元数据对高速缓存以及优化算法进行更进一步的细化评估。
进一步地,上述第二访存统计信息可以包括各访存请求在命中优化后的高速缓存时,第一优化算法的优化状态。其中,优化状态指的是优化算法的优化参数,不同优化算法的优化参数不同,示例性地,在第一优化算法为替换算法的情况下,上述优化状态可以是最近最少使用替换算法的最近最少使用距离(Least Recently Used,LRU)或者是重引用间隔预测替换算法(Re-Reference Interval Prediction,RRIP)所预测的重引用间隔等。具体的,上述优化状态可以通过对优化算法的优化参数的当前值进行读取得到。
本申请实施例通过采用第一优化算法对所述待评估的高速缓存进行优化,并基于优化后的高速缓存重新执行所述获取所述测试程序的访存统计信息的操作,得到第二访存统计信息,以及,将优化前的高速缓存对应的访存统计信息作为第一访存统计信息;基于所述第一访存统计信息、所述第二访存统计信息以及各访存请求的访存模式,对所述第一优化算法的优化效果进行评估。这样,可以实现对第一优化算法的有效评估。
可选地,本申请实施例具体还可以包括:
S51、采用第二优化算法对所述优化后的高速缓存进行优化,并重新执行所述获取所述测试程序的访存统计信息的操作,得到第三访存统计信息;所述第三访存统计信息包括各所述访存请求在命中优化后的高速缓存时,所述第一优化算法的优化状态以及第二优化算法的优化状态。
S52、基于所述第二访存统计信息以及所述第三访存统计信息,对所述第一优化算法以及所述第二优化算法的优化效果进行评估。
其中,上述第二优化算法指的也是对高速缓存的优化技术,可以是与第一优化算法不同的任一预取器、预取技术、预取算法或替换策略等,可以按照实际需求选择上述第二优化算法,本申请实施例对此不作限制。
具体的,由于在一些情况下,一个高速缓存系统可能同时采用两种或两种以上的优化技术,此时不同优化技术的效果可能互相叠加,也可能互相抵消导致缓存效果更差。示例性地,若一种硬件预取技术和一种替换策略在单独与基线对比时均能够对计算机系统性能作出有效的优化,但是当两种方法同时使用时,可能因为硬件预取增加了访存流量,以及比正常读取数据提前了访问时机,不符合替换算法设计时的假设,导致同时使用二者时的效果可能不如任一个方法单独使用时。
在此基础上,本申请实施例在采用第一优化算法对高速缓存进行优化之后,还采用第二优化算法对高速缓存进行进一步优化,此时该高速缓存同时应用了第一优化算法以及第二优化算法,此时本申请实施例针对同时应用第一优化算法以及第二优化算法的优化后的高速缓存,可以采用测试程序进行多次访问操作,获取第三访存统计信息,第三访存统计信息中可以包括各个访存请求在命中高速缓存时,第一优化算法的优化状态以及第二算法的优化状态。
进一步地,本申请实施例可以通过第二访存统计信息以及第三访存统计信息,对第一优化算法以及第二优化算法整体的优化效果进行评估。具体可以通过第二访存统计信息中的优化状态,与第三访存统计信息中第一优化算法的优化状态进行比较,若第三访存统计信息中第一优化算法的优化状态变差,则可以得到同时应用第一优化算法以及第二优化算法的优化效果较差。
示例性地,以第一优化算法为替换算法,第二优化算法为硬件预取技术为例,如加入硬件预取技术前,某访存请求由替换算法预测的重引用间隔(优化状态)都较长,但是加入硬件预取后,该访存指令被预测的重引用间隔变短,由此则可以评估出替换算法受预取影响导致两者的综合优化效果不佳。
综上,本申请实施例提供了一种高速缓存性能评估方法,通过响应于测试程序中的各访存请求的多次访存操作,获取所述测试程序的访存统计信息;所述访存统计信息至少包括各所述访存请求在待评估的高速缓存中的命中次数;基于所述访存统计信息以及各所述访存请求的访存模式,对所述待评估的高速缓存的性能进行评估。这样,可以通过获取访存请求在高速缓存中的命中次数,可以实现对高速缓存的性能评估。同时,通过访存统计信息以及各访存请求的访存模式,可以从不同访存模式的维度对高速缓存的性能进行评估,实现多维度的评估,提高高速缓存性能评估的准确性。
进一步地,本申请实施例基于命中次数以及访存请求的访存模式对高速缓存的性能进行评估,相较于采用高速缓存缺失率或者每个时钟周期执行的平均指令数(Instruction Per Cycle,IPC)等指标进行评估的方式,高速缓存缺失率只能反映高速缓存收到的访存请求中未命中的比例,其不能对缺失的时机进行定位,相同缺失率的程序片段可能由于缺失的时机不同而对IPC的影响不同,进而也无法确定对高速缓存的性能影响较大的程序片段。而IPC则对分支预测等因素较敏感,不能直接反映出高速缓存系统对内存访问的覆盖程度。而本申请实施例通过各访存请求的命中次数以及各访存请求的访存模式,可以从不同访存模式的维度对高速缓存的性能进行评估,实现多维度的评估,可以确定出对高速缓存的性能影响较大的访存模式,提高高速缓存性能评估的准确性和可解释性。
进一步地,本申请实施例还可以对高速缓存系统的设计优化提供依据,还可以对优化算法的改进提供依据。本申请实施例可以辅助硬件缓存系统的设计,可以在模拟器或仿真环境下也可以便捷地使用。
需要说明的是,对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请实施例并不受所描述的动作顺序的限制,因为依据本申请实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本申请实施例所必须的。
装置实施例
参照图6,示出了本申请的一种高速缓存性能评估装置实施例的结构框图,所述装置20具体可以包括:
获取模块201,用于响应于测试程序中的各访存请求的多次访存操作,获取所述测试程序的访存统计信息;所述访存统计信息至少包括各所述访存请求在待评估的高速缓存中的命中次数;
第一评估模块202,用于基于所述访存统计信息以及各所述访存请求的访存模式,对所述待评估的高速缓存的性能进行评估。
可选地,所述待评估的高速缓存的存储层次包含至少两个层级;所述获取模块201具体用于:
响应于测试程序中的各访存请求的多次访存操作,获取各所述访存请求在各个层级的高速缓存中的命中次数,作为所述访存统计信息。
可选地,所述获取模块201,包括:
设置子模块,用于为各所述访存请求设置多个命中计数器;不同命中计数器对应不同层级的高速缓存;
封装子模块,用于对于各个所述访存请求的任一次访存操作,将所述访存请求封装为请求包,并在所述请求包中设置层级参数;
参数子模块,用于在所述请求包从目标层级返回的过程中,每经过一个层级,则将所述请求包中的层级参数加1;所述目标层级为所述请求包命中的高速缓存所在的层级;
确定子模块,用于基于所返回的请求包中的层级参数的数值确定所述访存请求的目标层级,并从所述访存请求所对应的命中计数器中,将所述目标层级对应的命中计数器加1;
次数获取子模块,用于在满足评估条件的情况下,基于各访存请求对应的各个命中计数器的当前数值,获取各所述访存请求在各个层级的高速缓存中的命中次数。
可选地,所述访存统计信息中还包括各所述访存请求的索引值;所述装置还包括:
输出模块,用于基于各所述访存请求的索引值,将各索引值对应的源代码输出至信息显示界面;
接收模块,用于接收用户基于所述信息显示界面为各索引值对应的源代码所输入的模式信息,并将所述模式信息确定为各索引值对应的访存请求的访存模式。
可选地,所述第一评估模块,包括:
参照获取子模块,用于针对各个所述访存请求,获取所述访存请求的访存模式所对应的参照命中次数;
评估子模块,用于基于各所述访存请求的命中次数以及参照命中次数,对所述待评估的高速缓存的性能进行评估。
可选地,所述装置还包括:
第一优化模块,用于采用第一优化算法对所述待评估的高速缓存进行优化,并基于优化后的高速缓存重新执行所述获取所述测试程序的访存统计信息的操作,得到第二访存统计信息,以及,将优化前的高速缓存对应的访存统计信息作为第一访存统计信息;所述第二访存统计信息包括各所述访存请求在优化后的高速缓存中的命中次数以及各所述访存请求在命中所述优化后的高速缓存时,所述第一优化算法的优化状态;
第二评估模块,用于基于所述第一访存统计信息、所述第二访存统计信息以及各访存请求的访存模式,对所述第一优化算法的优化效果进行评估。
可选地,所述装置还包括:
第二优化模块,用于采用第二优化算法对所述优化后的高速缓存进行优化,并重新执行所述获取所述测试程序的访存统计信息的操作,得到第三访存统计信息;所述第三访存统计信息包括各所述访存请求在命中优化后的高速缓存时,所述第一优化算法的优化状态以及第二优化算法的优化状态;
第三评估模块,用于基于所述第二访存统计信息以及所述第三访存统计信息,对所述第一优化算法以及所述第二优化算法的优化效果进行评估。
综上,本申请实施例提供了一种高速缓存性能评估装置,通过响应于测试程序中的各访存请求的多次访存操作,获取所述测试程序的访存统计信息;所述访存统计信息至少包括各所述访存请求在待评估的高速缓存中的命中次数;基于所述访存统计信息以及各所述访存请求的访存模式,对所述待评估的高速缓存的性能进行评估。这样,可以通过获取访存请求在高速缓存中的命中次数,可以实现对高速缓存的性能评估。同时,通过访存统计信息以及各访存请求的访存模式,可以从不同访存模式的维度对高速缓存的性能进行评估,实现多维度的评估,提高高速缓存性能评估的准确性。
对于系统实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。
关于上述实施例中的高速缓存性能评估装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
本申请实施例还提供了一种电子设备,包括:处理器、用于存储处理器可执行指令的存储器,其中,处理器被配置为执行上述高速缓存性能评估方法。
参照图7,是本申请实施例提供的电子设备的结构示意图。如图7所示,所述电子设备包括:处理器、存储器、通信接口和通信总线,所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行前述实施例的高速缓存性能评估方法。
需要说明的是,本申请实施例中的电子设备包括移动电子设备和非移动电子设备。
所述处理器可以是CPU(Central Processing Unit,中央处理器),通用处理器、DSP(Digital Signal Processor,数字信号处理器),ASIC(Application Specific Integrated Circuit,专用集成电路),FPGA(Field Programmble Gate Array,现场可编程门阵列)或者其他可编辑器件、晶体管逻辑器件、硬件部件或者其任意组合。所述处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,DSP和微处理器的组合等。
所述通信总线可包括一通路,在存储器和通信接口之间传送信息。通信总线可以是PCI(Peripheral Component Interconnect,外设部件互连标准)总线或EISA(Extended Industry Standard Architecture,扩展工业标准结构)总线等。所述通信总线可以分为地址总线、数据总线、控制总线等。为便于表示,图7中仅用一条线表示,但并不表示仅有一根总线或一种类型的总线。
所述存储器可以是ROM(Read Only内存,只读内存)或可存储静态信息和指令的其他类型的静态存储设备、RAM(Random Access,随机存取存储器)或者可存储信息和指令的其他类型的动态存储设备,也可以是EEPROM(Electrically Erasable Programmable Read Only,电可擦可编程只读内存)、CD-ROM(Compact Disa Read Only,只读光盘)、磁带、软盘和光数据存储设备等。
本申请实施例还提供了一种非临时性计算机可读存储介质,当所述存储介质中的指令由电子设备(服务器或者终端)的处理器执行时,使得处理器能够执行图1所示的高速缓存性能评估方法。
本申请实施例还提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行图1所示的高速缓存性能评估方法。
本申请实施例还提供了一种芯片,所述芯片包括处理器和通信接口,所述通信接口和所述处理器耦合,所述处理器用于运行程序或指令,实现上述高速缓存性能评估方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。
应理解,本申请实施例提到的芯片还可以称为系统级芯片、系统芯片、芯片系统或片上系统芯片等。
本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。
本领域内的技术人员应明白,本申请实施例的实施例可提供为方法、装置、或计算机程序产品。因此,本申请实施例可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。
本申请实施例是参照根据本申请实施例的方法、终端设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理终端设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理终端设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理终端设备以预测方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理终端设备上,使得在计算机或其他可编程终端设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程终端设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本申请实施例的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请实施例范围的所有变更和修改。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本公开的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本公开各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
可以理解的是,本公开实施例描述的这些实施例可以用硬件、软件、固件、中间件、微码或其组合来实现。对于硬件实现,模块、单元、子单元可以实现在一个或多个专用集成电路(Application Specific Integrated Circuits,ASIC)、数字信号处理器(Digital Signal Processor,DSP)、数字信号处理设备(DSP Device,DSPD)、可编程逻辑设备(Programmable Logic Device,PLD)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、通用处理器、控制器、微控制器、微处理器、用于执行本公开所述功能的其它电子单元或其组合中。
对于软件实现,可通过执行本公开实施例所述功能的模块(例如过程、函数等)来实现本公开实施例所述的技术。软件代码可存储在存储器中并通过处理器执行。存储器可以在处理器中或在处理器外部实现。
本说明书中的各个实施例均采用相关的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
需要说明的是,本申请实施例中获取各种数据相关过程,都是在遵照所在地国家相应的数据保护法规政策的前提下,并获得由相应装置所有者给予授权的情况下进行的。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者终端设备中还存在另外的相同要素。

Claims (20)

  1. 一种高速缓存性能评估方法,其中,所述方法包括:
    响应于测试程序中的各访存请求的多次访存操作,获取所述测试程序的访存统计信息;所述访存统计信息至少包括各所述访存请求在待评估的高速缓存中的命中次数;
    基于所述访存统计信息以及各所述访存请求的访存模式,对所述待评估的高速缓存的性能进行评估。
  2. 根据权利要求1所述的方法,其中,所述待评估的高速缓存的存储层次包含至少两个层级;所述响应于测试程序中的各访存请求的多次访存操作,获取所述测试程序的访存统计信息,包括:
    响应于测试程序中的各访存请求的多次访存操作,获取各所述访存请求在各个层级的高速缓存中的命中次数,作为所述访存统计信息。
  3. 根据权利要求2所述的方法,其中,所述获取各所述访存请求在各个层级的高速缓存中的命中次数,包括:
    为各所述访存请求设置多个命中计数器;不同命中计数器对应不同层级的高速缓存;
    对于各个所述访存请求的任一次访存操作,将所述访存请求封装为请求包,并在所述请求包中设置层级参数;
    在所述请求包从目标层级返回的过程中,每经过一个层级,则将所述请求包中的层级参数加1;所述目标层级为所述请求包命中的高速缓存所在的层级;
    基于所返回的请求包中的层级参数的数值确定所述访存请求的目标层级,并从所述访存请求所对应的命中计数器中,将所述目标层级对应的命中计数器加1;
    在满足评估条件的情况下,基于各访存请求对应的各个命中计数器的当前数值,获取各所述访存请求在各个层级的高速缓存中的命中次数。
  4. 根据权利要求1所述的方法,其中,所述访存统计信息中还包括各所述访存请求的索引值;所述获取所述测试程序的访存统计信息之后,所述方法还包括:
    基于各所述访存请求的索引值,将各索引值对应的源代码输出至信息显示界面;
    接收用户基于所述信息显示界面为各索引值对应的源代码所输入的模式信息,并将所述模式信息确定为各索引值对应的访存请求的访存模式。
  5. 根据权利要求1所述的方法,其中,所述基于所述访存统计信息以及各所述访存请求的访存模式,对所述待评估的高速缓存的性能进行评估,包括:
    针对各个所述访存请求,获取所述访存请求的访存模式所对应的参照命中次数;
    基于各所述访存请求的命中次数以及参照命中次数,对所述待评估的高速缓存的性能进行评估。
  6. 根据权利要求1-5任一项所述的方法,其中,所述方法还包括:
    采用第一优化算法对所述待评估的高速缓存进行优化,并基于优化后的高速缓存重新执行所述获取所述测试程序的访存统计信息的操作,得到第二访存统计信息,以及,将优化前的高速缓存对应的访存统计信息作为第一访存统计信息;所述第二访存统计信息包括各所述访存请求在优化后的高速缓存中的命中次数以及各所述访存请求在命中所述优化后的高速缓存时,所述第一优化算法的优化状态;
    基于所述第一访存统计信息、所述第二访存统计信息以及各访存请求的访存模式,对所述第一优化算法的优化效果进行评估。
  7. 根据权利要求6所述的方法,其中,所述方法还包括:
    采用第二优化算法对所述优化后的高速缓存进行优化,并重新执行所述获取所述测试程序的访存统计信息的操作,得到第三访存统计信息;所述第三访存统计信息包括各所述访存请求在命中优化后的高速缓存时,所述第一优化算法的优化状态以及第二优化算法的优化状态;
    基于所述第二访存统计信息以及所述第三访存统计信息,对所述第一优化算法以及所述第二优化算法的优化效果进行评估。
  8. 根据权利要求1所述的方法,其中,所述方法还包括:
    根据预设的性能等级,对所述待评估的高速缓存针对不同访存模式的性能进行划分。
  9. 根据权利要求1-5任一项所述的方法,其中,所述高速缓存包括:具有缓存功能的缓存系统。
  10. 一种高速缓存性能评估装置,其中,所述装置包括:
    获取模块,用于响应于测试程序中的各访存请求的多次访存操作,获取所述测试程序的访存统计信息;所述访存统计信息至少包括各所述访存请求在待评估的高速缓存中的命中次数;
    第一评估模块,用于基于所述访存统计信息以及各所述访存请求的访存模式,对所述待评估的高速缓存的性能进行评估。
  11. 根据权利要求10所述的装置,其中,所述待评估的高速缓存的存储层次包含至少两个层级;所述获取模块201用于:
    响应于测试程序中的各访存请求的多次访存操作,获取各所述访存请求在各个层级的高速缓存中的命中次数,作为所述访存统计信息。
  12. 根据权利要求10所述的装置,其中,所述获取模块,包括:
    设置子模块,用于为各所述访存请求设置多个命中计数器;不同命中计数器对应不同层级的高速缓存;
    封装子模块,用于对于各个所述访存请求的任一次访存操作,将所述访存请求封装为请求包,并在所述请求包中设置层级参数;
    参数子模块,用于在所述请求包从目标层级返回的过程中,每经过一个层级,则将所述请求包中的层级参数加1;所述目标层级为所述请求包命中的高速缓存所在的层级;
    确定子模块,用于基于所返回的请求包中的层级参数的数值确定所述访存请求的目标层级,并从所述访存请求所对应的命中计数器中,将所述目标层级对应的命中计数器加1;
    次数获取子模块,用于在满足评估条件的情况下,基于各访存请求对应的各个命中计数器的当前数值,获取各所述访存请求在各个层级的高速缓存中的命中次数。
  13. 根据权利要求10所述的装置,其中,所述访存统计信息中还包括各所述访存请求的索引值;所述装置还包括:
    输出模块,用于基于各所述访存请求的索引值,将各索引值对应的源代码输出至信息显示界面;
    接收模块,用于接收用户基于所述信息显示界面为各索引值对应的源代码所输入的模式信息,并将所述模式信息确定为各索引值对应的访存请求的访存模式。
  14. 根据权利要求10所述的装置,其中,所述第一评估模块,包括:
    参照获取子模块,用于针对各个所述访存请求,获取所述访存请求的访存模式所对应的参照命中次数;
    评估子模块,用于基于各所述访存请求的命中次数以及参照命中次数,对所述待评估的高速缓存的性能进行评估。
  15. 根据权利要求10至14任一项所述的装置,其中,所述装置还包括:
    第一优化模块,用于采用第一优化算法对所述待评估的高速缓存进行优化,并基于优化后的高速缓存重新执行所述获取所述测试程序的访存统计信息的操作,得到第二访存统计信息,以及,将优化前的高速缓存对应的访存统计信息作为第一访存统计信息;所述第二访存统计信息包括各所述访存请求在优化后的高速缓存中的命中次数以及各所述访存请求在命中所述优化后的高速缓存时,所述第一优化算法的优化状态;
    第二评估模块,用于基于所述第一访存统计信息、所述第二访存统计信息以及各访存请求的访存模式,对所述第一优化算法的优化效果进行评估。
  16. 根据权利要求15所述的装置,其中,所述装置还包括:
    第二优化模块,用于采用第二优化算法对所述优化后的高速缓存进行优化,并重新执行所述获取所述测试程序的访存统计信息的操作,得到第三访存统计信息;所述第三访存统计信息包括各所述访存请求在命中优化后的高速缓存时,所述第一优化算法的优化状态以及第二优化算法的优化状态;
    第三评估模块,用于基于所述第二访存统计信息以及所述第三访存统计信息,对所述第一优化算法以及所述第二优化算法的优化效果进行评估。
  17. 一种电子设备,其中,所述电子设备包括处理器、存储器、通信接口和通信总线,所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;所述存储器用于存放可执行指令,所述可执行指令使所述处理器执行如权利要求1至9中任一项所述的高速缓存性能评估方法。
  18. 一种可读存储介质,其中,当所述可读存储介质中的指令由电子设备的处理器执行时,使得所述处理器能够执行如权利要求1至9中任一项所述的高速缓存性能评估方法。
  19. 一种芯片,其中,所述芯片包括处理器和通信接口,所述通信接口和所述处理器耦合,所述处理器用于运行程序或指令,实现如权利要求1至9中任一项所述的高速缓存性能评估方法。
  20. 一种同步控制装置/设备,其中,包括所述装置/设备(被配置成)用于执行如权利要求1至9中任一项所述的高速缓存性能评估方法。
PCT/CN2025/095503 2024-05-21 2025-05-16 高速缓存性能评估方法、装置、电子设备及可读存储介质 Pending WO2025242009A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202410634793.6A CN118227446B (zh) 2024-05-21 2024-05-21 高速缓存性能评估方法、装置、电子设备及可读存储介质
CN202410634793.6 2024-05-21

Publications (1)

Publication Number Publication Date
WO2025242009A1 true WO2025242009A1 (zh) 2025-11-27

Family

ID=91502614

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2025/095503 Pending WO2025242009A1 (zh) 2024-05-21 2025-05-16 高速缓存性能评估方法、装置、电子设备及可读存储介质

Country Status (2)

Country Link
CN (1) CN118227446B (zh)
WO (1) WO2025242009A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118227446B (zh) * 2024-05-21 2024-08-02 北京开源芯片研究院 高速缓存性能评估方法、装置、电子设备及可读存储介质
CN119484641B (zh) * 2024-11-11 2025-11-28 中移动信息技术有限公司 缓存优化方法、装置、设备、存储介质及计算机程序产品
CN119396349B (zh) * 2024-12-31 2025-10-03 北京开源芯片研究院 缓存验证方法、装置、电子设备及可读存储介质
CN119862152A (zh) * 2025-03-24 2025-04-22 北京开源芯片研究院 请求处理方法、装置、电子设备及可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050268037A1 (en) * 2004-05-27 2005-12-01 International Business Machines Corporation Cache hit ratio estimating apparatus, cache hit ratio estimating method, program, and recording medium
US20090222625A1 (en) * 2005-09-13 2009-09-03 Mrinmoy Ghosh Cache miss detection in a data processing apparatus
CN114830101A (zh) * 2019-12-16 2022-07-29 超威半导体公司 基于访问类型优先级的高速缓存管理
CN115185803A (zh) * 2022-07-26 2022-10-14 Oppo广东移动通信有限公司 评测存储系统性能的方法、装置及电子设备
WO2023130316A1 (zh) * 2022-01-06 2023-07-13 中国科学院计算技术研究所 一种兼顾服务质量和利用率的缓存动态划分方法及系统
CN118227446A (zh) * 2024-05-21 2024-06-21 北京开源芯片研究院 高速缓存性能评估方法、装置、电子设备及可读存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2601894B2 (ja) * 1988-12-26 1997-04-16 株式会社日立マイコンシステム エミュレータ
US9244724B2 (en) * 2013-08-15 2016-01-26 Globalfoundries Inc. Management of transactional memory access requests by a cache memory
JP2016012288A (ja) * 2014-06-30 2016-01-21 富士通株式会社 試験装置、試験プログラム、および試験方法
US11210234B2 (en) * 2019-10-31 2021-12-28 Advanced Micro Devices, Inc. Cache access measurement deskew
CN117813592A (zh) * 2021-10-13 2024-04-02 华为技术有限公司 压缩缓存作为缓存层级
US11994989B2 (en) * 2022-08-17 2024-05-28 Dell Products L.P. Cache efficiency analysis based on in-depth metrics instrumentation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050268037A1 (en) * 2004-05-27 2005-12-01 International Business Machines Corporation Cache hit ratio estimating apparatus, cache hit ratio estimating method, program, and recording medium
US20090222625A1 (en) * 2005-09-13 2009-09-03 Mrinmoy Ghosh Cache miss detection in a data processing apparatus
CN114830101A (zh) * 2019-12-16 2022-07-29 超威半导体公司 基于访问类型优先级的高速缓存管理
WO2023130316A1 (zh) * 2022-01-06 2023-07-13 中国科学院计算技术研究所 一种兼顾服务质量和利用率的缓存动态划分方法及系统
CN115185803A (zh) * 2022-07-26 2022-10-14 Oppo广东移动通信有限公司 评测存储系统性能的方法、装置及电子设备
CN118227446A (zh) * 2024-05-21 2024-06-21 北京开源芯片研究院 高速缓存性能评估方法、装置、电子设备及可读存储介质

Also Published As

Publication number Publication date
CN118227446B (zh) 2024-08-02
CN118227446A (zh) 2024-06-21

Similar Documents

Publication Publication Date Title
WO2025242009A1 (zh) 高速缓存性能评估方法、装置、电子设备及可读存储介质
Basak et al. Analysis and optimization of the memory hierarchy for graph processing workloads
US8683129B2 (en) Using speculative cache requests to reduce cache miss delays
US9164676B2 (en) Storing multi-stream non-linear access patterns in a flash based file-system
US10394714B2 (en) System and method for false sharing prediction
US10558569B2 (en) Cache controller for non-volatile memory
US20170161194A1 (en) Page-based prefetching triggered by tlb activity
CN111742302A (zh) 通过基于上层缓存中的条目来日志记录对下层缓存的流入量进行跟踪记录
US11086781B2 (en) Methods and apparatus for monitoring prefetcher accuracy information using a prefetch flag independently accessible from prefetch tag information
US20190272104A1 (en) Methods and apparatus to perform memory copy operations
WO2025247372A1 (zh) 数据预取方法、装置、电子设备及可读存储介质
CN110168508B (zh) 监测断点出现的存储器位置的方法、计算机系统和介质
CN118295936B (zh) 高速缓存替换策略的管理方法、装置及电子设备
CN118409981B (zh) 一种预取方法、装置、电子设备及可读存储介质
CN118550853A (zh) 一种缓存替换方法、装置、电子设备及可读存储介质
US10261905B2 (en) Accessing cache with access delay reduction mechanism
CN119201004A (zh) 数据读写处理方法、装置、设备及介质
US20090276600A1 (en) Method and apparatus for determining memory usage for a computing device
US8661169B2 (en) Copying data to a cache using direct memory access
CN114281715B (zh) 缓存合成预取方法、装置、处理器以及电子设备
CN115827507A (zh) 数据预取的方法、装置、电子设备及存储介质
US9483180B2 (en) Memory-area property storage including data fetch width indicator
CN120315762A (zh) 构建模式历史表的方法、数据预取方法、处理器及设备
CN120407441A (zh) 一种数据预取方法及相关装置
CN117453500A (zh) 调整服务性能的方法及其相关设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 25807109

Country of ref document: EP

Kind code of ref document: A1