[go: up one dir, main page]

CN120066987B - Data prefetcher, processor and device for graph computing - Google Patents

Data prefetcher, processor and device for graph computing

Info

Publication number
CN120066987B
CN120066987B CN202510527371.3A CN202510527371A CN120066987B CN 120066987 B CN120066987 B CN 120066987B CN 202510527371 A CN202510527371 A CN 202510527371A CN 120066987 B CN120066987 B CN 120066987B
Authority
CN
China
Prior art keywords
data
graph
array
prefetch
offset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202510527371.3A
Other languages
Chinese (zh)
Other versions
CN120066987A (en
Inventor
李晨
李宣佚
郭阳
鲁建壮
陈小文
袁珩洲
刘畅
汪志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202510527371.3A priority Critical patent/CN120066987B/en
Publication of CN120066987A publication Critical patent/CN120066987A/en
Application granted granted Critical
Publication of CN120066987B publication Critical patent/CN120066987B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0862Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with prefetch
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0811Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30101Special purpose registers
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

本发明公开了一种面向图计算的数据预取器、处理器及设备,本发明的数据预取器包括分别位于每一个处理器核及其二级缓存L2之间的预取单元,所述预取单元包括:配置寄存器组,用于保存软件配置的图计算数据结构相关的数据;状态寄存器组,用于维护预取单元在预取过程中的中间索引变量;控制状态机,用于通过读取配置寄存器组和状态寄存器组的内容对预取流水线的执行进行控制;预取流水线,用于采用流水线的方式执行图数据访问预取逻辑的各个阶段;数据缓存,用于保存预取流水线预取的数据。本发明旨在面向图计算提高预取的及时性和准确性,以提高图计算的效率、缓解图计算的访存瓶颈,提高处理器的图计算性能。

The present invention discloses a data prefetcher, processor, and device for graph computing. The data prefetcher of the present invention includes a prefetch unit located between each processor core and its L2 cache L2. The prefetch unit includes: a configuration register group for storing data related to the graph computing data structure configured by software; a status register group for maintaining the intermediate index variables of the prefetch unit during the prefetch process; a control state machine for controlling the execution of the prefetch pipeline by reading the contents of the configuration register group and the status register group; a prefetch pipeline for executing each stage of the graph data access prefetch logic in a pipeline manner; and a data cache for storing data prefetched by the prefetch pipeline. The present invention aims to improve the timeliness and accuracy of prefetching for graph computing, thereby improving the efficiency of graph computing, alleviating the memory access bottleneck of graph computing, and improving the graph computing performance of the processor.

Description

Data prefetcher, processor and equipment for graph calculation
Technical Field
The present invention relates to the field of processor technologies, and in particular, to a graph computation-oriented data prefetcher, a processor, and a device.
Background
The data prefetching is a technology for pre-acquiring data possibly needed in advance by predicting the address of memory access before a user requests the data and storing the data in a cache or a temporary memory, so that the memory access delay is hidden and the working speed of a processor is improved. Traditional data prefetching techniques predict future memory access addresses based on the memory access patterns of the application. They search data from the memory or lower level caches according to the predicted address so that when a memory access request arrives, the corresponding data is already stored at the corresponding cache level, thereby reducing memory access latency. Data prefetching may be divided into hardware prefetching and software prefetching according to implementation.
The most common prefetcher for hardware prefetchers is the step-size prefetcher design, which uses fixed, repeatable address differences in a sequence. However, conventional data prefetchers such as GHB and AMPM prefetch by learning a step history in the address stream, and thus can only identify streaming memory access patterns with deterministic regularity. To further capture the indirect access data stream Yu et al propose a hardware mechanism IMP supporting indirect data stream prefetching, first identifying the sequential data stream and on the basis of this identifying the indirect access pattern. However, these prefetchers have difficulty capturing the four-level inter-level access pattern Dc [ Bai ] +j ] in the graph computation, where i is in [0, n-1] and j is in [ Bai ] ], bai+1 ], and the implementation overhead is high because of the large amount of memory required to record the access history. Ainsworth proposes a graph-structure-aware data prefetcher, which needs to monitor data requests to prefetch data of subsequent nodes, but needs to be particularly optimized for prefetching opportunities, and meanwhile, because the data returned by prefetching is stored in an L1 cache, resource waves caused by replacement need to be particularly considered.
Software prefetching is implemented by a programmer or a compiler adding partial prefetch instructions to a program, such as Ainsworth proposes a compiler-based indirect mode prefetcher that can generally capture access modes more accurately, and software prefetching has inherent limitations in that it may not be portable across microarchitectures (e.g., with different cache hierarchies), resulting in significant instruction overhead and power consumption increases, thereby masking the benefits of improving cache hit rates.
The ideal prefetching technique causes little additional overhead to memory accesses. However, in practical situations, prefetching is not always timely or accurate, and a lag or erroneous prefetch may affect performance. Therefore, the prefetch technique must consider three basic issues, how to determine prefetched data, i.e., the address of the prefetched data, when to prefetch the data, and where to place the prefetched data. According to the data access mode calculated by the design, all relevant data can be accurately prefetched, and all prefetched data are stored in the special prefetching cache, so that pollution to a cache is avoided.
The graph in the real scene has strong sparsity, most nodes are not connected by edges, and a relatively small number of nodes in the graph structure are connected with a large number of nodes, so that in order to improve the storage efficiency of graph data, the graph data is stored in a compressed sparse row (Compressed Sparse Row, CSR) format generally, and the time and space utilization rate is high. As shown in FIG. 1, the compressed lean row format uses 4 arrays to store the graph data, wherein the offset array stores the offset value of the first outgoing edge of each node in the edge array, the edge array stores the numbers (ids) of all outgoing edge nodes in node order, the weight array stores the weights (weights) of the edges corresponding to the edge array, and the attribute (property) data (e.g., V 0~V4) of the nodes associated with the graph computation is stored in the attribute array.
When updating an edge vertex, the updating operation needs to access a plurality of data including an edge vertex id, an edge weight, an attribute property and the like, and the computing is usually only performed with a reduction operation with smaller computation amount, so that when the graph computing is applied to a general processor platform for execution, the graph computing has the characteristic of intensive memory access, and presents three challenges, namely 1) weak locality, namely, the memory access behavior of a program has the characteristic of low locality due to complex and changeable graph data structure. Although a large number of redundant structures have been found to exist in real application scenario oriented graph data, current graph computing applications still have difficulty accurately predicting the location of the redundant structures at runtime, thereby limiting optimization of data access locality. 2) And the high concurrency is realized by combining a large amount of data access generated by traversing vertex or edge data with simple reduction calculation, so that the application generates a large amount of memory access requests in a short time, and the processor cannot acquire the required data in time under the high concurrency condition. 3) Fine-grained Access computing operations in graph computing applications typically involve only certain attributes of vertices or edges that occupy less space such that the memory access behavior of the graph computing application exhibits fine-grained characteristics.
Conventional von neumann general-purpose processor architectures employ unified off-chip memory resources to store instructions and data. However, for access intensive applications such as graphics computing applications, due to the presence of large-scale high-concurrency access requests, a general-purpose processor based on von neumann architecture can obtain data required for operation instructions after consuming a large number of idle waiting periods, which results in an instruction pipeline not being filled with a sufficient number of computing instructions, and thus not fully utilizing instruction-level parallelism.
Pipeline congestion is a direct cause of processor performance degradation, and the main conflicts that lead to out-of-order superscalar processor congestion are resource dependent, data dependent, and control dependent. By analyzing the ratio of pipeline blocking periods caused by data correlation to analyze the data access bottleneck, fig. 2 shows the blocking period ratio of the access instruction obtained by testing the Single-source shortest path (SSSP) algorithm on different data sets, and referring to fig. 2, it can be seen that the average blocking period ratio of the load instruction is 38.02%, while the store instruction is only 19.77%, mainly because the load instruction is on the critical execution path, and the subsequent calculation instructions all depend on the result of the load instruction, so that the effect of optimizing the load instruction on the overall performance is greater. However, in practical situations, the prefetching is not always timely or accurate, so how to alleviate the access bottleneck of graph computation and improve the timeliness and accuracy of prefetching to improve the efficiency of graph computation for the graph-computation-oriented data prefetcher has become a critical technical problem to be solved.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a data prefetcher, a processor and equipment for graph calculation, and aims to improve the timeliness and accuracy of prefetching for graph calculation, so as to improve the efficiency of graph calculation, alleviate the access bottleneck of graph calculation and improve the graph calculation performance of the processor.
In order to solve the technical problems, the invention adopts the following technical scheme:
a graph computation oriented data prefetcher comprising a prefetch unit located between each processor core and its secondary cache L2, respectively, the prefetch unit comprising:
a configuration register set for storing data related to a graph computation data structure of the software configuration, including a work queue and an array address related to a compressed sparse line format;
The state register set is used for maintaining an intermediate index variable of the prefetching unit in the prefetching process, providing the intermediate index variable for a control state machine to generate a control signal, and specifically comprises a work queue index, an out-edge offset index and a mark of full data cache;
a control state machine for controlling execution of the prefetch pipeline by reading the contents of the configuration register set and the status register set, including suspending and resuming the prefetch pipeline;
A prefetch pipeline for executing each stage of the graph data access prefetch logic in a pipeline manner, wherein the operation of each stage comprises the calculation of a prefetch address, the sending of a prefetch request and the receiving of prefetch data;
And the data cache is used for storing the prefetched data of the prefetching pipeline, the data cache maintains the prefetched data with the edges as granularity, each entry contains all information related to the edges, and the information of the edges is stored in the cache at the last stage of the pipeline.
Optionally, the stages of executing the graph data access prefetch logic in a pipelined manner include stage 1, accessing the work queue LISTARRAY to prefetch the node n to be processed, stage 2, accessing the attribute array DATAARRAY and the offset array offsetArray to prefetch the attributes and offsets of the source node, stage 3, accessing the edge array EDGEARRAY and the weight array WEIGHTARRAY to prefetch the edge nodes and weights, and stage 4, accessing the attribute array DATAARRAY to prefetch the weights of the destination node.
Optionally, the saved work queue and the array address related to the compressed sparse line format in the configuration register set comprise a work queue starting address, a work queue element number, an offset data starting address, an edge array starting address, an edge weight array starting address and an attribute array starting address.
Optionally, the intermediate index variable in the state register set includes a column index idx, a current column index current_idx, a current line start offset start, a current line end offset end, and a data buffer full flag full, where the data buffer full flag is used as a credential for controlling the state machine to suspend and resume the prefetch pipeline, if the data buffer full flag full is true, the prefetch pipeline is suspended, otherwise, the prefetch pipeline is resumed.
Optionally, the field of each item of data in the data cache includes a source node ID, a destination node ID, a source node attribute, a destination node attribute, and an out-edge weight.
Optionally, the data prefetcher provides a configuration interface function config (addr, data), a fetch interface function fetch_edge () and a buffer empty state judgment interface function empty () for use by the software, wherein addr is an address of the data prefetcher, data is configuration data, and a manner in which the software uses the data prefetcher includes:
s1, configuring a data prefetcher by calling a configuration interface function config (addr, data) through software, wherein configuration information for configuring comprises an offset array offsetArray, a side array EDGEARRAY, a weight array WEIGHTARRAY and a starting address and a starting size of an attribute array DATAARRAY, wherein the offset array offsetArray is used for recording a starting offset and a ending offset of an outgoing side, the side array EDGEARRAY is used for recording an id of an outgoing side node, the weight array WEIGHTARRAY is used for recording an outgoing side weight, and the attribute array DATAARRAY is used for recording state data;
S2, configuring a starting address of a work queue LISTARRAY and the number len of the work queue elements through software, enabling a data prefetcher to start prefetching graph data related to all nodes to be processed in the work queue LISTARRAY, wherein the graph data comprises a source node ID, a destination node ID, a source node attribute, a destination node attribute and an out-edge weight, and storing the related graph data into a data cache;
s3, the data in the data cache is obtained through software calling a data access interface function fetch_edge ();
S4, executing corresponding graph calculation by software by using the prefetched data, and updating the edge node attribute;
s5, judging whether data to be processed exist in a data buffer of the data prefetcher or not through a buffer empty state judging interface function empty () called by software, if so, continuing to execute the step S3, otherwise, ending and exiting.
Optionally, the data prefetcher in step S2 starts prefetching graph data related to all pending nodes in the work queue LISTARRAY, including:
S2.1, acquiring the input starting address of the work queue LISTARRAY and the number len of the work queue elements, initializing the column index idx to be 0;
S2.2, judging whether the column index idx is smaller than the number len of the work queue elements, if the column index idx is smaller than the number len of the work queue elements, ending, otherwise, jumping to the step S2.3;
S2.3, taking out an index number i of the node to be processed corresponding to the column index idx from the work queue LISTARRAY, self-increasing the column index idx by 1, and jumping to the step S2.3 until the numbers of all the nodes to be processed are generated;
S2.4, pre-obtaining node attributes srcData of the nodes to be processed according to the starting address of the attribute array DATAARRAY and the index number i of the nodes to be processed, and taking out the offset corresponding to the index number i of the nodes to be processed from the offset array offsetArray as a current line starting offset start, and taking the offset corresponding to the index number i+1 of the next node as a current line ending offset end;
s2.5, assigning a current row start offset start to a current column index current_idx;
S2.6, judging whether the current column index current_idx is smaller than the end offset end of the current line or not, if not, jumping to the step S2.2, otherwise, jumping to the next step;
s2.7, judging whether the data cache is full, if so, jumping to the step S2.7 again, otherwise, jumping to the next step;
S2.8, taking out an element corresponding to the current column index current_idx from the edge array EDGEARRAY as an edge node corresponding to the obtained node i to be processed, and taking out an element corresponding to the current column index current_idx from the weight array WEIGHTARRAY as an obtained edge weight;
S2.9, reading attribute data of an outgoing node corresponding to the node i to be processed according to the node dependency array DATAARRAY, adding 1 to the current column index current_idx, and jumping to the step S2.6.
In addition, the invention also provides a processor, which comprises a processor body with a plurality of processor cores, wherein the processor body is provided with the data prefetcher facing the graph calculation.
In addition, the invention also provides computer equipment, which comprises a processor and a memory which are connected with each other, wherein the processor is provided with any one of the graph calculation-oriented data prefetchers.
Compared with the prior art, the data prefetcher mainly has the advantages that in order to relieve access bottleneck of graph computation, the data prefetcher comprises prefetching units respectively positioned between each processor core and a second-level cache L2 of the processor cores, wherein the prefetching units comprise a configuration register set, a state register set, a control state machine and a data cache, wherein the configuration register set is used for storing data related to a graph computation data structure of software configuration, the state register set is used for maintaining intermediate index variables of the prefetching units in the prefetching process, the control state machine is used for controlling execution of a prefetching pipeline by reading contents of the configuration register set and the state register set, the prefetching pipeline is used for executing each stage of graph data access prefetching logic in a pipeline mode, and the data cache is used for storing the data prefetched by the prefetching pipeline. The method and the device can improve the timeliness and accuracy of prefetching for graph calculation, so as to improve the efficiency of the graph calculation, relieve the access bottleneck of the graph calculation and improve the graph calculation performance of a processor.
Drawings
Fig. 1 is a schematic diagram of a prior art compression lean (Compressed Sparse Row, CSR) format.
FIG. 2 is a memory access command blocking period duty cycle of a single source shortest path algorithm according to the prior art.
FIG. 3 is a flow chart of the data access of the graph in an embodiment of the invention.
FIG. 4 is a schematic diagram of a data prefetcher according to an embodiment of the invention.
FIG. 5 is a schematic diagram of a prefetch unit in a data prefetcher according to an embodiment of the present invention.
FIG. 6 is a flowchart illustrating the overall implementation of the collaboration mechanism in accordance with an embodiment of the present invention.
FIG. 7 is a schematic diagram of a workflow of a data prefetcher according to an embodiment of the invention.
FIG. 8 is a graph showing normalized execution times of different algorithms according to an embodiment of the present invention.
FIG. 9 illustrates memory access instruction blocking period duty cycle normalized by different algorithms in an embodiment of the present invention.
Detailed Description
In order to enable those skilled in the art to better understand the technical solution of the present invention, the technical solution of the present invention will be further described in detail below with reference to the accompanying drawings in the embodiments of the present invention.
As shown in fig. 3, the step of accessing the graph data by the data prefetcher in this embodiment includes S101, initializing the column index idx to 0, and the number len of work queue elements to be the size listmaray size (); S102, judging whether the column index idx is smaller than the number len of the work queue elements, ending if not, otherwise jumping to the next step, S103, taking out the node LISTARRAY [ idx ] corresponding to the column index idx from the work queue LISTARRAY as a node i to be processed, adding 1 (idx+ =1) to the column index idx, S104, taking out the offset offsetArray [ i ] corresponding to the node i to be processed from the offset array offsetArray as a current line starting offset start, taking the offset offsetArray [ i+1] corresponding to the next node i+1 as a current line ending offset end, S105, assigning the current line starting offset start to the current column index current_idx, S106, judging whether the current column index current_idx is smaller than the current line ending offset end or not, jumping to step S102 if not, otherwise jumping to the next step, taking out the element EDGEARRAY [ current_idx ] corresponding to the node i to be processed from the side array offsetArray as a current line starting offset start, taking out the element EDGEARRAY [ current_current_idx ] corresponding to the side array corresponding to the side index array side array 329, and taking out the element EDGEARRAY [ current_current_idx ] as a side attribute of the side array corresponding to the current array side index_1 from the side array 32109, and taking out the element_current_current_1 as a side attribute of the current node corresponding to the current array node corresponding to the current_node side index_49, and taking the current_1.
In order to alleviate access bottleneck of graph computation, as shown in fig. 4, the embodiment provides a graph computation-oriented data prefetcher, which includes a prefetching unit respectively located between each processor core and a second-level cache L2 thereof, where the prefetching unit is placed beside the L1 cache to reduce delay of accessing prefetched data by the processor, as shown in fig. 5, the prefetching unit includes:
a configuration register set for storing data related to a graph computation data structure of the software configuration, including a work queue and an array address related to a compressed sparse line format;
The state register set is used for maintaining an intermediate index variable of the prefetching unit in the prefetching process, providing the intermediate index variable for a control state machine to generate a control signal, and specifically comprises a work queue index, an out-edge offset index and a mark of full data cache;
a control state machine for controlling execution of the prefetch pipeline by reading the contents of the configuration register set and the status register set, including suspending and resuming the prefetch pipeline;
A prefetch pipeline for executing each stage of the graph data access prefetch logic in a pipeline manner, wherein the operation of each stage comprises the calculation of a prefetch address, the sending of a prefetch request and the receiving of prefetch data;
And the data cache is used for storing the prefetched data of the prefetching pipeline, the data cache maintains the prefetched data with the edges as granularity, each entry contains all information related to the edges, and the information of the edges is stored in the cache at the last stage of the pipeline.
As shown in FIG. 5, the stages of executing the graph data access prefetch logic in a pipelined manner in this embodiment include stage 1, accessing the work queue LISTARRAY to prefetch the pending node n, stage 2, accessing the attribute array DATAARRAY and the offset array offsetArray to prefetch the attributes and offsets of the source node, stage 3, accessing the edge array EDGEARRAY and the weight array WEIGHTARRAY to prefetch the edge nodes and weights, and stage 4, accessing the attribute array DATAARRAY to prefetch the weights of the destination node.
As shown in FIG. 5, the array addresses related to the saved work queue and compressed sparse line format in the configuration register set in this embodiment include a work queue start address, a work queue element number, an offset data start address, an edge array start address, an edge weight array start address, and an attribute array start address.
As shown in fig. 5, the intermediate index variable in the state register set in this embodiment includes a column index idx, a current column index current_idx, a current line start offset start, a current line end offset end, and a data buffer full flag full, where the data buffer full flag full is used as a credential for controlling the state machine to suspend and resume the prefetch pipeline, and if the data buffer full flag full is true, the prefetch pipeline is suspended, otherwise the prefetch pipeline is resumed.
As shown in fig. 5, the fields of each item of data in the data cache in this embodiment include a source node ID, a destination node ID, a source node attribute, a destination node attribute, and an out-edge weight.
As shown in fig. 6, the data prefetcher in this embodiment provides a configuration interface function config (addr, data), a fetch interface function fetch_edge () and a buffer empty state judgment interface function empty () for use by software, where addr is an address of the data prefetcher, data is configuration data, and the manner in which the software uses the data prefetcher includes:
s1, configuring a data prefetcher by calling a configuration interface function config (addr, data) through software, wherein configuration information for configuring comprises an offset array offsetArray, a side array EDGEARRAY, a weight array WEIGHTARRAY and a starting address and a starting size of an attribute array DATAARRAY, wherein the offset array offsetArray is used for recording a starting offset and a ending offset of an outgoing side, the side array EDGEARRAY is used for recording an id of an outgoing side node, the weight array WEIGHTARRAY is used for recording an outgoing side weight, and the attribute array DATAARRAY is used for recording state data;
S2, configuring a starting address of a work queue LISTARRAY and the number len of the work queue elements through software, enabling a data prefetcher to start prefetching graph data related to all nodes to be processed in the work queue LISTARRAY, wherein the graph data comprises a source node ID, a destination node ID, a source node attribute, a destination node attribute and an out-edge weight, and storing the related graph data into a data cache;
s3, the data in the data cache is obtained through software calling a data access interface function fetch_edge ();
S4, executing corresponding graph calculation by software by using the prefetched data, and updating the edge node attribute;
s5, judging whether data to be processed exist in a data buffer of the data prefetcher or not through a buffer empty state judging interface function empty () called by software, if so, continuing to execute the step S3, otherwise, ending and exiting.
As shown in fig. 7, in step S2 of this embodiment, the data prefetcher starts prefetching graph data related to all pending nodes in the work queue LISTARRAY, including:
S2.1, acquiring the input starting address of the work queue LISTARRAY and the number len of the work queue elements, initializing the column index idx to be 0;
S2.2, judging whether the column index idx is smaller than the number len of the work queue elements, if the column index idx is smaller than the number len of the work queue elements, ending, otherwise, jumping to the step S2.3;
S2.3, taking out an index number i of the node to be processed corresponding to the column index idx from the work queue LISTARRAY, self-increasing the column index idx by 1, and jumping to the step S2.3 until the numbers of all the nodes to be processed are generated;
S2.4, pre-obtaining node attributes srcData of the nodes to be processed according to the starting address of the attribute array DATAARRAY and the index number i of the nodes to be processed, and taking out the offset corresponding to the index number i of the nodes to be processed from the offset array offsetArray as a current line starting offset start, and taking the offset corresponding to the index number i+1 of the next node as a current line ending offset end;
s2.5, assigning a current row start offset start to a current column index current_idx;
S2.6, judging whether the current column index current_idx is smaller than the end offset end of the current line or not, if not, jumping to the step S2.2, otherwise, jumping to the next step;
s2.7, judging whether the data cache is full, if so, jumping to the step S2.7 again, otherwise, jumping to the next step;
S2.8, taking out an element corresponding to the current column index current_idx from the edge array EDGEARRAY as an edge node corresponding to the obtained node i to be processed, and taking out an element corresponding to the current column index current_idx from the weight array WEIGHTARRAY as an obtained edge weight;
S2.9, reading attribute data of an outgoing node corresponding to the node i to be processed according to the node dependency array DATAARRAY, adding 1 to the current column index current_idx, and jumping to the step S2.6.
To verify the performance of the graph-oriented data prefetcher in this embodiment, the prefetcher is evaluated on a ZSim simulator using a 4-core skylake processor in this embodiment, with the specific configuration shown in Table 1. The test was performed simultaneously using 6 graph datasets as shown in table 2.
Table 1 simulator parameter configuration
Table 2 graph dataset
In the embodiment, 3 graph data sets in different fields, namely Email-Eu-core, wiki-Vote and Soc-Epinions graph data sets, are selected from a large-scale network data set snap website of Stanford university, and each data set has typical characteristics and complexity in the corresponding field and covers various application scenes and use conditions. In addition, 3 automatically generated random graph data sets with different scales, namely graph 1-graph 3 graph data sets, are designed to simulate various unknown or future graph structures, so that universality and robustness of the experiment are further enhanced.
To evaluate the performance of the data prefetcher on different graph algorithms, the present embodiment uses 3 graph algorithms, BFS (Bread FIRST SEARCH) algorithm, SSSP (Single-Source Shortest Path) algorithm, SSWP (Single-Source WIDEST PATH) algorithm, respectively. The BFS algorithm traverses the graph to obtain the depth of all nodes in the graph from the origin, and can be used for friend relations and community discovery problems in social network analysis, and used in the fields of state space search, network link analysis and the like in artificial intelligence. The SSSP algorithm is used for searching a single-source shortest path in a graph, and in the aspect of network routing, the shortest path from one node to other nodes is obtained through analysis of a topological structure, so that efficient network routing and data transmission are realized, and in the aspect of social network analysis, the shortest path from one node to other nodes is obtained through the algorithm, so that relationship analysis and influence evaluation of a social network are realized. The SSWP algorithm is an algorithm for solving a single-source widest path, wherein the widest path refers to the largest bandwidth path in a network in network communication, the method can be used for searching the largest bandwidth path between two nodes, and in traffic planning, the widest path refers to the largest traffic capacity of a road, and the method can be used for searching the largest traffic capacity path between two places.
FIG. 8 shows normalized execution times of 3 algorithms on different data sets normalized to a baseline system of an unintegrated data prefetcher, with smaller times and higher performance. It can be seen that the data prefetcher in this embodiment can improve the performance by 31%, 41% and 36% on average for the BFS algorithm, SSSP algorithm and SSWP algorithm. Meanwhile, fig. 9 shows normalized memory access instruction blocking period duty ratios of 3 algorithms on different data sets, normalized to a reference system without an integrated prefetcher, the smaller the number, the higher the performance, the obviously reduced memory access instruction blocking period duty ratios of 3 algorithms, wherein the memory access instruction blocking degrees of the BFS algorithm, the SSSP algorithm and the SSWP algorithm are reduced by 95%, 66% and 43% on average. It can be seen that the graph data access in the graph computation severely blocks pipeline execution, which affects overall performance, and the data access pattern of the graph computation is relatively fixed, and a dedicated data prefetcher can be designed according to the access pattern. Compared with the prior general data prefetching technology, the data prefetcher facing graph calculation in the embodiment can realize more ideal prefetching performance due to the prefetching scheme designed and customized for graph data access. In this embodiment, the data prefetcher facing graph computation is designed and customized based on three key problems of data prefetching aiming at the defects of the existing data prefetching mechanism, and can be flexibly integrated into the existing graph computation framework by combining software configuration with hardware prefetching, so as to realize user transparency. The existing graph calculation is usually developed by using an existing framework, and the mode of adding software configuration in the embodiment can improve the efficiency of hardware prefetching on the premise of not changing the original use mode. Experiments were performed on various algorithms and datasets, and the graph-oriented data prefetcher in this embodiment could improve the performance by 31%, 41% and 36% on average for the BFS algorithm, SSSP algorithm and SSWP algorithm.
In addition, the embodiment also provides a processor, which comprises a processor body with a plurality of processor cores, wherein the processor body is provided with the data prefetcher facing the graph calculation. The processor may be a general purpose processor or an acceleration processor for graph-oriented computation.
In addition, the embodiment also provides a computer device, which comprises a processor and a memory which are connected with each other, wherein the processor is provided with the data prefetcher for graph-oriented computing, and the computer device can be a general-purpose computer device or a special-purpose computer device for graph-oriented computing.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the present invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims (8)

1. A graph computation oriented data prefetcher comprising a prefetch unit between each processor core and its secondary cache L2, respectively, said prefetch unit comprising:
a configuration register set for storing data related to a graph computation data structure of the software configuration, including a work queue and an array address related to a compressed sparse line format;
A state register set, configured to maintain an intermediate index variable of the prefetch unit during the prefetch process, and provide the intermediate index variable to a control state machine to generate a control signal, where the intermediate index variable includes a work queue index, an out-edge offset index, and a flag that the data cache is full;
a control state machine for controlling execution of the prefetch pipeline by reading the contents of the configuration register set and the status register set, including suspending and resuming the prefetch pipeline;
A prefetch pipeline for executing each stage of the graph data access prefetch logic in a pipeline manner, wherein the operation of each stage comprises the calculation of a prefetch address, the sending of a prefetch request and the receiving of prefetch data;
The data cache is used for storing data prefetched by the prefetching pipeline, the data cache takes edges as granularity to maintain prefetched data, each item contains all information related to the edges, and the information of the edges is stored in the cache at the last stage of the pipeline;
The pipeline mode execution of the graph data access prefetch logic comprises a stage 1 of prefetching a node n to be processed by an access work queue LISTARRAY, a stage 2 of prefetching the attribute and the offset of a source node by an access attribute array DATAARRAY and an offset array offsetArray, a stage 3 of prefetching the edge node and the weight by an access edge array EDGEARRAY and a weight array WEIGHTARRAY, and a stage 4 of prefetching the attribute data of a destination node by an access attribute array DATAARRAY.
2. The graph-oriented data prefetcher of claim 1 wherein the saved work queue and compressed sparse row format related array addresses in the configuration register set include a work queue start address, a number of work queue elements, an offset data start address, an edge array start address, an edge weight array start address, and an attribute array start address.
3. The graph-oriented data prefetcher of claim 2 wherein the work queue index, out-side offset index, and data cache full flag include a column index idx, a current column index current_idx, a current line start offset start, a current line end offset end, and a data cache full flag full that is used to control the state machine to halt and resume the prefetching pipeline's credentials, and if the data cache full flag full is true, halting the prefetching pipeline, otherwise resuming the prefetching pipeline.
4. A graph-oriented data prefetcher as recited in claim 3, wherein the fields of each item of data in the data cache include a source node ID, a destination node ID, a source node attribute, a destination node attribute, and an out-edge weight.
5. The graph-oriented data prefetcher of claim 4 wherein the data prefetcher provides a configuration interface function config (addr, data), a fetch data interface function fetch_edge () and a buffer empty state decision interface function empty () for use by software, wherein addr is an address of the data prefetcher, data is configuration data, and the manner in which the software uses the data prefetcher comprises:
S1, configuring a data prefetcher by calling a configuration interface function config (addr, data) through software, wherein configuration information for configuring comprises an offset array offsetArray, a side array EDGEARRAY, a weight array WEIGHTARRAY and a starting address and a starting size of an attribute array DATAARRAY, wherein the offset array offsetArray is used for recording a starting offset and a ending offset of an outgoing side, the side array EDGEARRAY is used for recording an id of an outgoing side node, the weight array WEIGHTARRAY is used for recording an outgoing side weight, and the attribute array DATAARRAY is used for recording attribute data;
S2, configuring a starting address of a work queue LISTARRAY and the number len of the work queue elements through software, enabling a data prefetcher to start prefetching graph data related to all nodes to be processed in the work queue LISTARRAY, wherein the graph data comprises a source node ID, a destination node ID, a source node attribute, a destination node attribute and an out-edge weight, and storing the related graph data into a data cache;
s3, the data in the data cache is obtained through software calling a data access interface function fetch_edge ();
S4, executing corresponding graph calculation by software by using the prefetched data, and updating the edge node attribute;
s5, judging whether data to be processed exist in a data buffer of the data prefetcher or not through a buffer empty state judging interface function empty () called by software, if so, continuing to execute the step S3, otherwise, ending and exiting.
6. The graph-oriented data prefetcher of claim 5 wherein the data prefetcher beginning to prefetch graph data associated with all pending nodes in the work queue LISTARRAY in step S2 comprises:
S2.1, acquiring the input starting address of the work queue LISTARRAY and the number len of the work queue elements, initializing the column index idx to be 0;
S2.2, judging whether the column index idx is smaller than the number len of the work queue elements, if the column index idx is smaller than the number len of the work queue elements, ending, otherwise, jumping to the step S2.3;
S2.3, taking out an index number i of the node to be processed corresponding to the column index idx from the work queue LISTARRAY, self-increasing the column index idx by 1, and jumping to the step S2.3 until the numbers of all the nodes to be processed are generated;
S2.4, pre-obtaining node attributes srcData of the nodes to be processed according to the starting address of the attribute array DATAARRAY and the index number i of the nodes to be processed, and taking out the offset corresponding to the index number i of the nodes to be processed from the offset array offsetArray as a current line starting offset start, and taking the offset corresponding to the index number i+1 of the next node as a current line ending offset end;
s2.5, assigning a current row start offset start to a current column index current_idx;
S2.6, judging whether the current column index current_idx is smaller than the end offset end of the current line or not, if not, jumping to the step S2.2, otherwise, jumping to the next step;
s2.7, judging whether the data cache is full, if not, jumping to the step S2.7 again, otherwise jumping to the next step;
S2.8, taking out an element corresponding to the current column index current_idx from the edge array EDGEARRAY as an edge node corresponding to the obtained node i to be processed, and taking out an element corresponding to the current column index current_idx from the weight array WEIGHTARRAY as an obtained edge weight;
S2.9, reading attribute data of an outgoing node corresponding to the node i to be processed according to the node dependency array DATAARRAY, adding 1 to the current column index current_idx, and jumping to the step S2.6.
7. A processor comprising a processor body with a plurality of processor cores, wherein the processor body is provided with the graph computation oriented data prefetcher of any one of claims 1 to 6.
8. A computer device comprising a processor and a memory connected to each other, wherein the processor is provided with a graph computation oriented data prefetcher according to any one of claims 1 to 6.
CN202510527371.3A 2025-04-25 2025-04-25 Data prefetcher, processor and device for graph computing Active CN120066987B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510527371.3A CN120066987B (en) 2025-04-25 2025-04-25 Data prefetcher, processor and device for graph computing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510527371.3A CN120066987B (en) 2025-04-25 2025-04-25 Data prefetcher, processor and device for graph computing

Publications (2)

Publication Number Publication Date
CN120066987A CN120066987A (en) 2025-05-30
CN120066987B true CN120066987B (en) 2025-08-05

Family

ID=95789260

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510527371.3A Active CN120066987B (en) 2025-04-25 2025-04-25 Data prefetcher, processor and device for graph computing

Country Status (1)

Country Link
CN (1) CN120066987B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109461113A (en) * 2018-10-11 2019-03-12 中国人民解放军国防科技大学 A data structure-oriented graphics processor data prefetching method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6772179B2 (en) * 2001-12-28 2004-08-03 Lucent Technologies Inc. System and method for improving index performance through prefetching
US10157136B2 (en) * 2016-03-31 2018-12-18 Intel Corporation Pipelined prefetcher for parallel advancement of multiple data streams
US10909039B2 (en) * 2019-03-15 2021-02-02 Intel Corporation Data prefetching for graphics data processing
CN114756483A (en) * 2022-03-31 2022-07-15 深圳清华大学研究院 Subgraph segmentation optimization method based on inter-core storage access and application
CN118132464A (en) * 2024-01-29 2024-06-04 西安交通大学 Data prefetcher for coping with irregular access

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109461113A (en) * 2018-10-11 2019-03-12 中国人民解放军国防科技大学 A data structure-oriented graphics processor data prefetching method and device

Also Published As

Publication number Publication date
CN120066987A (en) 2025-05-30

Similar Documents

Publication Publication Date Title
Lin et al. Pagraph: Scaling gnn training on large graphs via computation-aware caching
Zhuo et al. Graphq: Scalable pim-based graph processing
Basak et al. Analysis and optimization of the memory hierarchy for graph processing workloads
Xiao et al. A load balancing inspired optimization framework for exascale multicore systems: A complex networks approach
Bai et al. Efficient data loader for fast sampling-based GNN training on large graphs
US10949200B2 (en) Methods and apparatus for executing data-dependent threads in parallel
US11520589B2 (en) Data structure-aware prefetching method and device on graphics processing unit
US20110066830A1 (en) Cache prefill on thread migration
CN110704360A (en) Graph calculation optimization method based on heterogeneous FPGA data flow
TW201631479A (en) Multiple data prefetchers that defer to one another based on prefetch effectiveness by memory access type
CN107844380B (en) Multi-core cache WCET analysis method supporting instruction prefetching
US11636122B2 (en) Method and apparatus for data mining from core traces
Zhou et al. Gas: A heterogeneous memory architecture for graph processing
Huber et al. Worst‐case execution time analysis‐driven object cache design
Shen et al. Statistical behavior guided block allocation in hybrid cache-based edge computing for cyber-physical-social systems
Falahati et al. Cross-core data sharing for energy-efficient GPUs
Li et al. PIM-WEAVER: A high energy-efficient, general-purpose acceleration architecture for string operations in big data processing
Subedi et al. Rise: Reducing i/o contention in staging-based extreme-scale in-situ workflows
WO2018076979A1 (en) Detection method and apparatus for data dependency between instructions
CN120066987B (en) Data prefetcher, processor and device for graph computing
Keshtegar et al. Cluster‐based approach for improving graphics processing unit performance by inter streaming multiprocessors locality
Basak et al. Exploring core and cache hierarchy bottlenecks in graph processing workloads
Mao et al. PMGraph: Accelerating Concurrent Graph Queries over Streaming Graphs
CN108846248B (en) An application modeling and performance prediction method
Min Fine-grained memory access over I/O interconnect for efficient remote sparse data access

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant