CN120066987B

CN120066987B - Data prefetcher, processor and device for graph computing

Info

Publication number: CN120066987B
Application number: CN202510527371.3A
Authority: CN
Inventors: 李晨; 李宣佚; 郭阳; 鲁建壮; 陈小文; 袁珩洲; 刘畅; 汪志
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2025-04-25
Filing date: 2025-04-25
Publication date: 2025-08-05
Anticipated expiration: 2045-04-25
Also published as: CN120066987A

Abstract

The present invention discloses a data prefetcher, processor, and device for graph computing. The data prefetcher of the present invention includes a prefetch unit located between each processor core and its L2 cache L2. The prefetch unit includes: a configuration register group for storing data related to the graph computing data structure configured by software; a status register group for maintaining the intermediate index variables of the prefetch unit during the prefetch process; a control state machine for controlling the execution of the prefetch pipeline by reading the contents of the configuration register group and the status register group; a prefetch pipeline for executing each stage of the graph data access prefetch logic in a pipeline manner; and a data cache for storing data prefetched by the prefetch pipeline. The present invention aims to improve the timeliness and accuracy of prefetching for graph computing, thereby improving the efficiency of graph computing, alleviating the memory access bottleneck of graph computing, and improving the graph computing performance of the processor.

Description

Data prefetcher, processor and equipment for graph calculation

Technical Field

The present invention relates to the field of processor technologies, and in particular, to a graph computation-oriented data prefetcher, a processor, and a device.

Background

The data prefetching is a technology for pre-acquiring data possibly needed in advance by predicting the address of memory access before a user requests the data and storing the data in a cache or a temporary memory, so that the memory access delay is hidden and the working speed of a processor is improved. Traditional data prefetching techniques predict future memory access addresses based on the memory access patterns of the application. They search data from the memory or lower level caches according to the predicted address so that when a memory access request arrives, the corresponding data is already stored at the corresponding cache level, thereby reducing memory access latency. Data prefetching may be divided into hardware prefetching and software prefetching according to implementation.

The most common prefetcher for hardware prefetchers is the step-size prefetcher design, which uses fixed, repeatable address differences in a sequence. However, conventional data prefetchers such as GHB and AMPM prefetch by learning a step history in the address stream, and thus can only identify streaming memory access patterns with deterministic regularity. To further capture the indirect access data stream Yu et al propose a hardware mechanism IMP supporting indirect data stream prefetching, first identifying the sequential data stream and on the basis of this identifying the indirect access pattern. However, these prefetchers have difficulty capturing the four-level inter-level access pattern Dc [ Bai ] +j ] in the graph computation, where i is in [0, n-1] and j is in [ Bai ] ], bai+1 ], and the implementation overhead is high because of the large amount of memory required to record the access history. Ainsworth proposes a graph-structure-aware data prefetcher, which needs to monitor data requests to prefetch data of subsequent nodes, but needs to be particularly optimized for prefetching opportunities, and meanwhile, because the data returned by prefetching is stored in an L1 cache, resource waves caused by replacement need to be particularly considered.

Software prefetching is implemented by a programmer or a compiler adding partial prefetch instructions to a program, such as Ainsworth proposes a compiler-based indirect mode prefetcher that can generally capture access modes more accurately, and software prefetching has inherent limitations in that it may not be portable across microarchitectures (e.g., with different cache hierarchies), resulting in significant instruction overhead and power consumption increases, thereby masking the benefits of improving cache hit rates.

The ideal prefetching technique causes little additional overhead to memory accesses. However, in practical situations, prefetching is not always timely or accurate, and a lag or erroneous prefetch may affect performance. Therefore, the prefetch technique must consider three basic issues, how to determine prefetched data, i.e., the address of the prefetched data, when to prefetch the data, and where to place the prefetched data. According to the data access mode calculated by the design, all relevant data can be accurately prefetched, and all prefetched data are stored in the special prefetching cache, so that pollution to a cache is avoided.

The graph in the real scene has strong sparsity, most nodes are not connected by edges, and a relatively small number of nodes in the graph structure are connected with a large number of nodes, so that in order to improve the storage efficiency of graph data, the graph data is stored in a compressed sparse row (Compressed Sparse Row, CSR) format generally, and the time and space utilization rate is high. As shown in FIG. 1, the compressed lean row format uses 4 arrays to store the graph data, wherein the offset array stores the offset value of the first outgoing edge of each node in the edge array, the edge array stores the numbers (ids) of all outgoing edge nodes in node order, the weight array stores the weights (weights) of the edges corresponding to the edge array, and the attribute (property) data (e.g., V ₀～V₄) of the nodes associated with the graph computation is stored in the attribute array.

When updating an edge vertex, the updating operation needs to access a plurality of data including an edge vertex id, an edge weight, an attribute property and the like, and the computing is usually only performed with a reduction operation with smaller computation amount, so that when the graph computing is applied to a general processor platform for execution, the graph computing has the characteristic of intensive memory access, and presents three challenges, namely 1) weak locality, namely, the memory access behavior of a program has the characteristic of low locality due to complex and changeable graph data structure. Although a large number of redundant structures have been found to exist in real application scenario oriented graph data, current graph computing applications still have difficulty accurately predicting the location of the redundant structures at runtime, thereby limiting optimization of data access locality. 2) And the high concurrency is realized by combining a large amount of data access generated by traversing vertex or edge data with simple reduction calculation, so that the application generates a large amount of memory access requests in a short time, and the processor cannot acquire the required data in time under the high concurrency condition. 3) Fine-grained Access computing operations in graph computing applications typically involve only certain attributes of vertices or edges that occupy less space such that the memory access behavior of the graph computing application exhibits fine-grained characteristics.

Conventional von neumann general-purpose processor architectures employ unified off-chip memory resources to store instructions and data. However, for access intensive applications such as graphics computing applications, due to the presence of large-scale high-concurrency access requests, a general-purpose processor based on von neumann architecture can obtain data required for operation instructions after consuming a large number of idle waiting periods, which results in an instruction pipeline not being filled with a sufficient number of computing instructions, and thus not fully utilizing instruction-level parallelism.

Pipeline congestion is a direct cause of processor performance degradation, and the main conflicts that lead to out-of-order superscalar processor congestion are resource dependent, data dependent, and control dependent. By analyzing the ratio of pipeline blocking periods caused by data correlation to analyze the data access bottleneck, fig. 2 shows the blocking period ratio of the access instruction obtained by testing the Single-source shortest path (SSSP) algorithm on different data sets, and referring to fig. 2, it can be seen that the average blocking period ratio of the load instruction is 38.02%, while the store instruction is only 19.77%, mainly because the load instruction is on the critical execution path, and the subsequent calculation instructions all depend on the result of the load instruction, so that the effect of optimizing the load instruction on the overall performance is greater. However, in practical situations, the prefetching is not always timely or accurate, so how to alleviate the access bottleneck of graph computation and improve the timeliness and accuracy of prefetching to improve the efficiency of graph computation for the graph-computation-oriented data prefetcher has become a critical technical problem to be solved.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a data prefetcher, a processor and equipment for graph calculation, and aims to improve the timeliness and accuracy of prefetching for graph calculation, so as to improve the efficiency of graph calculation, alleviate the access bottleneck of graph calculation and improve the graph calculation performance of the processor.

In order to solve the technical problems, the invention adopts the following technical scheme:

a graph computation oriented data prefetcher comprising a prefetch unit located between each processor core and its secondary cache L2, respectively, the prefetch unit comprising:

a configuration register set for storing data related to a graph computation data structure of the software configuration, including a work queue and an array address related to a compressed sparse line format;

The state register set is used for maintaining an intermediate index variable of the prefetching unit in the prefetching process, providing the intermediate index variable for a control state machine to generate a control signal, and specifically comprises a work queue index, an out-edge offset index and a mark of full data cache;

a control state machine for controlling execution of the prefetch pipeline by reading the contents of the configuration register set and the status register set, including suspending and resuming the prefetch pipeline;

A prefetch pipeline for executing each stage of the graph data access prefetch logic in a pipeline manner, wherein the operation of each stage comprises the calculation of a prefetch address, the sending of a prefetch request and the receiving of prefetch data;

And the data cache is used for storing the prefetched data of the prefetching pipeline, the data cache maintains the prefetched data with the edges as granularity, each entry contains all information related to the edges, and the information of the edges is stored in the cache at the last stage of the pipeline.

Optionally, the stages of executing the graph data access prefetch logic in a pipelined manner include stage 1, accessing the work queue LISTARRAY to prefetch the node n to be processed, stage 2, accessing the attribute array DATAARRAY and the offset array offsetArray to prefetch the attributes and offsets of the source node, stage 3, accessing the edge array EDGEARRAY and the weight array WEIGHTARRAY to prefetch the edge nodes and weights, and stage 4, accessing the attribute array DATAARRAY to prefetch the weights of the destination node.

Optionally, the saved work queue and the array address related to the compressed sparse line format in the configuration register set comprise a work queue starting address, a work queue element number, an offset data starting address, an edge array starting address, an edge weight array starting address and an attribute array starting address.

Optionally, the intermediate index variable in the state register set includes a column index idx, a current column index current_idx, a current line start offset start, a current line end offset end, and a data buffer full flag full, where the data buffer full flag is used as a credential for controlling the state machine to suspend and resume the prefetch pipeline, if the data buffer full flag full is true, the prefetch pipeline is suspended, otherwise, the prefetch pipeline is resumed.

Optionally, the field of each item of data in the data cache includes a source node ID, a destination node ID, a source node attribute, a destination node attribute, and an out-edge weight.

Optionally, the data prefetcher provides a configuration interface function config (addr, data), a fetch interface function fetch_edge () and a buffer empty state judgment interface function empty () for use by the software, wherein addr is an address of the data prefetcher, data is configuration data, and a manner in which the software uses the data prefetcher includes:

s1, configuring a data prefetcher by calling a configuration interface function config (addr, data) through software, wherein configuration information for configuring comprises an offset array offsetArray, a side array EDGEARRAY, a weight array WEIGHTARRAY and a starting address and a starting size of an attribute array DATAARRAY, wherein the offset array offsetArray is used for recording a starting offset and a ending offset of an outgoing side, the side array EDGEARRAY is used for recording an id of an outgoing side node, the weight array WEIGHTARRAY is used for recording an outgoing side weight, and the attribute array DATAARRAY is used for recording state data;

S2, configuring a starting address of a work queue LISTARRAY and the number len of the work queue elements through software, enabling a data prefetcher to start prefetching graph data related to all nodes to be processed in the work queue LISTARRAY, wherein the graph data comprises a source node ID, a destination node ID, a source node attribute, a destination node attribute and an out-edge weight, and storing the related graph data into a data cache;

s3, the data in the data cache is obtained through software calling a data access interface function fetch_edge ();

S4, executing corresponding graph calculation by software by using the prefetched data, and updating the edge node attribute;

s5, judging whether data to be processed exist in a data buffer of the data prefetcher or not through a buffer empty state judging interface function empty () called by software, if so, continuing to execute the step S3, otherwise, ending and exiting.

Optionally, the data prefetcher in step S2 starts prefetching graph data related to all pending nodes in the work queue LISTARRAY, including:

S2.1, acquiring the input starting address of the work queue LISTARRAY and the number len of the work queue elements, initializing the column index idx to be 0;

S2.2, judging whether the column index idx is smaller than the number len of the work queue elements, if the column index idx is smaller than the number len of the work queue elements, ending, otherwise, jumping to the step S2.3;

S2.3, taking out an index number i of the node to be processed corresponding to the column index idx from the work queue LISTARRAY, self-increasing the column index idx by 1, and jumping to the step S2.3 until the numbers of all the nodes to be processed are generated;

S2.4, pre-obtaining node attributes srcData of the nodes to be processed according to the starting address of the attribute array DATAARRAY and the index number i of the nodes to be processed, and taking out the offset corresponding to the index number i of the nodes to be processed from the offset array offsetArray as a current line starting offset start, and taking the offset corresponding to the index number i+1 of the next node as a current line ending offset end;

s2.5, assigning a current row start offset start to a current column index current_idx;

S2.6, judging whether the current column index current_idx is smaller than the end offset end of the current line or not, if not, jumping to the step S2.2, otherwise, jumping to the next step;

s2.7, judging whether the data cache is full, if so, jumping to the step S2.7 again, otherwise, jumping to the next step;

S2.8, taking out an element corresponding to the current column index current_idx from the edge array EDGEARRAY as an edge node corresponding to the obtained node i to be processed, and taking out an element corresponding to the current column index current_idx from the weight array WEIGHTARRAY as an obtained edge weight;

S2.9, reading attribute data of an outgoing node corresponding to the node i to be processed according to the node dependency array DATAARRAY, adding 1 to the current column index current_idx, and jumping to the step S2.6.

In addition, the invention also provides a processor, which comprises a processor body with a plurality of processor cores, wherein the processor body is provided with the data prefetcher facing the graph calculation.

In addition, the invention also provides computer equipment, which comprises a processor and a memory which are connected with each other, wherein the processor is provided with any one of the graph calculation-oriented data prefetchers.

Compared with the prior art, the data prefetcher mainly has the advantages that in order to relieve access bottleneck of graph computation, the data prefetcher comprises prefetching units respectively positioned between each processor core and a second-level cache L2 of the processor cores, wherein the prefetching units comprise a configuration register set, a state register set, a control state machine and a data cache, wherein the configuration register set is used for storing data related to a graph computation data structure of software configuration, the state register set is used for maintaining intermediate index variables of the prefetching units in the prefetching process, the control state machine is used for controlling execution of a prefetching pipeline by reading contents of the configuration register set and the state register set, the prefetching pipeline is used for executing each stage of graph data access prefetching logic in a pipeline mode, and the data cache is used for storing the data prefetched by the prefetching pipeline. The method and the device can improve the timeliness and accuracy of prefetching for graph calculation, so as to improve the efficiency of the graph calculation, relieve the access bottleneck of the graph calculation and improve the graph calculation performance of a processor.

Drawings

Fig. 1 is a schematic diagram of a prior art compression lean (Compressed Sparse Row, CSR) format.

FIG. 2 is a memory access command blocking period duty cycle of a single source shortest path algorithm according to the prior art.

FIG. 3 is a flow chart of the data access of the graph in an embodiment of the invention.

FIG. 4 is a schematic diagram of a data prefetcher according to an embodiment of the invention.

FIG. 5 is a schematic diagram of a prefetch unit in a data prefetcher according to an embodiment of the present invention.

FIG. 6 is a flowchart illustrating the overall implementation of the collaboration mechanism in accordance with an embodiment of the present invention.

FIG. 7 is a schematic diagram of a workflow of a data prefetcher according to an embodiment of the invention.

FIG. 8 is a graph showing normalized execution times of different algorithms according to an embodiment of the present invention.

FIG. 9 illustrates memory access instruction blocking period duty cycle normalized by different algorithms in an embodiment of the present invention.

Detailed Description

In order to enable those skilled in the art to better understand the technical solution of the present invention, the technical solution of the present invention will be further described in detail below with reference to the accompanying drawings in the embodiments of the present invention.

As shown in fig. 3, the step of accessing the graph data by the data prefetcher in this embodiment includes S101, initializing the column index idx to 0, and the number len of work queue elements to be the size listmaray size (); S102, judging whether the column index idx is smaller than the number len of the work queue elements, ending if not, otherwise jumping to the next step, S103, taking out the node LISTARRAY [ idx ] corresponding to the column index idx from the work queue LISTARRAY as a node i to be processed, adding 1 (idx+ =1) to the column index idx, S104, taking out the offset offsetArray [ i ] corresponding to the node i to be processed from the offset array offsetArray as a current line starting offset start, taking the offset offsetArray [ i+1] corresponding to the next node i+1 as a current line ending offset end, S105, assigning the current line starting offset start to the current column index current_idx, S106, judging whether the current column index current_idx is smaller than the current line ending offset end or not, jumping to step S102 if not, otherwise jumping to the next step, taking out the element EDGEARRAY [ current_idx ] corresponding to the node i to be processed from the side array offsetArray as a current line starting offset start, taking out the element EDGEARRAY [ current_current_idx ] corresponding to the side array corresponding to the side index array side array 329, and taking out the element EDGEARRAY [ current_current_idx ] as a side attribute of the side array corresponding to the current array side index_1 from the side array 32109, and taking out the element_current_current_1 as a side attribute of the current node corresponding to the current array node corresponding to the current_node side index_49, and taking the current_1.

In order to alleviate access bottleneck of graph computation, as shown in fig. 4, the embodiment provides a graph computation-oriented data prefetcher, which includes a prefetching unit respectively located between each processor core and a second-level cache L2 thereof, where the prefetching unit is placed beside the L1 cache to reduce delay of accessing prefetched data by the processor, as shown in fig. 5, the prefetching unit includes:

As shown in FIG. 5, the stages of executing the graph data access prefetch logic in a pipelined manner in this embodiment include stage 1, accessing the work queue LISTARRAY to prefetch the pending node n, stage 2, accessing the attribute array DATAARRAY and the offset array offsetArray to prefetch the attributes and offsets of the source node, stage 3, accessing the edge array EDGEARRAY and the weight array WEIGHTARRAY to prefetch the edge nodes and weights, and stage 4, accessing the attribute array DATAARRAY to prefetch the weights of the destination node.

As shown in FIG. 5, the array addresses related to the saved work queue and compressed sparse line format in the configuration register set in this embodiment include a work queue start address, a work queue element number, an offset data start address, an edge array start address, an edge weight array start address, and an attribute array start address.

As shown in fig. 5, the intermediate index variable in the state register set in this embodiment includes a column index idx, a current column index current_idx, a current line start offset start, a current line end offset end, and a data buffer full flag full, where the data buffer full flag full is used as a credential for controlling the state machine to suspend and resume the prefetch pipeline, and if the data buffer full flag full is true, the prefetch pipeline is suspended, otherwise the prefetch pipeline is resumed.

As shown in fig. 5, the fields of each item of data in the data cache in this embodiment include a source node ID, a destination node ID, a source node attribute, a destination node attribute, and an out-edge weight.

As shown in fig. 6, the data prefetcher in this embodiment provides a configuration interface function config (addr, data), a fetch interface function fetch_edge () and a buffer empty state judgment interface function empty () for use by software, where addr is an address of the data prefetcher, data is configuration data, and the manner in which the software uses the data prefetcher includes:

As shown in fig. 7, in step S2 of this embodiment, the data prefetcher starts prefetching graph data related to all pending nodes in the work queue LISTARRAY, including:

To verify the performance of the graph-oriented data prefetcher in this embodiment, the prefetcher is evaluated on a ZSim simulator using a 4-core skylake processor in this embodiment, with the specific configuration shown in Table 1. The test was performed simultaneously using 6 graph datasets as shown in table 2.

Table 1 simulator parameter configuration

Table 2 graph dataset

In the embodiment, 3 graph data sets in different fields, namely Email-Eu-core, wiki-Vote and Soc-Epinions graph data sets, are selected from a large-scale network data set snap website of Stanford university, and each data set has typical characteristics and complexity in the corresponding field and covers various application scenes and use conditions. In addition, 3 automatically generated random graph data sets with different scales, namely graph 1-graph 3 graph data sets, are designed to simulate various unknown or future graph structures, so that universality and robustness of the experiment are further enhanced.

To evaluate the performance of the data prefetcher on different graph algorithms, the present embodiment uses 3 graph algorithms, BFS (Bread FIRST SEARCH) algorithm, SSSP (Single-Source Shortest Path) algorithm, SSWP (Single-Source WIDEST PATH) algorithm, respectively. The BFS algorithm traverses the graph to obtain the depth of all nodes in the graph from the origin, and can be used for friend relations and community discovery problems in social network analysis, and used in the fields of state space search, network link analysis and the like in artificial intelligence. The SSSP algorithm is used for searching a single-source shortest path in a graph, and in the aspect of network routing, the shortest path from one node to other nodes is obtained through analysis of a topological structure, so that efficient network routing and data transmission are realized, and in the aspect of social network analysis, the shortest path from one node to other nodes is obtained through the algorithm, so that relationship analysis and influence evaluation of a social network are realized. The SSWP algorithm is an algorithm for solving a single-source widest path, wherein the widest path refers to the largest bandwidth path in a network in network communication, the method can be used for searching the largest bandwidth path between two nodes, and in traffic planning, the widest path refers to the largest traffic capacity of a road, and the method can be used for searching the largest traffic capacity path between two places.

FIG. 8 shows normalized execution times of 3 algorithms on different data sets normalized to a baseline system of an unintegrated data prefetcher, with smaller times and higher performance. It can be seen that the data prefetcher in this embodiment can improve the performance by 31%, 41% and 36% on average for the BFS algorithm, SSSP algorithm and SSWP algorithm. Meanwhile, fig. 9 shows normalized memory access instruction blocking period duty ratios of 3 algorithms on different data sets, normalized to a reference system without an integrated prefetcher, the smaller the number, the higher the performance, the obviously reduced memory access instruction blocking period duty ratios of 3 algorithms, wherein the memory access instruction blocking degrees of the BFS algorithm, the SSSP algorithm and the SSWP algorithm are reduced by 95%, 66% and 43% on average. It can be seen that the graph data access in the graph computation severely blocks pipeline execution, which affects overall performance, and the data access pattern of the graph computation is relatively fixed, and a dedicated data prefetcher can be designed according to the access pattern. Compared with the prior general data prefetching technology, the data prefetcher facing graph calculation in the embodiment can realize more ideal prefetching performance due to the prefetching scheme designed and customized for graph data access. In this embodiment, the data prefetcher facing graph computation is designed and customized based on three key problems of data prefetching aiming at the defects of the existing data prefetching mechanism, and can be flexibly integrated into the existing graph computation framework by combining software configuration with hardware prefetching, so as to realize user transparency. The existing graph calculation is usually developed by using an existing framework, and the mode of adding software configuration in the embodiment can improve the efficiency of hardware prefetching on the premise of not changing the original use mode. Experiments were performed on various algorithms and datasets, and the graph-oriented data prefetcher in this embodiment could improve the performance by 31%, 41% and 36% on average for the BFS algorithm, SSSP algorithm and SSWP algorithm.

In addition, the embodiment also provides a processor, which comprises a processor body with a plurality of processor cores, wherein the processor body is provided with the data prefetcher facing the graph calculation. The processor may be a general purpose processor or an acceleration processor for graph-oriented computation.

In addition, the embodiment also provides a computer device, which comprises a processor and a memory which are connected with each other, wherein the processor is provided with the data prefetcher for graph-oriented computing, and the computer device can be a general-purpose computer device or a special-purpose computer device for graph-oriented computing.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the present invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims

1. A graph computation oriented data prefetcher comprising a prefetch unit between each processor core and its secondary cache L2, respectively, said prefetch unit comprising:

A state register set, configured to maintain an intermediate index variable of the prefetch unit during the prefetch process, and provide the intermediate index variable to a control state machine to generate a control signal, where the intermediate index variable includes a work queue index, an out-edge offset index, and a flag that the data cache is full;

The data cache is used for storing data prefetched by the prefetching pipeline, the data cache takes edges as granularity to maintain prefetched data, each item contains all information related to the edges, and the information of the edges is stored in the cache at the last stage of the pipeline;

The pipeline mode execution of the graph data access prefetch logic comprises a stage 1 of prefetching a node n to be processed by an access work queue LISTARRAY, a stage 2 of prefetching the attribute and the offset of a source node by an access attribute array DATAARRAY and an offset array offsetArray, a stage 3 of prefetching the edge node and the weight by an access edge array EDGEARRAY and a weight array WEIGHTARRAY, and a stage 4 of prefetching the attribute data of a destination node by an access attribute array DATAARRAY.

2. The graph-oriented data prefetcher of claim 1 wherein the saved work queue and compressed sparse row format related array addresses in the configuration register set include a work queue start address, a number of work queue elements, an offset data start address, an edge array start address, an edge weight array start address, and an attribute array start address.

3. The graph-oriented data prefetcher of claim 2 wherein the work queue index, out-side offset index, and data cache full flag include a column index idx, a current column index current_idx, a current line start offset start, a current line end offset end, and a data cache full flag full that is used to control the state machine to halt and resume the prefetching pipeline's credentials, and if the data cache full flag full is true, halting the prefetching pipeline, otherwise resuming the prefetching pipeline.

4. A graph-oriented data prefetcher as recited in claim 3, wherein the fields of each item of data in the data cache include a source node ID, a destination node ID, a source node attribute, a destination node attribute, and an out-edge weight.

5. The graph-oriented data prefetcher of claim 4 wherein the data prefetcher provides a configuration interface function config (addr, data), a fetch data interface function fetch_edge () and a buffer empty state decision interface function empty () for use by software, wherein addr is an address of the data prefetcher, data is configuration data, and the manner in which the software uses the data prefetcher comprises:

S1, configuring a data prefetcher by calling a configuration interface function config (addr, data) through software, wherein configuration information for configuring comprises an offset array offsetArray, a side array EDGEARRAY, a weight array WEIGHTARRAY and a starting address and a starting size of an attribute array DATAARRAY, wherein the offset array offsetArray is used for recording a starting offset and a ending offset of an outgoing side, the side array EDGEARRAY is used for recording an id of an outgoing side node, the weight array WEIGHTARRAY is used for recording an outgoing side weight, and the attribute array DATAARRAY is used for recording attribute data;

6. The graph-oriented data prefetcher of claim 5 wherein the data prefetcher beginning to prefetch graph data associated with all pending nodes in the work queue LISTARRAY in step S2 comprises:

s2.7, judging whether the data cache is full, if not, jumping to the step S2.7 again, otherwise jumping to the next step;

7. A processor comprising a processor body with a plurality of processor cores, wherein the processor body is provided with the graph computation oriented data prefetcher of any one of claims 1 to 6.

8. A computer device comprising a processor and a memory connected to each other, wherein the processor is provided with a graph computation oriented data prefetcher according to any one of claims 1 to 6.