CN119169111A

CN119169111A - Method for compressing and processing large-scale graph data in resource-constrained environment

Info

Publication number: CN119169111A
Application number: CN202411196733.7A
Authority: CN
Inventors: 沈游人; 许骞; 杨娟; 张峰; 杜小勇; 杨珂
Original assignee: Beijing Haizhi Xingtu Technology Co ltd
Current assignee: Beijing Haizhi Xingtu Technology Co ltd
Priority date: 2024-08-29
Filing date: 2024-08-29
Publication date: 2024-12-20

Abstract

The application belongs to the technical field of graph data processing, and particularly relates to a method for compressing and processing large-scale graph data in a resource-limited environment. The method for compressing and processing the large-scale image data in the resource limited environment comprises the steps of obtaining the large-scale image data, dividing the large-scale image data to obtain a plurality of initial subgraphs, carrying out a parallel compression algorithm on the plurality of initial subgraphs to obtain a plurality of compressed target subgraphs, and recovering the plurality of compressed target subgraphs. According to the method, through the self-adaptive graph slicing of the memory perception, the sub-graph compression of the parallel rule basis and the graph representation recovery module, the storage space requirement of the graph data can be effectively reduced, and the processing efficiency of the graph data is optimized, particularly in an environment with limited resources. The application not only helps to promote the development of large-scale graph data processing technology, but also provides a new solution for graph data application in a resource-constrained environment.

Description

Method for compressing and processing large-scale image data in resource-limited environment

Technical Field

The application belongs to the technical field of graph data processing, and particularly relates to a method for compressing and processing large-scale graph data in a resource-limited environment.

Background

The graph data plays an important role in a plurality of fields such as social network analysis, bioinformatics, traffic network planning and the like as a data structure for expressing entities and interrelationships thereof. With the increasing size of data, conventional graph data processing techniques face significant challenges, especially in resource-constrained environments, how to efficiently store, compress, and process large-scale graph data is a hotspot for research.

Among these, the major challenges of graph data are its complex structural and large-scale data volume. The graph data contains not only a large number of vertices (nodes) and edges (relationships connecting the nodes), but also the relationships between the vertices and edges constitute a complex network structure. This structuring results in the storage and processing of graph data requiring a significant amount of memory and computing resources. Furthermore, the dynamics of the graph data, i.e., the change in graph structure over time, increases the complexity of the graph data processing.

Currently, existing graph data processing techniques focus mainly on compression and efficient computation of graph data. Graph data compression aims at reducing the storage space requirement of graph data, and redundant information in the graph data is identified and removed through various algorithms. However, many compression algorithms often sacrifice data access speed and processing efficiency, or cannot effectively support the processing of dynamic graph data. Furthermore, existing graph processing frameworks often require loading the entire graph data into memory for processing, which is impractical for large-scale graph data, especially in resource-constrained environments. Under the environment of limited resources, such as embedded systems, mobile devices and cloud computing resources, higher requirements are put on graph data processing technology. Not only the limited storage space in these environments becomes a limiting factor, but also the computing power and energy consumption are key points to consider. Therefore, minimizing the resource consumption while ensuring the compression rate and the processing efficiency of the graph data becomes an important direction of the development of the graph data processing technology.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method for compressing and processing large-scale graph data in a resource-constrained environment.

The first aspect of the present invention provides a method for compressing and processing large-scale graph data in a resource-constrained environment, the method comprising:

Acquiring large-scale graph data;

dividing the large-scale graph data to obtain a plurality of initial subgraphs;

carrying out a parallel compression algorithm on the plurality of initial subgraphs to obtain a plurality of compressed target subgraphs;

and recovering the compressed multiple target subgraphs.

In some embodiments, the step of dividing the large-scale graph data to obtain a plurality of initial subgraphs includes:

Acquiring parameters and utilization rate of a memory;

And according to the parameters and the utilization rate of the memory, the size of the initial subgraph is adjusted by predicting the scale of a binary rule which is possibly generated, so that the sizes of the initial subgraph and an intermediate result are within the range of the available memory, wherein the intermediate result represents the size of the temporary occupied memory.

In some embodiments, the step of performing a parallel compression algorithm on the plurality of initial subgraphs to obtain a plurality of compressed target subgraphs includes:

scanning the input initial subgraph, and distributing sequence numbers to all possible binary rules;

Distributing serial numbers according to all possible binary rules, and distributing different threads to obtain all binary rules;

and splitting the initial subgraph, distributing different split initial subgraphs to different threads, and carrying out rule replacement to obtain a plurality of compressed target subgraphs.

In some embodiments, the step of splitting the initial subgraph, and distributing the split different initial subgraphs to different threads to perform rule replacement to obtain a plurality of compressed target subgraphs includes:

and merging the initial subgraphs compressed by the parallel rules to eliminate repeated binary rules among different initial subgraphs so as to simplify the representation of a plurality of compressed target subgraphs.

In some implementations, the step of recovering the compressed target sub-graphs further includes:

Merging the target subgraphs to obtain the large-scale graph data;

and carrying out breadth-first search and reordering on the large-scale graph data, and ordering the large-scale graph data in the node according to the sequence numbers of neighbors, wherein the node represents the vertex in the graph data.

The second aspect of the present invention provides a system for compressing and processing large-scale image data in a resource-constrained environment, which is applied to the method for compressing and processing large-scale image data in the resource-constrained environment, and the system for compressing and processing large-scale image data in the resource-constrained environment comprises:

An acquisition unit configured to acquire large-scale map data;

The dividing unit is used for dividing the large-scale graph data to obtain a plurality of initial subgraphs;

the compression unit is used for carrying out a parallel compression algorithm on the plurality of initial subgraphs to obtain a plurality of compressed target subgraphs;

and the recovery unit is used for recovering the compressed multiple target subgraphs.

A third aspect of the invention provides a computer device comprising a memory storing a computer program and a processor implementing the steps of the aforementioned method when the processor executes the computer program.

A fourth aspect of the invention provides a computer storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method as described above.

A fifth aspect of the invention provides a computer program which, when executed by a processor, implements the steps of the aforementioned method.

Advantageous effects

The method for compressing and processing the large-scale image data in the resource limited environment comprises the steps of firstly obtaining the large-scale image data, dividing the large-scale image data to obtain a plurality of initial subgraphs, then carrying out a parallel compression algorithm on the plurality of initial subgraphs to obtain a plurality of compressed target subgraphs, and then recovering the plurality of compressed target subgraphs. According to the method, through the self-adaptive graph slicing of the memory perception, the sub-graph compression of the parallel rule basis and the graph representation recovery module, the storage space requirement of the graph data can be effectively reduced, and the processing efficiency of the graph data is optimized, particularly in an environment with limited resources. The invention not only helps to promote the development of large-scale graph data processing technology, but also provides a new solution for graph data application in a resource-constrained environment.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments or the conventional techniques of the present application, the drawings required for the descriptions of the embodiments or the conventional techniques will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.

FIG. 1 is a flow diagram of a method for compressing and processing large-scale graph data in a resource-constrained environment, in one embodiment.

Detailed Description

In order that the application may be readily understood, a more complete description of the application will be rendered by reference to the appended drawings. Embodiments of the application are illustrated in the accompanying drawings. This application may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The term "and/or" as used herein includes any and all couplings of one or more of the associated listed items.

It will be understood that the terms first, second, etc. as used herein may be used to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish one element from another element.

The following terms are used to explain the present application in order to understand the present application:

binary rules refer to rules that relate to a relationship between two variables. In data mining, machine learning, and logic pushing, binary rules are used to describe associations, causal relationships, or condition dependencies between two variables.

Breadth First Search (BFS), a graph algorithm, is used to traverse or search nodes in a graph. It starts from a specific node in the graph, accesses its neighbor nodes layer by layer, and then continues to access the neighbor nodes of the next layer according to the breadth until all nodes are accessed.

PageRank is an algorithm for evaluating the importance and ranking of web pages.

The application provides a method for compressing and processing large-scale image data in a resource-limited environment, which divides the large-scale image data to be read into a plurality of subgraphs based on available memory, wherein each subgraph is represented by using a CSR. And accelerating the compression of graph data by adopting a parallel rule lasso mode for each sub graph. And merging the rule compression results after all the subgraphs are compressed, reordering the merged graph and performing coding compression, thereby achieving less memory occupation. The present application aims to provide an efficient way of graph compression to effectively reduce the occupation of storage space in as little time as possible.

As shown in fig. 1, a method for compressing and processing large-scale graph data in a resource-constrained environment includes:

s100, acquiring large-scale map data.

The large-scale map data is formed by forming a large number of map data with increasing data scale. The graph data includes nodes and edges, the nodes refer to vertices of the graph data, and the edges refer to relationships connecting the nodes. Complex network structures are formed by the relationship between vertices and edges. The complex network architecture also creates a significant amount of memory and computing resources required for storage and processing.

And S200, dividing the large-scale graph data to obtain a plurality of initial subgraphs.

Specifically, obtaining the plurality of initial subgraphs includes the steps of:

Acquiring parameters and utilization rate of a memory;

It should be noted that, the parameters and the utilization ratio of the memory may be determined according to the system configuration, that is, known conditions. In the case of determining the parameters and the utilization rate of the memory, the binary rule scale is predicted, wherein the binary rule scale refers to the complexity degree of the binary rule. And adjusting the size of the initial subgraph by utilizing a binary rule so that the sizes of the initial subgraph and the intermediate result are within the available memory range. Illustratively, the binary rule designs two variables, one of which is the size of the initial sub-graph and the other of which is the intermediate result of the temporary memory footprint. That is, if the size of the initial sub-graph is a, then the intermediate result is B, which may be symbolized as a→b, i.e., the size of the initial sub-graph corresponds to the intermediate result of the temporary memory occupancy. In this way, the binary rule can be utilized to adjust the size of the initial sub-graph to obtain the result of temporarily occupying the memory, and the size of the initial sub-graph can be adjusted for a plurality of times, so that the current memory capacity can be used by the size of each sub-graph.

It should also be noted that the large-scale graph data is divided into a plurality of initial sub-graphs, each of which contains a portion of the data, and the large-scale graph data may be divided into a plurality of initial sub-graphs using a graph division algorithm (e.g., METIS).

The invention dynamically divides the large-scale graph data into a plurality of subgraphs through a self-adaptive algorithm (binary rule) of memory perception. The process takes into account the limitations of available memory resources, and aims to ensure that the size of each sub-graph can be adapted to the current memory capacity, thereby avoiding excessive memory requirements during subsequent compression and processing. By the method, the memory can be effectively balanced for use, and performance bottlenecks caused by memory overflow are reduced.

S300, carrying out a parallel compression algorithm on the plurality of initial subgraphs to obtain a plurality of compressed target subgraphs.

Specifically, a plurality of compressed target subgraphs are obtained, comprising the following steps:

After the initial sub-graph is obtained in the foregoing step, the initial sub-graph is scanned, and all possible binary rules are assigned unique sequence numbers. And then, according to the allocation requirement, allocating different binary rules to different threads, and carrying out parallel processing to generate all binary rules.

The initial sub-graph is divided into a plurality of sub-graphs, and the sub-graphs are distributed to different threads for parallel processing. By dividing the initial subgraph into a plurality of smaller subgraphs and processing the subgraphs in parallel, the resources of the multi-core processor or the computing cluster can be fully utilized, and the computing efficiency is greatly improved. Each thread processes a sub-graph, and a plurality of threads work simultaneously, so that the processing time can be remarkably shortened. In addition, if the initial subgraph is very large, a single thread processing the entire initial subgraph may result in memory starvation or inefficiency. By splitting the initial subgraph, each thread only needs to process smaller data blocks, so that the memory burden of a single thread is reduced, and the memory utilization rate is improved. Better load balancing can be achieved by splitting the initial subgraph, the workload of each thread can be ensured to be approximately the same by dynamically distributing the subgraph to different threads, and overload or idling of certain threads is avoided. This helps to fully utilize computing resources and improve overall processing efficiency.

Further, the steps of splitting the initial subgraph, distributing different split initial subgraphs to different threads, and replacing rules to obtain a plurality of compressed target subgraphs further comprise:

It should be noted that, in each thread, rule substitution is performed on the allocated subgraph by using the generated binary rule, where rule substitution may simplify complexity in the graph, for example, if there are multiple edges may be reduced to one edge by some rule, such substitution may reduce complexity of the graph. As previously mentioned, certain nodes or edges may be merged according to binary rules. For example, if there are multiple edges between two nodes, the edges may be merged into one according to rules, or the nodes themselves may be merged. Rule substitution may eliminate redundant information in the graph. For example, duplicate edges or nodes may be deleted, retaining a unique rule representation. Illustratively, assume that there is a social network diagram in which nodes represent users and edges represent relationships between users. Some binary rules are generated, such as "if user a and user B are friends of user C, this can be reduced to user a and user B also being friends". Based on such rules, rule replacement operations may be performed. The core idea of rule replacement is to simplify and optimize graph data by using binary rules generated in advance, and reduce complexity and redundancy, thereby improving processing efficiency and storage efficiency. Rule replacement is an effective optimization strategy in parallel computing and large-scale data processing, helping to better manage and analyze data. Therefore, each initial sub-graph is processed and subjected to rule compression, and a binary rule of each sub-graph is generated, wherein rule compression can be realized by deleting redundant rules, combining similar rules and the like.

Thanks to the sub-slices in the previous step, the invention applies a parallel rule-based compression algorithm to each sub-graph in this step. This algorithm reduces redundancy of data by identifying and replacing duplicate structures (e.g., frequently occurring vertex pairs or patterns) in the subgraph. The parallel processing not only accelerates the compression process, but also ensures the efficiency and effect of the compression process by precisely controlling the compression ratio and the quality of the compressed data. In addition, the step greatly reduces the size of the graph data by eliminating the data redundancy, thereby reducing the resource consumption in the subsequent processing process.

S400, recovering the compressed multiple target subgraphs.

Specifically, recovering the compressed plurality of target subgraphs includes the following steps:

Merging the target subgraphs to obtain the large-scale graph data;

It should be noted that merging the target sub-graph refers to merging multiple compressed and processed sub-graphs into an overall, larger graph structure. This process can be interpreted as bringing together all nodes and edges in the graph to form a complete graph containing all of the original sub-graph information. Next, the large graph data is reordered and the entire large graph data may be traversed using a Breadth First Search (BFS) algorithm that begins with a starting node, accesses its neighbor nodes layer by layer, and then orders according to the order of access. Inside the nodes, sorting according to the sequence numbers of the neighbors means sorting the neighbor list of each node, ensuring that the neighbor nodes are arranged according to a certain rule or index order. Such ordering may be implemented according to the order of node connections, node attributes, or other rules to enable more efficient access and manipulation of nodes and their relationships during subsequent analysis or processing. The goal subgraphs are combined to obtain the large-scale graph data for integrating and integrating all relevant information, and breadth-first search reordering is used for optimizing the structure and access mode of the data, so that the graph data can be processed and analyzed efficiently in large-scale data processing.

The compression process, while reducing the data size, may also affect the structure and expressive power of the graph data, and thus the subsequent graph analysis tasks. Thus, in step S400 of the present invention, graph representation restoration is introduced, aiming at optimizing and restoring structural representations of compressed graph data. This includes re-ordering vertices to improve locality of data, thereby improving the speed and efficiency of graph data processing without significantly increasing the size of the data. Through the step, the method and the device can ensure that the graph data can still effectively support various graph analysis algorithms and tasks even after compression processing.

Examples

For the uk-2007-05 dataset, steps S100 to S400 ensure the expressive force of the graph while achieving a compression ratio of 8.23, making the performance of the compressed graph in executing a graph algorithm such as PageRank exceed that of the original graph.

The UK-2007-05 data set refers to an Internet Graph (Web Graph) data set, wherein 'UK' represents United Kingdom (United Kingdom), and '2007-05' represents acquisition time of the data set, namely 5 months in 2007.

The algorithm pseudo code provided by the embodiment of the invention comprises the following steps:

In summary, the method for compressing and processing the large-scale graph data in the resource-limited environment has the following advantages:

1. obviously improves the compression rate and the data processing efficiency

According to the invention, redundant information in the graph data is effectively identified and removed by the self-adaptive graph slicing and parallel rule-based sub-graph compression technology of memory perception, so that high compression rate is realized, and the storage space requirement of the graph data is greatly reduced. In addition, the graph representation recovery technology optimizes the compressed graph structure and improves the locality of the data, thereby accelerating the processing process of the graph data. The combination of compression and optimization of processing ensures the integrity of data and simultaneously remarkably improves the performance and efficiency of graph data processing.

2. Effectively alleviating resource-constrained challenges

For the application scenes of mobile equipment, edge computing equipment and cloud computing resource limitation, the method and the device effectively relieve the challenge of resource limitation by reducing the storage of graph data and the requirement of computing resources. The memory-aware graph slicing technology ensures that the memory usage in the compression and processing process is always in a controllable range, avoids performance degradation or system breakdown caused by resource overrun, and improves the stability and reliability of graph data processing.

3. Energy consumption is reduced, and equipment endurance is prolonged

In processing large scale map data on mobile and edge computing devices, energy consumption is a non-negligible factor. The invention obviously reduces the energy consumption by improving the efficiency of data processing and reducing the data transmission quantity, and is beneficial to prolonging the endurance time of equipment. The method has the advantages that the method is particularly important for application scenes with energy sensitivity, can support data analysis and processing tasks for a longer time, and improves user experience.

4. Enhancing flexibility and adaptability of graph data processing

The memory-aware adaptive graph slicing technology can dynamically adjust the processing strategy of the graph data according to the available resources, and enhances the flexibility and adaptability of graph data processing. The method and the device are suitable for environments with limited resources, and can fully utilize available resources under the condition of sufficient resources so as to realize more efficient graph data processing. In addition, the method has good expansibility, can adapt to graph data of different scales and types, and supports various graph analysis algorithms and application requirements.

An acquisition unit configured to acquire large-scale map data;

In the description of the embodiments of the present invention, those skilled in the art will appreciate that the embodiments of the present invention may be implemented as a method, an apparatus, an electronic device, and a computer-readable storage medium. Accordingly, embodiments of the present invention may be embodied in the form of hardware entirely, software entirely (including firmware, resident software, micro-code, etc.), combinations of hardware and software. Furthermore, in some embodiments, embodiments of the invention may also be implemented in the form of a computer program product in one or more computer-readable storage media having computer program code embodied therein.

Any combination of one or more computer-readable storage media may be employed by the computer-readable storage media described above. The computer readable storage medium includes an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer readable storage medium include a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only Memory (ROM), an erasable programmable read-only Memory (EPROM), a Flash Memory (Flash Memory), an optical fiber, a compact disc read-only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any combination thereof. In embodiments of the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, device.

The computer program code embodied in the computer readable storage medium may be transmitted using any appropriate medium, including wireless, wireline, optical cable, radio Frequency (RF), or any suitable combination of the foregoing.

Computer program code for carrying out operations for embodiments of the present invention may be written in assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, integrated circuit configuration data, or in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the C-language or similar programming languages. The computer program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of remote computers, the remote computers may be connected to the user computer or to external computers through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN).

The embodiment of the invention describes a method, a device and electronic equipment through flowcharts and/or block diagrams.

It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions. These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer readable program instructions may also be stored in a computer readable storage medium that can cause a computer or other programmable data processing apparatus to function in a particular manner. Thus, instructions stored in a computer-readable storage medium produce an instruction means which implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

The terms first and second and the like in the description and in the claims of embodiments of the invention, are used for distinguishing between different objects and not necessarily for describing a particular sequential order of objects. For example, the first target object and the second target object, etc., are used to distinguish between different target objects, and are not used to describe a particular order of target objects.

In embodiments of the invention, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g." in an embodiment should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

In the description of the embodiments of the present invention, unless otherwise indicated, the meaning of "a plurality" means two or more. For example, a plurality of processing units refers to two or more processing units, and a plurality of systems refers to two or more systems.

It should be noted that the above embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that the technical solution described in the above embodiments may be modified or some or all of the technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the scope of the technical solution of the embodiments of the present invention.

Claims

1. A method for compressing and processing large-scale graph data in a resource constrained environment, the method comprising:

Acquiring large-scale graph data;

dividing the large-scale graph data to obtain a plurality of initial subgraphs;

and recovering the compressed multiple target subgraphs.

2. The method for compressing and processing large-scale graph data in a resource-constrained environment according to claim 1, wherein the step of dividing the large-scale graph data to obtain a plurality of initial subgraphs comprises:

Acquiring parameters and utilization rate of a memory;

3. The method for compressing and processing large-scale graph data in a resource-constrained environment according to claim 1, wherein the step of performing a parallel compression algorithm on the plurality of initial subgraphs to obtain a plurality of compressed target subgraphs comprises:

4. The method for compressing and processing large-scale graph data in a resource-constrained environment according to claim 1, wherein the steps of splitting the initial subgraph, distributing different split initial subgraphs to different threads, and performing rule replacement to obtain a plurality of compressed target subgraphs include:

5. The method for compressing and processing large-scale graph data in a resource-constrained environment of claim 1, wherein the step of recovering the compressed plurality of target subgraphs further comprises:

Merging the target subgraphs to obtain the large-scale graph data;

6. A system for compressing and processing large-scale graphics data in a resource-constrained environment, the system for compressing and processing large-scale graphics data in a resource-constrained environment comprising:

An acquisition unit configured to acquire large-scale map data;

7. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 5 when the computer program is executed.

8. A computer storage medium having stored thereon a computer program, which when executed by a processor realizes the steps of the method according to any of claims 1 to 5.

9. A computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 5.