WO2024101537A1

WO2024101537A1 - Method and computer device for node importance-based graph augmentation preserving graph core structure

Info

Publication number: WO2024101537A1
Application number: PCT/KR2023/000127
Authority: WO
Inventors: 양은호; 박준형; 심하진
Original assignee: Korea Advanced Institute of Science and Technology KAIST
Current assignee: Korea Advanced Institute of Science and Technology KAIST
Priority date: 2022-11-08
Filing date: 2023-01-04
Publication date: 2024-05-16
Anticipated expiration: 2025-05-08
Also published as: KR20240066576A

Abstract

Disclosed are a method and computer device for node importance-based graph augmentation preserving a graph core structure. The graph data augmentation method augments graph data for training a graph neural network (GNN) and comprises the steps of: extracting subgraphs from two different graphs, respectively; and mixing the subgraphs to generate virtual graph data.

Description

Method and computer apparatus for node importance-based graph augmentation that preserves the core structure of the graph

아래의 설명은 GNN(graph neural network) 학습을 위한 그래프 데이터를 증강하는 기술에 관한 것이다.The explanation below is about technology for augmenting graph data for GNN (graph neural network) learning.

그래프 분류(graph classification)는 화학에서 소셜 데이터 분석에 이르기까지 다양한 분야에서 중요한 역할을 한다. 최근 몇 년 동안 GNN을 사용하여 그래프 분류 작업을 해결하려는 시도가 주목받고 있다.Graph classification plays an important role in a variety of fields, from chemistry to social data analysis. In recent years, attempts to solve graph classification tasks using GNNs have attracted attention.

딥 모델은 컴퓨터 비전과 같은 다양한 영역의 비정형 데이터에서 성공적으로 학습이 이루어진다. 그러나, 문제는 대부분의 딥 모델과 마찬가지로 GNN 또한 그래프 데이터를 일반화하기 위해 종종 너무 많은 데이터가 요구되어 실용적이지 않다는 점이다.Deep models are successfully learned from unstructured data in various areas such as computer vision. However, the problem is that, like most deep models, GNNs often require too much data to generalize graph data, making them impractical.

이러한 데이터 집약적인 신경망을 비용 측면에서 보다 효율적으로 훈련하기 위해 데이터 증강 기술이 사용되고 있다([1] Zhao, Tong, et al. "Data augmentation for graph neural networks." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. No. 12. 2021.). 기존 그래프 증강 방법은 각 그래프 데이터의 의미론을 고려하지 않고 엣지 조작과 같은 구조적 노이즈를 주입하는 것에 크게 의존하여 문맥 불가지론적 특성으로 인해 네트워크를 오도할 수 있다. 또한, 그래프 증강 전략은 대부분 노드 분류 작업에 중점을 두며 그래프 분류에 직접 적용하기 어려운 한계가 있다.Data augmentation technology is being used to train these data-intensive neural networks more cost-effectively ([1] Zhao, Tong, et al. "Data augmentation for graph neural networks." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. No. 12. 2021). Existing graph augmentation methods rely heavily on injecting structural noise, such as edge manipulation, without considering the semantics of each graph data, which can mislead the network due to its context-agnostic nature. Additionally, most graph augmentation strategies focus on node classification tasks and have limitations that make them difficult to apply directly to graph classification.

또 다른 옵션으로 이미지 분류, 문장 분류, 모델 견고성 등에서 우수한 성능을 보여 표준 데이터 증강 중 하나가 된 믹스업(mixup) 기술을 고려할 수 있다. 그러나, 기존의 믹스업은 이미지 데이터에 다소 특이적이거나 데이터 인스턴스가 동일한 차원 공간에 있어야 하기 때문에 일반적으로 노드 관련성이 없는 그래프 데이터의 경우 적합하지 않으며, 노드 수도 그래프 인스턴스마다 다를 수 있다.As another option, you can consider mixup technology, which has become one of the standard data augmentation techniques due to its excellent performance in image classification, sentence classification, and model robustness. However, existing mixups are generally not suitable for graph data without node correlation because they are somewhat specific to image data or require data instances to be in the same dimensional space, and the number of nodes may also vary for each graph instance.

서로 다른 두 그래프 데이터의 부분 그래프(subgraph)를 합성하는(mixing) 그래프 데이터 증강 방법론인 그래프 이식(graph transplant) 기술을 제공할 수 있다.It is possible to provide graph transplant technology, a graph data augmentation methodology that mixes subgraphs of two different graph data.

컴퓨터 장치에서 수행되는 그래프 데이터 증강 방법에 있어서, 상기 컴퓨터 장치는 메모리에 포함된 컴퓨터 판독가능한 명령들을 실행하도록 구성된 적어도 하나의 프로세서를 포함하고, 상기 그래프 데이터 증강 방법은, GNN(graph neural network) 학습을 위한 그래프 데이터를 증강하는 것으로, 상기 적어도 하나의 프로세서의 의해, 서로 다른 두 그래프에서 부분 그래프(subgraph)를 각각 추출하는 단계; 및 상기 적어도 하나의 프로세서의 의해, 상기 부분 그래프를 혼합하여(mixing) 가상 그래프 데이터를 생성하는 단계를 포함하는 그래프 데이터 증강 방법을 제공한다.In a graph data augmentation method performed on a computer device, the computer device includes at least one processor configured to execute computer-readable instructions included in a memory, and the graph data augmentation method includes learning a graph neural network (GNN). Augmenting graph data for: extracting, by the at least one processor, subgraphs from two different graphs; and generating virtual graph data by mixing the partial graph, by the at least one processor.

일 측면에 따르면, 상기 추출하는 단계는, 각 그래프 상의 노드 중요도(node saliency)를 이용하여 상기 부분 그래프를 추출할 수 있다.According to one aspect, the extracting step may extract the partial graph using node saliency on each graph.

다른 측면에 따르면, 상기 추출하는 단계는, 각 그래프에서 노드 피처(node feature)와 기울기(gradient)를 기초로 노드 중요도를 계산하는 단계; 및 상기 노드 중요도를 이용하여 각 그래프에서 k-홉(hop)의 그래프를 추출하는 단계를 포함할 수 있다.According to another aspect, the extracting step includes calculating node importance based on node features and gradient in each graph; and extracting a k-hop graph from each graph using the node importance.

또 다른 측면에 따르면, 상기 생성하는 단계는, 대상 그래프(destination graph)에서 추출된 제1 부분 그래프를 상기 대상 그래프에서 제거하는 단계; 및 상기 제1 부분 그래프가 제거된 상기 대상 그래프에 소스 그래프(source graph)에서 추출된 제2 부분 그래프를 이식하는(transplant) 단계를 포함할 수 있다.According to another aspect, the generating step includes removing a first partial graph extracted from a destination graph from the destination graph; and transplanting a second partial graph extracted from a source graph to the target graph from which the first partial graph has been removed.

또 다른 측면에 따르면, 상기 이식하는 단계는, 상기 대상 그래프에서 상기 제1 부분 그래프가 제거되고 남은 노드와 상기 제2 부분 그래프에서 샘플링된 특정 노드 사이에 새로운 엣지(edge)를 추가하는 단계를 포함할 수 있다.According to another aspect, the transplanting step includes adding a new edge between a node remaining after the first subgraph is removed from the target graph and a specific node sampled from the second subgraph. can do.

또 다른 측면에 따르면, 상기 특정 노드는 원본 그래프에서 노드의 예상되는 정도(expected degree)를 보존하는 DP(degree preservation)를 통해 상기 제2 부분 그래프에서 무작위로 샘플링될 수 있다.According to another aspect, the specific node may be randomly sampled from the second subgraph through degree preservation (DP), which preserves the expected degree of the node in the original graph.

또 다른 측면에 따르면, 상기 특정 노드는 노드 쌍의 피처(feature)를 이용하여 엣지 존재 확률(edge existence probability)을 예측하는 엣지 예측(edge prediction)을 통해 상기 제2 부분 그래프에서 샘플링될 수 있다.According to another aspect, the specific node may be sampled from the second partial graph through edge prediction, which predicts edge existence probability using features of node pairs.

또 다른 측면에 따르면, 상기 엣지 예측은 원본 그래프를 사용한 지도 학습(supervised learning)과 가상 그래프를 사용한 종단 간 방식(end-to-end manner)을 통해 훈련되는 미분 가능한 엣지 예측기(differentiable edge predictor)를 사용할 수 있다.According to another aspect, the edge prediction uses a differentiable edge predictor that is trained through supervised learning using the original graph and an end-to-end manner using a virtual graph. You can use it.

또 다른 측면에 따르면, 상기 그래프 데이터 증강 방법은, 상기 적어도 하나의 프로세서의 의해, 상기 제1 부분 그래프와 상기 제2 부분 그래프의 노드 중요도를 기초로 상기 가상 그래프 데이터의 레이블(label)을 결정하는 단계를 더 포함할 수 있다.According to another aspect, the graph data augmentation method determines, by the at least one processor, a label of the virtual graph data based on node importance of the first partial graph and the second partial graph. Additional steps may be included.

또 다른 측면에 따르면, 상기 결정하는 단계는, 상기 대상 그래프에서 상기 제1 부분 그래프가 제거되고 남은 노드와 상기 제2 부분 그래프에 포함된 노드 간의 상대적 중요도에 따라 상기 가상 그래프 데이터에 대한 레이블 가중치를 도출하는 단계를 포함할 수 있다.According to another aspect, the determining step includes determining a label weight for the virtual graph data according to the relative importance between the nodes remaining after the first partial graph is removed from the target graph and the nodes included in the second partial graph. It may include deriving steps.

컴퓨터 장치로 구현되는 그래프 데이터 증강 시스템에 있어서, 메모리에 포함된 컴퓨터 판독가능한 명령들을 실행하도록 구성된 적어도 하나의 프로세서를 포함하고, 상기 적어도 하나의 프로세서는, GNN(graph neural network) 학습을 위한 그래프 데이터를 증강하는 것으로, 서로 다른 두 그래프에서 부분 그래프(subgraph)를 각각 추출하는 과정; 및 상기 부분 그래프를 혼합하여(mixing) 가상 그래프 데이터를 생성하는 과정을 처리하는 그래프 데이터 증강 시스템을 제공한다.A graph data augmentation system implemented by a computer device, comprising at least one processor configured to execute computer-readable instructions included in a memory, wherein the at least one processor is configured to store graph data for learning a graph neural network (GNN). augmenting, the process of extracting subgraphs from two different graphs; and a graph data augmentation system that processes the process of generating virtual graph data by mixing the partial graph.

본 발명의 실시예에 따르면, 서로 다른 두 그래프 데이터의 부분 그래프를 합성하는 그래프 데이터 증강 기술을 통해 가상의 그래프 데이터를 생성함으로써 그래프 데이터를 효과적으로 증강시킬 수 있으며, 이를 통해 그래프 기반의 인공지능 기술을 획기적인 성능 향상을 기대할 수 있다.According to an embodiment of the present invention, graph data can be effectively augmented by generating virtual graph data through graph data augmentation technology that synthesizes partial graphs of two different graph data, and through this, graph-based artificial intelligence technology. You can expect groundbreaking performance improvements.

도 1은 본 발명의 일실시예에 있어서 컴퓨터 장치의 내부 구성의 일례를 설명하기 위한 블록도이다.1 is a block diagram for explaining an example of the internal configuration of a computer device according to an embodiment of the present invention.

도 2는 본 발명의 일실시예에 있어서 그래프 데이터 증강 방법을 도시한 순서도이다.Figure 2 is a flowchart showing a graph data augmentation method in one embodiment of the present invention.

도 3은 본 발명의 일실시예에 있어서 그래프 이식 과정의 예시를 도시한 것이다.Figure 3 shows an example of a graph transplant process in one embodiment of the present invention.

이하, 본 발명의 실시예를 첨부된 도면을 참조하여 상세하게 설명한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the attached drawings.

본 발명의 실시예들은 GNN 학습을 위한 그래프 데이터를 증강하는 기술에 관한 것이다.Embodiments of the present invention relate to technology for augmenting graph data for GNN learning.

본 명세서에서 구체적으로 개시되는 것들을 포함하는 실시예들은 서로 다른 그래프 데이터를 이용하여 가상 그래프 데이터를 증강하는 기술을 제공할 수 있다.Embodiments including those specifically disclosed in this specification may provide technology for augmenting virtual graph data using different graph data.

GNN은 그래프 관련 문제에서 뛰어난 성능을 보여주고 있지만, GNN 학습 과정 또한 다른 도메인에서와 마찬가지로 충분한 레이블 데이터(labeled data)를 요구한다. 그래프 데이터의 레이블을 수집하는 과정은 많은 비용이 소요되고 현실적으로 어려워 최근 그래프 데이터 증강 기법이 주목받고 있다. 그래프 분류 문제의 데이터 증강 방법론들은 대부분 도메인 지식(domain knowledge)에 의존하여 하나의 그래프에 인위적으로 변형을 가하는 접근법에 의존하고 있다. 최근 믹스업 기반의 데이터 증강기법이 여러 도메인에서 놀라운 성능을 보여주고 있지만, 그래프 구조 데이터(graph-structured data)는 서로 다른 크기와 연결성으로 인해 믹스업 증강 기법을 적용하는 것에 어려움이 있다. 이러한 어려움을 해결하기 위해 본 발명에서는 서로 다른 두 그래프 데이터의 부분 그래프를 혼합하는 데이터 증강 기술을 제안한다.GNN shows excellent performance in graph-related problems, but the GNN learning process also requires sufficient labeled data as in other domains. The process of collecting labels for graph data is expensive and realistically difficult, and graph data augmentation techniques have recently been attracting attention. Most data augmentation methodologies for graph classification problems rely on approaches that artificially modify a single graph by relying on domain knowledge. Recently, mixup-based data augmentation techniques have shown remarkable performance in several domains, but it is difficult to apply mixup augmentation techniques to graph-structured data due to their different sizes and connectivity. To solve these difficulties, the present invention proposes a data augmentation technology that mixes partial graphs of two different graph data.

본 발명의 실시예들에 따른 그래프 데이터 증강 시스템은 적어도 하나의 컴퓨터 장치에 의해 구현될 수 있으며, 본 발명의 실시예들에 따른 그래프 데이터 증강 방법은 그래프 데이터 증강 시스템에 포함되는 적어도 하나의 컴퓨터 장치를 통해 수행될 수 있다. 이때, 컴퓨터 장치에는 본 발명의 일실시예에 따른 컴퓨터 프로그램이 설치 및 구동될 수 있고, 컴퓨터 장치는 구동된 컴퓨터 프로그램의 제어에 따라 본 발명의 실시예들에 따른 그래프 데이터 증강 방법을 수행할 수 있다. 상술한 컴퓨터 프로그램은 컴퓨터 장치와 결합되어 그래프 데이터 증강 방법을 컴퓨터에 실행시키기 위해 컴퓨터 판독 가능한 기록매체에 저장될 수 있다.The graph data enhancement system according to embodiments of the present invention may be implemented by at least one computer device, and the graph data enhancement method according to embodiments of the present invention may be implemented by at least one computer device included in the graph data enhancement system. It can be performed through . At this time, the computer program according to an embodiment of the present invention may be installed and driven in the computer device, and the computer device may perform the graph data augmentation method according to the embodiments of the present invention under the control of the driven computer program. there is. The above-described computer program can be combined with a computer device and stored in a computer-readable recording medium to execute the graph data augmentation method on the computer.

도 1은 본 발명의 일실시예에 따른 컴퓨터 장치의 예를 도시한 블록도이다. 예를 들어, 본 발명의 실시예들에 따른 그래프 데이터 증강 시스템은 도 1을 통해 도시된 컴퓨터 장치(100)에 의해 구현될 수 있다.1 is a block diagram showing an example of a computer device according to an embodiment of the present invention. For example, a graph data enhancement system according to embodiments of the present invention may be implemented by the computer device 100 shown in FIG. 1.

도 1에 도시된 바와 같이 컴퓨터 장치(100)는 본 발명의 실시예들에 따른 그래프 데이터 증강 방법을 실행하기 위한 구성요소로서, 메모리(110), 프로세서(120), 통신 인터페이스(130) 그리고 입출력 인터페이스(140)를 포함할 수 있다.As shown in FIG. 1, the computer device 100 is a component for executing the graph data augmentation method according to embodiments of the present invention, including a memory 110, a processor 120, a communication interface 130, and input/output. It may include an interface 140.

메모리(110)는 컴퓨터에서 판독 가능한 기록매체로서, RAM(random access memory), ROM(read only memory) 및 디스크 드라이브와 같은 비소멸성 대용량 기록장치(permanent mass storage device)를 포함할 수 있다. 여기서 ROM과 디스크 드라이브와 같은 비소멸성 대용량 기록장치는 메모리(110)와는 구분되는 별도의 영구 저장 장치로서 컴퓨터 장치(100)에 포함될 수도 있다. 또한, 메모리(110)에는 운영체제와 적어도 하나의 프로그램 코드가 저장될 수 있다. 이러한 소프트웨어 구성요소들은 메모리(110)와는 별도의 컴퓨터에서 판독 가능한 기록매체로부터 메모리(110)로 로딩될 수 있다. 이러한 별도의 컴퓨터에서 판독 가능한 기록매체는 플로피 드라이브, 디스크, 테이프, DVD/CD-ROM 드라이브, 메모리 카드 등의 컴퓨터에서 판독 가능한 기록매체를 포함할 수 있다. 다른 실시예에서 소프트웨어 구성요소들은 컴퓨터에서 판독 가능한 기록매체가 아닌 통신 인터페이스(130)를 통해 메모리(110)에 로딩될 수도 있다. 예를 들어, 소프트웨어 구성요소들은 네트워크(160)를 통해 수신되는 파일들에 의해 설치되는 컴퓨터 프로그램에 기반하여 컴퓨터 장치(100)의 메모리(110)에 로딩될 수 있다.The memory 110 is a computer-readable recording medium and may include a non-permanent mass storage device such as random access memory (RAM), read only memory (ROM), and a disk drive. Here, non-perishable large-capacity recording devices such as ROM and disk drives may be included in the computer device 100 as a separate permanent storage device that is distinct from the memory 110. Additionally, an operating system and at least one program code may be stored in the memory 110. These software components may be loaded into the memory 110 from a computer-readable recording medium separate from the memory 110. Such separate computer-readable recording media may include computer-readable recording media such as floppy drives, disks, tapes, DVD/CD-ROM drives, and memory cards. In another embodiment, software components may be loaded into the memory 110 through the communication interface 130 rather than a computer-readable recording medium. For example, software components may be loaded into memory 110 of computer device 100 based on computer programs installed by files received over network 160.

프로세서(120)는 기본적인 산술, 로직 및 입출력 연산을 수행함으로써, 컴퓨터 프로그램의 명령을 처리하도록 구성될 수 있다. 명령은 메모리(110) 또는 통신 인터페이스(130)에 의해 프로세서(120)로 제공될 수 있다. 예를 들어 프로세서(120)는 메모리(110)와 같은 기록 장치에 저장된 프로그램 코드에 따라 수신되는 명령을 실행하도록 구성될 수 있다.The processor 120 may be configured to process instructions of a computer program by performing basic arithmetic, logic, and input/output operations. Commands may be provided to the processor 120 by the memory 110 or the communication interface 130. For example, the processor 120 may be configured to execute received instructions according to program codes stored in a recording device such as memory 110.

통신 인터페이스(130)은 네트워크(160)를 통해 컴퓨터 장치(100)가 다른 장치와 서로 통신하기 위한 기능을 제공할 수 있다. 일례로, 컴퓨터 장치(100)의 프로세서(120)가 메모리(110)와 같은 기록 장치에 저장된 프로그램 코드에 따라 생성한 요청이나 명령, 데이터, 파일 등이 통신 인터페이스(130)의 제어에 따라 네트워크(160)를 통해 다른 장치들로 전달될 수 있다. 역으로, 다른 장치로부터의 신호나 명령, 데이터, 파일 등이 네트워크(160)를 거쳐 컴퓨터 장치(100)의 통신 인터페이스(130)를 통해 컴퓨터 장치(100)로 수신될 수 있다. 통신 인터페이스(130)를 통해 수신된 신호나 명령, 데이터 등은 프로세서(120)나 메모리(110)로 전달될 수 있고, 파일 등은 컴퓨터 장치(100)가 더 포함할 수 있는 저장 매체(상술한 영구 저장 장치)로 저장될 수 있다.The communication interface 130 may provide a function for the computer device 100 to communicate with other devices through the network 160. For example, a request, command, data, file, etc. generated by the processor 120 of the computer device 100 according to a program code stored in a recording device such as memory 110 is transmitted to the network ( 160) and can be transmitted to other devices. Conversely, signals, commands, data, files, etc. from other devices may be received by the computer device 100 through the communication interface 130 of the computer device 100 via the network 160. Signals, commands, data, etc. received through the communication interface 130 may be transmitted to the processor 120 or memory 110, and files, etc. may be stored in a storage medium (as described above) that the computer device 100 may further include. It can be stored as a permanent storage device).

통신 방식은 제한되지 않으며, 네트워크(160)가 포함할 수 있는 통신망(일례로, 이동통신망, 유선 인터넷, 무선 인터넷, 방송망)을 활용하는 통신 방식뿐만 아니라 기기들 간의 근거리 유선/무선 통신 역시 포함될 수 있다. 예를 들어, 네트워크(160)는, PAN(personal area network), LAN(local area network), CAN(campus area network), MAN(metropolitan area network), WAN(wide area network), BBN(broadband network), 인터넷 등의 네트워크 중 하나 이상의 임의의 네트워크를 포함할 수 있다. 또한, 네트워크(160)는 버스 네트워크, 스타 네트워크, 링 네트워크, 메쉬 네트워크, 스타-버스 네트워크, 트리 또는 계층적(hierarchical) 네트워크 등을 포함하는 네트워크 토폴로지 중 임의의 하나 이상을 포함할 수 있으나, 이에 제한되지 않는다.The communication method is not limited, and may include not only a communication method utilizing communication networks that the network 160 may include (e.g., mobile communication network, wired Internet, wireless Internet, and broadcasting network), but also short-distance wired/wireless communication between devices. there is. For example, the network 160 may include a personal area network (PAN), a local area network (LAN), a campus area network (CAN), a metropolitan area network (MAN), a wide area network (WAN), and a broadband network (BBN). , may include one or more arbitrary networks such as the Internet. Additionally, the network 160 may include any one or more of network topologies including a bus network, star network, ring network, mesh network, star-bus network, tree or hierarchical network, etc. Not limited.

입출력 인터페이스(140)는 입출력 장치(150)와의 인터페이스를 위한 수단일 수 있다. 예를 들어, 입력 장치는 마이크, 키보드, 카메라 또는 마우스 등의 장치를, 그리고 출력 장치는 디스플레이, 스피커와 같은 장치를 포함할 수 있다. 다른 예로 입출력 인터페이스(140)는 터치스크린과 같이 입력과 출력을 위한 기능이 하나로 통합된 장치와의 인터페이스를 위한 수단일 수도 있다. 입출력 장치(150)는 컴퓨터 장치(100)와 하나의 장치로 구성될 수도 있다.The input/output interface 140 may be a means for interfacing with the input/output device 150. For example, input devices may include devices such as a microphone, keyboard, camera, or mouse, and output devices may include devices such as displays and speakers. As another example, the input/output interface 140 may be a means for interfacing with a device that integrates input and output functions, such as a touch screen. The input/output device 150 may be configured as a single device with the computer device 100.

또한, 다른 실시예들에서 컴퓨터 장치(100)는 도 1의 구성요소들보다 더 적은 혹은 더 많은 구성요소들을 포함할 수도 있다. 그러나, 대부분의 종래기술적 구성요소들을 명확하게 도시할 필요성은 없다. 예를 들어, 컴퓨터 장치(100)는 상술한 입출력 장치(150) 중 적어도 일부를 포함하도록 구현되거나 또는 트랜시버(transceiver), 카메라, 각종 센서, 데이터베이스 등과 같은 다른 구성요소들을 더 포함할 수도 있다.Additionally, in other embodiments, computer device 100 may include fewer or more components than those of FIG. 1 . However, there is no need to clearly show most prior art components. For example, the computer device 100 may be implemented to include at least a portion of the input/output device 150 described above, or may further include other components such as a transceiver, a camera, various sensors, and a database.

이하에서는 그래프 핵심 구조를 보존하는 노드 중요도 기반 그래프 증강 기술의 구체적인 실시예를 설명하기로 한다.Below, we will describe a specific example of a node importance-based graph augmentation technology that preserves the core structure of the graph.

본 실시예에 따른 컴퓨터 장치(100)에는 컴퓨터로 구현된 그래프 데이터 증강 시스템이 구성될 수 있다. 컴퓨터 장치(100)의 프로세서(120)는 이하의 그래프 데이터 증강 방법을 수행하기 위한 구성요소로 구현될 수 있다. 실시예에 따라 프로세서(120)의 구성요소들은 선택적으로 프로세서(120)에 포함되거나 제외될 수도 있다. 또한, 실시예에 따라 프로세서(120)의 구성요소들은 프로세서(120)의 기능의 표현을 위해 분리 또는 병합될 수도 있다.The computer device 100 according to this embodiment may be configured with a graph data augmentation system implemented on a computer. The processor 120 of the computer device 100 may be implemented as a component for performing the following graph data augmentation method. Depending on the embodiment, components of the processor 120 may be selectively included in or excluded from the processor 120. Additionally, depending on the embodiment, components of the processor 120 may be separated or merged to express the functions of the processor 120.

이러한 프로세서(120) 및 프로세서(120)의 구성요소들은 이하의 그래프 데이터 증강 방법이 포함하는 단계들을 수행하도록 컴퓨터 장치(100)를 제어할 수 있다. 예를 들어, 프로세서(120) 및 프로세서(120)의 구성요소들은 메모리(110)가 포함하는 운영체제의 코드와 적어도 하나의 프로그램의 코드에 따른 명령(instruction)을 실행하도록 구현될 수 있다.The processor 120 and the components of the processor 120 can control the computer device 100 to perform the steps included in the graph data augmentation method below. For example, the processor 120 and its components may be implemented to execute instructions according to the code of an operating system included in the memory 110 and the code of at least one program.

여기서, 프로세서(120)의 구성요소들은 컴퓨터 장치(100)에 저장된 프로그램 코드가 제공하는 명령에 따라 프로세서(120)에 의해 수행되는 서로 다른 기능들(different functions)의 표현들일 수 있다.Here, the components of the processor 120 may be expressions of different functions performed by the processor 120 according to instructions provided by program codes stored in the computer device 100.

프로세서(120)는 컴퓨터 장치(100)의 제어와 관련된 명령이 로딩된 메모리(110)로부터 필요한 명령을 읽어들일 수 있다. 이 경우, 상기 읽어들인 명령은 프로세서(120)가 이후 설명될 그래프 데이터 증강 방법을 실행하도록 제어하기 위한 명령을 포함할 수 있다.The processor 120 may read necessary instructions from the memory 110 where instructions related to controlling the computer device 100 are loaded. In this case, the read command may include an command for controlling the processor 120 to execute a graph data augmentation method that will be described later.

본 실시예들은 그래프 기하학에 관계없이 다중 도메인 데이터셋에 사용할 수 있는 그래프 분류 작업을 위한 새로운 믹스업 기반 데이터 증강 기술(이하, '그래프 이식 기술'이라 칭함)에 관한 것이다. 노드 관련성이 없고 노드 수가 다른 그래프 인스턴스를 혼합하는 것을 이해하기 위해, 그래프 이식 기술은 각 대상 및 소스 인스턴스에서 부분 그래프를 선택하고 부분 그래프가 제거된 대상에 소스의 부분 그래프를 이식한다. 부분 그래프를 한 인스턴스에서 다른 인스턴스로 이식하면 부분 그래프의 원본 내용(original charge)이 레이블에 반영되지 않을 경우 최종 제품이 해당 보간 레이블과 올바르게 정렬되지 않을 수 있다. 이 문제를 해결하기 위해, 본 실시예에서는 부분 그래프를 선택하고 레이블을 할당하기 위해 노드 중요도(node saliency)를 활용할 수 있다. 특히, 노드 중요도를 그래프 컨볼루션 계층의 출력에 대한 분류 손실의 기울기(gradient)의 'l2-놈(norm)'으로 정의한다. 이는 모델이 그래프를 분류할 때 노드의 기여도를 측정하는 척도로 볼 수 있다. 또한, 혼합 그래프의 레이블은 소스 부분 그래프와 대상 부분 그래프 사이의 중요도 비율로 결정될 수 있다. 부분 그래프에서 노드 수의 비율을 순수하게 사용하는 대신, 중요도 기반 레이블 혼합은 노드의 중요도에 따라 노드에 가중치를 부여하여 혼합 그래프를 적절하게 설명한다. 그래프를 혼합하는 데 있어 또 다른 중요한 과제는 두 개의 다른 인스턴스에서 얻은 부분 그래프 사이의 연결을 결정하는 것이다. 이를 위해, 1) 원본 노드 정도에 대한 제약 하에 무작위 연결, 2) 노드 피처를 기반으로 한 엣지 예측이라는 두 가지 변형을 적용할 수 있다.These embodiments relate to a new mixup-based data augmentation technology (hereinafter referred to as 'graph transplant technology') for graph classification tasks that can be used on multi-domain datasets regardless of graph geometry. To understand the mixing of graph instances with unrelated nodes and different numbers of nodes, the graph transplant technique selects a subgraph from each target and source instance and transplants the subgraph from the source to the target from which the subgraph has been removed. When porting a subgraph from one instance to another, the final product may not align correctly with the corresponding interpolated labels if the original charge of the subgraph is not reflected in the labels. To solve this problem, this embodiment can utilize node saliency to select subgraphs and assign labels. In particular, node importance is defined as the 'l2-norm' of the gradient of the classification loss for the output of the graph convolution layer. This can be seen as a measure of the contribution of a node when the model classifies the graph. Additionally, the label of a mixed graph can be determined by the importance ratio between the source subgraph and the target subgraph. Instead of purely using the ratio of the number of nodes in a subgraph, importance-based label mixing appropriately describes mixed graphs by weighting nodes according to their importance. Another important challenge in blending graphs is determining the connections between subgraphs obtained from two different instances. For this purpose, two transformations can be applied: 1) random connection with constraints on the original node degree, and 2) edge prediction based on node features.

다시 말해, 본 발명의 주요 내용은 다음과 같다.In other words, the main content of the present invention is as follows.

(1) 로컬 구조를 유지하면서 대상 부분 그래프를 소스 부분 그래프로 대체하여 두 개의 서로 다른 구조의 그래프를 혼합할 수 있는 믹스업 기반 그래프 증강 방법을 사용한다.(1) We use a mixup-based graph augmentation method that can mix graphs with two different structures by replacing the target subgraph with the source subgraph while maintaining the local structure.

(2) 노이즈가 많은 샘플을 생성할 위험이 있는 무작위 증강과 대조적으로, 본 실시예에서는 노드 중요도를 활용하여 해당 레이블을 준수하는 적응적으로 할당된 레이블을 사용하여 적절히 혼합된 그래프를 생성하는 새로운 전략을 사용한다.(2) In contrast to random augmentation, which runs the risk of generating noisy samples, in this embodiment a novel method that leverages node importance to generate a properly mixed graph using adaptively assigned labels that adhere to those labels. Use strategy.

도 2는 본 발명의 일실시예에 있어서 그래프 데이터 증강 방법의 일례를 설명하기 위한 순서도이다.Figure 2 is a flowchart for explaining an example of a graph data augmentation method in an embodiment of the present invention.

도 2를 참조하면, 단계(S210)에서 프로세서(120)는 서로 다른 두 그래프를 대상으로 노드 중요도를 이용하여 각 그래프에서 부분 그래프를 추출할 수 있다. 본 실시예에서는 그래프의 유의미한 정보(class-specific information)를 담고 있는 부분 그래프들을 추출하기 위해 노드의 중요성을 나타내는 'node saliency'를 손실에 대한 각 노드의 기울기로 정의할 수 있다. 프로세서(120)는 노드 중요도를 활용하여 각 그래프에서 중요한 k-홉(hop)의 부분 그래프를 찾아 추출할 수 있다.Referring to FIG. 2, in step S210, the processor 120 may extract a partial graph from each graph using node importance for two different graphs. In this embodiment, in order to extract partial graphs containing meaningful information (class-specific information) of the graph, 'node saliency', which indicates the importance of a node, can be defined as the slope of each node with respect to loss. The processor 120 can use node importance to find and extract a subgraph of important k-hops from each graph.

단계(S220)에서 프로세서(120)는 서로 다른 두 그래프에서 추출된 부분 그래프를 연결시켜 가상 그래프 데이터로 생성할 수 있다. 서로 다른 두 그래프에서 각각 추출된 부분 그래프들은 학습 가능한 예측기로서 미분 가능한 엣지 예측기(differentiable edge predictor)에 의해 연결되어 새로운 믹스 그래프로 합성될 수 있다.In step S220, the processor 120 may generate virtual graph data by connecting partial graphs extracted from two different graphs. Partial graphs extracted from two different graphs can be connected by a differentiable edge predictor as a learnable predictor and synthesized into a new mix graph.

단계(S230)에서 프로세서(120)는 단계(S220)에서 생성된 가상 그래프 데이터에 대해 해당 부분 그래프의 중요도를 기초로 레이블을 결정하여 GNN 학습을 위한 레이블 데이터로 생성할 수 있다. 가상 그래프 데이터의 레이블은 각 부분 그래프가 담고 있는 상대적 중요도에 의해 결정될 수 있으며, 이는 레이블과 가상 그래프 데이터의 의미적 불일치를 효과적으로 방지할 수 있다.In step S230, the processor 120 may determine a label for the virtual graph data generated in step S220 based on the importance of the corresponding partial graph and generate label data for GNN learning. The label of virtual graph data can be determined by the relative importance contained in each subgraph, which can effectively prevent semantic inconsistency between the label and virtual graph data.

먼저, 프릴리미너리(preliminary)를 설명한다.First, the preliminary is explained.

그래프 분류(Graph classification)Graph classification

G=(V,E)를 무방향 그래프로 정의하자. 여기서, V는 노드 세트를 의미하고, E는 엣지 세트를 의미한다.

는 노드 i에 대한 d차원 노드 피처 벡터가 i번째 항목인 피처 행렬이다. N(v)는 엣지로 v에 연결된 노드 집합

이다. 그래프 C-클래스 분류 작업의 경우 각 데이터 포인트(G,y)는 그래프 G와 해당 레이블 y∈{1,…,C}로 구성된다. 목표는 그래프 G를 가져와서 각 클래스에 대한 확률을 예측하는 GNN 모델을 훈련시키는 것이다.Let G=(V,E) be defined as an undirected graph. Here, V means a node set and E means an edge set.

is a feature matrix where the d-dimensional node feature vector for node i is the i-th item. N(v) is the set of nodes connected to v by edges

am. For a graph C-class classification task, each data point (G,y) is a graph G and its label y∈{1,… It consists of ,C}. The goal is to take a graph G and train a GNN model that predicts probabilities for each class.

그래프 분류를 의한 GNN(graph neural networks for graph classification)GNN (graph neural networks for graph classification)

본 실시예에서 그래프 분류의 그래프 신경망 분류에 대한 그래프 신경망은 다음과 같은 구성 요소를 고려해야 한다: 1) 메시지 함수 m_l, 2) 요소별 평균, 합계나 최대와 같은 순열 불변 메시지 집계 함수 Φ_l, 3) 노드 업데이트 함수 h_l, 4) 순열-불변과 노드 임베딩을 함께 결합하여 끝에서 그래프 표현을 얻는 읽기 함수

.In this embodiment, the graph neural network for graph classification must consider the following components: 1) message function m _l , 2) permutation invariant message aggregation function Φ _l , such as mean, sum or maximum for each element, 3) node update function h _l , 4) read function that combines permutation-invariant and node embedding together to obtain graph representation at the end

.

이 계층 l에서 노드 v의 잠재 벡터라고 하자. GNN의 재귀적 정의를 사용하여 표기법을 단순화하기 위해

을 사용하여 입력 노드 피처를 나타낸다. 각 GNN 계층에서 노드 피처는

로 업데이트되고, 마지막 L번째 계층 이후에는 모든 노드 피처 벡터가 읽기 함수

로 결합된다. x_G를 분류기 f에 전달함으로써 그래프 G에 대한 예측

를 얻는다. 여기서,

이다. 대표적인 예로, GCN(graph convolutional network) 계층은

로 정의된다. 여기서, 엣지 가중치가 e_v,u이고

이다.

Let be the latent vector of node v in this layer l. To simplify the notation we use the recursive definition of GNN.

Use to represent input node features. In each GNN layer, the node features are

, and after the last Lth layer, all node feature vectors are updated with the read function

are combined with Prediction on graph G by passing x _G to classifier f

get here,

am. As a representative example, the GCN (graph convolutional network) layer is

It is defined as Here, the edge weight is e _v,u and

am.

믹스업과 그 변형(Mixup and its variants)Mixup and its variants

믹스업은 분류 문제에 대한 예제 쌍과 레이블을 결합하여 가상 데이터를 생성하는 보간 기반 정규화 기술이다. x∈X를 입력 벡터로, y∈Y를 데이터 분포 D에서의 원핫 인코딩된 C-클래스 레이블이라 하자. 혼합 기반 증강 방법은 혼합 비율 분포 q가 주어지면 다음과 같이 손실 L을 최적화한다.Mixup is an interpolation-based normalization technique that generates virtual data by combining example pairs and labels for classification problems. Let x∈X be the input vector and y∈Y be the one-hot encoded C-class label in the data distribution D. The mixture-based augmentation method optimizes the loss L as follows, given the mixture ratio distribution q:

[수학식 1][Equation 1]

여기서, g와 l은 각각 데이터 및 레이블 믹스업 함수이다. 이러한 함수는 베타 분포 Beta(α,α)에서 샘플링된 혼합 파라미터와 함께

와

로 정의된다. 또 다른 예로, 혼합 함수를

로 정의하며, 여기서

는 이진 마스크

에 대한 요소별 아다마르(Hadamard) 곱을 나타내며(

),

이고, 여기서 W와 H는 이미지의 폭과 높이이다.Here, g and l are the data and label mixup functions, respectively. These functions have mixing parameters sampled from the beta distribution Beta(α,α)

and

It is defined as As another example, a mixed function

It is defined as, where

is the binary mask

Represents the element-wise Hadamard product for (

),

, where W and H are the width and height of the image.

이하, 그래프 이식 기술을 구체적으로 설명하기로 한다.Hereinafter, the graph transplant technology will be described in detail.

그래프 이식 기술은 그래프 인스턴스의 맥락을 고려하여 의미 있는 혼합 그래프를 생성하는 그래프 분류를 위한 효과적인 증강 기술이다. 도 3은 본 발명에 따른 그래프 이식 과정의 예시를 도시한 것이다.Graph transplantation technology is an effective augmentation technique for graph classification that creates meaningful mixed graphs by considering the context of graph instances. Figure 3 shows an example of a graph transplant process according to the present invention.

도 3을 참조하면, 먼저, 프로세서(120)는 소스 그래프 G_π와 대상 그래프 G의 노드 중요도 S_π, S를 계산한다.Referring to FIG. 3, first, the processor 120 calculates the node importance S _π and S of the source graph G _π and the target graph G.

다음, 프로세서(120)는 주요 노드

와 임의 노드

에 의해 고정된(anchored) 각 그래프에서 k-홉의 부분 그래프를 추출할 수 있다.Next, the processor 120 is the main node

and random nodes

A partial graph of k-hops can be extracted from each graph anchored by .

다음, 프로세서(120)는 노드 중요도 S_π, S를 기초로 혼합 레이블 y'을 결정할 수 있다.Next, the processor 120 may determine the mixed label y' based on the node importance S _π and S.

마지막으로, 프로세서(120)는 소스 측 부분 그래프

를 대상 부분 그래프로 이식할 수 있다.Finally, the processor 120 generates the source-side partial graph

can be transplanted to the target subgraph.

단계 별 세부 과정은 다음과 같다.The detailed process for each step is as follows.

GNN을 위한 노드 중요도(Node Saliency for GNNs)Node Saliency for GNNs

그래프 이식은 부분 그래프의 중요성을 계산하는 특정 방법에 의존하지 않으며 다양한 접근 방식과 결합할 수 있다. 훈련된 모델이 주어진 중요한 영역을 찾는 설명 가능성의 일반적인 목적과 달리, 본 실시예에서는 GNN이 훈련되는 동안 동적으로 부분 그래프를 찾고 혼합하는 것이다. 따라서, 노드 피처와 기울기를 사용하여 노드의 중요성을 간단하고 빠르게 찾을 수 있는 접근 방식을 우선적으로 고려한다.Graph porting does not depend on any specific method of calculating the importance of a subgraph and can be combined with a variety of approaches. Unlike the general goal of explainability, which is finding important regions given a trained model, in this embodiment the goal is to dynamically find and blend subgraphs while the GNN is being trained. Therefore, we prioritize an approach that can simply and quickly find the importance of nodes using node features and gradients.

GNN l번째 계층에서 얻은 잠재 노드 피처 행렬 X^(l)을 고려하여, 먼저 X^(l),

에 대한 분류 손실

의 기울기를 계산한다. 기울기는 원본 그래프에 대한 훈련 단계의 부산물이기 때문에 거의 비용 없이 얻을 수 있다. 그렇다면, 노드 v의 중요도 s_v는 노드

에 해당하는 기울기의 l2-놈으로 도출된다. l번째 계층에서 컴퓨팅 기울기를 선택하는 것은 각 노드에 대한 l-홉 이웃의 높은 수준의 공간 정보를 캡처하는 것이다. 따라서, 최종 노드 중요도 벡터

를 네트워크 아키텍처 수정 없이 얻을 수 있다.Considering the latent node feature matrix X ^(l) ^obtained from the GNN lth layer, first

Classification loss for

Calculate the slope of . Since the gradient is a by-product of the training step on the original graph, it can be obtained at almost no cost. Then, the importance s _v of node v is node

It is derived as the l2-norm of the slope corresponding to . Choosing a computing gradient at the lth layer is to capture high-level spatial information of the l-hop neighbors for each node. Therefore, the final node importance vector

can be obtained without modifying the network architecture.

그래프 혼합(Graph Mixing)Graph Mixing

주요 부분 그래프 선택(Salient subgraph selection): 모든 반복에 대해 N 그래프 배치(G₁,…,G_N)를 고려한다. 서로 다른 두 그래프 G_πi와 G_i를 쌍으로 구성하는데, 여기서, π는 range(N) = [1,…,N]의 셔플 인덱스이다. 설명의 간결성을 위해 지수 i를 제외하고, G_π와 G는 각각 소스 그래프와 대상 그래프를 의미한다.Salient subgraph selection: Consider N graph arrangements (G ₁ ,…,G _N ) for every iteration. Two different graphs G _πi and G _i are formed as a pair, where π is range(N) = [1,… ,N] is the shuffle index. Excluding the index i for brevity of explanation, G _π and G refer to the source graph and target graph, respectively.

소스 G_π에서 부분 그래프

를 구성하려면, 먼저 노드

의 중요도 R%를 선택하여 클래스 의미론을 포함하는 중요한 부분을 찾는다. 노드 중요도는 정의에 의해 l-홉 부분 그래프의 중요성을 나타낸다. 전체 훈련 단계에서 동일한 노드를 반복적으로 선택하는 것을 피하기 위해 상위-2R% 주요 노드에서 무작위로 균일하게

를 확률적으로 샘플링한다. 선택된 앵커

의 집합이 주어지면, 각 이동 단계에 대해 인접 이웃의 p%를 무작위로 취함으로써 부분 K-홉 부분 그래프

를 추출한다. p는 각 훈련 반복에 대해 Beta(α,α)*100에서 샘플링된다(알고리즘 1 참조).Subgraph from source G _π

To configure, first node

Select the importance R% of to find the important parts containing the class semantics. Node importance, by definition, represents the importance of an l-hop subgraph. Randomly and uniformly from the top-2R% major nodes to avoid repeatedly selecting the same nodes throughout the entire training phase.

Probabilistically sample. selected anchor

Given a set of , a partial K-hop subgraph is obtained by randomly taking p% of its immediate neighbors for each move step.

Extract . p is sampled from Beta(α,α)*100 for each training iteration (see Algorithm 1).

[표 1][Table 1]

일부 부분 그래프를 선택하는 주된 이유는 그래프가 밀도가 높을 때 K-홉 부분 그래프가 그래프 G 전체와 일치할 수 있기 때문이다. 각 반복에 대해 K는 이산 공간 K에서 확률적으로 샘플링된다.The main reason for choosing some subgraphs is that when the graph is dense, a K-hop subgraph can match the entire graph G. For each iteration, K is stochastically sampled from a discrete space K.

대상 측에서는 원본 그래프에서 부분 그래프를 제거하여 소스 측 부분 그래프

를 이식한다. 알고리즘 1에 의해 얻어진 노드 V는 대상 그래프 G에서 제거된다. 소스 측과 달리 앵커 노드

는 증강 그래프의 다양성을 위해 대상 그래프 G에서 R%로 무작위로 선택된다. 이 노드들이 떨어지는 곳은 소스

의 부분 그래프가 이식될 곳이다. v＼

에 의해 유도된 나머지 대상 부분 그래프는

=(v＼

)로 정의된다.On the target side, a subgraph is removed from the original graph to create a subgraph on the source side.

transplant. The node V obtained by Algorithm 1 is removed from the target graph G. Unlike the source side, the anchor node

is randomly selected as R% from the target graph G for the diversity of the augmented graph. Where these nodes fall is the source

This is where the subgraph of will be transplanted. v＼

The remaining target subgraph derived by is

=(v\

) is defined as.

이식(transplant): 그래프 이식 알고리즘은 다음과 같다.Transplant: The graph transplant algorithm is as follows.

[표 2][Table 2]

부분 그래프

와

를 추출한 후,

와 v＼

사이에 새로운 엣지를 추가하여

를

에 이식한다. 공식적으로,

v＼

로 결합된 그래프 G'=(V',E')을 구성하고

와 v＼

사이의 연결을 결정한다. 본 실시예에서는 로컬 정보와 내부 구조를 온전하게 보존하기 위해 부분 그래프 추출에 의해 노드가 엣지를 잃는다고 할 수 있다. 이를 위해 그래프 G에서 deg_G(v)를 노드 v의 정도(degree)로 정의하자. 그렇다면, 인위적 엣지가 부착될 수 있는 소스 및 대상 그래프의 후보 노드 집합은 각각

,

이며, 노드 정도의 총 변화량은

,

이다.subgraph

and

After extracting,

Wow v＼

By adding a new edge between

cast

transplanted to officially,

v＼

Construct a graph G'=(V',E') joined by

Wow v＼

determine the connections between In this embodiment, it can be said that nodes lose edges by partial graph extraction in order to preserve local information and internal structure intact. For this purpose, let us define deg _G (v) as the degree of node v in graph G. Then, the sets of candidate nodes in the source and target graphs to which artificial edges can be attached are respectively

,

, and the total amount of change in node degree is

,

am.

와,

사이의 연결을 결정하기 위해 다음의 두 가지 접근 방식을 사용한다. 첫 번째는

노드 쌍

를 무작위로 균일하게 샘플링하는 것이다. DP(degree preservation)라고 부르는 이 간단하고 효율적인 방법은 원본 그래프에서 노드의 예상되는 정도를 보존한다. 이를 통해 생성된 그래프의 분포를 원본 그래프에서 너무 벗어나지 않게 다양한 샘플을 생성할 수 있다. 다른 하나는, 연결을 위한 노드 쌍의 피처를 고려하여 훈련 단계에서 미분 가능한 엣지 예측 모듈을 도입하는 것이다. 이를 엣지 예측(edge prediction)이라고 명명한다. 엣지 예측 모듈

은 연결된 노드 피처 쌍

을 취하여 엣지 존재 확률

을 예측한다. 표기법으로

를 사용하여 노드 쌍 (u,v)의 표현을 소비하는 엣지 예측 변수를 나타낸다. 무방향 그래프의 경우 대칭 형태를 고려하고 평균

를 취한다.

의 입력에 대해 l번째 GNN 계층 이후 잠재 노드 피처

및

를 활용하여 로컬 정보를 함께 활용한다. 베르누이(Bernoulli)(

)에서 새로운 엣지 가중치

를 샘플링한다. 샘플링된 엣지 가중치

를 미분 가능하도록 훈련 단계에서 직통 검벨-소프트맥스(Gumbel-Softmax) 추정기에 의존한다. 마지막으로, 새 엣지 집합 E'은

로 설정된다.

and,

The following two approaches are used to determine the connection between: at first

node pair

is randomly and uniformly sampled. This simple and efficient method, called degree preservation (DP), preserves the expected degree of a node in the original graph. Through this, various samples can be generated so that the distribution of the generated graph does not deviate too much from the original graph. The other is to introduce a differentiable edge prediction module in the training stage by considering the features of node pairs for connection. This is called edge prediction. Edge prediction module

is a pair of connected node features

By taking the edge existence probability

predict. by notation

Use to denote an edge predictor that consumes the representation of the node pair (u,v). For undirected graphs, consider symmetric shapes and average

Take .

Latent node features after the lth GNN layer for the input of

and

Use local information together. Bernoulli (

) new edge weights in

Sample . Sampled edge weights

It relies on a direct Gumbel-Softmax estimator in the training stage to make it differentiable. Finally, the new edge set E' is

is set to .

그래프 이식에 대한 학습 알고리즘은 다음과 같다.The learning algorithm for graph transplantation is as follows.

[표 3][Table 3]

엣지 예측 모듈은 원본 그래프를 사용한 지도 학습(supervised learning)과 가상 그래프를 사용한 종단 간 방식으로 훈련된다(알고리즘 3의 4-5, 13행 참조).The edge prediction module is trained using supervised learning using the original graph and an end-to-end method using the virtual graph (see lines 4-5 and 13 of Algorithm 3).

적응형 레이블 혼합(Adaptive Label Mixing)Adaptive Label Mixing

본 발명에서는 원본 로컬 정보를 유지하면서 그래프 의미론을 캡처하는 가상 그래프 데이터를 생성한다. 생성된 그래프에 적절한 레이블을 할당하는 것은 네트워크가 노이즈가 많은 데이터로부터 부적절한 지도(supervision)에 의해 오도되는 것을 방지하기 위해 필요하다. 예를 들어, 대상 그래프 G의 속성을 주로 결정하는 코어 노드가 작은 영역에 응축된다고 가정하자. 위에서 설명한 부분 그래프 선택 방법을 통해 이 작은 영역의 노드를 제거하면 G의 원본 속성이 대부분 손실되므로 노드 수에 비례하여 레이블을 할당하는 것만으로는 생성된 그래프를 제대로 설명할 수 없다. 이 문제를 해결하기 위해 그래프 이식을 위한 간단하지만 효과적인 레이블 혼합 전략을 제안한다.In the present invention, we generate virtual graph data that captures graph semantics while maintaining original local information. Assigning appropriate labels to the generated graphs is necessary to prevent the network from being misled by inadequate supervision from noisy data. For example, assume that the core nodes that mainly determine the properties of the target graph G are condensed into a small area. When nodes in this small area are removed through the subgraph selection method described above, most of the original properties of G are lost, so simply assigning labels proportional to the number of nodes cannot properly describe the generated graph. To solve this problem, we propose a simple but effective label mixing strategy for graph porting.

소스 부분 그래프

(및 대상 그래프

의 나머지 부분)의 중요성을 원본 전체 그래프에서 부분 그래프를 구성하는 노드의 전체 중요도로 정의한다. 이 적응적 중요도 함수 I(S,V)는

로 공식화될 수 있다. 생성된 그래프 G'에 대한 레이블 가중치

은 소스

와 대상

에서 결합되는 부분의 상대적 중요도에 따라 도출되며 혼합 인스턴스

에 대한 레이블은

로 정의된다. 여기서

이다.Source Subgraph

(and target graph

The importance of the remaining part) is defined as the overall importance of the nodes that make up the partial graph in the original entire graph. This adaptive importance function I(S,V) is

It can be formalized as Label weights for the generated graph G'

silver sauce

and target

It is derived according to the relative importance of the parts being combined in mixed instances.

The label for is

It is defined as here

am.

본 발명에서의 최종 목표는 수학식 2와 같다.The final goal in the present invention is as shown in Equation 2.

[수학식 2][Equation 2]

여기서,

은

로 정의된다. 그래프 이식에 대한 전체 학습 프로세스는 알고리즘 3과 같다.here,

silver

It is defined as The overall learning process for graph transplantation is the same as Algorithm 3.

본 발명에 따른 그래프 데이터 증강 방법은 크기가 서로 다르고 연결성이 불규칙한 그래프 데이터에 적용될 수 있으며, 부분 그래프의 의미적 중요도를 고려하여 의미 있는 가상 그래프를 생성할 수 있다.The graph data augmentation method according to the present invention can be applied to graph data of different sizes and irregular connectivity, and can generate a meaningful virtual graph by considering the semantic importance of the partial graph.

본 발명에 따른 그래프 데이터 증강 방법은 그래프 분류 문제, 분자구조의 물성 예측문제, 신약개발 위한 그래프의 표현학습, 제한된 레이블 정보만을 활용한 그래프 특징 예측 문제 등에 적용될 수 있다.The graph data augmentation method according to the present invention can be applied to graph classification problems, property prediction problems of molecular structures, graph representation learning for new drug development, and graph feature prediction problems using only limited label information.

이처럼 본 발명의 실시예들에 따르면, 서로 다른 두 그래프 데이터의 부분 그래프를 합성하는 그래프 데이터 증강 기술을 통해 가상의 그래프 데이터를 생성함으로써 그래프 데이터를 효과적으로 증강시킬 수 있으며, 이를 통해 그래프 기반의 인공지능 기술을 획기적인 성능 향상을 기대할 수 있다.As such, according to embodiments of the present invention, graph data can be effectively augmented by generating virtual graph data through graph data augmentation technology that synthesizes partial graphs of two different graph data, and through this, graph-based artificial intelligence. The technology can be expected to dramatically improve performance.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 어플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The device described above may be implemented with hardware components, software components, and/or a combination of hardware components and software components. For example, the devices and components described in the embodiments include a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), and a programmable logic unit (PLU). It may be implemented using one or more general-purpose or special-purpose computers, such as a logic unit, microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. Additionally, a processing device may access, store, manipulate, process, and generate data in response to the execution of software. For ease of understanding, a single processing device may be described as being used; however, those skilled in the art will understand that a processing device includes multiple processing elements and/or multiple types of processing elements. It can be seen that it may include. For example, a processing device may include a plurality of processors or one processor and one controller. Additionally, other processing configurations, such as parallel processors, are possible.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of these, which may configure a processing unit to operate as desired, or may be processed independently or collectively. You can command the device. The software and/or data may be embodied in any type of machine, component, physical device, computer storage medium or device for the purpose of being interpreted by or providing instructions or data to the processing device. there is. Software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 이때, 매체는 컴퓨터로 실행 가능한 프로그램을 계속 저장하거나, 실행 또는 다운로드를 위해 임시 저장하는 것일 수도 있다. 또한, 매체는 단일 또는 수 개의 하드웨어가 결합된 형태의 다양한 기록수단 또는 저장수단일 수 있는데, 어떤 컴퓨터 시스템에 직접 접속되는 매체에 한정되지 않고, 네트워크 상에 분산 존재하는 것일 수도 있다. 매체의 예시로는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical medium), 및 ROM, RAM, 플래시 메모리 등을 포함하여 프로그램 명령어가 저장되도록 구성된 것이 있을 수 있다. 또한, 다른 매체의 예시로, 어플리케이션을 유통하는 앱 스토어나 기타 다양한 소프트웨어를 공급 내지 유통하는 사이트, 서버 등에서 관리하는 기록매체 내지 저장매체도 들 수 있다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. At this time, the medium may continuously store a computer-executable program, or temporarily store it for execution or download. In addition, the medium may be a variety of recording or storage means in the form of a single or several pieces of hardware combined. It is not limited to a medium directly connected to a computer system and may be distributed over a network. Examples of media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical recording media such as CD-ROMs and DVDs, magneto-optical media such as floptical disks, And there may be something configured to store program instructions, including ROM, RAM, flash memory, etc. Additionally, examples of other media include recording or storage media managed by app stores that distribute applications, sites or servers that supply or distribute various other software, etc.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited examples and drawings, various modifications and variations can be made by those skilled in the art from the above description. For example, the described techniques are performed in a different order than the described method, and/or components of the described system, structure, device, circuit, etc. are combined or combined in a different form than the described method, or other components are used. Alternatively, appropriate results may be achieved even if substituted or substituted by an equivalent.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims also fall within the scope of the claims described below.

Claims

In a graph data augmentation method performed on a computer device,

The computer device includes at least one processor configured to execute computer-readable instructions contained in a memory,

The graph data augmentation method is,

By augmenting graph data for GNN (graph neural network) learning,

extracting subgraphs from two different graphs, respectively, by the at least one processor; and

generating virtual graph data by mixing the partial graph, by the at least one processor

Graph data augmentation method including.

According to paragraph 1,

The extracting step is,

Extracting the subgraph using node saliency on each graph

A graph data augmentation method characterized by

According to paragraph 1,

The extraction step is,

Calculating node importance based on node features and gradient in each graph; and

Extracting a k-hop graph from each graph using the node importance

Graph data augmentation method including.

According to paragraph 1,

The generating step is,

removing a first partial graph extracted from a destination graph from the destination graph; and

Transplanting a second partial graph extracted from a source graph to the target graph from which the first partial graph has been removed.

Graph data augmentation method including.

According to clause 4,

The transplanting step is,

Adding a new edge between a node remaining after the first subgraph is removed from the target graph and a specific node sampled from the second subgraph

Graph data augmentation method including.

According to clause 5,

The specific node is randomly sampled from the second partial graph through degree preservation (DP), which preserves the expected degree of the node in the original graph.

A graph data augmentation method characterized by

According to clause 5,

The specific node is sampled from the second partial graph through edge prediction, which predicts edge existence probability using features of node pairs.

A graph data augmentation method characterized by

In clause 7,

The edge prediction uses a differentiable edge predictor that is trained through supervised learning using the original graph and an end-to-end manner using a virtual graph.

A graph data augmentation method characterized by

According to clause 4,

The graph data augmentation method is,

Determining, by the at least one processor, a label of the virtual graph data based on node importance of the first partial graph and the second partial graph.

A graph data augmentation method further comprising:

According to clause 9,

The determining step is,

Deriving a label weight for the virtual graph data according to the relative importance between the nodes remaining after the first partial graph is removed from the target graph and the nodes included in the second partial graph.

Graph data augmentation method including.

In a graph data augmentation system implemented with a computer device,

At least one processor configured to execute computer readable instructions contained in memory

Including,

The at least one processor,

By augmenting graph data for GNN (graph neural network) learning,

The process of extracting subgraphs from two different graphs; and

Process of generating virtual graph data by mixing the partial graph

A graph data augmentation system that processes .

According to clause 11,

The at least one processor,

In each graph, node saliency is calculated based on node features and gradient,

Extracting a k-hop graph from each graph using the node importance

A graph data augmentation system featuring.

According to clause 11,

The at least one processor,

Remove the first subgraph extracted from the destination graph from the destination graph,

A graph data augmentation system featuring.

According to clause 13,

The at least one processor,

Adding a new edge between a node remaining after the first subgraph is removed from the target graph and a specific node sampled from the second subgraph,

A graph data augmentation system featuring.

According to clause 13,

The at least one processor,

Determining the label of the virtual graph data based on node importance of the first subgraph and the second subgraph,

Deriving a label weight for the virtual graph data according to the relative importance between the nodes remaining after the first subgraph is removed from the target graph and the nodes included in the second subgraph.

A graph data augmentation system featuring.