WO2025183242A1

WO2025183242A1 - Synthetic medical data generation system and method for predicting postoperative complications

Info

Publication number: WO2025183242A1
Application number: PCT/KR2024/002632
Authority: WO
Inventors: 최윤재; 조은별; 이민재; 이하정; 권소이; 조정민; 김이삭
Original assignee: Korea Advanced Institute of Science and Technology KAIST; Seoul National University Hospital
Current assignee: Korea Advanced Institute of Science and Technology KAIST; Seoul National University Hospital
Priority date: 2024-02-29
Filing date: 2024-02-29
Publication date: 2025-09-04
Anticipated expiration: 2026-08-29

Abstract

The present invention relates to a synthetic medical data generation system and method for predicting postoperative complications. According to the present invention, the synthetic medical data generation system is provided, the system comprising: a medical data acquisition unit for acquiring an input vector in a form in which structured medical data related to clinical information of a patient and unstructured medical data in a text form recorded for the patient are arranged according to configuration rules; an embedding unit for embedding, in consideration of data type, the structured medical data and the unstructured medical data included in the input vector; and a synthetic medical data generation unit, which integrally learns the embedded input vector through a pre-trained deep learning-based synthetic data generation model, thereby generating structured synthetic medical data having the same form as the structured medical data.

Description

System and method for generating synthetic medical data for predicting postoperative complications

본 발명은 수술 후 합병증 예측을 위한 합성 의료 데이터 생성 시스템 및 방법에 관한 것으로서, 보다 상세하게는 구조화 데이터와 비구조화 데이터를 활용하여 고품질의 구조화 의료 데이터를 합성할 수 있는 합성 의료 데이터 생성 시스템 및 방법 The present invention relates to a system and method for generating synthetic medical data for predicting post-surgical complications, and more specifically, to a system and method for generating synthetic medical data capable of synthesizing high-quality structured medical data by utilizing structured data and unstructured data.

수술 후 보고되는 합병증은 출혈, 폐렴을 포함한 호흡기계 합병증, 패혈증, 순환기계 합병증 등을 포함하며, 국가별로 차이는 있으나 빈도는 약 6-33%로 보고되고 있다. 한 가지 이상의 수술 후 합병증이 발생하는 경우, 환자의 사망률은 약 12-20%에 이르며, 재원 기간 및 의료비용 증가 문제를 야기한다. Postoperative complications reported include bleeding, respiratory complications including pneumonia, sepsis, and circulatory complications. While the incidence varies by country, it is reported to range from approximately 6-33%. When one or more postoperative complications occur, the patient mortality rate reaches approximately 12-20%, leading to increased hospital stay and medical costs.

수술 후 발생하는 합병증 중에서 급성 신손상은 타 수술 후 합병증의 병발과 만성콩팥병 진행과 연관되며, 이는 사망률 증가로 연결된다. 따라서, 수술 전 급성 신손상 발생의 고위험군을 예측하고 합병증 발생 위험을 계층화하여 선제적으로 의료자원을 분배하는 과정이 필수적으로 요구된다.Among postoperative complications, acute kidney injury is associated with the development of other postoperative complications and the progression of chronic kidney disease, which leads to increased mortality. Therefore, it is essential to predict high-risk groups for acute kidney injury before surgery, stratify risk for complications, and proactively allocate medical resources.

아울러, 의료데이터 분석 분야 전반에 폭넓게 활용되고 있는 딥러닝/머신러닝 기법을 의료 현장에 적용하기 위해서는 상당한 양의 의료 및 환자데이터의 확보가 필요하다.Additionally, in order to apply deep learning/machine learning techniques, which are widely used across the medical data analysis field, to the medical field, it is necessary to secure a significant amount of medical and patient data.

최근에, 병원 의료 데이터에 대한 전산화/디지털화, 데이터 수집/생산/처리/보안기술 등이 급속히 진전되었으나, 이를 활용하기 위한 연계 및 교류는 다소 미흡한 편이다. 이는 병원의료정보에 대한 개인정보보호의 민감성 문제, 구조적 복잡성으로 인하여, 방대한 데이터의 연계 및 교류가 제한적이기 때문이다.Recently, rapid progress has been made in computerization/digitization of hospital medical data, as well as in data collection/production/processing/security technologies. However, the connectivity and exchange necessary for utilizing these technologies remains somewhat inadequate. This is due to the sensitivity of personal information protection issues regarding hospital medical information and its structural complexity, which limits the connectivity and exchange of vast amounts of data.

따라서, 방대한 의료 데이터를 활용하기 위해서는 병원 간 의료정보의 표준화 및 구조화를 통한 연계 시도와 함께, 신뢰성 높은 개인 민감정보 보호 방안이 필요하다. 개인 민감 정보 보호를 위해서 다양한 연구방법론이 제시된 바 있으나, 데이터의 품질 저하, 개인정보 재식별 위험, 표준화 부재 등의 문제로 고품질의 대규모 데이터를 얻기는 현실적으로 쉽지 않다.Therefore, to utilize vast amounts of medical data, efforts are needed to standardize and structure medical information across hospitals, while also establishing reliable measures to protect sensitive personal information. While various research methodologies have been proposed to protect sensitive personal information, obtaining high-quality, large-scale data remains challenging due to issues such as deteriorating data quality, the risk of re-identification, and a lack of standardization.

이러한 어려움을 타개할 수 있는 방안으로 합성 의료 데이터가 대두되고 있다. 합성 의료 데이터는 개인정보 등 민감 정보를 포함하고 있는 실제 의료 데이터를 대체할 수 있으며, 가명화/익명화, 데이터 결합 제약으로부터 자유로워, 데이터 유용성을 극대화할 수 있다는 장점이 있다.Synthetic medical data is emerging as a solution to overcome these challenges. Synthetic medical data can replace real medical data containing sensitive information such as personal information. It also offers the advantage of being free from pseudonymization/anonymization and data linking restrictions, maximizing data utility.

뿐만 아니라, 연구자는 실제 환자의 의료데이터를 다루지 않고 합성 데이터로만 연구를 수행할 수 있기 때문에, 법적 제약 없이 데이터를 사용하고 가공할 수 있으며, 외부 연구자와의 자유로운 데이터 공유가 가능해진다. Moreover, because researchers can conduct research only with synthetic data without handling actual patient medical data, they can use and process the data without legal restrictions and freely share the data with external researchers.

합성 의료 데이터는 다양한 전처리, 표준화, 라벨링이 이루어진 원 데이터(original data)를 기반으로 원하는 만큼 생산될 수 있으므로, 인공지능 학습용 데이터셋 구축에 필요한 시간과 비용을 획기적으로 절감할 수 있다.Synthetic medical data can be produced in any desired quantity based on original data that has undergone various preprocessing, standardization, and labeling, dramatically reducing the time and cost required to build datasets for artificial intelligence learning.

따라서, 민감한 개인정보를 보호하고 개인정보 식별화를 방지하면서 다량의 임상 데이터를 구축할 수 있는 합성 의료데이터 생성 기술의 개발이 요구된다.Therefore, there is a need for the development of synthetic medical data generation technology that can build large amounts of clinical data while protecting sensitive personal information and preventing personal information from being identified.

본 발명의 배경이 되는 기술은 한국등록특허 제10-2482262호(2022.12.28 공고)에 개시되어 있다.The technology underlying the present invention is disclosed in Korean Patent No. 10-2482262 (announced on December 28, 2022).

본 발명은 인공지능 모델에서 환자의 임상 정보와 관련한 구조화 의료 데이터뿐만 아니라 환자와 관련하여 기록된 텍스트 자료에 해당한 비구조화 의료 데이터를 참조하여 구조화 의료 데이터를 합성함으로써, 의료 데이터의 다양성과 복잡도가 반영된 고품질의 합성 의료 데이터를 생성할 수 있는 합성 의료 데이터 생성 시스템 및 방법을 제공하는데 목적이 있다.The present invention aims to provide a system and method for generating synthetic medical data capable of generating high-quality synthetic medical data that reflects the diversity and complexity of medical data by synthesizing structured medical data by referencing not only structured medical data related to clinical information of a patient but also unstructured medical data corresponding to text data recorded in relation to the patient in an artificial intelligence model.

본 발명은, 환자의 임상 정보와 관련한 구조화 의료 데이터 및 환자에 대해 기록된 텍스트 형태의 비구조화 의료 데이터가 설정 규칙으로 나열된 형태의 입력 벡터를 획득하는 의료 데이터 획득부; 상기 입력 벡터 내에 포함된 구조화 의료 데이터 및 비구조화 의료 데이터를 데이터 유형을 고려하여 임베딩하는 임베딩부; 및 상기 임베딩된 입력 벡터를 기 학습된 딥러닝 기반의 합성 데이터 생성 모델을 통해 통합적으로 학습하여, 상기 구조화 의료 데이터와 동일 양식의 구조화된 합성 의료 데이터를 생성하는 합성 의료 데이터 생성부를 포함하는 합성 의료 데이터 생성 시스템을 제공한다.The present invention provides a synthetic medical data generation system, comprising: a medical data acquisition unit for acquiring an input vector in which structured medical data related to clinical information of a patient and unstructured medical data in the form of text recorded about the patient are listed according to a set rule; an embedding unit for embedding structured medical data and unstructured medical data included in the input vector in consideration of data type; and a synthetic medical data generation unit for generating structured synthetic medical data in the same format as the structured medical data by comprehensively learning the embedded input vector through a pre-trained deep learning-based synthetic data generation model.

또한, 상기 구조화 의료 데이터는, 상기 환자의 임상 정보 중에서 범주에 따라 분류되는 정보 유형을 가지는 범주형 데이터(categorical data) 및 수치적으로 표현되는 정보 유형을 가지는 수치형 데이터(numerical data)로 이루어질 수 있다.In addition, the structured medical data may be composed of categorical data having an information type classified by category among the patient's clinical information and numerical data having an information type expressed numerically.

또한, 상기 비구조화 의료 데이터는, 수술 중에 기록된 수술기록지 상의 텍스트 데이터일 수 있다.Additionally, the above unstructured medical data may be text data on a surgical record recorded during surgery.

또한, 상기 입력 벡터는, 상기 환자에 대한 설정 종류의 범주형 데이터들로 이루어진 제1 벡터, 설정 종류의 수치형 데이터들로 이루어진 제2 벡터, 그리고 상기 수술기록지에서 추출된 토큰화된 텍스트 데이터들로 이루어진 제3 벡터가 순서대로 연결된 형태를 가질 수 있다.In addition, the input vector may have a form in which a first vector consisting of categorical data of the setting type for the patient, a second vector consisting of numerical data of the setting type, and a third vector consisting of tokenized text data extracted from the surgical record are sequentially connected.

또한, 상기 임베딩부는, 상기 구조화 의료 데이터의 경우, 상기 범주형 데이터에 대응한 범주형 특징을 임베딩하고 상기 수치형 데이터에 대응한 연속적인 수치적 특징을 선형 임베딩(linear embedding)으로 임베딩하고, 상기 비구조화 의료 데이터의 경우, 텍스트 임베딩 방식을 사용하여 토큰화된 텍스트의 개별 토큰을 고정된 크기의 벡터로 변환할 수 있다.In addition, in the case of the structured medical data, the embedding unit may embed categorical features corresponding to the categorical data and linearly embed continuous numerical features corresponding to the numerical data, and in the case of the unstructured medical data, may convert individual tokens of tokenized text into fixed-size vectors using a text embedding method.

또한, 상기 임베딩부는, 의료 데이터의 구조화 여부를 상기 합성 데이터 생성 모델이 식별하도록 하기 위한 토큰 타입 임베딩(token type embedding), 상기 구조화 의료 데이터 내 각 컬럼(column)들과 상기 비구조화 의료 데이터 내의 각 토큰(token)들을 구분하기 위한 포지셔널 임베딩(Positional embedding), 그리고 입력 벡터 내의 각 값이 널(null) 값(공백 값)인지 여부를 구분하기 위한 널 타입 임베딩(null type embedding)을 수행할 수 있다.In addition, the embedding unit may perform token type embedding to enable the synthetic data generation model to identify whether medical data is structured, positional embedding to distinguish between each column in the structured medical data and each token in the unstructured medical data, and null type embedding to distinguish between whether each value in the input vector is a null value (blank value).

또한, 상기 합성 데이터 생성 모델은, 상기 임베딩된 입력 벡터 내 구조화 의료 데이터 및 비구조화 의료 데이터의 특징을 통합적으로 학습하여 출력 데이터를 생성하는 트랜스포머(transformer)와, 상기 트랜스포머의 출력 데이터를 이용하여 상기 범주형 데이터와 수치형 데이터로 이루어진 구조화된 합성 의료 데이터를 생성하여 출력 벡터로 제공하는 멀티 헤드(Multi-head)를 포함하여 이루어질 수 있다.In addition, the synthetic data generation model may include a transformer that comprehensively learns the characteristics of structured medical data and unstructured medical data within the embedded input vector to generate output data, and a multi-head that generates structured synthetic medical data composed of categorical data and numerical data using the output data of the transformer and provides the structured synthetic medical data as an output vector.

또한, 상기 멀티 헤드는, 상기 트랜스포머의 출력 데이터로부터 상기 합성 의료 데이터 내의 범주형 데이터를 구성하는 범주적 특징(categorical feartures)을 생성하는 분류기(classifier), 상기 합성 의료 데이터 내의 수치형 데이터를 구성하는 연속적 특징(continuous feartures)을 생성하는 가우시안 헤드(Gausian head), 그리고 상기 출력 벡터 내의 널(null) 값의 유무를 결정하는 널 분류기(null classifier)를 포함할 수 있다.In addition, the multi-head may include a classifier that generates categorical features constituting categorical data in the synthetic medical data from the output data of the transformer, a Gaussian head that generates continuous features constituting numerical data in the synthetic medical data, and a null classifier that determines the presence or absence of a null value in the output vector.

또한, 상기 생성된 합성 의료 데이터는, 수술 후 합병증 예측을 위한 예측 모델의 구축을 위한 학습 데이터로 사용될 수 있다.Additionally, the synthetic medical data generated above can be used as learning data for building a prediction model for predicting post-surgical complications.

그리고, 본 발명은, 합성 의료 데이터 생성 시스템에 의해 수행되는 합성 의료 데이터 생성 방법에 있어서, 환자의 임상 정보와 관련한 구조화 의료 데이터 및 환자에 대해 기록된 텍스트 형태의 비구조화 의료 데이터가 설정 규칙으로 나열된 형태의 입력 벡터를 획득하는 단계; 상기 입력 벡터 내에 포함된 구조화 의료 데이터 및 비구조화 의료 데이터를 데이터 유형을 고려하여 임베딩하는 단계; 및 상기 임베딩된 입력 벡터를 기 학습된 딥러닝 기반의 합성 데이터 생성 모델을 통해 통합적으로 학습하여, 상기 구조화 의료 데이터와 동일 양식의 구조화된 합성 의료 데이터를 생성하는 단계를 포함한다.And, the present invention provides a method for generating synthetic medical data performed by a synthetic medical data generation system, comprising the steps of: obtaining an input vector in which structured medical data related to clinical information of a patient and unstructured medical data in the form of text recorded about the patient are listed according to a set rule; embedding structured medical data and unstructured medical data included in the input vector in consideration of data type; and generating structured synthetic medical data in the same format as the structured medical data by comprehensively learning the embedded input vector through a pre-trained deep learning-based synthetic data generation model.

본 발명에 따르면, 환자의 임상 정보와 관련한 구조화 데이터와 수술 기록지와 같은 비구조화 텍스트 데이터를 통합적으로 고려하여, 의료 데이터의 복잡성과 다양성을 이해하는 합성 데이터 생성 모델을 제공할 수 있다. According to the present invention, a synthetic data generation model that understands the complexity and diversity of medical data can be provided by comprehensively considering structured data related to a patient's clinical information and unstructured text data such as surgical records.

이러한 본 발명은 다양한 전처리, 표준화, 라벨링 과정을 거쳐 정제된 원 데이터를 기반으로, 필요한 양의 데이터를 효율적으로 빠르게 생성함으로써, 인공지능 학습용 데이터셋을 구축하는 데 드는 시간과 비용을 크게 줄일 수 있다.The present invention can significantly reduce the time and cost required to build a dataset for artificial intelligence learning by efficiently and quickly generating the necessary amount of data based on raw data refined through various preprocessing, standardization, and labeling processes.

이와 같은, 본 발명에 따르면, 구조화 정보 뿐만 아니라 비구조화 텍스트까지 참조하여 데이터 복잡성과 다양성을 깊이 있게 이해하면서, 구조화 의료 데이터 합성함으로써, 고품질의 합성 의료 데이터를 효율적으로 생성할 수 있다. In this way, according to the present invention, high-quality synthetic medical data can be efficiently generated by synthesizing structured medical data while deeply understanding the complexity and diversity of data by referring to not only structured information but also unstructured text.

또한, 합성된 고품질 의료 데이터는 실제 의료데이터를 대체하여 합병증 발생 위험 예측 모델의 학습을 위한 데이터셋으로 활용될 수 있고, 다양한 의료 태스크에서 중요한 역할을 할 수 있다.Additionally, synthetic high-quality medical data can be used as a dataset for learning a complication risk prediction model by replacing actual medical data, and can play an important role in various medical tasks.

도 1은 본 발명의 실시예에 따른 합성 의료 데이터 생성 시스템의 구성을 나타낸 도면이다.FIG. 1 is a diagram showing the configuration of a synthetic medical data generation system according to an embodiment of the present invention.

도 2는 본 발명의 실시예에 따른 합성 데이터 생성 모델의 구조를 구체적으로 설명하기 위한 도면이다.FIG. 2 is a drawing specifically explaining the structure of a synthetic data generation model according to an embodiment of the present invention.

도 3은 본 발명의 실시예에 따른 합성 의료 데이터 생성 방법을 설명하는 도면이다.FIG. 3 is a drawing illustrating a method for generating synthetic medical data according to an embodiment of the present invention.

도 4은 합성 의료 데이터의 구체적인 활용 예시를 보여주는 도면이다. Figure 4 is a diagram showing a specific example of the use of synthetic medical data.

그러면 첨부한 도면을 참고로 하여 본 발명의 실시 예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시 예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. Hereinafter, embodiments of the present invention will be described in detail with reference to the attached drawings so that those skilled in the art can easily practice the present invention. However, the present invention may be implemented in various different forms and is not limited to the embodiments described herein. In addition, in the drawings, parts irrelevant to the description have been omitted to clearly explain the present invention, and similar parts have been designated with similar reference numerals throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. Throughout the specification, when a part is said to be "connected" to another part, this includes not only the cases where the parts are "directly connected" but also the cases where the parts are "electrically connected" with other elements intervening. Furthermore, when a part is said to "include" a component, this does not exclude other components, but rather includes other components, unless otherwise stated.

도 1은 본 발명의 실시예에 따른 합성 의료 데이터 생성 시스템의 구성을 나타낸 도면이고, 도 2는 본 발명의 실시예에 따른 합성 데이터 생성 모델의 구조를 구체적으로 설명하기 위한 도면이다.FIG. 1 is a diagram showing the configuration of a synthetic medical data generation system according to an embodiment of the present invention, and FIG. 2 is a diagram specifically explaining the structure of a synthetic data generation model according to an embodiment of the present invention.

도 1에 나타낸 것과 같이, 본 발명의 실시예에 따른 합성 의료 데이터 생성 시스템(100)은 의료 데이터 획득부(110), 임베딩부(120) 및 의료 데이터 생성부(130)를 포함한다. 여기서 각 부(110~130)의 동작과 각 부 간의 데이터 흐름은 제어부(미도시)에 의해 제어될 수 있다.As illustrated in FIG. 1, a synthetic medical data generation system (100) according to an embodiment of the present invention includes a medical data acquisition unit (110), an embedding unit (120), and a medical data generation unit (130). Here, the operation of each unit (110 to 130) and the data flow between each unit can be controlled by a control unit (not shown).

이러한 합성 의료 데이터 생성 시스템(100)은 물리적으로 구성되어 프로세서, 메모리, 유저 인터페이스 입출력 장치 및 저장 장치, 네트워크 입출력부 등을 포함하는 컴퓨터 장치로 구현될 수도 있고, 컴퓨터 장치나 사용자 단말에 실행되는 응용 프로그램 등으로 구현될 수도 있다.This synthetic medical data generation system (100) may be implemented as a computer device that is physically configured and includes a processor, memory, a user interface input/output device and a storage device, a network input/output unit, etc., or may be implemented as an application program running on a computer device or a user terminal.

본 발명의 실시예에 따른 합성 의료 데이터 생성 시스템(100)은 실제 환자의 의료 데이터를 기 학습된 합성 데이터 생성 모델에 입력하여 합성 의료 데이터를 생성할 수 있다. 합성 의료 데이터는 원하는 양 만큼 생성되어 실제 의료 데이터를 대체할 수 있다. 예를 들어, 1만 명의 의료 데이터를 각각 합성 데이터 생성 모델에 입력하여 1만 개의 신규한 합성 의료 데이터를 만들어낼 수 있다. The synthetic medical data generation system (100) according to an embodiment of the present invention can generate synthetic medical data by inputting actual patient medical data into a pre-trained synthetic data generation model. The synthetic medical data can be generated in a desired quantity to replace actual medical data. For example, the medical data of 10,000 individuals can be input into the synthetic data generation model to generate 10,000 new synthetic medical data sets.

도 1 및 도 2과 같이, 의료 데이터 획득부(110)는 환자의 임상 정보와 관련한 구조화 의료 데이터 및 환자에 대해 기록된 텍스트 형태의 비구조화 의료 데이터가 설정 규칙으로 나열된 형태의 입력 벡터(10)를 획득할 수 있다.As shown in FIGS. 1 and 2, the medical data acquisition unit (110) can acquire an input vector (10) in the form of structured medical data related to the patient's clinical information and unstructured medical data in the form of text recorded about the patient, which are listed according to a set rule.

의료 데이터 획득부(110)는 환자별 의료 데이터 또는 입력 벡터를 유저 인터페이스 입출력 장치, 유무선 접속된 사용자 단말, 저장 장치를 통해서 수집할 수 있다. 의료 데이터 획득부(110)는 환자별 의료 데이터를 외부로부터 수신한 후에 이를 입력 벡터의 형태로 직접 가공할 수도 있고, 미리 가공된 환자 별 입력 벡터를 외부로부터 직접 수신하여 획득할 수도 dLT다The medical data acquisition unit (110) can collect patient-specific medical data or input vectors through a user interface input/output device, a wired/wireless connected user terminal, or a storage device. The medical data acquisition unit (110) can receive patient-specific medical data from the outside and directly process it into the form of an input vector, or can directly receive and acquire pre-processed patient-specific input vectors from the outside.

도 2에 나타낸 것과 같이, 입력 벡터(10)는 크게 구조화 의료 데이터와 비구조화 의료 데이터로 구분되고, 구조화 데이터는 다시 범주형 데이터(categorical data)와 수치형 데이터(numerical data)로 구분될 수 있다. As shown in Fig. 2, the input vector (10) is largely divided into structured medical data and unstructured medical data, and the structured data can be further divided into categorical data and numerical data.

이에 따라, 입력 벡터(10)는 총 3가지 유형 데이터(범주형 데이터, 수치형 데이터, 텍스트 데이터)로 이루어진 것을 알 수 있다.Accordingly, it can be seen that the input vector (10) is composed of a total of three types of data (categorical data, numerical data, and text data).

구조화 데이터는 환자의 임상 정보(예: 성별, 나이, 혈압, 혈액형, 심박수, 과거 수술 이력, 과거 병력)와 관련한 의료 데이터에 해당하고, 비구조화 의료 데이터는 환자와 관련하여 기록된 텍스트 형태의 의료 데이터에 해당할 수 있다. Structured data may correspond to medical data related to a patient's clinical information (e.g., gender, age, blood pressure, blood type, heart rate, past surgical history, past medical history), and unstructured medical data may correspond to medical data in text form recorded in relation to a patient.

구조화 데이터 중에서 범주형 데이터(categorical data)는 환자의 임상 정보 중에서 범주에 따라 0 또는 1의 값으로 분류되는 정보 유형(예: 성별(남/여), 수술 이력 여부(유/무) 등)을 가지는 데이터에 해당할 수 있고, 수치형 데이터(numerical data)는 환자의 임상 정보 중에서 수치적으로 표현되는 정보 유형을 가지는 데이터(예: 혈압, 심박수, 혈당, 나이 등)에 해당할 수 있다. Among structured data, categorical data may correspond to data that has a type of information classified as 0 or 1 according to a category among the patient's clinical information (e.g., gender (male/female), presence/absence of surgical history (yes/no), etc.), and numerical data may correspond to data that has a type of information expressed numerically among the patient's clinical information (e.g., blood pressure, heart rate, blood sugar, age, etc.).

3가지 유형 데이터 중 유일하게 비구조화 의료 데이터에 해당한 텍스트 데이터는 해당 환자의 수술 중에 기록된 수술기록지 상의 텍스트 데이터에 해당할 수 있다. 수술기록지는 수술 중에 의료진에 의해 기록된 텍스트 데이터로, 환자의 수술 중에 발생한 다양한 이력, 특이 사항, 생체 반응 등을 포함한 정보가 포괄적으로 기록될 수 있다. Among the three types of data, the only unstructured medical data is text data, such as the surgical records recorded during a patient's surgery. Surgical records are text data recorded by medical staff during surgery, and can comprehensively record various information, including the patient's history, unusual events, and physiological responses that occurred during the surgery.

본 발명의 실시예는 수술후 합병증 예측을 위한 구조화 의료 데이터를 합성하는 기법을 제안한 것으로, 환자의 의료 데이터 중에서 텍스트 데이터는 이와 같이 수술 기록지 상의 텍스트 데이터를 포함할 수 있다. An embodiment of the present invention proposes a technique for synthesizing structured medical data for predicting post-surgical complications, wherein text data among the patient's medical data may include text data on a surgical record.

또한, 입력 벡터(10)는 상술한 3가지 유형 데이터가 설정 규칙에 따라 나열된 형태를 가질 수 있다. 설정 규칙이란, 입력 벡터를 구성한 데이터의 유형, 해당 유형 내의 세부 데이터 종류나 개수, 데이터의 나열 순서 등을 포함할 수 있다.Additionally, the input vector (10) may have a form in which the three types of data described above are listed according to a setting rule. The setting rule may include the type of data that constitutes the input vector, the type or number of detailed data within the type, the listing order of the data, etc.

예를 들어, 입력 벡터(10)는 도 2와 같이, 환자에 대한 설정 종류의 범주형 데이터들로 이루어진 제1 벡터, 설정 종류의 수치형 데이터들로 이루어진 제2 벡터, 그리고 수술 기록지에서 추출한 토큰화된 텍스트 데이터들로 이루어진 제3 벡터가 순서대로 연결된 형태를 가질 수 있다.For example, the input vector (10) may have a form in which a first vector consisting of categorical data of a setting type for a patient, a second vector consisting of numerical data of a setting type, and a third vector consisting of tokenized text data extracted from a surgical record are sequentially connected, as shown in FIG. 2.

임베딩부(120)는 이와 같은 형태의 입력 벡터(10)를 입력받고, 이를 임베딩 처리하여 합성 의료 데이터 생성부(130)로 제공할 수 있다.The embedding unit (120) can receive an input vector (10) of this type, embed it, and provide it to the synthetic medical data generation unit (130).

이러한 임베딩부(120)는 입력 벡터 내에 포함된 구조화 의료 데이터 및 비구조화 의료 데이터를 데이터 유형을 고려하여 임베딩할 수 있다.This embedding unit (120) can embed structured medical data and unstructured medical data included in the input vector by considering the data type.

구체적으로, 임베딩부(120)는, 구조화 의료 데이터의 경우, 범주형 데이터에 대응한 범주형 특징(categorical features)을 임베딩하고 수치형 데이터에 대응한 연속적인 수치적 특징(numerical features)을 선형 임베딩(linear embedding)으로 임베딩할 수 있다. 아울러, 임베딩부(120)는 비구조화 의료 데이터의 경우, 통상적으로 알려진 텍스트 임베딩 방식을 사용하여 토큰화된 텍스트의 개별 토큰을 고정된 크기의 벡터로 변환할 수 있다.Specifically, in the case of structured medical data, the embedding unit (120) can embed categorical features corresponding to categorical data and linearly embed continuous numerical features corresponding to numerical data. In addition, in the case of unstructured medical data, the embedding unit (120) can convert individual tokens of tokenized text into fixed-size vectors using a commonly known text embedding method.

아울러, 임베딩부(120)는 의료 데이터의 구조화 여부를 합성 데이터 생성 모델이 식별하도록 하기 위한 토큰 타입 임베딩(token type embedding), 구조화 의료 데이터 내 각 컬럼(column)들과 비구조화 의료 데이터 내의 각 토큰(token)들을 구분하기 위한 포지셔널 임베딩(Positional embedding), 그리고 입력 벡터 내의 각 값이 널(null) 값(공백 값)인지 여부를 구분하기 위한 널 타입 임베딩(null type embedding)을 수행할 수 있다.In addition, the embedding unit (120) can perform token type embedding to enable a synthetic data generation model to identify whether medical data is structured, positional embedding to distinguish between each column in structured medical data and each token in unstructured medical data, and null type embedding to distinguish between whether each value in an input vector is a null value (blank value).

여기서, 널 값의 경우 각 환자 별로 상이한 위치에 존재할 수 있다. 예를 들어 심박수가 없는 환자의 경우 입력 벡터 내에서 심박수에 해당한 위치의 값이 널 값으로 표현될 수 있다.Here, null values may exist in different locations for each patient. For example, for a patient without a heart rate, the value corresponding to the heart rate in the input vector may be represented as a null value.

이와 같이 임베딩부(120)는 구조화 및 비구조화 데이터를 적합하게 모델링하기 위하여 임베딩 과정을 다각적으로 수행할 수 있고, 임베딩이 완료된 입력 데이터를 합성 의료 데이터 생성부(130)로 전달할 수 있다. In this way, the embedding unit (120) can perform the embedding process in various ways to appropriately model structured and unstructured data, and can transmit the input data for which embedding has been completed to the synthetic medical data generation unit (130).

합성 의료 데이터 생성부(130)는 임베딩된 입력 벡터를 기 학습된 딥러닝 기반의 합성 데이터 생성 모델을 통해 통합적으로 학습하여, 구조화 의료 데이터와 동일 양식의 구조화된 합성 의료 데이터(20)를 생성할 수 있다. The synthetic medical data generation unit (130) can generate structured synthetic medical data (20) in the same format as the structured medical data by comprehensively learning the embedded input vector through a pre-trained deep learning-based synthetic data generation model.

이때, 합성 의료 데이터 생성부(130)는 임베딩 처리된 입력 벡터를 기 학습된 합성 데이터 생성 모델에 적용하여, 입력 벡터 내 구조화 의료 데이터 및 비구조화 의료 데이터의 특징을 통합적으로 학습하여, 구조화 의료 데이터와 동일한 구조로 합성된 합성 의료 데이터(20)를 최종 결과물로 생성할 수 있다. At this time, the synthetic medical data generation unit (130) applies the embedded input vector to a pre-learned synthetic data generation model, thereby comprehensively learning the characteristics of structured medical data and unstructured medical data within the input vector, and can generate synthetic medical data (20) synthesized with the same structure as the structured medical data as the final result.

합성 의료 데이터(20)는 도 2의 상단과 같이, 모델의 입력으로 사용된 구조화 의료 데이터와 동일하게 범주형 데이터와 수치형 데이터로 이루어지며, 입력 데이터와 동일한 데이터 사이즈를 가질 수 있다. 아울러, 합성 의료 데이터(20) 내의 널 값 역시 모델의 입력으로 사용된 구조화 의료 데이터의 널 값과 동일한 위치에 존재할 수 있다. Synthetic medical data (20) is composed of categorical data and numerical data, similar to the structured medical data used as input to the model, as shown in the upper part of Fig. 2, and may have the same data size as the input data. In addition, null values within the synthetic medical data (20) may also exist in the same location as the null values of the structured medical data used as input to the model.

이와 같이, 합성 의료 데이터(20)는 입력으로 사용된 원본 데이터인 구조화 데이터와 동일한 위치에 널 값이 존재하며, 널 값을 제외한 나머지 값들 중 일부 또는 전부가 원본에서 변형된 형태를 가질 수 있다. In this way, synthetic medical data (20) has null values in the same location as the original data used as input, which is structured data, and some or all of the remaining values excluding the null values may have a form that is modified from the original.

본 발명의 실시예에서 합성 데이터 생성 모델은 도 2와 같이 다양한 자연어 처리 태스크에서 가장 좋은 성능을 보이는 트랜스포머(transformer)(131) 및 이와 연결된 멀티 헤드(Multi-head)(132)를 포함하여 이루어질 수 있다. In an embodiment of the present invention, a synthetic data generation model may be formed by including a transformer (131) that exhibits the best performance in various natural language processing tasks, as shown in FIG. 2, and a multi-head (132) connected thereto.

트랜스포머(131)는 임베딩이 완료된 입력 벡터를 받아서, 해당 벡터 내의 구조화된 의료 데이터 및 비구조화된 의료 데이터의 특성을 종합적으로 학습하여 출력 데이터를 생성할 수 있다. 이 과정에서 트랜스포머(131)는 다양한 학습 방법 중에서도 마스킹 기반 생성 방식을 활용할 수 있다. 이 방식은 모델이 입력 데이터의 일부를 마스킹(숨김) 처리하고, 마스킹된 부분을 예측하도록 함으로써, 모델이 데이터의 내부 구조를 더 깊게 이해하고, 높은 수준의 패턴을 학습할 수 있다. The transformer (131) receives an input vector for which embedding has been completed, and can comprehensively learn the characteristics of structured and unstructured medical data within the vector to generate output data. In this process, the transformer (131) can utilize a masking-based generation method among various learning methods. This method allows the model to mask (hid) a portion of the input data and predict the masked portion, thereby enabling the model to gain a deeper understanding of the internal structure of the data and learn high-level patterns.

본 발명의 실시예의 경우, 트랜스포머(131)가 수행하는 셀프 어텐션(self-attention)의 범위가 입력 벡터에 포함된 구조화 의료 데이터 및 비구조화 의료 데이터에 모두 걸쳐 있으므로, 다양한 모달리티의 데이터를 동시에 모델링할 수 있다.In the embodiment of the present invention, since the range of self-attention performed by the transformer (131) spans both structured medical data and unstructured medical data included in the input vector, data of various modalities can be modeled simultaneously.

본 발명의 실시예에서 트랜스포머(131)는 트랜스포머 인코더 만으로 구현될 수 있다. 물론, 트랜스포머는 필요에 따라 인코더-디코터 모델 구조를 활용할 수도 있다.In an embodiment of the present invention, the transformer (131) may be implemented using only a transformer encoder. Of course, the transformer may also utilize an encoder-decoder model structure as needed.

멀티 헤드(132)는 트랜스포머(131)의 출력 데이터를 이용하여 범주형 데이터와 수치형 데이터로 이루어진 구조화된 합성 의료 데이터를 생성하여 출력 벡터로 제공할 수 있다.The multi-head (132) can generate structured synthetic medical data consisting of categorical data and numerical data using the output data of the transformer (131) and provide it as an output vector.

보다 구체적으로, 멀티 헤드(132)는 트랜스포머의 고차원 출력 데이터를 범주형 데이터 또는 수치형 데이터로 매핑하여 구조화된 의료 데이터를 생성할 수 있다.More specifically, the multi-head (132) can generate structured medical data by mapping the high-dimensional output data of the transformer into categorical data or numerical data.

이때, 멀티 헤드(132)는, 도 2와 같이, 트랜스포머(131)의 출력 데이터로부터 합성 의료 데이터 내의 범주형 데이터를 구성하는 범주적 특징(categorical features)을 생성하는 분류기(classifier), 합성 의료 데이터 내의 수치형 데이터를 구성하는 연속적 특징(continuous features)을 생성하는 가우시안 헤드(Gaussian head), 그리고 출력 벡터 내의 널(null) 값의 유무를 결정하는 널 분류기(null classifier)를 포함할 수 있다. At this time, the multi-head (132) may include, as shown in FIG. 2, a classifier that generates categorical features constituting categorical data in the synthetic medical data from the output data of the transformer (131), a Gaussian head that generates continuous features constituting numerical data in the synthetic medical data, and a null classifier that determines the presence or absence of a null value in the output vector.

구체적으로, 분류기(classifier)는 범주적 특징(categorical features)에 해당하는 트랜스포머(131)의 출력 각각이 어떤 범주형 클래스에 속하는지 결정할 수 있다. 가우시안 헤드(Gaussian head)는 연속적 특징(continuous features)에 해당하는 트랜스포머(131)의 출력을 이용하여 각 컬럼에 대한 평균과 분산을 예측하여 정규 분포를 모델링할 수 있다. 이렇게 추정한 분포로부터 값을 샘플링함으로써 수치형 데이터에 대한 확률적 생성을 수행할 수 있다. 널 분류기(null classifier)는 트랜스포머(131)의 출력을 이용하여 어떤 컬럼이 결측(null) 값인지 아닌지를 구별하여, 결측 값이 아닌 컬럼에 대해서만 합성하는 기능을 수행할 수 있다.Specifically, the classifier can determine which categorical class each output of the transformer (131) corresponding to the categorical features belongs to. The Gaussian head can model a normal distribution by predicting the mean and variance for each column using the output of the transformer (131) corresponding to the continuous features. By sampling values from the distribution estimated in this way, probabilistic generation of numerical data can be performed. The null classifier can perform the function of distinguishing whether a column is a missing value or not using the output of the transformer (131) and synthesizing only for columns that are not missing values.

이와 같이, 멀티 헤드(132)는 구조화 데이터 내의 다양한 데이터 타입들을 망라하여 데이터를 합성하기 위하여 합성 데이터 생성 모델의 최상단에 위치한 생성기를 다각화한 것을 알 수 있다.In this way, it can be seen that the multi-head (132) diversifies the generator located at the top of the synthetic data generation model to synthesize data by encompassing various data types within the structured data.

도 2의 우측에는 상술한 방법으로 생성한 합성 의료 데이터를 누적한 데이터 셋을 예시적으로 보여준다. 이와 같은 본 발명에 따르면, N명의 실제 의료 데이터를 기반으로 N개의 합성 의료 데이터를 새롭게 생성할 수 있으며, 이와 같이 획득된 합성 의료 데이터들은 딥러닝 기반의 각종 분석/예측 모델 및 관련 연구에서 실제 데이터를 대체할 수 있다. The right side of Figure 2 exemplarily shows a data set accumulating synthetic medical data generated using the aforementioned method. According to the present invention, N new synthetic medical data sets can be generated based on N actual medical data sets. These synthetic medical data sets thus obtained can replace actual data sets in various deep learning-based analysis/prediction models and related research.

이에 따르면, 예를 들어 천 명의 환자에 대한 의료 데이터를 인공지능 모델에 적용하여 새로운 천 개의 합성 의료 데이터를 빅데이터로 생성할 수 있으며, 이와 같이 생성한 합성 의료 데이터는 빅데이터 기반 의료 데이터 분석 과정에서 개인 정보 보호 문제를 해결할 수 있다. According to this, for example, medical data on a thousand patients can be applied to an artificial intelligence model to generate a thousand new synthetic medical data as big data, and the synthetic medical data generated in this way can solve privacy issues in the process of big data-based medical data analysis.

먼저, 합성 의료 데이터 생성 시스템(100)은 환자의 임상 정보와 관련한 구조화 의료 데이터 및 환자에 대해 기록된 텍스트 형태의 비구조화 의료 데이터가 설정 규칙으로 나열된 형태의 입력 벡터를 획득한다(S310). First, the synthetic medical data generation system (100) obtains an input vector in the form of structured medical data related to the patient's clinical information and unstructured medical data in the form of text recorded about the patient, which are listed according to a set rule (S310).

다음으로, 합성 의료 데이터 생성 시스템(100)은 입력 벡터 내에 포함된 구조화 의료 데이터 및 비구조화 의료 데이터를 데이터 유형을 고려하여 임베딩한다(S320).Next, the synthetic medical data generation system (100) embeds structured medical data and unstructured medical data included in the input vector by considering the data type (S320).

그리고, 합성 의료 데이터 생성 시스템(100)은 임베딩된 입력 벡터를 합성 데이터 생성 모델에 입력하고(S330), 이를 통해 입력 벡터 내 구조화 의료 데이터 및 비구조화 의료 데이터의 특징을 통합적으로 학습하여, 구조화 의료 데이터와 동일 양식의 합성 의료 데이터를 생성한다(S340).Then, the synthetic medical data generation system (100) inputs the embedded input vector into the synthetic data generation model (S330), and through this, comprehensively learns the characteristics of structured medical data and unstructured medical data within the input vector, thereby generating synthetic medical data in the same format as the structured medical data (S340).

이러한 본 발명에 따르면, 환자의 임상 정보와 관련한 구조화 의료 데이터뿐만 아니라 환자와 관련하여 기록된 비구조화 텍스트 의료 데이터를 참조하여, 구조화 의료 데이터를 합성함으로써, 의료 데이터의 다양성과 복잡도를 이해하는 합성 의료 데이터를 생성할 수 있다. According to the present invention, by synthesizing structured medical data by referring to unstructured text medical data recorded in relation to the patient as well as structured medical data related to the patient's clinical information, it is possible to generate synthetic medical data that understands the diversity and complexity of medical data.

아울러, 기존의 의료데이터 합성 모델들이 주로 의료 이미지 등과 같은 단일 모달리티의 합성에 중점을 둔 반면, 본 발명은 구조화 의료 정보와 비구조화 의료 텍스트 정보를 통합적으로 고려하여, 복잡성과 다양성을 이해하는 고품질의 합성 의료 데이터를 생성할 수 있는 차별화된 합성 모델을 제공할 수 있다. In addition, while existing medical data synthesis models mainly focus on the synthesis of a single modality such as medical images, the present invention can provide a differentiated synthesis model capable of generating high-quality synthetic medical data that understands complexity and diversity by comprehensively considering structured medical information and unstructured medical text information.

이렇게 합성된 고품질 데이터는 합병증 발생 위험 예측을 포함한 다양한 의료 태스크에서 중요한 역할을 할 수 있다. 아울러, 이러한 합성 모델은 의료 분야 뿐만 아니라 구조적 정보와 텍스트 정보가 함께 존재하는 다양한 데이터 합성에 폭 넓게 적용될 수 있다.High-quality data synthesized in this way can play a crucial role in various medical tasks, including predicting the risk of complications. Furthermore, these synthetic models can be widely applied not only in the medical field but also to various data synthesis tasks where structured and textual information coexist.

도 4는 합성 의료 데이터의 구체적인 활용 예시를 보여주는 도면이다. 도 4와 같이, 본 발명을 통해 생성된 합성 의료 데이터(20)는 수술 후 합병증 예측 모델의 구축을 위한 학습 데이터 셋으로 활용될 수 있다. 예를 들어, 수술 후 1주일 이내에 환자의 급성 신손상 발생 여부 또는 발생 확률을 예측하기 위한 인공지능 기반 딥러닝/머신러닝 모델의 구축에 활용될 수 있다. Figure 4 is a diagram illustrating a specific example of the use of synthetic medical data. As shown in Figure 4, the synthetic medical data (20) generated through the present invention can be utilized as a learning data set for building a post-surgical complication prediction model. For example, it can be utilized to build an artificial intelligence-based deep learning/machine learning model to predict whether or not a patient will develop acute kidney injury within one week of surgery, or the probability of such development.

이와 같이, 합성 의료 데이터를 토대로 수술 후 합병증 발생 위험도를 자동으로 예측하는 알고리즘을 도입할 경우, 수술 전과 후의 위험 평가에 따른 의료 자원과 비용의 효율적 활용을 가능하게 할 수 있다. In this way, introducing an algorithm that automatically predicts the risk of post-surgical complications based on synthetic medical data can enable efficient use of medical resources and costs based on pre- and post-surgical risk assessment.

또한, 합성 의료 데이터는 실제 의료 데이터에 포함된 개인 및 민감한 정보의 대체재로 중요한 가치를 지닌다. 따라서, 이를 활용할 경우, 데이터의 가명화, 익명화 문제를 우회하며, 데이터 활용성과 응용성이 크게 향상될 수 있다. 연구자들은 이러한 합성 의료 데이터를 법적 제약 없이 활용 가능하며, 필요한 경우 다른 연구자와 공유할 수도 있다.Furthermore, synthetic medical data holds significant value as a substitute for personal and sensitive information contained in real medical data. Therefore, utilizing it can circumvent data pseudonymization and anonymization issues, significantly enhancing data usability and applicability. Researchers can utilize this synthetic medical data without legal restrictions and, if necessary, share it with other researchers.

이와 같이, 합성 데이터의 활용은 의료 빅데이터의 개인정보 보호 문제를 해결하는 데 큰 도움이 되며, 이에 따라 의료 빅데이터 이용 활성화에 대한 사회적 합의를 도출할 수 있다. 더욱이, 합성 데이터를 이용한 합병증 예측 모델 개발을 통해 의료비 절감과 환자의 건강 결과 향상이 이루어질 수 있으며, 의료 전문가가 부족한 지역에서도 이 시스템의 도입을 통해 합병증의 조기 발견이 가능해진다.In this way, the use of synthetic data can significantly help address privacy concerns surrounding medical big data, thereby fostering a social consensus on the increased use of medical big data. Furthermore, developing a complication prediction model using synthetic data can reduce medical costs and improve patient health outcomes. Furthermore, the introduction of this system, even in areas with a shortage of medical professionals, can enable the early detection of complications.

다음은 본 발명의 기법에 따라 생성한 합성 의료 데이터에 의해 학습이 완료된 예측 모델(수술후 합병증 예측 모델)의 분류 성능을 실험한 결과를 설명한다. The following describes the results of an experiment on the classification performance of a prediction model (post-surgical complication prediction model) that has completed learning using synthetic medical data generated according to the technique of the present invention.

수술후 합병증으로는 급성신장손상(AKI, Acute Kidney Injury)이 고려되었다. 아래의 표 1에서 AKI=0은 수술후 합병증이 있는 그룹, AIO=1은 수술후 합병증이 없는 그룹의 데이터를 의미한다. Acute kidney injury (AKI) was considered a postoperative complication. In Table 1 below, AKI=0 indicates data for the group with postoperative complications, and AIO=1 indicates data for the group without postoperative complications.

통계적 테스트(Statistical test)는 다음과 같이 수행되었다. 범주형(categorical) 데이터에는 카이 제곱 검정을 사용하고, 숫자형(numerical) 데이터에는 T-test를 사용하였으며, 각 항목에 기재된 숫자는 합성 데이터와 실제 데이터의 분포가 동일하다고 판단할 수 있는 컬럼의 수를 나타낸다. 그리고, 모델 성능 테스트를 위한 실제 데이터를 AKI=0인 그룹과 AKI=1인 그룹으로 나누어 각각 통계 검정을 진행하였다.Statistical testing was performed as follows. The chi-square test was used for categorical data, and the t-test was used for numerical data. The number listed in each item indicates the number of columns where the distributions of the synthetic and actual data were judged to be identical. Furthermore, the actual data for model performance testing were divided into groups with AKI = 0 and with AKI = 1, and statistical tests were performed on each group.

표 1의 모달리티 항목은 데이터 합성에 활용된 입력 데이터의 종류를 나타낸다. 표 1에서 아랫 줄(Table+Text)은 멀티 모달리티를 사용한 예시로 테이블(구조화 데이터)과 텍스트(비구조화 데이터)를 모두 고려하여 합성 의료 데이터를 생성한 본 발명의 케이스이고, 윗 줄(Table)은 단일 모달리티를 사용한 예시로, 본 발명과의 비교를 위해서, 테이블 정보만 활용하여 합성 의료 데이터를 생성한 케이스에 해당한다. The modality items in Table 1 indicate the type of input data utilized for data synthesis. The lower row (Table+Text) in Table 1 is an example using multi-modality, representing a case of the present invention where synthetic medical data was generated by considering both tables (structured data) and text (unstructured data). The upper row (Table) is an example using a single modality, representing a case where synthetic medical data was generated using only table information for comparison with the present invention.

표 1에서 샘플링(sampling) 및 마스크 스케쥴러(mask scheduler) 항목은 사용된 인공지능 모델의 합성 레벨과 관계한 요소를 나타낸다. In Table 1, the sampling and mask scheduler items represent elements related to the synthesis level of the artificial intelligence model used.

수술후 합병증 예측 모델(이하, 예측 모델)의 성능 지표로는 AUROC(Area Under the Receiver Operating Characteristic Curve), 가장 빈도가 높은 클래스의 비율(Ratio of most frequent class), 평균(mean) 값이 각각 고려되었다. The performance indicators of the postoperative complication prediction model (hereinafter, the prediction model) were the Area Under the Receiver Operating Characteristic Curve (AUROC), the ratio of the most frequent class, and the mean value.

실제 데이터에 의해 학습된 예측 모델(비교대상)을 실제 환자(수술후 합병증 유무 값을 알고 있는 환자)의 테스트 데이터셋을 통해 검증하였을 때의 AUROC 값은 0.853로 나타났다. 만일, 본 발명의 기법에 의해 합성 데이터가 잘 만들어졌다면, 해당 합성 데이터로 학습한 예측 모델에 실제 환자의 테스트 데이터셋을 적용했을 때의 AUROC 값도 0.853에 가까울 것이다.When a predictive model (comparison target) trained on real data was validated against a test dataset of actual patients (patients with known postoperative complications), the AUROC value was 0.853. If the synthetic data were successfully generated using the present invention's technique, the AUROC value when applying the predictive model trained on the synthetic data to a test dataset of actual patients would also be close to 0.853.

표 1의 결과를 보면, 테이블과 텍스트 정보를 모두 고려한 본 발명에 따른 합성 의료 데이터를 통해 학습된 예측 모델의 AUROC 값은 0.783~0.799 범위를 나타내었으며, 이는 환자의 실제 데이터를 통해 학습된 비교대상 예측 모델의 AUROC 값인 0.853과 거의 근사한 것을 알 수 있다. As shown in the results in Table 1, the AUROC value of the prediction model learned through synthetic medical data according to the present invention, which takes into account both table and text information, ranges from 0.783 to 0.799, which is almost similar to the AUROC value of 0.853 of the comparison prediction model learned through actual patient data.

아울러, 모든 경우에 있어 본 발명에 따른 멀티 모달리티(Table+Text)를 활용한 합성 데이터가 단일 모달리티(Table)를 활용한 합성 데이터보다 높은 품질을 나타내는 것을 알 수 있다. 결과적으로, 비구조화 데이터인 텍스트까지 참조하여 생성한 합성 데이터로 학습된 예측 모델은 텍스트를 고려하지 않은 경우보다 모두 높은 성능을 보였으므로, 본 발명에 의해 제조된 합성 데이터는 고품질을 가지는 것을 알 수 있다. Furthermore, in all cases, synthetic data utilizing multi-modality (Table+Text) according to the present invention exhibited higher quality than synthetic data utilizing a single modality (Table). Consequently, predictive models trained with synthetic data generated by referencing unstructured text data in all cases exhibited higher performance than those generated without considering text, demonstrating the high quality of the synthetic data produced by the present invention.

다음으로, 가장 빈도가 높은 클래스의 비율(Ratio of the most frequent class)은 각 범주형 컬럼에서 가장 빈도가 높은 클래스의 개수를 전체 샘플 수로 나누어서 실제 데이터와 차이가 얼마나 나는지 측정한 것으로, 값이 낮을수록 차이가 적고, 보다 우수한 성능을 나타냄을 의미한다. Next, the ratio of the most frequent class is a measure of how much it differs from the actual data by dividing the number of the most frequent classes in each categorical column by the total number of samples. A lower value indicates a smaller difference and better performance.

그리고, 평균(Mean) 값은 실제 데이터와 생성 데이터의 숫자형 컬럼의 평균값 차이를 전체 컬럼에 대해서 더한 값이며, 이 역시 작은 값일 수록 높은 성능을 나타낸다.Also, the mean value is the difference between the average values of the numeric columns of the actual data and the generated data added up for the entire column, and the smaller the value, the higher the performance.

모든 경우에 있어 본 발명의 멀티 모달리티(Table+Text)를 활용한 경우에 대한 비율 및 평균 값이 단일 모달리티(Text)를 활용한 경우보다 비율 및 평균이 낮게나타났으므로, 본 발명에 의해 제조된 합성 데이터는 비구조화 데이터인 텍스트를 참조하지 않은 단일 모달리티의 경우보다 우수한 품질을 나타냄을 알 수 있다. In all cases, the ratio and average values for the case of utilizing the multi-modality (Table+Text) of the present invention were lower than those for the case of utilizing a single modality (Text), so it can be seen that the synthetic data produced by the present invention exhibits superior quality than the case of a single modality that does not refer to text, which is unstructured data.

그밖에도 멤버십 추론이라는 프라이버시 평가를 통해 제안 발명에 의해 생성된 합성 의료 데이터의 개인정보 유출 가능성을 테스트해 본 결과, 유츨 가능성이 낮음을 확인할 수 있었다.In addition, we tested the possibility of personal information leakage of synthetic medical data generated by the proposed invention through a privacy assessment called membership inference, and confirmed that the possibility of leakage was low.

이와 같이, 본 발명에 따르면, 환자의 임상 정보와 관련한 구조화 데이터와 수술 기록지와 같은 비구조화 텍스트 데이터를 통합적으로 고려하여, 의료 데이터의 복잡성과 다양성을 이해하는 합성 데이터 생성 모델을 제공할 수 있다.In this way, according to the present invention, a synthetic data generation model that understands the complexity and diversity of medical data can be provided by comprehensively considering structured data related to a patient's clinical information and unstructured text data such as surgical records.

아울러, 구조화 정보 뿐만 아니라 비구조화 텍스트까지 참조하여 데이터 복잡성과 다양성을 깊이 있게 이해하면서, 구조화 의료 데이터 합성함으로써, 합성 의료 데이터를 높은 품질로 생산할 수 있다. Additionally, by synthesizing structured medical data while deeply understanding the complexity and diversity of data by referencing not only structured information but also unstructured text, high-quality synthetic medical data can be produced.

본 발명은 도면에 도시된 실시 예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 다른 실시 예가 가능하다는 점을 이해할 것이다. 따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 특허청구범위의 기술적 사상에 의하여 정해져야 할 것이다.While the present invention has been described with reference to the embodiments illustrated in the drawings, these are merely exemplary, and those skilled in the art will understand that various modifications and equivalent alternative embodiments are possible. Therefore, the true scope of technical protection of the present invention should be determined by the technical spirit of the appended claims.

Claims

A medical data acquisition unit that acquires an input vector in the form of structured medical data related to a patient's clinical information and unstructured medical data in the form of text recorded about the patient, which are listed according to a set rule;

An embedding unit that embeds structured medical data and unstructured medical data included in the input vector by considering the data type; and

A synthetic medical data generation system including a synthetic medical data generation unit that comprehensively learns the embedded input vector through a pre-trained deep learning-based synthetic data generation model to generate structured synthetic medical data in the same format as the structured medical data.

In claim 1,

The above structured medical data is,

A synthetic medical data generation system comprising categorical data having information types classified by category among the clinical information of the above patient and numerical data having information types expressed numerically.

In claim 1,

The above unstructured medical data is,

A synthetic medical data generation system that generates text data from surgical records recorded during surgery.

In claim 2,

The above input vector is,

A synthetic medical data generation system having a first vector composed of categorical data of the above patient type, a second vector composed of numerical data of the above patient type, and a third vector composed of tokenized text data extracted from the above surgical records, which are sequentially connected.

In claim 2,

The above embedding part,

A synthetic medical data generation system that, in the case of the above structured medical data, embeds categorical features corresponding to the above categorical data and embeds continuous numerical features corresponding to the above numerical data using linear embedding, and, in the case of the above unstructured medical data, converts individual tokens of tokenized text into fixed-size vectors using a text embedding method.

In claim 2,

The above embedding part,

A synthetic medical data generation system that performs token type embedding to enable the synthetic data generation model to identify whether medical data is structured, positional embedding to distinguish between each column in the structured medical data and each token in the unstructured medical data, and null type embedding to distinguish between each value in the input vector and whether it is a null value (blank value).

In claim 2,

The above synthetic data generation model is,

A transformer that generates output data by comprehensively learning the features of structured medical data and unstructured medical data within the above-mentioned embedded input vector,

A synthetic medical data generation system comprising a multi-head that generates structured synthetic medical data composed of categorical data and numerical data using the output data of the transformer and provides the structured synthetic medical data as an output vector.

In claim 7,

The above multi-head,

A synthetic medical data generation system comprising a classifier that generates categorical features constituting categorical data in the synthetic medical data from the output data of the transformer, a Gaussian head that generates continuous features constituting numerical data in the synthetic medical data, and a null classifier that determines the presence or absence of a null value in the output vector.

In claim 1,

The synthetic medical data generated above is:

A synthetic medical data generation system used as learning data for building a predictive model for predicting post-surgical complications.

In a synthetic medical data generation method performed by a synthetic medical data generation system,

A step of obtaining an input vector in the form of structured medical data related to the patient's clinical information and unstructured medical data in the form of text recorded about the patient, which are listed according to a set rule;

A step of embedding structured medical data and unstructured medical data included in the input vector by considering the data type; and

A method for generating synthetic medical data, comprising the step of generating structured synthetic medical data in the same format as the structured medical data by comprehensively learning the embedded input vector through a pre-trained deep learning-based synthetic data generation model.

In claim 10,

The above structured medical data is,

A method for generating synthetic medical data consisting of categorical data having an information type classified by category among the clinical information of the above patient and numerical data having an information type expressed numerically.

In claim 10,

The above unstructured medical data is,

A method for generating synthetic medical data, which is text data from surgical records recorded during surgery.

In claim 11,

The above embedding step is,

A method for generating synthetic medical data, which performs token type embedding to enable the synthetic data generation model to identify whether medical data is structured, positional embedding to distinguish between each column in the structured medical data and each token in the unstructured medical data, and null type embedding to distinguish between each value in the input vector and whether it is a null value (blank value).

In claim 11,

The step of generating the above synthetic medical data is:

A step of generating output data by comprehensively learning the features of structured medical data and unstructured medical data within the above embedded input vector in a transformer; and

A method for generating synthetic medical data, comprising a step of generating structured synthetic medical data composed of categorical data and numerical data using output data of the transformer by a multi-head connected to the transformer and providing the structured synthetic medical data as an output vector.