KR102059743B1

KR102059743B1 - Method and system for providing biomedical passage retrieval using deep-learning based knowledge structure construction

Info

Publication number: KR102059743B1
Application number: KR1020180041996A
Authority: KR
Inventors: 이문용; 한기준
Original assignee: 한국과학기술원
Priority date: 2018-04-11
Filing date: 2018-04-11
Publication date: 2019-12-26
Anticipated expiration: 2038-04-11
Also published as: KR20190118744A

Abstract

딥러닝 기반의 지식 구조 생성 방법을 활용한 의료 문헌 구절 검색 방법 및 시스템이 제시된다. 일 실시예에 따른 의료 문헌 구절 검색 방법은, 의료 문헌으로부터 문단 단위의 구절을 추출하는 단계; 추출된 상기 구절을 인덱싱하며 입력되는 초기 질의어와 적합한 구절을 검색하여, 초기 구절 검색 결과를 획득하는 단계; 상기 초기 구절 검색 결과에서 기설정된 수의 상위 구절들을 자연어 처리를 통해 핵심 개념들을 추출한 후, 상기 핵심 개념들 간의 연관성 정보를 추출하는 단계; 상기 핵심 개념들 간의 연관성 정보를 이용하여 지식 구조를 생성하는 단계; 구축된 상기 지식 구조의 핵심 개념을 탐색하는 단계; 탐색된 상기 지식 구조 내 핵심 개념과 상기 초기 질의어 사이의 개념을 질의어 셋(set)에 추가하는 단계; 및 확장된 상기 질의어 셋을 활용하여 의료 구절 검색을 재수행하는 단계를 포함하여 이루어질 수 있다. A medical literature passage search method and system using deep learning based knowledge structure generation method is presented. According to one or more exemplary embodiments, a method of retrieving a medical document phrase may include extracting a paragraph unit paragraph from a medical document; Indexing the extracted phrase and searching for an initial query and an appropriate phrase inputted to obtain an initial phrase search result; Extracting key concepts from a predetermined number of higher phrases through natural language processing from the initial phrase search result, and then extracting association information between the key concepts; Generating a knowledge structure using association information between the core concepts; Exploring key concepts of the constructed knowledge structure; Adding a concept between the core concept in the searched knowledge structure and the initial query to a query set; And performing a medical phrase search again by utilizing the expanded set of query terms.

Description

METHOD AND SYSTEM FOR PROVIDING BIOMEDICAL PASSAGE RETRIEVAL USING DEEP-LEARNING BASED KNOWLEDGE STRUCTURE CONSTRUCTION}

아래의 실시예들은 딥러닝 기반의 지식 구조 생성 방법을 활용한 의료 문헌 구절 검색 방법 및 시스템에 관한 것으로, 더욱 상세하게는 딥러닝을 활용한 지식 구조 생성 기술과 지식 구조 내 개념간의 연관성 정보를 활용한 의료 문헌 구절 검색 방법 및 시스템에 관한 것이다. The following embodiments are related to a method and system for retrieving medical text passages using a deep learning based knowledge structure generation method, and more specifically, to use the association information between a knowledge structure generation technology using deep learning and a concept in the knowledge structure. One medical literature passage relates to methods and systems.

일반적으로 검색 서비스를 제공하는 검색 웹사이트는 사용자로부터 질의어가 입력되면 상기 질의어에 대응하는 검색 결과(예를 들면, 상기 질의어를 포함하는 웹 사이트, 상기 질의어를 포함하는 기사, 상기 질의어를 포함하는 파일명을 갖는 이미지 등)를 사용자에게 제공한다.In general, a search website providing a search service has a search result corresponding to the query when the user inputs a query (for example, a web site including the query, an article including the query, and a file name including the query). To the user).

특히, 의료 분야에서 사용되는 질의어는 일반적인 질의어와는 달리 의료 전문 용어와 축약어가 매우 많이 쓰이고 (예를 들면, alpha fetoprotein은 AFP, AFP-L3 등으로 다양하게 쓰임) 부위에 따라 다르게 해석되는 단어 (예를 들면, serum은 albumin과 함께 쓰일 때와 검사 항목으로 쓰일 때의 뜻이 달라짐)들이 많아 문서 혹은 구절이 해당 질의어를 포함하지 않더라도 위의 경우인 경우 검색 결과에 포함시킬 수 있는 검색 기법 개발이 요구된다. In particular, the query language used in the medical field has a very large number of medical jargons and abbreviations unlike general query terms (for example, alpha fetoprotein is widely used as AFP, AFP-L3, etc.). For example, serum has a different meaning when used with albumin and when it is used as a test item). Therefore, even if the document or phrase does not contain the corresponding query, the development of a search technique that can be included in the search results in the above case is necessary. Required.

아래에서 설명하는 의료 관련 구절 자료로는 PubMed Central에서 제공하는 의학 논문이나 의료인이 실제 진료에서 사용하는 여러 의학 교과서를 예로 들 수 있는데, PubMed Central에서 제공하는 의학 문서는 2018년 현재, 78만개 이상의 논문을 갖춘 온라인 의료 데이터베이스로, 의료 종사자들에게 진단 및 연구에 필요한 정보를 제공하는 공간으로서의 역할을 수행하고 있다. Examples of medical passages described below include medical papers from PubMed Central or many medical textbooks used by practitioners in practice.As of 2018, more than 780,000 papers are available from PubMed Central. It is an online medical database with high quality, and serves as a space to provide healthcare workers with information needed for diagnosis and research.

그러나, 대부분의 의료인은 제한된 시간 안에 많은 진단을 수행해야 하는 시간적 제약으로 인하여 효율적인 의사 결정 지원 시스템이 되기 위하여 검색된 문서들 속에서 가장 핵심적인 구절들만 추출하여 사용자에게 제공해야 할 필요가 있다. However, most medical practitioners need to extract only the most essential phrases from the searched documents and provide them to the user in order to be an effective decision support system due to the time constraint that many diagnosis must be performed in a limited time.

기존의 구절 검색 방법은 외부 리소스나 온톨로지로부터 관련 단어를 추출하는 방법이나, 상위에 검색된 문서로부터 기계적으로 관련 단어를 추출하는 방법 등이 사용되었다. Conventional phrase retrieval methods have been used to extract related words from external resources or ontologies, or to mechanically extract relevant words from documents searched above.

기존의 구절 검색 방법의 예로, 지식 구조 생성 방법이 제공될 수 있다. 한국등록번호 10-1538998호는 이러한 지식 구조를 기반으로 한 검색 서비스 제공 방법 및 장치에 관한 것으로, BoW(Bag-of-words)와 Word frequency 기반으로 문서 내의 주요한 키워드를 추출하고 두 키워드 사이의 co-occurrence 정보를 측정하여 단어의 연관성을 측정하였다. 이는 단일 문서에서 지식 구조를 추출하는 것은 가능하였으나, 복수 개의 문서로부터 하나의 통합된 지식 구조를 구축하기 어렵다는 단점이 있다. As an example of an existing phrase search method, a knowledge structure generation method may be provided. Korean Registration No. 10-1538998 relates to a method and apparatus for providing a search service based on this knowledge structure. It extracts major keywords in a document based on BoW (Bag-of-words) and Word frequency, We measure word association by measuring -occurrence information. Although it is possible to extract the knowledge structure from a single document, it is difficult to construct a single integrated knowledge structure from a plurality of documents.

또한, 지식 구조 검색 활용 방법의 경우, 초기 검색된 콘텐츠들의 지식 구조들을 각각 구축하고 지식 구조 내에서 초기 질의어들의 거리를 측정하여, 해당 정보를 근접성으로 정의하여 초기 검색 결과를 재정렬 하였다. 그러나 초기에 검색된 문서 이외에 숨겨진 다른 문서들은 검색하지 못한다는 단점이 있다. In addition, in the case of using the knowledge structure retrieval method, the knowledge structures of the initially retrieved contents are constructed, the distances of the initial query words are measured within the knowledge structure, and the corresponding information is defined as proximity to rearrange the initial search results. However, there is a disadvantage in that other hidden documents other than the initially searched documents cannot be searched.

그리고, 지식 구조를 활용한 서비스의 경우, 다양한 카테고리를 사용자에게 추천할 수 있게 하는 추천 서비스와 위키피디아를 기반으로 구축한 지식 구조를 바탕으로 문제 생성을 자동으로 하는 교육 서비스가 있다. 그러나 의료 분야에서의 지식 구조를 활용한 서비스를 제공하기에는 부족한 기능을 가진다. In addition, in the case of a service using a knowledge structure, there are a recommendation service for recommending various categories to a user and an education service for automatically generating a problem based on a knowledge structure built on Wikipedia. However, it lacks a function to provide a service utilizing a knowledge structure in the medical field.

한국등록번호 10-1538998호Korea Registration No. 10-1538998

실시예들은 딥러닝 기반의 지식 구조 생성 방법을 활용한 의료 문헌 구절 검색 방법 및 시스템에 관하여 기술하며, 보다 구체적으로 초기 질의어를 통해 상위에 검색된 복수 개의 의료 관련 구절로부터 통합된 지식 구조를 자동으로 생성하고, 생성된 지식 구조로부터 질의어 확장에 관련된 키워드를 자동으로 추출하여 의료 문서의 구절 검색 성능을 보다 획기적으로 향상시키는 기술을 제공한다. Embodiments describe a method and system for searching a medical literature passage using a deep learning based knowledge structure generation method, and more specifically, automatically generates an integrated knowledge structure from a plurality of medical related phrases searched above through an initial query. In addition, by automatically extracting keywords related to query expansion from the generated knowledge structure, it provides a technique to significantly improve the phrase search performance of medical documents.

실시예들은 복수 개의 리소스들로부터 통합된 지식 구조를 딥러닝 기반의 텍스트 분석 기법을 통해 자동으로 구축함으로써, 각 리소스들에 대한 지식 구조를 각각 구축하여 통합하기 위해 고려해야 하는 통계적 방법들을 생략할 수 있고, 초기 질의어와 관련된 단어들을 묵시적으로 파악하여 초기 검색되지 못한 문서의 검색을 가능하게 하는 딥러닝 기반의 지식 구조 생성 방법을 활용한 의료 문헌 구절 검색 방법 및 시스템을 제공하는데 있다. Embodiments can automatically build a knowledge structure integrated from a plurality of resources through a deep learning-based text analysis technique, thereby eliminating statistical methods that need to be considered for building and integrating knowledge structures for each resource. In addition, the present invention provides a method and system for retrieving medical text passages using a deep learning-based knowledge structure generation method that enables the retrieval of an unretrievable document by implicitly identifying words related to an initial query.

일 실시예에 따른 의료 문헌 구절 검색 방법은, 의료 문헌으로부터 문단 단위의 구절을 추출하는 단계; 추출된 상기 구절을 인덱싱하며 입력되는 초기 질의어와 적합한 구절을 검색하여, 초기 구절 검색 결과를 획득하는 단계; 상기 초기 구절 검색 결과에서 기설정된 수의 상위 구절들을 자연어 처리를 통해 핵심 개념들을 추출한 후, 상기 핵심 개념들 간의 연관성 정보를 추출하는 단계; 상기 핵심 개념들 간의 연관성 정보를 이용하여 지식 구조를 생성하는 단계; 구축된 상기 지식 구조의 핵심 개념을 탐색하는 단계; 탐색된 상기 지식 구조 내 핵심 개념과 상기 초기 질의어 사이의 개념을 질의어 셋(set)에 추가하는 단계; 및 확장된 상기 질의어 셋을 활용하여 의료 구절 검색을 재수행하는 단계를 포함하여 이루어질 수 있다. According to one or more exemplary embodiments, a method of retrieving a medical document phrase may include extracting a paragraph unit paragraph from a medical document; Indexing the extracted phrase and searching for an initial query and an appropriate phrase inputted to obtain an initial phrase search result; Extracting key concepts from a predetermined number of higher phrases through natural language processing from the initial phrase search result, and then extracting association information between the key concepts; Generating a knowledge structure using association information between the core concepts; Exploring key concepts of the constructed knowledge structure; Adding a concept between the core concept in the searched knowledge structure and the initial query to a query set; And performing a medical phrase search again by utilizing the expanded set of query terms.

상기 초기 구절 검색 결과를 획득하는 단계는, 사용자가 입력한 초기 질의어와 적합한 구절들을 TF-IDF, BM25, LM 기법을 통해 검색 후, 종합하여 검색 결과를 획득할 수 있다. In the obtaining of the initial phrase search result, the initial query word and the appropriate phrase input by the user may be searched through TF-IDF, BM25, and LM techniques, and then the search result may be obtained by combining.

상기 핵심 개념들 간의 연관성 정보를 추출하는 단계는, 상기 초기 구절 검색 결과에서 기설정된 수의 상위 구절들을 자연어 처리를 통해 핵심 개념들을 추출하는 단계; 및 추출된 상기 핵심 개념들을 이용하여 딥러닝 기반의 Word2Vec 알고리즘을 수행하여 각 핵심 개념들을 벡터로 변환하는 임베딩 작업을 수행하는 단계를 포함하여 이루어질 수 있다. The extracting the correlation information between the key concepts may include extracting key concepts through natural language processing of a predetermined number of upper phrases in the initial phrase search result; And performing embedding to convert each core concept into a vector by performing a deep learning based Word2Vec algorithm using the extracted core concepts.

상기 핵심 개념들 간의 연관성 정보를 이용하여 지식 구조를 생성하는 단계는, 상기 기설정된 수의 상위 구절들로부터 추출된 상기 핵심 개념들 간의 거리 매트릭스를 구축하고 얻어진 각 거리간의 유사도를 각 열(row)의 수치로 사용하여 연관성 행렬을 생성하는 단계; 및 상기 연관성 행렬을 이용하여 상기 지식 구조를 생성하는 단계를 포함할 수 있다. Generating a knowledge structure using the association information between the key concepts may include constructing a distance matrix between the key concepts extracted from the predetermined number of higher phrases and displaying the similarity between each distance obtained. Generating an association matrix using the numerical values of; And generating the knowledge structure using the association matrix.

상기 지식 구조의 핵심 개념을 탐색하는 단계는, 페이지랭크(PageRank) 알고리즘을 활용하여 상기 지식 구조에서 가장 핵심적인 키워드를 탐색할 수 있다. The searching of the core concept of the knowledge structure may search for the most important keyword in the knowledge structure by using a pagerank algorithm.

상기 지식 구조 내 핵심 개념과 상기 초기 질의어 사이의 개념을 질의어 셋(set)에 추가하는 단계는, 상기 지식 구조 내 핵심 개념과 상기 초기 질의어 사이에 놓여있는 개념어를 질의어 셋(set)에 추가할 수 있다. Adding a concept between the core concept in the knowledge structure and the initial query word to the query set may include adding a concept word placed between the key concept in the knowledge structure and the initial query word to the query set. have.

상기 초기 질의어를 통해 상위에 검색된 복수 개의 의료 관련 구절로부터 통합된 상기 지식 구조를 자동으로 생성하고, 생성된 상기 지식 구조로부터 질의어 확장에 관련된 키워드를 자동으로 추출할 수 있다. The knowledge structure may be automatically generated from the plurality of medical related phrases searched above through the initial query, and the keywords related to the query expansion may be automatically extracted from the generated knowledge structure.

다른 실시예에 따른 의료 문헌 구절 검색 시스템은, 의료 문헌으로부터 문단 단위의 구절을 추출하는 구절 추출부; 추출된 상기 구절을 인덱싱하며 입력되는 초기 질의어와 적합한 구절을 검색하여, 초기 구절 검색 결과를 획득하는 구절 검색부; 상기 초기 구절 검색 결과에서 기설정된 수의 상위 구절들을 자연어 처리를 통해 핵심 개념들을 추출한 후, 상기 핵심 개념들 간의 연관성 정보를 추출하는 연관성 정보 추출부; 상기 핵심 개념들 간의 연관성 정보를 이용하여 지식 구조를 생성하는 지식 구조 생성부; 구축된 상기 지식 구조의 핵심 개념을 탐색하는 핵심 개념 탐색부; 및 탐색된 상기 지식 구조 내 핵심 개념과 상기 초기 질의어 사이의 개념을 질의어 셋(set)에 추가하는 질의어 확장부를 포함하고, 확장된 상기 질의어 셋을 활용하여 의료 구절 검색을 재수행할 수 있다. According to another embodiment of the present invention, a medical document phrase search system includes a phrase extracting unit that extracts a paragraph unit paragraph from a medical document; A phrase search unit for indexing the extracted phrase and searching for an initial query word and an appropriate phrase inputted to obtain an initial phrase search result; An association information extracting unit for extracting key concepts from a predetermined number of higher phrases through natural language processing from the initial phrase search result, and then extracting association information between the key concepts; A knowledge structure generation unit generating a knowledge structure using the association information between the core concepts; A core concept search unit for searching a core concept of the constructed knowledge structure; And a query extension unit for adding a concept between the core concept in the searched knowledge structure and the initial query to a query set, and re-search the medical phrase using the expanded query set.

상기 구절 검색부는, 사용자가 입력한 초기 질의어와 적합한 구절들을 TF-IDF, BM25, LM 기법을 통해 검색 후, 종합하여 검색 결과를 획득할 수 있다.The phrase search unit may search for the initial query input by the user and the appropriate phrases through the TF-IDF, BM25, and LM techniques, and then obtain the search result by combining them.

상기 연관성 정보 추출부는, 상기 초기 구절 검색 결과에서 기설정된 수의 상위 구절들을 자연어 처리를 통해 핵심 개념들을 추출한 후, 딥러닝 기반의 Word2Vec 알고리즘을 수행하여 각 핵심 개념들을 벡터로 변환하는 임베딩 작업을 수행할 수 있다.The correlation information extracting unit extracts key concepts from a predetermined number of upper phrases through natural language processing from the initial phrase search result, and then performs an embedding operation of converting each key concept into a vector by performing a deep learning based Word2Vec algorithm. can do.

상기 핵심 개념 탐색부는, 페이지랭크(PageRank) 알고리즘을 활용하여 상기 지식 구조에서 가장 핵심적인 키워드를 탐색할 수 있다.The core concept search unit may search for a key keyword in the knowledge structure by using a pagerank algorithm.

상기 질의어 확장부는, 상기 지식 구조 내 핵심 개념과 상기 초기 질의어 사이에 놓여있는 개념어를 질의어 셋(set)에 추가할 수 있다.The query extension unit may add a conceptual word placed between a core concept in the knowledge structure and the initial query to a query set.

상기 초기 질의어를 통해 상위에 검색된 복수 개의 의료 관련 구절로부터 통합된 상기 지식 구조를 자동으로 생성하고, 생성된 상기 지식 구조로부터 질의어 확장에 관련된 키워드를 자동으로 추출할 수 있다.The knowledge structure may be automatically generated from the plurality of medical related phrases searched above through the initial query, and the keywords related to the query expansion may be automatically extracted from the generated knowledge structure.

또 다른 실시예에 따른 구절 검색 방법은, 복수의 문헌으로부터 문단 단위의 구절을 추출하는 단계; 추출된 상기 구절을 인덱싱하며 입력되는 초기 질의어와 적합한 구절을 검색하여, 초기 구절 검색 결과를 획득하는 단계; 상기 초기 구절 검색 결과에서 기설정된 수의 상위 구절들을 자연어 처리를 통해 핵심 개념들을 추출한 후, 상기 핵심 개념들 간의 연관성 정보를 추출하는 단계; 상기 핵심 개념들 간의 연관성 정보를 이용하여 지식 구조를 생성하는 단계; 구축된 상기 지식 구조의 핵심 개념을 탐색하는 단계; 탐색된 상기 지식 구조 내 핵심 개념과 상기 초기 질의어 사이의 개념을 질의어 셋(set)에 추가하는 단계; 및 확장된 상기 질의어 셋을 활용하여 구절 검색을 재수행하는 단계를 포함하여 이루어질 수 있다. According to another exemplary embodiment, a phrase search method includes extracting a phrase of a paragraph unit from a plurality of documents; Indexing the extracted phrase and searching for an initial query and an appropriate phrase inputted to obtain an initial phrase search result; Extracting key concepts from a predetermined number of higher phrases through natural language processing from the initial phrase search result, and then extracting association information between the key concepts; Generating a knowledge structure using association information between the core concepts; Exploring key concepts of the constructed knowledge structure; Adding a concept between the core concept in the searched knowledge structure and the initial query to a query set; And re-doing the phrase search using the expanded set of query terms.

상기 초기 질의어를 통해 상위에 검색된 복수 개의 관련 구절로부터 통합된 상기 지식 구조를 자동으로 생성하고, 생성된 상기 지식 구조로부터 질의어 확장에 관련된 키워드를 자동으로 추출할 수 있다.The knowledge structure may be automatically generated from the plurality of related phrases searched above through the initial query, and the keywords related to the query expansion may be automatically extracted from the generated knowledge structure.

실시예들에 따르면 복수 개의 리소스들로부터 통합된 지식 구조를 딥러닝 기반의 텍스트 분석 기법을 통해 자동으로 구축함으로써, 각 리소스들에 대한 지식 구조를 각각 구축하여 통합하기 위해 고려해야 하는 통계적 방법들을 생략할 수 있는 딥러닝 기반의 지식 구조 생성 방법을 활용한 의료 문헌 구절 검색 방법 및 시스템을 제공할 수 있다.According to embodiments, by automatically constructing an integrated knowledge structure from a plurality of resources through a deep learning-based text analysis technique, statistical methods that need to be considered for constructing and integrating knowledge structures for each resource may be omitted. It is possible to provide a method and system for searching a medical text passage using a deep learning based knowledge structure generation method.

실시예들에 따르면 초기 질의어와 관련된 단어들을 묵시적으로 파악함으로써, 초기 검색되지 못한 문서의 검색을 가능하게 함과 더불어 보다 향상된 검색 성능을 확보할 수 있다. 이는, 특히 전문어 및 다양한 형태의 축약어 및 유의어가 존재하는 의료 분야에서 활발하게 사용될 수 있는 검색 서비스를 제공하는 데에 사용될 수 있다. According to embodiments, by implicitly identifying words related to the initial query word, it is possible to search for a document that has not been initially searched and to obtain more improved search performance. This may be used to provide a search service that can be actively used in the medical field, in particular for nomenclature and various forms of abbreviations and synonyms.

도 1은 일 실시예에 따른 딥러닝 기반의 지식 구조 생성 방법을 활용한 의료 문헌 구절 검색 시스템을 개략적으로 나타내는 블록도이다.
도 2는 일 실시예에 따른 딥러닝 기반의 지식 구조 생성 방법을 활용한 의료 문헌 구절 검색 방법을 나타내는 흐름도이다.
도 3은 일 실시예에 따른 AST와 관련된 구절 검색 결과의 예를 나타내는 도면이다.
도 4는 일 실시예에 따른 구축된 지식 구조로부터의 질의어 확장의 예를 나타내는 도면이다. 1 is a block diagram schematically illustrating a medical document phrase search system using a deep learning based knowledge structure generation method according to an exemplary embodiment.
2 is a flowchart illustrating a method of searching for a medical document phrase using a deep learning based knowledge structure generation method according to an exemplary embodiment.
3 illustrates an example of a phrase search result associated with an AST according to an exemplary embodiment.
4 is a diagram illustrating an example of query expansion from a constructed knowledge structure according to an embodiment.

이하, 첨부된 도면을 참조하여 실시예들을 설명한다. 그러나, 기술되는 실시예들은 여러 가지 다른 형태로 변형될 수 있으며, 본 발명의 범위가 이하 설명되는 실시예들에 의하여 한정되는 것은 아니다. 또한, 여러 실시예들은 당해 기술분야에서 평균적인 지식을 가진 자에게 본 발명을 더욱 완전하게 설명하기 위해서 제공되는 것이다. 도면에서 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있다.Hereinafter, exemplary embodiments will be described with reference to the accompanying drawings. However, the described embodiments may be modified in many different forms, and the scope of the present invention is not limited to the embodiments described below. In addition, various embodiments are provided to more fully describe the present invention to those skilled in the art. Shape and size of the elements in the drawings may be exaggerated for more clear description.

본 실시예들은 초기 질의어를 통해 상위에 검색된 복수 개의 의료 관련 구절(예를 들어, 의학 교과서 혹은 의학 저널 내 질의어와 관련된 문장이나 문단)로부터 통합된 지식 구조를 자동으로 생성하고, 생성된 지식 구조로부터 질의어 확장에 관련된 키워드를 자동으로 추출하여, 의료 문서의 구절 검색 성능을 보다 획기적으로 향상시키는 방법 및 그 시스템에 관한 것이다. The present embodiments automatically generate an integrated knowledge structure from a plurality of medical related phrases (eg, sentences or paragraphs related to a query in a medical textbook or medical journal) searched upward through an initial query and from the generated knowledge structure. The present invention relates to a method and system for automatically extracting keywords related to query expansion to significantly improve the phrase search performance of medical documents.

본 실시예에서 제안하는 구절 검색 방법은 기존의 외부 리소스나 온톨로지로부터 관련 단어를 추출하는 방법이나 상위에 검색된 문서로부터 기계적으로 관련 단어를 추출하는 방법과 달리, 상위에 검색된 문서로부터 지식 구조를 구축하고 해당 지식 구조 내 질의어와 연관된 단어들을 확장된 질의어로 사용하여 검색 성능을 향상시키는 데에 기존 질의어 확장 방법들과 차이가 있다. Unlike the method of extracting related words from existing external resources or ontology or the method of mechanically extracting related words from the documents searched at the upper level, the phrase searching method proposed in this embodiment constructs a knowledge structure from the documents searched at the upper level. There are differences from existing query expansion methods in improving the search performance by using words related to the query in the knowledge structure as an extended query word.

또한, 본 실시예에서 제안하는 방법은 기존의 지식 구조를 활용한 방법 대비 다음과 같은 대비점이 있다. In addition, the method proposed in this embodiment has the following contrasts with the method using the existing knowledge structure.

기존의 지식 구조 생성 방법은 BoW(Bag-of-words)와 Word frequency 기반으로 문서 내의 주요한 키워드를 추출하고 두 키워드 사이의 co-occurrence 정보를 측정하여 단어의 연관성을 측정하였다. 이는 단일 문서에서 지식 구조를 추출하는 것은 가능하였으나, 복수 개의 문서로부터 하나의 통합된 지식 구조를 구축하기 어렵다는 단점이 있다. 이에, 본 실시예에서는 Word2Vec이라는 CNN(Convolutional Neural Network) 기반의 자연어 처리를 위한 딥러닝 기반 기술을 활용하여 복수 개의 문서로부터 통합 지식 구조를 자동으로 구축할 수 있다. 본 실시예에서 제안하는 지식 구조 생성 방법을 활용하면, 종래 기술 대비 복수 개의 문서 혹은 구절을 기반으로 한 지식 구조를 자동으로 생성할 수 있다는 장점을 가지고 있다. The existing knowledge structure generation method extracts major keywords in a document based on BoW (Bag-of-words) and Word frequency, and measures the association of words by measuring co-occurrence information between the two keywords. Although it is possible to extract the knowledge structure from a single document, it is difficult to construct a single integrated knowledge structure from a plurality of documents. Accordingly, in the present embodiment, an integrated knowledge structure can be automatically constructed from a plurality of documents by utilizing a deep learning based technology for processing natural language based on CNN (Convolutional Neural Network) called Word2Vec. By using the knowledge structure generation method proposed in this embodiment, there is an advantage that the knowledge structure based on a plurality of documents or phrases can be automatically generated compared to the prior art.

또한, 기존의 지식 구조 검색 활용 방법은 초기 검색된 콘텐츠들의 지식 구조들을 각각 구축하고 지식 구조 내에서 초기 질의어들의 거리를 측정하여, 해당 정보를 근접성으로 정의하여 초기 검색 결과를 재정렬 하였다. 그러나 초기에 검색된 문서 이외에 숨겨진 다른 문서들은 검색하지 못한다는 단점이 있다. 본 실시예에서는 초기 검색된 구절로부터 통합된 지식 구조를 구축하고, 이를 문서의 재정렬이 아닌 질의어 확장에 사용하기 때문에, 기존 특허 대비 초기에 검색되지 못한 문서들을 검색하는 데에 사용할 수 있다는 장점을 가지고 있다. In addition, the existing method of utilizing the knowledge structure search constructs the knowledge structures of the initially searched contents, measures the distance of the initial query words in the knowledge structure, and defines the corresponding information as proximity to rearrange the initial search results. However, there is a disadvantage in that other hidden documents other than the initially searched documents cannot be searched. In this embodiment, since an integrated knowledge structure is constructed from an initially searched phrase and used for query expansion rather than rearrangement of documents, it can be used to search for documents that were not initially searched compared to existing patents. .

그리고, 기존의 지식 구조를 활용한 서비스는 다양한 카테고리를 사용자에게 추천할 수 있게 하는 추천 서비스와 위키피디아를 기반으로 구축한 지식 구조를 바탕으로 문제 생성을 자동으로 하는 교육 서비스가 제공된다. 그러나, 의료 분야에서의 지식 구조를 활용한 서비스를 제공하지 못한다. 본 실시예에서는 의료 분야의 의사 결정을 검색을 통해 지원하는 서비스를 통해 의료 분야에서의 지식 구조를 활용한 서비스를 제공하고자 한다. In addition, the service using the existing knowledge structure is provided with a recommendation service for recommending various categories to users and an education service for automatically generating a problem based on a knowledge structure built on Wikipedia. However, it does not provide services utilizing knowledge structures in the medical field. In the present embodiment, to provide a service utilizing the knowledge structure in the medical field through a service that supports the decision-making in the medical field through a search.

본 발명의 목적은 자연어로 이루어진 의료 문헌 구절들로부터 핵심 개념어들과 그 개념어들간의 관계를 딥러닝 기반의 알고리즘을 통해 계산하여 지식 구조를 구축하고, 구축된 지식 구조로부터 질의어와 연관된 개념어들을 확장하는 질의어 확장, 그리고 확장된 질의어를 활용하여 의사 결정을 지원하기 위한 의료 구절 검색 서비스를 제공하는 것이다. An object of the present invention is to construct a knowledge structure by calculating the core conceptual words and the relationships between the conceptual words from natural texts of medical literature through deep learning-based algorithms, and to expand the conceptual words related to the query words from the constructed knowledge structures. It is to provide medical phrase search service to support decision making by using query expansion and extended query.

도 1은 일 실시예에 따른 딥러닝 기반의 지식 구조 생성 방법을 활용한 의료 문헌 구절 검색 시스템을 개략적으로 나타내는 블록도이다. 1 is a block diagram schematically illustrating a medical document phrase search system using a deep learning based knowledge structure generation method according to an exemplary embodiment.

도 1을 참조하면, 일 실시예에 따른 딥러닝 기반의 지식 구조 생성 방법을 활용한 의료 문헌 구절 검색 시스템(100)은 구절 추출부(110), 구절 검색부(120), 연관성 정보 추출부(130), 지식 구조 생성부(140), 핵심 개념 탐색부(150) 및 질의어 확장부(160)를 포함하여 이루어질 수 있다. 또한, 실시예에 따라 사용자에게 검색 페이지를 구성하고 제공하는 화면 구성부(170) 및 각종 정보를 저장하는 데이터 베이스(180)를 더 포함할 수 있다. Referring to FIG. 1, the medical document phrase search system 100 using the deep learning-based knowledge structure generation method according to an embodiment may include a phrase extractor 110, a phrase searcher 120, and an association information extractor ( 130, the knowledge structure generation unit 140, the core concept search unit 150, and the query expansion unit 160. In addition, according to an embodiment, the apparatus may further include a screen configuration unit 170 for constructing and providing a search page to a user and a database 180 for storing various information.

보다 구체적으로, 일 실시예에 따른 딥러닝 기반의 지식 구조 생성 방법을 활용한 의료 문헌 구절 검색 시스템은 의료 문헌으로부터 문단 단위의 구절을 추출하는 구절 추출부(110), 추출된 구절을 인덱싱하며 입력되는 초기 질의어와 적합한 구절을 검색하여, 초기 구절 검색 결과를 획득하는 구절 검색부(120), 초기 구절 검색 결과에서 기설정된 수의 상위 구절들을 자연어 처리를 통해 핵심 개념들을 추출한 후, 핵심 개념들 간의 연관성 정보를 추출하는 연관성 정보 추출부(130), 핵심 개념들 간의 연관성 정보를 이용하여 지식 구조를 생성하는 지식 구조 생성부(140), 구축된 지식 구조의 핵심 개념을 탐색하는 핵심 개념 탐색부(150), 및 탐색된 지식 구조 내 핵심 개념과 초기 질의어 사이의 개념을 질의어 셋(set)에 추가하는 질의어 확장부(160)를 포함하고, 구절 검색부(120) 또는 별도의 구절 재검색부를 통해 확장된 질의어 셋을 활용하여 의료 구절 검색을 재수행할 수 있다. More specifically, the medical document phrase search system using the deep learning-based knowledge structure generation method according to an embodiment is a phrase extraction unit 110 for extracting paragraphs of paragraph units from medical documents, indexing the extracted phrase and input After the initial query is searched and the appropriate phrase, the phrase search unit 120 that obtains the initial phrase search result, and extracts the core concepts through the natural language processing of a predetermined number of upper phrases from the initial phrase search result, Association information extraction unit 130 for extracting the association information, knowledge structure generation unit 140 for generating a knowledge structure using the association information between the core concepts, core concept search unit for searching the core concepts of the built knowledge structure ( 150) and a query extension unit 160 for adding a concept between the core concept in the searched knowledge structure and the initial query to a query set; The medical phrase search may be performed again using the expanded query set through the phrase search unit 120 or a separate phrase research unit.

이에 따라 사용자는 단말을 이용하여 일 실시예에 따른 딥러닝 기반의 지식 구조 생성 방법을 활용한 의료 문헌 구절 검색 시스템(또는 서버)에 네트워크를 통해 접속하여 사용자가 원하는 다양한 의료 구절을 검색할 수 있다. 여기에서는 의료 구절의 검색을 하나의 예로써 설명하고 있으나 의료 구절에 한정되지 않으며, 일반적인 구절 또는 특정 분야의 구절의 검색을 하도록 구성될 수 있다. Accordingly, the user can search for various medical passages desired by the user by accessing the medical literature passage search system (or server) using the deep learning-based knowledge structure generation method through a network using a terminal. . Here, the search for a medical phrase is described as an example, but is not limited to a medical phrase, and may be configured to search a general phrase or a phrase of a specific field.

일 실시예에 따른 딥러닝 기반의 지식 구조 생성 방법을 활용한 의료 문헌 구절 검색 시스템은 아래에서 하나의 예를 들어 보다 구체적으로 설명하기로 한다. The medical literature phrase search system using the deep learning based knowledge structure generation method according to an embodiment will be described in more detail with reference to one example below.

도 2는 일 실시예에 따른 딥러닝 기반의 지식 구조 생성 방법을 활용한 의료 문헌 구절 검색 방법을 나타내는 흐름도이다. 2 is a flowchart illustrating a method of searching for a medical document phrase using a deep learning based knowledge structure generation method according to an exemplary embodiment.

도 2를 참조하면, 일 실시예에 따른 딥러닝 기반의 지식 구조 생성 방법을 활용한 의료 문헌 구절 검색 방법은 딥러닝을 활용한 지식 구조 생성 기술과 지식 구조 내 개념간의 연관성 정보를 활용한 의료 구절 검색 방법으로, 다음과 같이 수행될 수 있다. Referring to FIG. 2, according to an embodiment, a method of retrieving medical literature passages using a deep learning based knowledge structure generation method may further include a medical passage using a correlation information between a knowledge structure generation technology using deep learning and a concept in a knowledge structure. As a search method, it can be performed as follows.

일 실시예에 따른 딥러닝 기반의 지식 구조 생성 방법을 활용한 의료 문헌 구절 검색 방법은 의료 문헌으로부터 문단 단위의 구절을 추출하는 단계(210), 추출된 구절을 인덱싱하며 입력되는 초기 질의어와 적합한 구절을 검색하여, 초기 구절 검색 결과를 획득하는 단계(220), 초기 구절 검색 결과에서 기설정된 수의 상위 구절들을 자연어 처리를 통해 핵심 개념들을 추출한 후, 핵심 개념들 간의 연관성 정보를 추출하는 단계(230), 핵심 개념들 간의 연관성 정보를 이용하여 지식 구조를 생성하는 단계(240), 구축된 지식 구조의 핵심 개념을 탐색하는 단계(250), 탐색된 지식 구조 내 핵심 개념과 초기 질의어 사이의 개념을 질의어 셋(set)에 추가하는 단계(260), 및 확장된 질의어 셋을 활용하여 의료 구절 검색을 재수행하는 단계(270)를 포함하여 이루어질 수 있다. According to an embodiment of the present invention, a method of searching a medical document passage using a deep learning-based knowledge structure generation method includes extracting a paragraph of a paragraph unit from a medical document (210), indexing the extracted passage, and inputting an initial query and a suitable phrase. In step 220, obtaining an initial phrase search result, extracting key concepts through natural language processing of a predetermined number of upper phrases from the initial phrase search result, and then extracting correlation information between key concepts (230). ), Generating a knowledge structure using the association information between the core concepts (240), searching for the core concepts of the constructed knowledge structure (250), concepts between the core concepts in the searched knowledge structure and the initial query Adding 260 to the query set, and re-running the medical phrase search using the extended query set 270. All.

본 실시예들은 기존의 지식 구조 구축 방법과는 달리, 복수 개의 리소스들로부터 통합된 지식 구조를 딥러닝 기반의 텍스트 분석 기법을 통해 자동으로 구축할 수 있다. 따라서 각 리소스들에 대한 지식 구조를 각각 구축하여 통합하기 위해 고려해야 하는 통계적 방법들을 생략할 수 있다. Unlike the conventional knowledge structure construction method, the present embodiments can automatically construct a knowledge structure integrated from a plurality of resources through a deep learning based text analysis technique. Thus, statistical methods that need to be considered to construct and integrate knowledge structures for each resource can be omitted.

또한, 지식 구조를 통한 질의어 확장 방법은 초기 질의어와 관련된 단어들을 묵시적으로 파악하여 제공할 수 있게 하여, 초기 검색되지 못한 문서의 검색을 가능하게 함과 더불어 보다 향상된 검색 성능을 확보할 수 있다. In addition, the query expansion method through the knowledge structure enables to implicitly identify and provide words related to the initial query, thereby enabling the retrieval of a document that was not initially searched, and further improving search performance.

이는, 특히 전문어 및 다양한 형태의 축약어 및 유의어가 존재하는 의료 분야에서 활발하게 사용될 수 있는 검색 서비스를 제공하는 데에 사용될 수 있다. This may be used to provide a search service that can be actively used in the medical field, in particular for nomenclature and various forms of abbreviations and synonyms.

아래에서 일 실시예에 따른 딥러닝 기반의 지식 구조 생성 방법을 활용한 의료 문헌 구절 검색 방법을 보다 구체적으로 설명하기로 한다. Hereinafter, a method of searching for a medical document phrase using a deep learning based knowledge structure generation method according to an embodiment will be described in more detail.

일 실시예에 따른 딥러닝 기반의 지식 구조 생성 방법을 활용한 의료 문헌 구절 검색 방법은 도 1에서 설명한 일 실시예에 따른 딥러닝 기반의 지식 구조 생성 방법을 활용한 의료 문헌 구절 검색 시스템을 이용하여 구체적으로 설명할 수 있다. 여기서, 일 실시예에 따른 딥러닝 기반의 지식 구조 생성 방법을 활용한 의료 문헌 구절 검색 시스템(100)은 구절 추출부(110), 구절 검색부(120), 연관성 정보 추출부(130), 지식 구조 생성부(140), 핵심 개념 탐색부(150) 및 질의어 확장부(160)를 포함하여 이루어질 수 있다. According to an embodiment of the present invention, a medical document phrase search method using a deep learning based knowledge structure generation method may be performed by using a medical document phrase search system using the deep learning based knowledge structure generation method described in FIG. 1. It can be explained concretely. Here, the medical document phrase search system 100 using the deep learning-based knowledge structure generation method according to an embodiment of the phrase extraction unit 110, phrase search unit 120, association information extraction unit 130, knowledge It may include a structure generator 140, a core concept search unit 150, and a query expansion unit 160.

단계(210)에서, 구절 추출부(110)는 의료 문헌으로부터 문단 단위의 구절을 추출할 수 있다. 예를 들어, 구절 추출부(110)는 온라인에서 수집된 복수의 의료 문헌으로부터 문단 단위의 구절을 추출할 수 있다. In operation 210, the phrase extraction unit 110 may extract a paragraph unit of paragraph from the medical document. For example, the phrase extraction unit 110 may extract a paragraph unit paragraph from a plurality of medical documents collected online.

의료 문헌으로부터 문단 단위의 구절을 추출하는 방법은 아래와 같이 수행될 수 있다. The method of extracting paragraph units from the medical literature can be performed as follows.

E-book 파일 형식 변환을 위해 .azw3, .azw 형식을 .pdf, .txt, .doc로 변환할 수 있다. You can convert .azw3 and .azw formats to .pdf, .txt, and .doc for e-book file format conversion.

텍스트 및 이미지 추출을 위해 PDFbox 라이브러리를 통한 텍스트/이미지를 추출할 수 있다. You can extract text / images from the PDFbox library for text and image extraction.

텍스트 전처리(Pre-processing)를 위해 OpenNLP 라이브러리를 통한 텍스트 문장 및 문단 구분 전처리를 수행할 수 있다. For text pre-processing, text sentences and paragraph break preprocessing can be performed through the OpenNLP library.

레퍼런스 추출을 위해 책 마지막의 참고문헌을 추출할 수 있다. References can be extracted at the end of the book for reference extraction.

레퍼런스 웹 크롤링을 위해 참고문헌의 제목 및/또는 초록을 웹에서 크롤링할 수 있다. For reference web crawling, you can crawl the title and / or abstract of references on the web.

위와 같은 과정을 통해, 문헌에서 추출된 텍스트로부터 문단을 추출해 내고 이를 한 구절로 정의할 수 있다. Through the above process, it is possible to extract a paragraph from the text extracted from the literature and define it as a phrase.

단계(220)에서, 구절 검색부(120)는 추출된 구절을 인덱싱하며 입력되는 초기 질의어와 적합한 구절을 검색하여, 초기 구절 검색 결과를 획득할 수 있다. 예컨대, 구절 검색부(120)는 BoW(Bag-of-words) 기반의 검색 엔진을 이용하여 초기 구절 검색 결과 획득할 수 있다. In operation 220, the phrase search unit 120 may index an extracted phrase and search for an input initial query and an appropriate phrase to obtain an initial phrase search result. For example, the phrase search unit 120 may obtain an initial phrase search result using a bag-of-words (BOW) based search engine.

구절 검색부(120)는 앞에서 획득한 문단 단위의 구절들을 인덱싱하며 사용자가 입력한 초기 질의어와 적합한 구절들을 TF-IDF, BM25, LM 기법을 통해 검색하고, 이를 종합하여 검색 결과를 반환할 수 있다. 각각의 기법들로 인하여 검색된 구절의 최종 점수는 다음 식과 같이 계산할 수 있다. The phrase search unit 120 indexes the phrases obtained by the paragraph unit obtained above, searches for the initial query input by the user and suitable phrases through the TF-IDF, BM25, and LM techniques, and aggregates them and returns a search result. . The final scores of the phrases retrieved by each technique can be calculated as follows.

[수학식 1][Equation 1]

여기서, S _j , k는 IR 모델 F _k 을 사용한 구절(passage)의 점수이고, N은 IR 모델이 사용된 총 수이며, M은 검색된 구절들의 총 수이다. 여기서, N은 3이고, M은 100으로 설정된다. Where S _j , k are scores of passages using IR model F _k , N is the total number of IR models used, and M is the total number of retrieved phrases. Here, N is 3 and M is set to 100.

도 3은 일 실시예에 따른 AST와 관련된 구절 검색 결과의 예를 나타내는 도면이다. 3 illustrates an example of a phrase search result associated with an AST according to an exemplary embodiment.

도 3을 참조하면, 상기의 수학식 1을 통해 얻어진 AST에 관련된 구절들과 그 구절이 가지고 있는 이미지를 나타낸다. Referring to FIG. 3, phrases related to AST obtained through Equation 1 above and images of the phrases are shown.

단계(230)에서, 연관성 정보 추출부(130)는 초기 구절 검색 결과에서 기설정된 수의 상위 구절들을 자연어 처리를 통해 핵심 개념들을 추출한 후, 핵심 개념들 간의 연관성 정보를 추출할 수 있다. In operation 230, the correlation information extraction unit 130 may extract key concepts through natural language processing of a predetermined number of higher phrases in the initial phrase search result, and then extract correlation information between the key concepts.

예를 들어, 연관성 정보 추출부(130)는 상위 M개의 구절을 자연어 처리를 통해 핵심 개념들을 추출한 후, 딥러닝 기반으로 각 핵심 개념들을 벡터 스페이스 모델로 변환하는 임베딩 작업을 수행할 수 있다. For example, the correlation information extraction unit 130 may extract key concepts through natural language processing of the upper M phrases, and then perform embedding to convert each core concept into a vector space model based on deep learning.

보다 구체적으로, 연관성 정보 추출부(130)는 초기 구절 검색 결과에서 기설정된 수의 상위 구절들을 자연어 처리를 통해 핵심 개념들을 추출할 수 있다. 이후, 추출된 핵심 개념들을 이용하여 딥러닝 기반의 Word2Vec 알고리즘을 수행하여 각 핵심 개념들을 벡터로 변환하는 임베딩 작업을 수행할 수 있다. More specifically, the association information extraction unit 130 may extract key concepts through natural language processing of a predetermined number of higher phrases in the initial phrase search result. Afterwards, the deep learning-based Word2Vec algorithm is performed using the extracted core concepts to perform embedding to convert each core concept into a vector.

Word2vec(mikolov et al., 2013)은 말뭉치(Corpus)를 입력으로 받아서 말뭉치의 단어를 벡터로 표현하는 임베딩을 학습하는데 사용하는 자연어 처리를 위한 딥러닝 방법의 하나이다. 즉, 검색된 상위 M개의 구절들을 자연어 처리하여 명사를 추출한 후, Word2vec에 입력하면 해당 말뭉치의 속한 핵심 개념들 간의 연관 관계를 학습하여 벡터로 표현할 수 있다. Word2vec (mikolov et al., 2013) is one of the deep learning methods for natural language processing used to learn embeddings that take corpus as input and express words in corpus as vectors. In other words, after extracting nouns by processing the searched upper M phrases by natural language, and inputting them into Word2vec, the relationship between the core concepts belonging to the corpus can be learned and expressed as a vector.

표 1은 일 실시예를 위해 구현된 Word2Vec의 한 예로 Albumin과 가까운, 가장 유사한 단어를 표시한다. Table 1 shows the most similar words, close to Albumin, as an example of Word2Vec implemented for one embodiment.

[표 1]TABLE 1

표 1은 일 실시예를 위해 구현된 Word2Vec의 한 예로 Albumin과 가까운, 가장 유사한 단어를 표시한 것으로, Albumin과 가장 유사한 상위 개념 10개 리스트를 나타낸다. Table 1 is an example of Word2Vec implemented for one embodiment, showing the most similar words close to Albumin, and showing a list of the top ten concepts most similar to Albumin.

단계(240)에서, 지식 구조 생성부(140)는 핵심 개념들 간의 연관성 정보를 이용하여 지식 구조를 생성할 수 있다. In operation 240, the knowledge structure generator 140 may generate a knowledge structure by using correlation information between key concepts.

예컨대, 지식 구조 생성부(140)는 생성된 단어-벡터 행렬을 활용하여 유사한 개념을 연결하면 상위 M개의 구절에 대한 지식 구조를 생성할 수 있다. For example, the knowledge structure generation unit 140 may generate a knowledge structure for the upper M phrases by connecting similar concepts using the generated word-vector matrix.

보다 구체적으로, 지식 구조 생성부(140)는 기설정된 수의 상위 구절들로부터 추출된 핵심 개념들 간의 거리 매트릭스를 구축하고 얻어진 각 거리간의 유사도를 각 열(row)의 수치로 사용하여 연관성 행렬을 생성하고, 연관성 행렬을 이용하여 지식 구조를 생성할 수 있다. More specifically, the knowledge structure generator 140 constructs a distance matrix between key concepts extracted from a predetermined number of higher phrases and uses the similarity between the obtained distances as numerical values for each row to generate an association matrix. And a knowledge structure using an association matrix.

기존의 지식 구조 생성 방법과 본 발명에서 제안하는 방법은 다음과 같은 명확한 차이를 지니고 있다. The existing knowledge structure generation method and the method proposed in the present invention have the following clear differences.

1) 연관성 행렬 생성1) Create Association Matrix

기존의 기술은 핵심 명사들이 문단 단위에서 함께 등장하는 정도를 구하는 것이다. 아래의 식을 이용한 PCS(Paragraph co-occurrences Cosine Similarity) 방법은 위에서 추출한 명사들이 같은 문단(paragraph) 내에 얼마나 자주 동시 출현하는지를 수치화하는 데에 사용될 수 있다. The existing technique is to determine the degree to which key nouns appear together in paragraph units. The Paragraph co-occurrences Cosine Similarity (PCS) method can be used to quantify how often the nouns extracted above appear simultaneously in the same paragraph.

[수학식 2][Equation 2]

반면, 일 실시예에 따른 딥러닝 기반의 지식 구조 생성 방법을 활용한 의료 문헌 구절 검색 방법은 상위 M개 구절로부터 추출된 개념들 간의 거리 매트릭스를 구축하고 단계(230)을 통해 얻어진 각 거리간의 유사도를 각 열(row)의 수치로 사용할 수 있다. On the other hand, the medical literature passage search method using the deep learning-based knowledge structure generation method according to an embodiment constructs a distance matrix between concepts extracted from the top M phrases and the similarity between the distances obtained through step 230. Can be used as the value for each row.

2) 다차원 스케일링2) multidimensional scaling

앞서 계산된 연관성 정보는 1-7 의 거리정보로 변환되어 1은 두 단어 간에 연관성이 높음을 나타내고, 7은 두 단어 간에 연관성이 낮음을 나타낼 수 있다. 이후, PathFinder(Schvaneveldt, 1989) 알고리즘을 이용하여 연관성 행렬의 차원(dimension)을 스케일링하여 잡음(Noise)를 제거하고 핵심어들 간의 근접성 정보를 선명하게 만드는 과정을 거쳐 생성된 행렬 정보를 이용하여 가시화된 지식 구조를 생성할 수 있다.The previously calculated correlation information may be converted into distance information of 1-7, where 1 represents a high association between two words, and 7 represents a low association between two words. Then, using the PathFinder (Schvaneveldt, 1989) algorithm, the dimension of the association matrix is scaled to remove noise and sharpen the proximity information between key words. Create knowledge structures

일 실시예에 따른 딥러닝 기반의 지식 구조 생성 방법을 활용한 의료 문헌 구절 검색 방법은, 기존 방법과 달리, Pathfinder 알고리즘을 활용하지 않는다. 이는, 잡음을 제거하기 위해 설계된 알고리즘이 두 개념 사이의 묵시적인 관계까지 삭제하는 것을 방지하기 위한 것이다. According to an embodiment of the present invention, the method of searching a medical document passage using the deep learning based knowledge structure generation method does not use a pathfinder algorithm, unlike the existing method. This is to prevent algorithms designed to remove noise eliminate the implicit relationship between the two concepts.

단계(250)에서, 핵심 개념 탐색부(150)는 구축된 지식 구조의 핵심 개념을 탐색할 수 있다. In operation 250, the core concept search unit 150 may search the core concepts of the constructed knowledge structure.

특히, 핵심 개념 탐색부(150)는 페이지랭크(PageRank) 알고리즘을 활용하여 지식 구조에서 가장 핵심적인 키워드를 탐색할 수 있다. In particular, the core concept search unit 150 may search for the most important keyword in the knowledge structure by using a PageRank algorithm.

단계(240)을 통해 구축된 지식 구조는 그래프 형태로, 페이지랭크(PageRank) 알고리즘을 활용하면, 지식 구조에서 가장 핵심적인 키워드가 무엇인지 알 수 있다. 본 실시예에서는 지식 구조의 형태에 적합하게 다음 식과 같이 페이지랭크(PageRank)를 변형하여 사용할 수 있다. The knowledge structure constructed through the step 240 is in the form of a graph. By using the PageRank algorithm, it is possible to know what is the most important keyword in the knowledge structure. In the present embodiment, a page rank may be modified to be used in accordance with the form of the knowledge structure as shown in the following equation.

[수학식 3][Equation 3]

여기서, B _i 는 v _i 를 가리키는 정점들(vertices) 셋(set)이고, L _j 는 정점(vertice) v _j 로부터 나가는 링크(outgoing link)들의 수이며, O _i 및 O _k 는 각각 정점들 v _i 및 v _k 의 나가는 에지들의 수를 나타낸다. 여기에서 사용된 페이지랭크(PageRank)는 원래 페이지랭크(PageRank)의 가장자리 가중치를 고려하지 않은 가중치 버전이다.Where B _i is a set of vertices pointing to v _i , L _j is the number of outgoing links from vertex v _j , and O _i and O _k are vertices v respectively Indicate the number of outgoing edges of _i and v _k . As used herein, PageRank is a weighted version that does not consider the edge weight of the original PageRank.

단계(260)에서, 질의어 확장부(160)는 탐색된 지식 구조 내 핵심 개념과 초기 질의어 사이의 개념을 질의어 셋(set)에 추가할 수 있다. In operation 260, the query extension unit 160 may add a concept between a core concept in the searched knowledge structure and an initial query to a query set.

보다 구체적으로, 질의어 확장부(160)는 지식 구조 내 핵심 개념(또는 핵심 키워드)과 초기 질의어 사이에 놓여있는 개념어들을 질의어 셋(set)에 추가할 수 있다. 이 때, 해당 개념들을 최종적으로 확장하기 전에 외부 온톨리지를 활용하여 의학 용어와 관련이 있는지를 확인할 수 있다.More specifically, the query extension unit 160 may add conceptual words that are placed between the core concepts (or key keywords) and the initial query in the knowledge structure to the query set. At this point, an external ontology can be used to determine if it is relevant to medical terminology before it is finally extended.

도 4는 일 실시예에 따른 구축된 지식 구조로부터의 질의어 확장의 예를 나타내는 도면이다. 4 is a diagram illustrating an example of query expansion from a constructed knowledge structure according to an embodiment.

도 4를 참조하면, 구축된 지식 구조(400)로부터의 질의어 확장의 예로써, Albumin(410)을 핵심 개념으로 하는 실제 지식 구조의 부분을 표현할 수 있다. Aspartate(420)가 초기 질의어로 들어오면, Albumin(410)과 Asparate(420)의 경로상에 놓여있는 'Albumin(410)', 'GOT(431)', 'Glutamic(432)', 'Activity(433)', 및 'Mitochondrial(434)'을 초기 질의어 셋에 포함시켜 질의어를 확장할 수 있다. Referring to FIG. 4, as an example of query expansion from the constructed knowledge structure 400, a part of the actual knowledge structure using Albumin 410 as a core concept may be represented. When Aspartate (420) enters the initial query, 'Albumin (410)', 'GOT (431)', 'Glutamic (432)', and 'Activity ()' are placed on the path of Albumin (410) and Asparate (420). 433), and 'Mitochondrial 434' may be included in the initial query set to expand the query.

그러나, 의료 분야와 관련되지 않거나 명확하지 않은 개념을 질의어 셋에 포함시키게 되면 오히려 검색의 잡음을 발생시켜 전체적인 검색 성능을 하락시킬 수 있다. 이를 방지하기 위하여, 해당 개념들을 최종적으로 확장하기 전에 외부 온톨리지(BioPortal, UMLS, MeSH 등)를 활용하여 해당 개념들이 UMLS 코드를 가지고 있는 의학 용어와 관련이 있는지를 확인할 수 있다. However, incorporating concepts that are not related to the medical field or are unclear in the query set may result in noise in the search, thereby reducing the overall search performance. To avoid this, external ontologies (BioPortal, UMLS, MeSH, etc.) can be used to determine whether the concepts relate to medical terms with UMLS code before they are finally extended.

예를 들어, 구축된 지식 구조 상의 GOT(431)는 GET의 과거형이 아니라 UMLS 코드 C0528721을 가지고 있는 의학 용어임을 외부 온톨로지 검색을 통해 파악할 수 있다. 그러나 도 4에서 추가된 Activity(433)의 경우 특별히 의학 용어라고 판단할 만한 근거가 없으므로, 최종 질의어 확장 리스트에서 제외될 수 있다. For example, the GOT 431 on the established knowledge structure may recognize through an external ontology search that the medical term having the UMLS code C0528721 is not the past tense of the GET. However, in the case of the Activity 433 added in FIG. 4, there is no basis for judging it as a medical term, and thus may be excluded from the final query expansion list.

즉, 초기 질의어 Asparate(420)에 대하여, 본 실시예에서 제안하는 지식 구조를 통한 질의어 확장을 통해 'Albumin(410)', 'GOT(431)', 'Glutamic(432)', 및 'Mitochondrial(434)' 단어가 초기 질의어 셋에 추가될 수 있다. That is, for the initial query Asparate 420, 'Albumin 410', 'GOT 431', 'Glutamic 432', and 'Mitochondrial () through the query expansion through the knowledge structure proposed in this embodiment. 434) may be added to the initial query set.

단계(270)에서, 구절 검색부(120)는 확장된 질의어 셋을 활용하여 의료 구절 검색을 재수행할 수 있다. 여기서, 구절 검색부(120) 대신 별도의 구절 재검색부를 통해 확장된 질의어 셋을 활용하여 의료 구절 검색을 재수행할 수도 있다. In operation 270, the phrase search unit 120 may re-search the medical phrase using the expanded query set. Here, the medical phrase search may be re-executed by using the expanded query set through a separate phrase research unit instead of the phrase search unit 120.

즉, 추가된 질의어를 통해 의료 문헌 구절 검색을 다시 수행할 수 있다. 이를 통해, 초기 검색 결과에 탐지되지 못한 추가적인 구절을 탐색할 수 있게 되며, 경로 상에 놓여있는 단어들을 추가하여 질의어와 지식 구조 핵심어 간의 근접성(Proximity)을 고려할 수 있게 되어 보다 더 향상된 구절 검색이 가능해 진다. That is, the medical query phrase search can be performed again through the added query word. This allows you to search for additional phrases that were not detected in the initial search results, and allows you to consider the proximity between the query and knowledge structure keywords by adding words that lie on the path, allowing for improved phrase search. Lose.

이에 따라 초기 질의어를 통해 상위에 검색된 복수 개의 의료 관련 구절로부터 통합된 지식 구조를 자동으로 생성하고, 생성된 지식 구조로부터 질의어 확장에 관련된 키워드를 자동으로 추출할 수 있다. Accordingly, an integrated knowledge structure can be automatically generated from a plurality of medical-related phrases searched above through an initial query, and keywords related to query expansion can be automatically extracted from the generated knowledge structure.

한편, 이러한 구절 검색 방법은 의료 문헌 구절뿐 아니라 다양한 구절 검색에 사용될 수 있다. On the other hand, such a phrase search method can be used for a variety of phrase search as well as medical literature passages.

다른 실시예에 따른 구절 검색 방법은, 복수의 문헌으로부터 문단 단위의 구절을 추출하는 단계, 추출된 구절을 인덱싱하며 입력되는 초기 질의어와 적합한 구절을 검색하여, 초기 구절 검색 결과를 획득하는 단계, 초기 구절 검색 결과에서 기설정된 수의 상위 구절들을 자연어 처리를 통해 핵심 개념들을 추출한 후, 핵심 개념들 간의 연관성 정보를 추출하는 단계, 핵심 개념들 간의 연관성 정보를 이용하여 지식 구조를 생성하는 단계, 구축된 지식 구조의 핵심 개념을 탐색하는 단계, 탐색된 지식 구조 내 핵심 개념과 초기 질의어 사이의 개념을 질의어 셋(set)에 추가하는 단계, 및 확장된 질의어 셋을 활용하여 구절 검색을 재수행하는 단계를 포함하여 이루어질 수 있다. According to another exemplary embodiment, a phrase search method includes extracting a paragraph-based phrase from a plurality of documents, indexing the extracted phrase, searching for an initial query input and a suitable phrase, and obtaining an initial phrase search result; Extracting the core concepts from the verse search result through natural language processing, extracting the correlation information between the core concepts, generating the knowledge structure using the correlation information between the core concepts, and constructing Exploring the core concepts of the knowledge structure, adding concepts between the core concepts in the knowledge structure and the initial query to the query set, and re-executing the phrase search using the extended query set. It can be done by.

초기 질의어를 통해 상위에 검색된 복수 개의 관련 구절로부터 통합된 지식 구조를 자동으로 생성하고, 생성된 지식 구조로부터 질의어 확장에 관련된 키워드를 자동으로 추출할 수 있다.Through the initial query, an integrated knowledge structure can be automatically generated from a plurality of related phrases searched above, and keywords related to query expansion can be automatically extracted from the generated knowledge structure.

다른 실시예에 따른 구절 검색 방법은 의료 문헌 구절 대신 일반적인 또는 특정 구절을 사용하는 것으로, 앞에서 설명한 일 실시예에 따른 딥러닝 기반의 지식 구조 생성 방법을 활용한 의료 문헌 구절 검색 방법과 그 구성이 유사하여 중복되는 설명은 생략하기로 한다. According to another embodiment of the present invention, a phrase search method uses a general or specific phrase instead of a medical document phrase, and has a similar structure to a medical document phrase search method using a deep learning based knowledge structure generation method according to the above-described embodiment. Duplicate description will be omitted.

또한, 다른 실시예에 따른 구절 검색 시스템은 문헌으로부터 문단 단위의 구절을 추출하는 구절 추출부, 추출된 구절을 인덱싱하며 입력되는 초기 질의어와 적합한 구절을 검색하여, 초기 구절 검색 결과를 획득하는 구절 검색부, 초기 구절 검색 결과에서 기설정된 수의 상위 구절들을 자연어 처리를 통해 핵심 개념들을 추출한 후, 핵심 개념들 간의 연관성 정보를 추출하는 연관성 정보 추출부, 핵심 개념들 간의 연관성 정보를 이용하여 지식 구조를 생성하는 지식 구조 생성부, 구축된 지식 구조의 핵심 개념을 탐색하는 핵심 개념 탐색부, 및 탐색된 지식 구조 내 핵심 개념과 초기 질의어 사이의 개념을 질의어 셋(set)에 추가하는 질의어 확장부를 포함하고, 확장된 질의어 셋을 활용하여 구절 검색을 재수행할 수 있다.In addition, the phrase search system according to another embodiment is a phrase extraction unit for extracting a paragraph of paragraph units from the literature, the phrase search to index the extracted phrases and to search the input initial query and the appropriate phrases, to obtain an initial phrase search results After retrieving key concepts through natural language processing from a predetermined number of upper phrases in the initial and initial phrase search results, the relevance information extraction unit extracts the relevance information between the key concepts and the knowledge structure using the relevance information between the key concepts. A knowledge structure generation unit for generating, a core concept search unit for searching the core concepts of the constructed knowledge structure, and a query expansion unit for adding a concept between the core concepts and the initial query in the searched knowledge structure to the query set; Using the extended set of queries, the phrase search can be rerun.

다른 실시예에 따른 구절 검색 시스템은 의료 문헌 구절 대신 일반적인 또는 특정 분야의 구절을 사용하는 것으로, 앞에서 설명한 일 실시예에 따른 딥러닝 기반의 지식 구조 생성 방법을 활용한 의료 문헌 구절 검색 시스템과 그 구성이 유사하여 중복되는 설명은 생략하기로 한다.According to another embodiment of the present invention, a phrase search system uses a general or specific field instead of a medical document phrase, and uses a deep learning-based knowledge structure generation method and its configuration according to the above-described embodiment. This similar and redundant description will be omitted.

이상과 같이, 실시예들에 따르면 기존의 외부 리소스나 온톨로지로부터 관련 단어를 추출하는 방법이나 상위에 검색된 문서로부터 기계적으로 관련 단어를 추출하는 방법과 달리, 상위에 검색된 문서로부터 지식 구조를 구축하고 해당 지식 구조 내 질의어와 연관된 단어들을 확장된 질의어로 사용하여 검색 성능을 향상시킬 수 있다. As described above, according to embodiments, unlike a method of extracting a related word from an existing external resource or ontology or a method of mechanically extracting a related word from a document searched at a higher level, a knowledge structure is constructed from a document searched at a higher level and corresponding Words associated with queries in the knowledge structure can be used as extended query words to improve search performance.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 컨트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The apparatus described above may be implemented as a hardware component, a software component, and / or a combination of hardware components and software components. For example, the devices and components described in the embodiments may include, for example, processors, controllers, arithmetic logic units (ALUs), digital signal processors, microcomputers, field programmable arrays (FPAs), It may be implemented using one or more general purpose or special purpose computers, such as a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to the execution of the software. For the convenience of understanding, a processing device may be described as one being used, but a person skilled in the art will appreciate that the processing device includes a plurality of processing elements and / or a plurality of types of processing elements. It can be seen that it may include. For example, the processing device may include a plurality of processors or one processor and one controller. In addition, other processing configurations are possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instructions, or a combination of one or more of the above, and configure the processing device to operate as desired, or process independently or collectively. You can command the device. Software and / or data may be any type of machine, component, physical device, virtual equipment, computer storage medium or device in order to be interpreted by or to provide instructions or data to the processing device. It can be embodied in. The software may be distributed over networked computer systems so that they are stored or executed in a distributed manner. Software and data may be stored on one or more computer readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be embodied in the form of program instructions that can be executed by various computer means and recorded in a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the media may be those specially designed and constructed for the purposes of the embodiments, or they may be of the kind well-known and available to those having skill in the computer software arts. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic disks, such as floppy disks. Magneto-optical media, and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine code, such as produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.Although the embodiments have been described by the limited embodiments and the drawings as described above, various modifications and variations are possible to those skilled in the art from the above description. For example, the described techniques may be performed in a different order than the described method, and / or components of the described systems, structures, devices, circuits, etc. may be combined or combined in a different manner than the described method, or other components. Or even if replaced or substituted by equivalents, an appropriate result can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are within the scope of the claims that follow.

Claims

Extracting paragraph-based phrases from the medical literature in the phrase extraction unit;
Indexing the phrase extracted by the phrase search unit and searching for an initial query and an appropriate phrase inputted to obtain an initial phrase search result;
Extracting, by a correlation information extraction unit, key concepts from a predetermined number of upper phrases in the initial phrase search result through natural language processing, and then extracting correlation information between the key concepts;
Generating a knowledge structure by using the association information between the core concepts in the knowledge structure generator;
Searching for a core concept of the knowledge structure constructed by a core concept search unit;
Adding a concept between the core concept in the knowledge structure found in the query expansion unit and the initial query to a query set; And
Re-executing a medical phrase search by using the query term set expanded by the phrase search unit;
Including,
Extracting a paragraph unit paragraph from the medical literature,
After converting a file format from a plurality of medical documents, extracting text and images using the PDFbox library, and performing text preprocessing to separate text sentences and paragraphs through the OpenNLP library, extracting paragraphs from the extracted text Define a paragraph as a phrase and extract the phrase of the paragraph unit,
Generating a knowledge structure using the association information between the core concepts,
Constructing a distance matrix between the key concepts extracted from the predetermined number of upper phrases and generating an association matrix using the similarity between each obtained distance as a numerical value for each row; And
Generating the knowledge structure using the association matrix
Including;
Adding a concept between the core concept in the knowledge structure and the initial query to a query set,
A conceptual word placed between the core concept in the knowledge structure and the initial query word is added to the query set, and before the query expansion, the external term is used to determine whether the conceptual word is a medical term having a UMLS code. Not adding to a query set if it is not considered a term
Characterized in that, the method of retrieval of medical literature passages.

The method of claim 1,
Acquiring the initial phrase search results,
Acquiring the search result by searching the user's initial query and the appropriate phrases through TF-IDF, BM25, and LM techniques.
Characterized in that, the method of retrieval of medical literature passages.

The method of claim 1,
Extracting association information between the core concepts,
Extracting key concepts through natural language processing of a predetermined number of upper phrases from the initial phrase search result; And
Performing deep embedding-based word2vec algorithm using the extracted core concepts to convert each core concept into a vector;
Including, medical literature phrase search method.

delete

The method of claim 1,
Exploring the core concepts of the knowledge structure,
Using the PageRank algorithm to search for the most critical keywords in the knowledge structure
Characterized in that, the method of retrieval of medical literature passages.

delete

The method of claim 1,
Automatically generating the knowledge structure integrated from the plurality of medical related phrases searched above through the initial query, and automatically extracting keywords related to the query expansion from the generated knowledge structure
Characterized in that, the method of retrieval of medical literature passages.

A phrase extraction unit for extracting a paragraph unit paragraph from a medical document;
A phrase search unit for indexing the extracted phrase and searching for an initial query word and an appropriate phrase inputted to obtain an initial phrase search result;
An association information extracting unit for extracting key concepts from a predetermined number of higher phrases through natural language processing from the initial phrase search result, and then extracting association information between the key concepts;
A knowledge structure generation unit generating a knowledge structure using the association information between the core concepts;
A core concept search unit for searching a core concept of the constructed knowledge structure; And
A query extension unit that adds a concept between the core concept in the searched knowledge structure and the initial query to a query set.
Including,
Re-search the medical phrases using the expanded set of queries,
The phrase extraction unit,
After converting a file format from a plurality of medical documents, extracting text and images using the PDFbox library, and performing text preprocessing to separate text sentences and paragraphs through the OpenNLP library, extracting paragraphs from the extracted text Define a paragraph as a phrase and extract the phrase of the paragraph unit,
The knowledge structure generation unit,
Construct a distance matrix between the core concepts extracted from the predetermined number of upper phrases, and use the similarity between the obtained distances as numerical values for each row to generate an association matrix, and use the association matrix to Create a knowledge structure,
The query expansion unit,
A conceptual word placed between the core concept in the knowledge structure and the initial query word is added to the query set, and before the query expansion, the external term is used to determine whether the conceptual word is a medical term having a UMLS code. Not adding to a query set if it is not considered a term
A medical literature passage search system, characterized in that.

The method of claim 8,
The phrase search unit,
Acquiring the search result by searching the user's initial query and the appropriate phrases through TF-IDF, BM25, and LM techniques.
A medical literature passage search system, characterized in that.

The method of claim 8,
The association information extraction unit,
Extracting key concepts from natural phrases by using a predetermined number of upper phrases from the initial phrase search results, and then performing deep embedding-based Word2Vec algorithm to perform embedding to convert each key concept into a vector.
A medical literature passage search system, characterized in that.

The method of claim 8,
The core concept search unit,
Using the PageRank algorithm to search for the most critical keywords in the knowledge structure
A medical literature passage search system, characterized in that.

delete

The method of claim 8,
Automatically generating the knowledge structure integrated from the plurality of medical related phrases searched above through the initial query, and automatically extracting keywords related to the query expansion from the generated knowledge structure
A medical literature passage search system, characterized in that.

Extracting paragraph-based phrases from a plurality of documents in a phrase extraction unit;
Indexing the phrase extracted by the phrase search unit and searching for an initial query and an appropriate phrase inputted to obtain an initial phrase search result;
Extracting, by a correlation information extraction unit, key concepts from a predetermined number of upper phrases in the initial phrase search result through natural language processing, and then extracting correlation information between the key concepts;
Generating a knowledge structure by using the association information between the core concepts in the knowledge structure generator;
Searching for a core concept of the knowledge structure constructed by a core concept search unit;
Adding a concept between the core concept in the knowledge structure found in the query expansion unit and the initial query to a query set; And
Re-executing the phrase search using the query term set expanded by the phrase search unit;
Including,
Extracting a paragraph unit paragraph from the plurality of documents,
After converting a file format from the plurality of documents in a specific field, text and image extraction are performed using the PDFbox library, and text preprocessing is performed to separate text sentences and paragraphs using the OpenNLP library. Extract the paragraph of the paragraph unit by extracting and defining the paragraph as one phrase,
Generating a knowledge structure using the association information between the core concepts,
Constructing a distance matrix between the key concepts extracted from the predetermined number of upper phrases and generating an association matrix using the similarity between each obtained distance as a numerical value for each row; And
Generating the knowledge structure using the association matrix
Including;
Adding a concept between the core concept in the knowledge structure and the initial query to a query set,
A conceptual word placed between the core concept in the knowledge structure and the initial query word is added to a query set, and the term of a specific field is determined by determining whether the conceptual word is a term of a specific field by using an external ontology before expanding the query. Not being added to a query set if it is not determined
Characterized in that, the phrase search method.

The method of claim 14,
Automatically generating the knowledge structure integrated from a plurality of related phrases searched higher through the initial query, and automatically extracting keywords related to query expansion from the generated knowledge structure
Characterized in that, the phrase search method.