KR20240168118A

KR20240168118A - Document Classification System and Method Using Entity Group Scores

Info

Publication number: KR20240168118A
Application number: KR1020230065717A
Authority: KR
Inventors: 이성수; 김연수
Original assignee: 주식회사 엘지화학
Priority date: 2023-05-22
Filing date: 2023-05-22
Publication date: 2024-11-29

Abstract

The present invention relates to a method and a system for performing document classification using an existing entity name recognition model. According to the present invention, an entity name recognition model generated in advance is used to recognize an entity name of a document, an entity name group score for calculating importance of entity names is calculated, an expression vector for the document is obtained by using the calculated entity name group score, and the documents are classified by analyzing a document expression vector.

Description

Document Classification System and Method Using Entity Group Scores

본 발명은 개체명 인식 및 고유 개체명 그룹 스코어링을 기반으로 한 문서 분류 시스템 및 방법에 관한 것으로, 보다 상세하게는 다중 개체명을 포함하는 문서를 정확하게 분류하기 위한 문서 분류 시스템 및 방법에 관한 것이다.The present invention relates to a document classification system and method based on named entity recognition and unique named entity group scoring, and more particularly, to a document classification system and method for accurately classifying a document including multiple named entities.

텍스트 정보를 인식하는 태스크는 개체명 인식, 문헌 분류, 문헌 요약, 관계 분석, Q&A 등의 다양한 종류의 하위 태스크로 이루어져 있다. 이러한 태스크를 지도 학습으로 수행하려면 각 하위 태스크마다 적절한 학습 데이터를 생성하는 데 시간과 노력을 투자해야 하며, 이로 인해 종합적인 문헌 분석 시스템 구축이 지연될 수 있다.The task of recognizing text information consists of various types of subtasks such as entity recognition, document classification, document summary, relationship analysis, and Q&A. In order to perform these tasks with supervised learning, time and effort must be invested in generating appropriate learning data for each subtask, which may delay the construction of a comprehensive document analysis system.

대부분의 경우, 연구자들은 각 하위 태스크에 적합한 오픈 데이터(벤치마크 데이터셋 등)를 활용하여 데이터 준비에 따른 수고를 아끼려고 한다. 그러나 과학기술분야 연구문헌의 경우, 분야의 특성에 맞는 오픈된 범용 데이터가 제공되지 않는 경우가 많다. 이는 과학기술 분야의 희소성과 난이도 때문에 많은 사람들이 해당 주제를 다루지 못하거나 다룰 수 없는 경우가 많기 때문이다. 이러한 상황은 많은 과학기술분야 연구원들에게 있어서, 오랫동안 피할 수 없는 상황이었다.In most cases, researchers try to save the effort of data preparation by utilizing open data (such as benchmark datasets) suitable for each subtask. However, in the case of research literature in the field of science and technology, open general data that fits the characteristics of the field are often not provided. This is because many people are unable or cannot deal with the topic due to the scarcity and difficulty of the field of science and technology. This situation has been an unavoidable situation for many researchers in the field of science and technology for a long time.

이 경우 하나의 태스크를 수행한 결과를 다른 태스크를 수행하는데 활용할 수 있다면 자연어 처리 업무에 소요되는 시간을 절약할 수 있다. In this case, if the results of performing one task can be used to perform another task, the time required for natural language processing can be saved.

또한, 문헌 분류는 정보 검색, 지식 관리 및 자연어 처리 등 다양한 분야에서 중요한 작업이다. 디지털 문서의 양이 급격히 증가함에 따라 효율적인 문서 관리와 검색을 위해서는 자동 분류 기술이 필수적이다.In addition, document classification is an important task in various fields such as information retrieval, knowledge management, and natural language processing. As the amount of digital documents increases rapidly, automatic classification technology is essential for efficient document management and retrieval.

이와 관련하여 키워드 기반 방법, 토픽 모델링 및 딥 러닝 기법을 비롯한 다양한 기술들이 문서 분류를 위해 제안되었지만, 이러한 기술들은 종종 문서 내 개체명 간의 관계를 고려하지 않아 여러 개체를 포함하는 문서를 정확하게 분류하는 데 한계가 있다.In this regard, various techniques including keyword-based methods, topic modeling, and deep learning techniques have been proposed for document classification, but these techniques often do not consider the relationships between named entities in the document, and thus have limitations in accurately classifying documents containing multiple entities.

특허문헌 1: 미국 등록특허 US 9760634 B1호Patent Document 1: US Registered Patent No. US 9760634 B1 특허문헌 2: 대한민국 등록특허 KR 2006646 B1호Patent Document 2: Republic of Korea Registered Patent No. KR 2006646 B1

이에 본 발명은, 문헌분류를 위한 방법 및 시스템에 있어서, 기존의 자연어 처리 하위태스크로서의 개체명 인식 프로세스에서 사용되는 개체명 인식모델을 사용하여, 문헌을 분류하는 방법 및 장치를 제공하고자 한다. Accordingly, the present invention provides a method and device for classifying documents by using a named entity recognition model used in a named entity recognition process as a subtask of existing natural language processing in a method and system for document classification.

상술한 문제를 해결하기 위하여, 본 발명은, 문서 분류를 위한 문서분류 시스템으로서, a. 제어부; b. 소정의 문서풀(document pool)에 포함된 복수의 문서들에 대한 문서데이터를 수신하기 위해 구성된 문서입력부; c. 문서 내 개체명을 인식하기 위해 상기 복수의 문서들의 문서데이터에 포함된 소정의 개체명들을 인식하는 개체명 인식부; d. 각각의 문서에 대하여, 개체명 인식 결과를 기반하여 개체명들을 그루핑(grouping) 하고, 각 개체명 그룹별 점수를 계산하는 문서 개체명 그룹점수 계산부; e. 상기 개체명 그룹점수 계산모듈에서 계산된 개체명 그룹점수를 기반으로 상기 복수의 문서들을 분류하는 문서 분류부; 을 포함하여 구성되는 문서분류 시스템을 제공한다. In order to solve the above-described problem, the present invention provides a document classification system for document classification, comprising: a. a control unit; b. a document input unit configured to receive document data for a plurality of documents included in a predetermined document pool; c. an entity recognition unit that recognizes predetermined entity names included in the document data of the plurality of documents to recognize entity names in the documents; d. a document entity group score calculation unit that groups the entities based on the entity recognition result for each document and calculates a score for each entity group; e. a document classification unit that classifies the plurality of documents based on the entity group scores calculated by the entity group score calculation module.

한편, 상기 소정의 n번째 개체명 그룹에 대한 개체명 그룹점수 Sn은 아래 (수식 1)에 의해 산출된다. Meanwhile, the entity group score Sn for the above-mentioned nth entity group is calculated by the following (Formula 1).

(수식 1) (Formula 1)

여기서, ecp는 전체 문서 풀에 포함된 개체명의 개수를 나타내고, ecd는 개체명 그룹별 점수 계산 대상 문서에 있는 개체명의 개수를 나타내며, dfe는 특정 개체명그룹을 포함하고 있는 문서의 빈도를 나타내고, ld는 대상 문서의 길이를 나타냄) Here, ecp represents the number of entities included in the entire document pool, ecd represents the number of entities in the target document for calculating scores by entity group, dfe represents the frequency of documents containing a specific entity group, and ld represents the length of the target document.)

이 때, 상기 문서 분류부는, 상기 계산된 개체명 그룹별 점수를 각 문서의 문헌표상(document representation)을 나타내는 특징값(feature)으로 활용하여 상기 복수의 문서들을 분류하는 것; 을 특징으로 한다. At this time, the document classification unit is characterized by using the calculated entity group scores as feature values representing the document representation of each document to classify the plurality of documents.

또한, 본 발명에 따른 문서분류 시스템은, 상기 계산된 개체명 그룹점수를 개체명 그룹에 매핑하고, 이를 바탕으로 차원 감소 및 군집 분석을 수행하여 각 문서에 대한 표현 벡터와 군집정보를 생성하는 벡터 공간 분석부;를 추가로 포함하며, 문서 분류부는 상기 벡터 공간 분석부가 생성하는 표현 벡터와 군집정보를 바탕으로 유사한 특성의 문서를 식별하고 분류한다. In addition, the document classification system according to the present invention further includes a vector space analysis unit that maps the calculated named entity group score to a named entity group and performs dimension reduction and cluster analysis based on the same to generate an expression vector and cluster information for each document; and the document classification unit identifies and classifies documents with similar characteristics based on the expression vector and cluster information generated by the vector space analysis unit.

또한, 본 발명은 문서 분류 방법으로서, a. 소정의 문서풀(document pool)에 포함된 복수의 문서들에 대한 문서데이터를 수신하는 문서 수신단계; b. 문서 내 개체명을 인식하기 위해 상기 복수의 문서들의 문서데이터에 포함된 소정의 개체명들을 인식하는 개체명 인식단계; c. 각각의 문서에 대하여, 개체명 인식 결과를 기반으로 개체명들을 그루핑(grouping) 하고, 각 개체명 그룹별 점수를 계산하는 점수 산출단계; d. 상기 개체명 그룹별 점수를 기반으로 상기 복수의 문서들을 분류하는 문서 분류단계; 를 포함하는 문서 분류 방법을 제공한다. In addition, the present invention provides a document classification method, comprising: a. a document receiving step of receiving document data for a plurality of documents included in a predetermined document pool; b. an entity recognition step of recognizing predetermined entity names included in the document data of the plurality of documents to recognize entity names in the documents; c. a score calculation step of grouping entities based on the entity recognition result for each document and calculating a score for each entity group; d. a document classification step of classifying the plurality of documents based on the scores for each entity group.

본 발명에 따르면, 미리 학습된 개체명 인식모델을 문서분류 작업에 활용함으로써, 컴퓨터 연산자원을 절약하며, 나아가, 개체명 인식 결과로부터 각 개체명들의 문서풀 및 개별문서에서의 중요도를 표시하는 개체명 그룹점수를 산출하고 이를 문서분류에 사용함으로써, 종래의 방식에 비하여 효과적인 문헌 분류 방법 및 시스템을 제공한다. According to the present invention, by utilizing a pre-learned named entity recognition model for document classification work, computer computational resources are saved, and further, by calculating named entity group scores indicating the importance of each named entity in a document pool and individual documents from the named entity recognition results and using these for document classification, a document classification method and system that are more effective than conventional methods are provided.

본 명세서에 첨부되는 다음의 도면들은 본 발명의 바람직한 실시예를 예시하는 것이며, 전술된 발명의 상세한 설명과 함께 본 발명의 기술사상을 더욱 이해시키는 역할을 하는 것이므로, 본 발명은 그러한 도면에 기재된 사항에만 한정되어 해석되지 않아야 한다.
도 1은 본 발명의 일 실시 예에 따른 문서 분류 시스템을 나타내는 블록 다이어그램이고, 도 2는 본 발명의 일 실시 예에 따른 문서 분류 방법을 나타내는 플로우 차트이며, 도 3은 소정의 문서들에 대하여 개체명을 인식하고, 인식된 개체명들에 대하여 산출한 개체명 그룹점수의 예시를 보이는 표이고, 도 4는 벡터 차원축소 분석 결과를, 도 5는 벡터 공간 분석 및 문서 분류 과정을 설명하는 예시 다이어그램이고, 도 6은 본 발명에 따른 문서 분류 방법을 수행하는 컴퓨터 시스템의 블럭도이다. The following drawings attached to this specification illustrate preferred embodiments of the present invention and, together with the detailed description of the invention described above, serve to further understand the technical idea of the present invention, and therefore, the present invention should not be construed as being limited to matters described in such drawings.
FIG. 1 is a block diagram showing a document classification system according to an embodiment of the present invention, FIG. 2 is a flow chart showing a document classification method according to an embodiment of the present invention, FIG. 3 is a table showing examples of entity group scores calculated for recognized entity names in certain documents, FIG. 4 is an example diagram explaining vector dimensionality reduction analysis results, FIG. 5 is an example diagram explaining vector space analysis and document classification processes, and FIG. 6 is a block diagram of a computer system performing a document classification method according to the present invention.

도 1은 본 발명의 일 실시 예에 따른 문서 분류 시스템을 나타내는 블록 다이어그램이다. 문서 분류 시스템은 제어부 110, 문서 입력부 120, 개체명 인식 부 130, 문서 개체명 그룹점수 계산부 140, 벡터 공간 분석부 150 및 문서 분류부160을 포함한다. 본 발명에 따른 시스템은 또한 개체명 인식 모델 및 문서 분류 모델을 저장하는 메모리 장치 170을 포함할 수 있다. 메모리 장치 170은 개체명 인식 모델과 문서 분류 모델을 저장하기 위한 것이다. FIG. 1 is a block diagram showing a document classification system according to an embodiment of the present invention. The document classification system includes a control unit 110, a document input unit 120, a named entity recognition unit 130, a document named entity group score calculation unit 140, a vector space analysis unit 150, and a document classification unit 160. The system according to the present invention may also include a memory device 170 for storing a named entity recognition model and a document classification model. The memory device 170 is for storing the named entity recognition model and the document classification model.

문서 입력부 120은 미리 결정된 문서 풀에 포함된 복수의 문서에 대한 문서 데이터를 수신하도록 구성된다. 문서 데이터는 텍스트, 메타데이터 또는 기타 관련 정보를 포함할 수 있다. The document input unit 120 is configured to receive document data for a plurality of documents included in a predetermined document pool. The document data may include text, metadata, or other related information.

개체명 인식부 130은 복수의 문서의 문서 데이터에 포함된 미리 결정된 개체명을 인식하도록 구성된다. 개체는 문서와 관련된 특정 정보로서 이름, 날짜, 위치, 조직 또는 기타 특정 정보를 포함할 수 있다. The entity name recognition unit 130 is configured to recognize a predetermined entity name included in document data of multiple documents. An entity is specific information related to a document and may include a name, date, location, organization, or other specific information.

문서 개체명 그룹점수 계산부 140는 각 문서의 개체명 인식 결과에 따라 각 개체명 그룹에 대한 개체명 그룹점수를 계산한다. "개체명 그룹"이란, 문서에서 특정 카테고리와 관련된 개체명(명사 또는 구)들의 집합을 나타낸다. 개체명 그룹은 해당 카테고리와 관련된 문맥에서 발견되는 키워드 집합 및 이 키워드 집합 주변의 단어 집합을 포함한다. 예를 들어, "개체명 그룹"이 카테고리가 '의학'인 경우, 그룹은 의학과 관련된 단어와 구(예: 질병 이름, 치료법, 약물 등)로 구성될 수 있다. The document entity group score calculation unit 140 calculates the entity group score for each entity group based on the entity recognition result of each document. The "entity group" refers to a set of entity names (nouns or phrases) related to a specific category in a document. The entity group includes a set of keywords found in a context related to the category and a set of words surrounding the set of keywords. For example, if the "entity group" category is 'medicine', the group may be composed of words and phrases related to medicine (e.g., disease names, treatments, drugs, etc.).

본 발명은 이러한 개체명 그룹점수를 이용하여 문헌을 분류하는 것을 특징으로 한다. 본 발명은, 특정 카테고리와 관련된 키워드 집합과 그 주변에 나타나는 단어 집합 간의 관계를 분석하여, 새로운 문서가 특정 카테고리와 관련되어 있는지 여부를 결정하는 데 사용한다.The present invention is characterized by classifying documents using such entity group scores. The present invention uses the analysis of the relationship between a set of keywords related to a specific category and a set of words appearing around them to determine whether a new document is related to a specific category.

일 실시 예에서, 소정의 개체명 그룹 n의 개체명 그룹점수(entity name group score)는 아래 수학식 1에 따라 계산된다. In one embodiment, the entity name group score of a given entity name group n is calculated according to the following mathematical expression 1.

(Sn은 본 발명의 제1 양태에 따른 소정의 개체명 그룹 n의 개체명 그룹점수, ecp는 전체 문서 풀에 있는 개체 수를, ecd는 개체명 그룹점수 계산 대상 문서에 있는 개체명의 개수를, dfe는 특정 개체명 그룹을 포함하고 있는 문서의 빈도를 나타내며, ld는 대상 문서의 길이를 나타낸다.)(Sn represents a named entity group score of a given named entity group n according to the first aspect of the present invention, ecp represents the number of entities in the entire document pool, ecd represents the number of entities in the target document for calculating the named entity group score, dfe represents the frequency of documents including a specific named entity group, and ld represents the length of the target document.)

다른 실시예로, 개체명 그룹점수는 다음 수학식 2에 따라 계산될 수 있다.In another embodiment, the entity group score can be calculated according to the following mathematical expression 2.

(Sn'는 본 발명의 제2 양태에 따른 소정의 n 번째 개체명 그룹의 점수, frequency = sum(개체명별 단어 언급개수)/문헌전체길이, weight는 각 개체명에 대하여 사용자가 사전에 설정한 값)(Sn' is the score of the nth entity group according to the second aspect of the present invention, frequency = sum (number of word mentions per entity) / total document length, weight is a value set in advance by the user for each entity)

상기 수학식 2에서 weight의 예시를 들면 다음과 같다. An example of weight in the above mathematical expression 2 is as follows.

biodegrad_poly=1, biodegrad_prop=1, mechanical_prop=0.7, poly_struc=0.7, rheological_prop=0.7biodegrad_poly=1, biodegrad_prop=1, mechanical_prop=0.7, poly_struc=0.7, rheological_prop=0.7

다음으로, 벡터 공간 분석부 150은 계산된 개체명 그룹점수를 해당 개체명 그룹에 매핑하고, 이 매핑에 기반하여 각 문서에 대한 표현 벡터를 생성하고, 이를 바탕으로 차원 축소(Dimensionality Reduction Analysis) 및 군집 분석(Clustering Analysis)을 수행한다.Next, the vector space analysis unit 150 maps the calculated named entity group scores to the corresponding named entity groups, generates a representation vector for each document based on this mapping, and performs dimensionality reduction analysis and clustering analysis based on this.

문서 분류부 160은 계산된 개체명 그룹점수, 표현 벡터, 및 군집 정보를 기반으로 다수의 문서를 분류한다. 일 실시예에서, 문서 분류부 160은 문서 표현을 나타내는 특성값으로서, 개체명 그룹점수를 사용하여 다수의 문서를 분류한다.The document classification unit 160 classifies a plurality of documents based on the calculated named entity group scores, expression vectors, and cluster information. In one embodiment, the document classification unit 160 classifies a plurality of documents using the named entity group scores as a feature value representing the document expression.

문서 분류부 160이 문서를 분류하는 방식은, 벡터 공간 분석부 150에서 수행된 차원 축소 및 군집 분석의 결과를 시각적인 그래픽 데이터로 출력하거나, 각 문서에 대하여 상기 벡터분석의 결과를 바탕으로 문서 분류 카테고리를 각 문서 데이터에 태깅하는 방식으로 수행할 수 있다. The method by which the document classification unit 160 classifies documents can be performed by outputting the results of dimension reduction and cluster analysis performed in the vector space analysis unit 150 as visual graphic data, or by tagging each document data with a document classification category based on the results of the vector analysis for each document.

각 문서에 대하여 상기 벡터분석의 결과를 바탕으로 문서 분류 카테고리를 태깅하는 것은, 군집 분석 결과 각 클러스터의 중심점에서 소정의 벡터거리 이내에 전사되는 표현벡터에 대응하는 문서들에 대하여 해당 클러스터의 태깅값을 부여할 수 있다. For each document, tagging the document classification category based on the results of the above vector analysis can be done by assigning a tagging value of the corresponding cluster to documents corresponding to expression vectors transcribed within a predetermined vector distance from the center point of each cluster as a result of the cluster analysis.

메모리 장치 170은 개체명을 인식하도록 구성된 개체명 인식 모델과 문서 표현 특성 값을 사용하여 문서를 분류하도록 구성된 문서 분류 모델을 저장한다. 개체명 인식부 130과 문서 분류부 160은 각각 개체명 인식 모델과 문서 분류 모델을 사용하기 위해 메모리 장치 170에 접근한다. The memory device 170 stores an entity recognition model configured to recognize an entity name and a document classification model configured to classify a document using document representation feature values. The entity recognition unit 130 and the document classification unit 160 access the memory device 170 to use the entity recognition model and the document classification model, respectively.

도면 2는 본 발명의 실시예에 따른 문서 분류 방법을 설명하는 순서도이다. 이 방법에 따르면, 문서 수신 단계 S210, 개체명 인식 단계 S220, 점수 계산 단계 S230, 벡터 공간 분석 단계 S240, 및 문서 분류 단계 S250을 포함하여 구성된다. Drawing 2 is a flowchart illustrating a document classification method according to an embodiment of the present invention. According to this method, it is configured to include a document receiving step S210, an entity recognition step S220, a score calculation step S230, a vector space analysis step S240, and a document classification step S250.

문서 수신 단계 S210에서는, 미리 정해진 문서 풀에 포함된 다수의 문서에 대한 문서 데이터를 수신한다. 문서 데이터는 텍스트, 메타데이터 또는 기타 관련 정보를 포함할 수 있다. In the document receiving step S210, document data for a plurality of documents included in a predefined document pool is received. The document data may include text, metadata, or other related information.

개체명 인식 단계 S220에서는, 다수의 문서의 문서 데이터에 포함된 미리 정해진 개체명을 인식한다. 개체명은 이름, 날짜, 위치, 조직 또는 문서와 관련된 기타 특정 정보를 포함할 수 있다. In the entity name recognition step S220, a predetermined entity name included in document data of a plurality of documents is recognized. The entity name may include a name, date, location, organization, or other specific information related to the document.

점수 계산 단계 S230에서는, 각 문서의 개체명 인식 결과에 기반하여 각 개체명 그룹의 점수를 계산한다. 이는 위에서, 도 1 및 문서 개체명 그룹점수 계산 모듈 140과 관련하여 자세히 설명하였다. In the score calculation step S230, the score of each entity group is calculated based on the entity recognition result of each document. This has been described in detail above with respect to FIG. 1 and the document entity group score calculation module 140.

벡터 공간 분석 단계 S240에서는, 계산된 개체명 그룹점수를 개체명 그룹별로 매핑하고, 이 매핑을 기반으로 각 문서에 대한 표현 벡터를 생성하고, 각 문서의 표현 벡터를 벡터공간에 전사하고 차원 축소 및/또는 군집 분석을 수행한다. In the vector space analysis step S240, the calculated named entity group scores are mapped to each named entity group, an expression vector for each document is generated based on this mapping, the expression vector of each document is transcribed to a vector space, and dimensionality reduction and/or cluster analysis is performed.

문서 분류 단계 S250에서는, 계산된 개체명 그룹점수, 표현 벡터, 및 군집 정보를 기반으로 다수의 문서를 분류한다. 일 실시예에서, 문서 분류 단계 S250은 문서 표현을 나타내는 특성 값으로서 개체명 그룹점수를 사용하여 다수의 문서를 분류한다.In the document classification step S250, a plurality of documents are classified based on the calculated named entity group scores, expression vectors, and cluster information. In one embodiment, the document classification step S250 classifies a plurality of documents using the named entity group scores as feature values representing document expressions.

도 3은 소정의 문서들에 대한 개체명 인식결과 및 인식된 개체명에 대하여 각 문서의 개체명 그룹별 개체명 그룹점수를 산출한 결과를 보이는 표를 보인다. Figure 3 shows a table showing the results of entity recognition for certain documents and the results of calculating entity group scores for each entity group of each document for the recognized entity names.

도 3의 (a)는, "How the composition and manufacturing parameters affect insulin release from polymeric nanoparticles"라는 문서제목(title)을 가진 문서로부터 인식된 개체명들을 보인다. Figure 3 (a) shows entity names recognized from a document with the document title “How the composition and manufacturing parameters affect insulin release from polymeric nanoparticles.”

도 3의 (b)는 biodegrad_poly, poly_struc, rheological_prop, biodegrad_prop, mechanical_prop 이라는 5개의 개체명 그룹을 설정한 경우, 도 3 (a)의 각 개체명들에 대하여 수학식 1에 따라서 개체명 그룹점수를 구하고, 각 개체명들을 각 개체명 그룹으로 매핑하고, 산출한 각 개체명 그룹점수를 보인다. 표의 'title' 열은 각 문서의 title 들을 의미한다. 예들 들어, title1 문서의 biodegrad_poly 개체명 그룹의 산출된 개체명 그룹점수는 11임을 보인다.Fig. 3 (b) shows the case where five entity groups, biodegrad_poly, poly_struc, rheological_prop, biodegrad_prop, and mechanical_prop, are set, the entity group score is calculated for each entity in Fig. 3 (a) according to mathematical expression 1, each entity is mapped to each entity group, and each entity group score is calculated. The 'title' column in the table represents the titles of each document. For example, the calculated entity group score of the biodegrad_poly entity group of the title1 document is 11.

도 4, 5는 계산된 개체명 그룹점수를 기반으로 벡터 공간 분석 및 문서 분류 과정을 보여주는 예시 도면이다. 이 예에서, 개체명 그룹점수를 개체명 그룹에 매핑하고, 이 매핑을 기반으로 각 문서에 대한 표현 벡터를 생성하고, 각 문서의 표현 벡터를 벡터공간에 전사하고 차원 축소 및 군집 분석을 수행한 것을 보인다. Figures 4 and 5 are exemplary diagrams showing the vector space analysis and document classification process based on the calculated named entity group scores. In this example, the named entity group scores are mapped to named entity groups, a representation vector for each document is generated based on this mapping, and the representation vector for each document is transferred to vector space and dimensionality reduction and cluster analysis are performed.

본 발명에서, 차원 축소분석과 군집(clustering) 분석 등의 벡터 분석은 공지의 기법을 적용하며, 본 발명은 분석 기법 보다는 문서들을 분류함에 있어서 군집 분석(clustering analysis)의 입력으로 문서에 포함된 개체명 그룹들에 대한 개체명 그룹점수들에 기반한 표현벡터를 사용하는 데에 특징이 있다.In the present invention, vector analysis such as dimension reduction analysis and clustering analysis applies known techniques, and the present invention is characterized by using expression vectors based on entity group scores for entity groups included in documents as inputs for clustering analysis in classifying documents rather than analysis techniques.

도 4, 5에서 각 문서에 대한 표현 벡터는 개체명 그룹점수를 기반으로 생성될 수 있다. 이에 더하여 군집 정보도 생성되어, 유사한 특성을 가진 문서의 클러스터를 식별한다. 도 4는 산출된 개체명 그룹점수를 기반으로 각 문서에 대한 표현벡터들에 대하여 각각 1차원, 2차원, 3차원으로 차원축소한 결과를 보이는 도면이며, 도 5는 산출된 개체명 그룹점수를 기반으로 한 표현 벡터를 이용하여 각 문서들을 클러스터 0에서 클러스터 5까지로 분류한 결과를 보이는 도면이다. In Figs. 4 and 5, the expression vector for each document can be generated based on the entity group score. In addition, cluster information is also generated to identify clusters of documents with similar characteristics. Fig. 4 is a diagram showing the results of dimensionally reducing the expression vectors for each document into 1D, 2D, and 3D based on the calculated entity group score, and Fig. 5 is a diagram showing the results of classifying each document into cluster 0 to cluster 5 using the expression vector based on the calculated entity group score.

도 4, 5의 그래프에서, PCA는 차원축소분석에서 주성분 분석((Principal Component Analysis, PCA)을 한 것을 의미하며, Clustering KMeans는 "K-means 클러스터링" 기법에 따른 분류임을 표시하는 것이다. 또한 각 축을 표시한, pca_vec1, pca_vec2, pca_vec3은 축소된 차원의 축 중 데이터에 대한 설명력이 가장 높은 3개의 주성분을 나타내는 것이다. 이는 공지의 주성분 분석 기법에 따른 것으로서, 주성분 분석에서 추출된 첫 번째 주성분은 데이터의 분산이 가장 큰 방향, 두 번째 주성분은 첫 번째 주성분과 직각이며 데이터의 분산이 그 다음으로 큰 방향, 세 번째 주성분은 첫 번째와 두 번째 주성분과 수직인 방향에서 분산이 가장 큰 방향으로 추출된다. 이렇게 추출된 주성분은 데이터에 대한 설명력이 크기 때문에, 주로 데이터의 차원을 축소할 때 사용되며, 3개의 주성분으로 데이터를 투영하여 시각화함으로써, 문서 데이터의 패턴을 파악할 수 있다. In the graphs of Figures 4 and 5, PCA means principal component analysis (PCA) in dimension reduction analysis, and Clustering KMeans indicates classification according to the "K-means clustering" technique. In addition, pca_vec1, pca_vec2, and pca_vec3, which indicate each axis, represent the three principal components with the highest explanatory power for the data among the axes of the reduced dimension. This is in accordance with the known principal component analysis technique, and the first principal component extracted from the principal component analysis is extracted in the direction with the greatest variance in the data, the second principal component is orthogonal to the first principal component and in the direction with the next greatest variance in the data, and the third principal component is extracted in the direction with the greatest variance in the direction perpendicular to the first and second principal components. Since the principal components extracted in this way have a great explanatory power for the data, they are mainly used when reducing the dimensionality of the data, and by projecting the data onto the three principal components and visualizing it, the pattern of the document data can be identified.

도 5의 (b)의 표는 벡터분석에 의해 0 내지 5로 클러스터링 되는 문서집단에서 각 집단의 대표적인 문서를 식별하여 보이는 표이다. 클러스터링 분석에서 대표 객체의 식별방법은 공지의 기법들이 알려져 있으며, 본 예시에서는 K-means 클러스터링 기법을 적용하였는바, 각 문서의 표현벡터들 중 중심점이 되는 표현벡터를 가지는 문서가 각 클러스터의 대표문서가 된다. 이와같이 대표문서를 식별하여 보임으로써, 각 클러스터 집단의 성격 역시 파악이 가능하다. The table in (b) of Fig. 5 is a table that identifies and shows representative documents of each group from a document group that is clustered from 0 to 5 by vector analysis. There are known techniques for identifying representative objects in clustering analysis, and in this example, the K-means clustering technique is applied, and the document having the expression vector that becomes the center point among the expression vectors of each document becomes the representative document of each cluster. By identifying and showing representative documents in this way, the characteristics of each cluster group can also be grasped.

요약하면, 본 발명은 문서 내의 개체를 인식하고, 인식된 개체를 기반으로 각 문서의 개체명 그룹점수를 계산하고, 개체명 그룹점수를 각 문서를 표현하는 특징 값으로 사용함으로써 문서를 효과적으로 분류하는 문서 분류 시스템 및 방법을 제공한다. 벡터 공간 분석 모듈 및 문서 분류 모듈은 개체명 그룹점수를 기반으로 표현 벡터와 군집 정보를 생성함으로써 분류 과정을 더욱 향상시킨다. 이를 통해 유사한 특성을 가진 문서의 효율적이고 정확한 분류가 가능하다. In summary, the present invention provides a document classification system and method that effectively classifies documents by recognizing entities in a document, calculating a named entity group score of each document based on the recognized entities, and using the named entity group score as a feature value representing each document. The vector space analysis module and the document classification module further enhance the classification process by generating expression vectors and cluster information based on the named entity group score. This enables efficient and accurate classification of documents with similar characteristics.

도 6은 본 발명에 따른 문서 분류 방법을 실행하기 위한 컴퓨터 시스템을 나타내는 예시적인 블록 다이어그램이다. 컴퓨터 시스템은 프로세서 610, 메모리 620, 문서 입력 모듈 630, 개체명 인식 모듈 640, 개체명 그룹점수 계산 모듈 650, 벡터 공간 분석 모듈 660 및 문서 분류 모듈 670을 포함한다. FIG. 6 is an exemplary block diagram showing a computer system for executing a document classification method according to the present invention. The computer system includes a processor 610, a memory 620, a document input module 630, a named entity recognition module 640, a named entity group score calculation module 650, a vector space analysis module 660, and a document classification module 670.

프로세서 610은 메모리 620에 저장된 명령어를 실행하도록 구성되어 있으며, 이는 개체명 인식 모델 및 문서 분류 모델을 포함할 수 있다. 프로세서 610은 컴퓨터 시스템의 전체 작동을 제어하고 본 발명에 따른 문서 분류 방법을 실행하는 역할을 수행한다. The processor 610 is configured to execute instructions stored in the memory 620, which may include an entity recognition model and a document classification model. The processor 610 controls the overall operation of the computer system and performs a document classification method according to the present invention.

메모리 620은 프로세서 610이 문서 분류 방법을 실행하기 위해 필요한 데이터와 명령어를 저장한다. 메모리 620은 RAM, ROM, 플래시 메모리 및 기타 비휘발성 메모리와 같은 다양한 유형의 메모리 장치를 포함할 수 있다. Memory 620 stores data and instructions necessary for processor 610 to execute the document classification method. Memory 620 may include various types of memory devices, such as RAM, ROM, flash memory, and other non-volatile memory.

문서 입력 모듈 630은 문서 풀에 포함된 여러 문서의 문서 데이터를 수신하도록 구성되어 있으며, 문서 데이터는 컴퓨터 시스템의 입력부를 통하여 외부 메모리 장치로부터 입력되거나, 통신부(미도시)를 통하여 외부장치로부터 입력될 수 있다. 문서 데이터는 텍스트, 이미지 또는 문서와 관련된 기타 유형의 데이터를 포함할 수 있다. The document input module 630 is configured to receive document data of multiple documents included in the document pool, and the document data may be input from an external memory device through an input unit of a computer system or from an external device through a communication unit (not shown). The document data may include text, images, or other types of data related to the document.

개체명 인식 모듈 640은 여러 문서의 문서 데이터에 포함된 미리 정의된 개체를 인식하다. 개체명 인식 모듈 640은 메모리 620에 저장된 개체명 인식 모델을 사용하여 미리 정의된 개체를 식별하고 인식할 수 있다. The named entity recognition module 640 recognizes predefined entities included in document data of multiple documents. The named entity recognition module 640 can identify and recognize predefined entities using a named entity recognition model stored in the memory 620.

개체명 그룹점수 계산 모듈 650은 개체 인식 결과를 기반으로 각 문서에 대한 개체명 그룹점수를 계산한다. 개체명 그룹점수는 앞에서 설명한 것처럼 수학식 1 또는 수학식 2를 사용하여 계산될 수 있다. The named entity group score calculation module 650 calculates a named entity group score for each document based on the entity recognition results. The named entity group score can be calculated using mathematical expression 1 or mathematical expression 2 as described above.

벡터 공간 분석 모듈 660은 계산된 개체명 그룹점수를 개체명 그룹에 매핑하고 이 매핑을 기반으로 차원 축소 및 군집 분석을 수행한다. 벡터 공간 분석 모듈 660은 각 문서에 대한 표현 벡터와 군집 정보를 생성한다. The vector space analysis module 660 maps the calculated named entity group scores to named entity groups and performs dimension reduction and cluster analysis based on this mapping. The vector space analysis module 660 generates a representation vector and cluster information for each document.

문서 분류 모듈 670은 벡터 공간 분석 모듈 660에서 생성된 표현 벡터와 군집 정보를 기반으로 여러 문서를 분류한다. 문서 분류 모듈 670은 유사한 특성을 가진 문서를 식별하고 분류한다. 문서 분류 모듈 670은 문서분류 결과를 외부로 출력하도록 출력 데이터를 생성할 수 있다. The document classification module 670 classifies multiple documents based on the expression vector and cluster information generated by the vector space analysis module 660. The document classification module 670 identifies and classifies documents with similar characteristics. The document classification module 670 can generate output data to output the document classification results externally.

도 6의 컴퓨터 시스템을 구성하는 상기 각 모듈들은 메모리 620에 저장되고 프로세서 610에 의해 읽어 들여져 상기 각 동작을 수행하도록 구성된 컴퓨터 수행가능 알고리즘 또는 프로그램들의 집합일 수 있다. The above modules constituting the computer system of FIG. 6 may be a collection of computer-executable algorithms or programs configured to perform each of the above operations by being stored in the memory 620 and read by the processor 610.

앞서 설명한 실시 예 외에도, 본 발명은 문서 분류 방법을 수행하는 컴퓨터 알고리즘이 저장된 컴퓨터 판독 가능한 기록 매체로 구현될 수 있다. 컴퓨터 판독 가능한 기록 매체는 광디스크, 자기 디스크, 플래시 메모리 장치 및 기타 비휘발성 메모리 장치와 같은 다양한 유형의 저장 매체를 포함할 수 있다. In addition to the embodiments described above, the present invention can be implemented as a computer-readable recording medium storing a computer algorithm for performing a document classification method. The computer-readable recording medium can include various types of storage media such as optical disks, magnetic disks, flash memory devices, and other nonvolatile memory devices.

또한, 본 발명은 컴퓨터 판독 가능한 프로그램 코드가 내장된 컴퓨터 판독 가능한 저장 매체를 포함하는 컴퓨터 프로그램 제품으로 구현될 수 있다. 컴퓨터 판독 가능한 프로그램 코드는 프로세서가 본 발명에 따른 문서 분류 방법을 수행하도록 실행될 수 있다. In addition, the present invention can be implemented as a computer program product including a computer-readable storage medium having computer-readable program code embedded therein. The computer-readable program code can be executed by a processor to perform a document classification method according to the present invention.

본 발명은 위에서 설명한 문서 분류 방법을 수행하도록 구성된 프로세서 및 메모리를 포함하는 컴퓨터 시스템으로 구현될 수 있다. 프로세서는 메모리에 저장된 명령을 실행하여 컴퓨터 시스템의 전체 작동을 제어하고 문서 분류 방법을 수행한다. 본 발명은 위에 설명한 문서 분류 방법을 수행하기 위한 명령을 실행하는 프로세서에 의해 수행되는 방법으로도 구현될 수 있다. The present invention can be implemented as a computer system including a processor and a memory configured to perform the document classification method described above. The processor controls the overall operation of the computer system and performs the document classification method by executing instructions stored in the memory. The present invention can also be implemented as a method performed by a processor that executes instructions for performing the document classification method described above.

아래에 본 발명에 따른 실시예를 들어 보다 상세하게 설명한다. Below, examples according to the present invention are described in more detail.

<실시예 1><Example 1>

본 발명의 실시예 1에서는 기존에 구축된 개체명 인식 모델을 문헌 분류에 활용하여 인적 및 물적 리소스를 절약할 수 있는 방법을 제공한다. 이 방법은 다음과 같은 단계를 포함한다.Embodiment 1 of the present invention provides a method for saving human and material resources by utilizing an existing entity name recognition model for document classification. The method includes the following steps.

먼저 관심 분야 "생분해성 고분자 소재"에 대하여, 문헌들에서 ①_ '생분해성 고분자', ②_ '고분자 구조', ③_ '유변 물성', ④_ '생분해 특성', ⑤_ '기계적 물성'의 다섯 개의 개념을 다루는 개체명 그룹들에 대응하는 기술 용어들을 인식할 수 있도록 개체명 인식 모델을 생성하였다. 개체명 인식 모델은 공지의 인공지능 모델을 이용하였다.First, for the field of interest, "biodegradable polymer materials", a named entity recognition model was created to recognize technical terms corresponding to the named entity groups covering five concepts in the literature: ① 'biodegradable polymer', ② 'polymer structure', ③ 'rheological properties', ④ 'biodegradation characteristics', and ⑤ 'mechanical properties'. The named entity recognition model used a known artificial intelligence model.

다수의 연관 문헌들을 상기 개체명 인식 모델에 적용하여 문헌에서 개체명으로 인식된 기술용어들을 각 개체명 그룹에 할당하고, 수학식 1에 의하여 개체명 그룹점수를 산출하고, 각 개체명 그룹점수로부터, 각 문헌에 대응되는 추상화된 "표현자 벡터(representation vector)"를 획득하였다. A large number of related documents were applied to the above-mentioned entity recognition model, and technical terms recognized as entities in the documents were assigned to each entity group. The entity group score was calculated using mathematical expression 1, and an abstracted "representation vector" corresponding to each document was obtained from each entity group score.

이는 다음과 같은 과정을 통해 이루어진다. This is accomplished through the following process:

a. 계산된 개체명 그룹점수를 개체명 그룹별로 산출한다. 예를 들어, 개체명 그룹 ①_ '생분해성고분자'에 속하는 모든 개체명들에 대하여 개체명 그룹점수를 산출하고, 개체명 그룹 ②_ '고분자구조'에 속하는 모든 개체명들에 대하여 개체명 그룹점수를 산출하는 방식으로 진행된다. a. The calculated entity group score is calculated for each entity group. For example, entity group scores are calculated for all entities belonging to entity group ①_'biodegradable polymer', and entity group scores are calculated for all entities belonging to entity group ②_'polymer structure'.

b. 각 개체명 그룹별로, 산출된 개체명 그룹점수를 바탕으로 추상화된 "표현자 벡터"를 생성한다. 표현자 벡터는 각 문헌에서 개체명 그룹별 개체명 그룹점수를 벡터 형태로 표현한 것이다. 예를 들어, 개체명 그룹이 총 5개인 경우, 표현자 벡터는 5차원 벡터로 표현되며, 벡터의 각 차원은 해당 개체명 그룹의 점수를 나타낸다. b. For each entity group, an abstract "expressor vector" is generated based on the calculated entity group score. The expressor vector is a vector representation of the entity group score for each entity group in each document. For example, if there are a total of five entity groups, the expressor vector is represented as a five-dimensional vector, and each dimension of the vector represents the score of the corresponding entity group.

이후, 각 문헌별 표현자 벡터를 벡터 공간에 전사하고, 벡터 공간을 해석하여 비슷한 성격의 문헌들 간의 군집을 파악한 후, 벡터 공간의 중심에 위치한 문헌을 통해 그 군집의 특징을 도출하여, 그에 따라 문서를 분류할 수 있다. Afterwards, the descriptor vector for each document is transferred to a vector space, the vector space is interpreted to identify clusters among documents with similar characteristics, and the characteristics of the cluster are derived through the document located at the center of the vector space, and the documents can be classified accordingly.

<실시예 2><Example 2>

실시예 2는 개체명 고유 스코어를 수학식 2를 통하여 산출하는 것을 제외하고는 실시예 1과 동일하다. 수학식 2에서, 각 용어별 weight를 아래와 같이 설정하였다. biodegrad_poly=1, biodegrad_prop=1, mechanical_prop=0.7, poly_struc=0.7, rheological_prop=0.7.Example 2 is the same as Example 1 except that the entity name unique score is calculated using Equation 2. In Equation 2, the weight for each term is set as follows: biodegrad_poly=1, biodegrad_prop=1, mechanical_prop=0.7, poly_struc=0.7, rheological_prop=0.7.

<비교예 1><Comparative Example 1>

비교예 1은 기존 개체명인식모델의 구축여부와 상관없이 원하는 클래스로 문헌을 분류할 수 있는 분류모델을 추가로 구축하는 방식이다. Comparative Example 1 is a method of additionally constructing a classification model that can classify documents into desired classes regardless of whether an existing named entity recognition model is constructed.

비교예 1은 다음과 같은 절차로 진행된다. Comparative Example 1 proceeds as follows.

a. 분류 모델 학습에 사용할 문헌을 수작업으로 선별a. Manually select the literature to be used for training the classification model.

b. 학습 데이터에 원하는 클래스 라벨을 부여하여 학습, 검증, 테스트 데이터를 생성b. Create learning, validation, and test data by assigning desired class labels to the learning data.

c. 분류 모델의 학습 및 테스트 진행c. Learning and testing of classification model

d. 학습된 분류모델에 실제 문헌데이터를 입력하여 문헌 분류 수행d. Perform document classification by inputting actual document data into the learned classification model.

이와 같은, 비교예 1의 방식은 기존의 개체명 인식모델의 구축여부와 상관없이 원하는 분류 기준으로 문헌을 분류할 수는 있으나, 개체명 인식모델을 추가로 생성하여 분류모델을 구축하여야 하므로, 추가적인 자원을 투입하여야 하는 문제가 있다.The method of Comparative Example 1 can classify documents using the desired classification criteria regardless of whether an existing named entity recognition model is constructed, but since an additional named entity recognition model must be created to construct a classification model, there is a problem in that additional resources must be invested.

110 제어부 120 문서 입력부
130 개체명 인식부 140 문서 개체명 그룹점수 계산부
150 벡터 공간 분석부 160 문서 분류부
170 메모리 장치
610 프로세서 620 메모리
630 문서 입력모듈 640 개체명 인식 모듈
650 개체명 그룹점수 계산모듈
660 벡터 공간 분석모듈 670 문서 분류 모듈 110 Control Unit 120 Document Input Unit
130 Entity Name Recognition Section 140 Document Entity Name Group Score Calculation Section
150 Vector Space Analysis Department 160 Document Classification Department
170 memory devices
610 Processor 620 Memory
630 Document Input Module 640 Entity Name Recognition Module
650 entity group score calculation module
660 Vector space analysis module 670 Document classification module

Claims

As a document classification system for document classification,
a. Control unit;
b. A document input unit configured to receive document data for multiple documents included in a specified document pool;
c. An entity name recognition unit that recognizes predetermined entity names included in document data of the plurality of documents to recognize entity names within the document;
d. For each document, a document entity group score calculation unit groups the entity names based on the entity name recognition results and calculates the entity group score for each entity group;
e. A document classification unit that classifies the multiple documents based on the entity group scores for each entity group calculated above;
A document classification system comprising:

In the first paragraph,
The entity group score Sn of the above entity n is calculated by the document classification system below (Formula 1).
(Formula 1)

(Here, ecp represents the number of entities included in the entire document pool, ecd represents the number of entities in the target document for calculating the entity group score, dfe represents the frequency of documents containing a specific entity group, and ld represents the length of the target document.)

In the first paragraph,
The above document classification section is,
Classifying the multiple documents by utilizing the above calculated entity group scores as feature values representing the document representation of each document;
A document classification system featuring .

In the first paragraph,
A vector space analysis unit that maps the above-mentioned calculated named entity group scores to named entity groups and performs dimension reduction and cluster analysis based on the same to generate expression vectors and cluster information for each document;
A document classification system that additionally includes:

In paragraph 4,
The above document classification section is,
Identifying and classifying documents with similar characteristics based on the expression vector and cluster information generated by the above vector space analysis unit;
A document classification system featuring:

In paragraph 5,
It additionally includes a memory device storing an entity recognition model set to recognize an entity name and a document classification model set to classify documents by utilizing features representing the document representation of each document;
The above entity name recognition unit and document classification unit are,
Set to access the above memory device and use the above entity name recognition model and the above document classification model respectively;
A document classification system featuring .

As a document classification method,
a. A document receiving step for receiving document data for multiple documents included in a given document pool;
b. An entity name recognition step for recognizing certain entity names included in document data of the plurality of documents to recognize entity names within the document;
c. For each document, a score calculation step for grouping named entities based on the named entity recognition results and calculating the named entity group score for each named entity group;
d. A document classification step for classifying the multiple documents based on the above entity name group score;
A method of classifying documents that includes .

In Article 7,
The above entity group score is a document classification method calculated by the following (Formula 2).
(Formula 2)

(Here, ecp represents the number of entities included in the entire document pool, ecd represents the number of entities in the target document for calculating the entity group score, dfe represents the frequency of documents containing a specific entity group, and ld represents the length of the target document.)

In Article 7,
The above document classification steps are:
A document classification method characterized by utilizing the above-mentioned calculated entity group score as a feature value representing the document representation of each document.

In Article 7,
A document classification method further comprising a vector space analysis step of mapping the above-mentioned calculated named entity group scores to named entity groups and performing dimension reduction and cluster analysis based on the same to generate an expression vector and cluster information for each document.

In Article 10,
The above document classification steps are:
A document classification method characterized by identifying and classifying documents having similar characteristics based on expression vectors and cluster information generated in the above vector space analysis step.

A computer system performing any one of the methods of claims 7 to 11.

A processor performing the method of any one of claims 7 to 11.

A recording medium having recorded thereon a computer algorithm for performing any one of the methods of claims 7 to 11.