KR20180129001A

KR20180129001A - Method and System for Entity summarization based on multilingual projected entity space

Info

Publication number: KR20180129001A
Application number: KR1020170063884A
Authority: KR
Inventors: 최기선; 김은경
Original assignee: 한국과학기술원
Priority date: 2017-05-24
Filing date: 2017-05-24
Publication date: 2018-12-05
Anticipated expiration: 2037-05-24
Also published as: KR102046692B1

Abstract

The present invention provides a method and a system for generating an entity summary based on a multilingual property projected entity space. According to the present invention, the method for generating an entity summary based on a multilingual property projected entity space comprises: a step of extracting a triple indicating a category in a multilingual knowledge base to integrate information of identical entity units; a step of extracting a three-term relation indicating a category in the multilingual knowledge base to perform entity grouping; a step of finding a main description relation and a main entity-object correlation for each group based on entity groups constructed by a group construction unit, and calculating a weight of a triple of the multilingual knowledge base; a step of repeating analysis for all three-term relations, and aligning a summary in accordance with importance order for all three-term relations based on calculated weights; and a step of minimizing redundancy in accordance with a request of a user for the aligned summary, and retrieving the aligned summary from the highest of the importance order.

Description

Field of the Invention < RTI ID = 0.0 > [0006] < / RTI >

본 발명은 RDF(Resource Description Framework) 트리플 형태로 작성된 대용량 지식베이스(Knowledge Base)로부터 다양한 언어의 특징을 단일 공간으로 투영한 후, 통합된 분류체계 태그(category tags) 기반 개체 군집화를 선행하고 이에 따라 트리플의 중요도를 계산하여 정렬하는 개체 요약본 생성 방법 및 시스템에 관한 것이다.In the present invention, the features of various languages are projected in a single space from a large-capacity knowledge base created in the form of a Resource Description Framework (RDF) triple, followed by grouping of objects based on integrated category tags, And more particularly, to a method and system for generating an entity summary that calculates and priorities of triples.

개체 요약본 생성 기술이란 개체 중심으로 기술된 대용량의 지식베이스에서 개체 단위의 주요 정보를 선별하고 적절한 요약본의 길이에 맞춰 재구성하는 기술로 대용량 데이터 검색(Search), 정보 추출(Information Extraction), 질의 및 응답(Question and Answer) 등의 다양한 자연언어처리 응용 분야에서 사용도가 높은 핵심 기술이다. Object summarization technology is a technology that selects major information of each entity in a large-capacity knowledge base described by an entity and reconstructs it according to the length of a proper summary. It is a technique of searching large data, extracting information, (Question and Answer) and other natural language processing applications.

개체 요약 기술은 최근 시맨틱 웹 및 웹에 존재하는 데이터의 개방성 및 연결성이 증대되는 링크드 데이터 환경의 연구가 활발해짐에 따라 하나의 개체에 대하여 웹 상에서 유기적으로 연결되어 있는 정보의 수가 방대해짐에 따라, 거대 규모의 지식베이스에서 중요한 정보만을 신속하고 정확하게 검색하는 문제를 해결하기 위해 반드시 필요한 기술로 현재 널리 연구되고 있다. 기존에는 단일 언어 환경에서 획득된 정보 기반 지식베이스의 분할 실행 후 개체 요약본 생성 시스템이 개발되었으나, 개체의 고유 특징을 표현하여 지식베이스의 경계를 모델링하는데 한계가 존재하며 부족한 자질 확장을 위해 외부 사전 자원(WordNet)을 활용한다는 제한점이 있다. 그러나 외부 사전 자원에 등록되지 않은 개체에 대한 확장이 불가능하여 기존 개체 요약본 생성 시스템은 그 방법 및 활용에 제한성이 있다.As the number of information that is organically connected to one entity on the web becomes wider as the study of the linked data environment in which the openness and connectivity of the data existing in the semantic web and the web becomes active becomes active, It is now widely studied as a necessary technique to solve the problem of searching only important information quickly and accurately in a large scale knowledge base. Conventionally, an entity summary generation system has been developed after executing the division of the information based knowledge base acquired in a single language environment. However, there are limitations in modeling the boundary of the knowledge base by expressing the inherent characteristics of the entity, (WordNet). However, since it is impossible to expand the objects that are not registered in the external dictionary resource, the existing object summary generation system is limited in its method and application.

개체 요약(Entity Summarization)은 Gong Cheng, Thanh Tran, Yuzhong Qu가 2011년에 International Semantic Web Conference(ISWC) 에 발표한 논문 "RELIN: Relatedness and Informativeness-Based Centrality for Entity Summarization"에서 처음 정의된 것으로, 개방형 공개 데이터(Linked Open Data)를 통해 지속적으로 성장하고 있는 대규모 데이터 공간에 존재하는 RDF 트리플 데이터 중 특정 개체에 대한 정보를 신속하고 편리하게 접근하기 위하여 개체 단위의 소규모 데이터를 발췌하는 기술이다. Entity Summarization was first defined in the paper "RELIN: Relatedness and Informativeness-Based Centrality for Entity Summarization" presented by Gong Cheng, Thanh Tran and Yuzhong Qu in 2011 at the International Semantic Web Conference (ISWC) It is a technology that extracts small-scale data of individual units in order to quickly and conveniently access information of a specific entity among RDF triple data existing in a large-scale data space which is continuously growing through the use of open data.

개체 요약 기술은 검색시스템에 부가적인 서비스로 적용되어 검색 질의어에 사용된 개체에 대한 다양한 데이터 정보원으로부터 취합된 정보 중 개체를 기술하기에 필요한 요점 정보를 제공함으로써 개체에 대한 신속한 정보를 제공할 수 있다. The entity summarization technique can be applied as an additional service to the search system to provide quick information about the entity by providing the essential information necessary to describe the entity among the information gathered from various data sources about the entity used in the search query term .

현재 구글(Google)에서 지식 그래프라는 서비스명으로 이와 유사한 서비스를 제공하고 있으나 자동화되지 않은 기술이다. Currently, Google provides a service similar to Knowledge Graph, which is a non-automated technology.

빅 데이터 관련 기업 및 정부 부처에서 공개되고 있는 다양한 정보원의 통합된 정보로부터, 다양한 개체에 대한 주요 기본 정보를 제공하고 개체에 대한 정보 검색을 제공할 수 있다. From the integrated information of various information sources being disclosed by big data related companies and government departments, it is possible to provide basic basic information about various objects and to provide information search of objects.

또한, 스마트폰 기반 지식 가시화 제공 서비스로 적용되어 대용량의 데이터 중 소형의 스크린에 맞추어 일부 선노출이 필요한 경우, 필수 정보에 대하여 우선 노출 및 적용을 통해 요점 정보를 제공할 수 있다. In addition, when the smartphone is applied as a service for providing visualization of knowledge based on a smart phone and some of the large amount of data needs to be exposed to a small screen, essential information can be provided through exposure and application of essential information.

향후 기업화 전망에 있어서는, 개방형 데이터로 접근 가능한 데이터 및 지식베이스를 해석하는 영역에 대하여 기업화가 가능할 수 있다. 상기에서 언급한 구글의 지식 그래프와 같은 지식 제공 및 검색 시스템 관련 기업 등에서 관련 연구가 활발히 진행되고 있다. 또한 개체에 대하여 필수가 되는 내용 요소들을 추출하고 구성하여 다양한 주제별 개념에 대하여 e-Learning 교육과정에 활용할 수 있다. In the future prospect of enterpriseization, it is possible to commercialize the area that interprets data and knowledge base which can be accessed by open data. Related researches have been actively conducted by corporations related to knowledge provision and search systems such as the above-mentioned knowledge graph of Google. In addition, it is possible to utilize e-Learning curriculum for various thematic concepts by extracting and constructing content elements that are essential for the individual.

하지만, 종래 기술의 경우 개체 기술문에 등장하는 개체와-속성값(또 다른 개체) 사이의 상대적인 중요도를 바탕으로 요약을 생성함으로써, 주어진 중심 개체를 기술하기 위하여 필수적으로 중요하지 않은 정보들이 요약본에 포함될 수 있는 제한점이 있다. 또한, 종래기술의 경우 외부자원을 활용한다는 제한점과 더불어 하나의 면에 개체의 특징을 기술한 중요 정보가 다수 개 포함되어 있을 경우, 요약의 성능이 낮아질 수 있는 취약점이 존재한다. 그리고, 외부 사전 자원인 워드넷(WordNet)을 사용하여 개체로부터 유추할 수 있는 긴 단어열의 자질을 확장했으나, 이는 사전에 등록되지 않은 개체명, 또는 사전이 정의되지 않은 언어 데이터에 대해서는 사용이 불가하다는 제한점이 있다.However, in the case of the prior art, by generating a summary based on the relative importance between the entity appearing in the entity description and the attribute value (another entity), information that is not essential for describing the given entity is included in the summary There are limitations that can be included. In addition, in the case of the prior art, there is a weak point that the performance of the summary may be degraded when a plurality of important information describing the characteristics of the entity are included in one page, in addition to the limitation of utilizing external resources. We also extended the qualities of long word sequences that can be inferred from objects using WordNet, an external dictionary resource. However, this is not applicable to object names that are not registered in advance, or to language data in which dictionaries are not defined. There are limitations to this.

본 발명이 이루고자 하는 기술적 과제는 지식베이스와 요약을 생성하고 싶은 개체, 요약본의 길이를 입력으로 받아들여 개체를 기술하고 있는 다언어 지식베이스를 통합하고, 지식베이스의 정보를 개체의 군집 단위로 분류된 군집 단위의 주요 정보를 판별하여 개체 기술문의 트리플들을 정렬한 후, 사용자가 원하는 길이만큼 정렬된 결과물을 우선순위에 따라 내어 줌으로써 주어진 개체 요약본을 생성하는 방법 및 시스템을 제공하는데 있다.The object of the present invention is to integrate a knowledge base and a multilingual knowledge base describing an object by taking the lengths of objects and a summary to be generated as an input, The present invention is to provide a method and system for generating a given entity summary by sorting triples of entity description queries by identifying important information of a cluster unit, and then outputting the sorted results by a user in a priority order.

일 측면에 있어서, 본 발명에서 제안하는 다언어 특질 투영된 개체 공간 기반 개체 요약본 생성 방법은 다언어 지식베이스에서 분류체계를 표식하는 트리플을 추출하여 동일한 개체 단위의 정보를 통합하는 단계, 다언어 지식베이스에서 분류체계를 표식하는 삼항관계를 추출하여 개체 군집화를 구성하는 단계, 군집화 구성부에서 구성된 개체 군집에 기반하여 군집 별 주요 서술관계 및 주요 개체-목적어 상관관계를 찾고, 다언어 지식베이스의 트리플의 가중치를 계산하는 단계, 모든 삼항관계에 대한 분석을 반복하고, 계산된 가중치에 기반하여 모든 삼항관계에 관한 중요도 순에 따라 요약본을 정렬하는 단계 및 정렬된 요약본에 대하여 사용자의 요구에 따라 중복을 최소화하고, 정렬된 요약본 중 중요도 순의 우선 순위부터 가져오는 단계를 포함한다. According to an aspect of the present invention, there is provided a method for generating a multi-lingual feature-based object space based entity summary by a multi-language knowledge base, the method comprising the steps of: extracting triples indicating classification schemes in a multi- In this paper, we propose a method of extracting ternary relations from a base and constructing an object clustering based on the clustering based on the clusters. , Repeating the analysis of all ternary relations, sorting the summary according to the order of importance of all ternary relations based on the calculated weights, and arranging the summary according to the user's request on the sorted summary Minimizing and taking precedence of the sorted summaries in order of importance. The.

다언어 지식베이스에서 분류체계를 표식하는 트리플을 추출하여 동일한 개체 단위의 정보를 통합하는 단계는 동일한 개체에 대하여 복수의 언어로 작성된 트리플을 연계하여 복수의 언어에서 공통적으로 사용되는 자질을 도출하고, 복수의 언어 각각에서 독립적으로 사용되는 자질을 도출하여, 복수의 언어 커뮤니티에서 생성되는 해당 개체에 대한 개체의 분류체계 특징을 통합한다. Extracting a triple marking a classification system from a multilingual knowledge base, and integrating information of the same entity unit, deriving qualities commonly used in a plurality of languages by linking triples created in a plurality of languages to the same entity, Extracts the qualities independently used in each of the plurality of languages, and integrates the classification scheme characteristics of the object for the corresponding object generated in the plurality of language communities.

다언어 지식베이스에서 분류체계를 표식하는 삼항관계를 추출하여 개체 군집화를 구성하는 단계는 다언어 지식베이스에 존재하는 분류체계를 기술한 트리플로부터 개체를 군집화 하기 위한 자질을 도출하고, 도출된 유사한 자질의 해당 개체끼리 군집화한다. The step of extracting the ternary relations from the multilingual knowledge base and constructing the entity clustering is to derive the qualities for clustering the entities from the triple describing the classification system existing in the multilingual knowledge base, And clusters the corresponding objects of the group.

군집화 구성부에서 구성된 개체 군집에 기반하여 군집 별 주요 서술관계 및 주요 개체-목적어 상관관계를 찾고, 다언어 지식베이스의 트리플의 가중치를 계산하는 단계는 개체 군집에서 사용된 속성 유형 빈도 및 역군집 빈도를 나타내는 점수의 조합으로 이루어지는 개체 군집 내의 주요 속성 유형에 기반하고, 개체 군집 단위 별 개체-속성값 공기정보(co-occurrence)에 기반한다. The step of finding the main narrative relation and the major object-object correlation and calculating the weight of the triple of the multilingual knowledge base based on the individual cluster constituted in the clustering component, Based on the main attribute type in the entity cluster, which is a combination of points indicating the number of points representing the entity-attribute value co-occurrence.

모든 삼항관계에 대한 분석을 반복하고, 계산된 가중치에 기반하여 모든 삼항관계에 관한 중요도 순에 따라 요약본을 정렬하는 단계는 개체 군집에 따른 트리플의 주요 속성 유형을 도출하고, 개체-속성값 상관 관계를 도출하여 주요 속성 유형 및 개체-속성값 상관 관계의 조합을 이용하여 중요도 순에 따라 요약본을 정렬한다. The step of repeating the analysis of all ternary relations and sorting the summaries according to the order of importance of all ternary relations based on the calculated weights derives the main attribute types of the triples according to the individual clusters, And summarizes the summary according to the order of importance by using the combination of the main attribute type and the entity-attribute value correlation.

정렬된 요약본에 대하여 사용자의 요구에 따라 중복을 최소화하고, 정렬된 요약본 중 중요도 순의 우선 순위부터 가져오는 단계는 개체 기술문에 사용된 속성 유형 및 속성값의 중복을 최소화 하고, 사용자가 요구하는 길이만큼 요약본을 생성한다.The step of minimizing the redundancy according to the request of the user for the sorted summary and fetching from the order of priority among the sorted summaries minimizes the duplication of the attribute type and attribute value used in the object description statement, Generate a summary as long as the length.

또 다른 일 측면에 있어서, 본 발명에서 제안하는 다언어 특질 투영된 개체 공간 기반 개체 요약본 생성 시스템은 다언어 지식베이스에서 분류체계를 표식하는 트리플을 추출하여 동일한 개체 단위의 정보를 통합하는 다언어 특질 투영 모듈, 다언어 지식베이스에서 분류체계를 표식하는 삼항관계를 추출하여 개체 군집화를 구성하는 개체 군집화 모듈, 군집화 구성부에서 구성된 개체 군집에 기반하여 군집 별 주요 서술관계를 찾고, 다언어 지식베이스의 트리플의 가중치를 계산하고, 군집화 구성부에서 구성된 개체 군집에 기반하여 군집 별 주요 개체-목적어 상관관계를 찾고, 다언어 지식베이스의 트리플의 가중치를 계산하는 개체 기술문 랭킹 모듈 및 서술관계 분석부 및 개체-목적어 분석부를 통해 모든 삼항관계에 대한 분석을 반복하고, 계산된 가중치에 기반하여 모든 삼항관계에 관한 중요도 순에 따라 요약본을 정렬하고, 정렬된 요약본에 대하여 사용자의 요구에 따라 중복을 최소화하고, 정렬된 요약본 중 중요도 순의 우선 순위부터 가져오는 개체 요약본 생성 모듈을 포함한다. According to another aspect of the present invention, there is provided a multilingual feature-based object space based entity summary generation system proposed by the present invention, which extracts triples indicating classification schemes in a multilingual knowledge base, Project module, and multilingual knowledge base, extracts the ternary relations that mark the classification system, finds the main narrative relation of each community based on the object community that is composed in the object clustering module and the clustering component that constitute the object clustering, An entity descriptor ranking module and a descriptive relation analysis unit for calculating the weights of the triples and calculating the weight of the triples of the multilingual knowledge base based on the individual entity group constituted by the grouping unit, The analysis of all ternary relations is repeated through the Object-Object Analysis Division, Based on the weights, we summarize the summary according to the order of importance of all ternary relations, minimize the duplication according to the user 's request for the sorted summary, and generate the summary summary of the sorted summary from the order of priority. .

개체 군집화 모듈은 동일한 개체에 대하여 복수의 언어로 작성된 트리플을 연계하여 복수의 언어에서 공통적으로 사용되는 자질을 도출하고, 복수의 언어 각각에서 독립적으로 사용되는 자질을 도출하여, 복수의 언어 커뮤니티에서 생성되는 해당 개체에 대한 개체의 분류체계 특징을 통합한다. The object clustering module derives the qualities commonly used in a plurality of languages by associating triples created in a plurality of languages with respect to the same object, derives the qualities independently used in each of the plurality of languages, And integrates the classification scheme characteristics of the object for the corresponding object.

개체 군집화 모듈은 다언어 지식베이스에 존재하는 분류체계를 기술한 트리플로부터 개체를 군집화 하기 위한 자질을 도출하고, 도출된 유사한 자질의 해당 개체끼리 군집화 한다. The object clustering module derives the qualities for clustering objects from the triple describing the classification scheme existing in the multilingual knowledge base, and clusters the objects of similar derived qualities.

개체 기술문 랭킹 모듈은 개체 군집에서 사용된 속성 유형 빈도 및 역군집 빈도를 나타내는 점수의 조합으로 이루어지는 개체 군집 내의 주요 속성 유형에 기반하고, 개체 군집 단위 별 개체-속성값 공기정보(co-occurrence)에 기반한다. The Entity Knowledge Ranking module is based on the main attribute types in the Entity Clusters, which consist of a combination of the frequency of attribution types used in the Entity Clusters and the scores indicating the backward house frequencies, and the Entity-Attribute Value Co- .

개체 요약본 생성 모듈은 개체 군집에 따른 트리플의 주요 속성 유형을 도출하고, 개체-속성값 상관 관계를 도출하여 주요 속성 유형 및 개체-속성값 상관 관계의 조합을 이용하여 중요도 순에 따라 요약본을 정렬한다. The entity summary generation module derives the main attribute types of the triple according to the object group, derives the entity-attribute value correlation, and sorts the summary according to the order of importance using the combination of the main attribute type and the entity-attribute value correlation .

개체 요약본 생성 모듈은 개체 기술문에 사용된 속성 유형 및 속성값의 중복을 최소화 하고, 사용자가 요구하는 길이만큼 요약본을 생성한다.The object summary generation module minimizes the duplication of attribute types and attribute values used in the object description statement, and generates a summary of the length required by the user.

본 발명의 실시예들에 따르면 다국어 개별 특질 투영을 통해 개체를 군집화하고, 개체를 설명하기 위한 필수 항목을 포함시키는 전문가의 요약 방식과 최대한 가깝게 재현함으로써, 대용량의 지식베이스 상에서의 개체 단위의 효율적인 정보 검색과 신속한 질의 처리를 제공하는데 유용하게 이용될 수 있다.According to the embodiments of the present invention, it is possible to reproduce as efficiently as possible an object-by-object information on a large-capacity knowledge base by clustering the objects through multi-lingual characteristic projection, and reproducing them as close as possible to a summary method of experts, It can be useful for providing search and quick query processing.

도 1은 본 발명의 일 실시예에 따른 다언어 특질 투영된 개체 공간 기반 개체 요약본 생성 방법을 설명하기 위한 흐름도이다.
도 2는 본 발명의 일 실시예에 따른 개체에 대한 3개의 서로 다른 언어에서 발견된 카테고리 태그의 단일 공간화를 나타내는 도면이다.
도 3은 본 발명의 일 실시예에 따른 다언어 특질 투영된 개체 공간 기반 개체 요약본 생성 시스템의 구성을 나타내는 도면이다.
도 4는 본 발명의 일 실시예에 따른 한국어 커뮤니티에 존재하는 개체에 대한 카테고리 태그를 나타내는 도면이다.
도 5는 본 발명의 일 실시예에 따른 영어 커뮤니티에 존재하는 개체에 대한 카테고리 태그를 나타내는 도면이다.
도 6은 본 발명의 일 실시예에 따른 분류체계 단어로부터 찾아진 어근의 벡터화를 나타내는 도면이다.
도 7은 본 발명의 일 실시예에 따른 분류체계 단어로부터 찾아진 어근의 벡터화에 대한 가중치 추가를 나타내는 도면이다.
도 8은 본 발명의 일 실시예에 따른 일 군집의 두 개체의 트리플 집합 비교를 나타내는 도면이다.
도 9은 본 발명의 일 실시예에 따른 하나의 개체에 대한 중복된 속성 유형을 나타내는 도면이다.
도 10은 본 발명의 일 실시예에 따른 중복된 속성 유형의 허가 여부에 따른 최종 요약본 비교를 나타내는 도면이다.
도 11은 본 발명의 일 실시예에 따른 중복된 속성 값의 허가 여부에 따른 최종 요약본 비교를 나타내는 도면이다.FIG. 1 is a flowchart illustrating a method for generating a multi-language feature-based object space based entity summary according to an exemplary embodiment of the present invention.
Figure 2 illustrates a single spatialization of category tags found in three different languages for an entity in accordance with an embodiment of the present invention.
FIG. 3 is a block diagram of a multi-language projected object space based entity summary generation system according to an exemplary embodiment of the present invention.
4 is a diagram illustrating a category tag for an entity existing in a Korean community according to an embodiment of the present invention.
5 is a diagram illustrating a category tag for an entity existing in an English community according to an embodiment of the present invention.
6 is a diagram illustrating vectorization of a root found from a classification system word according to an embodiment of the present invention.
FIG. 7 is a diagram illustrating the addition of weights to the vectorization of a root found from a classification system word according to an embodiment of the present invention.
8 is a diagram illustrating a triple set comparison of two entities of a cluster according to an embodiment of the present invention.
9 is a diagram illustrating a duplicate attribute type for one entity according to an embodiment of the present invention.
FIG. 10 is a diagram illustrating a final summary comparison according to whether a duplicate attribute type is permitted according to an embodiment of the present invention.
11 is a diagram illustrating a final summary comparison according to whether or not a duplicate attribute value is permitted according to an embodiment of the present invention.

본 발명에서 제안하는 다언어 특징 투영된 분류체계 기반의 개체 요약본 생성 시스템은 120개 이상의 언어로 공개 제공되고 있는 지식베이스의 특성을 활용하여, 여러 언어마다 다르게 분포되어 있는 정보원으로부터 구해진 상대적인 개체 단위 지식들의 특징을 통합하여 개체 군집을 추정하고, 상기 단계에서 계산된 개체 군집 단위의 지식베이스 경계에 따라 트리플 중요도 계산 방식을 적용하여 개체 요약본을 생성하는 것을 그 구성상의 특징으로 한다. 본 발명에서 제안하고 있는 개체 요약본 생성 방법에 따르면, 하나의 언어만을 모델링한 공간에서의 군집화보다 향상된 성능의 다언어 특징이 투영된 개체 군집화를 실행하고, 이를 기반으로 개체 고유 특질을 기술하고 있는 트리플의 중요도를 상위로 계산함으로써, 개체 요약본에 포함되어야 할 필수 트리플을 선별할 수 있는 우수한 성능의 요약본 생성이 가능하다. The multi-language feature projection system based on the projected classification system according to the present invention utilizes the characteristics of the knowledge base, which is publicly provided in more than 120 languages, And an object summary is generated by applying a triple importance calculation method according to a knowledge base boundary of the object cluster unit calculated in the step. According to the method of generating entity summaries proposed in the present invention, the clustering in a space modeled by only one language is performed, the clustering of the projected multilingual features is performed, and the triple It is possible to generate a summary of excellent performance that can select required triples to be included in the entity summary.

본 발명의 상세한 설명 있어서, 용어 '정보 자원(resource)'은 RDF 데이터 모형에서 그 형태에 관계없이 URI로 식별 가능한 모든 객체를 의미하고, 하나의 정보 자원은 여러 개의 속성 유형과 속성 값을 가질 수 있다.In the detailed description of the present invention, the term " resource " refers to all objects identifiable by a URI, regardless of its form in the RDF data model, and one information resource may have multiple attribute types and attribute values have.

본 발명의 상세한 설명 있어서, 용어 '개체(entity) '은 정보 자원 중 이름을 가질 수 있는 텍스트의 연속된 문자열을 의미하고, 예를 들어, 인명, 기관명, 지명 등이 있다. In the detailed description of the present invention, the term 'entity' refers to a continuous string of text that can have a name among information resources, for example, a name, an institution name, a place name, and the like.

본 발명의 상세한 설명 있어서, 용어 '속성 유형 (property type)'은 '저자', '서명' 등과 같이 자원의 속성을 적절한 이름으로 표현한 것을 의미한다. In the detailed description of the present invention, the term 'property type' means that an attribute of a resource is expressed by a proper name such as 'author', 'signature' and the like.

본 발명의 상세한 설명 있어서, 용어 '속성값 (value)'은 속성 유형에 상응하는 값으로, 문자열이나 숫자 등과 같은 자연어로 상세하게 기술될 수도 있으며, 속성값 자체가 하나의 정보 자원이 되어 고유의 속성을 가질 수 있다. In the detailed description of the present invention, the term " value " is a value corresponding to an attribute type and may be described in detail in a natural language such as a character string or a number. Attribute.

본 발명의 상세한 설명 있어서, 용어 '트리플 (triple)'은 정보 자원과 속성 유형, 속성값을 모두 포함한 것을 의미한다. In the detailed description of the present invention, the term " triple " means including information resources, attribute types, and attribute values.

본 발명의 상세한 설명 있어서, 용어 '개체 기술 (entity description)'은 동일한 개체를 정보 자원으로 참조하고 있는 트리플들의 집합을 의미한다. 이하, 본 발명의 실시 예를 첨부된 도면을 참조하여 상세하게 설명한다.
In the detailed description of the present invention, the term 'entity description' refers to a set of triples that refer to the same entity as an information resource. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 다언어 특질 투영된 개체 공간 기반 개체 요약본 생성 방법을 설명하기 위한 흐름도이다. FIG. 1 is a flowchart illustrating a method for generating a multi-language feature-based object space based entity summary according to an exemplary embodiment of the present invention.

제안하는 다언어 특질 투영된 개체 공간 기반 개체 요약본 생성 방법은 다언어 지식베이스에서 분류체계를 표식하는 트리플을 추출하여 동일한 개체 단위의 정보를 통합하는 단계(110), 다언어 지식베이스에서 분류체계를 표식하는 삼항관계를 추출하여 개체 군집화를 구성하는 단계(120), 군집화 구성부에서 구성된 개체 군집에 기반하여 군집 별 주요 서술관계 및 주요 개체-목적어 상관관계를 찾고, 다언어 지식베이스의 트리플의 가중치를 계산하는 단계(130), 모든 삼항관계에 대한 분석을 반복하고, 계산된 가중치에 기반하여 모든 삼항관계에 관한 중요도 순에 따라 요약본을 정렬하는 단계(140), 정렬된 요약본에 대하여 사용자의 요구에 따라 중복을 최소화하고, 정렬된 요약본 중 중요도 순의 우선 순위부터 가져오는 단계(150)를 포함한다. The method of generating a projected object space based object summary according to the present invention includes a step 110 of extracting a triple marking a classification system from a multilingual knowledge base and integrating information of the same entity unit 110, (120) constructing an object clustering by extracting a tagging ternary relation, (120) searching for the main description relation and major object-object correlation according to the clusters based on the individual clusters constructed in the clustering component, (130), repeating the analysis of all ternary relationships, and sorting the summary according to the order of importance of all ternary relations based on the calculated weights (140) (Step 150) of minimizing redundancy according to the priorities of the order of importance among ordered summaries.

단계(110)에서, 다언어 지식베이스에서 분류체계를 표식하는 트리플을 추출하여 동일한 개체 단위의 정보를 통합한다. 먼저, 다양한 언어로 작성된 다수의 지식베이스로부터 개체 단위의 특질을 하나의 공간으로 구성한다. 동일한 개체에 대하여 복수의 언어로 작성된 트리플을 연계하여 복수의 언어에서 공통적으로 사용되는 자질을 도출하고, 복수의 언어 각각에서 독립적으로 사용되는 자질을 도출한다. 그리고, 복수의 언어 커뮤니티에서 생성되는 해당 개체에 대한 개체의 분류체계 특징을 통합한다. In step 110, triples marking the classification scheme are extracted from the multilingual knowledge base, and information of the same entity unit is integrated. First, we construct a unit space from a plurality of knowledge bases written in various languages into one space. The triples formed in a plurality of languages are linked to the same entity to derive the qualities commonly used in a plurality of languages and the qualities used independently in each of the plurality of languages are derived. And integrates the classification scheme characteristics of the object for the corresponding object generated in a plurality of language communities.

단계(120)에서, 다언어 지식베이스에서 분류체계를 표식하는 삼항관계를 추출하여 개체 군집화를 구성한다. 개체 군집화라 함은 각 개체에서 공통점을 찾아내고 이를 하나의 집합으로 구분하는 것을 말한다. 다언어 지식베이스에 존재하는 분류체계를 기술한 트리플로부터 개체를 군집화 하기 위한 자질을 도출하고, 도출된 유사한 자질의 해당 개체끼리 군집화 한다. 개체 군집화 과정에서는 지식베이스로부터 개체의 분류 체계적인 특성을 나타내는 특정 속성 유형을 사용한 트리플로부터 개체의 공통점을 획득할 수 있다.In step 120, a ternary relation marking the classification system is extracted from the multilingual knowledge base to form an object clustering. Clustering refers to finding commonalities among individuals and dividing them into one set. From the triple describing the classification system existing in the multilingual knowledge base, the qualities for clustering individuals are derived and the corresponding entities of similar qualities are grouped together. In the process of clustering individuals, it is possible to acquire the common points of individuals from triples using specific property types that represent the classification systematic characteristics of individuals from the knowledge base.

예를 들어, 개체 군집화의 자질 선정 과정에 있어서, 공통의 접두사를 제외한 명사구로 이루어진 단어의 경계를 구분하고, 각 단어의 어근을 찾는 과정(stemming)을 진행할 수 있다. For example, in the qualification selection process of object clustering, the boundaries of words consisting of noun phrases excluding common prefixes can be distinguished, and the stemming process for finding the root of each word can be performed.

전산 분야에서 널리 활용되고 있는 분할법 중 특정 알고리즘을 활용하여 앞서 선정된 자질을 바탕으로 주어진 개체를 여러 군집으로 나눈다. 군집을 나누는 과정은 각 군집의 중심과 군집 내의 개체와의 거리의 제곱합을 비용 함수(cost function)로 정하고 이를 최소화하는 방식으로 이루어진다. 이 과정에서 같은 군집 내 개체끼리의 유사도는 증가하고, 다른 군집에 속해있는 개체와의 유사도는 감소한다. 이 과정은 기존 온톨로지의 분류 체계를 이용하여 대치할 수 있다. 본 발명에서는 전산 분야의 분할법 중 k-평균 알고리즘을 사용하나, 이에 한정되지 않는다. Using a specific algorithm among the division methods widely used in the field of computation, the given object is divided into several clusters based on the previously selected qualities. The process of dividing the cluster is done by minimizing the sum of squares of the distance between the center of each cluster and the individual in the cluster as a cost function. In this process, the degree of similarity between individuals in the same cluster increases, and the similarity with individuals in other clusters decreases. This process can be replaced by using the existing ontology classification system. In the present invention, a k-means algorithm is used among the division methods in the field of computation, but the present invention is not limited thereto.

단계(130)에서, 군집화 구성부에서 구성된 개체 군집에 기반하여 군집 별 주요 서술관계 및 주요 개체-목적어 상관관계를 찾고, 다언어 지식베이스의 트리플의 가중치를 계산한다. 다시 말해, 단계(120)를 거쳐 생성된 개체 군집 별로 가장 중요한 속성 유형을 도출한다. 이 과정은 여러 개의 군집이 있을 때 어떤 속성 유형이 특정 군집 내에서 얼마나 중요한 것인지를 나타내는 통계적 수치로 정의된다. 가중치 계산은 개체 군집에서 사용된 속성 유형 빈도 및 역군집 빈도를 나타내는 점수의 조합으로 이루어지는 개체 군집 내의 주요 속성 유형에 기반하고, 개체 군집 단위 별 개체-속성값 공기정보(co-occurrence)에 기반한다. In step 130, the major descriptive relationships and major object-object correlations of the clusters are searched based on the individual clusters constructed in the clustering component, and the weights of the triples of the multilingual knowledge base are calculated. In other words, the most important attribute type is derived for each of the generated community through the step 120. This process is defined as a statistical number that indicates how important an attribute type is in a particular cluster when there are several communities. The weight calculation is based on the main attribute type in the object cluster, which is a combination of the attribute type frequency used in the object cluster and the score indicating the reverse population frequency, and is based on the object-attribute value co-occurrence by the object cluster unit .

속성 유형의 가중치는 다음 두 가지 자질의 조합으로 정의된다: 군집에서의 속성 유형 단어의 빈도(Property Frequency) 및 역군집 빈도(Inverse Group Frequency).The weight of an attribute type is defined by a combination of the following two qualities: the attribute frequency and the inverse group frequency of the attribute type word in the cluster.

군집에서의 속성 유형 단어의 빈도는 군집 내에 나타나는 속성 유형 단어의 총 빈도수를 사용한다. 역군집 빈도(Inverse Group Frequency)는 한 속성 유형 단어가 군집 집합 전체에서 얼마나 공통적으로 나타나는 지를 나타내며, 전체 군집의 수를 해당 속성 유형 단어를 포함한 군집의 수로 나눈 뒤 로그를 취하여 얻는 값을 사용한다.The frequency of attribute type words in a cluster uses the total frequency of attribute type words appearing in the cluster. The inverse group frequency indicates how common an attribute type word appears in the whole population. The value obtained by dividing the total number of populations by the number of populations including the attribute type word is used.

특정 군집 내에서 속성 유형 단어 빈도가 높을 수록, 그리고 전체 군집들 중 그 속성 유형 단어를 포함한 군집이 적을수록 서술관계의 가중치 값이 높아진다. 이를 이용하여 모든 군집에 흔하게 나타나는 속성 유형 단어를 걸러내는 효과를 얻을 수 있어 군집 내에서 의미 있게 중요한 속성 유형을 파악할 수 있다.The higher the attribution type word frequency in a particular cluster, and the smaller the population including the attribute type word among all the clusters, the higher the weight value of the narrative relation. Using this, it is possible to obtain the effect of filtering the attribute type words that are common in all the clusters, so that it is possible to grasp the meaningful property types in the community.

단계(140)에서, 모든 삼항관계에 대한 분석을 반복하고, 계산된 가중치에 기반하여 모든 삼항관계에 관한 중요도 순에 따라 요약본을 정렬한다. 개체 군집에 따른 트리플의 주요 속성 유형을 도출하고, 개체-속성값 상관 관계를 도출하여 주요 속성 유형 및 개체-속성값 상관 관계의 조합을 이용하여 중요도 순에 따라 요약본을 정렬한다. In step 140, the analysis of all ternary relationships is repeated and the summary is sorted in order of importance for all ternary relationships based on the calculated weights. We derive the main attribute types of the triple according to the entity clusters and derive the entity - property value correlations and sort the summary according to the order of importance by using the combination of the main attribute type and entity - attribute value correlation.

단계(150)에서, 정렬된 요약본에 대하여 사용자의 요구에 따라 중복을 최소화하고, 정렬된 요약본 중 중요도 순의 우선 순위부터 가져온다. 개체 기술문에 사용된 속성 유형 및 속성값의 중복을 최소화 하고, 사용자가 요구하는 길이만큼 요약본을 생성한다. At step 150, duplication is minimized according to the user's request for the sorted summary, and the priority is taken from the order of priority among the sorted summaries. Minimize the duplication of attribute types and attribute values used in the entity description statement, and generate a summary of the length required by the user.

다시 말해, 사용자 요구에 따라 요약본 중 일부를 취하여 반환한다. 사용자가 요약본의 길이로 n개를 요구할 경우 최종 결과물에서 n개의 트리플을 취하여 사용자에게 반환한다. 특히 요약되는 최종 결과물이 갖추어야 할 본질 기능인 중복을 최소화하기 위하여 사용자가 요구하는 n개의 길이가 극단적으로 작은 경우 (n=5)에는 최종결과물에 포함되는 트리플들 사이의 중복은 다음과 같이 제한된다:In other words, some of the summaries are taken and returned according to the user's request. If the user requests n in length of the summary, n triples are taken from the final result and returned to the user. In particular, if the n lengths required by the user are extremely small (n = 5) in order to minimize redundancy, which is the essential function that the final result should have, the redundancy between the triples included in the final result is limited as follows:

최종 요약본 = 속성 유형 중복 허용되지 않음 ∧ 속성 값 중복 허용되지 않음 Final summary = Duplicate attribute type Not allowed ∧ Attribute value Duplicate not allowed

또한, 사용자가 요구하는 n개의 길이가 늘어난 경우(n=10)에는 최종 요약본에 포함되는 트리플들 사이의 중복은 다음과 같이 조절된다:Also, if the length of n requested by the user is increased (n = 10), the redundancy between the triples included in the final summary is adjusted as follows:

최종 요약본 = 속성 유형 중복 허용 ∧ 속성 값 중복 허용
Final summary = Allow duplicate attribute type ∧ Allow duplicate attribute value

도 2는 본 발명의 일 실시예에 따른 개체에 대한 3개의 서로 다른 언어에서 발견된 카테고리 태그의 단일 공간화를 나타내는 도면이다. Figure 2 illustrates a single spatialization of category tags found in three different languages for an entity in accordance with an embodiment of the present invention.

본 발명에서는 개체를 기술하고 있는 트리플(개체-속성-목적어)로 구성된 지식베이스에서 개체 단위의 요약본을 생성하기 위해서, 지식베이스의 분류 체계의 특성을 자질로 사용하여 지식베이스내의 개체를 군집화하고, 이를 이용하여 개체 요약 시스템에 사용한다. 개체들의 분류 체계적인 특성을 보다 잘 모델링하기 위해서 다양한 언어의 지식베이스를 통합하는 방식을 이용하여 개체에 대한 지식추출 방식과 지식 확장 장치를 더한다. In the present invention, in order to generate a summary of individual units in a knowledge base composed of triples (entities-attributes-object words) describing entities, the entities in the knowledge base are clustered by using the characteristics of the classification system of the knowledge base as qualities, This is used in the object summarization system. In order to better model the systematic characteristics of individuals, a knowledge extraction method and a knowledge extension device are added to the system by integrating various knowledge bases.

지금까지의 개체 요약 관련 기술은 모두 단일 언어(예를 들어, 영어)를 기반으로 수집된 개체 정보에 그 관심이 집중되어 있었다. 그러나 공개된 데이터가 방대한 웹에서, 어떤 개체에 대한 정보를 수집하는 경우, 가장 중요하게 작용하는 것들 중 하나가 개체에 관한 정보는 수집 정보원에 따라 상이할 수 있다는 것이다. 도 2는 개체 제주도(Jejudo)에 대하여 서로 다른 세 가지 언어에 존재하는 시맨틱 분류 체계(210, 220, 230)를 보여준다. "생물권보전지역"과 같이 여러 언어에서 모두 중복적으로 발견되는 분류 체계가 있는 반면, "화산섬", "지리", "지오파크" 등 특정 언어 데이터베이스 내에서만 발견되는 분류 체계가 존재한다. 이는 세계적으로 널리 알려진 의미 분류 체계는 여러 언어에서 중복적으로 발견될 수 있음을 나타내고, 하나의 개체에 대하여 잘 알려지지 않은 사실이나 서로 다른 언어권 데이터에서 발생할 수 있는 문화적 시각의 차이에 따라 불일치의 문제를 보여주는 예이다. 따라서 만약 여러 언어에서 발생한 단일 개체에 대한 의미 태그를 자동으로 통합(240)하고, 단일 공간에서의 중요도 분석을 실시할 수 있다면, 개체의 고유 속성에 대한 편향되지 않은 정보를 수집할 수 있는 장점을 가질 수 있다. 또한 이는 개체에 대하여 세계적이며 공통적인 요약본을 제공하는데 있어 그 목적이 있다.All of the object summary techniques so far focused on the collected object information based on a single language (for example, English). However, when published data collects information about an entity on a massive web, one of the most important things is that the information about the entity may differ depending on the source of the information gathered. FIG. 2 shows semantic classification schemes 210, 220 and 230 existing in three different languages with respect to Jeju Island. There is a classification system that is found only in certain language databases such as "Volcanic Island", "Geography", and "Geopark", while there are classification systems that are found redundantly in all languages, such as "Biosphere Reserve". This indicates that the world wide known semantic classification system can be found redundantly in multiple languages and it is difficult to understand the problem of inconsistency due to differences in cultural viewpoints that occur in unknown or unknown language data This is an example. Therefore, if we can automatically integrate (240) a semantic tag for a single entity in multiple languages and perform a single-space significance analysis, the advantage of collecting non-biased information about an entity's unique attributes Lt; / RTI > It is also intended to provide a global and common summary of the entity.

본 발명에서는 다언어 특질이 투영된 단일 공간 기반으로 최적화된 개체 군집을 밝히고, 개체 단위의 필수 정보를 포함하면서 중복을 최소화하는 개체 요약 시스템 및 개체 요약 방법을 제공함으로써, 다른 방법에서는 시도되지 않았던 개체 고유의 속성을 파악하는 개념에 기반을 둔 요약을 시도한다. 개체 고유의 속성이란 개체를 기술하기 위해 반드시 포함해야 할 필수적이면서 다른 개체와의 구별성을 나타내는 정보를 나타낸다. 이하, 도면을 참조한 실시예를 통해 본 발명의 구체적인 실시형태를 설명하기로 한다. 그러나 이는 예시에 불과하여 본 발명은 이에 제한되지 않는다.
The present invention provides an object summarizing system and an object summarizing method that reveal an object group optimized based on a single space on which a multilingual feature is projected and include essential information on an object basis while minimizing redundancy, Try a summary based on the concept of identifying unique properties. An entity-specific property represents information that is indispensable to be included in the description of an entity but is distinguishable from other objects. Hereinafter, embodiments of the present invention will be described with reference to the drawings. However, this is merely an example and the present invention is not limited thereto.

도 3은 본 발명의 일 실시예에 따른 다언어 특질 투영된 개체 공간 기반 개체 요약본 생성 시스템의 구성을 나타내는 도면이다.FIG. 3 is a block diagram of a multi-language projected object space based entity summary generation system according to an exemplary embodiment of the present invention.

제안하는 개체 요약본 생성 시스템(300)은 다언어 특질 투영 모듈(311), 개체 군집화 모듈(312), 개체 기술문 랭킹 모듈(313), 개체 요약본 생성 모듈(314)를 포함한다. The proposed entity summary creation system 300 includes a multilingual feature projection module 311, an entity clustering module 312, an entity description ranking module 313, and an entity summary generation module 314.

본 실시예에 따른 개체 요약본 생성 시스템(300)은 프로세서(310), 버스(320), 네트워크 인터페이스(330), 메모리(340) 및 데이터베이스(350)를 포함할 수 있다. 메모리(340)는 운영체제(341) 및 개체 요약본 생성 루틴(342)을 포함할 수 있다. 프로세서(310)는 다언어 특질 투영 모듈(311), 개체 군집화 모듈(312), 개체 기술문 랭킹 모듈(313), 개체 요약본 생성 모듈(314)를 포함할 수 있다. 다른 실시예들에서 개체 요약본 생성 시스템(300)은 도 3의 구성요소들보다 더 많은 구성요소들을 포함할 수도 있다. 그러나, 대부분의 종래기술적 구성요소들을 명확하게 도시할 필요성은 없다. 예를 들어, 개체 요약본 생성 시스템(300)은 디스플레이나 트랜시버(transceiver)와 같은 다른 구성요소들을 포함할 수도 있다.The entity summary generation system 300 according to the present embodiment may include a processor 310, a bus 320, a network interface 330, a memory 340, and a database 350. The memory 340 may include an operating system 341 and an entity summary generation routine 342. The processor 310 may include a multilingual feature projection module 311, an entity clustering module 312, an entity description ranking module 313, and an entity summary generation module 314. In other embodiments, the entity summary generation system 300 may include more components than the components of FIG. However, there is no need to clearly illustrate most prior art components. For example, the entity summary generation system 300 may include other components such as a display or a transceiver.

메모리(340)는 컴퓨터에서 판독 가능한 기록 매체로서, RAM(random access memory), ROM(read only memory) 및 디스크 드라이브와 같은 비소멸성 대용량 기록장치(permanent mass storage device)를 포함할 수 있다. 또한, 메모리(340)에는 운영체제(341)와 개체 요약본 생성 루틴(342)을 위한 프로그램 코드가 저장될 수 있다. 이러한 소프트웨어 구성요소들은 드라이브 메커니즘(drive mechanism, 미도시)을 이용하여 메모리(340)와는 별도의 컴퓨터에서 판독 가능한 기록 매체로부터 로딩될 수 있다. 이러한 별도의 컴퓨터에서 판독 가능한 기록 매체는 플로피 드라이브, 디스크, 테이프, DVD/CD-ROM 드라이브, 메모리 카드 등의 컴퓨터에서 판독 가능한 기록 매체(미도시)를 포함할 수 있다. 다른 실시예에서 소프트웨어 구성요소들은 컴퓨터에서 판독 가능한 기록 매체가 아닌 네트워크 인터페이스(330)를 통해 메모리(340)에 로딩될 수도 있다. The memory 340 may be a computer-readable recording medium and may include a permanent mass storage device such as a random access memory (RAM), a read only memory (ROM), and a disk drive. In addition, the memory 340 may store program codes for the operating system 341 and the entity summary creation routine 342. These software components may be loaded from a computer readable recording medium separate from the memory 340 using a drive mechanism (not shown). Such a computer-readable recording medium may include a computer-readable recording medium (not shown) such as a floppy drive, a disk, a tape, a DVD / CD-ROM drive, or a memory card. In other embodiments, the software components may be loaded into the memory 340 via the network interface 330 rather than from a computer readable recording medium.

버스(320)는 개체 요약본 생성 시스템(300)의 구성요소들간의 통신 및 데이터 전송을 가능하게 할 수 있다. 버스(320)는 고속 시리얼 버스(high-speed serial bus), 병렬 버스(parallel bus), SAN(Storage Area Network) 및/또는 다른 적절한 통신 기술을 이용하여 구성될 수 있다.The bus 320 may enable communication and data transfer between components of the entity summary creation system 300. The bus 320 may be configured using a high-speed serial bus, a parallel bus, a Storage Area Network (SAN), and / or other suitable communication technology.

네트워크 인터페이스(330)는 개체 요약본 생성 시스템(300)을 컴퓨터 네트워크에 연결하기 위한 컴퓨터 하드웨어 구성요소일 수 있다. 네트워크 인터페이스(330)는 개체 요약본 생성 시스템(300)을 무선 또는 유선 커넥션을 통해 컴퓨터 네트워크에 연결시킬 수 있다.The network interface 330 may be a computer hardware component for connecting the entity summary generation system 300 to a computer network. The network interface 330 may connect the entity summary creation system 300 to a computer network via a wireless or wired connection.

데이터베이스(350)는 개체 요약본 생성을 위해 필요한 모든 정보를 저장 및 유지하는 역할을 할 수 있다. 도 3에서는 개체 요약본 생성 시스템(300)의 내부에 데이터베이스(350)를 구축하여 포함하는 것으로 도시하고 있으나, 이에 한정되는 것은 아니며 시스템 구현 방식이나 환경 등에 따라 생략될 수 있고 혹은 전체 또는 일부의 데이터베이스가 별개의 다른 시스템 상에 구축된 외부 데이터베이스로서 존재하는 것 또한 가능하다.The database 350 may store and maintain all the information necessary for generating the entity summary. In FIG. 3, the database 350 is constructed to be included in the entity summary creation system 300, but it is not limited thereto and may be omitted according to the system implementation method or environment, It is also possible to exist as an external database built on a separate, separate system.

프로세서(310)는 기본적인 산술, 로직 및 개체 요약본 생성 시스템(300)의 입출력 연산을 수행함으로써, 컴퓨터 프로그램의 명령을 처리하도록 구성될 수 있다. 명령은 메모리(340) 또는 네트워크 인터페이스(330)에 의해, 그리고 버스(320)를 통해 프로세서(310)로 제공될 수 있다. 프로세서(310)는 다언어 특질 투영 모듈(311), 개체 군집화 모듈(312), 개체 기술문 랭킹 모듈(313), 개체 요약본 생성 모듈(314)를 위한 프로그램 코드를 실행하도록 구성될 수 있다. 이러한 프로그램 코드는 메모리(340)와 같은 기록 장치에 저장될 수 있다.The processor 310 may be configured to process instructions of a computer program by performing basic arithmetic, logic, and input / output operations of the entity summary generation system 300. The instructions may be provided by the memory 340 or the network interface 330 and to the processor 310 via the bus 320. The processor 310 may be configured to execute program code for the multilingual feature projection module 311, the entity clustering module 312, the entity description language ranking module 313, and the entity summary generation module 314. Such a program code may be stored in a recording device such as the memory 340. [

다언어 특질 투영 모듈(311), 개체 군집화 모듈(312), 개체 기술문 랭킹 모듈(313), 개체 요약본 생성 모듈(314)는 도 1의 단계들(110~150)을 수행하기 위해 구성될 수 있다.The multilingual feature projection module 311, the entity clustering module 312, the entity description ranking module 313 and the entity summary generation module 314 may be configured to perform the steps 110-150 of FIG. have.

개체 요약본 생성 시스템(300)은 다언어 특질 투영 모듈(311), 개체 군집화 모듈(312), 개체 기술문 랭킹 모듈(313), 개체 요약본 생성 모듈(314)를 포함할 수 있다.The entity summary generation system 300 may include a multilingual feature projection module 311, an entity clustering module 312, an entity description ranking module 313, and an entity summary generation module 314.

다언어 특질 투영 모듈(311)은 다언어 지식베이스에서 분류체계를 표식하는 트리플을 추출하여 동일한 개체 단위의 정보를 통합한다. 먼저, 다양한 언어로 작성된 다수의 지식베이스로부터 개체 단위의 특질을 하나의 공간으로 구성한다. 동일한 개체에 대하여 복수의 언어로 작성된 트리플을 연계하여 복수의 언어에서 공통적으로 사용되는 자질을 도출하고, 복수의 언어 각각에서 독립적으로 사용되는 자질을 도출한다. 그리고, 복수의 언어 커뮤니티에서 생성되는 해당 개체에 대한 개체의 분류체계 특징을 통합한다. The multilingual feature projection module 311 extracts triples that indicate classification schemes in a multilingual knowledge base and integrates information of the same entity unit. First, we construct a unit space from a plurality of knowledge bases written in various languages into one space. The triples formed in a plurality of languages are linked to the same entity to derive the qualities commonly used in a plurality of languages and the qualities used independently in each of the plurality of languages are derived. And integrates the classification scheme characteristics of the object for the corresponding object generated in a plurality of language communities.

개체 군집화 모듈(312)은 다언어 지식베이스에서 분류체계를 표식하는 삼항관계를 추출하여 개체 군집화를 구성한다. 개체 군집화라 함은 각 개체에서 공통점을 찾아내고 이를 하나의 집합으로 구분하는 것을 말한다. 다언어 지식베이스에 존재하는 분류체계를 기술한 트리플로부터 개체를 군집화 하기 위한 자질을 도출하고, 도출된 유사한 자질의 해당 개체끼리 군집화 한다. 개체 군집화 과정에서는 지식베이스로부터 개체의 분류 체계적인 특성을 나타내는 특정 속성 유형을 사용한 트리플로부터 개체의 공통점을 획득할 수 있다.The object clustering module 312 extracts a ternary relation marking the classification system from a multilingual knowledge base to construct an object clustering. Clustering refers to finding commonalities among individuals and dividing them into one set. From the triple describing the classification system existing in the multilingual knowledge base, the qualities for clustering individuals are derived and the corresponding entities of similar qualities are grouped together. In the process of clustering individuals, it is possible to acquire the common points of individuals from triples using specific property types that represent the classification systematic characteristics of individuals from the knowledge base.

개체 기술문 랭킹 모듈(313)은 군집화 구성부에서 구성된 개체 군집에 기반하여 군집 별 주요 서술관계 및 주요 개체-목적어 상관관계를 찾고, 다언어 지식베이스의 트리플의 가중치를 계산한다. 다시 말해, 앞서 설명된 바와 같이 생성된 개체 군집 별로 가장 중요한 속성 유형을 도출한다. 이 과정은 여러 개의 군집이 있을 때 어떤 속성 유형이 특정 군집 내에서 얼마나 중요한 것인지를 나타내는 통계적 수치로 정의된다. 가중치 계산은 개체 군집에서 사용된 속성 유형 빈도 및 역군집 빈도를 나타내는 점수의 조합으로 이루어지는 개체 군집 내의 주요 속성 유형에 기반하고, 개체 군집 단위 별 개체-속성값 공기정보(co-occurrence)에 기반한다.The Entity Description Ranking module 313 finds the key narrative relationships and major entity-object correlations for each cluster based on the entity clusters constructed in the clustering component, and calculates the weights of the triples in the multilingual knowledge base. In other words, we derive the most important attribute types for each of the generated entity clusters as described above. This process is defined as a statistical number that indicates how important an attribute type is in a particular cluster when there are several communities. The weight calculation is based on the main attribute type in the object cluster, which is a combination of the attribute type frequency used in the object cluster and the score indicating the reverse population frequency, and is based on the object-attribute value co-occurrence by the object cluster unit .

개체 요약본 생성 모듈(314)은 모든 삼항관계에 대한 분석을 반복하고, 계산된 가중치에 기반하여 모든 삼항관계에 관한 중요도 순에 따라 요약본을 정렬한다. 개체 군집에 따른 트리플의 주요 속성 유형을 도출하고, 개체-속성값 상관 관계를 도출하여 주요 속성 유형 및 개체-속성값 상관 관계의 조합을 이용하여 중요도 순에 따라 요약본을 정렬한다. The entity summary generation module 314 repeats the analysis of all the ternary relationships and sorts the summary according to the order of importance of all ternary relations based on the calculated weight. We derive the main attribute types of the triple according to the entity clusters and derive the entity - property value correlations and sort the summary according to the order of importance by using the combination of the main attribute type and entity - attribute value correlation.

이후, 정렬된 요약본에 대하여 사용자의 요구에 따라 중복을 최소화하고, 정렬된 요약본 중 중요도 순의 우선 순위부터 가져온다. 개체 기술문에 사용된 속성 유형 및 속성값의 중복을 최소화 하고, 사용자가 요구하는 길이만큼 요약본을 생성한다. Thereafter, the sorted summary is minimized in accordance with the user's request, and the sorted summaries are taken from the order of priority. Minimize the duplication of attribute types and attribute values used in the entity description statement, and generate a summary of the length required by the user.

최종 요약본 = 속성 유형 중복 허용 ∧ 속성 값 중복 허용Final summary = Allow duplicate attribute type ∧ Allow duplicate attribute value

이하, 다언어 특질 투영된 개체 공간 기반 개체 요약본 생성 방법 및 시스템에 대하여 도 4 내지 도 10을 참조하여 더욱 상세히 설명한다.
Hereinafter, a method and system for generating a multi-language feature-based object space based entity summary will be described in more detail with reference to FIG. 4 to FIG.

도 4은 본 발명의 일 실시예에 따른 한국어 커뮤니티에 존재하는 개체에 대한 카테고리 태그를 나타내는 도면이다. 4 is a diagram illustrating a category tag for an entity existing in a Korean community according to an embodiment of the present invention.

도 3에서 설명된 다언어 특질 투영 모듈은 지식베이스에 존재하는 모든 개체를 수집하고, 수집된 개체들의 하나 이상의 언어로 작성된 위키피디아 문서 집합으로부터 복수개의 분류 체계에 사용된 단어를 추출한다. The multilingual feature projection module described in FIG. 3 collects all the entities present in the knowledge base and extracts words used in a plurality of classification schemes from a set of Wikipedia documents created in one or more languages of the collected entities.

위키피디아 문서 집합은 복수 개의 언어로 작성된 문서의 집합으로서, 각 문서들은 특정 개체에 대한 작성자 및 편집자의 배경 지식 및 의견 내지 작성자의 문화적 배경을 표현하기 위한 하나 이상의 분류 체계에 대한 정보를 포함한다. 예를 들어, "우사인 볼트"의 개체에 대하여 영어 위키피디아에 존재하는 분류 체계 단어 "People from Trelawny Parish"는 한국어 위키피디아에 존재하지 않으며, 이는 영어 위키피디아 문서 작성시에만 사용된 정보이므로 한국어 위키피디아에서는 발견할 수 없는 정보임을 알 수 있다. 본 발명은, 여러 언어의 특정 분류 체계를 단일 벡터 공간으로 통합하여 사용된 단어에 대한 전체적인 통계기반 점수를 계산할 수 있다.A Wikipedia document set is a collection of documents written in a plurality of languages, each document containing information about one or more classification schemes for representing the author's background and editor's background on a particular entity and the author's cultural background. For example, the classification system word "People from Trelawny Parish" that exists in English Wikipedia on an object of "Usain Bolt" does not exist in Korean Wikipedia, and this information is used only in English Wikipedia. It is information that can not be done. The present invention can integrate specific classification schemes of various languages into a single vector space to calculate the overall statistical-based score for the words used.

분류 체계는 정보 추출, 검색 등과 같은 과정에 있어 중요하게 사용되는 자원으로, 위키피디아 카테고리 태그로부터 추출된다. 도 4는 한국어 위키피디아에 존재하는 개체 "우사인_볼트"에 대해 존재하는 카테고리 태그를 나타낸다. 이는 개체에 대한 일종의 위키피디아 문서 집합을 구성하는 작성자들간의 협력적 태깅이라고 볼 수 있으며, 집단 지성을 활용함으로써 데이터 품질 유지가 이루어지고 있다.
The classification system is an important resource for processes such as information extraction, retrieval, etc., and is extracted from Wikipedia category tags. Fig. 4 shows the category tags that exist for the object " Usain ' bolt " existing in Korean Wikipedia. This is a collaborative tagging between authors who constitute a kind of Wikipedia document set for an entity. Data quality is maintained by using collective intelligence.

도 5는 본 발명의 일 실시예에 따른 영어 커뮤니티에 존재하는 개체에 대한 카테고리 태그를 나타내는 도면이다.5 is a diagram illustrating a category tag for an entity existing in an English community according to an embodiment of the present invention.

위키피디아 문서로부터 특정 분류 체계를 추출하기 위한 알고리즘에 대해서는 본 발명의 기술 분야에서 잘 알려져 있으므로 여기서는 이에 대한 설명을 생략한다. 서로 다른 언어간 추출된 분류 체계 데이터는, 포함하고 있는 개체의 양과 범위가 일치하지 않는다. 따라서 피벗 언어를 영어로 설정하고 영어에 존재하는 개체 분류 체계에 사용된 모든 단어를 분류 체계 피벗 벡터로 생성한다. 도 5는 피벗 언어인 영어 위키피디아 문서 집합에서 발견된 개체 "우사인 볼트 (Usain Bolt)"에 사용된 카테고리 태그를 나타낸다. 이는 한국어 위키피디아에 존재하는 해당 개체의 카테고리 태그가 나타난 도 4와 비교하여 그 수가 다르며, 분류 체계를 구성하는 단어열 역시 다른 것을 알 수 있다.
Algorithms for extracting a specific classification scheme from a Wikipedia document are well known in the art and will not be described here. The classification scheme data extracted between different languages does not match the amount and range of the objects included. Therefore, the pivot language is set to English and all the words used in the English classification system are generated as the classification pivot vectors. Figure 5 shows the category tag used in the object " Usain Bolt " found in the English Wikipedia document set, which is a pivot language. This is different from FIG. 4 in which the category tag of the corresponding object existing in the Korean Wikipedia is shown, and the word sequence constituting the classification system is also different.

도 6은 본 발명의 일 실시예에 따른 분류체계 단어로부터 찾아진 어근의 벡터화를 나타내는 도면이다.6 is a diagram illustrating vectorization of a root found from a classification system word according to an embodiment of the present invention.

분류 체계 피벗 벡터에는 분류 체계를 구성하는 명사구로 이루어진 단어의 경계를 구분하고, 각 단어의 어근을 찾는 일(stemming)을 한 후 벡터로 생성하는 단계로 이루어진다. 피벗 분류 체계 벡터는 어근 단위의 단어열을 그 길이로 가지며 피벗 벡터에는 영어 위키피디아 문서 집합에서 발견된 분류 체계의 각 단어열의 발견 횟수 기반 분류 체계 단어에 대한 점수(610)를 계산한다. 다음으로, 피벗 분류 체계 벡터에 피벗 언어인 영어를 제외한 다른 언어로부터 추출된 분류 체계를 통합하는 과정을 진행한다.
The classification system pivoted vector consists of dividing the boundaries of words consisting of noun phrases constituting the classification system, stemming the root of each word, and then generating a vector. The pivot classification scheme vector has the word sequence of the root unit in its length and the pivot vector calculates the score (610) for the word-based classification word in each word sequence of the classification system found in the English Wikipedia document set. Next, the process of integrating the classification system extracted from the languages other than English, which is a pivotal language, into the Pivot classification system vector is performed.

도 7은 본 발명의 일 실시예에 따른 분류체계 단어로부터 찾아진 어근의 벡터화에 대한 가중치 추가를 나타내는 도면이다. FIG. 7 is a diagram illustrating the addition of weights to the vectorization of a root found from a classification system word according to an embodiment of the present invention.

상이한 언어 간의 개체 대응 관계 과정은 위키피디아에 존재하는 언어간 링크(interlanguage link)를 SPARQL 질의를 이용하여 밝혀낼 수 있으며 이는 상이한 두 개의 언어 간 번역을 이용하는 것과 동일한 효과가 있다. 지식베이스에 대하여 SPARQL 질의문을 처리하는 것에 대해서는 본 발명의 기술 분야에서 잘 알려져 있으므로 여기서는 이에 대한 설명을 생략하기로 한다. 해당 과정을 거치면 기존의 피벗 분류 체계 벡터에 다른 언어 위키피디아 문서 집합으로부터 추출된 단어에 해당되는 번역된 영어 단어의 가중치를 추가 계산된다. 이때, 기존 피벗 벡터에 존재하는 해당 단어 어근에 대한 가중치는 그 발견횟수만큼 증가한다. 도 7은 특정 단어 "Jamaican(721)"과 "sprinters(722)"에 대한 가중치(710)가 도 6에서의 가중치 보다 각각 +2, +1씩 증가된 것을 보여주는 하나의 예이다. 피벗 벡터에 존재하지 않는 추가적으로 발견된 단어에 대해서는 본 발명에서 고려하지 않는다. The process of entity correspondence between different languages can reveal interlanguage links existing in Wikipedia using SPARQL queries, which has the same effect as using two different interlanguage translations. The processing of the SPARQL query for the knowledge base is well known in the technical field of the present invention, and a description thereof will be omitted here. The weight of the translated English words corresponding to the words extracted from the set of different language Wikipedia documents is added to the existing Pivot classification system vector. At this time, the weight for the word root existing in the existing pivot vector increases by the number of times of finding. 7 is an example showing that the weights 710 for the specific words " Jamaican 721 " and " sprinters 722 " are increased by +2 and +1, respectively, from the weights in FIG. Additional found words not present in the pivot vector are not considered in the present invention.

앞서 선정된 분류 체계로부터 추출된 단어 벡터를 자질로 하여 주어진 개체를 전산분야에서 널리 활용되고 있는 분할법 중 특정 알고리즘을 활용하여 주어진 개체를 여러 군집으로 나눈다. 본 발명은 개체의 군집 구성원(이웃)이 공유하는 <특성 ― 값> 쌍이 해당 군집에 없는 개체와 공유하는 기능보다 개체의 고유 속성을 지정하는데 중요함을 의미한다. 예를 들어, A = {"Usain Bolt", "Carl Lewis", "Michael Johnson"}, B = {"Babe Ruth", "Hyun-jin Ryu"}의 두 군집이 존재하는 경우, 군집 A의 "Usain Bolt"의 경우는 "스포츠 이벤트" 또는 "메달 정보"와 같은 필수 속성을 가지고 있지만 "베이브 루스"는 자신의 "포지션" 또는 "소속팀"에 더 중점을 두어 요약본을 생성할 수 있다. Using a word vector extracted from the previously selected classification system, the given object is divided into several clusters by using a specific algorithm among the division methods widely used in the field of computer. The present invention means that the < property-value > pair shared by the cluster members (neighbors) of an entity is more important than the function shared by the individuals not in the cluster. For example, if there are two communities of A = {"Usain Bolt", "Carl Lewis", "Michael Johnson"}, B = {"Babe Ruth", "Hyun-jin Ryu"}, Usain Bolt "has the required attributes such as" Sport Event "or" Medal Information ", but" Babe Ruth "can generate summaries with more emphasis on his" Position "or" Team ".

이때 다수개의 개체로부터 다수개의 군집을 나누는 과정을 각 군집의 중심과 군집 내의 개체와의 거리의 제곱합을 비용 함수(cost function)로 정하고 이를 최소화하는 방식으로 이루어지며, 이 과정에서 같은 군집 내 개체끼리의 유사도는 증가하고, 다른 군집에 속해 있는 개체와의 유사도는 감소하게 된다. 본 발명에서는 분할법 중 k-평균 알고리즘을 사용하나, 본 발명에서 제안하는 기술은 이에 한정되지 않는다.
In this case, the process of dividing a plurality of clusters from a plurality of objects is performed by a method of minimizing the sum of squares of the distances between the centers of the clusters and the objects in the clusters as a cost function. In this process, And the degree of similarity with individuals belonging to other clusters is decreased. In the present invention, a k -means algorithm is used in the division method, but the present invention is not limited thereto.

도 8은 본 발명의 일 실시예에 따른 일 군집의 두 개체의 트리플 집합 비교를 나타내는 도면이다.8 is a diagram illustrating a triple set comparison of two entities of a cluster according to an embodiment of the present invention.

이 과정은 여러 개의 군집이 있을 때 어떤 속성 유형이 특정 군집 내에서 얼마나 중요한 것인지를 나타내는 통계적 수치로 정의된다. 예를 들어 아래 도 8과 같이, 동일한 군집에 속하는 두 개체 "우사인_볼트(Usain_Bolt)"의 속성 유형(810)과 "마이클_존슨 (Michael_Johnson_(sprinter)"의 속성 유형(820)은 dbo:birthPlace, dbo:sport, dbo:event를 공통으로 포함 하고 있으며 이는 두 개체를 기술하기 위한 중요한 서술관계임이 명백하다. 그러나 공통적으로 사용되지 않은 속성 유형 dbo:honorificSuffix와 dbo:collegeteam 는 개체의 주요 본질을 기술하는 필수 요소로는 볼 수 없다. This process is defined as a statistical number that indicates how important an attribute type is in a particular cluster when there are several communities. For example, as shown in FIG. 8 below, an attribute type 810 of two objects "Usain_Bolt" belonging to the same community and an attribute type 820 of "Michael_Johnson_ (sprinter) birthPlace, dbo: sport, and dbo: event, which is an important descriptive relationship for describing two entities, but the commonly used attribute types dbo: honorificSuffix and dbo: collegeteam It can not be seen as a necessary element to describe.

따라서 속성 유형의 가중치는 다음 두 가지 자질의 조합으로 정의된다: 군집에서의 속성 유형 단어의 빈도(Property Frequency) 및 역군집 빈도(Inverse Group Frequency).Thus, the weight of an attribute type is defined by a combination of two qualities: the attribute frequency and the inverse group frequency of attribute type words in the cluster.

군집에서의 속성 유형 단어의 빈도(Property Frequency): 군집 내에 나타나는 속성 유형 단어의 총 빈도수를 사용하며 수학식(1)과 같다. Property Frequency in Cluster: Property Frequency: The total frequency of the attribute type words appearing in the cluster is used as Equation (1).

수학식(1)

Equation (1)

역군집 빈도(Inverse Group Frequency): 한 속성 유형 단어가 군집 집합 전체에서 얼마나 공통적으로 나타나는 지를 나타내며, 전체 군집의 수를 해당 속성 유형 단어를 포함한 군집의 수로 나눈 뒤 로그를 취하여 얻는 값을 사용하며 수학식(2)와 같다.Inverse Group Frequency: Indicates how common an attribute type word appears in the entire set of clusters. The value obtained by dividing the total number of clusters by the number of clusters containing the corresponding attribute type word, Equation (2) is obtained.

수학식(2)

Equation (2)

도 9는 본 발명의 일 실시예에 따른 하나의 개체에 대한 중복된 속성 유형을 나타내는 도면이다.9 is a diagram illustrating a duplicate attribute type for one entity according to an embodiment of the present invention.

특정 군집 내에서 속성 유형 단어 빈도가 높을수록, 그리고 전체 군집들 중 그 속성 유형 단어를 포함한 군집이 적을수록 서술관계의 가중치 값이 높아진다. 이를 이용하여 모든 군집에 흔하게 나타나는 속성 유형 단어를 걸러내는 효과를 얻을 수 있어 군집 내에서 의미 있게 중요한 속성 유형을 파악할 수 있다. The higher the attribution type word frequency in a particular cluster, and the smaller the population including the attribute type word among all the clusters, the higher the weight value of the narrative relation. Using this, it is possible to obtain the effect of filtering the attribute type words that are common in all the clusters, so that it is possible to grasp the meaningful property types in the community.

수학식(3)

Equation (3)

수학식(3)에서 e는 트리플의 주어 즉 주어진 개체를 나타내며, v는 트리플의 속성 값을 나타낸다. (s,p,o)는 지식베이스에 존재하는 트리플을 나타내며 E(e)는 주어진 개체 e가 속한 군집을 나타낸다. |x|는 해당 집합 x에 속하는 원소개수를 나타낸다. 현 단계에서 개체 군집 단위별로 가장 중요한 주어-목적어(다시 말해, 개체-속성값) 관계를 도출한다. 이 과정은 개체 기준으로 주요한 상대 개체를 결정하는 과정으로 예시는 도 9와 같다. In Equation (3), e represents a subject of a triple, i.e., a given entity, and v represents an attribute value of a triple. (s, p, o) represents a triple present in the knowledge base, and E (e) represents a population to which a given object e belongs. | x | represents the number of circle introductions belonging to the corresponding set x. At this stage, we derive the most important subject-object (ie, object-attribute value) relationship for each of the population units. This process is a process of determining a major relative entity based on an entity, and an example is shown in FIG.

여기서 하나의 개체에 대하여 동일한 속성 유형에 의해 정의된 두 개의 목적어인 dbr:Spanish_Town(910)과 dbr:Jamaica(920) 중에서 주어진 개체 Usain_Bolt 와의 상관관계를 점수화하여 두 개의 트리플 중 상대적으로 더 주요한 속성값을 결정하도록 가중치를 계산하며 사용된 수학식은 다음과 같다.Here, the correlation between two objects defined by the same attribute type, dbr: Spanish_Town (910) and dbr: Jamaica (920), for a given object, Usain_Bolt, is scored so that the more important property value The weighting factors are calculated to determine the following equation.

수학식(4)

Equation (4)

수학식(4)에서, v는 트리플에서의 목적어를 나타낸다. 즉 상호 연결 가중치를 알고 싶은 두 개의 개체가 각각의 군집 내에서 함께 트리플로 많이 발견될수록 해당 점수는 높아진다. 즉, 가중치 v-score는 트리플을 구성하고 있는 두 개체의 공기 정보(Co-occurrence)를 바탕으로 계산된다. 상세하게는 가산 연산으로 연결된 첫 번째 두 요소는 주어와 목적어로 이루어진 두 개의 엔티티에 대한 상관관계 기반 점수이고, 범위 [0-1]로 정규화된다. In Equation (4), v represents an object in a triple. That is, the more the two entities wanting to know the interconnection weights are found together in triplets within each cluster, the higher the score. That is, the weight v-score is calculated based on the co-occurrence of the two individuals constituting the triple. Specifically, the first two elements connected by addition operations are correlation-based scores for two entities of subject and object, and are normalized to the range [0-1].

다음 단계에서는 상기 단계에서 계산된 두 개의 가중치의 곱을 통하여 개체 단위의 전체 트리플(다시 말해, 개체 기술문)을 정렬한다. 트리플 사이의 최종 정렬을 위한 점수는 독립된 "속성 유형 가중치"의 값과 독립된 "주어-속성 값 간의 가중치" 값, 그리고 협력되어 계산되는 "속성 유형 가중치"와 "주어-속성 값 간의 가중치"의 곱의 합으로 정의되며 수학식은 다음과 같다. In the next step, the entire triple of the object unit (that is, the object description statement) is sorted through the product of the two weights calculated in the above step. The score for the final sort between triples is the product of the "weight between subject-attribute values" value independent of the value of the independent "attribute type weight" and the "weight between attribute-type weight" and "subject- And the equation is as follows.

수학식(5)

Equation (5)

도 10은 본 발명의 일 실시예에 따른 중복된 속성 유형의 허가 여부에 따른 최종 요약본 비교를 나타내는 도면이다.FIG. 10 is a diagram illustrating a final summary comparison according to whether a duplicate attribute type is permitted according to an embodiment of the present invention.

다음 단계에서는 사용자에 의해 요구된 길이에 따라 요약본 중 일부를 취하여 반환한다. 즉, 사용자가 요약본의 길이로 n개를 요구할 경우 최종 결과물에서 n개의 트리플을 취하여 사용자에게 반환한다. 특히 요약되는 최종 결과물이 갖추어야 할 본질 기능인 중복을 최소화하기 위하여 사용자가 요구하는 n개의 길이가 극단적으로 작은 경우(n=5)에는 최종결과물에 포함되는 트리플들 사이의 중복은 다음과 같이 제한된다. In the next step, some of the summaries are taken and returned according to the length requested by the user. That is, if the user requests n in length of the summary, n triples are taken from the final result and returned to the user. In particular, if the n lengths required by the user are extremely small (n = 5) in order to minimize redundancy, which is the essential function that the final result should have, the overlap between the triples included in the final result is limited as follows.

- 최종 요약본 = 속성 유형 중복 허용되지 않음 ∧ 속성 값 중복 허용되지 않음 - Final summary = Duplicate attribute type not allowed ∧ Duplicate attribute value not allowed

즉, 최종 요약본에 포함된 삼항관계에 사용된 속성 유형은 주어-속성 값 사이에 유일하게 사용되며 속성 값 역시 다수개의 속성 유형 에서 한번 이상 발견될 수 없으며 최종 요약본 가능한 상태(1010) 및 최종 요약본 불가능한 상태(1020)의 예시는 도 10과 같다.
That is, the attribute type used in the ternary relationship included in the final summary is used solely between the subject-attribute values, and the attribute value can not be found more than once in the plurality of attribute types, and the final summary state 1010 and the final summary impossible An example of state 1020 is shown in FIG.

도 11은 본 발명의 일 실시예에 따른 중복된 속성 값의 허가 여부에 따른 최종 요약본 비교를 나타내는 도면이다.11 is a diagram illustrating a final summary comparison according to whether or not a duplicate attribute value is permitted according to an embodiment of the present invention.

도 10에서 설명된 바와 다르게, 사용자가 요구하는 n개의 길이가 늘어난 경우(n=10)에는 최종 요약본에 포함되는 트리플들 사이의 중복은 다음과 같이 조절된다.10, if the length n required by the user is increased (n = 10), the overlap between the triples included in the final summary is adjusted as follows.

- 최종 요약본 = 속성 유형 중복 허용 ∧ 속성 값 중복 허용 - Final summary = Duplicate attribute type ∧ Duplicate attribute value allowed

즉, 최종 요약본에 포함된 트리플에 사용된 속성 유형은 서로 다른 속성값과 함께 여러 번 발견될 수 있으며, 목적어 역시 다수개의 속성 유형으로 여러 번 사용될 수 있으며 최종 요약본 가능한 상태(1110) 및 최종 요약본 불가능한 상태(1120)의 예시는 도 11과 같다.
That is, the attribute types used in the triples included in the final summary can be found multiple times with different attribute values, and the object can also be used multiple times with multiple attribute types, and the final summary possible state (1110) An example of state 1120 is shown in FIG.

아래에서, 다언어 특질 투영된 개체 공간 기반 개체 요약본 생성 방법 및 시스템의 실험 결과에 대해 설명한다. In the following, a description will be given of a method for generating a multivariate projected object space based object summary and an experimental result of the system.

제안하는 기술의 성능을 검증하기 위해 기존 최신 기술에서 사용한 동일한 실험 데이터를 통해 성능 비교 평가를 진행하였다. 사용된 실험 데이터는 15명의 독립된 서로 다른 사용자가 총 50개의 주어진 DBpedia 개체에 대한 트리플 셋을 제공받아, 개체 단위의 중요한 트리플을 Top5와 Top10로 선정해놓은 정답데이터이며 중요 트리플 Top10의 경우 Top5를 모두 포함한다. 시스템의 성능은 정답 데이터로부터 수학식(6)과 같은 퀄리티로 측정할 수 있다. In order to verify the performance of the proposed technology, performance comparison evaluation was conducted through the same experimental data used in the existing technology. Experimental data used are given by triple sets of fifty given DBpedia objects of fifteen independent users, and the important triple of each object is selected as Top5 and Top10. do. The performance of the system can be measured from the correct answer data with the same quality as Equation (6).

수학식(6)

Equation (6)

수학식(6)에서 Summ(e)는 개체 e에 대하여 시스템이 생성한 요약 결과를 의미하며 SummiI(e)는 정답데이터로 사용된 데이터 중 i번째 사용자가 주어진 개체 e에 대하여 선택한 요약본을 나타낸다. 시스템의 성능은 정답 데이터에 포함된 모든 사용자와의 평균값으로 계산된다. 정답 데이터를 생성한 사용자마다 생성하는 요약본에 포함시키는 트리플의 정보가 다르기 때문에 시스템이 이상적으로 도달할 수 있는 성능(quality)의 목표값이 주어지는데 목표값은 수학식(7)에 따라 결정된다. 참고로 다수의 사용자가 생성한 정답 데이터간의 일치를 나타내는 목표값은 평균 1.9596(n=5), 4.6770(n=10) 이다.In Equation (6), Summ (e) represents the summary result generated by the system for the object e and SummiI (e) represents the summary selected by the i-th user from the data used as the correct answer data for the given object e. The performance of the system is calculated as the average value of all users included in the correct answer data. Since the information of the triple to be included in the summary to be generated is different for each user who generates the correct answer data, a target value of a quality that the system can ideally reach is given, and the target value is determined according to equation (7). For reference, the target values indicating the correspondence between the correct data generated by a plurality of users are 1.9596 (n = 5) and 4.6770 (n = 10).

수학식(7)

Equation (7)

표 2는 요약본 생성 길이가 각각 5, 10일 때, 기존 최신 기술(FACES; state-of-the-art)과 제안하는 기술에 대한 성능을 나타내었다. 본 실험에서는 개체 군집 기반 방식의 효율성을 분석하기 위해 두 개의 비교군 시스템을 추가하였다. 상세하게는 단일 언어 환경과 다양한 언어 투영된 환경의 비교를 위하여 제안 기술을 하나의 언어 환경에 적용한 경우를 추가 비교하였으며(비교군1) 개체 군집화를 통한 지식베이스 분할과 유사 기술과의 우위성 검증을 위해 비교군2를 추가하였다.Table 2 shows the performance of the proposed state-of-the-art (FACES) and the proposed technique when the summary generation length is 5 and 10, respectively. In this experiment, two comparative group systems were added to analyze the efficiency of the entity cluster based method. Specifically, we compared the cases where the proposed technology was applied to one language environment for comparison between a single language environment and various language projected environments (comparative group 1). We added hazard comparison group 2.

- FACES : 종래 기술 - FACES : Prior art

- Multi-EGS : 제안 기술, 다언어 특질 투영 분류체계 기반 - Multi-EGS : Proposal technology, Multilingual feature projection based classification system

- EGS : 제안 기술 (비교군 1), 단일언어 분류체계 기반 - EGS : proposed technology (comparative group 1), based on single language classification system

- Typed : 제안 기술 (비교군 2), 단일언어 온톨로지 타입 기반 - Typed : based on the proposed technology (comparative group 2), single language ontology type

표 1에서 기존 기술인 지식베이스 분할 기법을 사용한 FACES보다 개체 군집화 기법을 사용한 모든 방식 Multi-EGS, EGS, Typed 시스템이 우수한 성능임을 확인할 수 있다. 또한 분류체계 태그를 이용하여 개체 군집화를 실행하는 것이 유사 기술인 온톨로지에 미리 정의된 타입을 이용하는 것보다 뛰어남을 확인할 수 있으며 다국어 분류 체계의 통합 기반 개체 군집화(Multi-EGS)에 따른 성능 향상을 확인할 수 있었다.Table 1 shows that the multi-EGS, EGS, and Typed systems are superior to FACES using the existing knowledge base partitioning technique. In addition, we can confirm that executing the clustering by using the classification tag is superior to using the predefined type in the similar technology ontology, and it is confirmed that the performance improvement according to the multi-EGS of the multilingual classification system there was.

<표 1><Table 1>

표 2는 종래 기술과 제안하는 기술의 상세 비교 결과를 나타내며 평가 데이터의 개체 요약본 결과 중, 제안 방식에서 가장 높은 품질 및 가장 낮은 품질을 나타낸 개체에 대한 결과 비교 분석이다. 제안 기술에 비해 기존 최신 기법의 품질 점수는 현저히 높지만, 목적어 중복의 증가로 인해 여러 사용자들의 정답과 유사하게 계산되면서 평균 품질 점수가 높으나, 중복적인 내용이 포함되어 요약본으로 적합하지 않음을 알 수 있다. 반면에 제안 방식에 따른 요약은 주어진 개체에 대한 주요한 특질을 포함할 뿐 아니라, 요약본 내의 중복을 최소화하여 개체에 대한 대표성을 표현할 수 있다.Table 2 shows the result of the detailed comparison between the conventional technology and the proposed technology, and the comparative analysis of the result of the summary of the evaluation data for the object having the highest quality and the lowest quality in the proposed method. Compared to the proposed technique, the quality score of the existing method is remarkably high. However, it is found that the average quality score is high as it is calculated similar to the correct answer of many users due to the increase of object redundancy, but it is not suitable as a summary because it contains redundant contents . On the other hand, the summary according to the proposed method not only includes the main characteristics for a given entity but also can represent the representativeness of the entity by minimizing the redundancy in the summary.

<표 2><Table 2>

종래기술에서는 개체 중심의 데이터가 빠르게 증가하는 환경에서 <개체-속성- 목적어>로 이루어진 트리플의 집합이 너무 방대해져 주요 정보를 신속하게 식별하는데 어려움을 겪었다. 본 발명에서 제안하는 개체의 군집화를 통해 개체에 대하여 주제별 분류를 가능하게 하는 효과를 가져올 수 있으며, 다양한 언어 자원으로부터 발생한 상대적인 개별 특질을 통합함으로써 기존 단일 언어 자원만 사용된 조건에서보다 개체 군집화의 성능 향상을 도모한다. 본 발명은 개체를 설명하기 위한 필수 항목을 포함시키는 전문가의 요약 방식과 최대한 가깝게 재현함으로써 더욱 효과적은 요약 결과를 내는 시스템을 기대할 수 있으며, 대용량의 지식베이스 상에서의 개체 단위의 효율적인 정보 검색과 신속한 질의 처리를 제공하는데 유용하게 이용될 수 있을 것이다.
In the prior art, in an environment where object-oriented data is rapidly increasing, a set of triples composed of <object-attribute-object> becomes too large to quickly identify important information. It is possible to classify individual objects by the clustering of the objects proposed in the present invention. By integrating the relative individual characteristics generated from various language resources, it is possible to improve the performance of the object clustering Improvement. The present invention can expect a more effective summary result system by reproducing the summary method as close as possible to the summary method of the experts including the essential items for describing the entity. It is also possible to efficiently search information and prompt inquiry on a per unit basis on a large- And may be usefully used to provide processing.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다.　 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다.　 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다.　 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다.　 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The apparatus described above may be implemented as a hardware component, a software component, and / or a combination of hardware components and software components. For example, the apparatus and components described in the embodiments may be implemented within a computer system, such as, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA) A programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For ease of understanding, the processing apparatus may be described as being used singly, but those skilled in the art will recognize that the processing apparatus may have a plurality of processing elements and / As shown in FIG. For example, the processing unit may comprise a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as a parallel processor.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다.　 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다.　 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instructions, or a combination of one or more of the foregoing, and may be configured to configure the processing device to operate as desired or to process it collectively or collectively Device can be commanded. The software and / or data may be in the form of any type of machine, component, physical device, virtual equipment, computer storage media, or device As shown in FIG. The software may be distributed over a networked computer system and stored or executed in a distributed manner. The software and data may be stored on one or more computer readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다.　 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다.　 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다.　 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다.　 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.　 The method according to an embodiment may be implemented in the form of a program command that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions to be recorded on the medium may be those specially designed and configured for the embodiments or may be available to those skilled in the art of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks and magnetic tape; optical media such as CD-ROMs and DVDs; magnetic media such as floppy disks; Magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다.　 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. For example, it is to be understood that the techniques described may be performed in a different order than the described methods, and / or that components of the described systems, structures, devices, circuits, Lt; / RTI > or equivalents, even if it is replaced or replaced.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.
Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

Extracting triples marking a classification system from a multilingual knowledge base, and integrating information of the same entity unit;
Extracting a ternary relation from the multilingual knowledge base to mark the classification system, and constructing the object clustering;
Searching for major descriptive relationships and major object-object correlations for each of the clusters based on the individual clusters constructed in the clustering component, and calculating the weights of the triples in the multilingual knowledge base;
Repeating the analysis of all ternary relations, and sorting the summary according to the order of importance of all ternary relations based on the calculated weight; And
Minimizing the redundancy according to the user's request for the sorted summary and fetching from the priority order of the sorted summary
/ RTI >

The method according to claim 1,
Extracting a triple marking a classification scheme from a multilingual knowledge base, and integrating information of the same entity unit,
The present invention relates to a method for deriving qualities commonly used in a plurality of languages by linking triples created in a plurality of languages with respect to the same entity and deriving qualities independently used in each of the plurality of languages, Integrating the classification system characteristics of the object
How to create an entity summary.

The method according to claim 1,
In the multi-linguistic knowledge base, the step of extracting the ternary relations that mark the classification system and constructing the object clustering,
We derive qualities for clustering entities from triples that describe classification schemes existing in a multilingual knowledge base, and cluster the corresponding entities of similar qualities
How to create an entity summary.

The method according to claim 1,
The step of finding the main narrative relation and major object-object correlation according to the clusters based on the individual cluster constituted in the clustering component and calculating the weight of the triple of the multilingual knowledge base,
Based on the main attribute type in the object group consisting of a combination of the attribute type frequency used in the object group and the score indicating the reverse group frequency, and based on the object-attribute value co-
How to create an entity summary.

The method according to claim 1,
The steps of repeating the analysis of all ternary relations and sorting the summary according to the order of importance of all ternary relations based on the calculated weight,
We derive the main attribute types of the triple according to the object clusters and derive the object-attribute value correlations and sort the summaries according to the order of importance by using the combination of the main attribute type and object-attribute value correlation
How to create an entity summary.

The method according to claim 1,
The step of minimizing redundancy according to a request of a user for an ordered summary and the order of priority among the ordered summary,
Minimize the duplication of attribute types and attribute values used in object description statements, and generate a summary as long as the user requires
How to create an entity summary.

A multilingual feature projection module for extracting triples marking a classification system in a multilingual knowledge base and integrating information of the same entity unit;
An object clustering module for extracting ternary relations from a multilingual knowledge base and constructing object clustering;
Based on the individual clusters constructed in the clustering component, the major descriptive relations of the clusters are searched, the weights of the triples of the multilingual knowledge base are calculated, and the main object-object correlations of the clusters are found based on the individual clusters constructed in the clustering component An entity descriptor ranking module for calculating a weight of a triple of a multilingual knowledge base; And
The analysis of all ternary relations is repeated through the descriptive relationship analysis unit and the object-object analysis unit, and the summary is sorted according to the order of importance of all the ternary relations based on the calculated weights. An object summary generation module that minimizes the redundancy and obtains the sorted summary from the order of priority,
The object summary generating system comprising:

8. The method of claim 7,
The object clustering module,
The present invention relates to a method for deriving qualities commonly used in a plurality of languages by linking triples created in a plurality of languages with respect to the same entity and deriving qualities independently used in each of the plurality of languages, Integrating the classification system characteristics of the object
Object summary creation system.

8. The method of claim 7,
The object clustering module,
We derive qualities for clustering entities from triples that describe classification schemes existing in a multilingual knowledge base, and cluster the corresponding entities of similar qualities
Object summary creation system.

8. The method of claim 7,
The Entity Knowledge Ranking module,
Based on the main attribute type in the object group consisting of a combination of the attribute type frequency used in the object group and the score indicating the reverse group frequency, and based on the object-attribute value co-
Object summary creation system.

8. The method of claim 7,
The object summary generation module,
We derive the main attribute types of the triple according to the object clusters and derive the object-attribute value correlations and sort the summaries according to the order of importance by using the combination of the main attribute type and object-attribute value correlation
Object summary creation system.

8. The method of claim 7,
The object summary generation module,
Minimize the duplication of attribute types and attribute values used in object description statements, and generate a summary as long as the user requires
Object summary creation system.