KR101522049B1

KR101522049B1 - Coreference resolution in an ambiguity-sensitive natural language processing system

Info

Publication number: KR101522049B1
Application number: KR1020107006475A
Authority: KR
Inventors: 마틴 밴 덴 버그; 리차드 크로츠; 프랜코 살베티; 지오바니 로렌조 티오네; 데이비드 안
Original assignee: 마이크로소프트 코포레이션
Priority date: 2007-08-31
Filing date: 2008-08-29
Publication date: 2015-05-20
Anticipated expiration: 2028-08-29
Also published as: RU2010107148A; JP2014238865A; JP2010538374A; EP2183684A4; BRPI0815826A2; WO2009029903A2; AU2008292779B2; MX2010002349A; CN101796508A; KR20100075451A; RU2480822C2; EP2183684A2; CA2698054A1; AU2008292779A1; CN101796508B; WO2009029903A3; ZA201001259B; CA2698054C

Abstract

모호성 민감 자연 언어 처리 시스템에서의 동일 지시어 분석을 위한 기술들이 여기에 설명된다. 자연 언어 처리 시스템에 지시 분석을 통합하기 위한 기법들은 정보 검색 및 검색 시스템 내에서 인덱싱될 문서들을 처리할 수 있다. 모호성 분석 기능뿐만 아니라 모호성 인식 특징들이 동일 지시어 분석과 협조하여 동작할 수 있다. 모호한 해석들뿐만 아니라 동일 지시 엔티티들의 주석이 텍스트 콘텐트 내의 인라인 마크업에 의해 또는 외부 엔티티 맵들에 의해 지원될 수 있다. 문서들 내에서 표현된 정보는 사실들, 즉 텍스트 내의 엔티티들 사이의 관계들에 의하여 형식적으로 조직될 수 있다. 확장은 다수의 별칭들, 또는 모호성들을 인덱싱되고 있는 엔티티에 적용하는 것을 지원함으로써, 그 엔티티에 대한 모든 가능한 지시들 또는 해석들이 인덱스 내에 캡처되도록 할 수 있다. 대안적인 저장된 서술들은 본래의 서술에 의해 또는 동일 지시적인 서술에 의해 사실의 검색을 지원할 수 있다.Technologies for the same directive analysis in ambiguous sensitive natural language processing systems are described herein. Techniques for incorporating instructional analysis into a natural language processing system can process documents to be indexed within an information retrieval and retrieval system. Ambiguity recognition features as well as ambiguity analysis features can work in concert with the same directive analysis. Annotations of the same indication entities as well as ambiguous interpretations can be supported by inline markup in the text content or by external entity maps. Information expressed in documents can be formally organized by facts, that is, relationships between entities in the text. The extension may support applying multiple aliases, or ambiguities, to the entity being indexed so that all possible indications or interpretations for that entity are captured in the index. Alternative stored descriptions may support the retrieval of facts by the original description or by the same directional description.

Description

[0002] COREFERENCE RESOLUTION IN ANMBIGUITY-SENSITIVE NATURAL LANGUAGE PROCESSING SYSTEM [0003]

자연 언어에서, 엔티티(entity)들을 상이한 서술들에 의해 지시하는 것은 드문 일이 아니다. 예를 들면, 명사들을 대신하기 위해 일반적으로 대명사들이 이용된다. 또한, 한 엔티티를 지시하기 위해 지시(reference)의 다양한 다른 서술들, 또는 상이한 형태들이 이용될 수 있다. 예로서 텍스트의 다음 부분들을 생각해보자:In natural language, it is not uncommon to direct entities by different descriptions. For example, pronouns are commonly used to replace nouns. In addition, various other descriptions of the reference, or different forms, may be used to indicate an entity. As an example, consider the following parts of the text:

"Pablo Picasso was born in Malaga.""Pablo Picasso was born in Malaga."

"The Spanish painter became famous for his varied styles.""The Spanish painter became famous for his varied styles."

"Among his paintings is the large-scale Guernica.""Among his paintings are the large-scale Guernica."

"He painted this disturbing masterpiece during the Spanish Civil War.""He painted this disturbing masterpiece during the Spanish Civil War."

"Picasso died in 1973.""Picasso died in 1973."

다양한 언어 변화에 직면한다. 예를 들면, "Pablo Picasso" 및 "Picasso"라는 2개의 상이한 이름이 사용된다. 한정하는 서술인 "the Spanish painter" 및 2개의 대명사 "his" 및 "he"는 모두 Picasso를 지시하기 위해 사용된다. 그림(painting)을 지시하기 위해 2개의 상이한 표현이 사용된다: 작품의 이름인 "Guernica" 및 지시 서술(demonstrative description)인 "this disturbing masterpiece."I face various language changes. For example, two different names are used, "Pablo Picasso" and "Picasso". The definitive narrative "the Spanish painter" and the two pronouns "his" and "he" are all used to direct Picasso. Two different representations are used to indicate painting: "Guernica", the name of the work, and "demonstration description", "this disturbing masterpiece."

2개의 언어 표현들은 그것들이 동일한 지시 대상(referent)을 갖는다면 동일 지시적이라고 할 수 있다. 바꾸어 말하면, 그것들이 동일한 엔티티를 지시한다는 가정이다. 두 번째 어구는 첫 번째 어구에 조응적인(anaphoric) 전방조응사(anaphor)일 수 있다. 그러므로, 첫 번째 어구는 두 번째 어구의 선행사(antecedent)이다. 전방조응사의 지시 대상을 판정하기 위해 선행사의 지시 대상에 대한 지식이 필요할 수 있다. 문서 내에서 동일 지시적인 표현들, 전방조응사들, 및 그들의 선행사들을 찾아내는 일반적인 작업은 동일 지시어 분석(coreference resolution)이라고 불릴 수 있다. 동일 지시어 분석은 2개의 표현들이 동일한 지시 대상을 지시하는 것을 확립하는 프로세스이고, 반드시 그 지시 대상이 무엇인지를 확립하는 것은 아니다. 지시 분석(reference resolution)은 그 지시 대상이 무엇인지를 확립하는 프로세스이다.Two language expressions can be said to be identical if they have the same referent. In other words, they are supposed to point to the same entity. The second phrase may be an anaphoric anaphor in the first phrase. Therefore, the first phrase is the antecedent of the second phrase. Knowledge of the antecedents of the antecedent may be needed to determine the subject of the antecedent. A common task of finding the same directional expressions, forward converters, and their predecessors in a document can be called coreference resolution. The same directive analysis is a process of establishing that two expressions refer to the same directive, and does not necessarily establish what the directive is. Reference resolution is the process of establishing what the referent is.

동일 지시적인 표현들의 집단(cluster)들에 대하여, 그들의 조응적 관계들에 관계없이, 그 표현들은 서로의 별칭(alias)들이라고 불릴 수 있다. 상기 예에 따르면, 표현들 "Pablo Picasso", "the Spanish painter", "his", "he", 및 "Picasso"는 Picasso를 지시하는 별칭 집단(alias cluster)을 형성한다.For clusters of identically indicative expressions, regardless of their adaptive relationships, the expressions can be called aliases of each other. According to the example above, the expressions "Pablo Picasso", "the Spanish painter", "his", "he", and "Picasso" form an alias cluster pointing to Picasso.

자연 언어 표현들은 종종 모호성(ambiguity)을 나타낸다. 모호성은 한 표현이 2개 이상의 의미로 해석될 수 있을 때 일어난다. 예를 들면, 문장 "The duck is ready to eat(오리는 먹을 준비가 되어 있다)"는 오리가 적당하게 요리되어 있다는 것 또는 오리가 배고파서 모이를 줄 필요가 있다는 것을 주장하는 것으로 해석될 수 있다.Natural language expressions often represent ambiguity. Ambiguity occurs when an expression can be interpreted in two or more meanings. For example, the sentence "The duck is ready to eat" can be interpreted as asserting that the duck has been cooked properly or that the duck needs to hunger hungry.

동일 지시어 분석 및 모호성 분석은 인간 사용자들에 의해 일반적으로 표현되는 언어를 기계적으로 지원하는 데 이용될 수 있는 자연 언어 처리 동작들의 2가지 예들이다. 정보 검색을 지원하는 텍스트 인덱싱 및 쿼링(querying)과 같은 정보 처리 시스템들은 자연 언어 처리 시스템들의 증가된 적용으로부터 이익을 얻을 수 있다.The same directive analysis and ambiguity analysis are two examples of natural language processing operations that can be used to mechanically support the language commonly expressed by human users. Information processing systems such as text indexing and querying that support information retrieval can benefit from increased application of natural language processing systems.

여기에 작성된 명세서가 제시되는 것은 이러한 사정들 및 다른 사정들에 관련한 것이다.It is to be understood that these speci fi cations and other considerations are provided herein.

모호성 민감 자연 언어 처리 시스템에서의 동일 지시어 분석을 위한 기술들이 여기에 설명된다. 특히, 정보 검색 및 검색 시스템 내에서 인덱싱될 문서들을 처리하기 위한 시스템에 동일 지시어 분석 기능을 통합하기 위한 기법들이 설명된다. 이 통합은 자연 언어 문서들 내의, 동일 지시어 분석, 및 모호한 의미를 지원하는 정보에 의한 인덱싱을 향상시킬 수 있다.Technologies for the same directive analysis in ambiguous sensitive natural language processing systems are described herein. In particular, techniques for incorporating the same directive analysis functionality into a system for processing documents to be indexed within an information retrieval and retrieval system are described. This integration can improve indexing by natural language documents, by the same directive analysis, and by information that supports ambiguous semantics.

여기에 제시된 일 양태에 따르면, 동일 지시어 분석 시스템에 의해 제공되는 정보가 자연 언어 처리 시스템에 통합되어 자연 언어 처리 시스템의 성능을 개선할 수 있다. 그러한 시스템의 일례는 문서 인덱싱 및 검색 시스템이다.According to one aspect set forth herein, the information provided by the same directive analysis system may be integrated into a natural language processing system to improve the performance of the natural language processing system. One example of such a system is a document indexing and retrieval system.

여기에 제시된 다른 양태에 따르면, 모호성 분석 기능뿐만 아니라 모호성 인식 특징들이 자연 언어 처리 시스템 내의 동일 지시어 분석과 협조하여 동작할 수 있다. 모호한 해석뿐만 아니라 동일 지시 엔티티들의 주석이 텍스트 표현들 내의 인라인 마크업(in-line markup)에 의해 또는 대안적으로는 외부 엔티티 맵들(external entity maps)에 의해 지원될 수 있다.In accordance with another aspect set forth herein, ambiguity recognition features as well as ambiguity analysis features can operate in coordination with the same directive analysis in a natural language processing system. Annotations of the same instruction entities as well as ambiguous interpretations can be supported by in-line markup in textual representations, or alternatively by external entity maps.

여기에 제시된 또 다른 양태에 따르면, 인덱싱될 텍스트로부터 사실(fact)들이 추출될 수 있다. 텍스트 내에 표현된 정보는 사실들에 의하여 형식적으로 조직될 수 있다. 이러한 의미에서 사용되는 경우, 사실은 텍스트에 포함된 임의의 정보일 수 있고, 반드시 진실일 필요는 없다. 사실은 엔티티들 사이의 관계로서 표현될 수 있다. 사실은 의미 인덱스(semantic index) 내에 저장된 엔티티들 사이의 관계로서 상기 의미 인덱스에 저장될 수 있다. 사실 기반 검색 시스템에서, 문서는 그것이 쿼리의 분석을 통하여 판정된 사실과 부합하는 사실을 포함한다면 검색될 수 있다.According to another aspect set forth herein, facts can be extracted from the text to be indexed. Information expressed in text can be formally organized by facts. When used in this sense, the fact may be any information contained in the text, and is not necessarily true. The fact can be expressed as a relationship between entities. The fact may be stored in the semantic index as a relation between the entities stored in the semantic index. In a fact based search system, a document can be searched if it includes facts that match the fact that it is determined through analysis of the query.

여기에 제시된 또 다른 양태에 따르면, 확장의 프로세스가 다수의 별칭들, 또는 모호성들을 인덱싱되고 있는 엔티티에 적용하는 것을 지원할 수 있다. 그러한 확장은 의미 인덱스에 캡처되고 있는 주어진 엔티티에 대하여, 추가적인 가능한 지시들, 또는 해석들을 지원할 수 있다. 대안적인 저장된 서술들은 본래의 서술에 의해 또는 동일 지시적인 서술에 의해 사실의 검색을 지원할 수 있다.According to another aspect set forth herein, a process of extension may support applying a plurality of aliases, or ambiguities, to an entity being indexed. Such an extension may support additional possible instructions, or interpretations, for a given entity being captured in the semantic index. Alternative stored descriptions may support the retrieval of facts by the original description or by the same directional description.

전술한 청구 대상은 또한 컴퓨터 제어되는 장치, 컴퓨터 프로세스, 컴퓨팅 시스템, 또는 컴퓨터 판독 가능한 매체와 같은 제조물로서 구현될 수 있다는 것을 이해해야 한다. 이들 및 다양한 다른 특징들은 다음의 상세한 설명을 읽고 관련 도면들을 검토하는 것으로부터 명백할 것이다.It is to be understood that the subject matter described may also be embodied as an article of manufacture, such as a computer-controlled device, a computer process, a computing system, or a computer-readable medium. These and various other features will be apparent from a reading of the following detailed description and a review of the relevant drawings.

본 요약은 아래 상세한 설명에서 더 설명되는 개념들 중 선택된 것을 단순화된 형태로 소개하기 위해 제공된다. 본 요약은 청구된 내용의 중요한 특징들 또는 본질적인 특징들을 일치시키려 의도된 것이 아니며, 또한 본 요약은 청구된 내용의 범위를 제한하는 데 이용되도록 의도된 것도 아니다. 또한, 청구된 내용은 이 명세서의 임의의 부분에서 지적된 임의의 또는 모든 불리점들을 해결하는 구현들에 제한되지 않는다.This summary is provided to introduce a selection of the concepts further described in the following detailed description in a simplified form. This summary is not intended to be exhaustive or to limit the scope of the claimed subject matter to the essential features or essential features of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages pointed out in any part of this specification.

도 1은 본 명세서에 제시된 실시예의 양태들에 따른 정보 검색 시스템을 예시하는 네트워크 아키텍처 도이다.
도 2는 본 명세서에 제시된 실시예의 양태들에 따른 자연 언어 인덱스 및 쿼리 시스템의 다양한 컴포넌트들을 예시하는 기능 블록도이다.
도 3은 본 명세서에 제시된 실시예의 양태들에 따른 자연 언어 처리 시스템 내의 동일 지시어 분석 및 모호성 분석을 예시하는 기능 블록도이다.
도 4는 본 명세서에 제시된 실시예의 양태들에 따른 동일 지시어 분석에 의한 모호성 민감 인덱싱을 위한 프로세스들의 양태들을 예시하는 논리 흐름도이다.
도 5는 본 명세서에 제시된 실시예의 양태들을 구현할 수 있는 컴퓨팅 시스템에 대한 예시적인 컴퓨터 하드웨어 및 소프트웨어 아키텍처를 보여주는 컴퓨터 아키텍처 도이다.1 is a network architecture diagram illustrating an information retrieval system in accordance with aspects of the embodiments presented herein.
2 is a functional block diagram illustrating various components of a natural language index and query system in accordance with aspects of the embodiments presented herein.
Figure 3 is a functional block diagram illustrating the same directive analysis and ambiguity analysis in a natural language processing system in accordance with aspects of the embodiments presented herein.
4 is a logic flow diagram illustrating aspects of processes for ambiguous sensitive indexing by the same directive analysis in accordance with aspects of the embodiments presented herein.
5 is a computer architecture diagram illustrating exemplary computer hardware and software architectures for a computing system capable of implementing aspects of the embodiments presented herein.

다음의 상세한 설명은 모호성 민감 자연 언어 처리 시스템에서의 동일 지시어 분석을 위한 기술들에 관한 것이다. 여기에 제시된 기술들 및 개념들의 이용을 통하여, 정보 검색 및 검색 시스템에서 사용하기 위해 인덱싱될 문서들을 처리하는 자연 언어 처리 시스템에 동일 지시어 분석 기능이 통합될 수 있다. 이 통합은 인덱싱되고 있는 자연 언어 문서들에 대한 동일 지시어 분석을 지원하는 정보에 의한 인덱스를 향상시킬 수 있다.The following detailed description relates to techniques for analyzing the same directive in a ambiguous sensitive natural language processing system. Through the use of the techniques and concepts presented herein, the same directive analysis function can be integrated into a natural language processing system that processes documents to be indexed for use in an information retrieval and retrieval system. This integration can improve the index by information that supports the same directive analysis for natural language documents being indexed.

여기에 설명된 내용은 컴퓨터 시스템 상의 운영 체제 및 애플리케이션 프로그램들의 실행과 관련하여 실행하는 프로그램 모듈들의 일반적인 컨텍스트에서 제시되지만, 숙련된 당업자들은 다른 유형의 프로그램 모듈들과 함께 다른 구현들이 수행될 수 있다는 것을 인지할 것이다. 일반적으로, 프로그램 모듈들은 특정한 태스크들을 수행하거나 특정한 추상 데이터 유형들을 구현하는 루틴, 프로그램, 컴포넌트, 데이터 구조, 및 기타 유형의 구조를 포함한다. 또한, 숙련된 당업자들은 여기에 제시된 내용은 핸드헬드 장치, 마이크로프로세서 시스템, 마이크로프로세서 기반 또는 프로그램 가능한 소비자 전자 장치, 미니컴퓨터, 메인프레임 컴퓨터 등을 포함하는 다른 컴퓨터 시스템 구성들과 함께 실시될 수 있다는 것을 알 것이다.While the teachings herein are presented in the general context of program modules executing in connection with the execution of an operating system and application programs on a computer system, those skilled in the art will appreciate that other implementations, along with other types of program modules, I will recognize. Generally, program modules include routines, programs, components, data structures, and other types of structures that perform particular tasks or implement particular abstract data types. It will also be appreciated by those skilled in the art that the teachings herein may be practiced with other computer system configurations, including hand held devices, microprocessor systems, microprocessor based or programmable consumer electronics, minicomputers, You will know.

다음의 상세한 설명에서는, 본 명세서의 일부를 형성하고, 특정한 실시예들 또는 예들이 예시로서 도시되어 있는 첨부 도면들이 참조된다. 이제, 몇몇 도면들을 통하여 유사한 참조번호들이 유사한 엘리먼트들을 나타내는 도면들을 참조하여, 모호성 민감 자연 언어 처리 시스템에서의 동일 지시어 분석을 위한 컴퓨팅 시스템 및 방법의 양태들을 설명한다.In the following detailed description, reference is made to the accompanying drawings, which form a part hereof, and in which is shown by way of illustration specific embodiments or examples. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Reference will now be made to the drawings, in which like reference numerals represent like elements throughout the several views, to describe aspects of a computing system and method for the same directive analysis in a ambiguous sensitive natural language processing system.

이제 도 1을 참조하여, 여기에 제시된 구현들에 대한 예시적인 동작 환경에 관하여 상세히 설명한다. 특히, 네트워크 아키텍처 다이어그램(100)은 여기에 제시된 실시예의 양태들에 따른 정보 검색 시스템을 예시한다. 클라이언트 컴퓨터들(110A-110D)은 자연 언어 엔진(130)과 관련된 정보를 얻기 위해 네트워크(140)를 통하여 서버(120)에 인터페이스할 수 있다. 4개의 클라이언트 컴퓨터들(110A-110D)이 예시되어 있지만, 임의의 수의 클라이언트 컴퓨터들(110A-110D)이 이용될 수 있다는 것을 이해해야 한다. 클라이언트 컴퓨터들(110A-110D)은 네트워크(140)에 걸쳐서 지리적으로 분산되거나, 한 곳에 배치되거나, 또는 그의 임의의 조합으로 될 수 있다. 단 하나의 서버(120)가 예시되어 있지만, 서버(120)의 기능은 임의의 수의 다수의 서버들(120)에 걸쳐서 분산될 수도 있다는 것을 이해해야 한다. 그러한 다수의 서버들(120)은 한 곳에 배치되거나, 네트워크(140)에 걸쳐서 지리적으로 분산되거나, 또는 그의 임의의 조합으로 될 수 있다.Referring now to Figure 1, an exemplary operating environment for the implementations presented herein will be described in detail. In particular, network architecture diagram 100 illustrates an information retrieval system in accordance with aspects of the embodiments presented herein. The client computers 110A-110D may interface to the server 120 through the network 140 to obtain information related to the natural language engine 130. Although four client computers 110A-110D are illustrated, it should be understood that any number of client computers 110A-110D may be utilized. Client computers 110A-110D may be geographically dispersed across a network 140, placed in one location, or in any combination thereof. It should be appreciated that although only one server 120 is illustrated, the functionality of the server 120 may be distributed across any number of multiple servers 120. [ Such multiple servers 120 may be located in one location, geographically distributed across the network 140, or any combination thereof.

하나 이상의 실시예들에 따르면, 자연 언어 엔진(130)은 검색 엔진 기능을 지원할 수 있다. 검색 엔진 시나리오에서는, 클라이언트 컴퓨터(110A-110D)로부터 네트워크(140)를 통하여 서버(120)로 사용자 쿼리가 발행될 수 있다. 사용자 쿼리는 자연 언어 포맷일 수 있다. 서버에서, 자연 언어 엔진(130)은 자연 언어 쿼리를 처리하여 자연 언어 쿼리로부터 추출된 구문 및 의미에 기초하여 검색을 지원할 수 있다. 그러한 검색의 결과들은 서버(120)로부터 네트워크(140)를 통하여 클라이언트 컴퓨터들(110A-110D)로 제공될 수 있다.According to one or more embodiments, the natural language engine 130 may support a search engine function. In a search engine scenario, a user query may be issued from the client computers 110A-110D to the server 120 via the network 140. [ The user query can be a natural language format. At the server, the natural language engine 130 may process the natural language query to support the search based on the syntax and semantics extracted from the natural language query. The results of such a search may be provided from the server 120 to the client computers 110A-110D via the network 140. [

하나 이상의 검색 인덱스들이 서버(120)에 저장되거나, 또는 관련될 수 있다. 검색 인덱스 내의 정보는 소스 정보의 세트, 또는 코퍼스(corpus)로부터 파퓰레이트(populate)될 수 있다. 예를 들면, 웹 검색 구현에서, 네트워크(140)에 걸쳐서 다양한 웹 서버들(도시되지 않음) 상의 다양한 웹 사이트들로부터 콘텐트가 수집(collect)되고 인덱싱될 수 있다. 그러한 수집 및 인덱싱은 서버(120) 상에서, 또는 다른 컴퓨터(도시되지 않음) 상에서 실행하는 소프트웨어에 의해 수행될 수 있다. 수집은 웹 크롤러(web crawlers) 또는 스파이더(spider) 애플리케이션에 의해 수행될 수 있다. 자연 언어 엔진(130)은 코퍼스로부터 수집된 자연 언어 콘텐트가 자연 언어 엔진(130)에 의해 추출된 구문 및 의미에 기초하여 인덱싱될 수 있도록 수집된 정보에 적용될 수 있다. 인덱싱 및 검색은 도 2에 관련하여 더 상세히 논의된다.One or more search indices may be stored in, or related to, the server 120. The information in the search index may be populated from a set of source information, or from a corpus. For example, in a web search implementation, content can be collected and indexed from various web sites on various web servers (not shown) across the network 140. [ Such collection and indexing may be performed on the server 120 or by software running on another computer (not shown). The collection may be performed by web crawlers or spider applications. The natural language engine 130 may be applied to the collected information so that the natural language content collected from the corpus can be indexed based on the syntax and semantics extracted by the natural language engine 130. The indexing and retrieval is discussed in more detail with respect to FIG.

클라이언트 컴퓨터들(110A-110D)은 서버(120)에 대해 터미널 클라이언트, 하이퍼텍스트 브라우저 클라이언트, 그래픽 디스플레이 클라이언트, 또는 다른 네트워킹된 클라이언트들로서 기능할 수 있다. 예를 들면, 클라이언트 컴퓨터들(110A-110D)에 있는 웹 브라우저 애플리케이션은 서버(120)에 있는 웹 서버 애플리케이션과의 인터페이싱을 지원할 수 있다. 그러한 브라우저는 서버(120)에의 인터페이싱을 지원하기 위해 컨트롤(controls), 플러그인(plug-ins), 또는 애플릿(applets)을 이용할 수 있다. 클라이언트 컴퓨터들(110A-110D)은 또한 서버(120)와 인터페이스하기 위해 다른 사용자 지정된 프로그램, 애플리케이션, 또는 모듈을 이용할 수 있다. 클라이언트 컴퓨터들(110A-110D)은 데스크톱 컴퓨터, 랩톱, 핸드헬드, 이동 단말기, 이동 전화기, 텔레비전 셋톱 박스, 키오스크, 서버, 터미널, 씬 클라이언트(thin-clients), 또는 임의의 다른 컴퓨터화된 장치일 수 있다.The client computers 110A-110D may function as a terminal client, a hypertext browser client, a graphical display client, or other networked clients to the server 120. [ For example, a web browser application on client computers 110A-110D may support interfacing with a web server application on server 120. [ Such browsers may use controls, plug-ins, or applets to support interfacing to the server 120. [ The client computers 110A-110D may also utilize other customized programs, applications, or modules to interface with the server 120. [ The client computers 110A-110D may be a desktop computer, laptop, handheld, mobile terminal, mobile phone, television set-top box, kiosk, server, terminal, thin- .

네트워크(140)는 클라이언트 컴퓨터들(110A-110D)과 서버(120) 사이의 통신을 지원할 수 있는 임의의 통신 네트워크일 수 있다. 네트워크(140)는 유선, 무선, 광학, 라디오, 패킷 교환, 회선 교환, 또는 그의 임의의 조합일 수 있다. 네트워크(140)는 임의의 토폴로지를 이용할 수 있고, 네트워크(140)의 링크들은 이더넷, DSL, 케이블 모뎀, ATM, SONET, MPLS, PSTN, POTS 모뎀, PONS, HFC, 위성, ISDN, 와이파이(WiFi), 와이맥스(WiMax), 모바일 셀룰러(mobile cellular), 그의 임의의 조합, 또는 임의의 다른 데이터 상호접속 또는 네트워킹 메커니즘과 같은 임의의 네트워킹 기술, 프로토콜, 또는 대역폭을 지원할 수 있다. 네트워크(140)는 인트라넷, 인터넷(internet), 인터넷(the Internet), 월드 와이드 웹, LAN, WAN, MAN, 또는 컴퓨터 시스템들의 상호접속을 위한 임의의 다른 네트워크일 수 있다.The network 140 may be any communication network capable of supporting communication between the client computers 110A-110D and the server 120. [ The network 140 may be wired, wireless, optical, radio, packet switched, circuit switched, or any combination thereof. The network 140 may utilize any topology and the links of the network 140 may be implemented in a variety of networks such as Ethernet, DSL, cable modem, ATM, SONET, MPLS, PSTN, POTS modem, PONS, HFC, satellite, ISDN, WiFi Protocols, or bandwidths, such as, for example, WiMAX, mobile cellular, any combination thereof, or any other data interconnection or networking mechanism. The network 140 may be any other network for interconnecting intranets, the Internet, the Internet, the World Wide Web, LAN, WAN, MAN, or computer systems.

예시된 네트워크 환경 외에도, 자연 언어 엔진(130)은 로컬로(locally) 운영될 수 있다는 것을 이해해야 한다. 예를 들면, 서버(120) 및 클라이언트 컴퓨터들(110A-110D)은 단일 컴퓨팅 장치로 결합될 수 있다. 그러한 결합된 시스템은 로컬로 또는 원격으로 저장된 검색 인덱스들을 지원할 수 있다.In addition to the illustrated network environment, it should be appreciated that the natural language engine 130 may be operated locally. For example, server 120 and client computers 110A-110D may be combined into a single computing device. Such a combined system may support locally or remotely stored search indexes.

이제 도 2를 참조하면, 기능 블록도는 하나의 예시적인 실시예에 따른 자연 언어 엔진(130)의 다양한 컴포넌트들을 예시한다. 위에 논의된 바와 같이, 자연 언어 엔진(130)은 정보 검색들을 지원할 수 있다. 그러한 검색들을 지원하기 위하여, 콘텐트 획득 프로세스(200)가 수행된다. 콘텐트 획득(200)에 관련된 동작들은 텍스트 콘텐트(210)로서 제공된 문서들로부터 정보를 추출한다. 이 정보는 검색을 위해 이용될 수 있는 의미 인덱스(250)에 저장될 수 있다. 사용자 검색(205)에 관련된 동작들은 사용자 입력 검색 쿼리의 처리를 지원할 수 있다. 사용자 쿼리는 자연 언어 질문(260)의 형태를 취할 수 있다. 자연 언어 엔진(130)은 사용자 입력을 분석하여 쿼리를 의미 인덱스(250) 내에 표현된 정보와 비교될 표현으로 변환할 수 있다. 의미 인덱스(250) 내의 정보의 콘텐트 및 구조는, 쿼리 또는 자연 언어 질문(260)의 의미에 관련이 있는, 문서들, 또는 문서들의 부분들의 신속한 매칭 및 검색을 지원할 수 있다.Referring now to FIG. 2, a functional block diagram illustrates various components of natural language engine 130 in accordance with one exemplary embodiment. As discussed above, the natural language engine 130 may support information searches. To support such searches, a content acquisition process 200 is performed. The operations related to the content acquisition 200 extract information from the documents provided as the text content 210. This information may be stored in a semantic index 250 that may be used for searching. Operations related to user search 205 may support processing of user input search queries. The user query can take the form of a natural language query (260). The natural language engine 130 may analyze the user input to convert the query into a representation to be compared with the information represented in the semantic index 250. The content and structure of the information in the semantic index 250 may support rapid matching and retrieval of documents, or portions of documents, that are related to the meaning of the query or natural language query 260.

텍스트 콘텐트(210)는 매우 일반적인 의미의 문서들을 포함할 수 있다. 그러한 문서들의 예들은 웹 페이지, 텍스트 문서, 스캔된 문서, 데이터베이스, 정보 목록, 기타 인터넷 콘텐트, 또는 임의의 다른 정보 소스를 포함할 수 있다. 이 텍스트 콘텐트(210)는 검색될 정보의 코퍼스를 제공할 수 있다. 텍스트 콘텐트(210)를 처리하는 것은 구문 파싱(syntactic parsing)(215) 및 의미 매핑(semantic mapping)(225)으로서 2개의 스테이지들에서 일어날 수 있다. 파싱(215)의 전에 또는 파싱(215)의 처음에 예비 언어 처리 단계들이 일어날 수 있다. 예를 들면, 텍스트 콘텐트(210)는 문장 경계들에서 분리될 수 있다. 특정한 사람들, 장소들, 물체들 또는 이벤트들의 이름들로서 적당한 명사들이 식별될 수 있다. 또한, 의미 있는 단어 끝부분들의 문법적 속성들이 판정될 수 있다. 예를 들면, 영어에서, "s"로 끝나는 명사는 복수의 명사일 것 같은 반면, "s"로 끝나는 동사는 3인칭 단수의 동사일 수 있다.The textual content 210 may include documents in a very general sense. Examples of such documents may include web pages, text documents, scanned documents, databases, information listings, other Internet content, or any other information source. This text content 210 may provide a corpus of information to be searched. Processing textual content 210 may take place in two stages as syntactic parsing 215 and semantic mapping 225. Preliminary language processing steps may occur before parsing 215 or at the beginning of parsing 215. For example, text content 210 may be separated at sentence boundaries. Appropriate nouns can be identified as names of specific people, places, objects or events. In addition, the grammatical properties of meaningful word ends can be determined. For example, in English, a noun ending in "s" may be plural nouns, while a verb ending in "s" may be a verb in the third person singular.

파싱(215)은, 본 명세서에서 단지 일반적인 예로서 제공되지만, 이러한 설명의 가능한 구현들을 제한하기 위한 것은 아닌, XLE(Xerox Linguistic Environment)와 같은 구문 분석 시스템에 의해 수행될 수 있다. 파서(parser)(215)는 문장들을 단어들 사이의 구문 관계들을 명백하게 하는 표현들로 변환할 수 있다. 파서(215)는 사용되고 있는 특정 언어와 관련된 문법(220)을 적용할 수 있다. 예를 들면, 파서(215)는 영어에 대한 문법(220)을 적용할 수 있다. 문법(220)은, 예를 들면, LFG(lexical functional grammar) 또는 HPSG(Head-Driven Phrase Structure Grammar), CCG(Combinatory Categorial Grammar), PCFG(Probabilistic Context-free Grammar) 또는 임의의 다른 문법 포멀리즘에 기초한 것들과 같은 다른 적합한 파싱 메커니즘으로서 형식화될 수 있다. 문법(220)은 주어진 언어에서 의미 있는 문장들을 구성하기 위한 가능한 방법들을 특정할 수 있다. 파서(215)는 텍스트 콘텐트(210)의 문자열들에 문법(220)의 규칙들을 적용할 수 있다.The parsing 215 is provided herein as a generic example only, but may be performed by a parsing system, such as the Xerox Linguistic Environment (XLE), rather than limiting the possible implementations of such descriptions. The parser 215 can convert sentences into expressions that clarify syntactic relationships between words. The parser 215 may apply the grammar 220 associated with the particular language being used. For example, the parser 215 may apply the grammar 220 for English. Grammar 220 may be used in conjunction with a lexical functional grammar (LFG) or a Head-Driven Phrase Structure Grammar (CCG), a Combinatory Categorial Grammar (CCG), a Probabilistic Context-free Grammar (PCFG) Lt; / RTI > can be formatted as other suitable parsing mechanisms, such as those based. Grammar 220 can specify possible ways to construct meaningful sentences in a given language. Parser 215 may apply rules of grammar 220 to strings of text content 210.

문법(220)은 다양한 언어들에 대하여 제공될 수 있다. 예를 들면, LFG 문법들은 영어, 불어, 독어, 중국어, 및 일어에 대하여 생성되었다. 다른 문법들이 제공될 수도 있다. 문법(220)은 언어학자 또는 사전 저자에 의해 문법 규칙들이 정의되는 수동 획득(manual acquisition)에 의해 개발될 수 있다. 대안적으로, 기계 학습 획득은 문법 규칙들을 자동으로 판정하기 위해 큰 코퍼스로부터의 텍스트의 많은 예들의 자동화된 관찰 및 분석을 수반할 수 있다. 수동 정의 및 기계 학습의 조합이 문법(220)의 규칙들을 획득하는 데 이용될 수도 있다.The grammar 220 may be provided for various languages. For example, LFG grammars were generated for English, French, German, Chinese, and Japanese. Other grammars may be provided. Grammar 220 can be developed by manual acquisition in which grammar rules are defined by a linguist or dictionary author. Alternatively, machine learning acquisition may involve automated observation and analysis of many examples of text from large corpus to automatically determine grammar rules. A combination of manual definition and machine learning may be used to obtain the rules of grammar 220.

파서(215)는 구문 구조를 판정하기 위해 텍스트 콘텐트(210)에 문법(220)을 적용할 수 있다. LFG 기반 파싱의 경우에, 구문 구조들은 구성 요소 구조들(constituent structures)(c-구조들) 및 기능 구조들(functional structures)(f-구조들)로 이루어진다. c-구조는 구성 요소 구들 및 단어들의 계층 구조를 나타낼 수 있다. f-구조는 c-구조의 다양한 구성 요소들 사이의 역할들 및 관계들을 인코딩할 수 있다. f-구조는 또한 단어들의 형태들로부터 도출되는 정보를 나타낼 수 있다. 예를 들면, 명사의 복수 또는 동사의 시제는 f-구조에서 특정될 수 있다.The parser 215 may apply the grammar 220 to the text content 210 to determine the syntax structure. In the case of LFG-based parsing, syntactic structures consist of constituent structures (c-structures) and functional structures (f-structures). The c-structure may represent a hierarchy of component phrases and words. The f-structure can encode the roles and relationships between the various components of the c-structure. The f-structure can also represent information derived from the forms of words. For example, the plural of the noun or the tense of the verb can be specified in the f-structure.

파싱 프로세스(215)의 다음에 오는 의미 매핑 프로세스(225) 동안에는, 구문 구조들로부터 정보가 추출되어 문장 내의 단어들의 의미들에 관한 정보와 조합될 수 있다. 문장의 의미 맵(semantic map) 또는 의미 표현이 콘텐트 의미(content semantics)(240)로서 제공될 수 있다. 의미 매핑(225)은 파서(215)에 의해 제공된 구문 관계들에 개개의 단어들의 개념적 속성들을 추가(augment)할 수 있다. 그 결과들은 텍스트 콘텐트(210)로부터의 문장들의 의미의 표현들로 변환될 수 있다. 의미 매핑(225)은 문장 내의 단어들에 의해 수행되는 역할들을 판정할 수 있다. 예를 들면, 액션을 수행하는 주체, 액션을 수행하는 데 이용되는 어떤 것, 또는 액션에 의해 영향을 받는 어떤 것. 검색 인덱싱을 위하여, 단어들의 그들의 역할과 함께 의미 인덱스(250)에 저장될 수 있다. 따라서, 의미 인덱스(250)로부터의 검색은 단지 분리된 단어에만 의존하지 않고, 텍스트 콘텐트(210)에서 그 단어가 나타나는 문장들 내의 의미에 의존할 수도 있다. 의미 매핑(225)은 용어들의 명확화(disambiguation), 선행사 관계들의 판정, 및 동의어(synonym), 상위어(hypernym), 또는 하위어(hyponym)에 의한 용어들의 확장을 지원할 수 있다.During the semantic mapping process 225 following the parsing process 215, information from the syntax constructs may be extracted and combined with information about the semantics of the words in the sentence. A semantic map of the sentence or a semantic representation may be provided as content semantics 240. [ The semantic mapping 225 may augment the conceptual properties of individual words in the syntactic relations provided by the parser 215. The results may be converted into representations of the semantics of the sentences from the text content 210. Semantic mapping 225 can determine the roles performed by the words in the sentence. For example, a subject performing an action, something used to perform an action, or something affected by an action. For search indexing, they may be stored in the semantic index 250 along with their role in the words. Thus, the search from the semantic index 250 may depend only on the meaning within the sentences in which the word appears in the text content 210, rather than depending only on the separated word. Semantic mapping 225 can support disambiguation of terms, determination of antecedent relationships, and expansion of terms by synonyms, hypernyms, or hyponyms.

의미 매핑(225)은 문장들로부터 의미들을 추출하기 위한 기법들 및 규칙들로서 지식 리소스들(230)을 적용할 수 있다. 지식 리소스들은, 문법들(220)의 획득에 관하여 논의한 바와 같이, 수동 정의 및 기계 학습 양쪽 모두를 통하여 획득될 수 있다. 의미 매핑(225) 처리는 의미 확장 가능 마크업 언어(semantic XML 또는 semxml) 표현으로 콘텐트 의미들(240)을 제공할 수 있다. PROLOG, LISP, JSON, YAML 등으로 작성된 표현들과 같은 임의의 적합한 표현 언어가 또한 이용될 수 있다. 콘텐트 의미들(240)은 텍스트 콘텐트(210)의 문장들 내의 단어들에 의해 수행되는 역할들을 특정할 수 있다. 콘텐트 의미들(240)은 인덱싱 프로세스(245)에 제공될 수 있다.Semantic mapping 225 may apply knowledge resources 230 as techniques and rules for extracting semantics from sentences. The knowledge resources may be obtained through both manual definition and machine learning, as discussed with respect to the acquisition of grammars 220. Semantic mapping 225 processing may provide content semantics 240 in a semantic extensible markup language (semantic XML or semxml) representation. Any suitable representation language, such as expressions written in PROLOG, LISP, JSON, YAML, etc., may also be used. Content semantics 240 may specify the roles performed by the words in the sentences of text content 210. Content semantics 240 may be provided to the indexing process 245. [

인덱스는 단어들 및 구들의 위치들이 인덱스 내에서 신속히 식별될 수 있도록 정보의 큰 코퍼스를 나타내는 것을 지원할 수 있다. 전통적인 검색 엔진은 인덱스가 사용자에 의해 특정된 키워드들로부터 그 키워드들이 나타나는 기사들 또는 문서들에 매핑하도록 검색어들로서 키워드들을 이용할 수 있다. 의미 인덱스(250)는 단어들 자체에 더하여 단어들의 의미론적 뜻을 나타낼 수 있다. 콘텐트 획득(200) 및 사용자 검색(205) 동안에 단어들에 의미 관계들이 할당될 수 있다. 의미 인덱스(250)에 대한 쿼리들은 단어들만이 아니라, 특정 역할들의 단어들에 기초할 수 있다. 그 역할들은 의미 인덱스(250)에 저장된 문장 또는 구에서 그 단어에 의해 수행되는 역할들이다. 의미 인덱스(250)는 그의 항목들이 의미 단어들(즉, 주어진 역할의 단어)과 그 단어들이 나타나는 문서들, 또는 웹 페이지들에의 포인터들인 신속히 검색 가능한 데이터베이스인 반전된 인덱스로 간주될 수 있다. 의미 인덱스(250)는 하이브리드 인덱싱을 지원할 수 있다. 그러한 하이브리드 인덱싱은 키워드 인덱싱 및 의미 인덱싱 양쪽 모두의 특징들 및 기능들을 조합할 수 있다.The index may support representing a large corpus of information so that the positions of words and phrases can be quickly identified in the index. Traditional search engines can use keywords as search terms to map indexes to articles or documents where the keywords appear from the keywords specified by the user. The semantic index 250 may represent semantic meanings of words in addition to the words themselves. Semantic relationships can be assigned to words during content acquisition 200 and user search 205. [ Queries for the semantic index 250 may be based on words of a particular role, not just words. The roles are those performed by the word in the sentence or phrase stored in the semantic index 250. [ The semantic index 250 can be regarded as an inverted index whose entries are quickly searchable databases that are semantic words (i.e., words in a given role) and documents in which the words appear, or pointers to web pages. The semantic index 250 may support hybrid indexing. Such hybrid indexing can combine features and functions of both keyword indexing and semantic indexing.

쿼리들의 사용자 입력은 자연 언어 질문들(260)의 형태로 지원될 수 있다. 쿼리는 콘텐트 획득(200)에서 사용된 것과 유사한, 또는 동일한 자연 언어 파이프라인을 통하여 분석될 수 있다. 즉, 자연 언어 질문(260)은 구문 구조를 추출하기 위해 파서(265)에 의해 처리될 수 있다. 구문 파싱(265)에 이어서, 자연 언어 질문(260)은 의미 매핑(270)을 위해 처리될 수 있다. 의미 매핑(270)은 위에 논의된 바와 같이 의미 인덱스(250)에 대하여 검색 프로세스(280)에서 이용될 질문 의미들(275)을 제공할 수 있다. 검색 프로세스(280)는 키워드 인덱스 검색 및 의미 인덱스 검색 양쪽 모두가 단독으로 또는 조합하여 제공될 수 있는 하이브리드 인덱스 쿼리들을 지원할 수 있다.User input of queries may be supported in the form of natural language questions 260. The query may be analyzed similar to that used in content acquisition 200, or through the same natural language pipeline. That is, the natural language query 260 can be processed by the parser 265 to extract the syntax structure. Following the syntax parsing 265, the natural language query 260 may be processed for semantic mapping 270. Semantic mapping 270 may provide query semantics 275 to be used in search process 280 for semantic index 250 as discussed above. The search process 280 may support hybrid index queries, in which both keyword index search and semantic index search may be provided alone or in combination.

사용자 쿼리에 응답하여, 질문 의미들(275)과 함께 의미 인덱스(250)로부터의 검색(280)의 결과들은 랭킹 프로세스(285)에 통지할 수 있다. 랭킹은 키워드 및 의미 정보 양쪽 모두를 이용할 수 있다. 랭킹(285) 동안에, 검색(280)에 의해 얻어진 결과들은 가장 바람직한 결과들을 결과 프리젠테이션(290)으로서 사용자에게 제공될 검색된 정보의 최상부에 더 가까이 배치하려는 시도에서 다양한 메트릭들에 의해 정리(order)될 수 있다.In response to the user query, the results of the search 280 from the semantic index 250 along with the query semantics 275 may inform the ranking process 285. Ranking can use both keyword and semantic information. During ranking 285, the results obtained by search 280 may be ordered by various metrics in an attempt to place the most desirable results closer to the top of the retrieved information to be presented to the user as a result presentation 290. [ .

이제 도 3을 참조하면, 기능 블록도가 여기에 제시된 실시예의 양태들에 따른 자연 언어 처리 시스템(300) 내의 동일 지시어 분석 및 모호성 분석을 예시한다. 애플리케이션의 예로서, 자연 언어 처리 시스템(300)은 문서 인덱싱 및 검색을 위한 정보 검색 엔진을 지원할 수 있다. 그러한 자연 언어 가능한 검색 엔진은 언어 분석에 기초하여 그의 인덱스 내에 저장된 정보를 확장할 수 있다. 시스템은 또한 쿼리를 언어적으로 분석함으로써 사용자 쿼리 내의 의도의 발견을 지원할 수 있다. 여기에 논의된 동일 지시어 분석 및 모호성 분석 특징들은 도 2에 관하여 논의된 바와 같이 구문 파싱(215), 의미 매핑(225), 및 의미 인덱싱(245)에 관련하여 동작할 수 있다. 동일 지시어 분석은 텍스트 콘텐트(210)에 대해 직접 수행되거나, 또는 파싱(215) 또는 의미 매핑(225) 동작들로부터의 정보를 이용할 수 있다.Referring now to FIG. 3, a functional block diagram illustrates the same directive analysis and ambiguity analysis in natural language processing system 300 in accordance with aspects of the embodiments presented herein. As an example of an application, natural language processing system 300 may support an information search engine for document indexing and searching. Such a natural language capable search engine can extend the information stored in its index based on language analysis. The system can also support the discovery of intent within the user query by analyzing the query verbally. The same directive analysis and ambiguity analysis features discussed herein may operate in conjunction with syntax parsing 215, semantic mapping 225, and semantic indexing 245, as discussed with respect to FIG. The same directive analysis may be performed directly on the text content 210, or may use information from parsing 215 or semantic mapping 225 operations.

예시된 바와 같이, 동일 지시어 분석(320, 370)은 세그먼트화된 문서에 대해 직접적으로 또한 의미 매핑(225)의 일부로서 수행될 수 있다. 동일 지시어 분석(320, 370)의 이들 2개의 발생들이 병합될 수 있거나 또는 그들의 정보 출력들이 병합될 수 있다. 동일 지시어 분석은 또한 구문 파싱(215)과 의미 매핑(225) 사이에서 발생할 수도 있다는 것을 이해해야 한다. 동일 지시어 분석은 또한 자연 언어 처리 파이프라인 내의 임의의 다른 스테이지에서 발생할 수 있다. 자연 언어 처리 시스템 내의 다양한 위치에 1개, 2개, 또는 그 이상의 동일 지시어 분석 컴포넌트들, 또는 스테이지들이 있을 수 있다. 텍스트 콘텐트(210)는 의미 인덱스(250)에 저장할 정보를 위하여 분석될 수 있다. 검색은 원하는 정보에 대하여 의미 인덱스(250)에 쿼리하는 것을 포함할 수 있다.As illustrated, the same directive analysis (320, 370) can be performed directly on the segmented document and as part of the semantic mapping (225). These two occurrences of the same directive analysis 320, 370 can be merged, or their information outputs can be merged. It should be appreciated that the same directive parsing may also occur between the syntax parsing 215 and the semantic mapping 225. The same directive analysis can also occur at any other stage in the natural language processing pipeline. There may be one, two, or more identical directive analysis components, or stages, at various locations within a natural language processing system. The text content 210 may be analyzed for information to be stored in the semantic index 250. The search may include querying the semantic index 250 for the desired information.

콘텐트 세그먼트(310)는 텍스트 콘텐트(210)를 구성하는 문서들에 대해 수행될 수 있다. 문서들은 보다 효율적이고 잠재적으로 보다 정확한 동일 지시어 분석(320)을 위하여 세그먼트화될 수 있다. 동일 지시어 분석(320)은 전체 문서에 걸쳐서 잠재적인 지시 관계들(reference relationships)을 고려할 수 있다. 긴 문서들의 경우, 멀리 떨어진 표현들을 비교하는 데 많은 시간이 소비될 수 있다. 처리의 속도가 고려될 때, 동일 지시어 분석(320) 전에 문서들의 콘텐트 세그먼트화(310)는 처리에 사용되는 시간을 실질적으로 감소시킬 수 있다. 콘텐트 세그먼트화(310)는 동일 지시어 분석(320)의 시도들에서 탐구되는 콘텐트 텍스트(210)의 양을 효과적으로 감소시킬 수 있다.The content segment 310 may be performed on documents that make up the text content 210. The documents can be segmented for more efficient and potentially more accurate equivalent directive analysis 320. [ The same directive analysis 320 may consider potential reference relationships across the entire document. In the case of long documents, it can take a lot of time to compare distant expressions. When the speed of processing is considered, the content segmentation 310 of documents prior to the same directive analysis 320 may substantially reduce the time used for processing. Content segmentation 310 may effectively reduce the amount of content text 210 being explored in attempts of the same directive analysis 320. [

콘텐트 세그먼트화(310)는 새로운 문서 세그먼트화가 시작되는 때를 나타내는 정보를 의미적 동일 지시어 분석(370)에 제공할 수 있다. 그러한 정보는 세그먼트화 신호(312)로서 또는 콘텐트 문서 세그먼트에 마크업(mark-up)을 삽입함으로써 제공될 수 있다. 메타 정보를 포함하는 외부 파일 또는 다른 메커니즘들이 또한 이용될 수 있다.The content segmentation 310 may provide information to the semantic same directive analysis 370 indicating when new document segmentation is to begin. Such information may be provided as a segmentation signal 312 or by inserting a mark-up in the content document segment. External files or other mechanisms containing meta information may also be used.

문서의 구조는 지시 관계들이 교차할 것 같지 않은 세그먼트 경계들을 식별하는 데 이용될 수 있다. 문서 구조는 단락 경계들, 챕터들, 또는 섹션 표제들과 같은 명백한 마크업으로부터 추론될 수 있다. 문서 구조는 또한 언어 처리를 통하여 발견될 수 있다. 지정된 길이를 초과하는 세그먼트들은 더욱 서브분할될 수 있다. 원하는 서브분할 길이는, 예를 들면, 문장들의 수 또는 단어들의 수에 의하여 표현될 수 있다.The structure of the document may be used to identify segment boundaries where the indicative relationships are unlikely to intersect. The document structure can be inferred from explicit markup such as paragraph boundaries, chapters, or section headings. The document structure can also be found through language processing. Segments that exceed the specified length may be further subdivided. The desired subdivision length can be expressed, for example, by the number of sentences or the number of words.

확실한 문서 구조화가 이용 가능하지 않은 경우, 휴리스틱(heuristic) 또는 통계적 기준들이 적용될 수 있다. 그러한 기준들은 세그먼트의 사이즈를 소정의 최대값으로 제한하면서 동일 지시어들을 함께 유지하는 경향을 갖도록 지정될 수 있다. 텍스트 콘텐트(210) 문서들을 세그먼트화하기 위한 다양한 다른 접근법들이 또한 적용될 수 있다. 콘텐트 세그먼트화(310)는 또한 전체 문서를 하나의 세그먼트로서 지정할 수도 있다.If valid document structuring is not available, heuristic or statistical criteria can be applied. Such criteria may be specified to have a tendency to keep the same directives together while limiting the size of the segment to a predetermined maximum value. Various other approaches for segmenting textual content 210 documents may also be applied. Content segmentation 310 may also specify the entire document as one segment.

동일 지시어 분석(320, 370)은 콘텐트 텍스트(210) 내의 동일 지시어 및 별칭들을 식별하는 데 이용될 수 있다. 예를 들면, 문장 "He painted Guernica"를 인덱싱할 때, 그것은 "he"가 Picasso를 지시한다는 것을 판정하는 데에 판정적일 수 있다. 이것은 특히 사실 기반 검색이 이용되고 있는 경우에 그러하다. Picasso에 대한 대명사 별칭을 분석하는 것은 어떤 남성 개인인 "he"가 Guernica를 그렸다는 덜 유익할 사실보다는, Picasso가 Guernica를 그렸다는 사실을 인덱싱하는 것을 지원할 수 있다. 대명사의 지시 대상을 식별하고 인덱싱하는 이러한 능력이 없다면, 사실 기반 검색 방법을 이용하여, 쿼리 "Picasso painted"에 응답하여 문서를 검색하는 것은 어려울 수 있다. 시스템의 리콜(recall)은 다른 경우라면 반환되지 않았을 수 있는 쿼리에 관련된 문서가 반환될 때 개선될 수 있다.The same directive analysis (320, 370) can be used to identify the same directives and aliases in the content text 210. For example, when indexing the sentence "He painted Guernica ", it may be judicious to determine that" he " This is especially true if fact-based searches are being used. Analyzing the synonyms for Picasso can help index the fact that Picasso painted Guernica, rather than the less informative fact that "male" he "Guernica" was painting. Without this ability to identify and index the pronoun's referent, it may be difficult to retrieve the document in response to the query "Picasso painted" using the fact-based retrieval method. The recall of the system can be improved when documents related to queries that might otherwise not be returned are returned.

주석(annotation)(330)은 엔티티들 및 가능한 동일 지시 관계들을 추적하는 것을 지원하기 위해 텍스트 콘텐트(210)에 적용될 수 있다. 분석 판정의 신뢰 값들이 또한 텍스트 콘텐트(210) 내에 주석되거나 마크업될 수 있다. 분석 판정들은 텍스트에 명백한 주석 마크들을 추가하는 것에 의해 기록될 수 있다. 예를 들면 텍스트, "John visited Mary. He met her in 2003."가 주어진다. 주석(330)은 "[E1:0.9 John] visited [E2:0.8 Mary]. [E1:0.9 He] met [E2:0.8 her] in 2003."으로서 적용될 수 있다. 여기서 단어들 "John" 및 "He"는 0.9의 신뢰 값을 갖는 엔티티 1(E1)로서 관련될 수 있다. 마찬가지로, 단어들 "Mary" 및 "her"는 0.8의 신뢰 값을 갖는 엔티티 2(E2)로서 관련될 수 있다. 신뢰 값은 동일 지시어 분석(320) 판정에서의 신뢰의 측정값을 나타낼 수 있다. 주석은 동일 지시어 판정들을 직접 인코딩할 수 있거나, 또는 주석은 주석이 달린 텍스트 내의 관련 용어들을 스탠드 어사이드(stand aside) 주석(325) 내의 추가 정보에 연결하는 식별자들로서 기능할 수 있다.An annotation 330 may be applied to the textual content 210 to assist in tracking entities and possible same indications. The confidence values of the analysis decisions can also be annotated or marked up in the text content 210. The analysis decisions can be recorded by adding explicit comment marks to the text. For example, the text, "John visited Mary. He met her in 2003." The annotation 330 can be applied as "[E1: 0.9 John] visited [E2: 0.8 Mary]. [E1: 0.9 He] met [E2: 0.8 her] in 2003." Where the words "John" and "He" may be related as entity 1 (E1) with a confidence value of 0.9. Similarly, the words "Mary" and "her" may be related as entity 2 (E2) having a confidence value of 0.8. The confidence value may represent a measure of confidence in the same directive analysis 320 decision. The annotations can directly encode the same directive verdicts or the annotations can function as identifiers that link the relevant terms in the annotated text to additional information in the stand-by annotation 325. [

동일 지시어 분석(320) 판정들은 의미 매핑(225)을 구성하는 프로세스의 일부로서 이용될 수 있다. 동일 지시어 분석(320)에 의해 이용되는 참조 표현들은 텍스트 콘텐트(210) 내의 인라인 주석들에 의해 의미 매핑(225)을 위한 입력 표현에 통합될 수 있다. 그 지시들은 또한 외부 스탠드-어사이드 엔티티 맵(325)에서 별도로 제공될 수도 있다.The same directive analysis 320 determinations can be used as part of the process of constructing the semantic mapping 225. The reference expressions used by the same directive analysis 320 may be incorporated into the input representation for the semantic mapping 225 by inline annotations in the text content 210. The instructions may also be provided separately in the outer stand-by-side entity map 325.

월드 와이드 웹과 같은, 텍스트 콘텐트(210)의 많은 양의 문서 컬렉션 내에서, 동일한 문장이 상이한 문맥들에서 다수 회 나타날 수 있다. 이들 상이한 문맥들은 동일 지시어 분석(320)을 위한 상이한 후보들을 제공할 수 있다. 구문 파싱(215)은 계산상 비용이 많이 들 수 있으므로, 문장들에 대한 파싱 결과들을 캐시에 저장하는 것이 유익할 수 있다. 그러한 캐싱 메커니즘(350)은 문장이 향후에 직면하는 경우에 파싱 정보를 검색하는 것을 신속히 지원할 수 있다.Within a large collection of documents of the text content 210, such as the World Wide Web, the same sentence can appear multiple times in different contexts. These different contexts may provide different candidates for the same directive analysis 320. Because the syntax parsing 215 may be computationally expensive, it may be advantageous to store the parsing results for the sentences in a cache. Such a caching mechanism 350 can quickly assist in retrieving parsing information in the future when a sentence is encountered.

동일 지시어 분석(320)이 상이한 문맥들에서 나타나는 단일 문장에 적용된다면, 그것은 동일 지시어가 문맥에 의존할 수 있으므로 동일한 참조 표현들에 대하여 상이한 동일 지시 관계들을 식별할 수 있다. 따라서, 상이한 엔티티 식별자들이 텍스트에 인라인으로 삽입될 수 있다. 예를 들면, 2개의 상이한 문서들에서 나타나는 텍스트 "He is smart"는 2개의 상이한 식별자들, "[E21 He] is smart." 및 "[E78 He] is smart."로 주석을 달 수 있다. 여기서 제1 문서 내의 단어 "He"는 제2 문서 내의 단어 "He"와 다른 사람을 지칭한다.If the same directive analysis 320 is applied to a single sentence appearing in different contexts, it may identify different identical referential relations for the same reference expressions since the same directive may be context-dependent. Thus, different entity identifiers may be inserted inline in the text. For example, the text "He is smart" appearing in two different documents may contain two different identifiers, "[E21 He] is smart." And "[E78 He] is smart." Here, the word "He" in the first document refers to a person other than the word "He" in the second document.

얕은(shallow) 동일 지시어 분석(320)을 위한 상이한 정보 소스들이 있을 수 있다. 예를 들면, 동일 지시어 분석(320) 동안에 수행되는 표현 검출 외에도, 텍스트 콘텐트(210)에서 적당한 이름들을 찾아내는 데에 전용되는 시스템이 있을 수 있다. 이들 상이한 소스들은 상충되는 분석 정보를 식별할 수 있다. 예를 들면, 경계들이 교차하는 곳에서 상충되는 분석이 일어날 수 있다. 예를 들면, 2개의 시스템들이 다음의 상충되는 참조 표현들을 식별하였을 수 있다:There may be different sources of information for the shallow equal directive analysis 320. For example, in addition to expression detection performed during the same directive analysis 320, there may be a system dedicated to finding the appropriate names in the textual content 210. These different sources can identify conflicting analysis information. For example, conflicting analyzes can occur where the boundaries intersect. For example, two systems may have identified the following conflicting reference expressions:

"[John] told [George Washington][Irving] was a great writer.""[John] told [George Washington] [Irving] was a great writer."

"[John] told [George] [Washington Irving] was a great writer.""[John] told [George] [Washington Irving] was a great writer."

교차되는 경계들의 다음의 상충들을 생각해보자: 제1 문자열 내의 [George Washington]은 제2 문자열 내의 [George]와 상충된다. 또한 제1 문자열 내의 [George Washington]은 제2 문자열 내의 [Washington Irving]과 상충된다. 신뢰 정보 또는 문맥상의 요소들에 기초하여, 이러한 상충을 분석하고 그것을 보존하기 위해 상이한 전략들이 반복하여 적용될 수 있다. "드롭(drop)" 전략에서는, 2개 이상의 상충되는 경계들은 가장 낮은 신뢰를 갖는 것을 드롭함으로써 해결될 수 있다. "병합(merge)" 전략에서는, 양립할 수 있는 문맥들에서 2개 이상의 경계들이 동등하게 그럴듯한 경우에 그 경계들은 그에 따라서 이동될 수 있다. 예를 들면, "[Mr. John] Smith" 및 "Mr. [John Smith]"는 "[Mr. John Smith]"를 제공하도록 병합될 수 있다. "보존(preserve)" 전략에서는, 다수의 경계들은 그 경계들의 구성 및 그들의 신뢰 값들이 병합도 드롭도 지원하지 않는 경우에 그것들을 모호한 출력으로서 유지함으로써 보존될 수 있다. 예를 들면, "[Alexander the Great]" 및 "[Alexander][the Great]"는 대안적인 모호한 분석들로서 제공될 수 있다.Consider the following conflicts of crossed boundaries: [George Washington] in the first string conflicts with [George] in the second string. Also, [George Washington] in the first string conflicts with [Washington Irving] in the second string. Based on trust information or contextual factors, different strategies can be applied repeatedly to analyze and preserve these conflicts. In a "drop" strategy, two or more conflicting boundaries can be resolved by dropping having the lowest confidence. In a "merge" strategy, where two or more boundaries are equally plausible in compatible contexts, the boundaries may be moved accordingly. For example, "[Mr. John] Smith" and "Mr. [John Smith]" can be merged to provide "[Mr. John Smith]". In a "preserve" strategy, multiple boundaries can be preserved by keeping the configuration of their boundaries and their trust values as ambiguous outputs if they do not support merging or dropping. For example, "[Alexander the Great]" and "[Alexander] [the Great]" can be provided as alternative ambiguous analyzes.

파싱 컴포넌트(215)는 구문 파스(syntactic parse)(355)가 모호성을 보존할 수 있는 모호한 입력의 직접 파싱을 지원하는 모호성 인식 파서일 수 있다. 대안적으로, 모호한 입력 분석들이 분리되어 파싱될 필요가 있을 수 있고, 다수의 출력 구조들이 분리되어 의미 컴포넌트(225)에 전달될 수 있다. 의미 처리(225)는, 아래에서 더 상세히 논의되는 바와 같이, 구문 파서(215)의 각 출력에 다수 회 적용될 수 있다. 이에 따라 상이한 구문 입력들에 대하여 상이한 의미 출력들이 생성될 수 있다. 대안적으로, 의미 매핑(225)은 다양한 입력들을 조합하고 그것들을 일제히 처리할 수 있다.The parsing component 215 may be a ambiguity recognition parser that supports direct parsing of ambiguous input in which a syntactic parse 355 may preserve ambiguity. Alternatively, ambiguous input analyzes may need to be parsed separately, and multiple output structures may be communicated separately to the semantic component 225. The semantic processing 225 may be applied multiple times to each output of the syntax parser 215, as discussed in more detail below. Whereby different semantic outputs can be generated for different syntax inputs. Alternatively, semantic mapping 225 can combine the various inputs and process them all together.

의미 매핑(225)은 의미 표준화(semantic normalization)(360)와 함께 존재할 수 있다. 문장의 다수의 모호한 구문 파스(355) 출력들은 상이한 형태들을 가지면서 의미를 공유할 수 있다. 예를 들면, 이것은 수동적 언어의 표준화에서 일어날 수 있다. "John gave Mary a present"를 고려하면, 단어 "John"은 주어이고 "Mary"는 간접 목적어이다. "a present was given to Mary by John"을 고려하면, 주어는 "Mary"이고 "John"은 목적어이다. 표준화(360)는 이들 2개의 예들이 "John"은 의미 주어(semantic-subject)이고 "Mary"는 의미 간접 목적어(semantic-indirect-object)인 것과 동일한 것을 나타내는 출력들을 제공할 수 있다. 대안적으로, "John"은 동작 에이전트로서 식별될 수 있고, "Mary"는 수령인으로서 식별될 수 있다. 마찬가지로, "Rome's destruction of Carthage" 및 "Rome destroyed Carthage"에 대하여 동일한 표현들이 제공될 수 있다.The semantic mapping 225 may be present with semantic normalization 360. Multiple ambiguous phrase pars (355) outputs of a sentence can share meaning with different forms. For example, this can happen in the standardization of passive language. Considering "John gave Mary a present", the word "John" is the subject and "Mary" is the indirect object. Considering "a present was given to Mary by John", the subject is "Mary" and "John" is the object. Normalization 360 may provide outputs indicating that these two examples are the same as "John" is a semantic-subject and "Mary" is a semantic-indirect-object. Alternatively, "John" may be identified as a behavior agent and "Mary" may be identified as a recipient. Likewise, the same expressions can be provided for "Rome's destruction of Carthage" and "Rome destroyed Carthage".

의미 표준화는 또한 파싱된 문장의 상이한 단어들에 관한 정보를 추가할 수 있다. 예를 들면, 단어들은 어휘 목록(lexicon)에서 식별되고 그들의 동의어들, 상위어들, 가능한 별칭들, 및 다른 어휘 정보와 관련될 수 있다.Semantic normalization can also add information about different words in a parsed sentence. For example, words may be identified in a lexicon and associated with their synonyms, parentheticals, possible aliases, and other lexical information.

의미 기반 동일 지시어 분석(370)은 구문 및 의미 정보에 기초하여 표현들을 분석할 수 있다. 예를 들면, "John saw Bill. He greeted him."은 "he"를 "John"으로 "him"을 "Bill"로 분석할 수 있다. 이러한 분석은 "he"와 "John"은 둘 다 주어들이고, "him"과 "Bill"은 둘 다 목적어들이기 때문에 지정될 수 있다.The semantic-based same directive analysis 370 can analyze expressions based on syntax and semantic information. For example, "John saw Bill. He greeted him." Can analyze "he" as "John" and "him" as "Bill". This analysis gives both "he" and "John", and "him" and "Bill" can both be specified because they are objects.

얕은 동일 지시어 분석(320)은 용어들이 나타나는 문서 세그먼트를 면밀히 조사함으로써 기능할 수 있다. 이와 대조적으로, 의미적 동일 지시어 분석(370), 또는 깊은 동일 지시어 분석은 한 번에 하나의 문장을 처리할 수 있다. 나중의 문장들의 의미적 동일 지시어 분석(370)이 더 일찍이 도입된 엘리먼트들에 액세스할 수 있도록 문장들의 가능한 선행사들이 선행사 저장소(antecedent store)(375) 내에 배치될 수 있다. 선행사들은 문장에서의 그들의 문법적 기능 및 역할들, 텍스트에서의 그들의 거리에 관한 정보, 그들의 다른 선행사들과의 관계들에 관한 정보, 및 다양한 다른 정보들과 함께 저장될 수 있다.Shallow identical directive parser 320 can function by closely examining the document segment in which the terms appear. In contrast, semantic same directive analysis (370) or deep same-directive analysis can process one sentence at a time. Possible antecedents of the sentences may be placed in the antecedent store 375 so that semantic same directive analysis 370 of later sentences can access the earlier introduced elements. The antecedents can be stored along with their grammatical functions and roles in sentences, information about their distance in the text, information about their relationships with other antecedents, and various other information.

표현 병합(expression merging)(380)은 얕은 동일 지시어 분석(320)으로부터의 표현들, 스탠드 어사이드 주석(325), 및 의미적 동일 지시어 분석(370)으로부터의 정보를 조합할 수 있다. 조합될 용어들에 대한 정보는 문자열 정렬 또는 주석들(330)을 이용하여 식별될 수 있다. 동일한 텍스트에 대한 2개의 주석들을 조합하기 위한 다른 메커니즘들이 이용될 수도 있다.Expression merging 380 may combine the information from representations from the shallow in same directive analysis 320, the stand-by side annotation 325, and the semantic same directive analysis 370. Information about the terms to be combined may be identified using string alignment or annotations 330. [ Other mechanisms for combining the two annotations for the same text may be used.

구문 파싱(215)은 옵션으로 검출된 참조 표현들에 대한 자연스러운 통합점(point of integration)일 수 있다. 파서는 구성 요소들과 같은 문장들 내의 구조, 또는 주어 및 목적어와 같은 문법적 관계들을 추론하는 것을 지원할 수 있다. 모호성 인에이블 구문 파서(215)는 문장의 다수의 대안적인 구조적 표현들을 식별할 수 있다. 일례로, 동일 지시어 분석(320)으로부터의 정보는 각 참조 표현의 좌측 경계가 파스로부터 양립할 수 있는 부분의 처음과 일치하는 표현들만을 계속 유지함으로써 구문 파서(215)의 출력을 필터링하는 데 이용될 수 있다. 예를 들면, 동일 지시어 분석은 "[E0 John] told [E1 George][E2 Washington Irving] was a great writer."에서와 같이 동일 지시 대상들을 확립할 수 있다. 구문 파서(215)는 4개의 파싱 가능성들을 별도로 제공할 수 있다:Syntax parsing 215 may be a natural point of integration for optionally detected reference expressions. A parser can support inferring grammatical relationships such as constructs within sentences such as components, or subject and object. Ambiguity enable syntax parser 215 may identify a number of alternative structural representations of the sentence. In one example, the information from the same directive analysis 320 is used to filter the output of the syntax parser 215 by keeping only the expressions whose left boundaries of each reference expression match the beginning of the parts that are compatible from the pars . For example, the same directive analysis can establish the same directives as in "[E0 John] told [E1 George] [E2 Washington Irving] was a great writer." The syntax parser 215 may separately provide four parsing possibilities:

1. [John] and [George] and [Washington Irving]1. [John] and [George] and [Washington Irving]

2. [John] and [George] and [Washington] and [Irving]2. [John] and [George] and [Washington] and [Irving]

3. [John] and [George Washington] and [Irving]3. [John] and [George Washington] and [Irving]

4. [John] and [George Washington Irving]4. [John] and [George Washington Irving]

파서 가능성 번호 3 및 4는 지시 분석(320)에 의해 제공된 엔티티 E2 "Washington Irving"의 좌측 경계와 양립할 수 없기 때문에 필터링될 수 있다.Parser possibilities Nos. 3 and 4 may be filtered because they are incompatible with the left border of the entity E2 "Washington Irving " provided by the instruction analysis 320.

확장(385)의 프로세스는 표현에 추가적인 정보를 추가할 수 있다. 예를 들면, "John sold a car from Bill"에 대하여, 확장(385)은 "Bill bought a car from John"에 대한 표현을 추가로 출력할 수 있다. 마찬가지로, "John killed Bill"에 대하여, 확장(385)은 "Bill died"에 대한 표현을 추가로 출력할 수 있다.The process of extension 385 may add additional information to the representation. For example, for "John sold a car from Bill", the extension 385 may additionally output a representation for "Bill bought a car from John". Similarly, for "John killed Bill ", extension 385 may further output a representation for" Bill died ".

전통적인 검색 엔진들은 매칭하는 키워드들 또는 용어들에 기초하여 사용자 쿼리들에 응답하여 문서들을 검색할 수 있다. 문서들은, 이들 전통적인 시스템들에서, 쿼리들로부터의 용어들 중 얼마나 많은 것이 문서들 내에서 나타나는지, 그 용어들이 얼마나 자주 나타나는지, 또는 그 용어들이 얼마나 가까이 함께 나타나는지와 같은 요소들에 따라서, 랭킹될 수 있다.Traditional search engines can search for documents in response to user queries based on matching keywords or terms. Documents can be ranked according to factors such as how many of the terms from the queries appear in the documents, how often they appear, or how closely the terms appear together, in these traditional systems have.

"Picasso was born in Malaga. He painted Guernica."를 포함하는 제2 예시 문서와 함께 "Picasso's friend Matisse painted prolifically."를 포함하는 제1 예시 문서에서 예시 쿼리 "Picasso painted"를 고려해 보자. 그 밖에 모든 것이 동일한 경우, 단어들 "Picasso" 및 "painted"는 제2 문서에서 더 가까이 함께 있기 때문에, 전통적인 시스템은 제2 문서를 제1 문서보다 더 상위에 랭킹시킬 수 있다. . 이와 대조적으로, 제1 문서 내의 단어 "He"가 Picasso를 지시한다는 것을 분석할 수 있는 시스템은 이 지식에 기초하여 정확하게 제1 문서를 더 상위에 랭킹할 수 있다. 쿼리 "Picasso painted"가, Picasso가 무엇을 그렸는지를 알아내려는 사용자의 의도를 반영한다고 가정할 때, 제1 문서는 명백히 보다 관련된 결과이다.Consider the example query "Picasso painted" in a first example document that includes "Picasso's friend Matisse painted prolifically." With a second example document containing "Picasso was born in Malaga. He painted Guernica." If everything else is the same, the traditional system can rank the second document higher than the first document because the words "Picasso" and "painted" are closer together in the second document. . In contrast, a system that can analyze that the word "He" in the first document points to Picasso can accurately rank the first document higher on the basis of this knowledge. Assuming the query "Picasso painted" reflects the intent of the user to figure out what Picasso has drawn, the first document is obviously more relevant.

자연 언어 처리 시스템(300)은 상이한 아키텍처들을 가질 수 있다. 일 실시예에서, 언어 처리의 하나의 스테이지로부터의 정보가 나중의 스테이지들에의 입력으로서 전달되는 파이프라인이 제공될 수 있다. 이들 접근법들은 자연 언어 텍스트 콘텐트(210)로부터 인덱싱될 사실들을 추출하도록 동작할 수 있는 임의의 다른 아키텍처로 구현될 수도 있다는 이해해야 한다.The natural language processing system 300 may have different architectures. In one embodiment, a pipeline may be provided in which information from one stage of language processing is passed as input to later stages. It is to be appreciated that these approaches may be implemented with any other architecture that can operate to extract facts to be indexed from the natural language text content 210.

이제 도 4를 참조하여, 모호성 민감 자연 언어 처리 시스템에서의 동일 지시어 분석을 위해 여기에 제시된 실시예들에 관한 추가적인 상세들이 제공될 것이다. 특히, 도 4는 여기에 제시된 실시예의 양태들에 따른 동일 지시어 분석에 의한 모호성 민감 인덱싱을 위한 프로세스들(400)의 양태들을 예시하는 흐름도이다.Referring now to FIG. 4, additional details regarding the embodiments presented herein for the same directive analysis in a ambiguous sensitive natural language processing system will be provided. In particular, FIG. 4 is a flow chart illustrating aspects of processes 400 for ambiguous sensitive indexing by the same directive analysis in accordance with aspects of the embodiments presented herein.

여기에 설명된 논리 동작들은 (1) 컴퓨팅 시스템에서 실행하는 프로그램 모듈들 또는 컴퓨터 구현 행위의 시퀀스로서 및/또는 (2) 컴퓨팅 시스템 내의 상호 접속된 기계 논리 회로들 또는 회로 모듈들로서 구현된다는 것을 이해해야 한다. 본 구현은 컴퓨팅 시스템의 성능 및 기타 요건들에 의존하는 선택의 문제이다. 따라서, 여기에 설명된 논리 동작들은 상태 동작들, 구조 장치들, 단계들, 또는 모듈들로서 다양하게 언급된다. 이들 동작들, 구조 장치들, 행위들 및 모듈들은 소프트웨어로, 하드웨어로, 특수 용도 디지털 로직, 및 이들의 임의의 조합으로 구현될 수 있다. 또한 도면들에서 도시되고 본 명세서에 설명된 것보다 더 많은 또는 더 적은 수의 동작들이 수행될 수도 있다는 것을 이해해야 한다. 이들 동작들은 또한 순차적으로, 병행하여, 또는 여기에 설명된 것들과는 다른 순서로 수행될 수도 있다.It is to be understood that the logic operations described herein are implemented as (1) program modules executing in a computing system or as a sequence of computer-implemented acts and / or (2) as interconnected machine logic circuits or circuit modules in a computing system . This implementation is a matter of choice that depends on the performance and other requirements of the computing system. Accordingly, the logical operations described herein are variously referred to as state operations, structural devices, steps, or modules. These operations, structures, acts and modules may be implemented in software, in hardware, in special purpose digital logic, and any combination thereof. It should also be understood that more or fewer operations may be performed than those illustrated in the Figures and described herein. These operations may also be performed sequentially, in parallel, or in a different order than those described herein.

루틴(400)은 동작 410에서 시작하여, 분석 및 인덱싱을 위해 텍스트 콘텐트(210)의 부분이 검색될 수 있다. 동작 420에서 텍스트 콘텐트(210)는 분석 처리가 많이 검색하고 분석하는 텍스트의 영역들의 경계를 위해 세그먼트화된다. 이 세그먼트화는 문장들, 단락들, 페이지들, 챕터들, 또는 섹션들과 같은, 텍스트 내의 구조에 기초할 수 있다. 세그먼트화는 또한 단어들의 수, 문장들의 수, 또는 공간 또는 복잡성의 다른 메트릭들에 기초할 수 있다.The routine 400 begins at operation 410, where a portion of the textual content 210 may be retrieved for analysis and indexing. At operation 420, the text content 210 is segmented for the boundaries of the regions of text that the analysis process searches and analyzes much. This segmentation may be based on a structure within the text, such as sentences, paragraphs, pages, chapters, or sections. Segmentation may also be based on the number of words, the number of sentences, or other metrics of space or complexity.

동작 430에서는 텍스트 콘텐트(210) 내에서 동일 지시어들이 분석될 수 있다. 동작 430 내에서 확립된 경계들과 협력하여, 동일 지시어들이 식별되고 매칭될 수 있다. 별칭 집단들이 확립될 수 있다. "얕은" 분석을 제공하기 위해 표면 구조가 이용될 수 있다. 동일 지시어 분석 동안에 발생하는 모호성들은 주석이 달릴 수 있다. 그러한 주석(340)은 텍스트 콘텐트(210) 내의 마크업으로서 또는 외부 엔티티 맵의 이용을 통하여 제공될 수 있다. 또한 지시들 및 지시 대상들을 엔티티 번호들로 라벨링하기 위해 유사한 주석이 이용될 수도 있다. 또한 확립된 동일 지시어 분석들의 신뢰 레벨을 표시하기 위해 주석이 제공될 수도 있다.In operation 430, the same directives may be analyzed within the text content 210. In conjunction with the boundaries established within operation 430, the same directives can be identified and matched. Alias groups can be established. Surface structures can be used to provide "shallow" analysis. The ambiguities that occur during the same directive analysis can be annotated. Such an annotation 340 may be provided as markup in the text content 210 or through the use of an external entity map. A similar annotation may also be used to label the indications and the indication entities with entity numbers. An annotation may also be provided to indicate the confidence level of the established directive assays.

동작 440에서는, 구문 파싱이 문장들을 단어들 사이의 구문 관계들을 명백하게 하는 표현들로 변환할 수 있다. 파서(215)는 구문 파스(355) 정보를 제공하기 위해 특정 언어와 관련된 문법(220)을 적용할 수 있다.In operation 440, syntax parsing may convert sentences into expressions that clarify syntactic relationships between words. Parser 215 may apply grammar 220 associated with a particular language to provide syntax parsing 355 information.

동작 450에서는, 텍스트 콘텐트(210)로부터 의미 표현들이 추출될 수 있다. 텍스트 콘텐트(210) 내의 문서에서 표현된 정보는 텍스트 내의 엔티티들 사이의 관계들의 표현들에 의하여 형식적으로 조직될 수 있다. 이들 관계들은 일반적인 의미에서 사실로서 지칭될 수 있다.In operation 450, semantic representations may be extracted from the text content 210. The information represented in the document in textual content 210 may be formally organized by representations of relationships between entities within the text. These relationships can be referred to as facts in a general sense.

동작 455에서는, 구문 파스(215)로부터 출력된 구문 파스(355) 정보가 깊은 동일 지시어 분석(370)을 지원하는 데 이용될 수 있다. 또한 동작 450 동안에 생성된 의미 표현들이 이용될 수도 있다.At operation 455, the syntax pars 355 information output from syntax parser 215 may be used to support deep same-direct parser 370. The semantic representations generated during operation 450 may also be used.

동작 460에서는, 얕은 동일 지시어 분석 동작(430)으로부터의 표현들이 깊은 동일 지시어 분석 동작(455)으로부터의 정보와 통합될 수 있다. 모호성 인에이블 구문 파서(215)는 문장의 다수의 대안적인 구조 표현들을 식별할 수 있다. 동일 지시어 분석으로부터의 정보는 구문 파서(215)의 출력을 필터링하는 데 이용될 수 있다.In operation 460, representations from the shallow in same directive analysis operation 430 may be integrated with information from the deep same directive analysis operation 455. Ambiguity enable syntax parser 215 may identify a number of alternative structural representations of the sentence. The information from the same directive analysis can be used to filter the output of the syntax parser 215.

동작 470에서는, 선택된 함축된 표현들을 포함하도록 텍스트 콘텐트(210)의 의미들이 확장될 수 있다. 동작 475에서는, 콘텐트 텍스트 내의 엔티티들, 이벤트들 및 사건들의 상태들 사이의 관계들을 표현하는 의미 표현들로부터 사실들이 추출될 수 있다. 동작 480에서는, 사실들 및 엔티티들이 의미 인덱스(250)에 저장될 수 있다.At operation 470, the semantics of textual content 210 may be expanded to include selected implicit representations. At operation 475, facts may be extracted from the semantic representations representing relationships between the states of the entities, events and events within the content text. In operation 480, facts and entities may be stored in the semantic index 250.

루틴(400)은 동작 480 후에 종료할 수 있다. 그러나, 루틴(400)은 의미 인덱스(250)에 적용될 텍스트 콘텐트(210) 부분들을 검색하기 위해 되풀이하여 또는 연속적으로 적용될 수 있다는 것을 이해해야 한다.Routine 400 may terminate after operation 480. [ It should be understood, however, that the routine 400 can be applied repeatedly or sequentially to retrieve portions of the textual content 210 to be applied to the semantic index 250. [

이제 도 5를 참조하면, 예시적인 컴퓨터 아키텍처(500)는 모호성 민감 자연 언어 처리 시스템에서의 동일 지시어 분석을 위해 여기에 설명된 소프트웨어 컴포넌트들을 실행할 수 있다. 도 5에 도시된 컴퓨터 아키텍처는 종래의 데스크톱, 랩톱, 또는 서버 컴퓨터를 예시하고 여기에 제시된 소프트웨어 컴포넌트들의 임의의 양태들을 실행하는 데 이용될 수 있다. 그러나, 설명된 소프트웨어 컴포넌트들은 또한 이동 장치, 텔레비전, 셋톱 박스, 키오스크, 차량 정보 시스템, 이동 전화기, 내장 시스템, 또는 그 밖의 것들과 같은, 다른 예시적인 컴퓨팅 환경들에서 실행될 수도 있다는 것을 이해해야 한다. 클라이언트 컴퓨터들(110A-110D) 또는 서버 컴퓨터들(120) 중 임의의 하나 이상의 것들은 실시예들에 따른 컴퓨터 시스템(500)으로서 구현될 수 있다.Referring now to FIG. 5, an exemplary computer architecture 500 may execute the software components described herein for the same directive analysis in a ambiguous sensitive natural language processing system. The computer architecture shown in FIG. 5 illustrates a conventional desktop, laptop, or server computer and may be used to execute any aspects of the software components presented herein. It should be understood, however, that the described software components may also be implemented in other exemplary computing environments, such as mobile devices, televisions, set-top boxes, kiosks, vehicle information systems, mobile phones, embedded systems, Any one or more of client computers 110A-110D or server computers 120 may be implemented as computer system 500 in accordance with embodiments.

도 5에 예시된 컴퓨터 아키텍처는 중앙 처리 장치(10)(CPU), 랜덤 액세스 메모리(14)(RAM) 및 읽기 전용 메모리(16)(ROM)을 포함하는 시스템 메모리(13), 및 시스템 메모리(13)를 CPU(10)에 연결할 수 있는 시스템 버스(11)를 포함할 수 있다. 기동 중인 경우 등에서, 컴퓨터(500) 내의 엘리먼트들 사이에 정보를 전송하는 데 도움이 되는 기본 루틴들을 포함하는 기본 입력/출력 시스템이 ROM(16)에 저장될 수 있다. 컴퓨터(500)는 운영 체제(18), 소프트웨어, 데이터, 및 자연 언어 엔진(130)과 관련된 것들과 같은, 다양한 프로그램 모듈들을 저장하기 위한 대용량 저장 장치(15)를 더 포함할 수 있다. 자연 언어 엔진(130)은 여기에 설명된 소프트웨어 컴포넌트들의 부분들을 실행할 수 있다. 자연 언어 엔진(130)과 관련된 의미 인덱스(250)는 대용량 저장 장치(15) 내에 저장될 수 있다.The computer architecture illustrated in Figure 5 includes a system memory 13 that includes a central processing unit 10 (CPU), a random access memory 14 (RAM) and a read only memory 16 (ROM) 13) to the CPU 10 via the system bus 11. [ A basic input / output system may be stored in the ROM 16, including basic routines that help to transfer information between elements within the computer 500, such as when running. The computer 500 may further include a mass storage device 15 for storing various program modules, such as those associated with an operating system 18, software, data, and natural language engine 130. The natural language engine 130 may execute portions of the software components described herein. The semantic index 250 associated with the natural language engine 130 may be stored in the mass storage device 15.

대용량 저장 장치(15)는 버스(11)에 접속된 (도시되지 않은) 대용량 저장 컨트롤러를 통하여 CPU(10)에 접속될 수 있다. 대용량 저장 장치(15) 및 그와 관련된 컴퓨터 판독 가능한 매체는 컴퓨터(500)를 위한 비휘발성 저장을 제공할 수 있다. 비록 여기에 포함된 컴퓨터 판독 가능한 매체의 설명은 하드 디스크 또는 CD-ROM 드라이브와 같은 대용량 저장 장치를 참조하지만, 숙련된 당업자들은 컴퓨터 판독 가능한 매체가 컴퓨터(500)에 의해 액세스될 수 있는 임의의 이용 가능한 컴퓨터 저장 매체일 수 있다는 것을 알 것이다.The mass storage device 15 may be connected to the CPU 10 via a mass storage controller (not shown) connected to the bus 11. [ The mass storage device 15 and its associated computer-readable media may provide non-volatile storage for the computer 500. Although the description of computer-readable media contained herein refers to a mass storage device such as a hard disk or a CD-ROM drive, those skilled in the art will appreciate that any use of computer readable media, Lt; RTI ID = 0.0 > computer storage media. &Lt; / RTI >

제한이 아니라, 예로서, 컴퓨터 판독 가능한 매체는 컴퓨터 판독 가능한 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위해 임의의 방법 또는 기술로 구현되는 휘발성 및 비휘발성, 이동식 및 이동불가식 매체를 포함할 수 있다. 예를 들면, 컴퓨터 판독 가능한 매체는, RAM, ROM, EPROM, EEPROM, 플래시 메모리 또는 기타 솔리드 스테이트 메모리 기술, CD-ROM, DVD(digital versatile disk), HD-DVD, BLU-RAY, 또는 기타 광 저장 장치, 자기 카세트, 자기 테이프, 자기 디스크 저장 장치 또는 기타 자기 저장 장치, 또는 컴퓨터(500)에 의해 액세스될 수 있고 원하는 정보를 저장하는 데 이용될 수 있는 임의의 기타 매체를 포함하지만 이에 제한되는 것은 아니다.By way of example, and not limitation, computer readable media can comprise volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data Media. For example, the computer-readable medium may be RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, digital versatile disk (DVD), HD-DVD, BLU- But are not limited to, a magnetic cassette, magnetic tape, magnetic disk storage or other magnetic storage device, or any other medium that can be used to store the desired information and which can be accessed by the computer 500 no.

다양한 실시예들에 따르면, 컴퓨터(500)는 네트워크(140)와 같은 네트워크를 통하여 원격 컴퓨터들로의 논리적 접속들을 이용하여 네트워크화된 환경에서 동작할 수 있다. 컴퓨터(500)는 버스(11)에 접속된 네트워크 인터페이스 유닛(19)을 통하여 네트워크(140)에 접속될 수 있다. 네트워크 인터페이스 유닛(19)은 또한 다른 유형의 네트워크들 및 원격 컴퓨터 시스템들에 접속하는 데 이용될 수도 있다는 것을 이해해야 한다. 컴퓨터(500)는 또한 (도시되지 않은) 키보드, 마우스, 또는 전자 스타일러스를 포함하는 다수의 다른 장치들로부터 입력을 수신하고 처리하기 위한 입력/출력 컨트롤러(12)를 포함할 수 있다. 마찬가지로, 입력/출력 컨트롤러(12)는 (또한 도시되지 않은) 비디오 디스플레이, 프린터, 또는 다른 유형의 출력 장치에 출력을 제공할 수 있다.According to various embodiments, the computer 500 may operate in a networked environment using logical connections to remote computers via a network, such as the network 140. The computer 500 may be connected to the network 140 via a network interface unit 19 connected to the bus 11. It should be appreciated that the network interface unit 19 may also be used to connect to other types of networks and remote computer systems. The computer 500 may also include an input / output controller 12 for receiving and processing inputs from a number of other devices, including a keyboard, a mouse, or an electronic stylus (not shown). Similarly, the input / output controller 12 may provide output to a video display, printer, or other type of output device (also not shown).

위에서 간단히 언급한 바와 같이, 네트워크화된 데스크톱, 랩톱, 서버 컴퓨터, 또는 기타 컴퓨팅 환경의 동작을 제어하기에 적합한 운영 체제(18)를 포함하여, 다수의 프로그램 모듈들 및 데이터 파일들이 컴퓨터(500)의 대용량 저장 장치(15) 및 RAM(14)에 저장될 수 있다. 대용량 저장 장치(15), ROM(16), 및 RAM(14)은 또한 하나 이상의 프로그램 모듈들을 저장할 수 있다. 특히, 대용량 저장 장치(15), ROM(16), 및 RAM(14)은 CPU(10)에 의한 실행을 위한 자연 언어 엔진(130)을 저장할 수 있다. 자연 언어 엔진(130)은 도 2-4에 관련하여 상세히 논의된 프로세스들의 부분들을 구현하기 위한 소프트웨어 컴포넌트들을 포함할 수 있다. 대용량 저장 장치(15), ROM(16), 및 RAM(14)은 또한 다른 유형의 프로그램 모듈들을 저장할 수 있다. 대용량 저장 장치(15), ROM(16), 및 RAM(14)은 또한 자연 언어 엔진(130)과 관련된 의미 인덱스(250)를 저장할 수 있다.As mentioned briefly above, a number of program modules and data files may be stored on the computer 500, including an operating system 18 suitable for controlling the operation of networked desktops, laptops, server computers, or other computing environments. The mass storage device 15 and the RAM 14. [ The mass storage device 15, ROM 16, and RAM 14 may also store one or more program modules. In particular, mass storage device 15, ROM 16, and RAM 14 may store a natural language engine 130 for execution by CPU 10. The natural language engine 130 may comprise software components for implementing portions of the processes discussed in detail with respect to Figures 2-4. The mass storage device 15, ROM 16, and RAM 14 may also store other types of program modules. The mass storage device 15, the ROM 16, and the RAM 14 may also store the semantic index 250 associated with the natural language engine 130.

전술한 것에 기초하여, 여기서는 모호성 민감 자연 언어 처리 시스템에서의 동일 지시어 분석을 위한 기술들이 제공되었다는 것을 이해해야 한다. 비록 여기에 제시된 내용은 컴퓨터 구조 특징들, 방법적 행위들, 및 컴퓨터 판독 가능한 매체들에 특정한 언어로 설명되었지만, 첨부된 청구항들에서 정의된 발명은 반드시 여기에 설명된 특정한 특징들, 행위들, 또는 매체들에 제한되지는 않는다는 것을 이해해야 한다. 오히려, 그 특정한 특징들, 단계들 및 매체들은 청구항들을 구현하는 예시적인 형태들로서 개시되어 있다.Based on the foregoing, it should be appreciated that techniques have been provided for the same directive analysis in ambiguous sensitive natural language processing systems. Although the teachings herein have been described in language specific to computer architecture features, methodological acts, and computer-readable media, it is to be understood that the invention defined in the appended claims is not necessarily limited to the specific features, acts, Or < / RTI > media. Rather, the specific features, steps, and media are disclosed as exemplary forms of implementing the claims.

전술한 내용은 단지 예시로서 제공되는 것일 뿐이고 제한적인 것으로 해석되지 않아야 한다. 예시되고 설명된 예시적인 실시예들 및 응용들을 따르지 않고, 또한 다음의 청구항들에서 제시되는, 본 발명의 참된 정신 및 범위에서 벗어나지 않고 여기에 설명된 내용에 다양한 수정들 및 변경들이 행해질 수 있다.The foregoing is provided by way of illustration only and should not be construed as limiting. Various modifications and changes may be made thereto without departing from the exemplary embodiments and applications illustrated and described in the following claims and without departing from the true spirit and scope of the present invention as set forth in the following claims.

Claims

CLAIMS What is claimed is: 1. A method for integrating coreference resolution mechanisms,
Retrieving a portion of text using a natural language engine of the server;
Using the natural language engine of the server to identify the same core within the portion of the text;
Extracting a fact from a portion of the text using the natural language engine of the server, the fact having meaning; And
Using the natural language engine of the server to identify an ambiguity within a portion of the text;
Expanding the fact to an expanded fact using the natural language engine of the server, wherein the expanded fact has the same directive meaning that is different from the meaning and based on the identified same directive, Includes ambiguous meaning -
&Lt; / RTI >

2. The method of claim 1, wherein identifying the same directive within a portion of the text includes identifying the same directive within a portion of the text using at least some of syntactic parsing.

The method of claim 1, wherein identifying the same directive within a portion of the text includes identifying the same directive within a portion of the text using at least some semantic mapping.

2. The method of claim 1, wherein identifying the same directive comprises identifying the same directive having ambiguity.

delete

2. The method of claim 1, further comprising using the natural language engine of the server to store the expanded facts in an index operable to support information retrieval.

8. The method of claim 7 further comprising using the natural language engine of the server to retrieve the expanded fact from the index in response to a search query.

2. The method of claim 1, further comprising annotating the same directives identified in the portion of the text using the natural language engine of the server.

3. The method of claim 2, further comprising: caching information from the syntax parsing using the natural language engine of the server.

A computer storage medium having computer executable instructions stored thereon, the instructions, when executed by a computer,
Search for parts of text,
Identifying the same directive within a portion of the text,
Extracting facts from a portion of the text, the facts having meaning,
Identifying ambiguities within the portion of the text,
And having the same directive meaning different from said fact based on said identified same directive, to expand said fact to have ambiguous meaning based on said identified ambiguity.

12. The computer storage medium of claim 11, wherein identifying the same directive comprises identifying the same directive within a portion of the text using at least some of the syntax parsing.

12. The computer storage medium of claim 11, wherein identifying the same directive comprises identifying the same directive within a portion of the text using at least some semantic mapping.

12. The computer storage medium of claim 11, wherein identifying the same directive comprises identifying the same directive having ambiguity.

delete

12. The computer storage medium of claim 11, further comprising instructions for causing the computer to store the expanded facts in an index operable to support information retrieval.

18. The computer storage medium of claim 17, further comprising instructions for causing the computer to retrieve the expanded fact from the index in response to a search query.

12. The computer storage medium of claim 11, further comprising instructions for causing the computer to annotate the same directives identified within the portion of the text.

As a method for integrating the same directive analysis mechanisms,
Using a natural language engine of the server computer to search for a portion of text;
Identifying the same directive within a portion of the text using the natural language engine of the server computer;
Using the natural language engine of the server computer to identify ambiguities within a portion of the text;
Extracting facts from a portion of the text using the natural language engine of the server computer, the facts having significance;
Expanding the fact using the natural language engine of the server computer, wherein the expanded fact has the same directive meaning that is different from the meaning and based on the identified same directive, and an ambiguous meaning based on the identified ambiguity (including ambiguous meaning);
Storing the extended facts in an index operable to support information retrieval; And
Retrieving the expanded fact from the index in response to a search query
&Lt; / RTI >