JP2017504105A

JP2017504105A - System and method for in-memory database search

Info

Publication number: JP2017504105A
Application number: JP2016536900A
Authority: JP
Inventors: スコットライトナー; フランツウェックザー; ラケシュデイヴ; サンジェイボッヅ; ジョーゼフベックネル; ビラリハキズワミ
Original assignee: キューベースリミテッドライアビリティカンパニー
Priority date: 2013-12-02
Filing date: 2014-12-02
Publication date: 2017-02-02
Also published as: CA2932401A1; KR20160124079A; EP3077918A4; WO2015084759A1; CN106164889A; EP3077918A1

Abstract

エンティティ共起知識ベースを使用して関連エンティティを識別するシステム及び方法が開示される。実施形態において、エンティティインデックス型コーパスから抽出されたエンティティのエンティティ共起知識ベースを使用してサーチ質問において識別されるエンティティを抽出して、サーチ結果を関連エンティティとして提示する。エンティティ共起知識ベースと共に曖昧スコアマッチングを使用してサーチ示唆を発生する実施形態も開示される。又、実施形態において、サーチ質問から部分エンティティを抽出し、抽出されたエンティティのタイプに基づいてマッチングアルゴリズムを実行し、そしてエンティティ共起知識ベースに対してサーチを遂行する。共起及び／又は曖昧スコアマッチングに基づいて関連エンティティのサーチ示唆を発生する実施形態も開示される。それら実施形態では、部分サーチ質問を処理しそして完全な質問の示唆を提示し、それらは、新たなサーチ質問として使用される。又、エンティティ及びトレンド共起知識ベースを使用してサーチ質問からエンティティを抽出することによりエンティティ共起を使用してサーチ示唆を発生する実施形態も開示される。又、コンテンツマネージメントシステムにおいて地理的及び名前付きエンティティベースサーチ能力を可能にする実施形態も開示される。
【選択図】図１Disclosed are systems and methods for identifying related entities using an entity co-occurrence knowledge base. In an embodiment, an entity identified in a search query is extracted using an entity's entity co-occurrence knowledge base extracted from an entity indexed corpus to present search results as related entities. An embodiment for generating search suggestions using fuzzy score matching with an entity co-occurrence knowledge base is also disclosed. Also, in an embodiment, a partial entity is extracted from the search query, a matching algorithm is executed based on the extracted entity type, and a search is performed on the entity co-occurrence knowledge base. Embodiments for generating search suggestions for related entities based on co-occurrence and / or fuzzy score matching are also disclosed. In these embodiments, the partial search questions are processed and suggested full question suggestions that are used as new search questions. Also disclosed is an embodiment for generating search suggestions using entity co-occurrence by extracting entities from a search query using an entity and trend co-occurrence knowledge base. Embodiments are also disclosed that enable geographic and named entity-based search capabilities in content management systems.
[Selection] Figure 1

Description

本発明は、一般的に、情報検索のための方法及びシステムに関するもので、より詳細には、エンティティ共起(co-occurrence)を使用して関連エンティティをサーチする方法に関する。本発明は、一般的に、質問の向上に関するもので、より詳細には、知識ベースにおける曖昧スコアマッチング及びエンティティ共起を使用したサーチ示唆に関する。本発明は、一般的に、コンピュータ質問処理に関するもので、より詳細には、共起及び／又は曖昧スコアマッチングに基づく関連エンティティの電子サーチ示唆に関する。本発明は、一般的に、情報検索のための方法及びシステムに関するもので、より詳細には、サーチ示唆を得るための方法に関する。本発明は、一般的に、サーチエンジン及びコンテンツマネージメントに関するもので、より詳細には、デジタルコンテンツのジオタギング及び名前付エンティティのエンリッチメントを可能にするためのコンテンツマネージメントシステムのサーチエンジン技術の拡張に関する。 The present invention relates generally to a method and system for information retrieval, and more particularly to a method for searching related entities using entity co-occurrence. The present invention relates generally to query enhancement, and more particularly to search suggestions using fuzzy score matching and entity co-occurrence in a knowledge base. The present invention relates generally to computer query processing, and more particularly to electronic search suggestions for related entities based on co-occurrence and / or fuzzy score matching. The present invention relates generally to a method and system for information retrieval, and more particularly to a method for obtaining search suggestions. The present invention relates generally to search engines and content management, and more particularly to extending search engine technology in content management systems to enable geotagging of digital content and enrichment of named entities.

商業的コンテキストでは、良く知られたサーチエンジンがサーチ用語のセットをパースし、そしてある仕方で分類されたアイテム（典型的なサーチではウェブページ）のリストを返送する。サーチを遂行するための最も知られた解決策は、通常、キーワードに基づいてインデックスを発生するのに最終的に使用されるサーチ質問データベースを構築するために他のユーザの履歴的参照に基づいている。ユーザのサーチ質問は、エンティティに関連した名前又は属性で識別される１つ以上のエンティティを含む。又、エンティティは、組織、人々、場所、及び／又は時間も含む。典型的なサーチでは、ユーザが２つの特定の組織に関連した情報をサーチする場合に、サーチエンジンは、同じ名前又は同様の名前を伴う異なるエンティティの混合物についての詰め合わせ結果を返送する。後者の解決策では、ユーザが実際に何に関心があるかに関連しない大量のドキュメントをユーザが見出すことになる。 In the commercial context, a well-known search engine parses a set of search terms and returns a list of items (web pages in a typical search) that are categorized in some way. The best known solutions for performing searches are usually based on other users' historical references to build a search query database that is ultimately used to generate an index based on keywords. Yes. The user's search query includes one or more entities identified by names or attributes associated with the entities. Entities also include organizations, people, places, and / or times. In a typical search, when a user searches for information related to two specific organizations, the search engine returns an assortment result for a mixture of different entities with the same name or similar names. The latter solution results in the user finding a large number of documents that are not related to what the user is actually interested in.

従って、関心のある関連エンティティを見出す能力をユーザに許可する関連エンティティサーチ方法の要望が存在する。 Accordingly, there is a need for a related entity search method that allows a user the ability to find related entities of interest.

ユーザは、インターネット又は任意のデータベースシステムのいずれかにおいて関心のある情報を位置付けするためサーチエンジンをしばしば使用する。サーチエンジンは、通常、ユーザからサーチ質問を受け取りそしてサーチ結果をユーザへ返送することによって動作する。サーチ結果は、通常、サーチ質問に対する各返送サーチ結果の関連度に基づきサーチエンジンにより順序付けされる。それ故、サーチ質問のクオリティがサーチ結果のクオリティにとって著しく重要となる。しかしながら、ユーザからのサーチ質問は、ほとんどのケースでは、不完全に又は部分的に書かれるだけで（例えば、サーチ質問は、関連結果の焦点の合ったセットを発生するに充分なワードを含まず、むしろ、多数の非関連結果を発生する）、そして時々、スペルミスもある（例えば、ＢｉｌｌＳｍｉｔｈは、誤ってＢｉｌｌＳｍｉｔｔｈと綴られる）。 Users often use search engines to locate information of interest on either the Internet or any database system. A search engine typically operates by receiving a search query from a user and returning search results to the user. Search results are usually ordered by the search engine based on the relevance of each returned search result to the search query. Therefore, the quality of the search query is significantly important to the quality of the search results. However, search queries from users are only written incompletely or partially in most cases (eg, search queries do not contain enough words to generate a focused set of related results). Rather, it produces a large number of unrelated results), and sometimes there are also misspellings (eg Bill Smith is mistakenly spelled Bill Smith).

サーチ結果のクオリティを改善するための１つの共通の解決策は、サーチ質問を向上させることである。サーチ質問を向上させる１つの方法は、ユーザの入力に基づいて考えられる示唆を発生することによるものである。このため、ある解決策は、１人以上のユーザにより提出された過去の質問から所与の質問に対する候補質問洗練化を識別する方法を提案する。しかしながら、この解決策は、ユーザを関心のない結果へと時々導く質問ログに基づいている。異なる技術を使用する他の解決策もあるが、充分正確ではない。従って、ユーザからのサーチ結果を改善又は向上させて、より正確な結果を得るための方法の要望が依然として存在する。 One common solution for improving the quality of search results is to improve the search query. One way to improve search queries is by generating possible suggestions based on user input. For this reason, one solution proposes a method for identifying candidate question refinements for a given question from past questions submitted by one or more users. However, this solution is based on a question log that sometimes leads the user to uninteresting results. There are other solutions that use different technologies, but they are not accurate enough. Accordingly, there remains a need for a method for improving or enhancing search results from users to obtain more accurate results.

サーチ結果のクオリティを改善するための１つの共通の解決策は、サーチ質問を向上させることである。サーチ質問を向上させる１つの方法は、ユーザの入力に基づいて考えられる示唆を発生することによるものである。このため、ある解決策は、１人以上のユーザにより提出された過去の質問から所与の質問に対する候補質問洗練化を識別する方法を提案する。しかしながら、この解決策は、ユーザを関心のない結果へと時々導く質問ログに基づいている。異なる技術を使用する他の解決策もあるが、充分正確ではない。従って、ユーザからのサーチ結果を改善又は向上させて、より正確な結果を得ると共に、ユーザがサーチ質問をタイプするときに関心のある有用な関連エンティティをユーザに与えるための方法の要望が依然として存在する。 One common solution for improving the quality of search results is to improve the search query. One way to improve search queries is by generating possible suggestions based on user input. For this reason, one solution proposes a method for identifying candidate question refinements for a given question from past questions submitted by one or more users. However, this solution is based on a question log that sometimes leads the user to uninteresting results. There are other solutions that use different technologies, but they are not accurate enough. Accordingly, there remains a need for a method for improving or enhancing search results from users to obtain more accurate results and providing users with useful related entities that are of interest when the user types a search question. To do.

サーチエンジンは、ユーザ質問の予測を与えるため複数の特徴を備えている。そのような予測は、質問自動完全化及びサーチ示唆を含む。今日、そのような予測方法は、履歴的キーワード参照に基づくものである。そのような履歴的参照は、１つのキーワードが単一テキストにおける複数のトピックスを参照し得るので、正確でないことがある。 The search engine has a number of features to provide predictions of user questions. Such predictions include automatic query completion and search suggestions. Today, such prediction methods are based on historical keyword references. Such historical references can be inaccurate because a keyword can refer to multiple topics in a single text.

更に、ユーザのサーチ質問は、エンティティに関連した名前又は属性により識別される１つ以上のエンティティを含む。それらエンティティは、組織、人々、場所、イベント、日付、及び／又は時刻も含む。典型的なサーチにおいて、ユーザが２つの特定の組織に関連した情報をサーチする場合には、サーチエンジンは、同じ名前又は同様の名前を伴う異なるエンティティの混合物についての詰め合わせ結果を返送する。後者の解決策では、ユーザが実際に何に関心があるかに関連しない大量のドキュメントをユーザが見出すことになる。 Further, the user search query includes one or more entities identified by names or attributes associated with the entities. These entities also include organizations, people, places, events, dates, and / or times. In a typical search, when a user searches for information related to two specific organizations, the search engine returns assortment results for a mixture of different entities with the same name or similar names. The latter solution results in the user finding a large number of documents that are not related to what the user is actually interested in.

従って、より迅速に且つより正確にサーチ示唆を得るための方法の要望が存在する。 Accordingly, there is a need for a method for obtaining search suggestions more quickly and more accurately.

ドキュメントバージョニング及びコラボレートオブジェクトマネージメントのためのコンテンツマネージメント及びドキュメントマネージメントシステムが知られている。１つの非限定例は、ＭｉｃｒｏｓｏｆｔＳｈａｒｅｐｏｉｎｔ２０１３（登録商標）ソフトウェア及びアプリケーションというツールセットである。ＭｉｃｒｏｓｏｆｔＳｈａｒｅｐｏｉｎｔ２０１３（登録商標）は、コラボレーション、ファイルシェア及びウェブパブリッシングのためにマイクロソフト社により開発されたソフトウェア製品のファミリーである。このＳｈａｒｅｐｏｉｎｔ２０１３（登録商標）は、膨大な量のコンテンツ又は情報をユーザに与え、ユーザが特定の状況に対して最も関連性のある情報を見出すのを困難にする。これらの問題を軽減するために、Ｓｈａｒｅｐｏｉｎｔ２０１３（登録商標）は、ユーザが必要とするコンテンツを見出す上でユーザを助けるためのサーチエンジンを提供する。ユーザは、キーワードベースのサーチ質問を入力し、そしてＳｈａｒｅｐｏｉｎｔ２０１３（登録商標）のサーチエンジンは、コンテンツがシンデックスされたときにＳｈａｒｅｐｏｉｎｔ２０１３（登録商標）プラットホームのコンテキスト内に見出される最も関連性のある結果のリストをユーザへ返送する。 Content management and document management systems for document versioning and collaborative object management are known. One non-limiting example is the Microsoft Sharepoint 2013® software and application toolset. Microsoft Sharepoint 2013 (R) is a family of software products developed by Microsoft for collaboration, file sharing and web publishing. This Sharepoint 2013 (registered trademark) gives a huge amount of content or information to the user, making it difficult for the user to find information most relevant to a particular situation. To alleviate these problems, Sharepoint 2013® provides a search engine to help users in finding the content they need. The user enters a keyword-based search query and the Sharepoint 2013® search engine is the most relevant found within the context of the Sharepoint 2013® platform when the content is syndicated. Send the result list back to the user.

時々、ユーザは、Ｓｈａｒｅｐｏｉｎｔ２０１３（登録商標）における地理的エンティティ或いはドキュメント内で参照される組織又は人々のような他の形式のエンティティに関連したコンテンツを見出すことを希望する。Ｓｈａｒｅｐｏｉｎｔ２０１３（登録商標）は、ドキュメントからエンティティを自動的に抽出するための機能をボックスから与えない。特に、地理的エンティティを抽出してそれを地理的位置に対して解明するためのジオタギングコンテンツをサポートするものではない。又、Ｓｈａｒｅｐｏｉｎｔ２０１３は、ドキュメントにおける組織又は人々のような名前付エンティティを識別し、曖昧性除去しそして抽出するためのエンティティタギングをサポートするものでもない。しかしながら、Ｓｈａｒｅｐｏｉｎｔ２０１３（登録商標）のサーチは、エンティティベースのサーチファセットを含めて、有効な地理的サーチ及び他のエンティティ関連サーチを可能にするように拡張することができる。Ｓｈａｒｅｐｏｉｎｔ２０１３（登録商標）の以前のバージョンは、Ｓｈａｒｅｐｏｉｎｔのための「ＦＡＳＴサーチ」を含み、ここから、サンドボックス型アプリケーションを通してコンテンツ処理パイプラインを拡張することができるが、これは、低速であると共に、アクセスできる情報が限定される。 From time to time, users want to find content related to geographic entities in Sharepoint 2013 or other types of entities such as organizations or people referenced in documents. Sharepoint 2013 (registered trademark) does not provide a function for automatically extracting an entity from a document from a box. In particular, it does not support geotagging content for extracting geographic entities and resolving them relative to geographic locations. Sharepoint 2013 also does not support entity tagging to identify, disambiguate and extract named entities such as organizations or people in a document. However, Sharepoint 2013® search can be extended to allow for valid geographic and other entity-related searches, including entity-based search facets. Earlier versions of Sharepoint 2013® include a “FAST search” for Sharepoint, from which content processing pipelines can be extended through sandboxed applications, which is slow and The information that can be accessed is limited.

Ｓｈａｒｅｐｏｉｎｔ２０１３（登録商標）は、コンセプト抽出、関係抽出、ジオタギング、要約化及び精巧なテキスト分析、等の特殊な言語学を追加できるようにする非常にオープンなＡＰＩを導入する。従って、地理的及び他のエンティティベースのサーチを可能にするようにＳｈａｒｅｐｏｉｎｔ２０１３（登録商標）サーチエンジンの能力を拡張する機会が存在する。 Sharepoint 2013 (R) introduces a very open API that allows the addition of specialized linguistics such as concept extraction, relationship extraction, geotagging, summarization and sophisticated text analysis. Thus, there is an opportunity to extend the capabilities of the Sharepoint 2013® search engine to allow for geographic and other entity-based searches.

エンティティ共起を使用して関連エンティティをサーチする方法が開示される。この開示の１つの態様において、この方法は、クライアント／サーバータイプのアーキテクチャーを含むサーチシステムに使用される。ある実施形態では、サーチシステムは、ネットワーク接続を経て１つ以上のサーバー装置と通信するサーチエンジンのためのユーザインターフェイスを備えている。サーバー装置は、電子データのエンティティインデックス型コーパス、エンティティ共起知識ベースのデータベース、及びエンティティ抽出コンピュータモジュールを備えている。知識ベースは、インメモリデータベースとして構築されて、１つ以上のサーチコントローラ、複数のサーチノード、圧縮データの集合体及び曖昧性除去モジュール、等の他のコンポーネントも含む。１つのサーチコントローラは、１つ以上のサーチノードと選択的に関連付けされる。各サーチノードは、圧縮データの集合体を通して曖昧キーサーチを独立して遂行し、そしてスコア付けされた結果のセットをその関連サーチコントローラへ返送することができる。 A method for searching related entities using entity co-occurrence is disclosed. In one aspect of this disclosure, the method is used in a search system that includes a client / server type architecture. In some embodiments, the search system includes a user interface for a search engine that communicates with one or more server devices over a network connection. The server device includes an entity index corpus of electronic data, an entity co-occurrence knowledge base database, and an entity extraction computer module. The knowledge base is built as an in-memory database and includes other components such as one or more search controllers, multiple search nodes, collections of compressed data, and disambiguation modules. A search controller is selectively associated with one or more search nodes. Each search node can independently perform an ambiguous key search through the collection of compressed data and return a scored set of results to its associated search controller.

ある実施形態において、コンピュータで実施される方法は、エンティティ抽出コンピュータにより、クライアントコンピュータから、１つ以上のエンティティを含むサーチ質問を受け取り；エンティティ抽出コンピュータにより、各々のエンティティを、共起データベースにおける各エンティティの１つ以上の共起と比較し；エンティティ抽出コンピュータにより、サーチ質問からの１つ以上のエンティティのサブセットを、共起データベースに従ってそのエンティティと電子データコーパスにおける１つ以上の関連エンティティとの共起の確度に基づきそのサブセットの各エンティティが共起データベースの信頼性スコアを越えるとの決定に応答して、抽出し；エンティティ抽出コンピュータにより、インデックス識別子（インデックスＩＤ）をその複数の抽出されたエンティティにおけるエンティティの各々に指定し；エンティティ抽出コンピュータにより、その複数の抽出されたエンティティの各々に対してインデックスＩＤを電子データコーパスにセーブし、電子データコーパスは、１つ以上の関連エンティティの各々に対応するインデックスＩＤによりインデックスされるものであり；サーチサーバーコンピュータにより、その複数の抽出されたエンティティを位置付けし、そしてその複数の抽出されたエンティティの少なくとも２つが共起するデータレコードのインデックスＩＤを識別するためにエンティティインデックス型の電子データコーパスをサーチし；及びサーチサーバーコンピュータにより、その識別されたインデックスＩＤに対応するデータレコードを有するサーチ結果リストを構築する；ことを含む。 In certain embodiments, a computer-implemented method receives a search query that includes one or more entities from a client computer by an entity extraction computer; each entity in the co-occurrence database is received by the entity extraction computer. One or more subsets of one or more entities from a search query by means of an entity extraction computer with the entity and one or more related entities in an electronic data corpus according to a co-occurrence database In response to the determination that each entity in the subset exceeds the reliability score of the co-occurrence database based on the accuracy of the index; ) To each of the entities in the plurality of extracted entities; the entity extraction computer saves an index ID for each of the plurality of extracted entities in the electronic data corpus, Indexed by an index ID corresponding to each of the two or more related entities; the search server computer positions the plurality of extracted entities, and at least two of the plurality of extracted entities co-occur Search the entity index type electronic data corpus to identify the index ID of the data record to be searched; and the search server computer has a data record corresponding to the identified index ID. To build that search results list; including the fact.

ある実施形態において、システムは、複数のコンピュータモジュールに対するコンピュータ読み取り可能なインストラクションを実行する１つ以上のプロセッサを有する１つ以上のサーバーコンピュータを備え、これは、サーチ質問パラメータのユーザ入力を受け取るように構成されたエンティティ抽出モジュールを含み、該エンティティ抽出モジュールは、更に、複数の抽出されたエンティティにおける各エンティティを、その抽出されたエンティティと電子データコーパスにおける１つ以上の関連エンティティとの共起の確度を表わす信頼性スコアを含むエンティティ共起データベースと比較することにより、サーチ質問パラメータから複数のエンティティを抽出し、複数の抽出されたエンティティにおける各エンティティにインデックス識別子（インデックスＩＤ）を指定し、複数の抽出されたエンティティの各々に対するインデックスＩＤを電子データコーパスにセーブし、この電子データコーパスは、１つ以上の関連エンティティの各々に対応するインデックスＩＤによりインデックスされるものであるように構成され；及び更に、その複数の抽出されたエンティティを位置付けし、そしてその複数の抽出されたエンティティの少なくとも２つが共起するデータレコードのインデックスＩＤを識別するためにエンティティインデックス型の電子データコーパスをサーチするように構成されたサーチサーバーモジュールを備え、このサーチサーバーモジュールは、更に、その識別されたインデックスＩＤに対応するデータレコードを有するサーチ結果リストを構築するように構成される。 In certain embodiments, the system comprises one or more server computers having one or more processors that execute computer readable instructions for a plurality of computer modules, which receive user input of search query parameters. A configured entity extraction module, wherein the entity extraction module further identifies each entity in the plurality of extracted entities with the accuracy of the co-occurrence of the extracted entity and one or more related entities in the electronic data corpus. Extract multiple entities from the search query parameters by comparing to an entity co-occurrence database that includes a confidence score representing and index each entity in the multiple extracted entities An identifier (index ID) is specified and an index ID for each of the plurality of extracted entities is saved in an electronic data corpus, the electronic data corpus being indexed by an index ID corresponding to each of the one or more related entities. An entity index to locate the plurality of extracted entities and to identify an index ID of a data record in which at least two of the plurality of extracted entities co-occur A search server module configured to search for an electronic data corpus of a type, the search server module further constructing a search result list having data records corresponding to the identified index ID Constructed.

別の実施形態において、非一時的なコンピュータ読み取り可能な媒体は、エンティティ抽出コンピュータにより、サーチ質問パラメータのユーザ入力を受け取り；エンティティ抽出コンピュータにより、複数の抽出されたエンティティにおける各エンティティを、その抽出されたエンティティと電子データコーパスにおける１つ以上の関連エンティティとの共起の確度を表わす信頼性スコアを含むエンティティ共起データベースと比較することにより、サーチ質問パラメータから複数のエンティティを抽出し；エンティティ抽出コンピュータにより、複数の抽出されたエンティティにおける各エンティティにインデックス識別子（インデックスＩＤ）を指定し；エンティティ抽出コンピュータにより、複数の抽出されたエンティティの各々に対するインデックスＩＤを電子データコーパスにセーブし、この電子データコーパスは、１つ以上の関連エンティティの各々に対応するインデックスＩＤによってインデックスされるものであり；サーチサーバーコンピュータにより、その複数の抽出されたエンティティを位置付けし、そしてその複数の抽出されたエンティティの少なくとも２つが共起するデータレコードのインデックスＩＤを識別するためにエンティティインデックス型の電子データコーパスをサーチし；及びサーチサーバーコンピュータにより、その識別されたインデックスＩＤに対応するデータレコードを有するサーチ結果リストを構築する；ことを含むコンピュータ実行可能なインストラクションを記憶している。 In another embodiment, a non-transitory computer readable medium receives user input of search query parameters by an entity extraction computer; the entity extraction computer extracts each entity in a plurality of extracted entities. Extracting a plurality of entities from a search query parameter by comparing to an entity co-occurrence database that includes a confidence score representing the likelihood of co-occurrence of the identified entities and one or more related entities in the electronic data corpus; Assigning an index identifier (index ID) to each entity in the plurality of extracted entities; each of the plurality of extracted entities by the entity extraction computer An index ID to be stored in an electronic data corpus, the electronic data corpus being indexed by an index ID corresponding to each of the one or more related entities; the search server computer extracts the plurality of extracted entities And searching the entity index type electronic data corpus to identify the index ID of the data record in which at least two of the plurality of extracted entities co-occur; and the search server computer identifies the A computer-executable instruction is stored including: building a search result list having data records corresponding to the index ID;

知識ベースにおける曖昧スコアマッチング及びエンティティ共起を使用することによりサーチ示唆を発生する方法が開示される。この開示の１つの態様において、この方法は、クライアント／サーバー型のアーキテクチャーを含むサーチシステムに使用される。ある実施形態において、このサーチシステムは、ネットワーク接続を経て１つ以上のサーバー装置と通信するサーチエンジンに対するユーザインターフェイスを備えている。サーバー装置は、エンティティ抽出コンピュータモジュール、曖昧スコアマッチングコンピュータモジュール、及びエンティティ共起知識ベースのデータベースを備えている。知識ベースは、インメモリデータベースとして構築され、そして１つ以上のサーチコントローラ、複数のサーチノード、圧縮データの集合体、及び曖昧性除去コンピュータモジュールのような他のハードウェア及び／又はソフトウェアコンポーネントも備えている。あるサーチコントローラは、１つ以上のサーチノードに選択的に関連付けされる。各サーチノードは、圧縮データの集合体を通して曖昧キーサーチを独立して遂行し、そしてスコア付けされた結果のセットをその関連サーチコントローラへ返送することができる。 A method for generating search suggestions by using fuzzy score matching and entity co-occurrence in a knowledge base is disclosed. In one aspect of this disclosure, the method is used in a search system that includes a client / server type architecture. In one embodiment, the search system includes a user interface for a search engine that communicates with one or more server devices over a network connection. The server device comprises an entity extraction computer module, an ambiguous score matching computer module, and an entity co-occurrence knowledge base database. The knowledge base is built as an in-memory database and also includes one or more search controllers, multiple search nodes, a collection of compressed data, and other hardware and / or software components such as a disambiguation computer module. ing. A search controller is selectively associated with one or more search nodes. Each search node can independently perform an ambiguous key search through the collection of compressed data and return a scored set of results to its associated search controller.

この開示の別の態様において、この方法は、サーチ質問がエンティティを参照するかどうか識別し、もしそうであれば、どんな形式のエンティティを参照するか識別するため、与えられたサーチ質問から部分エンティティ抽出を遂行するエンティティ抽出モジュールを含む。更に、この方法は、抽出されたエンティティの形式に基づきアルゴリズムをスポーンし、そしてエンティティ共起知識ベースに対してサーチを遂行する曖昧スコアマッチングモジュールを含む。更に、エンティティに対応するものとして検出されない質問テキスト部分は、エンティティ共起知識ベースをサーチするのに使用できるトピックス、ファセット及びキーフレーズのような概念的特徴として処理される。一実施形態において、エンティティ共起知識ベースは、エンティティが、とりわけ、エンティティ対エンティティ、エンティティ対トピックス、又はエンティティ対ファセットとしてインデックスされるレポジトリを含み、これは、速く正確な示唆をユーザへ返送してサーチ質問を完成するのを促進する。 In another aspect of this disclosure, the method identifies whether a search query refers to an entity and, if so, what type of entity to refer to, from a given search query, a partial entity It includes an entity extraction module that performs the extraction. The method further includes an ambiguous score matching module that spawns an algorithm based on the extracted entity type and performs a search against the entity co-occurrence knowledge base. Furthermore, portions of the question text that are not detected as corresponding to an entity are processed as conceptual features such as topics, facets, and key phrases that can be used to search the entity co-occurrence knowledge base. In one embodiment, the entity co-occurrence knowledge base includes a repository in which entities are indexed, among other things, as entity-to-entity, entity-to-topics, or entity-to-facet, which returns fast and accurate suggestions to the user. Helps complete search questions.

ある実施形態において、方法が開示される。この方法は、エンティティ抽出コンピュータにより、ユーザインターフェイスからサーチ質問パラメータのユーザ入力を受け取り；エンティティ抽出コンピュータにより、サーチ質問パラメータを、電子データコーパスにおける１つ以上のエンティティの共起のインスタンスを有するエンティティ共起データベースと比較し、そしてサーチ質問パラメータにおいて１つ以上のエンティティに対応する少なくとも１つのエンティティ形式を識別することにより、サーチ質問パラメータから１つ以上のエンティティを抽出し；及び曖昧スコアマッチングコンピュータにより、サーチ質問パラメータに関連した１つ以上のレコードを識別するためにエンティティ共起データベースをサーチする曖昧マッチングアルゴリズムを選択し、その曖昧マッチングアルゴリズムは、少なくとも１つの識別されたエンティティ形式に対応する。この方法は、更に、曖昧スコアマッチングコンピュータにより、その選択された曖昧マッチングアルゴリズムを使用してエンティティ共起データベースをサーチし、そしてそのサーチに基づき１つ以上のレコードから１つ以上の示唆されたサーチ質問パラメータを形成し；及び曖昧スコアマッチングコンピュータにより、ユーザインターフェイスを経て１つ以上の示唆されたサーチ質問パラメータを提示する；ことを含む。 In certain embodiments, a method is disclosed. The method receives user input of a search query parameter from a user interface by an entity extraction computer; the search query parameter is received by an entity extraction computer and an entity co-occurrence having co-occurrence instances of one or more entities in an electronic data corpus. Extracting one or more entities from the search query parameters by comparing to a database and identifying at least one entity type corresponding to the one or more entities in the search query parameters; Select an fuzzy matching algorithm to search the entity co-occurrence database to identify one or more records associated with the query parameter and its fuzzy match Grayed algorithm corresponds to at least one of the identified entities form. The method further includes searching the entity co-occurrence database by the fuzzy score matching computer using the selected fuzzy matching algorithm and one or more suggested searches from one or more records based on the search. Forming query parameters; and presenting one or more suggested search query parameters via a user interface by a fuzzy score matching computer.

別の実施形態において、システムが提供される。このシステムは、複数のコンピュータモジュールに対するコンピュータ読み取り可能なインストラクションを実行する１つ以上のプロセッサを有する１つ以上のサーバーコンピュータを備え、これは、ユーザインターフェイスからサーチ質問パラメータのユーザ入力を受け取るように構成されたエンティティ抽出モジュールを含み、そのエンティティ抽出モジュールは、サーチ質問パラメータを、電子データコーパスにおける１つ以上のエンティティの共起のインスタンスを有するエンティティ共起データベースと比較し、そしてサーチ質問パラメータにおいて１つ以上のエンティティに対応する少なくとも１つのエンティティ形式を識別することにより、サーチ質問パラメータから１つ以上のエンティティを抽出するように更に構成される。このシステムは、更に、サーチ質問パラメータに関連した１つ以上のレコードを識別するためにエンティティ共起データベースをサーチする曖昧マッチングアルゴリズムを選択するように構成された曖昧スコアマッチングモジュールを備え、その曖昧マッチングモジュールは、少なくとも１つの識別されたエンティティ形式に対応する。その曖昧スコアマッチングモジュールは、更に、その選択された曖昧マッチングアルゴリズムを使用してエンティティ共起データベースをサーチし、そしてそのサーチに基づき１つ以上のレコードから１つ以上の示唆されたサーチ質問パラメータを形成し、及びユーザインターフェイスを経て１つ以上の示唆されたサーチ質問パラメータを提示するように構成される。 In another embodiment, a system is provided. The system includes one or more server computers having one or more processors that execute computer readable instructions for a plurality of computer modules, which are configured to receive user input of search query parameters from a user interface. The entity extraction module compares a search query parameter to an entity co-occurrence database having co-occurrence instances of one or more entities in the electronic data corpus and one in the search query parameter Further configured to extract one or more entities from the search query parameters by identifying at least one entity type corresponding to the above entities.The system further comprises an ambiguity score matching module configured to select an ambiguity matching algorithm that searches the entity co-occurrence database to identify one or more records associated with the search query parameter, the ambiguity matching A module corresponds to at least one identified entity type. The fuzzy score matching module further searches the entity co-occurrence database using the selected fuzzy matching algorithm and retrieves one or more suggested search query parameters from one or more records based on the search. Configured and configured to present one or more suggested search query parameters via a user interface.

共起及び／又は曖昧スコアマッチングに基づいて関連エンティティのサーチ示唆を発生する方法が開示される。この開示の１つの態様において、この方法は、クライアント／サーバー型アーキテクチャーを含むコンピュータサーチシステムに使用される。ある実施形態において、このサーチシステムは、ネットワーク接続を経て１つ以上のサーバー装置と通信するサーチエンジンに対するユーザインターフェイスを備えている。サーバー装置は、エンティティ抽出モジュール及び曖昧スコアマッチングモジュール並びにエンティティ共起知識ベースのデータベースを含む複数の特殊目的のコンピュータモジュールに対するインストラクションを実行する１つ以上のプロセッサを備えている。知識ベースは、インメモリデータベースとして構築され、そして１つ以上のサーチコントローラ、複数のサーチノード、圧縮データの集合体、及び曖昧性除去モジュールのような他のコンポーネントも含む。あるサーチコントローラは、１つ以上のサーチノードに選択的に関連付けされる。各サーチノードは、圧縮データの集合体を通して曖昧キーサーチを独立して遂行し、そしてスコア付けされた結果のセットをそれに関連したサーチコントローラへ返送することができる。 A method for generating related entity search suggestions based on co-occurrence and / or fuzzy score matching is disclosed. In one aspect of this disclosure, the method is used in a computer search system that includes a client / server architecture. In one embodiment, the search system includes a user interface for a search engine that communicates with one or more server devices over a network connection. The server device includes one or more processors that execute instructions for a plurality of special purpose computer modules including an entity extraction module and an ambiguous score matching module and an entity co-occurrence knowledge base database. The knowledge base is built as an in-memory database and includes other components such as one or more search controllers, multiple search nodes, a collection of compressed data, and a disambiguation module. A search controller is selectively associated with one or more search nodes. Each search node can independently perform an ambiguous key search through a collection of compressed data and return a scored set of results to its associated search controller.

この開示の別の態様において、この方法は、エンティティ抽出モジュールにより、与えられたサーチ質問から部分エンティティ抽出を遂行して、サーチ質問がエンティティを指すかどうか識別し、もしそうであれば、エンティティの形式を決定することを含む。更に、この方法は、曖昧スコアマッチングモジュールにより、抽出されたエンティティの形式に対応するアルゴリズムを発生し、そしてエンティティ共起知識ベースに対してサーチを遂行することを含む。更に、エンティティとして検出されない質問テキスト部分は、エンティティ共起知識ベースをサーチするのに使用できるトピックス、ファセット及びキーフレーズのような概念的特徴として処理される。エンティティが、とりわけ、エンティティ対エンティティ、エンティティ対トピックス、又はエンティティ対ファセットとしてインデックスされるレポジトリを既に有するエンティティ共起知識ベースは、速く正確な示唆をユーザへ返送してサーチ質問を完成する In another aspect of this disclosure, the method includes performing a partial entity extraction from a given search query by an entity extraction module to identify whether the search query refers to an entity, and if so, Including determining the format. In addition, the method includes generating an algorithm corresponding to the type of the extracted entity by the fuzzy score matching module and performing a search against the entity co-occurrence knowledge base. Furthermore, portions of the question text that are not detected as entities are processed as conceptual features such as topics, facets, and key phrases that can be used to search the entity co-occurrence knowledge base. An entity co-occurrence knowledge base where an entity already has a repository that is indexed as an entity-to-entity, entity-to-topic, or entity-to-facet, among other things, returns a quick and accurate suggestion to the user to complete the search query

この開示の更に別の態様では、完成されたサーチ質問は、新たなサーチ質問として使用される。サーチシステムは、新たなサーチ質問を処理し、エンティティ抽出を実行し、エンティティ共起知識ベースからの最も高いスコアをもつ関連エンティティを見出し、そしてその関連エンティティをユーザにとって有用なドロップダウンリストに提示する。 In yet another aspect of this disclosure, the completed search question is used as a new search question. The search system processes new search questions, performs entity extraction, finds the related entity with the highest score from the entity co-occurrence knowledge base, and presents the related entity in a drop-down list useful to the user .

ある実施形態において、方法が開示される。この方法は、エンティティ抽出コンピュータにより、ユーザインターフェイスから部分サーチ質問パラメータのユーザ入力を受け取り、その部分サーチ質問パラメータは、少なくとも１つの未完成のサーチ質問パラメータを有するものであり；エンティティ抽出コンピュータにより、その部分サーチ質問パラメータを、電子データコーパスにおいて１つ以上の第１エンティティの共起のインスタンスを有するエンティティ共起データベースと比較し、そして部分サーチ質問パラメータにおける１つ以上の第１エンティティに対応する少なくとも１つのエンティティ形式を識別することにより、部分サーチ質問パラメータから１つ以上の第１エンティティを抽出し；及び曖昧スコアマッチングコンピュータにより、部分サーチ質問パラメータに関連した１つ以上のレコードを識別するためにエンティティ共起データベースをサーチする曖昧マッチングアルゴリズムを選択し、その曖昧マッチングアルゴリズムは、少なくとも１つの識別されたエンティティ形式に対応するものである。この方法は、更に、曖昧スコアマッチングコンピュータにより、その選択された曖昧マッチングアルゴリズムを使用してエンティティ共起データベースをサーチし、そしてそのサーチに基づき１つ以上のレコードから１つ以上の第１の示唆されたサーチ質問パラメータを形成し；曖昧スコアマッチングコンピュータにより、ユーザインターフェイスを経て１つ以上の第１の示唆されたサーチ質問パラメータを提示し；エンティティ抽出コンピュータにより、完成したサーチ質問パラメータを形成するために１つ以上の第１の示唆されたサーチ質問パラメータのユーザ選択を受け取り；及びエンティティ抽出コンピュータにより、その完成したサーチ質問パラメータから１つ以上の第２のエンティティを抽出することを更に含む。この方法は、更に、エンティティ抽出コンピュータにより、その１つ以上の第２のエンティティに関連した１つ以上のエンティティを識別して１つ以上の第２の示唆されたサーチ質問パラメータを形成するためにエンティティ共起データベースをサーチし；及びエンティティ抽出コンピュータにより、ユーザインターフェイスを経て１つ以上の第２の示唆されたサーチ質問パラメータを提示する；ことを含む。 In certain embodiments, a method is disclosed. The method receives a user input of a partial search question parameter from a user interface by an entity extraction computer, the partial search question parameter having at least one incomplete search question parameter; Comparing the partial search query parameter with an entity co-occurrence database having co-occurrence instances of one or more first entities in the electronic data corpus, and at least one corresponding to the one or more first entities in the partial search query parameter One or more first entities are extracted from the partial search question parameters by identifying one entity type; and an ambiguous score matching computer Select ambiguous matching algorithm or search for entities co-occurrence database in order to identify the communication one or more records, the fuzzy matching algorithm, which corresponds to at least one of the identified entities form. The method further includes searching the entity co-occurrence database by the fuzzy score matching computer using the selected fuzzy matching algorithm and one or more first suggestions from one or more records based on the search. To present one or more first suggested search query parameters via a user interface by an ambiguous score matching computer; to form a completed search query parameter by an entity extraction computer Further receiving a user selection of one or more first suggested search query parameters; and extracting one or more second entities from the completed search query parameters by an entity extraction computer. The method further includes identifying, by an entity extraction computer, one or more entities associated with the one or more second entities to form one or more second suggested search query parameters. Searching the entity co-occurrence database; and presenting one or more second suggested search query parameters via the user interface by the entity extraction computer.

別の実施形態において、システムが開示される。このシステムは、複数のコンピュータモジュールに対するコンピュータ読み取り可能なインストラクションを実行する１つ以上のプロセッサを有する１つ以上のサーバーコンピュータを備え、これは、ユーザインターフェイスから部分サーチ質問パラメータのユーザ入力を受け取るように構成されたエンティティ抽出モジュールを含み、その部分サーチ質問パラメータは、少なくとも１つの未完成のサーチ質問パラメータを有するものであり、エンティティ抽出モジュールは、更に、その部分サーチ質問パラメータを、電子データコーパスにおいて１つ以上の第１エンティティの共起のインスタンスを有するエンティティ共起データベースと比較し、そして部分サーチ質問パラメータにおける１つ以上の第１エンティティに対応する少なくとも１つのエンティティ形式を識別することにより、部分サーチ質問パラメータから１つ以上の第１エンティティを抽出するように構成される。このシステムは、更に、部分サーチ質問パラメータに関連した１つ以上のレコードを識別するためにエンティティ共起データベースをサーチする曖昧マッチングアルゴリズムを選択するように構成され、その曖昧マッチングアルゴリズムは、少なくとも１つの識別されたエンティティ形式に対応するものである。曖昧スコアマッチングモジュールは、更に、その選択された曖昧マッチングアルゴリズムを使用してエンティティ共起データベースをサーチし、そしてそのサーチに基づいて１つ以上のレコードから１つ以上の第１の示唆されたサーチ質問パラメータを形成し、及びユーザインターフェイスを経て１つ以上の第１の示唆されたサーチ質問パラメータを提示するように構成される。加えて、エンティティ抽出モジュールは、更に、完成したサーチ質問パラメータを形成するため１つ以上の第１の示唆されたサーチ質問パラメータのユーザ選択を受け取り、その完成したサーチ質問パラメータから１つ以上の第２のエンティティを抽出し、その１つ以上の第２のエンティティに関連した１つ以上のエンティティを識別して１つ以上の第２の示唆されたサーチ質問パラメータを形成するためエンティティ共起データベースをサーチし、及びユーザインターフェイスを経て１つ以上の第２の示唆されたサーチ質問パラメータを提示する、ように構成される。 In another embodiment, a system is disclosed. The system includes one or more server computers having one or more processors that execute computer-readable instructions for a plurality of computer modules, such that it receives user input of partial search query parameters from a user interface. Comprising a configured entity extraction module, the partial search question parameter having at least one unfinished search question parameter, the entity extraction module further including the partial search question parameter in the electronic data corpus as 1 Compared to an entity co-occurrence database having co-occurrence instances of one or more first entities and at least corresponding to one or more first entities in the partial search query parameters By identifying one entity type, configured to extract one or more first entity from part search query parameters. The system is further configured to select an ambiguity matching algorithm that searches the entity co-occurrence database to identify one or more records associated with the partial search query parameter, the ambiguity matching algorithm comprising: It corresponds to the identified entity type. The fuzzy score matching module further searches the entity co-occurrence database using the selected fuzzy matching algorithm and one or more first suggested searches from one or more records based on the search. A query parameter is formed and configured to present one or more first suggested search query parameters via a user interface. In addition, the entity extraction module further receives a user selection of one or more first suggested search question parameters to form a completed search query parameter, from which the one or more first search query parameters are received. An entity co-occurrence database to extract two entities and identify one or more entities associated with the one or more second entities to form one or more second suggested search query parameters It is configured to search and present one or more second suggested search query parameters via a user interface.

エンティティ及び特徴共起を使用してエンティティに関連したサーチ示唆を得る方法が開示される。この開示の１つの態様において、この方法は、クライアント／サーバー型のアーキテクチャーを含むサーチシステムに使用される。 A method for obtaining search suggestions associated with an entity using entity and feature co-occurrence is disclosed. In one aspect of this disclosure, the method is used in a search system that includes a client / server type architecture.

１つ以上のサーバーに記憶されたエンティティを使用する方法を使用するサーチシステムは、エンティティデータベース及びトレンドデータベースを許す。そのようなデータベースのエンティティは、高いスコアに基づいてインデックスするためのスコアを有する。サーチ示唆を得るための方法は、サーチ示唆の単一リストを発生するために両データベースに記憶された情報を結合する。トレンドデータベースは、ローカルネットワーク及び／又はインターネットにおいて１人以上のユーザからの以前にサーチ質問を与える。エンティティデータベースは、ローカルネットワーク及び／又はインターネットにおいて利用可能な複数のデータからのエンティティ抽出に基づきサーチ示唆を与える。このリストは、ユーザのための示唆のより正確且つ迅速なグループを与える。 Search systems that use methods that use entities stored on one or more servers allow entity databases and trend databases. Such database entities have a score to index based on a high score. A method for obtaining search suggestions combines information stored in both databases to generate a single list of search suggestions. The trend database provides previously searched questions from one or more users in the local network and / or the Internet. The entity database provides search suggestions based on entity extraction from multiple data available on the local network and / or the Internet. This list gives a more accurate and quick group of suggestions for the user.

ある実施形態において、コンピュータで実施される方法は、コンピュータにより、１つ以上のデータストリングを含むサーチ質問をサーチエンジンから受け取り、各々のエンティティは、１つ以上のストリングのサブセットに対応し；コンピュータにより、エンティティデータベース及びトレンドデータベースに対して１つ以上のエンティティを比較することに基づき１つ以上のデータストリングにおける１つ以上のエンティティを識別し；コンピュータにより、少なくとも１つのエンティティに対応するものとして識別されない１つ以上のデータストリングにおいて１つ以上の特徴を識別し；コンピュータにより、１つ以上の特徴の各々を、マッチングアルゴリズムに基づき１つ以上のエンティティの少なくとも１つに指定し；コンピュータにより、各エンティティに指定された各々の特徴に指定されたスコアに基づき各々のエンティティに抽出スコアを指定し；コンピュータにより、各エンティティの抽出スコアからスレッシュホールド距離内にあるスコアを有する１つ以上のエンティティを含む第１のサーチリストをエンティティデータベースから受け取り；コンピュータにより、各エンティティの抽出スコアからスレッシュホールド距離内にあるスコアを有する１つ以上のエンティティを含む第２のサーチリストをトレンドデータベースから受け取り；コンピュータにより、第１のサーチリスト及び第２のサーチリストを含む総計リストを発生し、その総計リストのエンティティは、各総計リストのスコアに従ってランク付けされ；及びコンピュータにより、その総計リストに従って示唆されるサーチを与える；ことを含む。 In certain embodiments, a computer-implemented method receives, by a computer, a search query that includes one or more data strings from a search engine, each entity corresponding to a subset of the one or more strings; Identifying one or more entities in one or more data strings based on comparing one or more entities against an entity database and a trend database; not identified by the computer as corresponding to at least one entity Identifying one or more features in one or more data strings; the computer assigns each of the one or more features to at least one of the one or more entities based on a matching algorithm; To specify an extraction score for each entity based on a score specified for each feature specified for each entity; by the computer, one or more having a score that is within a threshold distance from the extraction score for each entity Receiving a first search list including entities from the entity database; receiving from a trend database a second search list including one or more entities having a score that is within a threshold distance from each entity's extracted score; A computer generates a grand total list including a first search list and a second search list, and the entities of the grand total list are ranked according to the score of each total list; Providing a search suggested according; comprising.

ここに開示されるのは、ＭｉｃｒｏｓｏｆｔＳｈａｒｅｐｏｉｎｔ２０１３（登録商標）のようなコンテンツマネージメントシステムにおいて地理的エンティティベースのサーチを可能にするシステム及び方法である。実施形態で述べる方法は、地理的タギングウェブサーバーを追加することによってＳｈａｒｅｐｏｉｎｔ２０１３（登録商標）サーチアーキテクチャーを拡張することを含む。このシステムは、コンピュータメモリ及び１つ以上のＩ／Ｏ装置に作動的に関連したコンピュータプロセッサを備え、ここで、プロセッサ及びメモリは、１つ以上のＳｈａｒｅｐｏｉｎｔ２０１３（登録商標）プロセスを動作するように構成される。又、このシステムは、コンピュータメモリ及び１つ以上のＩ／Ｏ装置に作動的に関連した別のコンピュータプロセッサも備え、ここで、プロセッサ及びメモリは、ジオタギングウェブサービスをホストしそしてその処理を与えるように構成される。Ｓｈａｒｅｐｏｉｎｔ２０１３（登録商標）システムは、コンテンツのサーチを可能にするために、クローリングコンポーネント、コンテンツ処理コンポーネント、及びサーチインデックスコンポーネントを含む。Ｓｈａｒｅｐｏｉｎｔ２０１３（登録商標）サーチにおけるコンテンツ処理コンポーネントは、コンテンツエンリッチメントウェブサーバー（ＣＥＷＳ）特徴を使用することによりその機能を拡張することができる。 Disclosed herein are systems and methods that enable geographic entity-based searches in content management systems such as Microsoft Sharepoint 2013. The method described in the embodiment includes extending the Sharepoint 2013® search architecture by adding a geographic tagging web server. The system comprises a computer processor operatively associated with a computer memory and one or more I / O devices, wherein the processor and memory are adapted to operate one or more Sharepoint 2013® processes. Composed. The system also includes a computer memory and another computer processor operatively associated with the one or more I / O devices, wherein the processor and memory are adapted to host and provide processing for a geotagging web service. Configured. The Sharepoint 2013® system includes a crawling component, a content processing component, and a search index component to enable searching for content. The content processing component in Sharepoint 2013® search can extend its functionality by using Content Enrichment Web Server (CEWS) features.

この方法は、コンテンツ処理のために送られるクロールプロパティのアレイを得るために異なるソースからのクローリングコンテンツを含む。コンテンツの処理中に、トリガー状態は、オリジナルコンテンツを付加的な地理的メタデータプロパティでエンリッチするために付加的な処理からクロールプロパティに利益が得られるかどうか決定する。クロールプロパティが付加的な処理から利益を得ない場合には、クロールプロパティは、管理される処理へとマップされそしてサーチインデックスへ送られる。クロールプロパティが、外部ウェブサービス処理から利益を得る場合には、ＣＥＷＳがハイパーテキスト転送プロトコル（ＨＴＴＰ）又は他のウェブサービスコール方法を使用して構成可能なエンドポイントへ単純なオブジェクトアクセスプロトコル（ＳＯＡＰ）要求をなす。エンティティエンリッチメントサービスは、コンテンツの形式を決定する。コンテンツが画像フォーマットである場合には、ファイル位置のようなそのメタデータが光学的文字認識（ＯＣＲ）エンジンへ送られて、オリジナルドキュメントが検索され、非同期で処理されてテキストへと変換され、そしてクロールコンポーネントへ返送されて、テキストフォーマットで再クロールされる。コンテンツがテキストフォーマットである場合には、ジオタギングウェブサービスが地理的メタデータを識別し、そしてそれを、管理されるプロパティとしてコンテンツに関連付ける。コンテンツは、ジオタギングされた後に、インデックスコンポーネントへ送られる。 The method includes crawling content from different sources to obtain an array of crawl properties that are sent for content processing. During content processing, the trigger state determines whether the crawl property can benefit from additional processing to enrich the original content with additional geographic metadata properties. If the crawl property does not benefit from additional processing, the crawl property is mapped to a managed process and sent to the search index. If the crawl property benefits from external web service processing, the Simple Object Access Protocol (SOAP) to endpoints that CEWS can configure using Hypertext Transfer Protocol (HTTP) or other web service call methods Make a request. The entity enrichment service determines the type of content. If the content is in image format, its metadata, such as file location, is sent to an optical character recognition (OCR) engine, the original document is retrieved, processed asynchronously and converted to text, and Returned to the crawl component and recrawled in text format. If the content is in text format, the geotagging web service identifies the geographic metadata and associates it with the content as a managed property. The content is geotagged before being sent to the index component.

Ｓｈａｒｅｐｏｉｎｔ２０１３（登録商標）ウェブ部分を使用するか、又はＳｈａｒｅｐｏｉｎｔ２０１３（登録商標）サーチの標準レイアウトを、とりわけ、ＨＴＭＬ、ＨＴＭＬ５、ＪａｖａＳｃｒｉｐｔ（登録商標）及びＣＳＳのような標準ウェブ開発ツールで変更することにより、付加的なサーチユーザインターフェイス（ＵＩ）が追加される。サーチＵＩは、例えば、これに限定されないが、デジタルマップのようなデジタル地理的特徴を使用して地理的サーチ質問を遂行するか又は地理的サーチ結果を表示する上でユーザの助けとなる。又、サーチＵＩは、付加的な、エンリッチされたエンティティ又はそれに関連したメタデータを使用してファセットサーチを遂行するように向上を図ることもできる。 Use the Sharepoint 2013® web portion or change the standard layout of the Sharepoint 2013® search with, among other things, standard web development tools such as HTML, HTML5, JavaScript, and CSS. Adds an additional search user interface (UI). The search UI may assist a user in performing a geographic search query or displaying geographic search results using, for example, but not limited to, a digital geographic feature such as a digital map. The search UI can also be enhanced to perform faceted searches using additional, enriched entities or associated metadata.

以下の詳細な説明から、この開示の多数の他の観点、特徴、及び利益が明らかとなるであろう。 Numerous other aspects, features, and benefits of this disclosure will become apparent from the following detailed description.

本開示は、添付図面を参照することにより良く理解することができる。図面中のコンポーネントは、必ずしも、正しい縮尺ではなく、むしろ、本開示の原理を示すときには強調されている。図中、参照番号は、異なる図面全体を通して対応部分を示している。 The present disclosure can be better understood with reference to the following drawings. The components in the drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the present disclosure. In the figures, reference numerals indicate corresponding parts throughout the different views.

本発明のある実施形態が動作するコンピュータシステムの規範的な環境を示すブロック図である。FIG. 2 is a block diagram illustrating an example environment of a computer system on which an embodiment of the invention operates. 一実施形態によりエンティティ共起を使用してサーチする方法を示すフローチャートである。4 is a flowchart illustrating a method for searching using entity co-occurrence according to one embodiment. システムによって返送されるサーチ結果が関心のある関連エンティティを含む簡単なサーチの実施形態を示すフローチャートである。FIG. 6 is a flow chart illustrating a simple search embodiment in which search results returned by the system include related entities of interest. 本発明のある実施形態が動作する規範的なシステム環境を示すブロック図である。1 is a block diagram illustrating an example system environment in which an embodiment of the invention operates. FIG. 一実施形態により知識ベースにおける曖昧スコアマッチング及びエンティティ共起を使用してサーチ示唆を与える方法を示すフローチャートである。3 is a flowchart illustrating a method for providing search suggestions using fuzzy score matching and entity co-occurrence in a knowledge base according to one embodiment. 図４−６の知識ベースにおける曖昧マッチング及びエンティティ共起を使用してサーチ示唆を発生するユーザインターフェイスの一例を示す図である。FIG. 7 illustrates an example of a user interface that generates search suggestions using fuzzy matching and entity co-occurrence in the knowledge base of FIGS. 4-6. 本発明のある実施形態が動作する規範的なシステム環境を示すブロック図である。1 is a block diagram illustrating an example system environment in which an embodiment of the invention operates. FIG. 一実施形態により共起及び／又は曖昧スコアマッチングに基づき関連エンティティのサーチ示唆を発生する方法を示すフローチャートである。6 is a flowchart illustrating a method for generating search suggestions for related entities based on co-occurrence and / or fuzzy score matching according to one embodiment. 図８に示す方法に関連したユーザインターフェイスの規範的実施形態である。FIG. 9 is an exemplary embodiment of a user interface associated with the method shown in FIG. エンティティ及びトレンドデータベースに基づきサーチ示唆を得る方法を示すブロック図である。FIG. 5 is a block diagram illustrating a method for obtaining search suggestions based on an entity and trend database. 各データベースにおけるサーチ示唆の個々のスコアに基づき示唆のリストを発生することにより、エンティティ及びトレンドデータベースに基づきサーチ示唆を得る方法を示すブロック図である。FIG. 6 is a block diagram illustrating a method for obtaining search suggestions based on entity and trend databases by generating a list of suggestions based on individual scores of search suggestions in each database. 両データベースにおけるサーチ示唆の全スコアに基づき示唆のリストを発生することにより、エンティティ及びトレンドデータベースに基づきサーチ示唆を得る方法を示すブロック図である。FIG. 5 is a block diagram illustrating a method for obtaining search suggestions based on entity and trend databases by generating a list of suggestions based on the total search suggestion scores in both databases. コンテンツマネージメントシステムにおけるコンテンツのタギング及びエンティティエンリッチメントのシステムアーキテクチャーである。It is a system architecture of content tagging and entity enrichment in a content management system. 名前付き及び地理的エンティティサーチのためにコンテキストをタギング及びインデックスするプロセスを示す。Fig. 4 illustrates the process of tagging and indexing contexts for named and geographic entity searches.

定義
ここで使用する次の用語は、次のような定義を有する。 Definitions The following terms used herein have the following definitions.

エンティティ抽出」は、名前、場所及び組織のような情報を抽出するための情報処理方法を指す。 “Entity extraction” refers to an information processing method for extracting information such as name, location and organization.

「コーパス」は、１つ以上のドキュメントの集合体を指す。 A “corpus” refers to a collection of one or more documents.

「特徴(Features)」は、ドキュメントから少なくとも一部分導出される情報である。 “Features” is information derived at least in part from a document.

「イベントコンセプトストア」は、イベントテンプレートモデルのデータベースを指す。 “Event concept store” refers to a database of event template models.

「イベント」は、少なくともリアルタイムでの特徴発生により特徴付けられる１つ以上の特徴を指す。 An “event” refers to one or more features characterized by at least real-time feature generation.

「イベントモデル」は、特定形式のイベントに対して比較しそしてそれを識別するのに使用されるデータの集合体を指す。 An “event model” refers to a collection of data used to compare and identify a particular type of event.

「モジュール」は、少なくとも１つ以上のタスクを実行するのに適したコンピュータ又はソフトウェアコンポーネントを指す。 A “module” refers to a computer or software component suitable for performing at least one or more tasks.

「特徴属性」は、特徴に関連したメタデータ、例えば、とりわけ、ドキュメントにおける特徴の位置、信頼スコアを指す。 “Feature attribute” refers to metadata associated with the feature, for example, the location of the feature in the document, the confidence score, among others.

「ファクト」は、特徴と特徴との間の客観的な関係を指す。 “Fact” refers to an objective relationship between features.

「エンティティ知識ベース」は、特徴／エンティティを含むコンピュータデータベースを指す。 An “entity knowledge base” refers to a computer database containing features / entities.

「質問」は、１つ以上の適当なデータベースから情報を検索するための、コンピュータで発生される要求を指す。 “Question” refers to a computer-generated request to retrieve information from one or more suitable databases.

「トピックス」は、コーパスから少なくとも一部分導出されるセマティック情報のセットを指す。 “Topics” refers to a set of semantic information derived at least in part from a corpus.

「ジオタギング」は、非構造化テキストファイルから地理的エンティティを抽出するプロセスを指す。ジオタギングは、エンティティを、特定の地理的場所及び付属の地理的メタデータ、例えば、地理的座標、地理的特徴形式及び他のメタデータへと曖昧性除去することを含む。 “Geotagging” refers to the process of extracting geographic entities from an unstructured text file. Geotagging involves disambiguating entities to specific geographic locations and associated geographic metadata, such as geographic coordinates, geographic feature types, and other metadata.

「エンティティタギング」は、非構造化テキストから名前付きエンティティを抽出するプロセスを指す。エンティティタギングは、エンティティ曖昧性除去、エンティティ名前正規化、及び付属のエンティティメタデータを含む。 “Entity tagging” refers to the process of extracting named entities from unstructured text. Entity tagging includes entity disambiguation, entity name normalization, and accompanying entity metadata.

「名前付きエンティティ」は、個人、組織又はトピックスを指す。 “Named entity” refers to an individual, an organization, or a topic.

「地理的エンティティ」は、地理的位置又は地理的場所を指す。 A “geographic entity” refers to a geographical location or location.

「クロールされたプロパティ」は、クロール中にドキュメントを検査することから得られるコンテンツマネージメントシステムメタデータを指す。 “Crawled properties” refers to content management system metadata that results from examining documents during crawling.

詳細な説明
添付図面に各々示された好ましい実施形態を以下に詳細に説明する。上述した実施形態は、例示に過ぎない。当業者であれば、ここに述べる特定の実施例について、本発明の範囲内で、多数の別のコンポーネント及び実施形態に置き換えできることが認識されよう。本発明の精神又は範囲から逸脱せずに、他の実施形態が使用されてもよく及び／又は他の変更がなされてもよい。詳細な説明に述べる例示的実施形態は、ここに提示される要旨の限界を意味するものではない。 The preferred embodiment shown, respectively in the description accompanying drawings will be described in detail below. The above-described embodiments are merely examples. Those skilled in the art will recognize that the specific examples described herein may be replaced by numerous other components and embodiments within the scope of the present invention. Other embodiments may be used and / or other changes may be made without departing from the spirit or scope of the invention. The exemplary embodiments described in the detailed description are not meant to imply limitations on the subject matter presented herein.

それでも、本発明の範囲の制限が意図されないことを理解されたい。ここに示す本発明の特徴の代替的及び更に別の実施形態、並びにここに示す本発明の原理の付加的な適用であって、当業者に対して生じ且つ本開示を所有するものは、本発明の範囲内であると考えるべきである。 It will nevertheless be understood that no limitation of the scope of the invention is intended. Alternative and further embodiments of the features of the invention shown herein, as well as additional applications of the principles of the invention shown herein, which arise to those skilled in the art and possess the present disclosure, It should be considered within the scope of the invention.

本開示は、複数のソースからイベントを検出し、抽出し及び有効化するためのシステム及び方法について述べる。ソースは、ニュースソース、ソーシャルメディアウェブサイト、及び／又はイベントに関するデータを含むソースを包含する。 This disclosure describes systems and methods for detecting, extracting and validating events from multiple sources. Sources include sources that contain data relating to news sources, social media websites, and / or events.

ここに開示するシステム及び方法の種々の実施形態は、独立したイベントを識別するために異なるソースからデータを収集する。 Various embodiments of the systems and methods disclosed herein collect data from different sources to identify independent events.

図１は、本発明によるサーチシステム１００のブロック図である。サーチシステム１００は、サーチシステム１００に関連したソフトウェアモジュールを実行するプロセッサを含む１つ以上のクライアントコンピューティング装置を備え、それに含まれるグラフィックユーザインターフェイス１０２は、サーチエンジン１０４にアクセスし、ネットワーク１０８を経てサーバー装置１０６とバイナリデータの形態でサーチ質問を通信する。規範的実施形態では、サーチシステム１００は、クライアント／サーバーコンピューティングアーキテクチャーにおいて実施される。しかしながら、サーチシステム１００は、他のコンピュータアーキテクチャー（例えば、スタンドアローンコンピュータ、ターミナルを伴うメインフレームシステム、アプリケーションサービスプロバイダー（ＡＳＰ）モデル、ピア・ツー・ピアモデル、等）を使用して実施されてもよい。ネットワーク１０８は、ローカルエリアネットワーク、ワイドエリアネットワーク、インターネット、ワイヤレスネットワーク、移動電話ネットワーク、等のように、コンピューティング装置間でデジタルデータを通信できる適当なハードウェア及びソフトウェアモデルを備えている。従って、システム１００は、単一のネットワーク１０８を経て、又は複数のネットワーク１０８を使用して実施されてもよいことが明らかであろう。 FIG. 1 is a block diagram of a search system 100 according to the present invention. The search system 100 includes one or more client computing devices that include a processor that executes software modules associated with the search system 100, which includes a graphical user interface 102 that accesses the search engine 104 and over the network 108. A search query is communicated with the server device 106 in the form of binary data. In the exemplary embodiment, search system 100 is implemented in a client / server computing architecture. However, the search system 100 is implemented using other computer architectures (eg, stand-alone computers, mainframe systems with terminals, application service provider (ASP) models, peer-to-peer models, etc.). Also good. Network 108 comprises suitable hardware and software models that can communicate digital data between computing devices, such as a local area network, a wide area network, the Internet, a wireless network, a mobile telephone network, and the like. Thus, it will be apparent that the system 100 may be implemented over a single network 108 or using multiple networks 108.

ユーザのコンピューティング装置１０２は、サーチ質問を送信できるソフトウェアモデルを含むサーチエンジン１０４にアクセスする。サーチ質問は、検索することが望まれる情報を指示するためにサーチエンジン１０４に与えられるパラメータである。サーチ質問は、サーチエンジン１０４のパース及び処理ルーチンに適合する適当なデータフォーマット（例えば、整数、ストリング、複素数オブジェクト）でユーザ又は別のソフトウェアアプリケーションにより与えられる。ある実施形態では、サーチエンジン１０４は、ユーザのコンピューティング装置１０２のブラウザ又は他のソフトウェアアプリケーションを通してアクセスでき且つユーザ又はソフトウェアアプリケーションがワールドワイドウェブにおいて情報を位置付けできるようにするウェブベースのツールである。ある実施形態では、サーチエンジン１０４は、システム１００に対してネーティブなもので、ユーザ又はアプリケーションがシステム１００のデータベース内の情報を位置付けできるようにするアプリケーションソフトウェアモジュールである。 The user computing device 102 accesses a search engine 104 that includes a software model that can submit search queries. A search question is a parameter that is provided to the search engine 104 to indicate the information that it is desired to search. The search query is provided by the user or another software application in an appropriate data format (eg, integer, string, complex object) that is compatible with the search engine 104 parsing and processing routines. In some embodiments, the search engine 104 is a web-based tool that can be accessed through a browser or other software application on the user's computing device 102 and allows the user or software application to locate information on the World Wide Web. In one embodiment, search engine 104 is an application software module that is native to system 100 and that allows a user or application to locate information in the database of system 100.

単一のサーバー装置１０６として実施されるか又は複数のサーバーコンピュータにわたり分散型アーキテクチャーで実施されるサーバー装置１０６は、エンティティ抽出モジュール１１０、エンティティ共起知識ベース１１２、及びエンティティインデックス型コーパス１１４を備えている。エンティティ抽出モジュール１１０は、質問ストリング、構造化データ、等の所与の質問セットから独立エンティティを抽出しそして曖昧性除去することのできるコンピュータソフトウェア及び／又はハードウェアモジュールである。エンティティは、例えば、人々、組織、地理的位置、日付及び／又は時刻である。抽出中に、１つ以上の特徴確認及び抽出アルゴリズムが使用される。又、正しい属性と共に正しく抽出される特徴の確度レベルを指示するスコアが各々の抽出された特徴に指定される。特徴の属性を考慮して、各特性の相対的な重み又は関連度が決定される。更に、重み付けされたスコアモデルを使用して特徴と特徴との間の関連付けの関連度が決定される。 Server device 106, implemented as a single server device 106 or implemented in a distributed architecture across multiple server computers, includes an entity extraction module 110, an entity co-occurrence knowledge base 112, and an entity index corpus 114. ing. The entity extraction module 110 is a computer software and / or hardware module that can extract and disambiguate independent entities from a given set of questions, such as question strings, structured data, and the like. An entity is, for example, people, organization, geographical location, date and / or time. During extraction, one or more feature confirmation and extraction algorithms are used. Also, a score indicating the accuracy level of the feature that is correctly extracted together with the correct attribute is designated for each extracted feature. Taking into account the attributes of the features, the relative weight or relevance of each property is determined. In addition, a weighted score model is used to determine the degree of association between features.

種々の実施形態によれば、エンティティ共起知識ベース１１２は、これに限定されないが、インメモリコンピュータデータベース（図示せず）として構築され、そして１つ以上のサーチコントローラ、複数のサーチノード、圧縮データの集合体、及び曖昧性除去コンピュータモジュールのような他のコンポーネント（図示せず）を含む。あるサーチコントローラは、１つ以上のサーチノードと選択的に関連付けされる。各サーチノードは、圧縮データの集合体を通して曖昧キーサーチを独立して遂行しそしてスコア付けされた結果のセットをその関連サーチコントローラへ返送することができる。 According to various embodiments, the entity co-occurrence knowledge base 112 is constructed as, but not limited to, an in-memory computer database (not shown) and includes one or more search controllers, multiple search nodes, compressed data. And other components (not shown) such as a disambiguation computer module. A search controller is selectively associated with one or more search nodes. Each search node can independently perform an ambiguous key search through a collection of compressed data and return a set of scored results to its associated search controller.

エンティティ共起知識ベース１１２は、特徴に基づく且つ信頼性スコアによりランク付けされた関連エンティティを含む。特徴をリンクする方法であって、重み付けされたモデルを本質的に使用してどのエンティティ形式が最も重要であるか決定し、どれがより大きな重みを有するか決定し、そして信頼性スコアに基づき、正しい特徴の抽出がどれほどの信頼性で行われたか決定するといった種々の方法が使用される。エンティティインデックス型コーパス１１４は、大量コーパス又はライブコーパスを有するインターネットのような複数のソースからのデータを含む。 The entity co-occurrence knowledge base 112 includes related entities that are feature-based and ranked by a reliability score. A method of linking features, essentially using a weighted model to determine which entity type is most important, which has a greater weight, and based on the confidence score, Various methods are used such as determining how reliable the extraction of the correct features was done. The entity indexed corpus 114 includes data from multiple sources, such as the Internet, having a large or live corpus.

図２は、図１に示されたようなサーチシステム１００において実施されるエンティティ共起を使用して関連エンティティをサーチする方法２００を示すフローチャートである。種々の実施形態によれば、方法２００を開始する前に、図１に示したものと同様のエンティティインデックス型コーパス１１４には、電子データの大量コーパス又はライブコーパスのような複数のソース（例えば、インターネット、ウェブサイト、ブログ、ワード処理ファイル、平易テキストファイル）からのデータが供給されている。エンティティインデックス型コーパス１１４は、新たなデータが発見されるにつれて常時更新される複数のインデックスされたエンティティを含む。 FIG. 2 is a flowchart illustrating a method 200 for searching related entities using entity co-occurrence implemented in the search system 100 as shown in FIG. According to various embodiments, prior to initiating method 200, an entity indexed corpus 114 similar to that shown in FIG. 1 may include a plurality of sources such as a bulk corpus of electronic data or a live corpus (eg, Data from the Internet, websites, blogs, word processing files, plain text files). The entity indexed corpus 114 includes a plurality of indexed entities that are constantly updated as new data is discovered.

ある実施形態では、方法２００は、ステップ２０２において、コンピューティング装置１０２のユーザ又はソフトウェアアプリケーションが１つ以上のエンティティを含む１つ以上のサーチ質問をサーチエンジン１０４に与えるときにスタートする。ステップ２０２において与えられたサーチ質問は、サーチシステム１００により、そのたびに、１からｎまで処理される。ステップ２０２におけるサーチ質問は、例えば、ストリング、構造化データ、又は他の適当なデータフォーマットのようなキーワードの組み合わせである。図２の規範的実施形態では、サーチ質問のキーワードは、人々、組織、地理的位置、日付及び／又は時刻を表わすエンティティである。 In one embodiment, the method 200 begins at step 202 when a user or software application of the computing device 102 provides the search engine 104 with one or more search questions that include one or more entities. The search questions given in step 202 are processed by the search system 100 from 1 to n each time. The search query in step 202 is a combination of keywords such as, for example, a string, structured data, or other suitable data format. In the exemplary embodiment of FIG. 2, the search query keywords are entities representing people, organizations, geographic locations, dates and / or times.

ステップ２０２からのサーチ質問は、次いで、ステップ２０４において、エンティティ抽出のために処理される。このステップでは、エンティティ抽出モジュール１１０は、ステップ２０２からのサーチ質問をエンティティとして処理し、そしてそれらを全てエンティティ共起知識ベース１１２に対して比較して、できるだけ多くのエンティティを抽出しそして曖昧性除去する。抽出中に、１つ以上の特徴確認及び抽出アルゴリズムが使用される。又、正しい属性で正しく抽出される特徴の確度レベルを指示するスコアが各々の抽出された特徴に指定される。特徴属性を考慮して、各特徴の相対的重み又は関連度が決定される。更に、重み付けされたスコアモデルを使用して特徴と特徴との間の関連付けの関連度が決定される。 The search query from step 202 is then processed for entity extraction in step 204. In this step, the entity extraction module 110 processes the search queries from step 202 as entities and compares them all against the entity co-occurrence knowledge base 112 to extract as many entities as possible and disambiguation. To do. During extraction, one or more feature confirmation and extraction algorithms are used. In addition, a score indicating the accuracy level of a feature that is correctly extracted with a correct attribute is designated for each extracted feature. Considering feature attributes, the relative weight or relevance of each feature is determined. In addition, a weighted score model is used to determine the degree of association between features.

更に、特徴をリンクする方法であって、重み付けされたモデルを本質的に使用してどのエンティティ形式が最も重要であるか決定し、どれがより大きな重みを有するか決定し、そして信頼性スコアに基づき、正しい特徴の抽出がどれほどの信頼性で行われたか決定するといった種々の方法が使用される。エンティティが抽出されそして信頼性スコアに基づいてランク付けされると、ある場合には番号であるインデックスＩＤが、ステップ２０６において、抽出されたエンティティに指定される。 Further, a method for linking features, essentially using a weighted model to determine which entity type is most important, which has the greater weight, and to the reliability score Based on this, various methods are used, such as determining how reliably the correct feature extraction has been performed. Once the entities are extracted and ranked based on the confidence score, an index ID, which in some cases is a number, is assigned to the extracted entities in step 206.

次いで、ステップ２０８において、ステップ２０６で指定されたエンティティインデックスＩＤに基づくサーチが遂行される。サーチステップ２０８において、抽出されたエンティティは、標準的なインデックス方法を使用してエンティティインデックス型コーパス１１４内に位置付けられる。抽出されたエンティティが位置付けられると、エンティティ関連付けステップ２１０へと続く。エンティティ関連付けステップ２１０では、少なくとも２つの抽出されたエンティティが重畳するところのドキュメント、ビデオ、ピクチャー、ファイル、等の全てのデータがエンティティインデックス型コーパス１１４から引き出される。最後に、ステップ２１２において、潜在的な結果のリストが構築され、関連度により分類され、そしてサーチ結果としてユーザに提示される。結果のリストは、次いで、ユーザが関心のある関連エンティティを見出すところのデータへのリンクだけを示す。 Next, in step 208, a search based on the entity index ID specified in step 206 is performed. In the search step 208, the extracted entities are positioned in the entity index corpus 114 using standard indexing methods. Once the extracted entity is located, it continues to the entity association step 210. In the entity association step 210, all data such as documents, videos, pictures, files, etc. where at least two extracted entities overlap is extracted from the entity index corpus 114. Finally, in step 212, a list of potential results is constructed, sorted by relevance, and presented to the user as search results. The resulting list then shows only the links to the data where the user finds relevant entities of interest.

図３は、図２に関連して上述したように、エンティティ共起を使用して関連エンティティをサーチするための方法３００の特定例である。図２について述べたように、種々の実施形態によれば、方法３００の開始の前に、図１で述べたものと同様のエンティティインデックス型コーパス１１４には、大量コーパス又はライブコーパスのような複数のソース（インターネット）からのデータが供給されている。エンティティインデックス型コーパス１１４は、新たなデータが発見されるにつれて常時更新される複数のインデックスされたエンティティを含む。 FIG. 3 is a specific example of a method 300 for searching related entities using entity co-occurrence, as described above in connection with FIG. As described with respect to FIG. 2, according to various embodiments, prior to the start of method 300, an entity-indexed corpus 114 similar to that described in FIG. The data from the source (Internet) is supplied. The entity indexed corpus 114 includes a plurality of indexed entities that are constantly updated as new data is discovered.

この規範的な実施形態では、ユーザは、会社「Ａｐｐｌｅ」の「Ｊｏｂｓ」に関する情報を探索する。このため、ユーザは、ユーザインターフェイス１０２を通して１つ以上のエンティティ（例えば、ステップ３０２におけるサーチ質問）を入力し、ユーザインターフェイスは、これに限定されないが、図１について述べたようなサーチエンジン１０４を伴うインターフェイスである。例示であって、これに限定されないが、ユーザは、「Ａｐｐｌｅ＋Ｊｏｂｓ」のようなエンティティの組み合わせを入力する。次いで、サーチエンジン１０４は、ステップ３０２において、サーチ質問を発生し、そしてそれら質問を処理のためにサーバー装置１０６に送る。サーバー装置１０６において、エンティティ抽出モジュール１１０は、ステップ３０２のサーチ質問入力からステップ３０４のエンティティ抽出を遂行する。 In this exemplary embodiment, the user searches for information about “Jobs” of the company “Apple”. Thus, the user enters one or more entities (eg, the search query at step 302) through the user interface 102, and the user interface involves, but is not limited to, a search engine 104 as described with respect to FIG. Interface. By way of example and not limitation, the user enters a combination of entities such as “Apple + Jobs”. The search engine 104 then generates search questions at step 302 and sends the queries to the server device 106 for processing. In the server device 106, the entity extraction module 110 performs the entity extraction of step 304 from the search query input of step 302.

エンティティ抽出モジュール１１０は、次いで、ステップ３０２で入力されたサーチ質問、例えば、「Ａｐｐｌｅ」及び「Ｊｏｂｓ」をエンティティとして処理し、そしてそれらを全てエンティティ共起知識ベース１１２に対して比較し、できるだけ多数のエンティティを抽出しそして曖昧性除去する。抽出中、１つ以上の特徴確認及び抽出アルゴリズムが使用される。又、正しい属性で正しく抽出される特徴の確度レベルを示すスコアが各々の抽出された特徴に指定される。特徴属性を考慮して、各特徴の相対的な重み又は関連度が決定される。更に、重み付けされたスコアモデルを使用して特徴と特徴との間の関連付けの関連度が決定される。 The entity extraction module 110 then processes the search questions entered in step 302, eg, “Apple” and “Jobs”, as entities, and compares them all against the entity co-occurrence knowledge base 112, as many as possible. Extract and disambiguate entities. During extraction, one or more feature confirmation and extraction algorithms are used. Also, a score indicating the accuracy level of the feature that is correctly extracted with the correct attribute is designated for each extracted feature. Considering feature attributes, the relative weight or relevance of each feature is determined. In addition, a weighted score model is used to determine the degree of association between features.

更に、特徴をリンクする方法であって、重み付けされたモデルを本質的に使用してどのエンティティ形式が最も重要であるか決定し、どれがより大きな重みを有するか決定し、そして信頼性スコアに基づいて、正しい特徴の抽出がどれほどの信頼性で行われたか決定するといった種々の方法が使用される。その結果として、エンティティ及び共起を含むテーブル３０６が生成される。テーブル３０６は、エンティティ「ａｐｐｌｅ」及びその共起、この場合には、Ａｐｐｌｅ及びＪｏｂｓ、Ａｐｐｌｅ及びＳｔｅｖｅＪｏｂｓを示す。又、テーブル３０６は、Ａｐｐｌｅ及びＯｒｇａｎｉｚａｔｉｏｎＡも含み、これは、ＯｒｇａｎｉｚａｔｉｏｎＡがＡｐｐｌｅとビジネスを行い且つ前記ＯｒｇａｎｉｚａｔｉｏｎＡに「ｊｏｂｓ」を発生するので関連性があると分かっている。他の共起は、低い重要度で見出される。従って、Ａｐｐｌｅ及びＪｏｂｓは、最も高いスコア（１）を有し、従って、最上位にリストされ、次いで、Ａｐｐｌｅ及びＳｔｅｖｅＪｏｂｓは、第２の最も高いスコア（０．８）を有し、そして最後に、Ａｐｐｌｅ及び他のＯｒｇａｎｉｚａｔｉｏｎＡは、最も低いスコア（０．３）で最も下にリストされる。 In addition, a method of linking features, essentially using a weighted model to determine which entity type is most important, which has the greater weight, and the reliability score Based on this, various methods are used such as determining how reliable the extraction of the correct features has been done. As a result, a table 306 containing entities and co-occurrence is generated. The table 306 shows the entity “apple” and its co-occurrence, in this case Apple and Jobs, Apple and Steve Jobs. Table 306 also includes Apple and Organization A, which is known to be relevant because Organization A does business with Apple and generates “jobs” in Organization A. Other co-occurrences are found with low importance. Thus, Apple and Jobs have the highest score (1) and are therefore listed at the top, then Apple and Steve Jobs have the second highest score (0.8), and last In addition, Apple and other Organizations A are listed at the bottom with the lowest score (0.3).

エンティティが抽出されそして信頼性スコアに基づきランク付けされると、ある場合に数字でもよいインデックスＩＤが、ステップ３０８において、抽出されたエンティティに指定される。テーブル３１０は、抽出されたエンティティに指定されるインデックスＩＤを示している。従って、テーブル３１０は、「Ａｐｐｌｅ」をインデックスＩＤ１と共に示し、「Ｊｏｂｓ」をインデックスＩＤ２と共に示し、「ＳｔｅｖｅＪｏｂｓ」をインデックスＩＤ３と共に示し、そして「Ｏｒｇａｎｉｚａｔｉｏｎ」をインデックスＩＤ４と共に示している。 Once the entities are extracted and ranked based on the confidence score, an index ID, which may be a number in some cases, is assigned to the extracted entities in step 308. The table 310 shows index IDs specified for the extracted entities. Thus, the table 310 indicates “Apple” with index ID1, “Jobs” with index ID2, “Steve Jobs” with index ID3, and “Organization” with index ID4.

次いで、エンティティインデックスＩＤ（３０８）に基づくサーチステップ３１２が遂行される。サーチステップ３１２において、「Ａｐｐｌｅ」「Ｊｏｂｓ」「ＳｔｅｖｅＪｏｂｓ」及び「ＯｒｇａｎｉｚａｔｉｏｎＡ」のような抽出されたエンティティは、標準的なインデックス方法を使用して、エンティティインデックス型コーパス１１４内に位置付けられる。 A search step 312 based on the entity index ID (308) is then performed. In search step 312, extracted entities such as “Apple”, “Jobs”, “Steve Jobs”, and “Organization A” are positioned in the entity indexed corpus 114 using standard indexing methods.

エンティティインデックス型コーパス１１４内に抽出されたエンティティを位置付けした後に、エンティティ関連付けステップ３１４へと続く。エンティティ関連付けステップ３１４では、少なくとも２つの抽出されたエンティティが重畳するところのドキュメント、ビデオ、ピクチャー、ファイル、等の全てのデータがエンティティインデックス型コーパス１１４から引き出されて、リンクのリストをサーチ結果として構築する（ステップ３１８）。例示であって、これに限定されないが、テーブル３１６は、抽出されたエンティティがエンティティインデックス型コーパス１１４のデータにどれほど関連付けられるか示している。テーブル３１６において、ドキュメント１、４、５、７、８及び１０は、２つの抽出されたエンティティの重畳を示し、従って、それらドキュメントのためのリンクは、ステップ３１８において、サーチ結果として示される。 After positioning the extracted entity in the entity index corpus 114, the entity association step 314 is followed. In the entity association step 314, all data such as documents, videos, pictures, files, etc. on which at least two extracted entities are superimposed are extracted from the entity index corpus 114 and a list of links is constructed as a search result. (Step 318). By way of example and not limitation, the table 316 shows how the extracted entities are associated with the data in the entity index corpus 114. In table 316, documents 1, 4, 5, 7, 8, and 10 show a superposition of two extracted entities, so links for those documents are shown as search results in step 318.

図４は、本発明によるサーチコンピュータシステム４００のブロック図である。サーチシステム４００は、ネットワーク４０８を経てサーバー装置４０６と通信するサーチエンジン４０４への１つ以上のユーザインターフェイス４０２を備えている。この実施形態では、サーチシステム４００は、クライアント／サーバー形式のアーキテクチャーを経ることを含めて、以下に述べる１つ以上の特殊目的コンピュータ及びコンピュータモジュールにおいて実施される。しかしながら、サーチシステム４００は、他のコンピュータアーキテクチャー（例えば、スタンドアローンコンピュータ、ターミナルを伴うメインフレームシステム、ＡＳＰモデル、ピア・ツー・ピアモデル、等）を使用して実施されてもよい。一実施形態では、サーチコンピュータシステム４００は、複数のネットワーク、例えば、ローカルエリアネットワーク、ワイドエリアネットワーク、インターネット、ワイヤレスネットワーク、移動電話ネットワーク、等を含む。 FIG. 4 is a block diagram of a search computer system 400 according to the present invention. Search system 400 includes one or more user interfaces 402 to search engine 404 that communicate with server device 406 over network 408. In this embodiment, the search system 400 is implemented in one or more special purpose computers and computer modules described below, including through a client / server type architecture. However, the search system 400 may be implemented using other computer architectures (eg, stand-alone computers, mainframe systems with terminals, ASP models, peer-to-peer models, etc.). In one embodiment, search computer system 400 includes multiple networks, such as a local area network, a wide area network, the Internet, a wireless network, a mobile telephone network, and the like.

サーチエンジン４０４は、ユーザがワールドワイドウェブに情報を位置付けできるようにするウェブベースツールのようなユーザインターフェイスを含む。又、サーチエンジン４０４は、ユーザが内部データベースシステム内に情報を位置付けられるようにするユーザインターフェイスツールも含む。単一のサーバー装置４０６において実施されるか又は複数のサーバーコンピュータにわたり分散型アーキテクチャーにおいて実施されるサーバー装置４０６は、エンティティ抽出モジュール４１０、曖昧スコアマッチングモジュール４１２、及びエンティティ共起知識ベースのデータベース４１４を含む。 Search engine 404 includes a user interface such as a web-based tool that allows a user to locate information on the World Wide Web. The search engine 404 also includes user interface tools that allow a user to locate information within an internal database system. Server device 406, implemented in a single server device 406 or in a distributed architecture across multiple server computers, includes an entity extraction module 410, an ambiguity score matching module 412, and an entity co-occurrence knowledge base database 414. including.

エンティティ抽出モジュール４１０は、質問ストリング、部分質問、構造化データ、等の所与の質問セットから独立エンティティをオンザフライで抽出しそして曖昧性除去するように構成されたハードウェア及び／又はソフトウェアモジュールである。エンティティは、例えば、人々、組織、地理的位置、日付及び／又は時刻である。抽出中に、１つ以上の特徴確認及び抽出アルゴリズムが使用される。又、正しい属性と共に正しく抽出される特徴の確度レベルを指示するスコアが各々の抽出された特徴に指定される。特徴の属性を考慮して、各特性の相対的な重み又は関連度が決定される。更に、重み付けされたスコアモデルを使用して特徴と特徴との間の関連付けの関連度が決定される。 The entity extraction module 410 is a hardware and / or software module configured to extract and disambiguate independent entities from a given set of questions such as question strings, partial questions, structured data, etc. . An entity is, for example, people, organization, geographical location, date and / or time. During extraction, one or more feature confirmation and extraction algorithms are used. Also, a score indicating the accuracy level of the feature that is correctly extracted together with the correct attribute is designated for each extracted feature. Taking into account the attributes of the features, the relative weight or relevance of each property is determined. In addition, a weighted score model is used to determine the degree of association between features.

曖昧スコアマッチングモジュール４１２は、所与のサーチ質問から抽出されるエンティティの形式に従って選択される複数のアルゴリズムを含む。アルゴリズムの機能は、ユーザ入力を経て受け取った所与のサーチ質問及びアルゴリズムにより識別される他のサーチされたストリングが互いに同様であるかどうか又は所与のパターンストリングにほぼ一致するかどうか決定することである。又、曖昧マッチングは、曖昧ストリングマッチング、厳密でないマッチング、及びおおよそのマッチングとしても知られている。エンティティ抽出モジュール４１０及び曖昧スコアマッチングモジュール４１２は、エンティティ共起知識ベース４１４に関連して働いて、ユーザのためのサーチ示唆を発生する。 The ambiguity score matching module 412 includes a plurality of algorithms that are selected according to the type of entity extracted from a given search query. The function of the algorithm is to determine whether a given search query received via user input and other searched strings identified by the algorithm are similar to each other or approximately match a given pattern string It is. Fuzzy matching is also known as fuzzy string matching, inexact matching, and approximate matching. The entity extraction module 410 and the fuzzy score matching module 412 work in conjunction with the entity co-occurrence knowledge base 414 to generate search suggestions for the user.

種々の実施形態によれば、エンティティ共起知識ベース４１４は、これに限定されないが、インメモリコンピュータデータベースとして構築され、そして１つ以上のサーチコントローラ、複数のサーチノード、圧縮データの集合体、及び曖昧性除去モジュールのようなコンポーネントを含む。あるサーチコントローラは、１つ以上のサーチノードと選択的に関連付けされる。各サーチノードは、圧縮データの集合体を通して曖昧キーサーチを独立して遂行しそしてスコア付けされた結果のセットをその関連サーチコントローラへ返送することができる。 According to various embodiments, the entity co-occurrence knowledge base 414 is constructed as, but not limited to, an in-memory computer database and includes one or more search controllers, multiple search nodes, a collection of compressed data, and Includes components such as disambiguation modules. A search controller is selectively associated with one or more search nodes. Each search node can independently perform an ambiguous key search through a collection of compressed data and return a set of scored results to its associated search controller.

エンティティ共起知識ベース４１４は、特徴に基づく且つ信頼性スコアによりランク付けされた関連エンティティを含む。特徴をリンクする方法であって、重み付けされたモデルを本質的に使用してどのエンティティ形式が最も重要であるか決定し、どれがより大きな重みを有するか決定し、そして信頼性スコアに基づき、正しい特徴の抽出がどれほどの信頼性で行われたか決定するといった種々の方法が使用される。 The entity co-occurrence knowledge base 414 includes related entities that are feature-based and ranked by a reliability score. A method of linking features, essentially using a weighted model to determine which entity type is most important, which has a greater weight, and based on the confidence score, Various methods are used such as determining how reliable the extraction of the correct features was done.

図５は、知識ベースにおいて曖昧スコアマッチング及びエンティティ共起を使用してサーチ示唆を発生する方法５００を示すフローチャートである。この方法５００は、図４に示すものと同様のサーチシステム４００において実施される。 FIG. 5 is a flowchart illustrating a method 500 for generating search suggestions using fuzzy score matching and entity co-occurrence in a knowledge base. The method 500 is implemented in a search system 400 similar to that shown in FIG.

ある実施形態において、方法５００は、ステップ５０２において、ユーザが図４に示したサーチエンジンインターフェイス４０２へサーチ質問をタイプし始めるときに始まる。ステップ５０２においてサーチ質問がタイプされるときに、サーチシステム４００は、オンザフライプロセスを遂行する。種々の実施形態によれば、ステップ５０２のサーチ質問入力は、完全であるか又は部分的であり、正しいスペルであるか又はスペルミスがある。その後、サーチシステム４００において、ステップ５０２のサーチ質問入力からの部分エンティティ抽出ステップ５０４が遂行される。部分エンティティ抽出ステップ５０４は、エンティティ共起知識ベース４１４に対してクイックサーチを実行して、ステップ５０２で入力されたサーチ質問がエンティティであるかどうか識別し、もしそうであれば、どんなタイプのエンティティであるか識別する。種々の実施形態によれば、ステップ４０２のサーチ質問入力は、とりわけ、個人、組織、位置又は場所、及び日付を指す。サーチ質問入力のエンティティタイプが識別されると、曖昧スコアマッチングモジュール４１２が、ステップ５０６において、それに対応する曖昧マッチングアルゴリズムを選択する。例えば、サーチ質問が、個人を指すエンティティとして識別された場合には、曖昧スコアマッチングモジュール４１２は、例えば、ファーストネーム、ミドルネーム、ラストネーム、及び肩書きを含む個人の名前の異なるコンポーネントを抽出することにより、個人のためのストリングマッチングアルゴリズムを選択する。別の実施形態では、サーチ質問が組織を指すエンティティとして識別された場合には、曖昧スコアマッチングモジュール４１２は、学校、大学、企業、会社、等の識別用語を含む組織のためのストリングマッチングアルゴリズムを選択する。曖昧スコアマッチングモジュール４１２は、次いで、卓越したサーチのためにサーチ質問入力における識別されたエンティティのタイプに対応するストリングマッチングアルゴリズムを選択する。ストリングマッチングアルゴリズムがその識別されたエンティティのタイプに対して調整されると、曖昧スコアマッチングステップ５０８が遂行される。 In some embodiments, the method 500 begins at step 502 when a user begins typing a search query into the search engine interface 402 shown in FIG. When a search query is typed at step 502, the search system 400 performs an on-the-fly process. According to various embodiments, the search query input of step 502 is complete or partial and is a correct spelling or misspelling. Thereafter, in the search system 400, the partial entity extraction step 504 from the search query input in step 502 is performed. The partial entity extraction step 504 performs a quick search on the entity co-occurrence knowledge base 414 to identify whether the search query entered in step 502 is an entity, and if so, what type of entity Is identified. According to various embodiments, the search query input of step 402 refers to, among others, an individual, an organization, a location or location, and a date. Once the entity type of the search query entry is identified, the fuzzy score matching module 412 selects a corresponding fuzzy matching algorithm at step 506. For example, if a search query is identified as an entity pointing to an individual, the fuzzy score matching module 412 extracts different components of the individual's name including, for example, first name, middle name, last name, and title. To select a string matching algorithm for the individual. In another embodiment, if the search question is identified as an entity pointing to an organization, the fuzzy score matching module 412 may use a string matching algorithm for the organization that includes the identifying term such as school, university, company, company, etc. select. The fuzzy score matching module 412 then selects a string matching algorithm corresponding to the identified entity type in the search query input for superior search. Once the string matching algorithm is adjusted for that identified entity type, an ambiguous score matching step 508 is performed.

曖昧スコアマッチングステップ５０８では、抽出されたエンティティ（１つ又は複数）及び非エンティティが選択されて、エンティティ共起知識ベース４１４に対して比較される。抽出されたエンティティ（１つ又は複数）は、個人の不完全な名前、例えば、とりわけ、ファーストネーム及びラストネームの最初の文字、組織の省略形、例えば、「ＵｎｉｔｅｄＮａｔｉｏｎ」を意味する「ＵＮ」、短縮形、及びニックネームを含む。エンティティ共起知識ベース４１４は、とりわけ、エンティティ対エンティティ、エンティティ対トピックス、及びエンティティ対ファクトのような、構造化データとしてインデックスされる複数のレコードを既に登録している。後者は、ステップ５０８の曖昧スコアマッチングを非常に高速で行えるようにする。ステップ５０８の曖昧スコアマッチングは、これに限定されないが、レベンシュタイン距離、ｓｔｒｃｍｐ９５、ＩＴＦスコアリング、等の共通のストリングメトリックを使用する。２つのワード間のレベンシュタイン距離は、あるワードを他のワードに変更するのに必要な単一キャラクタ編集の最低回数を指す。 In the fuzzy score matching step 508, the extracted entity (s) and non-entities are selected and compared against the entity co-occurrence knowledge base 414. The extracted entity (s) is an incomplete name of the individual, for example, the first letter of the first name and last name, an organization abbreviation, for example, “UN” meaning “United Nation”. , Abbreviations, and nicknames. The entity co-occurrence knowledge base 414 has already registered multiple records that are indexed as structured data, such as entity-to-entity, entity-to-topics, and entity-to-facts, among others. The latter allows the fuzzy score matching of step 508 to be performed very quickly. The ambiguous score matching of step 508 uses common string metrics such as, but not limited to, Levenstein distance, strcmp95, ITF scoring, etc. The Levenstein distance between two words refers to the minimum number of single character edits required to change one word to another.

最後に、曖昧スコアマッチングステップ５０８がエンティティ共起知識ベース４１４の全てのレコードに対するサーチ質問の比較及びサーチを終了すると、所与のパターンストリング（即ち、ステップ５０２のサーチ質問入力）に最も一致するか又は一致に最も近いレコードが、ステップ５１０におけるサーチ示唆のための第１候補として選択される。所与のパターンストリングとの一致にあまり近くない他のレコードは、第１候補の下に減少順に配置される。ステップ５１０のサーチ示唆は、考えられる一致のドロップダウンリストにおいてユーザに提示され、ユーザは、これを無視してもよいし、しなくてもよい。 Finally, when the fuzzy score matching step 508 finishes comparing and searching the search query for all records in the entity co-occurrence knowledge base 414, is it best matching the given pattern string (ie, the search query input of step 502)? Or the record closest to the match is selected as the first candidate for search suggestions in step 510. Other records that are not very close to matching a given pattern string are placed in decreasing order under the first candidate. The search suggestion at step 510 is presented to the user in a drop-down list of possible matches, which the user may or may not ignore.

図６は、図４−５について述べた曖昧スコアマッチング及びエンティティ共起知識ベースを使用してサーチ示唆を発生するための方法に基づく規範的なユーザインターフェイス６００を示す。この例では、ユーザは、図４に示すものと同様のサーチエンジンインターフェイスを通して、サーチボックス６０６に部分質問６０４を入力する。例示であってこれに限定されないが、部分質問６０４は、図６に示すように、「ＭｉｃｈａｅｌＪ」のような個人の不完全な名前である。これは、ユーザがサーチボタン６０８をまだ選択していないか、さもなければ、部分質問６０４をサーチシステム４００へ提出して実際のサーチを遂行しそして結果を得ていないので、部分質問６０４と考えられる。 FIG. 6 illustrates an example user interface 600 based on a method for generating search suggestions using the fuzzy score matching and entity co-occurrence knowledge bases described with respect to FIGS. 4-5. In this example, the user enters a partial query 604 in search box 606 through a search engine interface similar to that shown in FIG. By way of example and not limitation, the partial query 604 is an incomplete name of an individual such as “Michael J” as shown in FIG. This is considered partial question 604 because the user has not yet selected search button 608 or otherwise submitted partial query 604 to search system 400 to perform the actual search and have not obtained any results. It is done.

方法５００（図５）に続いて、ユーザが「ＭｉｃｈａｅｌＪ」とタイプするとき、エンティティ抽出モジュール４１０は、エンティティ共起知識ベース４１４に対して第１のワード（Ｍｉｃｈａｅｌ）のクイックサーチをオンザフライで遂行して、エンティティのタイプを識別し、この例では、エンティティは、個人の名前を指す。その結果、曖昧スコアマッチングモジュール４１２は、個人の名前に対して仕立てられたストリングマッチングアルゴリズムを選択する。個人の名前は、例えば、イニシャルのみ（短い形態）、又はファーストネーム及びラストネームの第１文字、又はファーストネーム、ミドルネームのイニシャル及びラストネーム、又はその組み合わせを使用して、異なる形態で書かれたデータベースにおいて見出される。曖昧スコアマッチングモジュール４１２は、レベンシュタイン距離のような共通ストリングメトリックを使用して、エンティティ「Ｍｉｃｈａｅｌ」に一致するエンティティ共起知識ベース４１４内のエンティティ、トピックス又はファクトに対するスコアを決定して指定する。この例では、Ｍｉｃｈａｅｌは、その名前を有する膨大な量のレコードと一致する。しかしながら、ユーザが次に続く文字「Ｊ」をタイプするときに、曖昧スコアマッチングモジュール４１２は、エンティティ共起知識ベース４１４でＭｉｃｈａｅｌを伴う全ての共起に対してレベンシュタイン距離に基づく別の比較を遂行する。エンティティ共起知識ベース４１４は、次いで、「ＭｉｃｈａｅｌＪ」に対して最も高いスコアとの全ての考えられる一致を選択する。例えば、曖昧スコアマッチングモジュール４１２は、「ＭｉｃｈａｅｌＪａｃｋｓｏｎ」「ＭｉｃｈａｅｌＪｏｒｄａｎ」「ＭｉｃｈａｅｌＪ．Ｆｏｘ」、又はある場合には「ＭｉｃｈａｅｌＤｅｌｌ」のようなサーチ示唆６１０をユーザに返送する。次いで、ユーザは、ドロップダウンリストから、示唆された個人の１人を選択して、サーチ質問を完成することができる。前記の例を拡張すると、「Ｍｉｃｈａｅｌｔｈｅｂａｓｋｅｔｂａｌｌｐｌａｙｅｒ」のような質問は、個人エンティティ名前変化における「Ｍｉｃｈａｅｌ」、及びキーフレーズ、ファクト及びトピックスのような共起特徴における「ｔｈｅｂａｓｋｅｔｂａｌｌｐｌａｙｅｒ」についてエンティティ共起知識ベースをサーチすることにより返送される結果に基づいて「ＭｉｃｈａｅｌＪｏｒｄａｎ」の示唆を招く。別の例として、「Ａｌｅｘａｎｄｅｒｔｈｅａｃｔｏｒ」は、「ＡｌｅｘａｎｄｅｒＰｏｌｉｎｓｋｙ」の示唆を招く。当業者であれば、既存のプラットホームは、前記のように示唆を発生できないことが明らかであろう。 Following the method 500 (FIG. 5), when the user types “Michael J”, the entity extraction module 410 performs a quick search of the first word (Michael) on the fly against the entity co-occurrence knowledge base 414. The entity type, in this example the entity refers to the name of the person. As a result, the fuzzy score matching module 412 selects a string matching algorithm tailored to the person's name. Personal names are written in different forms, for example using only the initials (short form), or the first letters of the first and last names, or the first and middle names, the initial and last names, or a combination thereof. Found in other databases. The ambiguous score matching module 412 uses a common string metric such as Levenstein distance to determine and specify a score for an entity, topic, or fact in the entity co-occurrence knowledge base 414 that matches the entity “Michael”. . In this example, Michael matches a huge amount of records with that name. However, when the user types the next letter “J”, the fuzzy score matching module 412 performs another comparison based on the Levenstein distance for all co-occurrence with Michael in the entity co-occurrence knowledge base 414. Carry out. The entity co-occurrence knowledge base 414 then selects all possible matches with the highest score for “Michael J”. For example, the ambiguous score matching module 412 returns search suggestions 610 such as “Michael Jackson”, “Michael Jordan”, “Michael J. Fox”, or in some cases “Michael Dell” to the user. The user can then select one of the suggested individuals from the drop-down list to complete the search question. Extending the above example, a question like “Michael the basketball player” is an entity co-ordination for “Michael” in personal entity name change and “the basketball player” in co-occurrence features like key phrases, facts and topics. Based on the results returned by searching the knowledge base, it leads to the suggestion of “Michael Jordan”. As another example, “Alexander the actor” invites the suggestion of “Alexander Polinsky”. It will be apparent to those skilled in the art that existing platforms cannot generate suggestions as described above.

図７は、本発明によるサーチシステム７００のブロック図である。サーチシステム７００は、ネットワーク７０８を経てサーバー装置７０６と通信するサーチエンジン７０４に対する１つ以上のユーザインターフェイス７０２を備えている。この実施形態では、サーチシステム７００は、クライアント／サーバー型アーキテクチャーで実施されるが、サーチシステム７００は、他のコンピュータアーキテクチャー（例えば、スタンドアローンコンピュータ、ターミナルを伴うメインフレームシステム、ＡＳＰモデル、ピア・ツー・ピアモデル、等）、及び複数のネットワーク、例えば、ローカルエリアネットワーク、ワイドエリアネットワーク、インターネット、ワイヤレスネットワーク、移動電話ネットワーク、等を使用して実施されてもよい。 FIG. 7 is a block diagram of a search system 700 according to the present invention. The search system 700 includes one or more user interfaces 702 for a search engine 704 that communicates with a server device 706 via a network 708. In this embodiment, the search system 700 is implemented in a client / server architecture, although the search system 700 may be implemented in other computer architectures (eg, stand-alone computers, mainframe systems with terminals, ASP models, peers). 2), and multiple networks such as a local area network, a wide area network, the Internet, a wireless network, a mobile phone network, etc.

サーチエンジン７０４は、これに限定されないが、ユーザがワールドワイドウェブにおいて情報を位置付けできるようにするウェブベースツールを経てのインターフェイスを含む。又、サーチエンジン７０４は、ユーザが内部データベースシステム内で情報を位置付けできるようにするツールも含む。単一のサーバー装置７０６において実施されるか又は複数のサーバーコンピュータにわたり分散型アーキテクチャーにおいて実施されるサーバー装置７０６は、エンティティ抽出モジュール７１０、曖昧スコアマッチングモジュール７１２、及びエンティティ共起知識ベースのデータベース７１４を含む。 Search engine 704 includes, but is not limited to, an interface via a web-based tool that allows a user to locate information on the World Wide Web. Search engine 704 also includes tools that allow a user to locate information within an internal database system. Server device 706, implemented in a single server device 706 or in a distributed architecture across multiple server computers, includes an entity extraction module 710, an ambiguity score matching module 712, and an entity co-occurrence knowledge base database 714. including.

エンティティ抽出モジュール７１０は、質問ストリング、部分質問、構造化データ、等の所与の質問セットから独立エンティティをオンザフライで抽出しそして曖昧性除去できるハードウェア及び／又はソフトウェアモジュールである。エンティティは、例えば、人々、組織、地理的位置、日付及び／又は時刻である。抽出中に、１つ以上の特徴確認及び抽出アルゴリズムが使用される。又、正しい属性と共に正しく抽出される特徴の確度レベルを指示するスコアが各々の抽出された特徴に指定される。特徴属性を考慮して、各特徴の相対的な重み又は関連度が決定される。更に、重み付けされたスコアモデルを使用して特徴と特徴との間の関連付けの関連度が決定される The entity extraction module 710 is a hardware and / or software module that can extract and disambiguate independent entities from a given set of questions such as question strings, partial questions, structured data, and the like. An entity is, for example, people, organization, geographical location, date and / or time. During extraction, one or more feature confirmation and extraction algorithms are used. Also, a score indicating the accuracy level of the feature that is correctly extracted together with the correct attribute is designated for each extracted feature. Considering feature attributes, the relative weight or relevance of each feature is determined. In addition, a weighted score model is used to determine the degree of association between features.

曖昧スコアマッチングモジュール７１２は、所与のサーチ質問から抽出されるエンティティの形式に従って調整又は選択される複数のアルゴリズムを含む。アルゴリズムの機能は、所与のサーチ質問（入力）及びサーチされ示唆されたストリングが互いに同様であるかどうか又は所与のパターンストリングにほぼ一致するかどうか決定することである。又、曖昧マッチングは、曖昧ストリングマッチング、厳密でないマッチング、及びおおよそのマッチングとしても知られている。エンティティ抽出モジュール７１０及び曖昧スコアマッチングモジュール７１２は、エンティティ共起知識ベース７１４に関連して働いて、ユーザのためのサーチ示唆を発生する。 The fuzzy score matching module 712 includes a plurality of algorithms that are adjusted or selected according to the type of entity extracted from a given search question. The function of the algorithm is to determine whether a given search query (input) and the searched and suggested strings are similar to each other or approximately match a given pattern string. Fuzzy matching is also known as fuzzy string matching, inexact matching, and approximate matching. The entity extraction module 710 and the fuzzy score matching module 712 work in conjunction with the entity co-occurrence knowledge base 714 to generate search suggestions for the user.

種々の実施形態によれば、エンティティ共起知識ベース７１４は、これに限定されないが、インメモリコンピュータデータベースとして構築され、そして１つ以上のサーチコントローラ、複数のサーチノード、圧縮データの集合体、及び曖昧性除去モジュールのようなコンポーネントを含む。あるサーチコントローラは、１つ以上のサーチノードと選択的に関連付けされる。各サーチノードは、圧縮データの集合体を通して曖昧キーサーチを独立して遂行しそしてスコア付けされた結果のセットをその関連サーチコントローラへ返送することができる。 According to various embodiments, the entity co-occurrence knowledge base 714 is constructed as, but not limited to, an in-memory computer database and includes one or more search controllers, a plurality of search nodes, a collection of compressed data, and Includes components such as disambiguation modules. A search controller is selectively associated with one or more search nodes. Each search node can independently perform an ambiguous key search through a collection of compressed data and return a set of scored results to its associated search controller.

エンティティ共起知識ベース７１４は、特徴に基づく且つ信頼性スコアによりランク付けされた関連エンティティを含む。特徴をリンクする方法であって、重み付けされたモデルを本質的に使用してどのエンティティ形式が最も重要であるか決定し、どれがより大きな重みを有するか決定し、そして信頼性スコアに基づき、正しい特徴の抽出がどれほどの信頼性で行われたか決定するといった種々の方法が使用される。 The entity co-occurrence knowledge base 714 includes related entities that are feature-based and ranked by a reliability score. A method of linking features, essentially using a weighted model to determine which entity type is most important, which has a greater weight, and based on the confidence score, Various methods are used such as determining how reliable the extraction of the correct features was done.

図８は、共起及び／又は曖昧スコアマッチングに基づき関連エンティティのサーチ示唆を発生する方法８００の一実施形態を示すフローチャートである。この方法８００は、図７について述べたのと同様のサーチシステム７００において実施される。 FIG. 8 is a flowchart illustrating one embodiment of a method 800 for generating related entity search suggestions based on co-occurrence and / or fuzzy score matching. The method 800 is implemented in a search system 700 similar to that described with respect to FIG.

ある実施形態において、方法８００は、図７について上述したサーチエンジン７０４において、ユーザが、ステップ８０２で、サーチ質問をタイプするときに始まる。サーチ質問がタイプされるときに、サーチシステム７００は、オンザフライプロセスを遂行する。種々の実施形態によれば、サーチ質問は、完全及び／又は部分的で、正しいスペルであり及び／又はスペルミスがある。次いで、サーチ質問の部分エンティティ抽出ステップ８０４が遂行される。部分エンティティ抽出ステップ８０４は、エンティティ共起知識ベース７１４に対してクイックサーチを実行して、サーチ質問がエンティティを含むかどうか識別し、もしそうであれば、エンティティのタイプを識別する。種々の実施形態によれば、サーチ質問エンティティは、とりわけ、個人、組織、位置又は場所、及び日付を指す。エンティティタイプがあると、曖昧スコアマッチングモジュール７１２が、ステップ８０６において、それに対応する曖昧マッチングアルゴリズムを選択する。例えば、サーチ質問が、個人を指すエンティティとして識別された場合には、曖昧スコアマッチングモジュール７１２は、ファーストネーム、ミドルネーム、ラストネーム、及び肩書きを含む個人の名前の異なるコンポーネントを抽出できる個人のためのストリングマッチングアルゴリズムを調整又は選択する。別の実施形態では、サーチ質問が組織を指すエンティティとして識別された場合には、曖昧スコアマッチングモジュール７１２は、学校、大学、企業、会社、等の識別用語を含む組織のためのストリングマッチングアルゴリズムを調整又は選択する。それ故、曖昧スコアマッチングモジュール７１２は、サーチを容易にするためエンティティのタイプに対するストリングマッチングアルゴリズムを調整又は選択する。エンティティのタイプに対応するようにストリングマッチングアルゴリズムが調整又は選択されると、曖昧スコアマッチングステップがステップ８０８において遂行される。 In some embodiments, the method 800 begins when the user types a search query at step 802 in the search engine 704 described above with respect to FIG. When a search query is typed, the search system 700 performs an on-the-fly process. According to various embodiments, the search query is complete and / or partial, correct spelling and / or misspelled. A search query partial entity extraction step 804 is then performed. The partial entity extraction step 804 performs a quick search against the entity co-occurrence knowledge base 714 to identify whether the search question includes entities, and if so, identifies the type of entity. According to various embodiments, search query entities refer to individuals, organizations, locations or places, and dates, among others. If there is an entity type, the fuzzy score matching module 712 selects a corresponding fuzzy matching algorithm at step 806. For example, if a search query is identified as an entity pointing to an individual, the fuzzy score matching module 712 may be for individuals who can extract different components of the individual's name, including first name, middle name, last name, and title. Adjust or select a string matching algorithm. In another embodiment, if the search query is identified as an entity that points to an organization, the fuzzy score matching module 712 may use a string matching algorithm for the organization that includes the identifying term such as school, university, company, company, etc. Adjust or select. Therefore, the fuzzy score matching module 712 adjusts or selects a string matching algorithm for the type of entity to facilitate searching. Once the string matching algorithm is adjusted or selected to correspond to the type of entity, an ambiguous score matching step is performed at step 808.

曖昧スコアマッチングステップ８０８では、抽出されたエンティティ（１つ又は複数）及び非エンティティが選択されて、エンティティ共起知識ベース７１４に対して比較される。抽出されたエンティティ（１つ又は複数）は、個人の不完全な名前、例えば、とりわけ、ファーストネーム及びラストネームの最初の文字、組織の省略形、例えば、「ＵｎｉｔｅｄＮａｔｉｏｎ」を意味する「ＵＮ」、短縮形、及びニックネームを含む。エンティティ共起知識ベース７１４は、とりわけ、エンティティ対エンティティ、エンティティ対トピックス、及びエンティティ対ファクトのような、構造化データとしてインデックスされる複数のレコードを既に登録している。これは、ステップ８０８の曖昧スコアマッチングを迅速に行えるようにする。曖昧スコアマッチングは、これに限定されないが、レベンシュタイン距離、ｓｔｒｃｍｐ９５、ＩＴＦスコアリング、等の共通のストリングメトリックを使用する。２つのワード間のレベンシュタイン距離は、あるワードを他のワードに変更するのに必要な単一キャラクタ編集の最低回数を指す。 In the fuzzy score matching step 808, the extracted entity (s) and non-entities are selected and compared against the entity co-occurrence knowledge base 714. The extracted entity (s) is an incomplete name of the individual, for example, the first letter of the first name and last name, an organization abbreviation, for example, “UN” meaning “United Nation”. , Abbreviations, and nicknames. The entity co-occurrence knowledge base 714 has already registered a plurality of records that are indexed as structured data, such as entity-to-entity, entity-to-topics, and entity-to-facts, among others. This allows the fuzzy score matching of step 808 to be done quickly. Ambiguous score matching uses, but is not limited to, common string metrics such as Levenstein distance, strcmp95, ITF scoring. The Levenstein distance between two words refers to the minimum number of single character edits required to change one word to another.

ステップ８０８の曖昧スコアマッチングがエンティティ共起知識ベース７１４の全てのレコードに対するサーチ質問の比較及びサーチを終了すると、サーチ質問入力の所与のパターンストリングに最も一致するか又は一致に最も近いレコードが、ステップ８１０において、サーチ示唆のための第１候補として選択される。サーチ質問入力の所与のパターンストリングとの一致にあまり近くない他のレコードは、第１候補の下に減少順に配置される。ステップ８１０のサーチ示唆は、質問を完成するためにユーザが選択する考えられる一致のドロップダウンリストにおいてユーザに提示される。 When the ambiguity score matching of step 808 finishes comparing and searching the search query against all records in the entity co-occurrence knowledge base 714, the record that best matches or is closest to the given pattern string in the search query input is In step 810, it is selected as the first candidate for search suggestions. Other records that are not very close to matching the search query input with a given pattern string are placed in decreasing order under the first candidate. The search suggestions at step 810 are presented to the user in a drop-down list of possible matches that the user selects to complete the question.

別の実施形態では、ユーザが関心のある一致を選択した後、サーチシステム７００は、ステップ８１２において、その選択を新たなサーチ質問として取り上げる。その後、前記新たなサーチ質問からのエンティティ抽出ステップ８１４が遂行される。抽出中に、１つ以上の特徴確認及び抽出アルゴリズムが使用される。又、正しい属性で正しく抽出される特徴の確度レベルを指示するスコアが各々の抽出された特徴に指定される。特徴属性を考慮して、各特徴の相対的重み又は関連度が決定される。更に、重み付けされたスコアモデルを使用して特徴と特徴との間の関連付けの関連度が決定される。エンティティ抽出モジュール７１０は、次いで、エンティティ共起知識ベース７１４に対してサーチを実行して、最も高いスコアを持つ共起に基づき関連エンティティを見出す（ステップ８１６）。最後に、ステップ８１８において、関連エンティティを含むサーチ示唆のドロップダウンリストが、電子ドキュメントコーパスにおいて実際のデータサーチを遂行する前に、ユーザに提示される。 In another embodiment, after the user selects a match of interest, search system 700 takes the selection as a new search question at step 812. Thereafter, an entity extraction step 814 from the new search query is performed. During extraction, one or more feature confirmation and extraction algorithms are used. In addition, a score indicating the accuracy level of a feature that is correctly extracted with a correct attribute is designated for each extracted feature. Considering feature attributes, the relative weight or relevance of each feature is determined. In addition, a weighted score model is used to determine the degree of association between features. The entity extraction module 710 then performs a search against the entity co-occurrence knowledge base 714 to find related entities based on the co-occurrence with the highest score (step 816). Finally, at step 818, a search suggestion drop-down list containing related entities is presented to the user prior to performing the actual data search in the electronic document corpus.

図９は、共起及び／又は曖昧スコアマッチングに基づき関連エンティティのサーチ示唆を発生するための方法８００に関連したユーザインターフェイス９００の規範的実施形態である。この例では、ユーザは、図７に示すものと同様のサーチエンジンインターフェイス９０２を通して、サーチボックス９０６に部分質問９０４を入力する。例示であってこれに限定されないが、部分質問３０４は、図９に示すように、「ＭｉｃｈａｅｌＪ」のような個人の不完全な名前である。これは、ユーザがサーチボタン９０８をまだ選択していないか、さもなければ、部分質問９０４をサーチシステム１００へ提出して実際のサーチを遂行しそして結果を得ていないので、部分質問９０４と考えられる。 FIG. 9 is an exemplary embodiment of a user interface 900 associated with a method 800 for generating related entity search suggestions based on co-occurrence and / or fuzzy score matching. In this example, the user enters a partial query 904 into search box 906 through a search engine interface 902 similar to that shown in FIG. By way of example and not limitation, the partial question 304 is an incomplete name of an individual such as “Michael J” as shown in FIG. This is considered partial query 904 because the user has not yet selected search button 908 or otherwise submitted partial query 904 to search system 100 to perform the actual search and have not obtained any results. It is done.

方法８００に続いて、ユーザが「ＭｉｃｈａｅｌＪ」とタイプするとき、エンティティ抽出モジュール７１０は、エンティティ共起知識ベース７１４に対して第１のワード（Ｍｉｃｈａｅｌ）のクイックサーチをオンザフライで遂行して、エンティティのタイプを識別し、この例では、エンティティは、個人の名前を指す。その結果、曖昧スコアマッチングモジュール７１２は、個人の名前に対して仕立てられたストリングマッチングアルゴリズムを選択する。個人の名前は、例えば、イニシャルのみ（短い形態）、又はファーストネーム及びラストネームの第１文字、又はファーストネーム、ミドルネームのイニシャル及びラストネーム、又はその組み合わせを使用して、異なる形態で書かれたデータベースにおいて見出される。曖昧スコアマッチングモジュール７１２は、レベンシュタイン距離のような共通ストリングメトリックを使用して、エンティティ「Ｍｉｃｈａｅｌ」に一致するエンティティ共起知識ベース７１４内のエンティティ、トピックス又はファクトに対するスコアを決定して指定する。この例では、Ｍｉｃｈａｅｌは、その名前を有する膨大な量のレコードと一致する。しかしながら、ユーザが次に続く文字「Ｊ」をタイプするときに、曖昧スコアマッチングモジュール７１２は、エンティティ共起知識ベース７１４でＭｉｃｈａｅｌを伴う全ての共起に対してレベンシュタイン距離に基づく別の比較を遂行する。エンティティ共起知識ベース７１４は、次いで、「ＭｉｃｈａｅｌＪ」に対して最も高いスコアとの全ての考えられる一致を選択する。例えば、曖昧スコアマッチングモジュール７１２は、「ＭｉｃｈａｅｌＪａｃｋｓｏｎ」「ＭｉｃｈａｅｌＪｏｒｄａｎ」「ＭｉｃｈａｅｌＪ．Ｆｏｘ」、又はある場合には「ＭｉｃｈａｅｌＤｅｌｌ」のようなサーチ示唆９１０をユーザに返送する。次いで、ユーザは、ドロップダウンリストから、示唆された個人の１人を選択するか、又は示唆を無視してタイピングを続けることができる。前記の例を拡張すると、「Ｍｉｃｈａｅｌｔｈｅｂａｓｋｅｔｂａｌｌｐｌａｙｅｒ」のような質問は、個人エンティティ名前変化における「Ｍｉｃｈａｅｌ」、及びキーフレーズ、ファクト、トピックス、等の共起特徴における「ｔｈｅｂａｓｋｅｔｂａｌｌｐｌａｙｅｒ」についてエンティティ共起知識ベースをサーチすることにより返送される結果に基づいて「ＭｉｃｈａｅｌＪｏｒｄａｎ」の示唆を招く。別の例として、「Ａｌｅｘａｎｄｅｒｔｈｅａｃｔｏｒ」は、「ＡｌｅｘａｎｄｅｒＰｏｌｉｎｓｋｙ」の示唆を招く。当業者に明らかなように、既存のサーチプラットホームは、前記のように発生される示唆を与えることができない。 Following the method 800, when the user types “Michael J”, the entity extraction module 710 performs a quick search of the first word (Michael) on the entity co-occurrence knowledge base 714 on the fly to In this example, an entity refers to a person's name. As a result, the fuzzy score matching module 712 selects a string matching algorithm tailored to the person's name. Personal names are written in different forms, for example using only the initials (short form), or the first letters of the first and last names, or the first and middle names, the initial and last names, or a combination thereof. Found in other databases. The ambiguous score matching module 712 uses a common string metric such as Levenstein distance to determine and specify scores for entities, topics or facts in the entity co-occurrence knowledge base 714 that matches the entity “Michael”. . In this example, Michael matches a huge amount of records with that name. However, when the user types the next letter “J”, the fuzzy score matching module 712 performs another comparison based on the Levenshtein distance for all co-occurrence with Michael in the entity co-occurrence knowledge base 714. Carry out. The entity co-occurrence knowledge base 714 then selects all possible matches with the highest score for “Michael J”. For example, the ambiguous score matching module 712 returns search suggestions 910 such as “Michael Jackson”, “Michael Jordan”, “Michael J. Fox”, or in some cases “Michael Dell” to the user. The user can then select one of the suggested individuals from the drop-down list, or ignore the suggestion and continue typing. Extending the above example, questions such as “Michael the basketball player” will ask for “Michael” in personal entity name changes and “the basketball player” in co-occurrence features such as key phrases, facts, topics, etc. Based on the results returned by searching the knowledge base, it leads to the suggestion of “Michael Jordan”. As another example, “Alexander the actor” invites the suggestion of “Alexander Polinsky”. As will be apparent to those skilled in the art, existing search platforms cannot provide the suggestions generated as described above.

この実施形態では、ユーザは、図９に示すように、ドロップダウンリストから「ＭｉｃｈａｅｌＪｏｒｄａｎ」を選択して、部分質問９０４を完成させる。その選択は、次いで、サーチシステム７００により新たなサーチ質問９１２として処理される。その後、その新たなサーチ質問９１２からのエンティティ抽出が行われる。抽出中に、１つ以上の特徴確認及び抽出アルゴリズムが使用される。又、正しい属性で正しく抽出される特徴の確度レベルを指示するスコアが各々の抽出された特徴に指定される。特徴属性を考慮して、各特徴の相対的重み又は関連度が決定される。更に、重み付けされたスコアモデルを使用して特徴と特徴との間の関連付けの関連度が決定される。エンティティ抽出モジュール７１０は、次いで、エンティティ共起知識ベース７１４に対して「ＭｉｃｈａｅｌＪｏｒｄａｎ」のサーチを実行して、最も高いスコアを持つ共起に基づき関連エンティティを見出す。最後に、関連エンティティを含むサーチ示唆９１４のドロップダウンリストが、サーチボタン９０８をクリックすることにより、実際のデータサーチを遂行する前に、ユーザに提示される。図７−９について述べた前記システム及び方法は、ユーザが有用な関係を見出すことができるので、ユーザにとって迅速で且つ便利である。 In this embodiment, the user selects “Michael Jordan” from the drop-down list to complete the partial question 904, as shown in FIG. That selection is then processed by search system 700 as a new search query 912. Thereafter, entity extraction from the new search query 912 is performed. During extraction, one or more feature confirmation and extraction algorithms are used. In addition, a score indicating the accuracy level of a feature that is correctly extracted with a correct attribute is designated for each extracted feature. Considering feature attributes, the relative weight or relevance of each feature is determined. In addition, a weighted score model is used to determine the degree of association between features. The entity extraction module 710 then performs a search for “Michael Jordan” against the entity co-occurrence knowledge base 714 to find related entities based on the co-occurrence with the highest score. Finally, a drop-down list of search suggestions 914 containing related entities is presented to the user prior to performing the actual data search by clicking the search button 908. The systems and methods described with respect to FIGS. 7-9 are quick and convenient for the user because the user can find useful relationships.

図１０は、本発明によるサーチシステム１０００のブロック図である。サーチシステム１０００は、サーチエンジン１００２を備え、そのようなサーチエンジン１００２は、ユーザからのデータ入力、例えば、ユーザ質問を許す１つ以上のユーザインターフェイスを備えている。 FIG. 10 is a block diagram of a search system 1000 according to the present invention. The search system 1000 includes a search engine 1002, which includes one or more user interfaces that allow data input from a user, eg, user questions.

サーチシステム１０００は、１つ以上のデータベースを備えている。そのようなデータベースは、エンティティデータベース１００４及びトレンドデータベース１００６を含む。データベースは、ローカルサーバー又はウェブベースサーバーに記憶される。従って、サーチシステム１０００は、クライアント／サーバー型アーキテクチャーで実施されるが、サーチシステム１０００は、他のコンピュータアーキテクチャー、例えば、スタンドアローンコンピュータ、ターミナルを伴うメインフレームシステム、ＡＳＰモデル、ピア・ツー・ピアモデル、等、並びに複数のネットワーク、例えば、ローカルエリアネットワーク、ワイドエリアネットワーク、インターネット、ワイヤレスネットワーク、移動電話ネットワーク、等を使用して実施されてもよい。 The search system 1000 includes one or more databases. Such databases include an entity database 1004 and a trend database 1006. The database is stored on a local server or a web-based server. Thus, the search system 1000 is implemented in a client / server architecture, but the search system 1000 may be implemented in other computer architectures, such as a stand-alone computer, a mainframe system with terminals, an ASP model, a peer-to-peer, etc. It may be implemented using a peer model, etc., as well as multiple networks such as a local area network, a wide area network, the Internet, a wireless network, a mobile telephone network, and the like.

サーチエンジン１００２は、これに限定されないが、ユーザがワールドワイドウェブに情報を位置付けられるようにするウェブベースツールを含む。又、サーチエンジン１００２は、ユーザが内部データベースシステム内に情報を位置付けられるようにするツールも含む。 Search engine 1002 includes, but is not limited to, web-based tools that allow users to locate information on the World Wide Web. The search engine 1002 also includes tools that allow the user to locate information within the internal database system.

エンティティデータベース１００４は、単一のサーバーとして実施されるか又は複数のサーバーにわたり分散型アーキテクチャーにおいて実施される。エンティティデータベース１００４は、質問ストリング、構造化データ、等のエンティティ質問のセットを許す。そのようなエンティティ質問のセットは、インターネット及び／又はローカルネットワークにおいて利用できる複数のコーパスから前もって抽出される。エンティティ質問は、インデックスされそしてスコア付けされる。エンティティは、例えば、人々、組織、地理的位置、日付及び／又は時刻を含む。抽出中に、１つ以上の特徴確認及び抽出アルゴリズムが使用される。又、正しい属性と共に正しく抽出される特徴の確度レベルを指示するスコアが各々の抽出された特徴に指定される。特徴の属性を考慮して、各特性の相対的な重み又は関連度が決定される。更に、重み付けされたスコアモデルを使用して特徴と特徴との間の関連付けの関連度が決定される。 The entity database 1004 may be implemented as a single server or in a distributed architecture across multiple servers. The entity database 1004 allows a set of entity questions such as question strings, structured data, and the like. Such a set of entity queries is extracted in advance from a plurality of corpora available on the Internet and / or local network. Entity questions are indexed and scored. Entities include, for example, people, organizations, geographic locations, dates and / or times. During extraction, one or more feature confirmation and extraction algorithms are used. Also, a score indicating the accuracy level of the feature that is correctly extracted together with the correct attribute is designated for each extracted feature. Taking into account the attributes of the features, the relative weight or relevance of each property is determined. In addition, a weighted score model is used to determine the degree of association between features.

トレンドデータベース１００６は、単一のサーバーとして実施されるか又は複数のサーバーにわたり分散型アーキテクチャーにおいて実施される。トレンドデータベース１００６は、質問ストリング、構造化データ、等のエンティティ質問のセットを許す。そのようなエンティティ質問のセットは、インターネット及び／又はローカルネットワークにおいてユーザ及び／又は複数のユーザにより遂行される履歴的質問から前もって抽出される。エンティティ質問は、インデックスされそしてスコア付けされる。エンティティは、例えば、人々、組織、地理的位置、日付及び／又は時刻を含む。抽出中に、１つ以上の特徴確認及び抽出アルゴリズムが使用される。又、正しい属性と共に正しく抽出される特徴の確度レベルを指示するスコアが各々の抽出された特徴に指定される。特徴の属性を考慮して、各特性の相対的な重み又は関連度が決定される。更に、重み付けされたスコアモデルを使用して特徴と特徴との間の関連付けの関連度が決定される。 Trend database 1006 may be implemented as a single server or in a distributed architecture across multiple servers. The trend database 1006 allows a set of entity questions such as question strings, structured data, etc. Such a set of entity questions is extracted in advance from historical questions performed by the user and / or multiple users in the Internet and / or local network. Entity questions are indexed and scored. Entities include, for example, people, organizations, geographic locations, dates and / or times. During extraction, one or more feature confirmation and extraction algorithms are used. Also, a score indicating the accuracy level of the feature that is correctly extracted together with the correct attribute is designated for each extracted feature. Taking into account the attributes of the features, the relative weight or relevance of each property is determined. In addition, a weighted score model is used to determine the degree of association between features.

エンティティデータベース１００４及びトレンドデータベース１００６は、エンティティ共起知識ベースを備え、この知識ベースは、これに限定されないが、インメモリデータベース（図示せず）として構築されて、１つ以上のサーチコントローラ、複数のサーチノード、圧縮データの集合体及び曖昧性除去モジュール、等の他のコンポーネント（図示せず）を含む。１つのサーチコントローラは、１つ以上のサーチノードと選択的に関連付けされる。各サーチノードは、圧縮データの集合体を通して曖昧キーサーチを独立して遂行し、そしてスコア付けされた結果のセットをその関連サーチコントローラへ返送することができる。 The entity database 1004 and the trend database 1006 comprise an entity co-occurrence knowledge base, which is constructed as an in-memory database (not shown), but is not limited to one or more search controllers, Includes other components (not shown) such as search nodes, collections of compressed data, and disambiguation modules. A search controller is selectively associated with one or more search nodes. Each search node can independently perform an ambiguous key search through the collection of compressed data and return a scored set of results to its associated search controller.

共起知識ベースは、特徴に基づく且つ信頼性スコアによりランク付けされた関連エンティティを含む。特徴をリンクする方法であって、重み付けされたモデルを本質的に使用してどのエンティティ形式が最も重要であるか決定し、どれがより大きな重みを有するか決定し、そして信頼性スコアに基づき、正しい特徴の抽出がどれほどの信頼性で行われたか決定するといった種々の方法が使用される。 The co-occurrence knowledge base includes related entities that are feature-based and ranked by a reliability score. A method of linking features, essentially using a weighted model to determine which entity type is most important, which has a greater weight, and based on the confidence score, Various methods are used such as determining how reliable the extraction of the correct features was done.

サーチシステム１０００は、サーチエンジン１００２におけるユーザ質問をエンティティデータベース１００４及びトレンドデータベース１００６に対して比較する。サーチエンジン１００２における自動完成モードは、両データベース、即ちエンティティデータベース１００４及びトレンドデータベース１００６からイネーブルされる。サーチシステム１０００は、サーチ示唆１００８のリストをユーザに対して展開し、そのようなリストは、データベースにおける各エンティティ示唆に指定される曖昧スコアに基づいて発生されインデックスされる。各エンティティ示唆のスコアは、サーチシステム１０００によって自動的に及び／又はシステムスーパーバイザーによって手動で指定される。エンティティ示唆は、各エンティティにより得られるスコアに基づいて最も高い関連度から低い関連度へと順序付けされる。加えて、トレンドデータベース１００６におけるスコアは、ローカルネットワーク及び／又はインターネットにおける１人以上のユーザからのトレンド及び質問頻度を使用して指定される。 The search system 1000 compares user queries in the search engine 1002 against the entity database 1004 and the trend database 1006. The auto-completion mode in the search engine 1002 is enabled from both databases: the entity database 1004 and the trend database 1006. Search system 1000 expands a list of search suggestions 1008 to the user, and such a list is generated and indexed based on the ambiguity score specified for each entity suggestion in the database. The score for each entity suggestion is automatically specified by the search system 1000 and / or manually by the system supervisor. Entity suggestions are ordered from highest to lowest relevance based on the score obtained by each entity. In addition, the scores in the trend database 1006 are specified using trends and question frequencies from one or more users in the local network and / or the Internet.

各データベースのエンティティ示唆は、それらの中で比較され、次いで、スコアで得られたランクによりインデックス及び順序付けされ、従って、両データベース、即ちエンティティデータベース１００４及びトレンドデータベース１００６におけるエンティティ示唆を合成するサーチ示唆１００８のリストがユーザに示される。ユーザがリストから示唆を選択するか、又は示唆リストから別の結果を選択する場合には、サーチシステム１０００は、そのような情報をトレンドデータベース１００６にセーブする。従って、サーチシステム１０００の信頼性及び精度を高める自己学習システムが許される。要約すれば、トレンド共起知識ベースは、ユーザの質問及び選択された示唆から抽出された特徴で連続的に更新されて、オンザフライ学習の手段を与え、これは、サーチの関連度及び精度を改善する。更に、トレンド共起知識ベースは、システムを使用する異なるユーザにより及びトレンド検出モジュールのような自動的な方法によりポピュレートすることができる。 The entity suggestions in each database are compared among them, then indexed and ordered by the rank obtained in the score, thus searching suggestions 1008 that synthesize entity suggestions in both databases, ie, entity database 1004 and trend database 1006. Is shown to the user. If the user selects an suggestion from the list or selects another result from the suggestion list, the search system 1000 saves such information in the trend database 1006. Accordingly, a self-learning system that increases the reliability and accuracy of the search system 1000 is allowed. In summary, the trend co-occurrence knowledge base is continuously updated with features extracted from user questions and selected suggestions to provide a means for on-the-fly learning, which improves search relevance and accuracy. To do. Furthermore, the trend co-occurrence knowledge base can be populated by different users using the system and by automatic methods such as trend detection modules.

図１１は、本発明によるサーチシステム１１００のブロック図である。サーチシステム１１００は、サーチエンジン１１０２を備え、そのようなサーチエンジン１１０２は、ユーザ質問のようなユーザからのデータ入力を許す１つ以上のユーザインターフェイスを含む。 FIG. 11 is a block diagram of a search system 1100 according to the present invention. The search system 1100 includes a search engine 1102, which includes one or more user interfaces that allow data input from the user, such as user questions.

サーチシステム１１００は、１つ以上のデータベースを備えている。そのようなデータベースは、エンティティデータベース１１０４及びトレンドデータベース１１０６を含む。データベースは、ローカルサーバー又はウェブベースサーバーに記憶される。従って、サーチシステム１１００は、クライアント／サーバー型アーキテクチャーで実施されるが、サーチシステム１１００は、他のコンピュータアーキテクチャー、例えば、スタンドアローンコンピュータ、ターミナルを伴うメインフレームシステム、ＡＳＰモデル、ピア・ツー・ピアモデル、等、並びに複数のネットワーク、例えば、ローカルエリアネットワーク、ワイドエリアネットワーク、インターネット、ワイヤレスネットワーク、移動電話ネットワーク、等を使用して実施されてもよい。 The search system 1100 includes one or more databases. Such databases include an entity database 1104 and a trend database 1106. The database is stored on a local server or a web-based server. Thus, the search system 1100 is implemented in a client / server architecture, although the search system 1100 can be implemented in other computer architectures, such as a stand-alone computer, a mainframe system with terminals, an ASP model, peer-to-peer, etc. It may be implemented using a peer model, etc., as well as multiple networks such as a local area network, a wide area network, the Internet, a wireless network, a mobile telephone network, and the like.

ある実施形態では、サーチシステム１１００は、ユーザがサーチエンジン１１０２のユーザインターフェイスを通して１つ以上のエンティティ（サーチ質問における）を入力するときにスタートする。サーチ質問は、例えば、ストリングデータフォーマット、構造化データ、等におけるキーワードの組み合わせである。これらキーワードは、人々、組織、地理的位置、日付及び／又は時刻を表わすエンティティである。この実施形態では、「ＩｎｄｉａｎａＮａ」がサーチ質問として使用される。 In some embodiments, the search system 1100 starts when a user enters one or more entities (in a search query) through the search engine 1102 user interface. The search question is, for example, a combination of keywords in a string data format, structured data, etc. These keywords are entities that represent people, organizations, geographic locations, dates and / or times. In this embodiment, “Indiana Na” is used as the search question.

「ＩｎｄｉａｎａＮａ」は、次いで、エンティティ抽出のために処理される。エンティティ抽出モデルは、「ＩｎｄｉａｎａＮａ」のようなサーチ質問をエンティティとして処理し、そしてそれらを、全て、エンティティデータベース１１０４及びトレンドデータベース１１０６におけるエンティティ共起知識ベースに対して比較して、できるだけ多くのエンティティを抽出しそして曖昧性除去する。更に、エンティティ（例えば、個人、組織、位置）として検出されない質問テキスト部分は、エンティティ共起知識ベース（例えば、エンティティ及びトレンドデータベース）をサーチするのに使用できる概念的特徴（例えば、トピックス、ファクト、キーフレーズ）として処理される。抽出中に、１つ以上の特徴確認及び抽出アルゴリズムが使用される。又、正しい属性と共に正しく抽出される特徴の確度レベルを指示するスコアが各々の抽出された特徴に指定される。特徴の属性を考慮して、各特性の相対的な重み又は関連度が決定される。更に、重み付けされたスコアモデルを使用して特徴と特徴との間の関連付けの関連度が決定される。 “Indiana Na” is then processed for entity extraction. The entity extraction model treats search queries such as “Indiana Na” as entities and compares them all against the entity co-occurrence knowledge base in the entity database 1104 and the trend database 1106, as many entities as possible. Are extracted and disambiguated. In addition, portions of the question text that are not detected as entities (eg, individuals, organizations, locations) can be used to search conceptual features (eg, topics, facts, Key phrase). During extraction, one or more feature confirmation and extraction algorithms are used. Also, a score indicating the accuracy level of the feature that is correctly extracted together with the correct attribute is designated for each extracted feature. Taking into account the attributes of the features, the relative weight or relevance of each property is determined. In addition, a weighted score model is used to determine the degree of association between features.

この実施形態では、エンティティデータベース１１０４は、インデックス及びランク付けされるエンティティ示唆のリスト１１０８としてサーチ示唆のリストを示す。トレンドデータベース１１０６は、インデックス及びランク付けされるトレンドベース示唆リスト１１１０としてサーチ示唆のリストを示す。その後、サーチシステム１１００は、エンティティデータベース１１０４及びトレンドデータベース１１０６により与えられるものに基づいてサーチ示唆リスト１１１２を構築する。このサーチ示唆リスト１１１２は、各データベースにおける各エンティティ示唆の個々のスコアに基づいてインデックス及びランク付けされ、従って、最も高い関連度が最初に示され、そしてその下に低い関連度の結果が続く。 In this embodiment, the entity database 1104 shows a list of search suggestions as a list 1108 of indexed and ranked entity suggestions. The trend database 1106 shows a list of search suggestions as a trend-based suggestion list 1110 that is indexed and ranked. The search system 1100 then builds a search suggestion list 1112 based on what is provided by the entity database 1104 and the trend database 1106. This search suggestion list 1112 is indexed and ranked based on the individual scores of each entity suggestion in each database, so the highest relevance is shown first, followed by the low relevance results.

サーチシステム１１００では、サーチ示唆を得るための規範的な使用が開示される。サーチ示唆リスト１１１２は、「ＩｎｄｉａｎａＮａ」ユーザ質問に基づく示唆を示す。その結果、そのエンティティに対して個々のスコア０．９に基づき「ＩｎｄｉａｎａＮａｍｅ」が最初に現われ、次いで、個々のスコア０．８の結果として「ＩｎｄｉａｎａＮａｓｃａ」が示され、最後に、個々のスコア０．７に基づき「ＩｎｄｉａｎａＮａｓｈｖｉｌｌｅ」が示される。個々のスコアは、考えられる繰り返しエンティティを適用せずにエンティティ示唆のリスト１１０８及びトレンドベースの示唆リスト１１１０を使用して比較される。 In search system 1100, an exemplary use for obtaining search suggestions is disclosed. Search suggestion list 1112 shows suggestions based on the “Indiana Na” user question. As a result, “Indiana Name” appears first for that entity based on the individual score 0.9, then “Indiana Nasca” is shown as the result of the individual score 0.8, and finally the individual score Based on 0.7, “Indiana Nashville” is indicated. Individual scores are compared using a list of entity suggestions 1108 and a trend-based suggestion list 1110 without applying possible recurring entities.

図１２は、本発明によるサーチシステム１２００のブロック図である。サーチシステム１２００は、サーチエンジン１２０２を備え、そのようなサーチエンジン１２０２は、ユーザ質問のようなユーザからのデータ入力を許す１つ以上のユーザインターフェイスを含む。 FIG. 12 is a block diagram of a search system 1200 according to the present invention. The search system 1200 includes a search engine 1202, which includes one or more user interfaces that allow data input from the user, such as user questions.

サーチシステム１２００は、１つ以上のデータベースを備えている。そのようなデータベースは、エンティティデータベース１２０４及びトレンドデータベース１２０６を含む。データベースは、ローカルサーバー又はウェブベースサーバーに記憶される。従って、サーチシステム１２００は、クライアント／サーバー型アーキテクチャーで実施されるが、サーチシステム１２００は、他のコンピュータアーキテクチャー、例えば、スタンドアローンコンピュータ、ターミナルを伴うメインフレームシステム、ＡＳＰモデル、ピア・ツー・ピアモデル、等、並びに複数のネットワーク、例えば、ローカルエリアネットワーク、ワイドエリアネットワーク、インターネット、ワイヤレスネットワーク、移動電話ネットワーク、等を使用して実施されてもよい。 Search system 1200 includes one or more databases. Such databases include an entity database 1204 and a trend database 1206. The database is stored on a local server or a web-based server. Thus, while the search system 1200 is implemented in a client / server architecture, the search system 1200 may be implemented in other computer architectures such as stand-alone computers, mainframe systems with terminals, ASP models, peer-to-peer, etc. It may be implemented using a peer model, etc., as well as multiple networks such as a local area network, a wide area network, the Internet, a wireless network, a mobile telephone network, and the like.

ある実施形態では、サーチシステム１２００は、ユーザがサーチエンジン１２０２のユーザインターフェイスを通して１つ以上のエンティティ（サーチ質問における）を入力するときにスタートする。サーチ質問は、例えば、ストリング、構造化データ、等におけるキーワードの組み合わせである。これらのキーワードは、人々、組織、地理的位置、日付及び／又は時刻を表わすエンティティである。この実施形態では、「ＩｎｄｉａｎａＮａ」がサーチ質問として使用される。 In some embodiments, the search system 1200 starts when a user enters one or more entities (in a search query) through the search engine 1202 user interface. A search question is a combination of keywords in a string, structured data, etc., for example. These keywords are entities that represent people, organizations, geographic locations, dates and / or times. In this embodiment, “Indiana Na” is used as the search question.

「ＩｎｄｉａｎａＮａ」は、次いで、エンティティ抽出のために処理される。エンティティ抽出モデルは、「ＩｎｄｉａｎａＮａ」のようなサーチ質問をエンティティとして処理し、そしてそれらを、全て、エンティティデータベース１２０４及びトレンドデータベース１２０６におけるエンティティ共起知識ベースに対して比較して、できるだけ多くのエンティティを抽出しそして曖昧性除去する。更に、エンティティ（例えば、個人、組織、位置）として検出されない質問テキスト部分は、エンティティ共起知識ベース（例えば、エンティティ及びトレンドデータベース）をサーチするのに使用できる概念的特徴（例えば、トピックス、ファクト、キーフレーズ）として処理される。抽出中に、１つ以上の特徴確認及び抽出アルゴリズムが使用される。又、正しい属性と共に正しく抽出される特徴の確度レベルを指示するスコアが各々の抽出された特徴に指定される。特徴の属性を考慮して、各特性の相対的な重み又は関連度が決定される。更に、重み付けされたスコアモデルを使用して特徴と特徴との間の関連付けの関連度が決定される。 “Indiana Na” is then processed for entity extraction. The entity extraction model treats search queries such as “Indiana Na” as entities and compares them all against the entity co-occurrence knowledge base in the entity database 1204 and trend database 1206, as many entities as possible. Are extracted and disambiguated. In addition, portions of the question text that are not detected as entities (eg, individuals, organizations, locations) can be used to search conceptual features (eg, topics, facts, Key phrase). During extraction, one or more feature confirmation and extraction algorithms are used. Also, a score indicating the accuracy level of the feature that is correctly extracted together with the correct attribute is designated for each extracted feature. Taking into account the attributes of the features, the relative weight or relevance of each property is determined. In addition, a weighted score model is used to determine the degree of association between features.

この実施形態では、エンティティデータベース１２０４は、予めインデックス及びランク付けされるエンティティ示唆のリスト１２０８としてサーチ示唆のリストを示す。同様に、トレンドデータベース１２０６は、予めインデックス及びランク付けされるトレンドベース示唆リスト１２１０としてサーチ示唆のリストを示す。その後、サーチシステム１２００は、エンティティデータベース１２０４及びトレンドデータベース１２０６により与えられるものに基づいてサーチ示唆リスト１２１２を構築する。このサーチ示唆リスト１２１２は、両データベースにおける各エンティティ示唆の全体的スコアに基づいてインデックス及びランク付けされ、従って、最も高い関連度が最初に示され、そしてその下に低い関連度の結果が続く。 In this embodiment, the entity database 1204 shows a list of search suggestions as a list 1208 of pre-indexed and ranked entity suggestions. Similarly, the trend database 1206 shows a list of search suggestions as a trend-based suggestion list 1210 that is pre-indexed and ranked. The search system 1200 then builds a search suggestion list 1212 based on what is provided by the entity database 1204 and the trend database 1206. This search suggestion list 1212 is indexed and ranked based on the overall score of each entity suggestion in both databases, so the highest relevance is shown first, followed by the low relevance results.

サーチシステム１２００では、サーチ示唆を得るための規範的な使用が開示される。サーチ示唆リスト１２１２は、「ＩｎｄｉａｎａＮａ」ユーザ質問に基づく示唆を示す。その結果、エンティティ示唆のリスト１２０８におけるスコア０．８及びトレンドベースの示唆リスト１２１０におけるスコア０．６の和から得られる全体的スコア１．４に基づいて「ＩｎｄｉａｎａＮａｓｃａ」が最初に現われる。同様に、全体的スコア０．９の結果として「ＩｎｄｉａｎａＮａｍｅ」が示され、最後に、全体的スコア０．７に基づいて「ＩｎｄｉａｎａＮａｓｈｖｉｌｌｅ」が示される。 In search system 1200, an exemplary use for obtaining search suggestions is disclosed. Search suggestion list 1212 shows suggestions based on the “Indiana Na” user question. As a result, “Indiana Nasca” appears first based on an overall score of 1.4 resulting from the sum of the score 0.8 in the entity suggestion list 1208 and the score 0.6 in the trend-based suggestion list 1210. Similarly, “Indiana Name” is shown as a result of an overall score of 0.9, and finally “Indiana Nashville” is shown based on an overall score of 0.7.

図１３は、Ｓｈａｒｅｐｏｉｎｔ２０１３（登録商標）におけるコンテンツをジオタギングするためのシステムアーキテクチャー１３００を示す。サーチインデックス１３２４は、Ｓｈａｒｅｐｏｉｎｔ１３０２においてサーチを可能にするための多数の重要コンポーネントの１つである。Ｓｈａｒｅｐｏｉｎｔ２０１３（登録商標）１３０２においてサーチを可能にする別の重要部分は、コンテンツをインデックスするためのコンテンツキャプチャーである。 FIG. 13 shows a system architecture 1300 for geotagging content in Sharepoint 2013®. Search index 1324 is one of a number of important components for enabling searching at Sharepoint 1302. Another important part that enables searching in Sharepoint 2013® 1302 is content capture for indexing content.

クローラー１３０４は、異なるコンテンツソース１３０６を通してクロールし、メタデータプロパティのリストを各コンテンツに追加する。コンテンツソースは、例えば、これに限定されないが、Ｓｈａｒｅｐｏｉｎｔコンテンツ、ネットワークファイルシェア、或いはユーザ又はイントラネットコンテンツを含む。クローラー１３０４は、コンテンツソース１３０６にセキュアに接続し、ソースからのドキュメントをクロールされたプロパティとしてそれらのメタデータに関連付けるという機能を遂行するように構成される。クローラー１３０４は、コンテンツに全クロール又は増分的クロールを与えるように構成される。クロールされたプロパティは、例えば、これに限定されないが、とりわけ、著者、タイトル、創作日を含む。 The crawler 1304 crawls through different content sources 1306 and adds a list of metadata properties to each content. Content sources include, but are not limited to, for example, Sharepoint content, network file shares, or user or intranet content. The crawler 1304 is configured to securely connect to the content source 1306 and perform the function of associating documents from the source with their metadata as crawled properties. The crawler 1304 is configured to give the content a full crawl or incremental crawl. Crawled properties include, for example, but are not limited to, author, title, creation date, among others.

Ｓｈａｒｅｐｏｉｎｔ２０１３（登録商標）は、コンテンツ処理コンポーネント１３０８を含む。このコンテンツ処理コンポーネント１３０８は、クローラー１３０４からコンテンツを取り出し、そしてそれをインデックスするための準備をする。コンテンツ処理１３０８は、とりわけ、言葉の処理（言語検出）、パース、エンティティ抽出マネージメント、コンテンツベースのファイルフォーマット検出、コンテンツ処理エラーレポート、自然言語処理、及びクロールされたプロパティの、管理されたプロパティへのマッピングの段階を含む。 Sharepoint 2013 (registered trademark) includes a content processing component 1308. The content processing component 1308 retrieves content from the crawler 1304 and prepares to index it. Content processing 1308 includes word processing (language detection), parsing, entity extraction management, content-based file format detection, content processing error reporting, natural language processing, and crawled properties to managed properties, among others. Includes mapping stage.

コンテンツ処理１３０８は、コンテンツエンリッチメントウェブサービス（ＣＥＷＳ１３１０）により拡張される。ＣＥＷＳ１３１０は、ウェブサービスコールアウト１３１２が外部ウェブサービスをコールして付加的なアクションを遂行しそしてクロールされたデータプロパティをエンリッチできるようにすることで、コンテンツ処理１３０８のエンリッチメントを可能にする。ウェブサービスコールアウト１３１２は、標準的な簡単なオブジェクトアクセスプロトコル（ＳＯＡＰ）要求であるか、或いはクロールされたデータの構造化情報をエンティティエンリッチメントサービス１３１４と交換するのに使用される他のウェブサービスコール方法である。ウェブサービスコールアウト１３１２は、コンテンツエンリッチメント構成オブジェクトにおいてエンリッチメント処理のために外部ウェブサービスをいつコールするか制御するように構成されたトリガー条件を含む。又、エンティティエンリッチメントサービス１３１４は、クロールされたデータのドキュメントタイプを決定して、画像（スキャンされたドキュメント、ピクチャー、等）の形態で到来するコンテンツを決定する。画像の形態のコンテンツが見出されると、エンティティエンリッチメントサービス１３１４は、クロールされたドキュメントの位置を、例えば、これに限定されないが、光学的文字認識コンポーネント又は他の画像処理コンポーネントのようなＯＣＲ処理エンジン１３１６へ送出する。ＯＣＲ処理エンジン１３１６は、次いで、画像ファイルを検索及び処理して、それをテキストファイルへ非同期で変換する。ＯＣＲで処理されたファイル１３１８は、その後、クローラー１３０４へ再供給され、テキストファイルとしてクロールされると共に、コンテンツ処理１３０８へ返送されて、ワークフローの残り部分で処理される。 Content processing 1308 is extended by the content enrichment web service (CEWS 1310). CEWS 1310 enables enrichment of content processing 1308 by allowing web service callout 1312 to call external web services to perform additional actions and enrich crawled data properties. Web service callout 1312 is a standard simple object access protocol (SOAP) request or other web service used to exchange crawled data structured information with entity enrichment service 1314. Call method. Web service callout 1312 includes a trigger condition configured to control when an external web service is called for enrichment processing in the content enrichment configuration object. The entity enrichment service 1314 also determines the document type of the crawled data and determines the content that comes in the form of images (scanned documents, pictures, etc.). When content in the form of an image is found, the entity enrichment service 1314 determines the location of the crawled document, such as, but not limited to, an OCR processing engine such as an optical character recognition component or other image processing component. Send to 1316. The OCR processing engine 1316 then retrieves and processes the image file and converts it asynchronously to a text file. The OCR processed file 1318 is then re-supplied to the crawler 1304, crawled as a text file, and sent back to the content processing 1308 for processing in the rest of the workflow.

システムアーキテクチャー１３００は、外部ジオタガーウェブサービス１３２０及び名前付きエンティティタガーサービス１３２２を含む。ジオタガーウェブサービス１３２０及び名前付きエンティティタガーサービス１３２２は、両方とも、ウェブサービスアプリケーションプロバイダーとして機能しそしてウェブサービスコールアウト１３１２に応答するように構成されたソフトウェアモジュールである。ジオタガーウェブサービス１３２０は、自然言語処理エンティティ抽出技術、マシン学習モデル及び他の技術を使用して、クロールされたコンテンツからの地理的エンティティを識別し及び曖昧性除去する。例えば、ジオタガーウェブサービス１３２０は、ガゼッタにおいて見出されたエンティティの統計学的共起を分析することにより地理的エンティティを曖昧性除去する。ジオタガーウェブサービス１３２０は、クローラー１３０４により見出されたコンテンツに対してリンクされる統計学的共起エンティティのデータベースを含む。その同じ技術に続いて、名前付きエンティティタガーサービス１３２２を使用して、組織、人々又はトピックスのような付加的なエンティティ又はテキスト特徴が抽出される。 The system architecture 1300 includes an external geotagger web service 1320 and a named entity tagger service 1322. Geotagger web service 1320 and named entity tagger service 1322 are both software modules configured to function as web service application providers and to respond to web service callouts 1312. The geotagger web service 1320 identifies and disambiguates geographic entities from crawled content using natural language processing entity extraction techniques, machine learning models and other techniques. For example, the geotagger web service 1320 disambiguates geographic entities by analyzing the statistical co-occurrence of entities found in the gazetta. The geotagger web service 1320 includes a database of statistical co-occurrence entities that are linked to content found by the crawler 1304. Following that same technique, the named entity tagger service 1322 is used to extract additional entities or text features such as organizations, people or topics.

ジオタガーウェブサービス１３２０は、ＣＥＷＳ１３１０により入力プロパティとして送られた管理プロパティを分析し、そしてテキストにおいて参照される地理的エンティティを識別する。入力プロパティの非限定例は、とりわけ、ＦｉｌｅＴｙｐｅ、ＩｓＤｏｃｕｍｅｎｔ、ＯｒｉｇｉｎａｌＰａｔｈ、及びボディを含む。ジオタガーウェブサービス１３２０は、次いで、見出された各地理的エンティティを参照して管理プロパティを生成又は変更することによりテキストをジオタギングする。ジオタガーウェブサービス１３２０は、変更された又は新たな管理プロパティをエンティティエンリッチメントサービス１３１４へ送出し、そこで、変換が行われて、変更された管理プロパティをマップし、そしてそれを出力プロパティとしてＣＥＷＳ１３１０へ返送する。この同じプロセスを使用して、組織、人々又はトピックスのような他のエンティティ又は他の特徴の抽出及びエンティティタギングのために名前付きエンティティタガーサービス１３２２と対話する。 The geo tagger web service 1320 analyzes the managed properties sent as input properties by the CEWS 1310 and identifies the geographic entities referenced in the text. Non-limiting examples of input properties include FileType, IsDocument, OriginalPath, and body, among others. The geotagger web service 1320 then geotags the text by creating or modifying managed properties with reference to each found geographic entity. The geotagger web service 1320 sends the modified or new managed property to the entity enrichment service 1314 where a transformation is performed to map the modified managed property and pass it to the CEWS 1310 as an output property. Return it. This same process is used to interact with the named entity tagger service 1322 for the extraction and entity tagging of other entities or other features such as organizations, people or topics.

増強された管理プロパティがエンティティエンリッチメントサービス１３１４によって返送された後に、プロパティは、クロールされたファイル管理プロパティと合流され、そしてサーチインデックス１３２４へ送られる。 After the augmented managed property is returned by the entity enrichment service 1314, the property is merged with the crawled file management property and sent to the search index 1324.

地理的及び他のエンティティタグがコンテンツに関連付けられそしてインデックスされると、地理的又は名前付きエンティティ特徴を使用してサーチ質問が遂行される。Ｓｈａｒｅｐｏｉｎｔ２０１３（登録商標）におけるサーチＵＩ１３２６は、地理的ベースのサーチを遂行する上でユーザの助けとなり且つファセットサーチ結果の表示向上をサポートする特定のディスプレイを含む。サーチＵＩ１３２６は、カスタムウェブ部分でもよいし、又はＨＴＭＬ、ＨＴＭＬ５、ＪａｖａＳｃｒｉｐｔ（登録商標）及びＣＳＳのような標準的なツールでＳｈａｒｅｐｏｉｎｔ２０１３（登録商標）サーチの標準レイアウトを変更することにより行われてもよい。 Once geographic and other entity tags are associated and indexed with content, search queries are performed using geographic or named entity features. Search UI 1326 in Sharepoint 2013® includes a specific display that assists the user in performing geographic based searches and supports improved display of faceted search results. The search UI 1326 can be a custom web part or can be done by modifying the standard layout of Sharepoint 2013® search with standard tools such as HTML, HTML5, JavaScript® and CSS. Also good.

図１４は、Ｓｈａｒｅｐｏｉｎｔ２０１３（登録商標）サーチのためにコンテンツをタギングするプロセスステップを示すフローチャート１４００である。このプロセスは、Ｓｈａｒｅｐｏｉｎｔ２０１３（登録商標）のクローラーコンポーネントがコンテンツに対してクロールを遂行するときに始まる（ステップ１４０２）。ある実施形態では、クロールが全クロールであり、別の実施形態では、クロールが増分的クロールである。クローラーコンポーネントは、次いで、クロールされたプロパティ及びメタデータをコンテンツ処理へ供給する（ステップ１４０４）。クロールされたコンテンツが地理的又は名前付きエンティティを含むかどうか検証するための決定がなされる。例えば、これに限定されないが、トリガー条件が使用される。トリガー条件は、コンテンツがジオタギング又はエンティティタギングから利益を得るかどうか決定するプログラミングロジック又はルールのセットを含む。トリガー条件が偽と評価する場合には、クロールされたコンポーネントが管理プロパティに関連付けられ（ステップ１４０６）そしてサーチインデックスコンポーネントへ通される（ステップ１４０８）。トリガー条件が真と評価する場合には、ＣＥＷＳがウェブサービスコールアウトをエンティティエンリッチメントサービスへ送る（ステップ１４１０）。エンティティエンリッチメントサービスは、送られたコンテンツを分析して、コンテンツが画像フォーマット（スキャンされたドキュメント、ピクチャー、等）であるかどうか決定する。画像フォーマットで見出されたコンテンツは、ＯＣＲエンジンにより非同期で処理され、そしてクローリングコンポーネントによりテキストファイルとしてクロールされるべく返送される（ステップ１４１２）。コンテンツが画像フォーマットでない場合には、コンテンツは、ジオタギングウェブサーバー又は名前エンティティタガーサービスにより処理される（ステップ１４１４）。ウェブサービスは、コンテンツにおいて参照される地理的又は名前付きエンティティを抽出及び曖昧性除去し、そしてそれらをエンティティメタデータでエンリッチする。識別されたエンティティ及びそれらのメタデータは、管理プロパティとしてコンテンツ処理コンポーネントへ返送されそしてコンテンツに関連付けされる（ステップ１４１６）。関連付けされたメタデータは、次いで、サーチインデックスコンポーネントへ送られる（ステップ１４０６）。 FIG. 14 is a flowchart 1400 illustrating the process steps for tagging content for a Sharepoint 2013® search. This process begins when the Sharepoint 2013® crawler component crawls content (step 1402). In one embodiment, the crawl is a full crawl and in another embodiment, the crawl is an incremental crawl. The crawler component then provides crawled properties and metadata to the content process (step 1404). A decision is made to verify whether the crawled content includes a geographic or named entity. For example, but not limited to, a trigger condition is used. The trigger condition includes a set of programming logic or rules that determine whether the content will benefit from geotagging or entity tagging. If the trigger condition evaluates to false, the crawled component is associated with the managed property (step 1406) and passed to the search index component (step 1408). If the trigger condition evaluates to true, the CEWS sends a web service callout to the entity enrichment service (step 1410). The entity enrichment service analyzes the sent content to determine if the content is in image format (scanned document, picture, etc.). The content found in the image format is processed asynchronously by the OCR engine and returned to be crawled as a text file by the crawling component (step 1412). If the content is not in an image format, the content is processed by a geotagging web server or name entity tagger service (step 1414). Web services extract and disambiguate geographic or named entities referenced in the content and enrich them with entity metadata. The identified entities and their metadata are returned as managed properties to the content processing component and associated with the content (step 1416). The associated metadata is then sent to the search index component (step 1406).

種々の態様及び実施形態が開示されたが、他の態様及び実施形態も意図される。ここに開示した種々の態様及び実施形態は、例示のためのもので、それに限定されるものではなく、真の範囲及び精神は、特許請求の範囲により示される。 While various aspects and embodiments have been disclosed, other aspects and embodiments are also contemplated. The various aspects and embodiments disclosed herein are for purposes of illustration and not limitation, and the true scope and spirit is indicated by the following claims.

以上の方法の説明及びプロセスフロー図は、単なる例示として示されたもので、種々の実施形態のステップを、提示した順序で遂行しなければならないことを要求し又は意味することは意図されない。当業者に明らかなように、前記実施形態におけるステップは、任意の順序で遂行されてもよい。「次いで(then)」、「次に(next)」、等のワードは、ステップの順序を限定するものではなく、これらのワードは、単に、方法の説明を通して読者を誘導するのに使用されるだけである。プロセスフロー図は、オペレーションを一連のプロセスとして示すが、多数のオペレーションを並列に又は同時に遂行することもできる。加えて、オペレーションの順序は、再構成してもよい。プロセスは、方法、機能、手順、サブルーチン、サブプログラム、等に対応する。プロセスが機能に対応するとき、その終了は、コーリング機能又はメイン機能への機能の復帰に対応する。 The foregoing method descriptions and process flow diagrams are presented by way of example only and are not intended to imply or imply that the steps of the various embodiments must be performed in the order presented. As will be apparent to those skilled in the art, the steps in the embodiments may be performed in any order. The words “then”, “next”, etc. do not limit the order of the steps, these words are simply used to guide the reader through the description of the method Only. Although the process flow diagram shows the operations as a series of processes, multiple operations can be performed in parallel or simultaneously. In addition, the order of operations may be reconfigured. Processes correspond to methods, functions, procedures, subroutines, subprograms, etc. When the process corresponds to a function, its termination corresponds to the return of the function to the calling function or the main function.

ここに開示する実施形態に関連して述べた種々の例示的論理ブロック、モジュール、回路及びアルゴリズムステップは、電子的ハードウェア、コンピュータソフトウェア又はその両方の組み合わせとして具現化されてもよい。ハードウェア及びソフトウェアのこの互換性を明確に示すために、種々の例示的コンポーネント、ブロック、モジュール、回路、及びステップは、それらの機能に関して一般的に説明された。そのような機能がハードウェアとして具現化されるかソフトウェアとして具現化されるかは、システム全体に課せられる特定アプリケーション及び設計上の制約に依存する。当業者であれば、ここに述べた機能を特定アプリケーションごとに色々な仕方で具現化できるが、そのような具現化の判断は、本発明の範囲から逸脱すると解釈されてはならない。 The various exemplary logic blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be embodied as electronic hardware, computer software, or a combination of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described generally in terms of their functionality. Whether such a function is implemented as hardware or software depends on a specific application imposed on the entire system and design constraints. Those skilled in the art can implement the functions described herein in various ways for each specific application, but such implementation decisions should not be construed as departing from the scope of the present invention.

コンピュータソフトウェアで具現化される実施形態は、ソフトウェア、ファームウェア、ミドルウェア、マイクロコード、ハードウェア記述言語、又はその組み合わせで具現化される。コードセグメント又はマシン実行可能なインストラクションは、手順、機能、サブプログラム、プログラム、ルーチン、サブルーチン、モジュール、ソフトウェアパッケージ、クラス、或いはインストラクション、データ構造体又はプログラムステートメントの組合せを表わす。コードセグメントは、情報、データ、アーギュメント、パラメータ又はメモリコンテンツを通し及び／又は受け取ることにより別のコードセグメント又はハードウェア回路に結合される。情報、アーギュメント、パラメータ、データ、等は、メモリ共有、メッセージ通過、トークン通過、ネットワーク送信、等を含む適当な手段を経て通され、転送され又は送信される。 Embodiments embodied in computer software are embodied in software, firmware, middleware, microcode, hardware description language, or a combination thereof. A code segment or machine-executable instruction represents a procedure, function, subprogram, program, routine, subroutine, module, software package, class, or instruction, data structure, or combination of program statements. A code segment is coupled to another code segment or a hardware circuit by passing and / or receiving information, data, arguments, parameters or memory contents. Information, arguments, parameters, data, etc. are passed, forwarded or transmitted via any suitable means including memory sharing, message passing, token passing, network transmission, etc.

これらのシステム及び方法を実施するのに使用される実際のソフトウェアコード又は特殊な制御ハードウェアは、本発明を限定するものではない。従って、システム及び方法のオペレーション及び振舞いは、ここでの記載に基づいてシステム及び方法を実施するようにソフトウェア及び制御ハードウェアを設計できることを理解して、特定のソフトウェアコードを参照せずに説明した。 The actual software code or specialized control hardware used to implement these systems and methods is not intended to limit the invention. Accordingly, the operation and behavior of the system and method have been described without reference to specific software code, with the understanding that software and control hardware can be designed to implement the system and method based on the description herein. .

ソフトウェアで実施されるときに、機能は、非一時的コンピュータ読み取り可能な又はプロセッサ読み取り可能なストレージ媒体に１つ以上のインストラクション又はコードとして記憶される。ここに開示する方法又はアルゴリズムのステップは、コンピュータ読み取り可能な又はプロセッサ読み取り可能なストレージ媒体に存在するプロセッサ実行可能なソフトウェアモジュールにおいて実施される。非一時的なコンピュータ読み取り可能な又はプロセッサ読み取り可能な媒体は、ある場所から別の場所へのコンピュータプログラムの転送を容易にするコンピュータストレージ媒体及び有形のストレージ媒体の両方を含む。非一時的なプロセッサ読み取り可能なストレージ媒体は、コンピュータによりアクセスされる利用可能な媒体である。これに限定されないが、一例として、そのような非一時的なプロセッサ読み取り可能な媒体は、ＲＡＭ、ＲＯＭ、ＥＥＰＲＯＭ、ＣＤ−ＲＯＭ又は他の光学ディスクストレージ、磁気ディスクストレージ又は他の磁気ストレージ装置、或いはインストラクション又はデータ構造体の形態で望ましいプログラムコードを記憶するのに使用され且つコンピュータ又はプロセッサによりアクセスされる他の有形のストレージ媒体を含む。ここで使用するディスク(disk & disc)とは、コンパクトディスク（ＣＤ）、レーザーディスク（登録商標）、光学ディスク、デジタル多様性ディスク（ＤＶＤ）、フロッピーディスク、及びブルーレイディスクを含み、ここで、ディスク(disk)は、通常、データを磁気的に再生するものであり、一方、ディスク(disc)は、データをレーザで光学的に再生するものである。前記の組み合わせも、コンピュータ読み取り可能な媒体の範囲内に包含される。加えて、方法又はアルゴリズムのオペレーションは、コンピュータプログラム製品に合体される非一時的プロセッサ読み取り可能な媒体及び／又はコンピュータ読み取り可能な媒体にコード及び／又はインストラクションの１つ又は組み合わせ或いはセットとして存在する。 When implemented in software, the functions are stored as one or more instructions or code on a non-transitory computer-readable or processor-readable storage medium. The steps of the method or algorithm disclosed herein are implemented in a processor-executable software module residing on a computer-readable or processor-readable storage medium. Non-transitory computer readable or processor readable media include both computer storage media and tangible storage media that facilitate transfer of a computer program from one place to another. A non-transitory processor readable storage medium is any available medium that can be accessed by a computer. By way of example, but not limitation, such non-transitory processor readable media may be RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage device, or It includes other tangible storage media used to store desired program code in the form of instructions or data structures and accessed by a computer or processor. As used herein, the term “disk & disc” includes a compact disc (CD), a laser disc (registered trademark), an optical disc, a digital diversity disc (DVD), a floppy disc, and a Blu-ray disc. The (disk) usually reproduces data magnetically, while the disc (disc) reproduces data optically with a laser. Combinations of the above are also included within the scope of computer-readable media. In addition, the operations of the method or algorithm may reside as one or a combination or set of codes and / or instructions on a non-transitory processor-readable medium and / or computer-readable medium that is incorporated into a computer program product.

技術の種々のコンポーネントは、分散型ネットワーク及び／又はインターネットの遠隔部分に、或いは専用のセキュア、アンセキュア及び／又は暗号化システム内に配置できることが明らかである。従って、システムのコンポーネントは、１つ以上の装置に結合するか、又はテレコミュニケーションネットワークのような分散型ネットワークの特定ノードに共通配置できることが明らかである。以上の説明から明らかなように、計算効率の理由で、システムのコンポーネントは、システムのオペレーションに影響することなく、分散型ネットワーク内の任意の位置に配置することができる。更に、それらのコンポーネントは、専用マシンに埋め込むこともできる。 It will be apparent that the various components of the technology can be located in a remote part of the distributed network and / or the Internet, or in a dedicated secure, unsecure and / or encryption system. Thus, it will be apparent that the components of the system can be coupled to one or more devices or co-located at a particular node in a distributed network such as a telecommunications network. As is apparent from the above description, for computational efficiency reasons, the components of the system can be located anywhere in the distributed network without affecting the operation of the system. In addition, these components can be embedded in a dedicated machine.

更に、エレメントを接続する種々のリンクは、ワイヤード又はワイヤレスリンク又はその組み合わせ、或いは接続されたエレメントへ及びそこからデータを供給及び／又は通信することのできる他の既知の又は今後開発されるエレメントであることが明らかである。ここで使用するモジュールという語は、エレメントに関連した機能を遂行できる既知の又は今後開発されるハードウェア、ソフトウェア、ファームウェア、又はその組み合わせを指す。又、ここで使用する決定、計算及びコンピューティング、並びにその変形の語は、交換可能に使用され、そして任意のタイプの方法、プロセス、数学演算又は技術を包含する。 Further, the various links connecting the elements can be wired or wireless links or combinations thereof, or other known or later developed elements that can supply and / or communicate data to and from the connected elements. It is clear that there is. As used herein, the term module refers to any known or later developed hardware, software, firmware, or combination thereof that can perform the functions associated with the element. Also, the terms decision, computation and computing, and variations thereof, as used herein, are used interchangeably and encompass any type of method, process, mathematical operation or technique.

ここに開示する実施形態の前記説明は、当業者が本発明を実施又は利用できるようにするためになされたものである。これら実施形態に対する種々の変更は、当業者に容易に明らかであり、そしてここに定義する一般的な原理は、本発明の精神又は範囲から逸脱せずに他の実施形態に適用される。従って、本発明は、ここに示す実施形態に限定されるものではなく、特許請求の範囲並びにここに開示した原理及び新規な特徴に一致する最も広い範囲と調和されるべきである。 The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Accordingly, the invention is not intended to be limited to the embodiments shown but is to be accorded the widest scope consistent with the claims and the principles and novel features disclosed herein.

以上に述べた実施形態は、例示に過ぎない。当業者であれば、ここに述べた特定例に対して置き換えられ且つ依然として本発明の範囲内に入る多数の代替的コンポーネント及び実施形態が認識されよう。 The embodiments described above are merely examples. Those skilled in the art will recognize numerous alternative components and embodiments that may be substituted for the specific examples described herein and still fall within the scope of the present invention.

１００：サーチシステム
１０２：グラフィックユーザインターフェイス
１０４：サーチエンジン
１０６：サーバー装置
１０８：ネットワーク接続
１１０：エンティティ抽出モジュール
１１２：エンティティ共起知識ベース
１１４：エンティティインデックス型コーパス
４００：サーチコンピュータシステム
４０２：ユーザインターフェイス
４０４：サーチエンジン
４０６：サーバー装置
４０８：ネットワーク接続
４１０：エンティティ抽出モジュール
４１２：曖昧スコアマッチングモジュール
４１４：エンティティ共起知識ベースのデータベース
７００：サーチシステム
７０２：ユーザインターフェイス
７０４：サーチエンジン
７０６：サーバー装置
７０８：ネットワーク接続
７１０：エンティティ抽出モジュール
７１２：曖昧スコアマッチングモジュール
７１４：エンティティ共起知識ベースのデータベース
１０００：サーチシステム
１００２：サーチエンジン
１００４：エンティティデータベース
１００６：トレンドデータベース
１００８：サーチ示唆
１３００：システムアーキテクチャー
１３０２：Ｓｈａｒｅｐｏｉｎｔ
１３０４：クローラー
１３０６：コンテンツソース
１３０８：コンテンツ処理
１３１２：ウェブサービスコールアウト
１３１４：エンティティエンリッチメントサービス
１３１６：ＯＣＲ処理エンジン
１３１８：ＯＣＲファイル
１３２０：ジオタガーウェブサービス
１３２２：名前付きエンティティタガーサービス
１３２４：サーチインデクサ
１３２６：サーチＵＩ DESCRIPTION OF SYMBOLS 100: Search system 102: Graphic user interface 104: Search engine 106: Server apparatus 108: Network connection 110: Entity extraction module 112: Entity co-occurrence knowledge base 114: Entity index type corpus 400: Search computer system 402: User interface 404: Search engine 406: Server device 408: Network connection 410: Entity extraction module 412: Fuzzy score matching module 414: Entity co-occurrence knowledge base database 700: Search system 702: User interface 704: Search engine 706: Server device 708: Network connection 710: Entity extraction module 712: Ambiguous Abnormal score matching module 714: Database of entity co-occurrence knowledge base 1000: Search system 1002: Search engine 1004: Entity database 1006: Trend database 1008: Search suggestion 1300: System architecture 1302: Sharepoint
1304: Crawler 1306: Content source 1308: Content processing 1312: Web service callout 1314: Entity enrichment service 1316: OCR processing engine 1318: OCR file 1320: Geotagger web service 1322: Named entity tagger service 1324: Search indexer 1326 : Search UI

Claims

An entity extraction computer receives a search query including one or more entities from a client computer;
The entity extraction computer compares each entity with one or more co-occurrence of each entity in the co-occurrence database;
The entity extraction computer allows a subset of one or more entities from a search query to be shared by each entity in the subset based on the accuracy of the co-occurrence of that entity with one or more related entities in the electronic data corpus according to a co-occurrence database. In response to a determination that the reliability score of the origin database is exceeded,
An entity extraction computer assigns an index identifier (index ID) to each of the entities in the plurality of extracted entities;
An entity extraction computer saves an index ID for each of the plurality of extracted entities in an electronic data corpus, the electronic data corpus being indexed by an index ID corresponding to each of the one or more related entities. Yes,
Searching an entity indexed electronic data corpus to locate the plurality of extracted entities and to identify an index ID of a data record in which at least two of the plurality of extracted entities co-occur by a search server computer; And a search server computer builds a search result list having data records corresponding to the identified index ID;
A computer-implemented method comprising:

The method of claim 1, further comprising: classifying the search result list by relevance based on the reliability score by the search server computer; and transferring the classified search result list to the user device by the search server computer. Method.

The method of claim 1, wherein the plurality of extracted entities are ranked based on a confidence score.

The method of claim 1, wherein the entity extraction computer associates the extracted entities with one or more co-occurrence entities in an entity indexed electronic data corpus.

The method of claim 4, wherein the associated entities are ranked by a confidence score.

The method of claim 1, wherein each of the plurality of entities is selected from the group consisting of an individual, an organization, a geographical location, a date, and a time.

Comprising one or more server computers having one or more processors executing computer readable instructions for a plurality of computer modules, comprising:
An entity extraction module configured to receive user input of search query parameters, the entity extraction module further comprising:
By comparing each entity in the plurality of extracted entities to an entity co-occurrence database that includes a confidence score representing the accuracy of the co-occurrence of the extracted entity and one or more related entities in the electronic data corpus, Extract multiple entities from search question parameters,
Specify an index identifier (index ID) for each entity in the plurality of extracted entities;
Saving an index ID for each of the plurality of extracted entities in an electronic data corpus, the electronic data corpus being indexed by an index ID corresponding to each of the one or more related entities;
And searching the entity indexed electronic data corpus to locate the plurality of extracted entities and identify an index ID of a data record in which at least two of the plurality of extracted entities co-occur A system further comprising a search server module configured to, wherein the search server module is further configured to build a search result list having data records corresponding to the identified index ID.

The system of claim 7, wherein the search server module is further configured to classify the search result list by relevance based on a reliability score and forward the sorted search result list to a user device.

The system of claim 7, wherein the plurality of extracted entities are ranked based on a confidence score.

The system of claim 7, wherein the entity extraction module is configured to associate the extracted entities with one or more co-occurrence entities in an entity indexed electronic data corpus.

The system of claim 10, wherein the associated entities are ranked by a confidence score.

The system of claim 7, wherein each of the plurality of entities is selected from the group consisting of an individual, an organization, a geographical location, a date, and a time.

The entity extraction computer receives user input for search question parameters,
An entity co-occurrence database including a confidence score representing the accuracy of co-occurrence of the extracted entities and one or more related entities in the electronic data corpus with an entity extraction computer By comparing, extract multiple entities from search question parameters,
An entity extraction computer designates an index identifier (index ID) for each entity in a plurality of extracted entities;
An entity extraction computer saves an index ID for each of a plurality of extracted entities in an electronic data corpus, the electronic data corpus being indexed by an index ID corresponding to each of the one or more related entities. ,
A search server computer searches the entity indexed electronic data corpus to locate the plurality of extracted entities and to identify an index ID of a data record in which at least two of the plurality of extracted entities co-occur Building a search result list having data records corresponding to the identified index ID by the search server computer;
A non-transitory computer-readable medium having stored thereon computer-executable instructions.

The instructions further comprise classifying the search result list by relevance based on the reliability score by the search server computer and transferring the classified search result list to the user device by the search server computer. Item 14. A computer-readable medium according to Item 13.

The computer-readable medium of claim 13, wherein the plurality of extracted entities are ranked based on a confidence score.

The computer-readable medium of claim 13, wherein the instructions further associate an extracted entity with one or more co-occurring entities in an entity indexed electronic data corpus by an entity extraction computer.

The computer-readable medium of claim 16, wherein the associated entities are ranked by a confidence score.

The computer-readable medium of claim 13, wherein each of the plurality of entities is selected from the group consisting of an individual, an organization, a geographical location, a date, and a time.

The entity extraction computer receives user input for search question parameters from the user interface,
The entity extraction computer compares the search query parameters with an entity co-occurrence database having co-occurrence instances of one or more entities in the electronic data corpus, and at least one corresponding to the one or more entities in the search query parameters Extract one or more entities from the search question parameters by identifying the entity type,
An ambiguity score matching computer selects an ambiguity matching algorithm that searches the entity co-occurrence database to identify one or more records associated with the search query parameter, the ambiguity matching algorithm comprising at least one identified entity type Corresponding to
An ambiguity score matching computer that searches the entity co-occurrence database using the selected ambiguity matching algorithm and forms one or more suggested search query parameters from one or more records based on the search; And the fuzzy score matching computer presents one or more suggested search query parameters via the user interface;
A method involving that.

21. The method of claim 19, further comprising searching the entity co-occurrence database by the fuzzy score matching computer using the selected fuzzy matching algorithm before user input is complete.

The method of claim 19, wherein the one or more records associated with the search query parameter include a conceptual feature.

The one or more suggested search question parameters include a plurality of suggested search question parameters, and the method further includes the fuzzy score matching computer to search the plurality of suggested search question parameters in user input. 20. The method of claim 19, further comprising classifying in descending order based on accessibility of matches to the query parameters.

23. The method of claim 22, wherein the fuzzy score matching computer presents the classified suggested search query parameters in a drop-down list via a user interface.

The method of claim 19, wherein the entity co-occurrence database is indexed.

The method of claim 19, wherein the entity co-occurrence database includes an entity-to-entity index.

The method of claim 19, wherein the entity co-occurrence database includes an entity-to-topic index.

20. The method of claim 19, wherein the entity co-occurrence database includes an entity-to-fact index.

Comprising one or more server computers having one or more processors executing computer readable instructions for a plurality of computer modules, comprising:
An entity extraction module configured to receive user input of search query parameters from a user interface, the entity extraction module further comprising:
The search query parameter is compared to an entity co-occurrence database having co-occurrence instances of one or more entities in the electronic data corpus and at least one entity type corresponding to the one or more entities is identified in the search query parameter To extract one or more entities from the search query parameters,
And further configured as
An ambiguity score matching module configured to select an ambiguity matching algorithm that searches the entity co-occurrence database to identify one or more records associated with the search query parameter;
And the ambiguity matching module corresponds to at least one identified entity type, the ambiguity score matching module further comprising:
Search the entity co-occurrence database using the selected fuzzy matching algorithm and form one or more suggested search query parameters from one or more records based on the search, and 1 through the user interface A system configured to present one or more suggested search query parameters.

30. The system of claim 28, wherein the fuzzy score matching module is further configured to search an entity co-occurrence database using the selected fuzzy matching algorithm before user input is terminated.

30. The system of claim 28, wherein the one or more records associated with the search query parameter include conceptual features.

The one or more suggested search question parameters include a plurality of suggested search question parameters, and the fuzzy score matching computer further converts the plurality of suggested search question parameters to search question parameters in a user input. 29. The system of claim 28, wherein the system is configured to sort in descending order based on proximity of the matches.

36. The system of claim 32, wherein the fuzzy score matching computer is configured to present the classified plurality of suggested search query parameters via a user interface in a drop-down list.

30. The system of claim 28, wherein the entity co-occurrence database is indexed.

30. The system of claim 28, wherein the entity co-occurrence database includes an entity-to-entity index.

30. The system of claim 28, wherein the entity co-occurrence database includes an entity-to-topic index.

30. The system of claim 28, wherein the entity co-occurrence database includes an entity-to-fact index.

An entity extraction computer receives user input of a partial search question parameter from a user interface, the partial search question parameter having at least one incomplete search question parameter;
The entity extraction computer compares the partial search query parameter with an entity co-occurrence database having co-occurrence instances of one or more first entities in the electronic data corpus, and one or more first in the partial search query parameter. Extracting one or more first entities from the partial search query parameters by identifying at least one entity type corresponding to an entity;
An ambiguity score matching computer selects an ambiguity matching algorithm that searches an entity co-occurrence database to identify one or more records associated with the partial search query parameter, the ambiguity matching algorithm comprising at least one identified entity Corresponding to the format,
An ambiguity score matching computer searches the entity co-occurrence database using the selected ambiguity matching algorithm and determines one or more first suggested search query parameters from one or more records based on the search. Forming,
An ambiguous score matching computer presenting one or more first suggested search query parameters via a user interface;
An entity extraction computer receives a user selection of one or more first suggested search query parameters to form a completed search query parameter;
An entity extraction computer to extract one or more second entities from the completed search query parameters;
An entity extraction computer searches an entity co-occurrence database to identify one or more entities associated with the one or more second entities to form one or more second suggested search query parameters And presenting one or more second suggested search query parameters via the user interface by the entity extraction computer;
A method involving that.

38. The method of claim 37, further comprising searching the entity co-occurrence database by the fuzzy score matching computer using the selected fuzzy matching algorithm before user input is complete.

38. The method of claim 37, wherein the one or more records associated with the partial search query parameter include conceptual features.

The one or more first suggested search query parameters include a plurality of first suggested search query parameters, and the method is further configured by the fuzzy score matching computer to provide the plurality of first suggested search query parameters. 38. The method of claim 37, further comprising classifying the searched query parameters in descending order based on the proximity of matches to the partial search query parameters in the user input.

41. The method of claim 40, wherein the fuzzy score matching computer presents the categorized first suggested search query parameters in a drop-down list via a user interface.

40. The method of claim 39, wherein the entity co-occurrence database is indexed.

38. The method of claim 37, wherein the entity co-occurrence database includes an entity-to-entity index.

38. The method of claim 37, wherein the entity co-occurrence database includes an entity-to-topic index.

38. The method of claim 37, wherein the entity co-occurrence database includes an entity-to-fact index.

Comprising one or more server computers having one or more processors executing computer readable instructions for a plurality of computer modules, comprising:
An entity extraction module configured to receive user input of partial search question parameters from a user interface, the partial search question parameters having at least one unfinished search question parameter; Furthermore,
Comparing the partial search query parameter to an entity co-occurrence database having co-occurrence instances of one or more first entities in the electronic data corpus and corresponding to at least one first entity in the partial search query parameters Extracting one or more first entities from the partial search query parameters by identifying one entity type;
And further configured
An ambiguity score matching module configured to select an ambiguity matching algorithm that searches the entity co-occurrence database to identify one or more records associated with the partial search query parameter, the ambiguity matching algorithm comprising at least one Corresponding to two identified entity types, and its fuzzy score matching module further comprises:
Search the entity co-occurrence database using the selected fuzzy matching algorithm and form one or more first suggested search query parameters from one or more records based on the search; and Presenting one or more first suggested search query parameters via an interface;
The entity extraction module is further configured as
Receiving a user selection of one or more first suggested search query parameters to form a completed search query parameter;
Extracting one or more second entities from the completed search query parameters;
Search the entity co-occurrence database to identify one or more entities associated with the one or more second entities and form one or more second suggested search query parameters; and Via which one or more second suggested search query parameters are presented,
Configured system.

47. The system of claim 46, wherein the fuzzy score matching module is further configured to search an entity co-occurrence database using the selected fuzzy matching algorithm before user input ends.

48. The system of claim 46, wherein the one or more records associated with the partial search query parameter include conceptual features.

The one or more first suggested search question parameters include a plurality of first suggested search question parameters, and the fuzzy score matching computer further includes the plurality of first suggested search question parameters. 47. The system of claim 46, wherein the system is configured to classify parameters in descending order based on accessibility of matches to partial search query parameters in user input.

50. The system of claim 49, wherein the fuzzy score matching computer is configured to present the categorized first suggested search query parameters in a drop-down list via a user interface.

48. The system of claim 46, wherein the entity co-occurrence database is indexed.

48. The system of claim 46, wherein the entity co-occurrence database includes an entity-to-entity index.

47. The system of claim 46, wherein the entity co-occurrence database includes an entity-to-topic index.

48. The system of claim 46, wherein the entity co-occurrence database includes an entity-to-fact index.

A computer receives a search query containing one or more data strings from a search engine, each entity corresponding to a subset of one or more strings,
Identifying one or more entities in one or more data strings based on comparing one or more entities against an entity database and a trend database by a computer;
Identifying one or more features in one or more data strings not identified by the computer as corresponding to at least one entity;
The computer assigns each of the one or more features to at least one of the one or more entities based on a matching algorithm;
The computer specifies an extraction score for each entity based on the score specified for each feature specified for each entity,
A computer receives a first search list from an entity database that includes one or more entities having a score that is within a threshold distance from an extraction score for each entity;
A computer receives a second search list from the trend database that includes one or more entities having a score that is within a threshold distance from the extracted score of each entity;
A computer generates a grand total list including a first search list and a second search list, wherein the entities of the total list are ranked according to the score of each total list, and are suggested by the computer according to the total list. Give a search,
A computer-implemented method comprising:

The computer receives a plurality of data streams each associated with a plurality of data sources,
The computer generates an array of properties associated with each data stream,
In response to the computer detecting a trigger condition associated with the data in the data stream,
The computer generates geographic data related to the data stream data,
In response to the computer not detecting the data source trigger condition,
In response to mapping by the computer an array of properties for the data source to a set of managed properties associated with the search index and determining that the content type of the data source is image data,
An optical character recognition routine is performed on metadata associated with data received from the data source by the computer, and an updated data stream from the data source from the web service identified by the metadata by the computer And the data source is associated with the web service identified in the metadata,
Computer-implemented method.