JP2006018829A

JP2006018829A - Automated classification generation

Info

Publication number: JP2006018829A
Application number: JP2005184985A
Authority: JP
Inventors: Christopher B Weare; ビー．ウェアークリストファー
Original assignee: Microsoft Corp
Current assignee: Microsoft Corp
Priority date: 2004-06-30
Filing date: 2005-06-24
Publication date: 2006-01-19
Anticipated expiration: 2025-06-24
Also published as: MXPA05007136A; CN1716256A; US7266548B2; CA2510761A1; EP1612701A2; EP1612701A3; US20060004747A1; BRPI0502591A; KR20060048583A; JP4141460B2

Abstract

<P>PROBLEM TO BE SOLVED: To structure the categories of information as a binary tree with the nodes of the binary tree containing information relevant to the search, in a hierarchical classification of document. <P>SOLUTION: The binary tree is trained or formed by examining a training set of documents and separating those documents into two child nodes. Each of those sets of documents is then further split into two nodes to create binary tree data structure. The nodes are generated to maximize the likelihood that all of the training documents are in either or both of the two child nodes. In one example, each node of the binary tree may be associated with a list of terms and each term in each list of terms is associated with a probability of that term appearing in a document given that node. New documents may be categorized by the nodes of the tree. For example, the new documents may be assigned to a particular node based upon the statistical similarity between that document and the associated node. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本出願は分類生成を対象とし、より詳細には、文書の自動分類生成を対象とする。 This application is directed to classification generation, and more specifically to automatic classification generation of documents.

対象の特定の文書を見つけるために、コンピュータユーザは、クエリエンジンによる電子検索を行って文書の集まりを探すことができる。しかし、インターネット上のＷｅｂページや文書データベースなどの文書の集まりの一部は、一般にユーザによって示されたクエリ用語に基づいて多数の文書をユーザに戻す場合もある。取り出された文書のばらつきに対処するために、結果または文書へのリンクを、日付、人気、検索用語との類似度によってさらにソートまたはフィルタ処理し、かつ／または手動で導出された階層型分類（ｈｉｅｒａｒｃｈｉｃａｌｔａｘｏｎｏｍｙ）に従ってカテゴリ化することができる。さらに、または代わりに、ユーザは特定のカテゴリを選択して、検索をそのカテゴリ内の文書に制限することができる。 To find a specific document of interest, a computer user can perform an electronic search with a query engine to find a collection of documents. However, some collections of documents, such as web pages and document databases on the Internet, may return a large number of documents to the user based on query terms generally indicated by the user. To address variations in retrieved documents, results or links to documents are further sorted or filtered by date, popularity, similarity to search terms, and / or manually derived hierarchical classification ( categorization according to hierarchical taxonomy). Additionally or alternatively, the user can select a particular category and limit the search to documents within that category.

一般に、階層型分類（またはテキスト分類）は、予め定められた１組のカテゴリ内の文書をどのように分類するかに関する専門知識をコード化する１組のルールを手動で定義することによって生成される。マシンで増強された分類生成（Ｍａｃｈｉｎｅａｕｇｍｅｎｔｅｄｔａｘｏｎｏｍｙｇｅｎｅｒａｔｉｏｎ）は、一般に制御された辞書を手動で維持し、文書に関連付けられ、制御された辞書内にある割り当てられたキーワードまたはメタデータに基づいて文書をソートすることに依存していた。 In general, a hierarchical classification (or text classification) is generated by manually defining a set of rules that encodes expertise on how to classify documents within a predetermined set of categories. The Machine augmented taxonomy generation generally maintains manually controlled dictionaries and associates documents with documents based on assigned keywords or metadata in the controlled dictionaries. Relied on sorting.

Hofmann, "Probabilistic Latent Semantic Indexing," Proceedings of the 22nd Int'l SIGR Conference on Research and Development in Information Retrieval, pp. 50-57, August 15-19, 1999, Berkeley, CAHofmann, "Probabilistic Latent Semantic Indexing," Proceedings of the 22nd Int'l SIGR Conference on Research and Development in Information Retrieval, pp. 50-57, August 15-19, 1999, Berkeley, CA Zhai他、 "A study of smoothing methods for language information retrieval," ACM Transactions, Vol. 22, No. 2, April 2004, pp. 179-214Zhai et al., "A study of smoothing methods for language information retrieval," ACM Transactions, Vol. 22, No. 2, April 2004, pp. 179-214 Viterbi, "Error bounds for convolutional codes and an asymptotically optical decoding algorithm," IEEE Trans. Information Theory, IT-13, pp. 260-269, 1967Viterbi, "Error bounds for convolutional codes and an asymptotically optical decoding algorithm," IEEE Trans. Information Theory, IT-13, pp. 260-269, 1967

カテゴリ、および制御された辞書を生成し、維持する際にマンパワーが必要なため、手動による分類、またはマシンにより増強された分類を作成し、維持するコストは高価である。さらに、ソートされる内容の性質または内容自体は、非常に頻繁に変更される可能性があるので、分類を手動で適合させることは、制御された辞書で増強されたとしても、実用的ではない。 Because manpower is required to create and maintain categories and controlled dictionaries, the cost of creating and maintaining manual classification or machine-enhanced classification is expensive. In addition, the nature of the content being sorted or the content itself can change very often, so manually adapting the classification is not practical, even if augmented with a controlled dictionary. .

読者に基本的な理解を提供するために、以下に本開示の簡単な概略を示す。この概要は、本開示の網羅的または限定的な概説ではない。この概要は、本発明の主な、かつ／または重要な要素を識別したり、本発明の範囲を画定したり、何らかの方法で本発明の範囲を限定したりするために提供されているわけではない。単に、後述するより詳細な説明の導入として、開示した概念の一部を簡略化した形式で提示するためのものである。 In order to provide the reader with a basic understanding, a brief summary of the present disclosure is provided below. This summary is not an exhaustive or limiting overview of the disclosure. This summary is not provided to identify key and / or important elements of the invention, to define the scope of the invention, or to limit the scope of the invention in any way. Absent. Its sole purpose is to present some disclosed concepts in a simplified form as a prelude to the more detailed description that is discussed later.

階層型分類またはテキスト分類の構造を自動的に生成するために、任意の外的知識なしに文書を分類することができる。すなわち、文書自体から抽出された知識のみに基づいて文書を分類することができる。後述する階層型分類では、情報の関連のカテゴリを、検索に関連する情報を含むバイナリツリーのノードを含むバイナリツリーとして構成することができる。バイナリツリーは、１組の訓練文書を検査し、こうした文書を２つの子ノードに分けることによって「訓練」または形成することができる。次いでこうした文書の組のそれぞれをさらに２つのノードに分割して、バイナリツリーデータ構造を作成することができる。ノードは、訓練文書のすべてが２つのノードのいずれかまたは両方にある尤度を最大にするように生成することができる。一例では、バイナリツリーの各ノードは、用語のリストに関連付けることができ、用語の各リスト内の各用語は、そのノードが与えられた文書にその用語が出現する確率に関連付けられる。新しい文書が加わると、こうした文書を、その文書と関連のノードとの間の統計的類似度に基づいて特定のノードに割り当てることができる。 Documents can be classified without any external knowledge to automatically generate a hierarchical or text classification structure. That is, it is possible to classify a document based only on knowledge extracted from the document itself. In the hierarchical classification described below, a related category of information can be configured as a binary tree including nodes of a binary tree including information related to search. A binary tree can be “trained” or formed by examining a set of training documents and splitting them into two child nodes. Each such set of documents can then be further divided into two nodes to create a binary tree data structure. Nodes can be generated to maximize the likelihood that all of the training documents are in either or both of the two nodes. In one example, each node of the binary tree can be associated with a list of terms, and each term in each list of terms is associated with the probability that the term will appear in the document given that node. As new documents are added, such documents can be assigned to specific nodes based on the statistical similarity between the document and the associated node.

特定のノードに関連付けられている文書は、ノードの割当に基づいて取り出すことができ、例えば、指定されたクエリ用語に一致するノードを探すことによってあるノードの文書を取り出すことができる。一部の場合、ユーザによるクエリに応答して選択された文書を戻すために、検索エンジンによって一般の逆引きインデックスを使用する場合もある。検索結果内の文書のばらつきの問題に対処するために、クエリエンジンは、関連するノードに基づいて選択された文書をソートし、クラスタ化し、かつ／またはフィルタ処理することができる。検索を拡大するには、関連のノードからの追加の文書を戻すことができる。 A document associated with a particular node can be retrieved based on the node assignment, for example, a node's document can be retrieved by looking for a node that matches a specified query term. In some cases, a general reverse index may be used by a search engine to return selected documents in response to a query by a user. To address the issue of document variability in search results, the query engine can sort, cluster, and / or filter selected documents based on the associated nodes. To expand the search, additional documents from related nodes can be returned.

上記の態様および付随する本発明の利点の多くは、以下の詳細な説明を添付の図面と併せ読めば、より容易に理解でき、またより良く了解できよう。 Many of the above aspects and attendant advantages of the present invention will be more readily understood and better understood when the following detailed description is read in conjunction with the accompanying drawings.

バイナリツリーとして示したブランチ／ノードの分類は、一種の階層型分類である。図１は、バイナリツリー１５０を示している。サブジェクトノード１５４は、対象のノードを表す。インターネット検索エンジンの文脈で、サブジェクトノード１５４は、ユーザのクエリに十分類似した１つのカテゴリを表すか、クエリ用語に一致する文書の位置であり得る。親ノード１５３は、サブジェクトノード１５４より１レベル高い（または１カテゴリ広い）ノードであり、祖父（母）ノード１５１は、サブジェクトノード１５４より２レベル高い（または２カテゴリ広い）。子ノード１５６、１５８は、サブジェクトノード１５４より１レベル低いノードであり、孫ノード１５７、１５９、１６０、１６１は、サブジェクトノード１５４より２レベル低い。兄弟ノード１５５は、サブジェクトノード１５４と等しいレベルにあり、同じ親ノードに関連付けられているノードである。両方の方向に、さらに「曾」ノード（図示せず）のレベルも存在し得る（曾祖父（母）、曾曾孫など）。図１に示すように、祖父（母）ノード１５１は、ルートノード、すなわちバイナリツリー１５０において最もレベルの高いノードである。バイナリツリーは、バランスがとれていてもとれていなくてもよいが、バイナリツリーの性質では、各ノードには子がちょうど２つあるか、子がないかのいずれかである必要がある。 The branch / node classification shown as a binary tree is a kind of hierarchical classification. FIG. 1 shows a binary tree 150. The subject node 154 represents a target node. In the context of an Internet search engine, subject node 154 may represent a category that is sufficiently similar to the user's query or may be the location of a document that matches the query term. The parent node 153 is a node that is one level higher (or one category wider) than the subject node 154, and the grandfather (mother) node 151 is two levels higher (or two categories wider) than the subject node 154. The child nodes 156 and 158 are nodes that are one level lower than the subject node 154, and the grandchild nodes 157, 159, 160, and 161 are two levels lower than the subject node 154. Sibling node 155 is a node that is at the same level as subject node 154 and is associated with the same parent node. There may also be levels of “曾” nodes (not shown) in both directions (great-grandfather (mother), great-grandchild, etc.). As shown in FIG. 1, the grandfather (mother) node 151 is the root node, that is, the highest level node in the binary tree 150. The binary tree does not have to be balanced, but the nature of the binary tree requires that each node has either exactly two children or no children.

訓練セット内の文書は、任意の適したソースを使用して選択することができる。例えば、文書のバッチは、カテゴリ化されることが望まれる場合がある。ツリーを訓練するために、カテゴリ化すべき文書の少なくとも一部分を１組の訓練文書として選択することができる。追加の、または別の訓練文書を、ニュース文書用のＲｅｕｔｅｒｓ（登録商標）コレクション、医薬文書用のＯＨＳＵＭＥＤ（商標）コレクション、書き込まれたニュースグループメッセージ用の２０Ｎｅｗｓｇｒｏｕｐｓ（商標）コレクション、およびニュース文書用のＡＰ（商標）コレクションを含むベンチマークコレクションから選択することができる。 Documents in the training set can be selected using any suitable source. For example, a batch of documents may be desired to be categorized. To train the tree, at least a portion of the documents to be categorized can be selected as a set of training documents. Additional or separate training documents for the Reuters® collection for news documents, the OHSUMED ™ collection for pharmaceutical documents, the 20 Newgroups ™ collection for written newsgroup messages, and news documents You can choose from benchmark collections including AP ™ collections.

図２に示すように、１組の訓練文書２１０は、各文書内の用語など、１組の訓練文書からの外的情報に基づいてバイナリ階層型分類ツリーを生成するツリー生成器２２０に入力される。したがって、訓練文書を検査して、すべての訓練文書内の用語に基づいて１組の訓練用語を決定することができる。 As shown in FIG. 2, a set of training documents 210 is input to a tree generator 220 that generates a binary hierarchical classification tree based on external information from the set of training documents, such as terms in each document. The Thus, the training documents can be examined to determine a set of training terms based on the terms in all training documents.

ツリーの訓練に使用される用語は、任意の適した方法を使用して、選択された訓練文書内から選択することができる。図３は、図２のツリー生成器２２０の一例を示している。ツリー生成器２２０は、ツリーの訓練に使用する訓練用語３２０のベクトルまたはリストを決定するために用語生成器３１０を含み得る。例えば、ナイーブベイズの仮定の下では、各文書は、統計的に関連のない用語の集まりとして扱われるため、ナイーブベイズの仮定の下で、各訓練文書を用語のリストまたはベクトルとして扱うことができる。 The terms used for training the tree can be selected from within the selected training document using any suitable method. FIG. 3 shows an example of the tree generator 220 of FIG. Tree generator 220 may include term generator 310 to determine a vector or list of training terms 320 to use for training the tree. For example, under Naive Bayes assumptions, each document is treated as a collection of statistically unrelated terms, so under Naive Bayes assumptions, each training document can be treated as a list or vector of terms .

ツリーの訓練に使用される用語は、各用語の出現の累計回数に基づいて、すべての文書に出現するすべての用語から選択することができる。ツリーを訓練する用語は、多数の文書内に出現し、かつ／または特定の文書にしばしば出現する可能性がある。さらに、用語生成器は、ツリーを訓練するために選択された用語が文書の訓練における有効度がより低いと確認されていないことを確実にするために、予め定められた排除用語のリストにアクセスしてもよい。例えば、前置詞、冠詞、および／または代名詞などの用語は、ほとんどの文書にしばしば出現するが、分類ツリーを訓練するための用語として最適ではない場合がある。さらに、排除用語リストは、使用可能なストップリストからアクセスすることができる。排除用語は、ヒューリスティックス、訓練用語の過去の性能を含む任意の方法を使用し、かつ１組の訓練文書内の各文書において用語の出現が実質的に同じである場合に生成することができる。 The terms used for training the tree can be selected from all terms appearing in all documents based on the cumulative number of occurrences of each term. The term to train the tree can appear in many documents and / or often appear in a particular document. In addition, the term generator accesses a predefined list of excluded terms to ensure that the terms selected to train the tree have not been confirmed to be less effective in training the document. May be. For example, terms such as prepositions, articles, and / or pronouns often appear in most documents, but may not be optimal terms for training a classification tree. In addition, the exclusion term list can be accessed from an available stop list. Exclusion terms can be generated using any method, including heuristics, past performance of training terms, and where the occurrences of terms are substantially the same in each document within a set of training documents.

一部の場合、計算の効率のためにシステムの訓練に使用される用語の数を限定することが有益となり得る。一般に、訓練文書の集成の性質に応じてＮが１０，０００から１００，０００にわたる場合、何らかの実用的な測定に従って、上位Ｎ個の用語が訓練用語として選択される。最も簡単な２つの測定値は、資料内で使用されている単語の数（用語数）、および単語を含む文書の数（文書数）である。別の有用な測定は、これらの測定値の両方を結合する。例えば、所与の用語の実用的な測定は、用語数の２乗を文書数で割ったものになり得る。 In some cases it may be beneficial to limit the number of terms used to train the system for computational efficiency. In general, if N ranges from 10,000 to 100,000, depending on the nature of the training document assembly, the top N terms are selected as training terms according to some practical measurement. The two simplest measurements are the number of words used in the material (number of terms) and the number of documents containing the word (number of documents). Another useful measurement combines both of these measurements. For example, a practical measure of a given term can be the term number squared divided by the number of documents.

図３に示すように、用語生成器３１０は、１組の訓練文書２１０を受信し、各文書内の各用語の出現回数を数え、用語を含む訓練セット内のすべての文書のこうした数を累積する。用語の出現回数（用語数）の２乗を用語を含む文書数（文書数）で割ったものが大きい場合、用語は訓練文書内で頻繁に使用されている。逆に、用語の出現回数の２乗を文書数で割ったものが小さい場合、用語は時々しか使用されていないか、しばしば使用されている場合、用語は各文書内に２、３回しか出現しない。相対頻度を計算する様々な方法を含めて、訓練用語を選択する他の方法も適しており、かつ／または単一の用語として数えられる句を形成するために複数の単語をトークン化することができる。選択された用語は、図３に示すように、用語のベクトル３２０としてデータストア内に格納することができる。用語のベクトル３２０は、データストア内で、バイナリツリーの現在のノード（第１の反復でルートノードである）に関連付けることができる。 As shown in FIG. 3, the term generator 310 receives a set of training documents 210, counts the number of occurrences of each term in each document, and accumulates these numbers for all documents in the training set that contain the term. To do. A term is frequently used in training documents when the square of the number of occurrences of the term (number of terms) divided by the number of documents containing the term (number of documents) is large. Conversely, if the square of the number of occurrences of the term divided by the number of documents is small, the term is used only occasionally or if it is used often, the term appears only a few times in each document do not do. Other methods of selecting training terms are suitable, including various methods of calculating relative frequencies, and / or tokenizing multiple words to form a phrase that counts as a single term it can. The selected terms can be stored in the data store as a vector 320 of terms, as shown in FIG. The term vector 320 can be associated with the current node of the binary tree (which is the root node in the first iteration) in the data store.

図３に示すように、用語生成器３１０は、用語ベクトル３２０をノード生成器３３０に渡す。ノード生成器３３０は、各子ノードが選択された訓練用語の用語リストまたはベクトル３２０に関連付けられている状態で、現在のノードの２つの子ノードを生成することができる。２つの子ノードを形成するために、用語ベクトル３２０内の各用語は、その用語が文書に出現する確率、言い換えれば、その単語が文書に出現するように選択される確率に関連付けることができる。第１の子ノードに関連付けられている確率は、図３に示すように、データストア内のベクトル３４０に１組の用語の確率として格納され、第２の子ノードに関連付けられている確率は、データストア内のベクトル３５０に１組の用語確率として格納され得る。各子ノードは、１つの用語確率のベクトルに関連付けられるため、生成される２つの子ノードに対応して２つの用語確率のベクトル３４０、３５０が生成される。 As shown in FIG. 3, term generator 310 passes term vector 320 to node generator 330. The node generator 330 can generate two child nodes of the current node with each child node associated with a selected training term term list or vector 320. To form two child nodes, each term in term vector 320 can be associated with the probability that the term will appear in the document, in other words, the probability that the word will be selected to appear in the document. The probability associated with the first child node is stored as a set of term probabilities in a vector 340 in the data store, as shown in FIG. 3, and the probability associated with the second child node is It can be stored as a set of term probabilities in a vector 350 in the data store. Since each child node is associated with one term probability vector, two term probability vectors 340, 350 are generated corresponding to the two generated child nodes.

用語確率の各ベクトル３４０、３５０を開発するために、文書に出現する用語の各確率を初期化することができる。例えば、確率は、乱数生成器で確率を無作為に生成する、または１組の訓練文書内の用語の出現回数を調整または変更するなど、任意の適した方法を介して初期化することができる。一部の場合、文書に出現する用語の確率を、各用語確率ベクトルにおいて異なる値に初期化することが適している場合がある。より詳細には、必ず２つの用語確率ベクトル３４０、３５０が同じにならないようにすることが適している場合がある。 In order to develop each vector 340, 350 of term probabilities, each probability of terms appearing in the document can be initialized. For example, probabilities can be initialized via any suitable method, such as randomly generating probabilities with a random number generator, or adjusting or changing the number of occurrences of a term in a set of training documents. . In some cases, it may be appropriate to initialize the probability of terms appearing in the document to different values in each term probability vector. More specifically, it may be appropriate to ensure that the two term probability vectors 340, 350 are not the same.

次いでノード生成器３３０は、２つの子ノードにそれぞれ関連付けられている用語確率ベクトル３４０、３５０内の用語の確率を最適化することができる。例えば、用語の確率は、期待値最大化、遺伝的アルゴリズム、ニューラルネットワーク、シミュレーテッドアニーリングなど、任意の適した方法を使用して最適化することができる。例えば、ノード生成器３３０は、訓練文書のそれぞれが兄弟ノードの両方に関連付けられている用語のリストから形成され得る尤度を最大にするために、用語の確率を最適化することができる。より詳細には、各訓練文書が文書に出現する各用語の初期化された確率（用語確率ベクトル３４０）に基づいて第１の子ノードに関連付けられている用語（用語ベクトル３２０）によって作成される確率を計算し、同じ訓練文書が文書に出現する各用語の初期化された確率（用語確率ベクトル３５０）に基づいて第２の子ノードに関連付けられている用語（用語ベクトル３２０）によって作成される確率を計算することによって、各ベクトルの用語の確率は、訓練文書の集成全体にわたって最適化することができる。 The node generator 330 can then optimize the probabilities of terms in the term probability vectors 340, 350 associated with the two child nodes, respectively. For example, the term probabilities can be optimized using any suitable method, such as expectation maximization, genetic algorithm, neural network, simulated annealing, and the like. For example, the node generator 330 can optimize the probability of terms to maximize the likelihood that each of the training documents can be formed from a list of terms associated with both sibling nodes. More specifically, each training document is created by a term (term vector 320) associated with the first child node based on the initialized probability of each term appearing in the document (term probability vector 340). Probability is calculated and the same training document is created by the term (term vector 320) associated with the second child node based on the initialized probability (term probability vector 350) of each term that appears in the document By calculating the probabilities, the probability of each vector term can be optimized over the entire training document collection.

期待値最大化を使用して、図３に示したノード生成器３２０は、すべての訓練文書が２つの兄弟ノードのそれぞれにおける用語によって生成される対数尤度を最大にすることができる。すべての訓練文書が２つの各ノードで入手可能な用語によって生成される対数尤度は、次の式によって得られる。
Ｌ＝Ｓｕｍ｛Ｓｕｍ［ｎ（ｄ_i，ｗ_jk）ｌｏｇ（Ｐ（ｄ_i，ｗ_jk）），ｊ］，ｉ，ｋ｝
上記の式中、ｎ（ｄ_i，ｗ_jk）はノードｋでの文書ｄ_i内の用語ｗ_jの出現回数、Ｐ（ｄ_i，ｗ_jk）は任意の文書に出現する用語の確率に基づく、文書ｄ_i内に出現するノードｋの用語ｗ_jの確率である。各ノードに関連付けられている用語の確率は、次いで対数尤度を最大にするように繰り返し調整することができる。最大化は、絶対最大値または相対最大値とすることができる。結果として得られたこれらの用語の確率は、図３のベクトル３４０、３５０に格納され、データストア内のそれぞれの子ノードに関連付けられる。このように、２つの子ノード（または図１の親ノード１５２、１５３）のそれぞれは、訓練用語のリスト（用語ベクトル３２０）、および１組の訓練文書が各子ノードの用語から形成される対数尤度を最大にするために最適化された文書に出現する各用語のそれぞれの確率（用語確率ベクトル３４０、３５０）に関連付けられる。 Using expectation maximization, the node generator 320 shown in FIG. 3 can maximize the log likelihood that all training documents are generated by terms in each of the two sibling nodes. The log likelihood that all training documents are generated by the terms available at each of the two nodes is given by:
L = Sum {Sum [n (d _i , w _jk ) log (P (d _i , w _jk )), j], i, k}
In the above formula, n (d _i , w _jk ) is based on the number of occurrences of the term w _j in the document d _i at the node k, and P (d _i , w _jk ) is based on the probability of the term appearing in an arbitrary document. , The probability of the term w _j of the node k appearing in the document d _i . The probability of the term associated with each node can then be adjusted iteratively to maximize log likelihood. Maximization can be an absolute maximum or a relative maximum. The resulting probabilities for these terms are stored in vectors 340, 350 in FIG. 3 and associated with each child node in the data store. Thus, each of the two child nodes (or parent nodes 152, 153 in FIG. 1) has a list of training terms (term vector 320) and a logarithm in which a set of training documents is formed from the terms of each child node. Associated with the respective probabilities (term probability vectors 340, 350) of each term appearing in the document optimized to maximize likelihood.

一例では、問題の形態の形式化を使用して、期待値最大化を使用した単語および文書の確率を解くことができる。様々なバージョンの期待値最大化が適している可能性があるが、代表的な１つの例は、参照により本明細書に組み込む、非特許文献１に記載されている。一部の場合、Ｈｏｆｍａｎｎによって述べられているように、期待値最大化手法に従うことが適し得るが、期待値最大化プロセスで文書の確率を再訓練するのではなく、文書の確率と単語の確率との間のＫｌｄｉｖｅｒｇｅｎｃｅなどの距離測定を使用して新しい文書のモデルパラメータの調整を低減することができる。 In one example, problem form formalization can be used to solve word and document probabilities using expectation maximization. While various versions of expectation maximization may be suitable, one representative example is described in Non-Patent Document 1, which is incorporated herein by reference. In some cases, it may be appropriate to follow the expectation maximization approach as described by Hofmann, but instead of retraining the document probabilities in the expectation maximization process, document probabilities and word probabilities Distance measurements such as Kl divergence between and can be used to reduce adjustment of model parameters for new documents.

より低いレベルのバイナリツリーの１組のテスト文書を形成するには、１組のテスト文書２１０を、２つの子ノードのうちの少なくとも１つに割り当てる。このように、第１の子ノードに関連付けられている文書を、２つの孫ノードを生成するために使用し、第２の子ノードに関連付けられている文書を、さらに２つの孫ノードを生成するために使用して、図１のバイナリツリー１５０を形成することができる。 To form a set of test documents for a lower level binary tree, a set of test documents 210 is assigned to at least one of the two child nodes. Thus, the document associated with the first child node is used to generate two grandchild nodes, and the document associated with the second child node is further generated with two grandchild nodes. Can be used to form the binary tree 150 of FIG.

図３に示すように、ツリー生成器２２０は、１組の訓練文書２１０を２つの子ノードのうちの少なくとも１つまたはヌルセットに割り当てる文書割当器３６０を含み得る。文書が訓練に適していないと決定されると、文書割当器３６０は、文書をヌルセットに割り当てることができる。このように、図３に示すように、３組の文書、第１の子ノードに関連付けられている文書セット３６２、第２の子ノードに関連付けられている文書セット３６４、および訓練セットから削除される文書のヌルセットである文書セット３６６を形成することができる。 As shown in FIG. 3, the tree generator 220 may include a document assigner 360 that assigns a set of training documents 210 to at least one of the two child nodes or a null set. If it is determined that the document is not suitable for training, the document assigner 360 can assign the document to a null set. Thus, as shown in FIG. 3, the document is deleted from the three sets of documents, the document set 362 associated with the first child node, the document set 364 associated with the second child node, and the training set. A document set 366 that is a null set of documents can be formed.

図３の文書割当器３６０は、エントロピまたは距離測定など任意の適した方法を使用して、訓練セット２１０の各文書を２つの子ノードのうちの一方または両方、またはヌルセットに関連付けることができる。例えば、文書割当器３６０は、それぞれの子ノードに関連付けられている最適化された用語確率ベクトル３４０、３５０を使用して、各文書と２つの子ノードのそれぞれとの間のＫｌｄｉｖｅｒｇｅｎｃｅを決定することができる。代表的な１つの例では、Ｋｌｄｉｖｅｒｇｅｎｃｅは、次の式を使用して決定することができる。
Ｓ_j＝Ｓｕｍ［Ｐ（ｗ_j）＊ｌｏｇ（Ｐ（ｗ_i）／Ｚ_j（ｗ_i））］
式中、Ｓ_jはＫｌｄｉｖｅｒｇｅｎｃｅ、Ｐ（ｗ_i）は用語ｗ_iが所与の文書内で検出される確率、およびＺ_j（ｗ_i）は用語ｗ_iがノードｊで検出される確率である。上記の式の対称的なバージョンを含めて、他の適した系統的に定められた距離または類似度も適していることを理解されたい。 The document allocator 360 of FIG. 3 may associate each document of the training set 210 with one or both of the two child nodes, or a null set, using any suitable method such as entropy or distance measurement. For example, the document allocator 360 uses the optimized term probability vectors 340, 350 associated with each child node to determine the Kl divergence between each document and each of the two child nodes. be able to. In one representative example, Kl divergence can be determined using the following equation:
S _j = Sum [P (w _j ) * log (P (w _i ) / Z _j (w _i ))]
Where S _j is the Kl divergence, P (w _i ) is the probability that the term w _i is detected in a given document, and Z _j (w _i ) is the probability that the term w _i is detected at node j. is there. It should be understood that other suitable systematically defined distances or similarities are suitable, including symmetric versions of the above equations.

一般に、文書は、所与のノードで検出された用語のサブセットのみを含んでいる。したがって、Ｋｌｄｉｖｅｒｇｅｎｃｅを制約するために、平滑化された単語の確率（ｓｍｏｏｔｈｅｄｗｏｒｄｐｒｏｂａｂｉｌｉｔｉｅｓ）を使用することができる。用語の確率は、任意の適した方法を使用して平滑化することができる。テキスト情報の取り出しの分野の専門家は、それだけには限定されないが、簡易Ｊｅｌｉｎｅｋ−Ｍｅｒｃｅｒ（ｓｉｍｐｌｉｆｉｅｄＪｅｌｉｎｅｋ−Ｍｅｒｃｅｒ）、Ｄｉｒｉｃｈｌｅｔ事前分布（Ｄｉｒｉｃｈｌｅｔｐｒｉｏｒ）、および絶対ディスカウンティング（ａｂｓｏｌｕｔｅｄｉｓｃｏｕｎｔｉｎｇ）など、単語の確率の平滑化のいくつかの方法に精通している。代表的な１つの例は、参照により本明細書に組み込む、非特許文献２に記載されている。このように、文書の集成全体はシステムエラーを考慮したシステム知識を提供し、新しい文書は、それが用語の１つの考え得る出現または組み合わせにすぎないように統計的に扱われるため、文書に出現する用語の確率はゼロではない。Ｊｅｎｓｅｎ−Ｓｈａｎｎｏｎｄｉｖｅｒｇｅｎｃｅ、ピアソンのカイ二乗検定などを含めて、距離または類似度の他の統計的な測定を使用できることを当分野の専門家は理解されよう。 In general, a document contains only a subset of terms found at a given node. Therefore, smoothed word probabilities can be used to constrain Kl divergence. The term probabilities can be smoothed using any suitable method. Experts in the field of text information retrieval include, but are not limited to, simplified Jelinek-Mercer (simplied Jelinek-Mercer), Dirichlet priors, and absolute discounting probabilities of words (absolute disccounting), etc. Familiar with several methods of smoothing. One representative example is described in Non-Patent Document 2, which is incorporated herein by reference. In this way, the entire collection of documents provides system knowledge that takes into account system errors, and the new document appears in the document because it is treated statistically so that it is only one possible occurrence or combination of terms. The probability of the term to be is not zero. Those skilled in the art will appreciate that other statistical measures of distance or similarity can be used, including Jensen-Shannon divergence, Pearson's chi-square test, and the like.

一例で、各文書は、最も低いＫｌｄｉｖｅｒｇｅｎｃｅを有するノードに割り当てることができる。さらに、または代わりに、Ｋｌｄｉｖｅｒｇｅｎｃｅが予め定められた閾値を下回る場合、各文書をノードに割り当てることができる。一部の場合、第１のノードへのＫｌｄｉｖｅｒｇｅｎｃｅ、および第２のノードへのＫｌｄｉｖｅｒｇｅｎｃｅは、ほぼ等しい、または類似している場合がある。この場合、文書は、両方のノードに関連付けることができる。他の場合、両方のノードへのＫｌｄｉｖｅｒｇｅｎｃｅは、予め定められた閾値に比べて比較的大きい可能性がある。この場合、文書はヌルセットに割り当てられる。例えば、その文書は訓練文書としての使用に適していないことになり得る。 In one example, each document can be assigned to the node with the lowest Kl divergence. Additionally or alternatively, each document can be assigned to a node if Kl divergence falls below a predetermined threshold. In some cases, the Kl divergence to the first node and the Kl divergence to the second node may be approximately equal or similar. In this case, the document can be associated with both nodes. In other cases, the Kl divergence to both nodes may be relatively large compared to a predetermined threshold. In this case, the document is assigned to a null set. For example, the document may not be suitable for use as a training document.

上記のステップは、バイナリツリーの新しいレベルが生成されるたびに再帰的に繰り返され、このプロセスは、切断条件が達成されると停止することができる。図３に示すように、ツリー生成器は、切断条件が達成されたかどうかを決定するツリーマネージャ３７０を含み得る。切断条件は、（例えば特定のノードに関連付けられている文書数が特定の閾値より小さいなど）ノード内にあり得る文書の最低数、２つの新しいノードから１組の訓練文書へのＫｌｄｉｖｅｒｇｅｎｃｅが１組の訓練文書と親ノードとの間のＫｌｄｉｖｅｒｇｅｎｃｅと類似する（例えば親ノードに対するＫｌｄｉｖｅｒｇｅｎｃｅと子ノードに対するＫｌｄｉｖｅｒｇｅｎｃｅとの間の差は、予め定められた閾値を下回る）、所与のブランチに沿ったツリーの深さが予め定められた限界に到達した（例えばツリー内の層の数が予め定められた閾値を超えるなど）、２つのノード間のＫｌｄｉｖｅｒｇｅｎｃｅが予め定められた閾値を下回る（例えば第１のノードと第２のノードとの間の差は予め定められた閾値を下回る）など、任意の適したパラメータまたは距離とすることができる。 The above steps are recursively repeated each time a new level of the binary tree is generated, and the process can be stopped when the cutting condition is achieved. As shown in FIG. 3, the tree generator may include a tree manager 370 that determines whether a cutting condition has been achieved. The cutting condition is that the minimum number of documents that can be in a node (eg, the number of documents associated with a particular node is less than a certain threshold), and the Kl divergence from two new nodes to a set of training documents is 1. Similar to the Kl diversity between a set of training documents and the parent node (eg, the difference between the Kl diversity for the parent node and the Kl diversity for the child node is below a predetermined threshold) for a given branch The depth of the tree along the line has reached a predetermined limit (eg, the number of layers in the tree exceeds a predetermined threshold), and the Kl divergence between the two nodes is below the predetermined threshold ( (For example, the difference between the first node and the second node is below a predetermined threshold) Suitable parameters or distances.

訓練セット内の文書の少なくとも一部が２つの子ノードのうちの少なくとも１つまたはヌルセットに割り当てられているとき、各子ノードは、元の訓練文書の組のサブセット（文書セット３６２または文書セット３６４）に関連付けられる。次いでツリーマネージャ３７０は、これらの文書の組のそれぞれを、新しい１組の訓練文書として転送して、訓練用語の新しいリストを生成することができる。より詳細には、ツリーマネージャ３７０は、文書セット３６２を、１組の訓練文書として使用するように用語生成器３１０に送信して、第１の子ノードの２つの孫ノードに関連付けられている１組の訓練用語３２０を生成することができる。同様に、ツリーマネージャは、文書セット３６４を、１組の訓練文書として使用するように用語生成器３１０に送信して、第２の子ノードの２つの孫ノードに関連付けられている１組の訓練用語３２０を生成することができる。 When at least some of the documents in the training set are assigned to at least one of the two child nodes or a null set, each child node is a subset of the original training document set (document set 362 or document set 364). ). The tree manager 370 can then transfer each of these sets of documents as a new set of training documents to generate a new list of training terms. More specifically, the tree manager 370 sends the document set 362 to the term generator 310 for use as a set of training documents, associated with two grandchild nodes of the first child node. A set of training terms 320 can be generated. Similarly, the tree manager sends the document set 364 to the term generator 310 for use as a set of training documents for a set of training associated with the two grandchild nodes of the second child node. The term 320 can be generated.

新しい訓練用語の各組は、ノード生成器３３０によって使用されて、孫ノードごとに関連の用語確率ベクトルを生成し、最適化することができる。上述したように、用語確率ベクトルは、用語の確率を無作為に生成することによって初期化することができる。あるいは、直前のレベル（子ノード）からの用語の確率を調整して、各孫ノードに関連付けられている用語確率ベクトルを初期化することができる。例えば、用語確率ベクトル３４０は、直前のノードの元の用語確率値の約９０％から約１１０％の値で無作為に調整することができ、同様に、用語確率ベクトル３５０は、直前のノードの元の用語確率値の約９０％から約１１０％の値で無作為に調整することができる。 Each set of new training terms can be used by the node generator 330 to generate and optimize an associated term probability vector for each grandchild node. As described above, the term probability vector can be initialized by randomly generating the term probabilities. Alternatively, the term probability vector associated with each grandchild node can be initialized by adjusting the probability of terms from the previous level (child node). For example, the term probability vector 340 can be randomly adjusted to a value of about 90% to about 110% of the previous term's original term probability value, and similarly, the term probability vector 350 can be It can be randomly adjusted with values from about 90% to about 110% of the original term probability value.

次いでノード生成器３３０は、孫ノードごとに用語確率値を最適化することができる。これらの最適化された用語確率は、次いでそれぞれ２つの新しい孫ノードに関連付けられ、文書を４つの新しい孫ノードのうちの少なくとも１つまたはヌルセットにさらに割り当てるために使用することができる。より詳細には、文書セット３６２の各文書は、ヌルセットまたは第１の子ノードに関連付けられている２つの孫ノードのうちの少なくとも一方に関連付けることができ、文書セット３６４の各文書は、ヌルセットまたは第２の子ノードに関連付けられている２つの孫ノードのうちの少なくとも一方に関連付けることができる。ノードとの文書の関連付けは、データストアに格納することができる。結果として、図２および図３に示すように、複数のノードを含むバイナリツリーデータ構造２３０が形成され、各ノードは、文書に出現する各用語の関連した確率（用語確率ベクトル３４０または３５０）で用語のベクトル（用語ベクトル３２０）に関連付けられる。 The node generator 330 can then optimize the term probability value for each grandchild node. These optimized term probabilities are then each associated with two new grandchild nodes and can be used to further assign the document to at least one of the four new grandchild nodes or a null set. More specifically, each document in the document set 362 can be associated with at least one of a null set or two grandchild nodes associated with the first child node, and each document in the document set 364 can be a null set or It can be associated with at least one of the two grandchild nodes associated with the second child node. Document associations with nodes can be stored in a data store. As a result, as shown in FIGS. 2 and 3, a binary tree data structure 230 is formed that includes a plurality of nodes, each node with an associated probability of each term appearing in the document (term probability vector 340 or 350). Associated with a vector of terms (term vector 320).

図４は、図２のツリー生成器２２０の動作の方法例４００を示している。４１０で、ツリー生成器は１組の訓練文書を受信する。上述したように、ナイーブベイズの仮定下では、各文書は、その文書に出現する用語のリストとして表される。４１２で、ツリー生成器は、各文書内の各用語の出現頻度を数える。文書に出現する用語のリストに基づいて、ツリー生成器は、４１４で、用語生成器を介して、用語ベクトルとして表される第１の組の訓練用語を選択する。訓練ベクトル内の訓練用語ごとに、ツリー生成器は、４１６で、ノード生成器を介して、訓練用語が所与の文書に出現する第１の確率を生成し、第１の確率の組は第１の用語確率ベクトルとして表される。４１８で、第１の用語確率ベクトルは、第１の子ノードに関連付けられる。また、用語生成器は、４２０で、用語ベクトル内の用語ごとに、その用語が所与の文書内に出現する第２の確率も生成し、第２の確率の組は第２の用語確率ベクトルとして表される。４２２で、第２の用語確率ベクトルは、第２の子ノードに関連付けられる。上述したように、ノード生成器は、用語確率ベクトルを無作為な値で初期化し、訓練文書が第１および第２の子ノードのそれぞれに関連付けられている用語の確率によって生成される対数尤度を最大にする期待値最大化に基づいてこうした確率を最適化する。文書割当器を介して、ツリー生成器は、４２４で、（用語のリストとして扱われた）各訓練文書を第１の子ノード、第２の子ノード、および訓練に適してない文書のヌルセットのうちの少なくとも１つに関連付ける。ノード生成器は、４２６で、用語生成器を介して、第１の子ノードに関連付けられている１組の訓練文書に出現する用語の少なくとも一部に基づいて第２の組の訓練用語または用語ベクトルを形成する。この場合もまた、ノード生成器を介して、ツリー生成器は、４２８で、第２の用語ベクトル内の訓練用語ごとに、そのノードが与えられた文書に訓練用語が出現する第３の確率を生成し、４３０で、結果として得られた第３の用語確率ベクトルを第１の孫ノードに関連付ける。同様に、ツリー生成器は、４３２で、第２の用語ベクトル内の訓練用語ごとに、そのノードが与えられた文書に訓練用語が出現する第４の確率を生成し、４３４で、結果として得られた第４の用語確率ベクトルを第２の孫ノードに関連付ける。第３および第４の用語確率ベクトルに基づいて、ツリー生成器は、４３６で、文書割当器を介して、第１の子ノードに関連付けられている各文書を、第１の孫ノード、第２の孫ノード、およびヌルセットのうちの少なくとも１つに関連付ける。図４のプロセス、またはその一部は、指定された切断条件に到達するまで、必要に応じて繰り返すことができる。 FIG. 4 illustrates an example method 400 of operation of the tree generator 220 of FIG. At 410, the tree generator receives a set of training documents. As described above, under Naive Bayes' assumption, each document is represented as a list of terms that appear in that document. At 412, the tree generator counts the frequency of occurrence of each term in each document. Based on the list of terms appearing in the document, the tree generator selects a first set of training terms represented as a term vector at 414 via the term generator. For each training term in the training vector, the tree generator, at 416, generates a first probability that the training term appears in the given document via the node generator, and the first set of probabilities is the first probability set. It is expressed as one term probability vector. At 418, the first term probability vector is associated with the first child node. The term generator also generates, at 420, for each term in the term vector, a second probability that the term appears in a given document, and the second set of probabilities is a second term probability vector. Represented as: At 422, the second term probability vector is associated with the second child node. As described above, the node generator initializes the term probability vector with a random value and the log likelihood generated by the probability of the term that the training document is associated with each of the first and second child nodes. We optimize these probabilities based on maximizing the expected value that maximizes. Via the document allocator, the tree generator at 424 replaces each training document (treated as a list of terms) with a first child node, a second child node, and a null set of documents not suitable for training. Associate with at least one of them. The node generator, at 426, via the term generator, a second set of training terms or terms based on at least some of the terms that appear in the set of training documents associated with the first child node. Form a vector. Again, via the node generator, the tree generator, at 428, gives, for each training term in the second term vector, a third probability that the training term will appear in the document given that node. And at 430, associate the resulting third term probability vector with the first grandchild node. Similarly, the tree generator generates, at 432, a fourth probability that the training term appears in the document given that node for each training term in the second term vector, and at 434 the resulting Associating the resulting fourth term probability vector with the second grandchild node. Based on the third and fourth term probability vectors, the tree generator at 436 assigns each document associated with the first child node to the first grandchild node, the second, via the document allocator. Associated with at least one of the grandchild node and the null set. The process of FIG. 4, or a portion thereof, can be repeated as necessary until a specified cutting condition is reached.

あるノードに関連付けられている各訓練文書セットは、訓練文書がバイナリ分類ツリーによってカテゴリ化される文書のサブセットである場合、結果として得られた分類ツリーデータ構造内のそのノードに関連付けられたままであり得る。一例では、各文書セットは、ツリー内のそのレベルに関係なく、そのそれぞれのノードに割り当てられたままであり、その結果親ノードは、その子ノードのそれぞれのすべての文書に関連付けられている。別の例では、結果として得られたツリーデータ構造のリーフノードに関連付けられている文書セットのみが、文書の関連付けのデータストアに保持され得る。あるいは、１組の訓練文書がカテゴリ化される文書の組の一部ではない場合、文書の関連付けは無視または削除され得る。このように、訓練文書は、分類ツリーを訓練するためにだけ使用することができる。 Each training document set associated with a node remains associated with that node in the resulting classification tree data structure if the training document is a subset of documents categorized by a binary classification tree. obtain. In one example, each document set remains assigned to its respective node, regardless of its level in the tree, so that the parent node is associated with every document of each of its child nodes. In another example, only the document set associated with the leaf nodes of the resulting tree data structure may be maintained in the document association data store. Alternatively, if the set of training documents is not part of the set of documents to be categorized, the document association can be ignored or deleted. In this way, the training document can only be used to train the classification tree.

新しい文書は、図２に示すように、文書がバイナリツリーデータ構造２５０のノードに関連付けられた状態で、新しい各文書をバイナリツリーのノードに関連付けて階層型分類ツリーを形成することによって分類することができる。図２に示すように、文書ソータ２４０は、新しい文書２４２を受信し、その文書をツリー２３０の少なくとも１つのノードに関連付ける。各文書のノードの関連付けは、データストアに格納することができる。文書ソータ２４０は、図３で示したツリー生成器の文書割当器３６０とまったく同じものとすることができ、関連付けをエントロピまたは距離測定（Ｋｌｄｉｖｅｒｇｅｎｃｅなど）に基づかせることができる。しかし、訓練プロセスとは異なり、用語のリスト、および各ノードでのその関連の用語の確率は調整されない。その結果、新しい文書の割当は、最も小さいＫｌｄｉｖｅｒｇｅｎｃｅを有するノード、および／またはノードへのＫｌｄｉｖｅｒｇｅｎｃｅが予め定められた閾値を下回るノードに基づいて次レベルのノードを選択することによって決定されたパスでバイナリツリーのノードを「歩く」ことになる。ツリーは割り当てられる各文書によって「歩かれる」ため、割当プロセスは、並列計算で達成することができる。 A new document is classified by associating each new document with a binary tree node to form a hierarchical classification tree with the document associated with the nodes of the binary tree data structure 250, as shown in FIG. Can do. As shown in FIG. 2, document sorter 240 receives a new document 242 and associates the document with at least one node of tree 230. The association of each document node can be stored in a data store. The document sorter 240 can be exactly the same as the document allocator 360 of the tree generator shown in FIG. 3, and the association can be based on entropy or distance measurements (such as Kl divergence). However, unlike the training process, the list of terms and their associated term probabilities at each node are not adjusted. As a result, new document assignments are determined by selecting the next level node based on the node with the smallest Kl divergence and / or the node whose Kl diverging is below a predetermined threshold. Will "walk" through the nodes of the binary tree. Since the tree is “walked” by each assigned document, the allocation process can be accomplished with parallel computing.

新しい文書が１組の訓練文書にはないツリーを含み得るため、文書の確率の大部分は、実際には文書に出現しない用語の用語確率の平滑化に基づき得る。上述したように、用語の確率は、簡易Ｊｅｌｉｎｅｋ−Ｍｅｒｃｅｒ、Ｄｉｒｉｃｈｌｅｔ事前分布、および絶対ディスカウンティングを含む任意の適した方法を使用して平滑化することができる。このように、文書の集成全体はシステムエラーを考慮したシステム知識を提供し、新しい文書は、それが用語の１つの考え得る出現または組み合わせにすぎないように統計的に扱われるため、文書に出現する用語の確率はゼロではない。 Since a new document may contain a tree that is not in the set of training documents, the majority of document probabilities may be based on smoothing the term probabilities for terms that do not actually appear in the document. As described above, the term probabilities can be smoothed using any suitable method, including simplified Jelinek-Mercer, Dirichlet priors, and absolute discounting. In this way, the entire collection of documents provides system knowledge that takes into account system errors, and the new document appears in the document because it is treated statistically so that it is only one possible occurrence or combination of terms. The probability of the term to be is not zero.

図５は、図２の文書ソータ２４０の動作の方法例５００を示している。文書ソータは、５１０で、バイナリツリー分類データ構造に関連付けられる新しい文書にアクセスする。文書ソータは、５１２で、新しい文書と第１の子ノードとの間の第１の距離値を決定し、５１４で、新しい文書と第２の子ノードとの間の第２の距離値を決定する。上述したように、Ｋｌｄｉｖｅｒｇｅｎｃｅなどの距離測定は、用語のリストが文書に出現する確率に基づいており、各子ノードは、用語ごとにそれ自体の関連の確率を有し得る。 FIG. 5 illustrates an example method 500 of operation of the document sorter 240 of FIG. The document sorter accesses a new document associated with the binary tree classification data structure at 510. The document sorter determines a first distance value between the new document and the first child node at 512 and determines a second distance value between the new document and the second child node at 514. To do. As described above, distance measurements such as Kl divergence are based on the probability that a list of terms will appear in the document, and each child node may have its own associated probability for each term.

文書ソータは、５１６で、切断条件が満たされているかどうかを決定する。上述したように、切断条件は、子ノード間のＫｌｄｉｖｅｒｇｅｎｃｅが所与の閾値を上回る、または親ノードがバイナリツリーのリーフノードであるなど、任意の適した条件とすることができる。切断条件が満たされている場合、文書は、２つの子ノードの親ノードに関連付けられる。切断条件が満たされていない場合、文書ソータは、５２０で、決定された距離値の１つが距離閾値を下回るかどうかを決定する。距離閾値は、予め定められており、文書ソータ内で一定とすることができる。このように、２つの距離値が距離閾値を下回っている場合、文書は、両方のノードに従い得る。あるいは、距離閾値は、ソートされる文書に基づいて動的な値とすることができる。例えば、距離閾値は、２つの計算された距離値のうちの最大のものとすることができる。距離値の一方が距離閾値を下回る場合、文書ソータは、５２２で、２つの子ノードがその距離値に関連付けられている子ノード（例えばその子ノードを通る親の２つの孫ノードなど）から延びているかどうかを決定する。例えば、第１の距離値が閾値を下回る場合、文書ソータは、第１の子ノードが２つの子ノード自体を有しているかどうか、例えばツリーが第１の子ノードから延びているかどうかを決定する。２つの孫ノードが存在する場合、文書ソータは、５１２、５１４で第１および第２の距離値を決定することに関連して上述したように、新しい文書と第１の孫ノードとの間の第３の距離値を決定し、新しい文書と第２の孫ノードとの間の第４の距離値を決定する。文書ソータは、切断条件が満たされ、文書がバイナリツリーの少なくとも１つのノードに関連付けられるまで、バイナリツリーを引き続き歩く。 The document sorter determines 516 whether the cutting condition is met. As described above, the cutting condition can be any suitable condition, such as the Kl divergence between child nodes exceeds a given threshold, or the parent node is a leaf node of a binary tree. If the cutting condition is met, the document is associated with the parent node of the two child nodes. If the cutting condition is not met, the document sorter determines at 520 whether one of the determined distance values is below a distance threshold. The distance threshold is predetermined and can be constant in the document sorter. Thus, if two distance values are below the distance threshold, the document can follow both nodes. Alternatively, the distance threshold can be a dynamic value based on the document being sorted. For example, the distance threshold can be the largest of the two calculated distance values. If one of the distance values falls below the distance threshold, the document sorter extends at 522 from the child node with which the two child nodes are associated with the distance value (eg, the two grandchildren of the parent through the child node). Determine whether or not. For example, if the first distance value is below a threshold, the document sorter determines whether the first child node has two child nodes themselves, for example, whether the tree extends from the first child node. To do. If there are two grandchild nodes, the document sorter will determine whether the first and second grandchild nodes are between 512 and 514 as described above in connection with determining the first and second distance values. A third distance value is determined and a fourth distance value between the new document and the second grandchild node is determined. The document sorter continues to walk the binary tree until the cutting condition is met and the document is associated with at least one node of the binary tree.

Ｋｌｄｉｖｅｒｇｅｎｃｅに基づいて文書を単一のノードに割り当てるより、文書ソータ２４０は、文書割当器３６０とは異なるプロセスを使用して、新しい文書をバイナリ分類ツリーのノードに関連付けることができる。一例で、文書ソータ２４０は、ルートノードから文書のパス全体にわたって最小のＫｌｄｉｖｅｒｇｅｎｃｅに基づいて文書を割り当てることができる。より詳細には、上述したように、文書は、文書と次に低いレベルの２つの兄弟ノードとの間の計算されたＫｌｄｉｖｅｒｇｅｎｃｅに基づいてツリーを「歩く」。しかし、文書を所与のノードの２つの選択肢のより小さいＫｌｄｉｖｅｒｇｅｎｃｅ値を有するノードに関連付けるのではなく、文書のＫｌｄｉｖｅｒｇｅｎｃｅを、文書がツリー内を歩くパス全体の合計Ｋｌｄｉｖｅｒｇｅｎｃｅ値に累積または結合することができる。次いで文書を、予め定められた閾値を下回る結合されたＫｌｄｉｖｅｒｇｅｎｃｅを有する、かつ／または最低値を有するパスに割り当てることができる。結合されたＫｌｄｉｖｅｒｇｅｎｃｅ値は、ビタビアルゴリズムなど複合決定理論を含む任意の適した方法を使用して決定することができる。ビタビアルゴリズムは、事後的に最高の意味で、有限ノードの離散時間プロセス（ｆｉｎｉｔｅ−ｎｏｄｅ，ｄｉｓｃｒｅｔｅｔｉｍｅｐｒｏｃｅｓｓ）と見なされ得るバイナリツリーの最も可能性が高いノードシーケンスまたはパスを見つけることができる。代表的な１つの例は、参照により本明細書に組み込む、非特許文献３に記載されている。 Rather than assigning a document to a single node based on Kl divergence, document sorter 240 can use a different process than document assigner 360 to associate a new document with a node in the binary classification tree. In one example, the document sorter 240 can assign documents based on a minimum Kl divergence from the root node to the entire document path. More specifically, as described above, the document “walks” the tree based on the calculated Kl divergence between the document and the next lower level two sibling nodes. However, instead of associating the document with a node having a smaller Kl divergence value of the two alternatives for a given node, the document's Kl divergence is accumulated or combined into the total Kl divergence value of the entire path that the document walks through the tree can do. The document can then be assigned to a path that has a combined Kl divergence below a predetermined threshold and / or has a minimum value. The combined Kl divergence value can be determined using any suitable method including complex decision theory, such as the Viterbi algorithm. The Viterbi algorithm can find the most likely node sequence or path of a binary tree that can be considered in the best sense afterwards as a finite-node, discrete time process. One representative example is described in Non-Patent Document 3, which is incorporated herein by reference.

文書とバイナリツリー構造のノードとの間の関連付けは、データストアに格納することができる。関連付けは、関連付けデータストア、テーブル、ベクトル、または文書のメタデータの一部としてなど、任意の適したフォーマットおよび／またはインデックスで格納することができる。例えば、ツリーの各ノードは、階層型分類のそのパスに従ってアドレス指定可能である。このパスは、図１に示すように、サブジェクトノード１５４を上位ノード（例えば親ノードおよび祖父（母）ノード）、および下位ノード（子および孫）と接続するブランチをトラバースすることによって作成することができる。このパスは、ノードパスまたはカテゴリパスと呼ばれ、「祖父（母）／親／サブジェクトノード／子」の形式で格納される。ツリー構造内でのノードの位置の任意の適した指示が適していることを理解されたい。例えば、バイナリ文字列は、「０」で左側の子へのトラバースを示し、「１」で右側の子へのトラバースを示すことによってノードへのパスを示すことができる。別の例では、例えば祖父（母）ノードが１であり、親ノードにはそれぞれ２および３と番号を付けるなど、ノードに番号を付けることができる。一例では、文書ソータ２４０は、データベース、インデックス、または文書メタデータの一部分など、関連のノードのパスを示す文字列をデータストア内に格納することができる。 Associations between documents and nodes in a binary tree structure can be stored in a data store. The association can be stored in any suitable format and / or index, such as as part of an association data store, table, vector, or document metadata. For example, each node of the tree can be addressed according to its path of hierarchical classification. This path may be created by traversing the branch connecting subject node 154 with the upper nodes (eg, parent and grandfather (mother) nodes) and lower nodes (child and grandchild) as shown in FIG. it can. This path is called a node path or a category path, and is stored in the format of “grandfather (mother) / parent / subject node / child”. It should be understood that any suitable indication of the position of the node within the tree structure is suitable. For example, a binary string can indicate a path to a node by indicating “0” to traverse to the left child and “1” to indicate traversal to the right child. In another example, the nodes can be numbered, for example, the grandfather (mother) node is 1, and the parent nodes are numbered 2 and 3, respectively. In one example, the document sorter 240 may store a string in the data store that indicates the path of the associated node, such as a database, an index, or a portion of document metadata.

図２に示すように、関連の文書を含むバイナリ分類ツリー２５０は、必要に応じて文書の取り出し、クラスタ化、ソート、かつ／またはフィルタ処理に使用されるように、情報取り出しシステム２６０に送信される。例えば、あるノード内の文書は、指定されたクエリ用語に一致するノードを探すことによって取り出されるなど、特定のノードに関連付けられている文書をノード割当に基づいて取り出すことができる。一部の場合、ユーザによるクエリに応答して選択された文書を戻すために、検索エンジンによって一般の逆引きインデックスを使用する場合もある。検索結果内の文書のばらつきの問題に対処するために、クエリエンジンは、関連するノードに基づいて選択された文書をソートし、またはクラスタ化することができる。さらに、または代わりに、クエリエンジンによって選択された文書に固有の階層型ツリーを形成することができる。このように、取り出された文書の少なくとも一部分を使用して、こうした文書に固有のバイナリツリーを生成または訓練することができ、次いで文書を、それぞれのノードに従ってソートまたはクラスタ化して、コンピュータユーザに階層型検索結果を示すようにすることができる。階層型分類ツリーは、ユーザの選好に従ってユーザにこうした文書のみを戻すために文書のフィルタ処理に使用することもできる。さらに、分類ツリーは、選択された文書に類似の、またはそれに関連付けられる追加の文書の指示を戻すことができる。例えば、クエリエンジンは、クエリ用語に基づいて文書を取り出すことができ、取り出された文書は、バイナリ分類ツリーの特定のノードに関連付けることができる。また、クエリエンジンは、取り出された文書だけではなく、同じノードおよび／または隣接するノードに関連付けられている文書のリストも戻して、検索をユーザによって提示されたクエリ用語以上に拡大することができる。さらに、または代わりに、隣接するノードに関連付けられているラベルを取り出された文書とともにユーザに戻して、所望の文書の位置をさらに検索することを示すことができる。カテゴリ化された文書は、検索を使用可能な文書の一部分のみに制限するために、ノードの関連付けに基づいて検索を行うこともできる。任意の適した情報取り出し方法および使用は、上記のバイナリツリーに適切に基づき得ることを理解されたい。 As shown in FIG. 2, a binary classification tree 250 containing related documents is sent to an information retrieval system 260 for use in document retrieval, clustering, sorting, and / or filtering as needed. The For example, documents associated with a particular node can be retrieved based on the node assignment, such as documents within a node are retrieved by looking for a node that matches a specified query term. In some cases, a general reverse index may be used by a search engine to return selected documents in response to a query by a user. To address the issue of document variability in search results, the query engine can sort or cluster selected documents based on the associated nodes. Additionally or alternatively, a hierarchical tree specific to the document selected by the query engine can be formed. In this way, at least a portion of the retrieved documents can be used to generate or train a binary tree that is unique to such documents, and then the documents are sorted or clustered according to their respective nodes and hierarchized to computer users. The type search result can be shown. Hierarchical classification trees can also be used to filter documents to return only those documents to the user according to user preferences. In addition, the classification tree can return indications of additional documents that are similar to or associated with the selected document. For example, the query engine can retrieve documents based on query terms, and the retrieved documents can be associated with specific nodes of the binary classification tree. The query engine can also return not only the retrieved documents, but also a list of documents associated with the same node and / or adjacent nodes to expand the search beyond the query terms presented by the user. . Additionally or alternatively, the label associated with the adjacent node can be returned to the user along with the retrieved document to indicate further searching for the location of the desired document. Categorized documents can also be searched based on node associations to limit the search to only a portion of the available documents. It should be understood that any suitable information retrieval method and use may be appropriately based on the binary tree described above.

図６は、ツリー生成器２２０と文書ソータ２４０との任意の組み合わせを実施できる好適なコンピューティングシステム環境９００の例を示している。コンピューティングシステム環境９００は、適したコンピューティング環境の一例にすぎず、本発明の使用または機能の範囲に関する限定を示唆するものではない。また、コンピューティング環境９００を、動作環境９００の例に示した構成要素のいずれか１つ、またはその組み合わせに関連する任意の依存性または必要条件を有しているものと解釈すべきではない。 FIG. 6 illustrates an example of a suitable computing system environment 900 in which any combination of tree generator 220 and document sorter 240 can be implemented. The computing system environment 900 is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the invention. Neither should the computing environment 900 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the example operating environment 900.

本発明は、他の多くの汎用または専用コンピューティングシステム環境または構成で動作可能である。本発明との使用に適したよく知られているコンピューティングシステム、環境、および／または構成の例には、それだけには限定されないが、パーソナルコンピュータ、サーバコンピュータ、ハンドヘルドまたはラップトップ装置、マルチプロセッサシステム、マイクロプロセッサベースのシステム、セットトップボックス、プログラム可能家庭用電化製品、ネットワークＰＣ、ミニコンピュータ、メインフレームコンピュータ、上記の任意のシステムまたは装置を含む分散コンピューティング環境などがある。 The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and / or configurations suitable for use with the present invention include, but are not limited to, personal computers, server computers, handheld or laptop devices, multiprocessor systems, There are microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments including any of the above systems or devices, and the like.

本発明は、コンピュータによって実行されるプログラムモジュールなどのコンピュータ実行可能命令の一般的な文脈で説明することができる。一般にプログラムモジュールは、特定のタスクを実行する、または特定の抽象データ型を実装するルーチン、プログラム、オブジェクト、構成要素、データ構造などを含む。また、本発明は、タスクが通信ネットワークによってリンクされているリモート処理装置によって実行される分散コンピューティング環境でも実施することができる。分散コンピューティング環境では、プログラムモジュールを、メモリ記憶装置を含むローカルおよびリモートのコンピュータ記憶媒体に置くことができる。 The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules can be located in both local and remote computer storage media including memory storage devices.

図６を参照すると、本発明を実施するシステムの例は、汎用コンピューティング装置をコンピュータ２０の形で含んでいる。コンピュータ２０の構成要素は、それだけには限定されないが、処理ユニット２１、システムメモリ２２、およびシステムメモリを含む様々なシステム構成要素を処理ユニット２１に結合するシステムバス２３を含む。システムバス２３は、様々なバスアーキテクチャのうちの任意のものを使用するメモリバスまたはメモリコントローラ、周辺バス、およびローカルバスを含むいくつかのタイプのバス構造のうちどんなものでもよい。こうしたアーキテクチャには、それだけには限定されないが一例として、業界標準アーキテクチャ（ＩＳＡ）バス、マイクロチャネルアーキテクチャ（ＭＣＡ）バス、拡張ＩＳＡ（ＥＩＳＡ）バス、ビデオ電子装置規格化協会（ＶＥＳＡ）ローカルバス、およびメザニンバスとしても知られている周辺部品相互接続（ＰＣＩ）バスなどがある。 With reference to FIG. 6, an exemplary system for implementing the invention includes a general purpose computing device in the form of a computer 20. The components of the computer 20 include, but are not limited to, a processing unit 21, a system memory 22, and a system bus 23 that couples various system components including the system memory to the processing unit 21. The system bus 23 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. Examples of such architectures include, but are not limited to, industry standard architecture (ISA) bus, microchannel architecture (MCA) bus, extended ISA (EISA) bus, video electronics standardization association (VESA) local bus, and mezzanine There are peripheral component interconnect (PCI) buses, also known as buses.

コンピュータ２０は、一般に様々なコンピュータ可読媒体を含む。コンピュータ可読媒体は、コンピュータ２０からアクセスできる使用可能な任意の媒体とすることができ、揮発性および不揮発性媒体、取外式および固定式媒体を含む。コンピュータ可読媒体は、それだけには限定されないが一例として、コンピュータ記憶媒体および通信媒体を含み得る。コンピュータ記憶媒体には、コンピュータ可読命令、データ構造、プログラムモジュール、他のデータなど、情報を記憶するための任意の方法または技術で実施される揮発性および不揮発性の取外式および固定式媒体がある。コンピュータ記憶媒体には、それだけには限定されないが、ＲＡＭ、ＲＯＭ、ＥＥＰＲＯＭ、フラッシュメモリまたは他のメモリ技術、ＣＤ−ＲＯＭ、デジタル多用途ディスク（ＤＶＤ）または他の光ディスク記憶装置、磁気カセット、磁気テープ、磁気ディスク記憶装置または他の磁気記憶装置、または所望の情報の格納に使用でき、コンピュータ２０からアクセスできる他の任意の媒体などがある。通信媒体は一般に、コンピュータ可読命令、データ構造、プログラムモジュール、または他のデータを搬送波または他の移送機構などの変調されたデータ信号に組み込む。これには任意の情報配送媒体がある。「変調されたデータ信号」という用語は、信号に情報を符号化するように１つまたは複数のその特性が設定または変更された信号を意味する。通信媒体には、それだけには限定されないが一例として、有線ネットワーク、直接配線された接続などの有線媒体、および音響、ＲＦ、赤外線、その他の無線媒体などの無線媒体がある。また、上記のどんな組み合わせでもコンピュータ可読媒体の範囲内に含まれるものとする。 Computer 20 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 20 and includes both volatile and nonvolatile media, removable and non-removable media. Computer-readable media can include, by way of example and not limitation, computer storage media and communication media. Computer storage media includes volatile and non-volatile removable and non-removable media implemented in any method or technique for storing information, such as computer readable instructions, data structures, program modules, and other data. is there. Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage device, magnetic cassette, magnetic tape, There may be a magnetic disk storage device or other magnetic storage device, or any other medium that can be used to store desired information and that is accessible from the computer 20. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism. This includes any information delivery medium. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. Examples of communication media include, but are not limited to, wired media such as wired networks, direct wired connections, and wireless media such as acoustic, RF, infrared, and other wireless media. Any combination of the above should be included within the scope of computer-readable media.

システムメモリ２２は、読み取り専用メモリ（ＲＯＭ）２４やランダムアクセスメモリ（ＲＡＭ）２５など、揮発性および／または不揮発性メモリの形のコンピュータ記憶媒体を含む。基本入出力システム２６（ＢＩＯＳ）は、例えば起動中など、コンピュータ２０内の要素間での情報の転送を助ける基本ルーチンを含み、一般にＲＯＭ２４に格納されている。ＲＡＭ２５は一般に、処理ユニット２１から直接アクセス可能な、かつ／または処理ユニット２１が現在処理しているデータおよび／またはプログラムモジュールを含む。図６は、それだけには限定されないが一例として、オペレーティングシステム３５、アプリケーションプログラム３６、他のプログラムモジュール３７、およびプログラムデータ３８を示している。 The system memory 22 includes computer storage media in the form of volatile and / or nonvolatile memory such as read only memory (ROM) 24 and random access memory (RAM) 25. The basic input / output system 26 (BIOS) includes basic routines that assist in transferring information between elements within the computer 20, such as during startup, and is generally stored in the ROM 24. The RAM 25 generally includes data and / or program modules that are directly accessible from and / or currently being processed by the processing unit 21. FIG. 6 shows, by way of example and not limitation, an operating system 35, application programs 36, other program modules 37, and program data 38.

コンピュータ２０は、他の取外式／固定式、揮発性／不揮発性コンピュータ記憶媒体を含むこともできる。一例にすぎないが、図６は、固定式不揮発性磁気媒体から読み取り、あるいはそこに書き込むハードディスクドライブ２７、取外式不揮発性磁気ディスク２９から読み取り、あるいはそこに書き込む磁気ディスクドライブ２８、およびＣＤ−ＲＯＭや他の光媒体など、取外式不揮発性光ディスク３１から読み取り、あるいはそこに書き込む光ディスクドライブ３０を示している。動作環境の例で使用できる他の取外式／固定式、揮発性／不揮発性コンピュータ記憶媒体には、それだけには限定されないが、磁気テープカセット、フラッシュメモリカード、デジタル多用途ディスク、デジタルビデオテープ、半導体ＲＡＭ、半導体ＲＯＭなどがある。ハードディスクドライブ２７は一般に、インターフェイス３２などの固定式メモリインターフェイスを介してシステムバス２３に接続され、磁気ディスクドライブ２８および光ディスクドライブ３０は一般に、インターフェイス３３などの取外式メモリインターフェイスによってシステムバス２３に接続される。 The computer 20 may also include other removable / non-removable, volatile / nonvolatile computer storage media. By way of example only, FIG. 6 illustrates a hard disk drive 27 that reads from or writes to a fixed non-volatile magnetic medium, a magnetic disk drive 28 that reads from or writes to a removable non-volatile magnetic disk 29, and a CD- An optical disk drive 30 is shown which reads from or writes to a removable non-volatile optical disk 31, such as a ROM or other optical medium. Other removable / fixed, volatile / nonvolatile computer storage media that can be used in the example operating environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile discs, digital video tapes, There are semiconductor RAM, semiconductor ROM, and the like. The hard disk drive 27 is typically connected to the system bus 23 via a fixed memory interface such as an interface 32, and the magnetic disk drive 28 and optical disk drive 30 are typically connected to the system bus 23 via a removable memory interface such as an interface 33. Is done.

上述し、図６に示したドライブおよびその関連のコンピュータ記憶媒体は、コンピュータ可読命令、データ構造、プログラムモジュール、およびコンピュータ２０の他のデータの記憶域を提供する。図６では例えば、ハードディスクドライブ２７は、オペレーティングシステム３５、アプリケーションプログラム３６、他のプログラムモジュール３７、およびプログラムデータ３８を格納するものとして示されている。これらの構成要素は、オペレーティングシステム３５、アプリケーションプログラム３６、他のプログラムモジュール３７、およびプログラムデータ３８と同じであっても、異なっていてもよいことに留意されたい。オペレーティングシステム３５、アプリケーションプログラム３６、他のプログラムモジュール３７、およびプログラムデータ３８は少なくとも異なるコピーである。ユーザは、キーボード４０、および一般にマウス、トラックボール、またはタッチパッドと呼ばれるポインティング装置４２などの入力装置を介してコマンドおよび情報をコンピュータ２０に入力することができる。他の入力装置（図示せず）には、マイクロフォン、ジョイスティック、ゲームパッド、衛星パラボラアンテナ、スキャナなどがある。これらおよび他の入力装置は、しばしばシステムバスに結合されているユーザ入力インターフェイス４６を介して処理ユニット２１に接続されるが、パラレルポート、ゲームポート、ユニバーサルシリアルバス（ＵＳＢ）など他のインターフェイスおよびバス構造で接続してもよい。モニタ４７または他のタイプの表示装置もまた、ビデオインターフェイス５８などのインターフェイスを介してシステムバス２３に接続される。モニタに加えて、コンピュータは、出力周辺インターフェイスを介して接続できるスピーカ、プリンタなど他の周辺出力装置を含むこともできる。 The drive described above and shown in FIG. 6 and its associated computer storage media provide storage for computer readable instructions, data structures, program modules, and other data on the computer 20. In FIG. 6, for example, hard disk drive 27 is shown as storing operating system 35, application program 36, other program modules 37, and program data 38. Note that these components can either be the same as or different from operating system 35, application programs 36, other program modules 37, and program data 38. The operating system 35, application program 36, other program modules 37, and program data 38 are at least different copies. A user may enter commands and information into the computer 20 through input devices such as a keyboard 40 and pointing device 42, commonly referred to as a mouse, trackball or touch pad. Other input devices (not shown) include a microphone, joystick, game pad, satellite dish, scanner, and the like. These and other input devices are often connected to the processing unit 21 via a user input interface 46 that is coupled to the system bus, but other interfaces and buses such as parallel ports, game ports, universal serial bus (USB), etc. You may connect by structure. A monitor 47 or other type of display device is also connected to the system bus 23 via an interface, such as a video interface 58. In addition to the monitor, the computer can also include other peripheral output devices such as speakers, printers, etc. that can be connected via an output peripheral interface.

コンピュータ２０は、リモートコンピュータ４９など１つまたは複数のリモートコンピュータへの論理接続を使用してネットワーク式環境で動作することができる。リモートコンピュータ４９は、パーソナルコンピュータ、サーバ、ルータ、ネットワークＰＣ、ピア装置、または他の一般のネットワークノードでよく、一般にコンピュータ２０に関連して上述した多くまたはすべての要素を含むが、図６にはメモリ記憶装置５０のみを示している。図６に示した論理接続は、ローカルエリアネットワーク（ＬＡＮ）５１および広域ネットワーク（ＷＡＮ）５２を含むが、他のネットワークを含んでいてもよい。こうしたネットワーキング環境は、オフィス、全社規模のコンピュータネットワーク、イントラネット、およびインターネットではごく一般的である。 Computer 20 can operate in a networked environment using logical connections to one or more remote computers, such as remote computer 49. The remote computer 49 may be a personal computer, server, router, network PC, peer device, or other common network node, and generally includes many or all of the elements described above in connection with the computer 20, although FIG. Only the memory storage device 50 is shown. The logical connection shown in FIG. 6 includes a local area network (LAN) 51 and a wide area network (WAN) 52, but may include other networks. Such networking environments are very common in offices, enterprise-wide computer networks, intranets, and the Internet.

ＬＡＮネットワーキング環境で使用する場合、コンピュータ２０は、ネットワークインターフェイスまたはアダプタ５３を介してＬＡＮ５１に接続される。ＷＡＮネットワーキング環境で使用する場合、コンピュータ２０は一般に、モデム５４、またはインターネットなどＷＡＮ５２を介して通信を確立する他の手段を含む。モデム５４は、内蔵のものでも外付けのものでもよく、ユーザ入力インターフェイス４６または他の適切な機構を介してシステムバス２３に接続することができる。ネットワーク式環境では、コンピュータ２０に関連して示したプログラムモジュール、またはその一部をリモートメモリ記憶装置に格納することができる。図６は、それだけには限定されないが一例として、リモートアプリケーションプログラム３６をメモリ装置５０上に存在するものとして示している。図示したネットワーク接続は例であり、コンピュータ間の通信リンクを確立する他の手段を使用してもよいことは理解されよう。 When used in a LAN networking environment, the computer 20 is connected to the LAN 51 via a network interface or adapter 53. When used in a WAN networking environment, the computer 20 typically includes a modem 54 or other means for establishing communications over the WAN 52, such as the Internet. The modem 54 may be internal or external and can be connected to the system bus 23 via a user input interface 46 or other suitable mechanism. In a networked environment, program modules shown in connection with computer 20 or portions thereof may be stored on a remote memory storage device. FIG. 6 shows the remote application program 36 as existing on the memory device 50 as an example, but not limited thereto. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.

本発明の好ましい実施形態について示し、説明してきたが、本発明の意図および範囲から逸脱することなく様々な変更を加えることができることを理解されたい。 While the preferred embodiment of the invention has been illustrated and described, it will be appreciated that various changes can be made therein without departing from the spirit and scope of the invention.

独占的な権利または特権を主張する本発明の実施形態は頭記のように定義される。 Embodiments of the invention that claim exclusive rights or privileges are defined as above.

一実施形態における階層型バイナリツリー例を示す図である。It is a figure which shows the example of the hierarchical binary tree in one Embodiment. 一実施形態における図１のバイナリツリーを形成し、使用するのに適したバイナリツリー分類プロセスを示す概略図例である。FIG. 2 is a schematic diagram illustrating a binary tree classification process suitable for forming and using the binary tree of FIG. 1 in one embodiment. 一実施形態における図２の分類プロセスのツリー生成例を示す概略図である。FIG. 3 is a schematic diagram illustrating a tree generation example of the classification process of FIG. 2 in one embodiment. 一実施形態における分類バイナリツリーを生成する方法例を示すフローチャートである。6 is a flowchart illustrating an example method for generating a classified binary tree in one embodiment. 一実施形態におけるバイナリツリーに文書を割り当てる方法例を示すフローチャートである。6 is a flowchart illustrating an example method for assigning a document to a binary tree according to an embodiment. 本発明の一実施形態を実施するのに有用なシステム例を示すブロック図である。1 is a block diagram illustrating an example system useful for implementing an embodiment of the present invention.

Explanation of symbols

１５１〜１６１ノード
２１０訓練文書
２２０ツリー生成器
２４０文書ソータ
２４２文書
２６０情報取り出しシステム
３１０用語生成器
３２０用語ベクトル
３３０ノード生成器
３６０文書割当器
３６２，３６４，３６６文書セット
３７０ツリーマネージャ

151-161 Node 210 Training document 220 Tree generator 240 Document sorter 242 Document 260 Information retrieval system 310 Term generator 320 Term vector 330 Node generator 360 Document assigner 362, 364, 366 Document set 370 Tree manager

Claims

(A) receiving a list of training terms based on a set of training documents, generating a first sibling node including a first set of probabilities, and selecting a second sibling node including a second set of probabilities A node generator configured to generate, wherein the first set of probabilities includes, for each term in the list of training terms, a probability that the term appears in a document; A set of probabilities, for each term in the list of training terms, a node generator containing the probability that the term appears in the document;
(B) based on the set of first and second probabilities, each document of the set of training documents is selected from the group consisting of the first sibling node, the second sibling node, and a null set. A document allocator configured to associate with at least one, wherein the document is associated with the first sibling node forming a first document set, and the document forms a second document set. A document allocator associated with the second sibling node;
(C) connecting at least one of the first document set and the second document set to the node generator, and a plurality of siblings based on recursive performance of the node generator and the document allocator. A computer readable medium comprising: a computer executable component including: a tree manager configured to create a binary tree data structure including a hierarchy of nodes.

The document sorter is further configured to associate a new document with at least one of the plurality of sibling nodes based on the generated probability of the set of probabilities. A computer-readable medium according to claim 1.

The computer-readable medium of claim 2, wherein the document sorter compares statistical distances between the new document and each of the first and second sibling nodes.

A term generator configured to receive the set of training documents and generate a list of the training terms based on terms appearing in at least a portion of the documents in the set of training documents. The computer-readable medium of claim 1.

The computer-readable medium of claim 4, wherein the term generator generates the list of training terms based on the frequency of occurrence of the terms appearing in at least a portion of the document.

The computer-readable medium of claim 4, wherein the term generator takes into account a predetermined list of excluded terms.

The node generator is configured to maximize the likelihood for all of the training documents associated with the first and second nodes based on the first and second set of probabilities. The computer-readable medium of claim 1, wherein the set of and second probabilities is determined.

The computer-readable medium of claim 7, wherein the node generator maximizes the likelihood based on an expectation maximization algorithm.

The document allocator determines a statistical distance value between each document of the set of training documents and each of the first node and the second node. Computer readable media.

The document allocator associates documents of the set of training documents with the first node when the determined distance value between the document and the first node is below a predetermined threshold. The computer-readable medium of claim 9.

The computer-readable medium of claim 9, wherein the distance value is a Kl divergence value.

(A) a root node stored in at least one region of a computer readable medium associated with a list of first probabilities assigned to individual terms detected in a set of training documents;
(B) a first child node stored in at least one region of the computer readable medium and associated with the root node in a parent-child relationship and detected by a set of training nodes A first child node associated with a second list of probabilities assigned to
(C) a second child node stored in at least one region of the computer readable medium and associated with the root node in a parent-child relationship, the individual terms detected at a set of training nodes A computer readable medium storing a binary tree data structure comprising: a second child node associated with a list of third probabilities assigned to.

(A) a plurality of terms appearing in the document;
(B) metadata including a node indicator indicating which nodes of the binary classification tree are associated with the document, each node of the binary classification tree including metadata associated with a term list and a term probability list; A computer readable medium storing the document including:

The computer-readable medium of claim 13, wherein the metadata comprises a text string.

The computer-readable medium of claim 14, wherein the text string includes a binary indication of the path to the associated node through the binary classification tree.

(A) creating a binary classification tree based on a set of training documents, wherein each node of the binary classification tree is associated with a list of terms, and each term in the list of terms is represented by the term Associated with the probability of appearing in the document given to the node;
(B) associating a new document with at least one node of the binary tree based on a distance value between the document and the node.

Creating the binary classification tree maximizes the likelihood that each document in the set of training documents is generated by the list of terms associated with each of the two sibling nodes of the binary classification tree. 17. The method of claim 16, comprising determining each probability of the term appearing in a document based on an expected value maximization algorithm.

The method of claim 16, wherein the distance value is determined based on Kl divergence.

The method of claim 18, wherein the new document is associated with a node having a Kl divergence below a distance threshold.

19. The method of claim 18, wherein associating the new document includes associating the new document with a node whose path has the smallest Kl divergence across the path.

Creating the binary classification tree includes determining each list of terms associated with a node based on the list of terms associated with a parent node of the node associated with the list of terms. The method according to claim 16.

Creating the binary classification tree includes associating at least a portion of the set of training documents with at least one of a first child node, a second child node, and a null set. The method of claim 16.

The step of associating at least a portion of the training document is based on each probability that each term is associated with the first child node and each probability that each term is associated with the second child node. 23. The method according to 22.

(A) accessing a document;
(B) determining a first distance value between the document and a first of the two sibling nodes based on a first probability that a set of training terms will appear in the document; ,
(C) determining a second distance value between the document and a second of the two sibling nodes based on a second probability that the set of training terms appears in the document. When,
(D) determining whether two child nodes are associated with a first of two sibling nodes if the first distance value is below a distance threshold;
(E) a third distance between the document and the first of the two child nodes if two child nodes are associated with the first of the two sibling nodes Determining a value and determining a fourth distance value between the document and a second one of the second child nodes;
(F) if two child nodes are associated with the first of the two sibling nodes, the document is said to be based on the third distance value and the fourth distance value. A computer-readable medium having computer-executable instructions for performing steps including: associating with at least one of two child nodes.

Determining the first distance value includes determining a first Kl divergence between the document and a first one of two sibling nodes, and determining the second distance value. The computer-readable medium of claim 24, wherein the step of determining includes determining a second Kl divergence between the document and a second of two sibling nodes.

The computer-readable medium of claim 24, wherein the distance threshold is the second distance value.

The computer-readable medium of claim 24, wherein the distance threshold is a predetermined entropy value.

Further comprising determining whether the second distance value is below the distance threshold and determining whether the other two child nodes are associated with a second of the two sibling nodes. 25. The computer readable medium of claim 24.

A fifth distance between the document and the first of the other two child nodes if the other two child nodes are associated with a second of the two sibling nodes 29. The computer of claim 28, further comprising determining a value and determining a sixth distance value between the document and a second of the other two child nodes. A readable medium.

If the other two child nodes are not associated with a second of the two sibling nodes, the method further comprises associating the document with the second of the two sibling nodes. 30. The computer readable medium of claim 28.

The method further comprises associating the document with the first of the two sibling nodes if two child nodes are not associated with the first of the two sibling nodes. 24. The computer readable medium according to 24.

Further comprising associating the document with a parent node of the first and second of the two sibling nodes if neither of the first and second distance values is below the distance threshold. 25. The computer readable medium of claim 24.

(A) receiving a set of training documents each including a list of terms;
(B) selecting a first set of training terms from at least a portion of the terms listed in the list of terms;
(C) generating, for each training term, a first probability that the training term appears in any document and associating the probability with a first node;
(D) generating, for each training term, a second probability that the training term appears in any document and associating the probability with a second node;
(E) based on the first and second probabilities for each training term, each of the list of terms is at least one of the group consisting of the first node, the second node, and a null set. Associating with one,
(F) forming a second set of training terms from at least a portion of the terms listed in the list of terms associated with the first node;
(G) for each training term in the second set of training terms, generating a third probability that the training term appears in any document and associating the probability with a third node;
(H) for each training term in the second set of training terms, generating a fourth probability that the training term appears in any document and associating the probability with a fourth node;
(I) Based on the third and fourth probabilities for each training term, each list of terms is at least one of the group consisting of the third node, the fourth node, and the null set. And a step of associating with.

Generating the probabilities of the terms includes maximizing the probabilities that each list of terms is in at least one of a first node and a second node of the binary tree layer. 34. The method of claim 33.

The method of claim 33, further comprising assigning a new document to a node of the binary tree.

Assigning the new document includes generating a new list of terms that appear in the new document and walking the tree based on the probability that each term is associated with each node of the tree. 36. The method of claim 35.