JP2008146424A

JP2008146424A - Method for calculating conformity of XML document, program thereof, and information processing apparatus

Info

Publication number: JP2008146424A
Application number: JP2006333993A
Authority: JP
Inventors: Masaki Hyodo; 正樹兵藤; Toshibumi Enomoto; 俊文榎本; Hiroki Akama; 浩樹赤間; Masashi Yamamuro; 雅司山室
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 2006-12-12
Filing date: 2006-12-12
Publication date: 2008-06-26
Anticipated expiration: 2026-12-12
Also published as: JP4839195B2

Abstract

【課題】蓄積されたＸＭＬ文書に対して、パスと検索単語を入力して適合度の算出を行う場合に、処理を高速に行うこと。
【解決手段】本発明は、ＸＭＬ文書を複数蓄積し、それらのＸＭＬ文書に対してパスと検索単語によってランキングを行う情報処理装置において、記憶部が、パス情報、ノード情報および単語情報を記憶している。また、パス情報は、パス単位の情報として、それぞれのパスの配下の部分文書の集合に関する統計情報であるパス統計情報を記憶している。
そして、処理部は、パスと検索単語が入力される前に、パス統計情報を含むパス情報を更新しておくので、適合度の算出時はパス統計情報を算出する必要がなく、検索を高速に処理することができる。
【選択図】図１Processing is performed at high speed when a path and a search word are input to calculate the degree of fitness for an accumulated XML document.
In an information processing apparatus that accumulates a plurality of XML documents and ranks the XML documents based on a path and a search word, a storage unit stores path information, node information, and word information. ing. The path information stores path statistical information, which is statistical information related to a set of partial documents under each path, as information for each path.
The processing unit updates the path information including the path statistical information before the path and the search word are input, so there is no need to calculate the path statistical information when calculating the degree of matching, and the search can be performed at high speed. Can be processed.
[Selection] Figure 1

Description

本発明は、蓄積した複数のＸＭＬ（eXtensible Markup Language）文書に関して、パスと検索単語を有する検索クエリ（問合せ）に対する適合度の算出を行う技術に関する。 The present invention relates to a technique for calculating a fitness for a search query (query) having a path and a search word with respect to a plurality of stored XML (eXtensible Markup Language) documents.

近年、コンピュータ装置で文書を作成する場合、マークアップ言語の１つであるＸＭＬを使用することが多い。ＸＭＬにより作成された文書であるＸＭＬ文書（以下、単に「文書」ともいう。）は、多く利点を有するからである。
たとえば、ＸＭＬ文書は、文書の構造を表現および管理するためのタグの内容を使用者が定義できるため使いやすく、また、階層構造となっているのでデータ整理に適している。さらに、ＸＭＬ文書は、データがバイナリデータでなくテキストデータであるので使用者がデータの内容を容易に確認でき、また、世界標準として認定されているので他の多くのアプリケーションと互換性がある。 In recent years, when a document is created by a computer device, XML, which is one of markup languages, is often used. This is because an XML document (hereinafter simply referred to as “document”) that is a document created by XML has many advantages.
For example, an XML document is easy to use because the contents of tags for expressing and managing the structure of the document can be defined by the user, and has a hierarchical structure and is suitable for data organization. Furthermore, since the XML document is text data instead of binary data, the user can easily confirm the contents of the data, and since it is certified as a global standard, it is compatible with many other applications.

そして、ＸＭＬ文書の増加にともなって、蓄積された複数のＸＭＬ文書に対する検索の精度や速度の向上の必要性も高くなってきている。なお、以下、検索クエリに対する文書の適合性によって行われる順位付けのことをランキングという。また、検索時は、文書中の検索単語の出現に関する何らかの統計量を利用してスコアを算出し、そのスコアの大小関係によってランキングが行われる。 As the number of XML documents increases, the necessity of improving the accuracy and speed of searching for a plurality of stored XML documents is increasing. Hereinafter, the ranking performed based on the suitability of documents with respect to a search query is referred to as ranking. Further, at the time of search, a score is calculated using some statistic regarding the appearance of a search word in a document, and ranking is performed according to the magnitude relation of the score.

蓄積された文書に対するランキングの手法としては、文書の記述内容を統計的に分析する手法が多く用いられている（たとえば非特許文献１参照）。
なお、スコアの算出に利用される統計量として、一般に、「文書集合全体の統計量（全文書の長さの平均（以下、「平均長さ」という。）など）」、「文書ごとの統計量（各文書の長さなど）」、「検索単語に対する統計量（各文書における検索単語の出現頻度など）」が用いられる。 As a ranking technique for accumulated documents, a technique for statistically analyzing the description contents of the document is often used (see, for example, Non-Patent Document 1).
As statistics used for calculating the score, generally, “statistics for the entire document set (average of all document lengths (hereinafter referred to as“ average length ”)”), “statistics for each document” “Amount (length of each document, etc.)” and “Statistics for search word (frequency of appearance of search word in each document)” are used.

また、ＸＭＬ文書に対して効率的な検索を実現しようとする場合、構造化データであるＸＭＬ文書からその検索に必要な記述内容を高速に取り出すことが可能でなければならない。そこで、一般的には、ＸＭＬ文書の構造と記述内容を記録したインデックス（構造インデックス）をあらかじめ構築しておく。そして、構造インデックスの構築には様々な方法が存在するが、多くの場合、ＸＭＬ文書の持つ全てのパス（データの所在を示す文字列。たとえば「/book/chapter/section」など）とそのパスに対応する記述内容を関連付けるパスインデックスが構築される（たとえば特許文献１参照）。 Further, when an efficient search is to be performed on an XML document, it is necessary to be able to quickly extract description contents necessary for the search from the XML document that is structured data. Therefore, in general, an index (structure index) that records the structure and description contents of an XML document is constructed in advance. There are various methods for constructing the structure index. In many cases, all paths (strings indicating the location of data, such as “/ book / chapter / section”) and their paths are included in the XML document. A path index for associating description contents corresponding to is constructed (see, for example, Patent Document 1).

そして、ＸＭＬ文書はその構造に意味を持っているため、ＸＭＬ文書に対するランキングは、ＸＭＬ文書の構造を利用して行うことが望ましい。そうすることで、精度の高い検索結果が期待できるからである。つまり、ＸＭＬ文書のランキングでは、従来のランキング時に行っていたように文書単位でスコアを算出するのではなく、部分文書（ＸＭＬ文書の一部）単位でスコアを算出することが好ましい。 Since the XML document has a meaning in its structure, it is desirable to perform ranking for the XML document by using the structure of the XML document. This is because highly accurate search results can be expected. That is, in ranking XML documents, it is preferable to calculate scores in units of partial documents (part of XML documents) instead of calculating scores in units of documents as was done during conventional ranking.

すなわち、ＸＭＬ文書のランキングに関するスコアの算出に利用される統計量として、「文書集合全体の統計量」ではなく「部分文書集合全体の統計量」、また、「文書ごとの統計量」ではなく「部分文書ごとの統計量」を使用することが望ましい。また、「検索単語に対する統計量」に関しても、たとえば、その出現位置を文書単位ではなくもっと詳細に識別するなどしたほうがよい。
岸田和明、外２名、「検索実験の方法と実際：ＮＴＣＩＲワークショップでの試み」、Pre-meeting Lecture at the NTCIR-3 Workshop、2002年10月8日、p.9-10 特開２００６−２２８１５５号公報 That is, as the statistic used for calculating the score related to the ranking of the XML document, “statistics for the entire partial document set” instead of “statistics for the entire document set”, and “statistics for each document” instead of “statistics for each document”. It is desirable to use “statistics for each subdocument”. As for “statistics for search words”, for example, it is better to identify the appearance position in more detail rather than in document units.
Kishida Kazuaki and two others, “Method and Practice of Search Experiment: Trial at NTCIR Workshop”, Pre-meeting Lecture at the NTCIR-3 Workshop, October 8, 2002, p.9-10 JP 2006-228155 A

しかしながら、前記した「部分文書集合全体の統計量」、「部分文書ごとの統計量」および「検索単語に対する統計量」は、検索クエリの内容に依存して変化する。これは、検索クエリにおけるパスの内容によって、該当する部分文書の範囲が変わるためである。
したがって、前記した非特許文献１や特許文献１を含む従来技術では、検索（ランキング）の都度、それらの統計量を算出しなければならず、蓄積されたＸＭＬ文書が大量になると算出コストが膨大になり、その結果、処理が遅くなってしまうという問題があった（詳細は「発明を実施するための最良の形態」の冒頭に記載）。 However, the above-mentioned “statistics for the entire partial document set”, “statistics for each partial document”, and “statistics for the search word” vary depending on the contents of the search query. This is because the range of the corresponding partial document changes depending on the content of the path in the search query.
Therefore, in the related art including the above-described Non-Patent Document 1 and Patent Document 1, the statistics must be calculated for each search (ranking), and the calculation cost becomes enormous when the accumulated XML documents become large. As a result, there is a problem that the processing becomes slow (details are described at the beginning of “Best Mode for Carrying Out the Invention”).

そこで、本発明は、前記問題点に鑑みてなされたものであり、蓄積されたＸＭＬ文書に対して、パスと検索単語を指定して適合度の算出を行う場合に、処理を高速に行うことを目的とする。 Therefore, the present invention has been made in view of the above-described problems, and performs processing at high speed when calculating the degree of matching by specifying a path and a search word for an accumulated XML document. With the goal.

前記課題を解決するために、請求項１および請求項６に係る発明は、情報処理装置が、複数のＸＭＬ文書に関するパス単位の情報を格納するパス情報、ＸＭＬ文書ごとの各ＸＭＬ文書におけるノード間の関係情報を格納するノード情報、および、複数のＸＭＬ文書で使用されている各単語の出現位置情報を格納する単語情報、を記憶する記憶部と、複数のＸＭＬ文書に関してパス単位の情報をパス情報に格納するパスインデックス部と、それぞれのＸＭＬ文書におけるノード間の親子関係を含む関係を分析しその関係情報をノード情報に格納するノード管理部と、複数のＸＭＬ文書のいずれかで使用されている単語を、その単語が使用されているＸＭＬ文書の文書ＩＤを含む出現位置情報と関連付けて単語情報に格納するテキストインデックス部と、パスと検索単語が入力されたときに、複数のＸＭＬ文書に関して、その入力されたパスの配下の文書である部分文書における当該検索単語の適合度を算出するランキング部と、を備える。
そして、パス情報は、パス単位の情報として、それぞれのパスの配下の部分文書の集合に関する統計情報であるパス統計情報を記憶しており、パスインデックス部は、パスと検索単語が入力される前に、蓄積された複数のＸＭＬ文書に関して、パス統計情報を含むパス情報を更新し、ランキング部は、パスと検索単語が入力された後に、パス統計情報を含むパス情報、ノード情報、および、単語情報を参照し、当該パスと検索単語に基づいて、複数のＸＭＬ文書に関する適合度の算出を行う。 In order to solve the above-mentioned problem, the invention according to claim 1 and claim 6 is characterized in that the information processing apparatus stores path information for storing information in units of paths related to a plurality of XML documents, and between nodes in each XML document for each XML document. Node information for storing the relationship information, and word information for storing the appearance position information of each word used in the plurality of XML documents, and pass information in units of paths for the plurality of XML documents. Used in one of a plurality of XML documents, a path index part for storing information, a node management part for analyzing a relation including a parent-child relation between nodes in each XML document, and storing the relation information in node information, and a plurality of XML documents Text index for storing a word in the word information in association with the appearance position information including the document ID of the XML document in which the word is used Comprises a section, when the path and the search word is entered, for a plurality of XML documents, a ranking unit for calculating a goodness of fit of the search word in the document partial document is a subordinate of the input paths, a.
The path information stores path statistical information, which is statistical information related to a set of partial documents under each path, as path unit information. The path index unit stores a path and a search word before input. The path information including path statistical information is updated with respect to a plurality of stored XML documents, and the ranking unit inputs path information including path statistical information, node information, and words after a path and a search word are input. With reference to the information, the degree of conformity for a plurality of XML documents is calculated based on the path and the search word.

かかる発明によれば、部分文書集合全体の統計量であるパス統計情報を、パスと検索単語の入力前に算出および格納しておくことができ、そのパス統計情報を使うことでＸＭＬ文書に関する適合度の算出処理を高速に行うことができる。 According to this invention, it is possible to calculate and store path statistical information, which is a statistic of the entire partial document set, before inputting a path and a search word, and use the path statistical information to conform to an XML document. The degree calculation process can be performed at high speed.

請求項２および請求項７に係る発明は、情報処理装置が、複数のＸＭＬ文書に関するパス単位の情報を格納するパス情報、ＸＭＬ文書ごとの各ＸＭＬ文書におけるノード間の関係情報を格納するノード情報、および、複数のＸＭＬ文書で使用されている各単語の出現位置情報を格納する単語情報、を記憶する記憶部と、複数のＸＭＬ文書に関してパス単位の情報をパス情報に格納するパスインデックス部と、それぞれのＸＭＬ文書におけるノード間の親子関係を含む関係を分析しその関係情報をノード情報に格納するノード管理部と、複数のＸＭＬ文書のいずれかで使用されている単語を、その単語が使用されているＸＭＬ文書の文書ＩＤを含む出現位置情報と関連付けて単語情報に格納するテキストインデックス部と、パスと検索単語が入力されたときに、複数のＸＭＬ文書に関して、その入力されたパスの配下の文書である部分文書における当該検索単語の適合度を算出するランキング部と、を備える。
そして、ノード情報は、各ノードごとの情報として、それぞれのノードの配下の部分文書に関する統計情報であるノード統計情報を記憶しており、ノード管理部は、パスと検索単語が入力される前に、ノード統計情報を含むノード情報を更新し、ランキング部は、パスと検索単語が入力された後に、パス情報、ノード統計情報を含むノード情報、および、単語情報を参照し、当該パスと検索単語に基づいて、複数のＸＭＬ文書に関する適合度の算出を行う。 According to the second and seventh aspects of the present invention, the information processing apparatus stores path information for storing path unit information regarding a plurality of XML documents, and node information for storing relationship information between nodes in each XML document for each XML document. And a storage unit that stores word information for storing appearance position information of each word used in a plurality of XML documents, and a path index unit for storing path unit information in the path information for the plurality of XML documents, , A node management unit for analyzing a relationship including a parent-child relationship between nodes in each XML document and storing the relationship information in the node information, and a word used in one of the plurality of XML documents. A text index part to be stored in the word information in association with the appearance position information including the document ID of the XML document, and the path and the search word When it is, for a plurality of XML documents, comprising a ranking unit for calculating a goodness of fit of the search word in the document partial document is a subordinate of the input paths, a.
The node information stores, as information for each node, node statistical information that is statistical information regarding partial documents under each node, and the node management unit before the path and the search word are input The node information including the node statistical information is updated, and after the path and the search word are input, the ranking unit refers to the path information, the node information including the node statistical information, and the word information. Based on the above, the degree of fitness for a plurality of XML documents is calculated.

かかる発明によれば、部分文書ごとの統計量であるノード統計情報を、パスと検索単語の入力前に算出および格納しておくことができ、そのノード統計情報を使うことでＸＭＬ文書に関する適合度の算出処理を高速に行うことができる。 According to this invention, it is possible to calculate and store node statistical information, which is a statistic for each partial document, before inputting a path and a search word. Can be performed at high speed.

請求項３および請求項８に係る発明は、情報処理装置が、複数のＸＭＬ文書に関するパス単位の情報を格納するパス情報、ＸＭＬ文書ごとの各ＸＭＬ文書におけるノード間の関係情報を格納するノード情報、および、複数のＸＭＬ文書で使用されている各単語の出現位置情報を格納する単語情報、を記憶する記憶部と、複数のＸＭＬ文書に関してパス単位の情報をパス情報に格納するパスインデックス部と、それぞれのＸＭＬ文書におけるノード間の親子関係を含む関係を分析しその関係情報をノード情報に格納するノード管理部と、複数のＸＭＬ文書のいずれかで使用されている単語を、その単語が使用されているＸＭＬ文書の文書ＩＤを含む出現位置情報と関連付けて単語情報に格納するテキストインデックス部と、パスと検索単語が入力されたときに、複数のＸＭＬ文書に関して、その入力されたパスの配下の文書である部分文書における当該検索単語の適合度を算出するランキング部と、を備える。
そして、単語情報は、出現位置情報として、ＸＭＬ文書の文書ＩＤとともに、その単語が出現するノード位置情報を記憶しており、テキストインデックス部は、パスと検索単語が入力される前に、ノード位置情報を含む単語情報を更新し、ランキング部は、パスと検索単語が入力された後に、パス情報、ノード情報、および、ノード位置情報を含む単語情報を参照し、当該パスと検索単語に基づいて、複数のＸＭＬ文書に関する適合度の算出を行う。 In the inventions according to claims 3 and 8, the information processing apparatus stores path information for storing path unit information regarding a plurality of XML documents, and node information for storing relation information between nodes in each XML document for each XML document. And a storage unit that stores word information for storing appearance position information of each word used in a plurality of XML documents, and a path index unit for storing path unit information in the path information for the plurality of XML documents, , A node management unit for analyzing a relationship including a parent-child relationship between nodes in each XML document and storing the relationship information in the node information, and a word used in one of the plurality of XML documents. A text index part to be stored in the word information in association with the appearance position information including the document ID of the XML document, and the path and the search word When it is, for a plurality of XML documents, comprising a ranking unit for calculating a goodness of fit of the search word in the document partial document is a subordinate of the input paths, a.
The word information stores, as the appearance position information, the node position information where the word appears together with the document ID of the XML document, and the text index portion stores the node position before the path and the search word are input. The word information including information is updated, and the ranking unit refers to the word information including the path information, the node information, and the node position information after the path and the search word are input, and based on the path and the search word. The degree of conformity for a plurality of XML documents is calculated.

かかる発明によれば、検索単語に対する統計量である出現位置情報を、パスと検索単語の入力前に算出および格納しておくことができ、その出現位置情報を使うことでＸＭＬ文書に関する適合度の算出処理を高速に行うことができる。 According to this invention, it is possible to calculate and store the appearance position information, which is a statistic for the search word, before inputting the path and the search word, and by using the appearance position information, the fitness level of the XML document can be calculated. The calculation process can be performed at high speed.

請求項４および請求項９に係る発明は、情報処理装置において、ノード管理部が、ノード間の関係情報として、各ノードに、開始ラベル値と終了ラベル値を、親ノードの開始ラベル値は子ノードの開始ラベル値よりも小さく、親ノードの終了ラベル値は子ノードの終了ラベル値よりも大きい、という関係を満たすように付与してノード情報に格納し、テキストインデックス部が、各単語情報のノード位置情報として、ノード情報における開始ラベル値と終了ラベル値を付与し、ランキング部は、パスと検索単語が入力されたときに、ノード情報における各ノードの開始ラベル値と終了ラベル値に基づいてそのパスに対応する部分文書の範囲を決定し、その部分文書に関して当該検索単語の適合度を算出する。 The invention according to claim 4 and claim 9 is that in the information processing apparatus, the node management unit provides the start label value and the end label value for each node as the relationship information between the nodes, and the start label value of the parent node is a child. It is assigned to satisfy the relationship that it is smaller than the start label value of the node and the end label value of the parent node is larger than the end label value of the child node, and stored in the node information. As the node position information, the start label value and the end label value in the node information are given, and the ranking unit is based on the start label value and the end label value of each node in the node information when the path and the search word are input. The range of the partial document corresponding to the path is determined, and the matching degree of the search word with respect to the partial document is calculated.

かかる発明によれば、ノード情報および単語情報における開始ラベル値と終了ラベル値によって部分文書の範囲を決定することができる。 According to this invention, the range of the partial document can be determined based on the start label value and the end label value in the node information and the word information.

請求項５に係る発明は、情報処理装置において、記憶部が、さらに、ＸＭＬ文書においてテキストデータを有するノードのテキストデータをそのテキストデータの識別情報と関連付けて格納するテキスト情報を記憶し、ノード管理部は、ノード情報において、ノードごとに、その配下のテキストデータの識別情報を関連付け、ランキング部は、パスと検索単語が入力されたときに、ノード情報におけるテキストデータの識別情報に基づいてそのパスに対応する部分文書の範囲を決定し、その部分文書に関して当該検索単語の適合度を算出する。 In the invention according to claim 5, in the information processing apparatus, the storage unit further stores text information for storing the text data of the node having the text data in the XML document in association with the identification information of the text data. The node associates the identification information of the subordinate text data for each node in the node information, and the ranking unit determines the path based on the identification information of the text data in the node information when the path and the search word are input. The range of the partial document corresponding to is determined, and the degree of matching of the search word with respect to the partial document is calculated.

かかる発明によれば、テキストデータの識別情報によって部分文書の範囲を決定することができる。 According to this invention, the range of the partial document can be determined based on the identification information of the text data.

請求項１０に係る発明は、請求項１から請求項５までのいずれか１項に記載のＸＭＬ文書の適合度の算出方法をコンピュータに実行させることを特徴とするプログラムである。 The invention according to claim 10 is a program that causes a computer to execute the method for calculating the conformity of an XML document according to any one of claims 1 to 5.

かかる発明によれば、ＸＭＬ文書の適合度の算出方法をコンピュータに実行させることができる。 According to this invention, it is possible to cause a computer to execute a method for calculating the conformity of an XML document.

本発明によれば、蓄積されたＸＭＬ文書に対して、パスと検索単語を指定して適合度の算出を行う場合に、処理を高速に行うことができる。 According to the present invention, it is possible to perform processing at high speed when calculating a fitness level by specifying a path and a search word for an accumulated XML document.

以下、本発明に係る情報処理装置、ＸＭＬ文書の適合度の算出方法およびそのプログラムを実施するための最良の形態（以下、実施形態という。）について、適宜図面を参照しながら説明する。なお、参照図以外の図も適宜参照する。
その前に、理解を容易にするため、図２１〜図２４を参照しながら、比較例（従来技術）および用語について説明する。 Hereinafter, an information processing apparatus, a method for calculating the conformity of an XML document, and a best mode for carrying out the program (hereinafter referred to as an embodiment) will be described with reference to the drawings as appropriate. Note that drawings other than the reference diagram are also referred to as appropriate.
Before that, in order to facilitate understanding, comparative examples (prior art) and terms will be described with reference to FIGS.

図２１は、比較例の統計情報を用いたランキング例を示したものであり、（ａ）が文書例、（ｂ）が適合度（ランキング）の計算例１、（ｃ）が適合度の計算例２の説明図である。
図２１（ａ）に示すように、ここでは、３つの文書（文書０１〜０３）に対して、単語「特許」で検索を行う場合について説明する。なお、ここでの文書は、ＸＭＬ文書であってもなくても、いずれでもよい。また、３つの文書における検索単語の出現回数と文書のテキスト長は、図示した通りである。 FIG. 21 shows a ranking example using statistical information of a comparative example. (A) is a document example, (b) is a calculation example 1 of the fitness (ranking), and (c) is a calculation of the fitness. 10 is an explanatory diagram of Example 2. FIG.
As shown in FIG. 21A, here, a case will be described in which three documents (documents 01 to 03) are searched using the word “patent”. Note that the document here may or may not be an XML document. Further, the number of occurrences of the search word in the three documents and the text length of the document are as illustrated.

図２１（ｂ）は、検索単語の出現回数でランキングを行った結果を示している。また、図２１（ｃ）は、検索単語の出現頻度（出現回数／テキスト長）でランキングを行った結果を示している。
そして、比較例では、一般に、前記したような統計量（検索単語の出現回数や出現頻度）と単語の出現位置（出現文書の識別子）を記録したインデックス（テキストインデックス）を構築しておく。そして、検索時には、構築してあるテキストインデックスを用いることで、検索単語の出現位置を特定し、高速なスコア算出を行うことができる。
なお、検索文字列に複数の単語が含まれている場合は、各単語に関してスコア算出を行い、所定の計算式によりそれらのスコアを統合するなどすればよい。 FIG. 21B shows a result of ranking by the number of appearances of the search word. FIG. 21C shows the result of ranking by the appearance frequency (number of appearances / text length) of the search word.
In the comparative example, generally, an index (text index) is recorded in which the above-described statistics (number of appearances and appearance frequency of search words) and word appearance position (appearance document identifier) are recorded. And at the time of a search, the appearance position of a search word can be pinpointed by using the constructed text index, and a high-speed score calculation can be performed.
When a plurality of words are included in the search character string, a score is calculated for each word, and those scores may be integrated by a predetermined calculation formula.

次に、図２２を参照しながら、構造化データであるＸＭＬ文書の構造について説明する。図２２において、（ａ）はＸＭＬ文書のソースコードの例の簡略図、（ｂ）はＸＭＬ文書の構造（木構造）を示した図である。
図２２（ａ）に例示しているように、ＸＭＬ文書００１のソースコードにおいて、ＸＭＬ文書の構成要素を識別するために使われるマーク（「＜book＞」など）をタグという。 Next, the structure of an XML document that is structured data will be described with reference to FIG. 22A is a simplified diagram of an example of source code of an XML document, and FIG. 22B is a diagram illustrating a structure (tree structure) of the XML document.
As illustrated in FIG. 22A, in the source code of the XML document 001, a mark (such as “<book>”) used to identify a component of the XML document is referred to as a tag.

また、ＸＭＬ文書は、図２２（ｂ）に示すように、ディレクトリ構造のような木構造を有しており、各要素（ノード）はパス（「/book/chapter」など。以下、同様に記載）で表現される。なお、各ノードには、ノード「ｎ１」の子ノードには「ｎ１１」〜「ｎ１３」、ノード「ｎ１２」の子ノードには「ｎ１２１」〜「ｎ１２４」、・・・、と、階層的に符号を付してある（他図も同様）。 Further, as shown in FIG. 22B, the XML document has a tree structure such as a directory structure, and each element (node) is a path (such as “/ book / chapter”). ). In each node, “n11” to “n13” are child nodes of the node “n1”, “n121” to “n124” are child nodes of the node “n12”, and so on. Reference numerals are attached (the same applies to other drawings).

このように、ＸＭＬ文書は、記述内容とは別に、タグによってそれぞれの記述内容の意味（属性）を付与されているため、単語とともにパスを指定して検索することで、大きな１つのＸＭＬ文書から必要とする部分を取り出すことができる。そして、前記したように、１つのＸＭＬ文書から取り出される文書の一部を部分文書という。
図２２の例では、ＸＭＬ文書００１は本のデータを表しており、この本のタイトルの記述内容だけを取り出したい場合、パスを「/book/title」と指定することで、タイトルの記述内容だけを取り出すことができる。 In this way, since the XML document is given the meaning (attribute) of each description content by the tag separately from the description content, it is possible to search from a large XML document by specifying a path together with a word and performing a search. The necessary part can be taken out. As described above, a part of a document extracted from one XML document is referred to as a partial document.
In the example of FIG. 22, the XML document 001 represents book data. If only the title description content of this book is to be extracted, only the title description content is specified by specifying the path as “/ book / title”. Can be taken out.

また、図２２（ｂ）に示すように、ＸＭＬ文書００１は、「/book」の配下に本の記述内容が章（chapter）別に格納されている。ＸＭＬ文書００１の中から特に単語「地球」について詳しく書かれている部分文書を取り出したい場合に、たとえば、章ごとに調査したいときは、「/book/chapter」の配下の「text」に対して単語「地球」で検索を行えばよい。また、章よりもさらに詳しい節（section）ごとに調査したいときは、「/book/chapter/section」の配下の「text」に対して単語「地球」で検索を行えばよい。いずれの場合も、スコアの高い部分文書がランキング上位の部分文書になる。 Also, as shown in FIG. 22B, in the XML document 001, the description content of the book is stored for each chapter under the “/ book”. If you want to take out a partial document in which the word “Earth” is written in detail from the XML document 001, for example, if you want to investigate every chapter, for “text” under “/ book / chapter” A search may be made with the word “Earth”. If you want to investigate every section that is more detailed than a chapter, you can search for the word “Earth” for “text” under “/ book / chapter / section”. In either case, the partial document with a high score becomes the partial document with the highest ranking.

続いて、図２３を参照しながら、パス指定（パスの内容）の違いによる部分文書の違いについて説明する。図２３の（ａ）と（ｂ）は、図２２（ｂ）の構造の例について、それぞれのパスによる部分文書の範囲を示した図である。
図２３（ａ）に示すように、パスを「/book/chapter」と指定すれば、それぞれの「chapter」（ｎ１２，ｎ１３など）以下のノードが部分文書となる。一方、図２３（ｂ）に示すように、パスを「/book/chapter/section」と指定すれば、それぞれの「section」（ｎ１２４など）以下のノードが部分文書となる。このように、パスの内容によって、部分文書の範囲は異なるのである。 Next, with reference to FIG. 23, differences in partial documents due to differences in path designation (path contents) will be described. (A) and (b) of FIG. 23 are diagrams showing ranges of partial documents by respective paths in the example of the structure of FIG. 22 (b).
As shown in FIG. 23A, if the path is designated as “/ book / chapter”, nodes under each “chapter” (n12, n13, etc.) become partial documents. On the other hand, as shown in FIG. 23B, if the path is designated as “/ book / chapter / section”, nodes under each “section” (such as n124) become partial documents. In this way, the range of partial documents differs depending on the contents of the path.

次に、図２４を参照しながら、パス指定の違いによる部分文書の違いの別の例について説明する。図２４の（ａ）と（ｂ）は、図２２（ｂ）の例について、それぞれのパスによる部分文書の範囲を示した図である。
図２４（ａ）に示すように、パスを「/book/title」と指定すれば、破線で図示した１箇所だけが該当する部分文書となる。一方、図２４（ｂ）に示すように、パスを「任意のtitle」と指定すれば、破線で図示した３箇所（以上）が該当する部分文書となる。 Next, another example of a difference in partial documents due to a difference in path designation will be described with reference to FIG. (A) and (b) of FIG. 24 are diagrams showing ranges of partial documents by respective paths in the example of FIG. 22 (b).
As shown in FIG. 24A, if the path is designated as “/ book / title”, only one location shown by a broken line becomes a corresponding partial document. On the other hand, as shown in FIG. 24B, if the path is designated as “arbitrary title”, three (or more) locations indicated by broken lines become corresponding partial documents.

このように、検索クエリにおけるパスの内容によって、該当する部分文書が変化するので、パスの内容が変化するたびに、「部分文書集合全体の統計量」、「部分文書ごとの統計量」および「検索単語に対する統計量」も変化することになる。そのため、比較例においては、検索の都度、それらの統計量を算出しなくてはならなかった。そして、特に、蓄積されたＸＭＬ文書の数が増加すればするほど、それらの統計量の算出コストが大きくなってしまっていた。 In this way, since the corresponding partial document changes depending on the content of the path in the search query, each time the content of the path changes, the “statistics for the entire partial document set”, “statistics for each partial document”, and “ The “statistics for search words” will also change. Therefore, in the comparative example, the statistics had to be calculated for each search. In particular, as the number of stored XML documents increases, the calculation cost of those statistics values increases.

続いて、図１〜図２０を参照しながら、本発明の各実施形態の情報処理装置、ＸＭＬ文書の適合度の算出方法およびそのプログラムについて説明する。
まず、本実施形態のランキング手法で用いるスコア計算式であるＢＭ２５（非特許文献１の「3.3.1 Okapi」参照）について説明する。このＢＭ２５を用いた場合、文書ｋにおける単語ｉのスコアＳ_ｑｋｉの計算式は、次の式（１）により与えられる。 Next, the information processing apparatus, the XML document fitness calculation method, and the program thereof according to each embodiment of the present invention will be described with reference to FIGS.
First, BM25 (see “3.3.1 Okapi” in Non-Patent Document 1), which is a score calculation formula used in the ranking method of the present embodiment, will be described. When this BM25 is used, the calculation formula of the score S _qki of the word i in the document k is given by the following formula (1).

ここで、各記号の意味は次の通りである。
Ｌ：（部分）文書集合全体の平均長さ（部分文書集合全体の統計量。パス統計情報）
Ｎ：（部分）文書の総数（部分文書集合全体の統計量。パス統計情報）
ｌ_ｋ：（部分）文書ｋの長さ（部分文書の統計量。ノード統計情報）
ｔｆ_ｋｉ：（部分）文書ｋ中の単語ｉの出現回数（単語の統計量）
ｎ_ｉ：該当する単語ｉを含む（部分）文書数（単語の統計量）
ｌ_ｑ：検索文字列ｑの長さ
ｔｆ_ｑｉ：検索文字列ｑ中の検索単語ｉの出現回数

Here, the meaning of each symbol is as follows.
L: (Partial) Average length of the entire document set (Statistics of the entire partial document set. Path statistical information)
N: Total number of (partial) documents (statistics of the entire partial document set, path statistical information)
l _k : (partial) length of document k (statistics of partial document; node statistical information)
tf _ki : (partial) number of occurrences of word i in document k (word statistic)
n _i : Number of (partial) documents including the corresponding word i (word statistics)
l _q : Length of search character string q tf _qi : Number of appearances of search word i in search character string q

（第１実施形態）
図１は、本発明の第１実施形態における情報処理装置の構成図である。図１に示すように、第１実施形態の情報処理装置１は、コンピュータ装置であり、入力部２、出力部３、メモリ４、記憶部５および処理部６を備えている。 (First embodiment)
FIG. 1 is a configuration diagram of an information processing apparatus according to the first embodiment of the present invention. As illustrated in FIG. 1, the information processing apparatus 1 according to the first embodiment is a computer apparatus, and includes an input unit 2, an output unit 3, a memory 4, a storage unit 5, and a processing unit 6.

入力部２は、データ入力を行うものであり、たとえば、キーボード、マウス、通信インターフェースなどである。情報処理装置１の使用者（以下、単に「使用者」という。）は、入力部２を使って、ＸＭＬ文書や検索クエリ（パスと検索文字列）を入力することができる。なお、本実施形態では、使用者がＸＭＬ文書の構造や記述内容をある程度把握しており、検索クエリとして、文字列だけではなく、パスも指定するものとする。 The input unit 2 performs data input, and is, for example, a keyboard, a mouse, a communication interface, or the like. A user of the information processing apparatus 1 (hereinafter simply referred to as “user”) can input an XML document or a search query (path and search character string) using the input unit 2. In the present embodiment, it is assumed that the user understands the structure and description contents of the XML document to some extent, and specifies not only a character string but also a path as a search query.

出力部３は、データを出力するものであり、たとえば、ディスプレイやスピーカである。出力部３は、ＸＭＬ文書の検索によるランキング結果などを出力する。
メモリ４は、処理部６の作業領域であり、たとえば、ＲＡＭ（Random Access Memory）である。 The output unit 3 outputs data, and is, for example, a display or a speaker. The output unit 3 outputs a ranking result or the like by searching the XML document.
The memory 4 is a work area of the processing unit 6 and is, for example, a RAM (Random Access Memory).

記憶部５は、データを記憶するものであり、たとえば、ハードディスクである。記憶部５は、たとえば、１件のデータを複数の項目(フィールド)の集合として表現して、データの集合を表（テーブル）で表す、いわゆるリレーショナルデータベースである。記憶部５は、入力部２から入力されたＸＭＬ文書を格納するＸＭＬ文書群５１、パステーブルＰＴ（Path Table。パス情報。詳細は図４（ｂ）で後記）、ノードテーブルＮＴ（Node Table。ノード情報。詳細は図４（ａ）で後記）および転置表ＩＴ（Inverted Table。単語情報。詳細は図５で後記）を記憶している。
また、記憶部５は、図示を省略しているが、後記するＸＭＬ文書の適合度の算出方法を記述したプログラムを記憶している。 The storage unit 5 stores data and is, for example, a hard disk. The storage unit 5 is, for example, a so-called relational database in which one piece of data is expressed as a set of a plurality of items (fields) and the set of data is expressed as a table. The storage unit 5 is an XML document group 51 for storing an XML document input from the input unit 2, a path table PT (Path Table. Path information; details will be described later with reference to FIG. 4B), and a node table NT (Node Table. Node information (details will be described later in FIG. 4A) and an inverted table IT (Inverted Table. Word information. Details are described later in FIG. 5) are stored.
Although not shown in the figure, the storage unit 5 stores a program describing a method for calculating the conformity of an XML document to be described later.

処理部６は、各種演算処理を行うものであり、たとえば、ＣＰＵ（Central Processing Unit）である。処理部６は、その機能として、データ格納部６１、パスインデックス部６２、範囲ラベル部６３（ノード管理部）、テキストインデックス部６４およびランキング部６５を備えている。なお、以下において、処理部６がこれらの機能以外の機能を果たす場合は、動作主体を処理部６として記載する。 The processing unit 6 performs various arithmetic processes, and is, for example, a CPU (Central Processing Unit). The processing unit 6 includes a data storage unit 61, a path index unit 62, a range label unit 63 (node management unit), a text index unit 64, and a ranking unit 65 as its functions. In the following, when the processing unit 6 fulfills functions other than these functions, the operation subject is described as the processing unit 6.

データ格納部６１は、入力部２から入力されたＸＭＬ文書をＸＭＬ文書群５１に格納する。
パスインデックス部６２は、入力部２から入力されたＸＭＬ文書の情報に基づき、パステーブルＰＴ（図４（ｂ）参照）において、各パスの統計量（部分文書集合全体の平均長さ（Ｌ）、部分文書数（Ｎ））を更新する。 The data storage unit 61 stores the XML document input from the input unit 2 in the XML document group 51.
Based on the information of the XML document input from the input unit 2, the path index unit 62 uses the statistics of each path (the average length (L) of the entire partial document set) in the path table PT (see FIG. 4B). , The number of partial documents (N)) is updated.

範囲ラベル部６３は、各ノードに範囲ラベル（開始ラベル（値）「pre」と終了ラベル（値）「post」の２値のＩＤ（IDentification）。関係情報。出現位置情報。ノード位置情報）を付与し、ノードテーブルＮＴ（図４（ａ）参照）に記録する。各ノードに対して、子ノードの範囲ラベルが親ノードの範囲ラベルの「pre」と「post」の間の値になるようにラベル付けすることで、各ノード間の上下（親子）関係がわかる（特許文献１参照）。 The range label unit 63 assigns a range label (binary ID (IDentification) of start label (value) “pre” and end label (value) “post”, relation information, appearance position information, node position information) to each node. Assigned and recorded in the node table NT (see FIG. 4A). By labeling each node so that the range label of the child node is a value between “pre” and “post” of the range label of the parent node, the upper and lower (parent-child) relationship between each node can be understood. (See Patent Document 1).

ここで、図２および図３を参照しながら、範囲ラベルについて説明する。図２は、ＸＭＬ文書のソースコードの例を示した図であり、（ａ）が図２２と同様のＸＭＬ文書００１に関する図であり、（ｂ）がその他の例としてのＸＭＬ文書００２に関する図である。図２の（ａ）と（ｂ）に示すように、いずれのＸＭＬ文書も本（book）に関するデータである。 Here, the range label will be described with reference to FIGS. 2 and 3. FIG. 2 is a diagram illustrating an example of source code of an XML document. (A) is a diagram related to an XML document 001 similar to FIG. 22, and (b) is a diagram related to an XML document 002 as another example. is there. As shown in FIGS. 2A and 2B, both XML documents are data relating to a book.

図３は、（ａ）がＸＭＬ文書００１に対して範囲ラベルを付与した状態を示す図であり、（ｂ）がＸＭＬ文書００２に対して範囲ラベルを付与した状態を示す図である。
図３（ａ）に示すように、ＸＭＬ文書００１において、ノードｎ１（book）は範囲ラベルが（１，９９）（開始ラベル「pre」が「１」で、終了ラベル「post」が「９９」。以下同様）で、範囲ラベルが（２，５）のノードｎ１１の親であることがわかる。また、ノードｎ１１は範囲ラベルが（２，５）で、範囲ラベルが（６，４７）のノードｎ１２とは上下（親子）関係にないことがわかる。図３（ｂ）に示したＸＭＬ文書００２についても、同様に、各ノードに対して範囲ラベルが付与されている。 FIG. 3A is a diagram showing a state where a range label is assigned to the XML document 001, and FIG. 3B is a diagram showing a state where a range label is assigned to the XML document 002.
As shown in FIG. 3A, in the XML document 001, the node n1 (book) has a range label (1,99) (the start label “pre” is “1” and the end label “post” is “99”). In the same manner, it can be seen that it is the parent of the node n11 whose range label is (2, 5). Also, it can be seen that the node n11 has a range label of (2, 5) and does not have a vertical (parent-child) relationship with the node n12 whose range label is (6, 47). Similarly, for the XML document 002 shown in FIG. 3B, a range label is assigned to each node.

図１に戻って、テキストインデックス部６４は、入力部２から入力されたＸＭＬ文書に出現する全ての単語の出現位置（文書ＩＤと範囲ラベル）を、転置表ＩＴに記録する。
ランキング部６５は、検索時に、パステーブルＰＴ、ノードテーブルＮＴおよび転置表ＩＴから統計情報などを取り出し、各部分文書ごとのスコアを算出する。 Returning to FIG. 1, the text index unit 64 records the appearance positions (document IDs and range labels) of all words appearing in the XML document input from the input unit 2 in the transposition table IT.
The ranking unit 65 extracts statistical information and the like from the path table PT, the node table NT, and the transposition table IT at the time of search, and calculates a score for each partial document.

次に、図４を参照しながら、ノードテーブルとパステーブルについて説明する。図４は、（ａ）がノードテーブル、（ｂ）がパステーブルを例示した図である。
図４（ａ）において、（ａ１）はＸＭＬ文書００１に関するノードテーブル００１ＮＴであり、（ａ２）はＸＭＬ文書００２に関するノードテーブル００２ＮＴである。いずれのノードテーブルＮＴも、左から順に、ＸＭＬ文書の識別子を表す文書ＩＤ（docid）、範囲ラベル（「pre」と「post」）、各ノードに与えられているタグ（tag）、パスの識別子を表すパスＩＤ（pathid）、そのノードがパス指定されたときの部分文書の長さを表すｌｋ（ｌ_ｋ）、および、そのノードのテキストデータであるテキスト（text）のカラムから構成されている。 Next, the node table and the path table will be described with reference to FIG. 4A is a diagram illustrating a node table, and FIG. 4B is a diagram illustrating a path table.
In FIG. 4A, (a1) is a node table 001NT related to the XML document 001, and (a2) is a node table 002NT related to the XML document 002. In any node table NT, in order from the left, a document ID (docid) representing an identifier of the XML document, a range label (“pre” and “post”), a tag (tag) given to each node, and a path identifier path ID representing the (pathid), lk that node representing the length of the partial document when it is the path specified (l _k), and, and a column of the text (text) is a text data of the node .

図４（ｂ）に示すように、パステーブルＰＴは、左から順に、パスＩＤ（pathid）、パス（pathexp）、Ｌ（部分文書集合全体の平均長さ）、Ｎ（部分文書の総数）のカラムから構成されている。 As shown in FIG. 4B, the path table PT includes, in order from the left, path ID (pathid), path (pathexp), L (average length of the entire partial document set), and N (total number of partial documents). It consists of columns.

続いて、図５を参照しながら、転置表について説明する。図５は、転置表の例を示した図である。
図５に示すように、転置表ＩＴは、単語（term）と出現位置（position）のカラムから構成されている。出現位置の()内は、左から順に、文書ＩＤ（docid）、開始ラベル（pre）、終了ラベル（post）を意味している。たとえば、転置表ＩＴにおいて、単語「宇宙」に対応する出現位置が（００１，３，４）であれば、文書ＩＤ（docid）が「００１」であるＸＭＬ文書００１における開始ラベル（pre）が「３」で終了ラベル（post）が「４」のノード、つまり、図３（ａ）におけるノードｎ１１１に単語「宇宙」が存在していることがわかる。 Next, the transposition table will be described with reference to FIG. FIG. 5 is a diagram showing an example of a transposition table.
As shown in FIG. 5, the transposition table IT is composed of columns of words (terms) and appearance positions (positions). The parentheses in the appearance position mean the document ID (docid), the start label (pre), and the end label (post) in order from the left. For example, if the appearance position corresponding to the word “universe” in the transposition table IT is (001, 3, 4), the start label (pre) in the XML document 001 whose document ID (docid) is “001” is “ It can be seen that the word “universe” is present at the node “3” whose end label (post) is “4”, that is, the node n111 in FIG.

次に、図６を参照しながら、構造インデックス（パステーブルＰＴおよびノードテーブルＮＴ）の構築処理について説明する。図６は、構造インデックスの構築処理を示すフローチャートである。
まず、使用者が、入力部２を介して、新たに蓄積したいＸＭＬ文書を情報処理装置１に投入する。そうすると、処理部６のデータ格納部６１が記憶部５のＸＭＬ文書群５１にそのＸＭＬ文書を格納し、また、そのとき、パスインデックス部６２と範囲ラベル部６３が以下の処理により、構造インデックスを構築する。 Next, the construction processing of the structure index (path table PT and node table NT) will be described with reference to FIG. FIG. 6 is a flowchart showing a structure index construction process.
First, a user inputs an XML document to be newly accumulated into the information processing apparatus 1 via the input unit 2. Then, the data storage unit 61 of the processing unit 6 stores the XML document in the XML document group 51 of the storage unit 5, and at that time, the path index unit 62 and the range label unit 63 obtain the structure index by the following processing. To construct.

投入されたＸＭＬ文書に関して、パスインデックス部６２は、１つのパスを取り出す（ステップＳ６０１）。パスインデックス部６２は、パステーブルＰＴを参照し、そのパスがすでにパステーブルＰＴに含まれている（存在している）か否かを判断する（ステップＳ６０２）。 For the input XML document, the path index unit 62 takes out one path (step S601). The path index unit 62 refers to the path table PT and determines whether or not the path is already included (exists) in the path table PT (step S602).

パスがパステーブルＰＴに含まれていない場合（ステップＳ６０２でＮｏ）、パスインデックス部６２は、新しいパスＩＤを発行し、新たにそのパスをパステーブルＰＴに加え（ステップＳ６０３）、さらに、そのパスに関するＬとＮの値（部分文書集合の統計量）を計算する、すなわち、そのパスの配下の連結テキスト長をＬの値とし、Ｎの値を「１」として、それぞれ、パステーブルＰＴの該当箇所に格納する（ステップＳ６０４）。 If the path is not included in the path table PT (No in step S602), the path index unit 62 issues a new path ID, newly adds the path to the path table PT (step S603), and further passes the path. L and N values (statistics of partial document set) are calculated, that is, the connected text length under the path is set to L value, and the value of N is set to “1”, respectively. Store in the location (step S604).

パスがすでにパステーブルＰＴに含まれている場合（ステップＳ６０２でＹｅｓ）、パスインデックス部６２は、そのパスに関するＬとＮの値（部分文書集合の統計量）を計算する、すなわち、そのパスの配下の連結テキスト長の平均をＬの値とし、Ｎの値をインクリメント（１つ増加）して、それぞれ、パステーブルＰＴの該当箇所に格納する（ステップＳ６０４）。 If the path is already included in the path table PT (Yes in step S602), the path index unit 62 calculates L and N values (statistics of the partial document set) relating to the path, that is, the path of the path. The average of the subordinate connected text lengths is set as the value of L, the value of N is incremented (increased by one), and each is stored in the corresponding part of the path table PT (step S604).

その後、パスインデックス部６２は、そのＸＭＬ文書に関する全てのパス分の処理を終了したか否かを判断し（ステップＳ６０５）、終了していなければ（Ｎｏ）ステップＳ６０１に戻って処理を繰り返し、終了していれば（Ｙｅｓ）ステップＳ６０６に進む。 Thereafter, the path index unit 62 determines whether or not the processing for all the paths related to the XML document has been completed (step S605), and if not completed (No), the process returns to step S601 and repeats the processing. If yes (Yes), the process proceeds to step S606.

このようにして、ステップＳ６０１〜Ｓ６０５の処理により、検索に必要な３つの統計量（「部分文書集合全体の統計量」、「部分文書ごとの統計量」および「検索単語に対する統計量」）のうち、「部分文書集合全体の統計量」（パステーブルＰＴの「Ｌ」と「Ｎ」）をＸＭＬ文書の投入直後（検索クエリの入力前）に算出および格納することができる。そして、これにより、検索（適合度の算出）時に「部分文書集合全体の統計量」を算出する必要がなくなり、検索処理を高速化することができるようになる。 In this way, the three statistics required for the search (“statistics for the entire partial document set”, “statistics for each partial document”, and “statistics for the search word”) are obtained by the processing in steps S601 to S605. Among them, “statistics of the entire partial document set” (“L” and “N” in the path table PT) can be calculated and stored immediately after the XML document is input (before the input of the search query). As a result, it is not necessary to calculate “statistics of the entire partial document set” at the time of search (calculation of fitness), and the search process can be speeded up.

続いて、投入されたＸＭＬ文書に関し、範囲ラベル部６３は、１つのノードを取り出す（ステップＳ６０６）。
その後、範囲ラベル部６３は、パステーブルＰＴを参照し、そのノードに対応するパスＩＤを取り出し（ステップＳ６０７）、前記した規則性にしたがって範囲ラベルを付与し（ステップＳ６０８）、ノードテーブルＮＴに各値（カラムの情報）を格納する（ステップＳ６０９）。 Subsequently, regarding the input XML document, the range label unit 63 takes out one node (step S606).
Thereafter, the range label unit 63 refers to the path table PT, extracts a path ID corresponding to the node (step S607), assigns a range label according to the regularity described above (step S608), and sets each node table NT to each node table NT. The value (column information) is stored (step S609).

つまり、ステップＳ６０９において、範囲ラベル部６３は、文書ＩＤ（docid）、範囲ラベル（「pre」と「post」）、タグ（tag）およびパスＩＤ（pathid）だけでなく、そのノード配下の部分文書の長さを表すｌｋと、そのノードのテキストデータであるテキスト（text）に関する情報もノードテーブルＮＴに格納する。 That is, in step S609, the range label unit 63 determines not only the document ID (docid), range label ("pre" and "post"), tag (tag), and path ID (pathid), but also the partial documents under the node. Lk representing the length of the node and information regarding the text (text) which is the text data of the node are also stored in the node table NT.

その後、範囲ラベル部６３は、そのＸＭＬ文書に関する全てのノード分の処理を終了したか否かを判断し（ステップＳ６１０）、終了していなければ（Ｎｏ）ステップＳ６０６に戻って処理を繰り返し、終了していれば（Ｙｅｓ）ステップＳ６１１に進む。 Thereafter, the range label unit 63 determines whether or not the processing for all the nodes related to the XML document has been completed (step S610), and if not completed (No), the processing returns to step S606 and repeats the processing. If yes (Yes), the process proceeds to step S611.

このようにして、ステップＳ６０６〜Ｓ６１０の処理により、検索に必要な３つの統計量（「部分文書集合全体の統計量」、「部分文書ごとの統計量」および「検索単語に対する統計量」）のうち、「部分文書ごとの統計量」（ノードテーブルＮＴの「ｌｋ」）をＸＭＬ文書の投入直後（検索クエリの入力前）に算出および格納することができる。そして、これにより、検索（適合度の算出）時に「部分文書ごとの統計量」を算出する必要がなくなり、検索処理を高速化することができる。 In this way, three statistics required for the search (“statistics for the entire partial document set”, “statistics for each partial document”, and “statistics for the search word”) are obtained by the processing of steps S606 to S610. Among them, “statistics for each partial document” (“lk” in the node table NT) can be calculated and stored immediately after the XML document is input (before the search query is input). Thus, it is not necessary to calculate “statistics for each partial document” at the time of search (calculation of fitness), and the search process can be speeded up.

処理部６は、投入した全てのＸＭＬ文書分の処理を終了したか否かを判断し（ステップＳ６１１）、終了していなければ（Ｎｏ）ステップＳ６０１に戻って処理を繰り返し、終了していれば（Ｙｅｓ）処理を終了する。
このようにして、図６のフローチャートの処理によれば、投入した全てのＸＭＬ文書に関して、その投入直後に、「部分文書集合全体の統計量」および「部分文書ごとの統計量」を算出および格納することができる。 The processing unit 6 determines whether or not the processing for all the input XML documents has been completed (step S611), and if not completed (No), the processing returns to step S601 and repeats the processing. (Yes) The process is terminated.
In this manner, according to the processing of the flowchart of FIG. 6, for all the input XML documents, “statistics for the entire partial document set” and “statistics for each partial document” are calculated and stored immediately after the input. can do.

なお、図６のフローチャートでは、説明を簡単にするため、パスインデックス部６２による処理と範囲ラベル部６３による処理を分離したが、それらの処理を並列的に行うようにしてもよい。 In the flowchart of FIG. 6, for the sake of simplicity, the processing by the path index unit 62 and the processing by the range label unit 63 are separated. However, these processing may be performed in parallel.

続いて、図７を参照しながら、テキストインデックス（転置表ＩＴ）の構築処理について説明する。図７は、テキストインデックスの構築処理を示すフローチャートである。
図６に示したフローチャートの処理によって、投入されたＸＭＬ文書に関して、パステーブルＰＴとノードテーブルＮＴが更新された後、処理部６のテキストインデックス部６４は、ノードテーブルＮＴから１レコード（１ノード分のデータ）を取り出す（ステップＳ７０１）。 Next, the text index (transposition table IT) construction process will be described with reference to FIG. FIG. 7 is a flowchart showing text index construction processing.
After the path table PT and the node table NT are updated for the input XML document by the processing of the flowchart shown in FIG. 6, the text index unit 64 of the processing unit 6 reads one record (one node worth) from the node table NT. Data) is extracted (step S701).

続いて、テキストインデックス部６４は、取り出したレコードにおける記述内容（図４（ａ）のノードテーブルＮＴの「text」のカラムのデータ）に関して、形態素解析（計算機を用いた自然言語処理の基礎技術の１つ）の手法を用いて単語に分ける（ステップＳ７０２）。 Subsequently, the text index unit 64 performs morphological analysis (basic technology of natural language processing using a computer) on the description content (data in the column “text” of the node table NT in FIG. 4A) in the extracted record. The word is divided into words using one method (step S702).

その後、テキストインデックス部６４は、分けられたうちの１つの単語が転置表ＩＴに含まれているか否かを判断する（ステップＳ７０３）。
その単語が転置表ＩＴに含まれていなかった場合（ステップＳ７０３でＮｏ）、テキストインデックス部６４は、その単語を新たに転置表ＩＴに登録し（ステップＳ７０４）、ステップＳ７０５に進む。その単語が転置表ＩＴに含まれていた場合（ステップＳ７０３でＹｅｓ）、テキストインデックス部６４は、そのままステップＳ７０５に進む。 Thereafter, the text index unit 64 determines whether or not one of the divided words is included in the transposition table IT (step S703).
If the word is not included in the transposition table IT (No in step S703), the text index unit 64 newly registers the word in the transposition table IT (step S704), and proceeds to step S705. If the word is included in the transposition table IT (Yes in step S703), the text index unit 64 proceeds to step S705 as it is.

ステップＳ７０５において、テキストインデックス部６４は、その単語の出現位置である文書ＩＤ（docid）と範囲ラベル（「pre」と「post」）を転置表ＩＴに格納する。たとえば、単語「宇宙」がＸＭＬ文書００１のノードｎ１１１に存在していれば（図３（ａ）参照）、ＸＭＬ文書００１の文書ＩＤ「００１」、ノードｎ１１１の開始ラベル（pre）「３」および終了ラベル（post）「４」を表す（００１，３，４）を、転置表ＩＴの「宇宙」に対応する「position」のカラムに格納する。 In step S705, the text index unit 64 stores the document ID (docid) and the range labels (“pre” and “post”) that are the appearance positions of the word in the transposition table IT. For example, if the word “universe” exists in the node n111 of the XML document 001 (see FIG. 3A), the document ID “001” of the XML document 001, the start label (pre) “3” of the node n111, and (001, 3, 4) representing the end label (post) “4” is stored in the “position” column corresponding to “universe” in the transposition table IT.

その後、テキストインデックス部６４は、そのレコードに関する全ての単語分の処理を終了したか否かを判断し（ステップＳ７０６）、終了していなければ（Ｎｏ）ステップＳ７０３に戻って処理を繰り返し、終了していれば（Ｙｅｓ）ステップＳ７０７に進む。
また、テキストインデックス部６４は、そのＸＭＬ文書に関する全てのレコード分の処理を終了したか否かを判断し（ステップＳ７０７）、終了していなければ（Ｎｏ）ステップＳ７０１に戻って処理を繰り返し、終了していれば（Ｙｅｓ）処理を終了する。 Thereafter, the text index unit 64 determines whether or not the processing for all the words related to the record has been completed (step S706), and if not completed (No), the process returns to step S703 to repeat the processing and finish. If yes (Yes), the process proceeds to step S707.
Further, the text index unit 64 determines whether or not the processing for all the records related to the XML document has been completed (step S707), and if not completed (No), the process returns to step S701 and repeats the processing to end. If so (Yes), the process is terminated.

このようにして、図７に示したフローチャートの処理により、検索に必要な３つの統計量（「部分文書集合全体の統計量」、「部分文書ごとの統計量」および「検索単語に対する統計量」）のうち、「検索単語に対する統計量」（転置表ＩＴの「position」）をＸＭＬ文書の投入直後（検索クエリの入力前）に算出および格納することができる。そして、これにより、検索（適合度の算出）時に「検索単語に対する統計量」を算出する必要がなくなり、検索処理を高速化することができる。 In this way, three statistics required for the search (“statistics for the entire partial document set”, “statistics for each partial document”, and “statistics for the search word” are obtained by the processing of the flowchart shown in FIG. ) Can be calculated and stored immediately after the input of the XML document (before the input of the search query). Thus, it is not necessary to calculate “statistics for the search word” at the time of search (calculation of fitness), and the search process can be speeded up.

次に、図８を参照しながら、ランキング処理について説明する。図８は、ランキング処理を示すフローチャートである。
記憶部５のＸＭＬ文書群５１に蓄積された複数のＸＭＬ文書に関して、検索によるランキングを行う場合、まず、使用者が入力部２を用いて検索クエリ（パスと検索文字列）を入力する。 Next, the ranking process will be described with reference to FIG. FIG. 8 is a flowchart showing the ranking process.
When ranking by search is performed on a plurality of XML documents stored in the XML document group 51 of the storage unit 5, first, the user inputs a search query (path and search character string) using the input unit 2.

そうすると、処理部６のランキング部６５は、検索クエリからパス、検索単語（検索文字列から抽出）、ｌ_ｑ（検索文字列ｑの長さ）、および、ｔｆ_ｑｉ（検索文字列ｑ中の検索単語ｉの出現回数）を取り出す（算出する）（ステップＳ８０１）。
続いて、ランキング部６５は、パステーブルＰＴを参照し、検索クエリ中のパスに対応するレコードにおける「Ｌ」、「Ｎ」およびパスＩＤ（pathid）の値を取り出す（ステップＳ８０２）。 Then, the ranking unit 65 of the processing unit 6 determines the path from the search query, the search word (extracted from the search character string), l _q (the length of the search character string q), and tf _qi (the search in the search character string q). The number of occurrences of the word i is extracted (calculated) (step S801).
Subsequently, the ranking unit 65 refers to the path table PT and extracts the values of “L”, “N” and the path ID (pathid) in the record corresponding to the path in the search query (step S802).

ランキング部６５は、ノードテーブルＮＴを参照し、ステップＳ８０２で取り出したパスＩＤ（pathid）に対応するレコードにおける部分文書の位置（文書ＩＤ（docid）と範囲ラベル（「pre」と「post」））および「ｌｋ」の値を取り出す（ステップＳ８０３）。なお、取り出した部分文書の集まりを部分文書群と呼ぶ。 The ranking unit 65 refers to the node table NT, and the position of the partial document (document ID (docid) and range label (“pre” and “post”)) in the record corresponding to the path ID (pathid) extracted in step S802. And the value of “lk” is extracted (step S803). A collection of the extracted partial documents is called a partial document group.

ランキング部６５は、転置表ＩＴから、検索単語の出現位置（文書ＩＤ（docid）、範囲ラベル（「pre」と「post」））の値を取り出す（ステップＳ８０４）。なお、ここで取り出された出現位置のノードの集まりを単語出現文書群と呼ぶ。 The ranking unit 65 extracts the values of the search word appearance position (document ID (docid), range labels (“pre” and “post”)) from the transposition table IT (step S804). The collection of nodes at the appearance positions extracted here is called a word appearance document group.

続いて、ランキング部６５は、転置表ＩＴから取り出した出現位置の値を用いて検索単語の出現する部分文書を絞り、ｔｆ_ｋｉ（部分文書ｋ中の単語ｉの出現回数）とｎ_ｉ（該当する単語ｉを含む部分文書数）を算出する（ステップＳ８０５）。すなわち、ランキング部６５は、部分文書群と単語出現文書群の位置情報を用いて、部分文書群から検索単語の含まれる部分文書を選別し、各部分文書における単語の統計量であるｔｆ_ｋｉとｎ_ｉを算出する。ここで、ステップＳ８０５の具体例について、図９を参照しながら説明する。 Subsequently, the ranking unit 65 narrows down the partial documents in which the search word appears using the value of the appearance position extracted from the transposition table IT, and tf _ki (number of occurrences of the word i in the partial document k) and _ni (corresponding) The number of partial documents including the word i to be calculated) is calculated (step S805). That is, the ranking unit 65 uses the position information of the partial document group and the word appearance document group to select partial documents including the search word from the partial document group, and tf _ki that is the statistic of the word in each partial document. to calculate the n _i. Here, a specific example of step S805 will be described with reference to FIG.

図９は、ステップＳ８０５の具体例、すなわち、部分文書の選別と単語の統計量の算出の例を示した図である。なお、この図９における例は、図２〜図５の具体例とは関係ない独立した例である。
図９（ａ）に示した部分文書群と図９（ｂ）に示した単語出現文書群に基づき、たとえば、まず、部分文書群のおける部分文書（００１，１０，２０）（図示した４つのうち最上位の部分文書）に含まれる単語出現文書が単語出現文書群にあるか探す。つまり、単語出現文書群において、文書ＩＤが「００１」で、範囲ラベルが「１０」〜「２０」の間に含まれている単語出現文書を探せばよい。ここでは、単語出現文書として、（００１，１２，１３）と（００１，１５，１６）が該当する。 FIG. 9 is a diagram showing a specific example of step S805, that is, an example of selection of partial documents and calculation of word statistics. The example in FIG. 9 is an independent example that is not related to the specific examples in FIGS.
Based on the partial document group shown in FIG. 9A and the word appearance document group shown in FIG. 9B, for example, first, the partial documents (001, 10, 20) in the partial document group (the four shown in the figure). The word appearance document included in the word appearance document group is searched for. That is, in the word appearance document group, a word appearance document having a document ID “001” and a range label “10” to “20” may be searched. Here, (001, 12, 13) and (001, 15, 16) correspond to the word appearance documents.

以下、同様にして、図９（ｃ）に示すように、部分文書が３つに絞られ（文書ＩＤが「００３」の文書は検索単語を１つも含まないため、はずれている）、統計量のｔｆ_ｋｉ（部分文書ｋ中の単語ｉの出現回数）は上から「２」、「１」および「１」であり、そして、ｎ_ｉ（該当する単語ｉを含む部分文書数）は「３」であると算出することができる。 Similarly, as shown in FIG. 9C, the partial documents are narrowed down to three (the document with the document ID “003” is not included because it does not include any search word), and the statistics Tf _ki (number of occurrences of the word i in the partial document k) is “2”, “1” and “1” from the top, and n _i (the number of partial documents including the corresponding word i) is “3”. Can be calculated.

図８に戻って、ランキング部６５は、各統計量と前記した式（１）を用いて、該当する部分文書のスコアを算出する（ステップＳ８０６）。
そして、ランキング部６５は、ステップＳ８０６で算出したスコアの高い順に部分文書をソートすることで、ランキングを行う（ステップＳ８０７）。 Returning to FIG. 8, the ranking unit 65 calculates the score of the corresponding partial document by using each statistic and the above-described equation (1) (step S806).
Then, the ranking unit 65 performs ranking by sorting the partial documents in descending order of the score calculated in step S806 (step S807).

このように、情報処理装置１は、予め算出および格納してある３つの統計量（「部分文書集合全体の統計量」、「部分文書ごとの統計量」および「検索単語に対する統計量」）を用いて検索（適合度の算出、ランキング）を行うことにより、検索を高速に処理することができる。 In this way, the information processing apparatus 1 uses the three statistics ("statistics for the entire partial document set", "statistics for each partial document", and "statistics for the search word") calculated and stored in advance. By performing a search (calculation of suitability, ranking) using the search, the search can be processed at high speed.

（検索例）
次に、図２〜図５の具体例に対して、図８のフローチャートによる検索（ランキング）を行った場合の例について説明する。ここでは、検索クエリのうち、パスが「/book/chapter/text/text()」であり、検索文字列が「地球」であるものとする。
まず、ｌ_ｑ（検索文字列ｑの長さ）は「２」であり、ｔｆ_ｑｉ（検索文字列ｑ中の単語ｉ（地球）の出現回数）は「１」であると算出できる（図８のステップＳ８０１。以下、「図８の」を省略）。 (Search example)
Next, an example in which a search (ranking) according to the flowchart of FIG. 8 is performed on the specific examples of FIGS. Here, it is assumed that in the search query, the path is “/ book / chapter / text / text ()” and the search character string is “Earth”.
First, l _q (the length of the search character string q) is “2”, and tf _qi (the number of appearances of the word i (earth) in the search character string q) is “1” (FIG. 8). Step S801 of FIG. 8. Hereinafter, “FIG. 8” is omitted).

また、パステーブルＰＴ（図４参照）におけるパスが「/book/chapter/text/text()」のレコード（「pathid」が「０９」のレコード）から、「Ｌ」は「２３７９」であり、「Ｎ」は「１３」であることがわかる（ステップＳ８０２）。
さらに、ノードテーブル００１ＮＴ（図４（ａ１）参照）における「pathid」が「０９」のレコードから、その部分文書の出現位置が（００１，１４，１５）であり、ｌｋ（部分文書ｋの長さ）が「３１９」であることがわかる（ステップＳ８０３）。 Further, from the record in which the path in the path table PT (see FIG. 4) is “/ book / chapter / text / text ()” (the record whose “pathid” is “09”), “L” is “2379”, It can be seen that “N” is “13” (step S802).
Furthermore, from the record whose “pathid” is “09” in the node table 001NT (see FIG. 4 (a1)), the appearance position of the partial document is (001, 14, 15), and lk (length of the partial document k) ) Is “319” (step S803).

また、転置表ＩＴ（図５参照）における「term」が「地球」のレコードから、単語「地球」の出現する全ての位置が、（００１，１４，１５）および（００１，１９，２０）であるとわかる（ステップＳ８０４）。
しかし、部分文書（００１，１９，２０）は、ステップＳ８０３で取り出した部分文書（００１，１４，１５）に含まれないため、部分文書（００１，１４，１５）だけが該当する部分文書として絞り込まれる（ステップＳ８０５）。つまり、部分文書（００１，１９，２０）は、使用者によって検索クエリで指定されたパスに対応する部分文書に含まれないため、検索の対象外となる。 Also, from the record in which “term” is “Earth” in the transposition table IT (see FIG. 5), all the positions where the word “Earth” appears are (001, 14, 15) and (001, 19, 20). It can be seen that there is (step S804).
However, since the partial documents (001, 19, 20) are not included in the partial documents (001, 14, 15) extracted in step S803, only the partial documents (001, 14, 15) are narrowed down as corresponding partial documents. (Step S805). That is, the partial document (001, 19, 20) is not included in the search because it is not included in the partial document corresponding to the path specified by the search query by the user.

また、この部分文書（００１，１４，１５）に関して、ｔｆ_ｋｉ（部分文書ｋ中の単語ｉの出現回数）は「１」であり、ｎ_ｉ（該当する単語ｉを含む部分文書数）も「１」であると算出できる（ステップＳ８０５）。 Further, regarding this partial document (001, 14, 15), tf _ki (number of occurrences of word i in partial document k) is “1”, and n _i (number of partial documents including the corresponding word i) is also “ 1 ”can be calculated (step S805).

以上の７つの値を前記した式（１）に代入してスコアを算出した式を、次の式（２）に示す。

このようにして、検索クエリに該当した部分文書に関するスコアを算出することができる。 The following formula (2) shows a score calculated by substituting the above seven values into the above formula (1).

In this way, the score regarding the partial document corresponding to the search query can be calculated.

なお、ステップＳ８０２において、検索クエリのパスの内容によっては、「Ｌ」と「Ｎ」の値が複数存在する場合もありえる。その場合の「Ｌ」と「Ｎ」の値の扱いについて、図１０を参照しながら説明する。図１０は、「Ｌ」と「Ｎ」の値が複数存在する場合における「Ｌ」と「Ｎ」の値の算出の説明図であり、（ａ）が図４（ｂ）と同様のパステーブルＰＴなど、（ｂ）が「Ｌ」と「Ｎ」の値の算出式、をそれぞれ表している。 In step S802, there may be a plurality of “L” and “N” values depending on the contents of the path of the search query. The handling of the values “L” and “N” in that case will be described with reference to FIG. FIG. 10 is an explanatory diagram for calculating the values of “L” and “N” when there are a plurality of values of “L” and “N”. FIG. 10A is a path table similar to FIG. (B) represents the calculation formulas for the values of “L” and “N”, such as PT.

たとえば、検索クエリのパスが「任意の位置にあるtitle（//title）」である場合、図１０（ａ）に示すように、複数のパスが該当する（パスＩＤ（pathid）が「０２」と「０７」のパス）。この場合、たとえば、図１０（ｂ）の算出式に示すように、「Ｎ」の値は複数の「Ｎ（Ｎ_１，Ｎ_２，・・・）」の値を足したもの、「Ｌ」の値は複数の「Ｌ（Ｌ_１，Ｌ_２，・・・）」のそれぞれに関して、対応する「Ｎ」の値による加重平均をとったもの、として計算すればよい。具体的な算出例は、図１０（ａ）の下半分に示した通りである。
このようにして、「Ｌ」と「Ｎ」の値が複数存在する場合でも、支障なく適合度の算出やランキングを行うことができる。 For example, when the path of the search query is “title (// title) at an arbitrary position”, as shown in FIG. 10A, a plurality of paths are applicable (path ID (pathid) is “02”). And "07" path). In this case, for example, as shown in the calculation formula of FIG. 10B, the value of “N” is a value obtained by adding a plurality of values of “N (N ₁ , N ₂ ,...)”, “L”. May be calculated as a weighted average of the corresponding values of “N” for each of a plurality of “L (L ₁ , L ₂ ,...)”. A specific calculation example is as shown in the lower half of FIG.
In this way, even when there are a plurality of values of “L” and “N”, it is possible to calculate the fitness level and perform ranking without any trouble.

（第２実施形態）
次に、図１１Ａ〜図２０を参照しながら、本発明の第２実施形態について説明する。図１１Ａは、第２実施形態の情報処理装置の構成図である。なお、図１の情報処理装置１と同様の構成については同じ符号を付し、説明を適宜省略する。図１１Ａの情報処理装置１ａは、図１の情報処理装置１と比べて、記憶部５ａと処理部６ａの構成が変更されている。 (Second Embodiment)
Next, a second embodiment of the present invention will be described with reference to FIGS. 11A to 20. FIG. 11A is a configuration diagram of an information processing apparatus according to the second embodiment. In addition, the same code | symbol is attached | subjected about the structure similar to the information processing apparatus 1 of FIG. 1, and description is abbreviate | omitted suitably. In the information processing apparatus 1a in FIG. 11A, the configurations of the storage unit 5a and the processing unit 6a are changed compared to the information processing apparatus 1 in FIG.

図１１Ａに示すように、情報処理装置１ａの処理部６ａは、図１の範囲ラベル部６３の代わりにタグインデックス部６６（ノード管理部）を備えている。タグインデックス部６６は、ノードテーブルＮＴａにおいてノードＩＤを使用して各ノードに関する情報を管理する。
また、記憶部５ａにおいて、ノードテーブルＮＴａと転置表ＩＴａは、図１の情報処理装置１において相当するそれぞれの構成と比べて、構成要素が一部変更されている（詳細は図１２と図１４で後記）。さらに、記憶部５ａは、図１の記憶部５と比べて、新たにテキストテーブルＴＴ（テキスト情報）を有している（詳細は図１３で後記）。 As illustrated in FIG. 11A, the processing unit 6a of the information processing device 1a includes a tag index unit 66 (node management unit) instead of the range label unit 63 of FIG. The tag index unit 66 manages information regarding each node using the node ID in the node table NTa.
Further, in the storage unit 5a, the node table NTa and the transposition table ITa are partially changed in constituent elements as compared with the corresponding configurations in the information processing apparatus 1 in FIG. 1 (details are shown in FIGS. 12 and 14). And later). Further, the storage unit 5a newly has a text table TT (text information) as compared with the storage unit 5 of FIG. 1 (details will be described later in FIG. 13).

ここで、図１１Ｂを参照しながら、ＸＭＬ文書にノードＩＤを付与した状態について説明する。図１１Ｂは、（ａ）が図２（ａ）のＸＭＬ文書００１に対してノードＩＤを付与した状態を示す図であり、（ｂ）が図２（ｂ）のＸＭＬ文書００２に対してノードＩＤを付与した状態を示す図である。
図１１Ｂ（ａ）に示すように、ＸＭＬ文書００１において、ノードｎ１（book）はノードＩＤが「００１」で、ノードＩＤが「００２」のノードｎ１１の親である。同様に、全てのノードに識別子として異なるノードＩＤを付与する。図１１Ｂ（ｂ）についても同様である。 Here, a state in which a node ID is assigned to an XML document will be described with reference to FIG. 11B. FIG. 11B is a diagram showing a state in which (a) assigns a node ID to the XML document 001 in FIG. 2 (a), and (b) shows a node ID for the XML document 002 in FIG. 2 (b). It is a figure which shows the state which provided.
As shown in FIG. 11B (a), in the XML document 001, the node n1 (book) is the parent of the node n11 whose node ID is “001” and whose node ID is “002”. Similarly, different node IDs are assigned as identifiers to all nodes. The same applies to FIG. 11B (b).

次に、図１２を参照しながら、ノードテーブルＮＴａについて説明する。図１２（ａ）は、（ａ１）がＸＭＬ文書００１に関するノードテーブル００１ＮＴａであり、（ａ２）がＸＭＬ文書００２に関するノードテーブル００２ＮＴａを示す図である。いずれのノードテーブルＮＴａも、左から順に、ノードＩＤ（nodeid）、直近の親ノードのノードＩＤを表す親ノードＩＤ（parent）、各ノードに与えられているタグ（tag）、パスの識別子を表すパスＩＤ（pathid）、そのノードがパス指定されたときの部分文書の長さを表すｌｋ、および、そのノードがパス指定されたときの部分文書に含まれるテキストデータの識別子であるテキストＩＤ（textid）（関係情報。ノード位置情報）のカラムから構成されている。
図１２（ｂ）のパステーブルＰＴは、図４（ｂ）のものと同様である。 Next, the node table NTa will be described with reference to FIG. FIG. 12A shows the node table 001NTa related to the XML document 001, and FIG. 12A shows the node table 002NTa related to the XML document 002. In any node table NTa, in order from the left, a node ID (nodeid), a parent node ID (parent) representing the node ID of the latest parent node, a tag (tag) given to each node, and a path identifier are represented. A path ID (pathid), lk indicating the length of the partial document when the node is designated as a path, and a text ID (textid) that is an identifier of text data included in the partial document when the node is designated as a path ) (Related information, node position information) column.
The path table PT in FIG. 12B is the same as that in FIG.

続いて、図１３を参照しながら、テキストテーブルについて説明する。図１３は、（ａ）が図１２（ａ）と同様のノードテーブルＮＴａであり、（ｂ）がテキストテーブルＴＴの構成図である。
図１３（ｂ）に示すように、テキストテーブルＴＴには、各テキストノードのテキストデータ（text）がテキストＩＤ（textid）と関連付けられて格納されている。 Next, the text table will be described with reference to FIG. 13A is a node table NTa similar to FIG. 12A, and FIG. 13B is a configuration diagram of a text table TT.
As shown in FIG. 13B, text data (text) of each text node is stored in the text table TT in association with the text ID (textid).

次に、図１４を参照しながら、転置表ＩＴａについて説明する。図１４は、転置表ＩＴａの構成図である。
図１４に示すように、転置表ＩＴａは、単語（term）とテキストＩＤ（textid）のカラムから構成されている。テキストＩＤ（textid）は、図１３（ｂ）のテキストテーブルＴＴにおけるテキストＩＤ（textid）と対応している。 Next, the transposition table ITa will be described with reference to FIG. FIG. 14 is a configuration diagram of the transposition table ITa.
As shown in FIG. 14, the transposition table ITa is composed of columns of a word (term) and a text ID (textid). The text ID (textid) corresponds to the text ID (textid) in the text table TT in FIG.

続いて、図１５を参照しながら、構造インデックス（パステーブルＰＴおよびノードテーブルＮＴａ）の構築処理について説明する。図１５は、構造インデックスの構築処理の概要を示すフローチャートである。
まず、使用者が、入力部２を介して、新たに蓄積したいＸＭＬ文書を情報処理装置１に投入する。そうすると、処理部６のデータ格納部６１が記憶部５のＸＭＬ文書群５１にそのＸＭＬ文書を格納する。それとき、投入された全てのＸＭＬ文書に関して（ステップＳ１５０３でＹｅｓが選択されるまで）、パスインデックス部６２がパスインデックスの構築（パステーブルＰＴの更新）を行い（ステップＳ１５０１。詳細は図１６）、タグインデックス部６６がタグインデックスの構築（ノードテーブルＮＴａとテキストテーブルＴＴの更新）を行うことにより（ステップＳ１５０２。詳細は図１７）、構造インデックスを構築する。 Next, the construction process of the structure index (path table PT and node table NTa) will be described with reference to FIG. FIG. 15 is a flowchart showing an overview of the structure index construction process.
First, a user inputs an XML document to be newly accumulated into the information processing apparatus 1 via the input unit 2. Then, the data storage unit 61 of the processing unit 6 stores the XML document in the XML document group 51 of the storage unit 5. At that time, with respect to all the input XML documents (until Yes is selected in step S1503), the path index unit 62 constructs a path index (updates the path table PT) (step S1501, details are shown in FIG. 16). The tag index unit 66 constructs a tag index (updates the node table NTa and the text table TT) (step S1502, details are shown in FIG. 17), thereby constructing a structure index.

図１６は、パスインデックス部６２によるパスインデックスの構築処理のフローチャートであるが、ステップＳ１６０１〜Ｓ１６０５の処理は、図６のステップＳ６０１〜Ｓ６０５の処理と同様であるので、説明を省略する。 FIG. 16 is a flowchart of the path index construction process performed by the path index unit 62. The processes in steps S1601 to S1605 are the same as the processes in steps S601 to S605 in FIG.

図１７は、タグインデックス（ノードテーブルＮＴａとテキストテーブルＴＴ）の構築処理のフローチャートであり、図１６のフローチャートの処理の後に行われる。
まず、投入されたＸＭＬ文書から、タグインデックス部６６は、１つのノードを取り出す（ステップＳ１７０１）。
次に、タグインデックス部６６は、パステーブルＰＴを参照し、そのノードに対応するパスＩＤを取り出す（ステップＳ１７０２）。
タグインデックス部６６は、そのノードがテキストノード（テキストデータを有するノード）である場合は、テキストデータをテキストテーブルＴＴに格納し（ステップＳ１７０３）、テキストテーブルＴＴから該当するテキストＩＤ（textid）の値を取り出す（ステップＳ１７０４）。そのノードがテキストノードでない場合、タグインデックス部６６はステップＳ１７０３とステップＳ１７０４の処理を行わずにスルーする。 FIG. 17 is a flowchart of the process of building the tag index (node table NTa and text table TT), which is performed after the process of the flowchart of FIG.
First, the tag index unit 66 extracts one node from the input XML document (step S1701).
Next, the tag index unit 66 refers to the path table PT and extracts a path ID corresponding to the node (step S1702).
If the node is a text node (a node having text data), the tag index unit 66 stores the text data in the text table TT (step S1703), and the value of the corresponding text ID (textid) from the text table TT. Is taken out (step S1704). If the node is not a text node, the tag index unit 66 passes through without performing the processes of steps S1703 and S1704.

続いて、タグインデックス部６６は、そのノードがルートノード（最上位のノード）か否かを判断し（ステップＳ１７０５）、ルートノードの場合（Ｙｅｓ）、ノードテーブルＮＴａのそのノードの「parent」に「０」を格納し（ステップＳ１７１０）、ステップＳ１７１１に進む。
そのノードがルートノードでない場合（ステップＳ１７０５でＮｏ）、タグインデックス部６６は、ノードテーブルＮＴａのそのノードの「parent」に直近の親ノードのノードＩＤを格納する（親ノードのノードＩＤをparent値とする）（ステップＳ１７０６）。 Subsequently, the tag index unit 66 determines whether or not the node is a root node (highest node) (step S1705). If the node is a root node (Yes), the tag index unit 66 sets “parent” of the node in the node table NTa. “0” is stored (step S1710), and the process proceeds to step S1711.
If the node is not the root node (No in step S1705), the tag index unit 66 stores the node ID of the nearest parent node in “parent” of the node in the node table NTa (the node ID of the parent node is the parent value). (Step S1706).

ステップＳ１７０６の後、タグインデックス部６６は、ルートノードに移動するまで（レコードのparent値が０になるまで。すなわち、ステップＳ１７０９でＹｅｓになるまで）、parent値をたどって１つ上の親ノードのノードテーブルＮＴａの行（レコード）に移動し（ステップＳ１７０７）、そのレコードのテキストＩＤ（textid）に、ステップＳ１７０４で取り出したテキストＩＤの値を加える（ステップＳ１７０８）。 After step S1706, the tag index unit 66 traces the parent value and moves up to the parent node that is one level higher until the tag index unit 66 moves to the root node (until the parent value of the record becomes 0. That is, until it becomes Yes in step S1709). Is moved to the row (record) of the node table NTa (step S1707), and the value of the text ID extracted in step S1704 is added to the text ID (textid) of the record (step S1708).

つまり、ステップＳ１７０７〜Ｓ１７０９の処理により、ノードテーブルＮＴａにおいて、それぞれのノードにその配下のテキストデータのテキストＩＤ（textid）が全て集まることになる。たとえば、図１２（ａ１）において、ルートノード（ノードＩＤ（nodeid）が「００１」のノード）には、ＸＭＬ文書００１のすべてのテキストデータのテキストＩＤ（textid）が集まる。
なお、ステップＳ１７０１で取り出したノードがテキストノードでない場合、タグインデックス部６６は、ステップＳ１７０６〜Ｓ１７０９の処理を行わずにスルーする。 That is, by the processing in steps S1707 to S1709, all the text IDs (textid) of the text data under the node are collected in each node in the node table NTa. For example, in FIG. 12A1, text IDs (textid) of all text data of the XML document 001 are collected at the root node (node having a node ID (nodeid) of “001”).
If the node extracted in step S1701 is not a text node, the tag index unit 66 passes through without performing the processing in steps S1706 to S1709.

次に、タグインデックス部６６は、ノードテーブルＮＴａの末尾に、そのノード自身のレコードを加える（ステップＳ１７１１）。このとき、タグインデックス部６６は、ノードテーブルＮＴに、「ｌｋ」を含む各値（カラムの情報）を格納する。
タグインデックス部６６は、全てのノード分の処理を終了したか否かを判断し（ステップＳ１７１２）、終了していなければ（Ｎｏ）ステップＳ１７０１に戻って処理を繰り返し、終了していれば（Ｙｅｓ）処理を終了する。 Next, the tag index unit 66 adds the record of the node itself to the end of the node table NTa (step S1711). At this time, the tag index unit 66 stores each value (column information) including “lk” in the node table NT.
The tag index unit 66 determines whether or not the processing for all the nodes has been completed (step S1712). If not completed (No), the process returns to step S1701 to repeat the process, and if completed (Yes) ) End the process.

このようにして、図１７のフローチャートの処理により、図６のステップＳ６０６〜Ｓ６１０の場合と同様、ノードテーブルＮＴａを更新するときに、「部分文書ごとの統計量」（ノードテーブルＮＴａの「ｌｋ」）を算出および格納することができる。そして、これにより、検索（適合度の算出）時に「部分文書ごとの統計量」を算出する必要がなくなり、検索処理を高速化することができる。 As described above, when the node table NTa is updated by the processing of the flowchart of FIG. 17 as in the case of steps S606 to S610 of FIG. 6, “statistics for each partial document” (“lk” of the node table NTa). ) Can be calculated and stored. Thus, it is not necessary to calculate “statistics for each partial document” at the time of search (calculation of fitness), and the search process can be speeded up.

続いて、図１８を参照しながら、テキストインデックス（転置表ＩＴａ）の構築処理について説明する。図１８は、テキストインデックスの構築処理を示すフローチャートである。
図１５〜図１７に示したフローチャートの処理によって、投入されたＸＭＬ文書に関して、パステーブルＰＴとノードテーブルＮＴａが更新された後、テキストインデックス部６４は、ノードテーブルＮＴａから１レコードを取り出す（ステップＳ１８０１）。 Next, the text index (transposition table ITa) construction process will be described with reference to FIG. FIG. 18 is a flowchart showing text index construction processing.
After the path table PT and the node table NTa are updated for the input XML document by the processing of the flowcharts shown in FIGS. 15 to 17, the text index unit 64 takes out one record from the node table NTa (step S1801). ).

続いて、テキストインデックス部６４は、取り出したレコードにおける記述内容（図１３（ａ）のノードテーブルＮＴａの「textid」に対応する図１３（ｂ）のテキストテーブルＴＴにおける「text」のカラムのデータ）に関して、形態素解析の手法を用いて単語に分ける（ステップＳ１８０２）。 Subsequently, the text index unit 64 describes the description contents in the extracted record (data in the column “text” in the text table TT in FIG. 13B corresponding to “textid” in the node table NTa in FIG. 13A). Is divided into words using a morphological analysis technique (step S1802).

その後、テキストインデックス部６４は、分けられたうちの１つの単語が転置表ＩＴａに含まれているか否かを判断する（ステップＳ１８０３）。
その単語が転置表ＩＴａに含まれていなかった場合（ステップＳ１８０３でＮｏ）、テキストインデックス部６４は、その単語を新たに転置表ＩＴａに登録する（ステップＳ１８０４）。 Thereafter, the text index unit 64 determines whether or not one of the divided words is included in the transposition table ITa (step S1803).
If the word is not included in the transposition table ITa (No in step S1803), the text index unit 64 newly registers the word in the transposition table ITa (step S1804).

ステップＳ１８０３でＹｅｓの場合、およびステップＳ１８０４に続き、ステップＳ１８０５において、テキストインデックス部６４は、その単語の「textid」を転置表ＩＴａに格納する。たとえば、単語「宇宙」がＸＭＬ文書００１のノードｎ１１１に存在していれば（図１１Ｂ（ａ）参照）、ノードｎ１１１のテキストデータに該当する「textid」である「０１」（図１３（ｂ）参照）を、転置表ＩＴａ（図１４参照）の「宇宙」に対応する「textid」のカラムに格納する。 In the case of Yes in step S1803 and following step S1804, in step S1805, the text index unit 64 stores “textid” of the word in the transposition table ITa. For example, if the word “universe” exists in the node n111 of the XML document 001 (see FIG. 11B (a)), “01” (“id”) corresponding to the text data of the node n111 (FIG. 13 (b)). Is stored in the column “textid” corresponding to “space” in the transposition table ITa (see FIG. 14).

その後、テキストインデックス部６４は、そのレコードに関する全ての単語分の処理を終了したか否かを判断し（ステップＳ１８０６）、終了していなければ（Ｎｏ）ステップＳ１８０３に戻って処理を繰り返し、終了していれば（Ｙｅｓ）ステップＳ１８０７に進む。
また、テキストインデックス部６４は、そのＸＭＬ文書に関する全てのレコード分の処理を終了したか否かを判断し（ステップＳ１８０７）、終了していなければ（Ｎｏ）ステップＳ１８０１に戻って処理を繰り返し、終了していれば（Ｙｅｓ）処理を終了する。 Thereafter, the text index unit 64 determines whether or not the processing for all the words related to the record has been completed (step S1806), and if not completed (No), the process returns to step S1803 to repeat the processing and finish. If yes (Yes), the process proceeds to step S1807.
In addition, the text index unit 64 determines whether or not the processing for all the records related to the XML document has been completed (step S1807), and if not completed (No), the process returns to step S1801 to repeat the processing and end. If so (Yes), the process is terminated.

このようにして、図１８に示したフローチャートの処理により、検索に必要な３つの統計量（「部分文書集合全体の統計量」、「部分文書ごとの統計量」および「検索単語に対する統計量」）のうち、「検索単語に対する統計量」（転置表ＩＴａの情報）をＸＭＬ文書の投入直後（検索クエリの入力前）に算出および格納することができるので、検索（適合度の算出）時に「検索単語に対する統計量」を算出する必要がなくなり、検索処理を高速化することができる。 In this way, three statistics required for the search (“statistics for the entire partial document set”, “statistics for each partial document”, and “statistics for the search word” are obtained by the processing of the flowchart shown in FIG. ) Can be calculated and stored immediately after the input of the XML document (before the input of the search query), and therefore, when searching (calculation of fitness) It is not necessary to calculate the “statistics for the search word”, and the search process can be speeded up.

次に、図１９を参照しながら、ランキング処理について説明する。図１９は、ランキング処理を示すフローチャートである。
記憶部５ａのＸＭＬ文書群５１に蓄積された複数のＸＭＬ文書に関して、検索によるランキングを行う場合、まず、使用者が入力部２を用いて検索クエリ（パスと検索文字列）を入力する。 Next, the ranking process will be described with reference to FIG. FIG. 19 is a flowchart showing the ranking process.
When ranking by search is performed on a plurality of XML documents stored in the XML document group 51 of the storage unit 5a, first, the user uses the input unit 2 to input a search query (path and search character string).

そうすると、処理部６のランキング部６５は、検索クエリからパス、検索単語（検索文字列から抽出）、ｌ_ｑ（検索文字列ｑの長さ）、および、ｔｆ_ｑｉ（検索文字列ｑ中の検索単語ｉの出現回数）を取り出す（算出する）（ステップＳ１９０１）。
続いて、ランキング部６５は、パステーブルＰＴを参照し、検索クエリ中のパスに対応するレコードにおける「Ｌ」、「Ｎ」およびパスＩＤ（pathid）の値を取り出す（ステップＳ１９０２）。 Then, the ranking unit 65 of the processing unit 6 determines the path from the search query, the search word (extracted from the search character string), l _q (the length of the search character string q), and tf _qi (the search in the search character string q). The number of occurrences of word i is extracted (calculated) (step S1901).
Subsequently, the ranking unit 65 refers to the path table PT and extracts the values of “L”, “N” and the path ID (pathid) in the record corresponding to the path in the search query (step S1902).

ランキング部６５は、ノードテーブルＮＴａを参照し、ステップＳ１９０２で取り出したパスＩＤ（pathid）に対応するレコードにおけるテキストＩＤ（textid）と「ｌｋ」の値を取り出す（ステップＳ１９０３）。
ランキング部６５は、転置表ＩＴａから、検索単語の出現位置（textid）の値を取り出す（ステップＳ１９０４）。 The ranking unit 65 refers to the node table NTa and extracts the text ID (textid) and the value of “lk” in the record corresponding to the path ID (pathid) extracted in step S1902 (step S1903).
The ranking unit 65 extracts the value of the appearance position (textid) of the search word from the transposition table ITa (step S1904).

続いて、ランキング部６５は、転置表ＩＴａから取り出したテキストＩＤ（textid）の値を用いて検索単語の出現する部分文書を絞り、ｔｆ_ｋｉ（部分文書ｋ中の単語ｉの出現回数）とｎ_ｉ（該当する単語ｉを含む部分文書数）を算出する（ステップＳ１９０５）。ここで、ステップＳ１９０５の具体例について、図２０を参照しながら説明する。 Subsequently, the ranking unit 65 narrows down the partial documents in which the search word appears using the value of the text ID (textid) extracted from the transposition table ITa, and tf _ki (number of occurrences of the word i in the partial document k) and n _i (the number of partial documents including the corresponding word i) is calculated (step S1905). Here, a specific example of step S1905 will be described with reference to FIG.

図２０は、ステップＳ１９０５の具体例、すなわち、部分文書の選別と単語の統計量の算出の例を示した図である。なお、この図２０における例は、図２〜図５などの具体例とは関係ない独立した例である。
図２０（ａ）に示した部分文書群と図２０（ｂ）に示した単語出現文書群に基づき、たとえば、まず、部分文書群（ａ）における最上位に記載された部分文書（ノードＩＤ「００４」、テキストＩＤ「０２，０３，０４，０５，０６，０７」）におけるテキストＩＤが単語出現文書群（ｂ）に存在するか探す。ここでは、（ｂ）の単語出現文書群において、テキストＩＤの「０３」と「０７」が該当する。 FIG. 20 is a diagram showing a specific example of step S1905, that is, an example of selection of partial documents and calculation of word statistics. The example in FIG. 20 is an independent example that is not related to the specific examples of FIGS.
Based on the partial document group shown in FIG. 20A and the word appearance document group shown in FIG. 20B, for example, first, the partial document (node ID “ 004 ”, text ID“ 02, 03, 04, 05, 06, 07 ”) is searched for whether the text ID exists in the word appearance document group (b). Here, in the word appearance document group of (b), the text IDs “03” and “07” correspond.

以下、同様にして、図２０（ｃ）に示すように、部分文書が３つに絞られ、統計量のｔｆ_ｋｉ（部分文書ｋ中の単語ｉの出現回数）は上から「２」、「１」および「１」であり、そして、ｎ_ｉ（該当する単語ｉを含む部分文書数）は「３」であると算出することができる。 Similarly, as shown in FIG. 20C, the partial documents are narrowed down to three, and the statistic tf _ki (number of occurrences of the word i in the partial document k) is “2”, “ It can be calculated that “1” and “1”, and n _i (number of partial documents including the corresponding word i) is “3”.

図１９に戻って、ランキング部６５は、各統計量と前記した式（１）を用いて、該当する部分文書のスコアを算出する（ステップＳ１９０６）。
そして、ランキング部６５は、ステップＳ１９０６で算出したスコアの高い順に部分文書をソートすることで、ランキングを行う（ステップＳ１９０７）。
なお、ステップＳ１９０６とステップＳ１９０７の処理は、図８のステップＳ８０６とステップＳ８０７の処理と同様であるので、詳細な説明を省略する。 Returning to FIG. 19, the ranking unit 65 calculates the score of the corresponding partial document using each statistic and the above-described equation (1) (step S1906).
Then, the ranking unit 65 performs ranking by sorting the partial documents in descending order of the score calculated in step S1906 (step S1907).
Note that the processing in step S1906 and step S1907 is the same as the processing in step S806 and step S807 in FIG.

このように、情報処理装置１ａは、予め算出および格納してある３つの統計量（「部分文書集合全体の統計量」、「部分文書ごとの統計量」および「検索単語に対する統計量」）を用いて検索（適合度の算出、ランキング）を行うことにより、検索を高速に処理することができる。 As described above, the information processing apparatus 1a uses the three statistics ("statistics for the entire partial document set", "statistics for each partial document", and "statistics for the search word") that are calculated and stored in advance. By using this to perform a search (calculation of suitability, ranking), the search can be processed at high speed.

また、各実施形態のＸＭＬ文書の適合度の算出方法は、前記した各フローチャートを実行するプログラムを作成することで、コンピュータ（装置）において実現することができる。さらに、それらのプログラムは、ハードディスク、フラッシュメモリ、ＣＤ−ＲＯＭ（Compact Disk Read Only Memory）、ＤＶＤ（Digital Versatile Disk）などの記録媒体に保存することが可能である。 In addition, the calculation method of the conformity of the XML document of each embodiment can be realized in a computer (apparatus) by creating a program for executing each flowchart described above. Furthermore, these programs can be stored in a recording medium such as a hard disk, a flash memory, a CD-ROM (Compact Disk Read Only Memory), or a DVD (Digital Versatile Disk).

以上で実施形態の説明を終えるが、本発明の態様はこれらに限定されるものではない。
たとえば、本実施形態では、文字列から単語を抽出する手法として、形態素解析を用いたが、Ｎ-ｇｒａｍなどの別の手法を用いてもよい。
その他、具体的な構成について、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 This is the end of the description of the embodiments, but the aspects of the present invention are not limited to these.
For example, in this embodiment, morphological analysis is used as a method for extracting a word from a character string, but another method such as N-gram may be used.
In addition, the specific configuration can be changed as appropriate without departing from the spirit of the present invention.

第１実施形態の情報処理装置の構成図である。It is a block diagram of the information processing apparatus of 1st Embodiment. ＸＭＬ文書のソースコードの例を示した図であり、（ａ）が図２２と同様のＸＭＬ文書００１に関する図であり、（ｂ）がその他の例としてのＸＭＬ文書００２に関する図である。It is the figure which showed the example of the source code of an XML document, (a) is a figure regarding the XML document 001 similar to FIG. 22, (b) is a figure regarding the XML document 002 as another example. （ａ）がＸＭＬ文書００１に対して範囲ラベルを付与した状態を示す図であり、（ｂ）がＸＭＬ文書００２に対して範囲ラベルを付与した状態を示す図である。(A) is a figure which shows the state which provided the range label with respect to the XML document 001, (b) is a figure which shows the state which provided the range label with respect to the XML document 002. （ａ）がノードテーブル、（ｂ）がパステーブルを例示した図である。FIG. 6A is a diagram illustrating a node table, and FIG. 5B is a diagram illustrating a path table. 転置表の例を示した図である。It is the figure which showed the example of the transposition table. 構造インデックスの構築処理を示すフローチャートである。It is a flowchart which shows the construction process of a structure index. テキストインデックスの構築処理を示すフローチャートである。It is a flowchart which shows the construction process of a text index. ランキング処理を示すフローチャートである。It is a flowchart which shows a ranking process. 部分文書の選別と単語の統計量の算出の例を示した図である。It is the figure which showed the example of the selection of a partial document, and the calculation of the statistic of a word. 「Ｌ」と「Ｎ」の値が複数存在する場合における「Ｌ」と「Ｎ」の値の算出の説明図であり、（ａ）が図４（ｂ）と同様のパステーブルＰＴなど、（ｂ）が「Ｌ」と「Ｎ」の値の算出式、をそれぞれ表している。FIG. 4 is an explanatory diagram of calculation of “L” and “N” values when there are a plurality of “L” and “N” values, and (a) shows a path table PT similar to FIG. b) represents formulas for calculating the values of “L” and “N”, respectively. 第２実施形態の情報処理装置の構成図である。It is a block diagram of the information processing apparatus of 2nd Embodiment. （ａ）が図２（ａ）のＸＭＬ文書００１に対してノードＩＤを付与した状態を示す図であり、（ｂ）が図２（ｂ）のＸＭＬ文書００２に対してノードＩＤを付与した状態を示す図である。(A) is a figure which shows the state which provided node ID with respect to the XML document 001 of Fig.2 (a), (b) is the state which provided node ID with respect to the XML document 002 of FIG.2 (b). FIG. （ａ）がノードテーブル、（ｂ）がパステーブルを例示した図である。FIG. 6A is a diagram illustrating a node table, and FIG. 5B is a diagram illustrating a path table. （ａ）が図１２（ａ）と同様のノードテーブルＮＴであり、（ｂ）がテキストテーブルＴＴの構成図である。(A) is the same node table NT as FIG. 12 (a), (b) is a block diagram of the text table TT. 転置表ＩＴａの例を示した図である。It is the figure which showed the example of transposition table ITa. 構造インデックスの構築処理の概要を示すフローチャートである。It is a flowchart which shows the outline | summary of the construction process of a structure index. パスインデックス部６２によるパスインデックスの構築処理のフローチャートである。10 is a flowchart of a path index construction process by a path index unit 62. タグインデックスの構築処理のフローチャートである。It is a flowchart of a tag index construction process. テキストインデックスの構築処理を示すフローチャートである。It is a flowchart which shows the construction process of a text index. ランキング処理を示すフローチャートである。It is a flowchart which shows a ranking process. 部分文書の選別と単語の統計量の算出の例を示した図である。It is the figure which showed the example of the selection of a partial document, and the calculation of the statistic of a word. 比較例の統計情報を用いたランキング例を示したものであり、（ａ）が文書例、（ｂ）が適合度（ランキング）の計算例１、（ｃ）が適合度の計算例２の説明図である。The ranking example using the statistical information of a comparative example is shown, (a) is a document example, (b) is a calculation example 1 of the fitness (ranking), and (c) is a calculation example 2 of the fitness. FIG. （ａ）はＸＭＬ文書のソースコードの例の簡略図、（ｂ）はＸＭＬ文書の木構造を示した図である。(A) is a simplified diagram of an example of source code of an XML document, and (b) is a diagram showing a tree structure of the XML document. （ａ）と（ｂ）は、図２（ｂ）の例について、それぞれのパスによる部分文書の範囲を示した図である。(A) And (b) is the figure which showed the range of the partial document by each path | pass about the example of FIG.2 (b). （ａ）と（ｂ）は、図２（ｂ）の例について、それぞれのパスによる部分文書の範囲を示した図である。(A) And (b) is the figure which showed the range of the partial document by each path | pass about the example of FIG.2 (b).

Explanation of symbols

１，１ａ情報処理装置
２入力部
３出力部
４メモリ
５，５ａ記憶部
６，６ａ処理部
５１ＸＭＬ文書群
６１データ格納部
６２パスインデックス部
６３範囲ラベル部
６４テキストインデックス部
６５ランキング部
６６タグインデックス部
ＩＴ，ＩＴａ転置表
ＮＴ，ＮＴａノードテーブル
ＰＴ，ＰＴａパステーブル
ＴＴテキストテーブル DESCRIPTION OF SYMBOLS 1,1a Information processing apparatus 2 Input part 3 Output part 4 Memory 5, 5a Storage part 6, 6a Processing part 51 XML document group 61 Data storage part 62 Path index part 63 Range label part 64 Text index part 65 Ranking part 66 Tag index Department IT, ITa Transposition table NT, NTa Node table PT, PTa Path table TT Text table

Claims

Path information for storing information in units of paths for a plurality of XML documents, node information for storing relation information between nodes in each XML document for each XML document, and each word used in the plurality of XML documents A storage unit for storing word information for storing appearance position information;
A path index unit for storing path unit information in the path information for the plurality of XML documents;
A node management unit that analyzes a relationship including a parent-child relationship between nodes in each XML document and stores the relationship information in the node information;
A text index part for storing a word used in any of the plurality of XML documents in the word information in association with appearance position information including a document ID (IDentification) of the XML document in which the word is used;
When a path and a search word are input, with respect to the plurality of XML documents, a ranking unit that calculates a degree of matching of the search word in a partial document that is a document under the input path;
An information processing apparatus that stores a plurality of XML documents that are structured data, and calculates, for each of the plurality of XML documents, a matching degree of the search word in a partial document under the input path. A method of calculating the fitness,
The path information stores path statistical information that is statistical information on a set of partial documents under each path as information on a path basis.
The path index unit updates path information including the path statistical information with respect to the plurality of stored XML documents before the path and a search word are input.
The ranking unit refers to the path information including the path statistical information, the node information, and the word information after the path and the search word are input, and based on the path and the search word, A method for calculating the degree of conformity of an XML document, comprising: calculating the degree of conformity relating to an XML document.

Path information for storing information in units of paths for a plurality of XML documents, node information for storing relation information between nodes in each XML document for each XML document, and each word used in the plurality of XML documents A storage unit for storing word information for storing appearance position information;
A path index unit for storing path unit information in the path information for the plurality of XML documents;
A node management unit that analyzes a relationship including a parent-child relationship between nodes in each XML document and stores the relationship information in the node information;
A text index part for storing a word used in any of the plurality of XML documents in the word information in association with appearance position information including a document ID (IDentification) of the XML document in which the word is used;
When a path and a search word are input, with respect to the plurality of XML documents, a ranking unit that calculates a degree of matching of the search word in a partial document that is a document under the input path;
An information processing apparatus that stores a plurality of XML documents that are structured data, and calculates, for each of the plurality of XML documents, a matching degree of the search word in a partial document under the input path. A method of calculating the fitness,
The node information stores, as information for each node, node statistical information that is statistical information regarding partial documents under each node,
The node management unit updates the node information including the node statistical information before the path and the search word are input,
The ranking unit refers to the path information, node information including the node statistical information, and the word information after the path and the search word are input, and based on the path and the search word, A method for calculating the degree of conformity of an XML document, comprising: calculating the degree of conformity relating to an XML document.

Path information for storing information in units of paths for a plurality of XML documents, node information for storing relation information between nodes in each XML document for each XML document, and each word used in the plurality of XML documents A storage unit for storing word information for storing appearance position information;
A path index unit for storing path unit information in the path information for the plurality of XML documents;
A node management unit that analyzes a relationship including a parent-child relationship between nodes in each XML document and stores the relationship information in the node information;
A text index part for storing a word used in any of the plurality of XML documents in the word information in association with appearance position information including a document ID (IDentification) of the XML document in which the word is used;
When a path and a search word are input, with respect to the plurality of XML documents, a ranking unit that calculates a degree of matching of the search word in a partial document that is a document under the input path;
An information processing apparatus that stores a plurality of XML documents that are structured data, and calculates, for each of the plurality of XML documents, a matching degree of the search word in a partial document under the input path. A method of calculating the fitness,
The word information stores, as the appearance position information, node position information where the word appears together with the document ID of the XML document,
The text index unit updates the word information including the node position information before the path and search word are input,
The ranking unit refers to word information including the path information, the node information, and the node position information after the path and the search word are input, and based on the path and the search word, A method for calculating the conformity of an XML document, comprising: calculating the conformity of an XML document.

As the relationship information between the nodes, the node management unit sets a start label value and an end label value for each node, a start label value of the parent node is smaller than a start label value of the child node, and an end label value of the parent node is It is given to satisfy the relationship that it is larger than the end label value of the child node, and stored in the node information,
The text index unit assigns a start label value and an end label value in the node information as node position information of each word information,
The ranking unit determines a range of partial documents corresponding to the path based on a start label value and an end label value of each node in the node information when the path and a search word are input, and the partial document The method of calculating the fitness of an XML document according to claim 3, wherein the fitness of the search word is calculated with respect to.

The storage unit further stores text information for storing text data of a node having text data in the XML document in association with identification information of the text data,
The node management unit associates identification information of text data under each node in the node information,
When the path and the search word are input, the ranking unit determines a range of the partial document corresponding to the path based on the identification information of the text data in the node information, and the search word of the search word is related to the partial document. The method for calculating the conformity of an XML document according to claim 3, wherein the conformance is calculated.

Path information for storing information in units of paths for a plurality of XML documents, node information for storing relation information between nodes in each XML document for each XML document, and each word used in the plurality of XML documents A storage unit for storing word information for storing appearance position information;
A path index unit for storing path unit information in the path information for the plurality of XML documents;
A node management unit that analyzes a relationship including a parent-child relationship between nodes in each XML document and stores the relationship information in the node information;
A text index part for storing a word used in any of the plurality of XML documents in the word information in association with appearance position information including a document ID (IDentification) of the XML document in which the word is used;
When a path and a search word are input, with respect to the plurality of XML documents, a ranking unit that calculates a degree of matching of the search word in a partial document that is a document under the input path;
A plurality of XML documents that are structured data, and for the plurality of XML documents, an information processing apparatus that calculates a degree of matching of the search word in a partial document under the input path,
The path information stores path statistical information that is statistical information on a set of partial documents under each path as information on a path basis.
The path index unit updates path information including the path statistical information with respect to the plurality of stored XML documents before the path and a search word are input.
The ranking unit refers to the path information including the path statistical information, the node information, and the word information after the path and the search word are input, and based on the path and the search word, An information processing apparatus that calculates a degree of conformity with respect to an XML document.

Path information for storing information in units of paths for a plurality of XML documents, node information for storing relation information between nodes in each XML document for each XML document, and each word used in the plurality of XML documents A storage unit for storing word information for storing appearance position information;
A path index unit for storing path unit information in the path information for the plurality of XML documents;
A node management unit that analyzes a relationship including a parent-child relationship between nodes in each XML document and stores the relationship information in the node information;
A text index part for storing a word used in any of the plurality of XML documents in the word information in association with appearance position information including a document ID (IDentification) of the XML document in which the word is used;
When a path and a search word are input, with respect to the plurality of XML documents, a ranking unit that calculates a degree of matching of the search word in a partial document that is a document under the input path;
A plurality of XML documents that are structured data, and for the plurality of XML documents, an information processing apparatus that calculates a degree of matching of the search word in a partial document under the input path,
The node information stores, as information for each node, node statistical information that is statistical information regarding partial documents under each node,
The node management unit updates the node information including the node statistical information before the path and the search word are input,
The ranking unit refers to the path information, node information including the node statistical information, and the word information after the path and the search word are input, and based on the path and the search word, An information processing apparatus that calculates a degree of conformity with respect to an XML document.

Path information for storing information in units of paths for a plurality of XML documents, node information for storing relation information between nodes in each XML document for each XML document, and each word used in the plurality of XML documents A storage unit for storing word information for storing appearance position information;
A path index unit for storing path unit information in the path information for the plurality of XML documents;
A node management unit that analyzes a relationship including a parent-child relationship between nodes in each XML document and stores the relationship information in the node information;
A text index part for storing a word used in any of the plurality of XML documents in the word information in association with appearance position information including a document ID (IDentification) of the XML document in which the word is used;
When a path and a search word are input, with respect to the plurality of XML documents, a ranking unit that calculates a degree of matching of the search word in a partial document that is a document under the input path;
A plurality of XML documents that are structured data, and for the plurality of XML documents, an information processing apparatus that calculates a degree of matching of the search word in a partial document under the input path,
The word information stores, as the appearance position information, node position information where the word appears together with the document ID of the XML document,
The text index unit updates the word information including the node position information before the path and search word are input,
The ranking unit refers to word information including the path information, the node information, and the node position information after the path and the search word are input, and based on the path and the search word, An information processing apparatus that calculates a degree of conformity with respect to an XML document.

As the relationship information between the nodes, the node management unit sets a start label value and an end label value for each node, a start label value of the parent node is smaller than a start label value of the child node, and an end label value of the parent node is It is given to satisfy the relationship that it is larger than the end label value of the child node, and stored in the node information,
The text index unit assigns a start label value and an end label value in the node information as node position information of each word information,
The ranking unit determines a range of partial documents corresponding to the path based on a start label value and an end label value of each node in the node information when the path and a search word are input, and the partial document The information processing apparatus according to claim 8, wherein the matching degree of the search word is calculated with respect to.

A program for causing a computer to execute the method for calculating the conformity of an XML document according to any one of claims 1 to 5.