JP2004054431A

JP2004054431A - Structured document generation device and structured document generation program

Info

Publication number: JP2004054431A
Application number: JP2002208712A
Authority: JP
Inventors: Kentaro Oguchi; 尾口　健太郎; Mitsuhiro Araki; 荒木　円博
Original assignee: Toyota Central R&D Labs Inc
Current assignee: Toyota Central R&D Labs Inc
Priority date: 2002-07-17
Filing date: 2002-07-17
Publication date: 2004-02-19
Anticipated expiration: 2022-07-17
Also published as: JP4653375B2

Abstract

【課題】自然文からコンピュータが容易に認識できるような意味的構造によって構成される構造化文書を生成する。
【解決手段】意味抽出部３０は、属性値の抽出パターンと属性値の出力形式とを様々な概念毎に記憶した概念辞書３１と、パターン照合を行って複数の項目概念辞書から最適な項目概念辞書を選択するパターン照合部３２と、選択した項目概念辞書を用いて属性値を抽出する属性値抽出部３３と、抽出した属性値と対応するタグとをＸＭＬ文書に整形して出力する整形部３４とを備えている。
【選択図】　　　　図７A structured document including a semantic structure that can be easily recognized by a computer from a natural sentence is generated.
A meaning extraction unit (30) stores a concept dictionary (31) in which an attribute value extraction pattern and an attribute value output format are stored for each of various concepts, and performs pattern matching to determine an optimal item concept from a plurality of item concept dictionaries. A pattern matching unit 32 for selecting a dictionary, an attribute value extracting unit 33 for extracting attribute values using the selected item concept dictionary, and a shaping unit for shaping the extracted attribute values and corresponding tags into an XML document and outputting the XML document 34.
[Selection diagram] FIG.

Description

【０００１】
【発明の属する技術分野】
本発明は、構造化文書生成装置及び構造化文書生成プログラムに係り、例えば、検索に直接使用できるデータを生成するのに用いて好適な構造化文書生成装置及び構造化文書生成プログラムに関する。
【０００２】
【従来の技術及び発明が解決しようとする課題】
特開２００１−２９０８０１号公報では、テキスト形式の文書を入力し、その文書構造を利用して構造化文書を出力する構造文書化システム、構造文書化プログラム、及び、コンピュータ可読格納媒体（以下「従来技術１」という。）が提案されている。
【０００３】
従来技術１は、テキスト形式の文書をＷｅｂページで表示するために再構成などをすることを目的としており、文書の構造を分解するために、文書分解定義ファイルを予め定義しておき、文書定義ファイルに記述されているパターンを用いて文書を構造化している。このようにして生成された文書構造は、標題、作成者、日付といった表層的な形式になっている。
【０００４】
しかし、本文として自然文で書かれた部分は、何ら変換されずにそのまま出力されるので、コンピュータシステムで利用できる形式になっていない。このため、従来技術１から出力された構造化文書は、表層的な情報のみデータベースで使用されるが、自然文で書かれた本文などの具体的な内容についてはデータベースで使用されないという問題があった。
【０００５】
本発明は、上述した課題を解決するために提案されたものであり、自然文からコンピュータが容易に認識できるような意味的構造によって構成される構造化文書を生成する構造化文書生成装置及び構造化文書生成プログラムを提供することを目的とする。
【０００６】
【課題を解決するための手段】
上述した課題を解決するため、請求項１に記載の発明は、自然文を含む文章を入力する文章入力手段と、前記文章入力手段により入力された文章の意味的構造によって構成される構造化文書を生成する構造化文書生成手段と、を備えている。
【０００７】
ここで、請求項１に記載の発明は、コンピュータに請求項４に記載の発明をインストールすることで構成される。
【０００８】
すなわち、請求項４に記載の発明は、コンピュータに、自然文を含む文章を入力する文章入力工程と、前記文章入力工程により入力された文章の意味的構造によって構成される構造化文書を生成する構造化文書生成工程と、を備えた処理を実行させる構造化文書生成プログラムである。
【０００９】
したがって、請求項１および４に記載の発明によれば、自然文を含む文章に基づいて、当該文章の意味的構造によって構成される構造化文書を自動的に生成することができる。
【００１０】
請求項２に記載の発明は、請求項１に記載の発明において、前記構造化文書生成手段は、前記文章入力手段により入力された文章の概念に基づいて、前記概念を表す語句と、前記語句を説明する属性値と、を有する意味的構造によって構成される構造化文章を生成する。
【００１１】
ここで、請求項２に記載の発明は、コンピュータに請求項５に記載の発明をインストールすることで構成される。
【００１２】
すなわち、請求項５に記載の発明は、請求項４に記載の発明において、前記構造化文書生成工程は、前記文章入力工程により入力された文章の概念に基づいて、前記概念を表す語句と、前記語句を説明する属性値と、を有する意味的構造によって構成される構造化文章を生成する。
【００１３】
したがって、請求項２および５に記載の発明によれば、文章の概念に基づいて、その概念を表す主たる語句と前記語句を説明する属性値と抽出し、主たる語句と属性値とを有する意味的構造によって構成される構造化文章を生成することができる。
【００１４】
請求項３に記載の発明は、請求項２に記載の発明において、概念を表す語句と、前記語句を説明する属性値と、前記語句と前記属性値とが文章中で記述されるパターンと、を表した概念辞書を記憶する概念辞書記憶手段を更に備え、前記構造化文書生成手段は、前記文章入力手段により入力された文章と、前記概念辞書記憶手段に記憶された概念辞書とを照合することによって、前記意味的構造によって構成される構造化文書を生成する。
【００１５】
ここで、請求項３に記載の発明は、コンピュータに請求項６に記載の発明をインストールすることで構成される。
【００１６】
すなわち、請求項６に記載の発明は、請求項５に記載の発明において、前記構造化文書生成工程は、前記文章入力工程により入力された文章と、概念を表す語句と、前記語句を説明する属性値と、前記語句と前記属性値とが文章中で記述されるパターンと、を表した概念辞書と、を照合することによって、前記意味的構造によって構成される構造化文書を生成する。
【００１７】
したがって、請求項３および６に記載の発明によれば、語句と属性値とが文章中で記述されるパターンを予め表した概念辞書を用いることによって、文書から語句と属性値とを抽出して、意味的構造によって構成される構造化文書を生成することができる。
【００１８】
請求項７に記載の発明は、自然文を含む文書に対して語句解析を実行して、前記文書を構成する単語と品詞種類とを対応付けた語句解析済み文書を出力する語句解析手段と、抽出対象となる属性値を含んだ単語列の品詞種類の構成を表す品詞種類構成パターンと、前記属性値及び対応するタグを構造化して出力する出力形式情報と、を有する概念辞書を記憶する概念辞書記憶手段と、前記語句解析済み文書の単語列の品詞種類の構成パターンと、前記概念辞書記憶手段に記憶されている概念辞書の品詞種類構成パターンとを照合して、前記品詞種類構成パターンに対応する前記語句解析済み文書の単語列から属性値を抽出する属性値抽出手段と、前記属性値抽出手段により抽出された属性値と、前記概念辞書記憶手段に記憶されている概念辞書の出力形式情報とに基づいて、構造化文書を生成する文書生成手段と、を備えている。
【００１９】
請求項７に記載の発明は、コンピュータに請求項９に記載の発明をインストールすることで構成される。
【００２０】
請求項９に記載の発明は、コンピュータに、自然文を含む文書に対して語句解析を実行して、前記文書を構成する単語と品詞種類とを対応付けた語句解析済み文書を出力する語句解析工程と、前記語句解析済み文書の単語列の品詞種類の構成パターンと、抽出対象となる属性値を含んだ単語列の品詞種類の構成を表す品詞種類構成パターンとを照合して、前記品詞種類構成パターンに対応する前記語句解析済み文書の単語列から属性値を抽出する属性値抽出工程と、前記抽出された属性値と、前記属性値と対応するタグとを構造化して出力する出力形式情報とに基づいて、構造化文書を生成する文書生成工程と、を備えた処理を実行させる構造化文書生成プログラムである。
【００２１】
語句解析手段は、自然文を含む文書に対して語句解析を実行する。ここで、自然文を含む文書とは、自然文のみからなる文書だけでもよいし、自然文と所定のタグとを有する所定形式の文書であってもよい。そして、語句解析手段は、文書を構成する各単語に品詞種類を対応付けて、各単語と各々の品詞種類とを対応付けた語句解析済み文書を出力する。
【００２２】
概念辞書記憶手段は、品詞種類構成パターンと出力形式情報とを有する概念辞書を記憶している。ここで、品詞種類構成パターンは、文書から属性値を抽出しようとする際に、抽出対象となる属性値を含んだ単語列の品詞種類の構成を表したものである。出力形式情報は、文書の出力形式を示し、属性値及び対応するタグを構造化して出力する情報である。
【００２３】
属性値抽出手段は、語句解析済み文書の単語列の品詞種類の構成パターンと、概念辞書の品詞種類構成パターンとを照合して、この品詞種類構成パターンに対応する単語列を語句解析済み文書から探し出す。そして、探し出した単語列から属性値を抽出する
文書生成手段は、抽出された属性値と、概念辞書の出力形式情報とに基づいて、所定の形式の構造化文書を生成する。この構造化文書は、当初は自然文であっても、属性値とタグとからなる形式の文書である。
【００２４】
したがって、請求項７および９に記載の発明によれば、語句解析済み文書の単語列の品詞種類の構成パターンと、抽出対象となる属性値を含んだ単語列の品詞種類の構成を表す品詞種類構成パターンとを照合して属性値を抽出し、抽出された属性値を出力形式情報に基づいて構造化して構造化文書を生成することにより、自然文から意味が抽出された構造化文書を得ることができる。
【００２５】
請求項８に記載の発明は、請求項７に記載の発明において、前記概念辞書記憶手段は、概念毎に各々設けられた概念辞書を記憶し、前記属性値抽出手段は、前記概念辞書記憶手段に記憶されている複数の概念辞書の中から、前記語句解析済み文書の品詞種類の構成パターンと一致する品詞種類構成パターンを有する概念辞書を選択し、選択した概念辞書を用いて属性値を抽出する。
【００２６】
請求項８に記載の発明は、コンピュータに請求項１０に記載の発明をインストールすることで構成される。
【００２７】
請求項１０に記載の発明は、請求項９に記載の発明において、前記属性値抽出工程は、概念毎に各々設けられた複数の概念辞書の中から、前記語句解析済み文書の品詞種類の構成パターンと一致する品詞種類構成パターンを有する概念辞書を選択し、選択した概念辞書を用いて属性値を抽出する。
【００２８】
概念辞書記憶手段は、様々な概念毎に各々の概念辞書を記憶している。属性値抽出手段は、様々な概念辞書の中から、語句解析済み文書の品詞種類の構成パターンと一致する品詞種類構成パターンを有する概念辞書を選択することで、語句解析済み文書の概念を特定する。そして、選択した概念辞書を用いて属性値を抽出することで、特定された概念に関する属性値を得る。
【００２９】
したがって、請求項８および１０に記載の発明によれば、様々な概念辞書の中から、語句解析済み文書の品詞種類の構成パターンと一致する品詞種類構成パターンを有する概念辞書を選択し、選択した概念辞書を用いて属性値を抽出することで、文書内容がどのような概念であっても、属性値を確実に抽出することができる。
【００３０】
【発明の実施の形態】
以下、本発明の好ましい実施の形態について図面を参照しながら詳細に説明する。
【００３１】
図１は、本発明の実施の形態に係る構造化文書生成装置の構成を示すブロック図である。
【００３２】
構造化文書生成装置は、キー操作により文字情報を入力するキーボード１と、ポインティングデバイスであるマウス２と、外部との間で情報の入出力を行う入出力ポート３と、構造化文書作成のための演算処理を実行するためのアプリケーションプログラム等を記憶するハードディスクドライブ（Ｈａｒｄ　Ｄｉｓｃ　Ｄｒｉｖｅ）４と、構造化文書作成の演算処理を実行するマイクロコンピュータ５と、マイクロコンピュータ５の演算結果、例えば構造化文書を表示するＬＣＤ（Ｌｉｑｕｉｄ　Ｃｒｙｓｔａｌ　Ｄｉｓｐｌａｙ）６とを備えている。
【００３３】
マイクロコンピュータ５は、データのワークエリアであるＲＡＭ（Ｒａｎｄｏｍ　Ａｃｃｅｓｓ　Ｍｅｍｏｒｙ）、所定の制御プログラムが記憶されているＲＯＭ（Ｒｅａｄ　Ｏｎｌｙ　Ｍｅｍｏｒｙ）、演算処理を実行するＣＰＵ（Ｃｅｎｔｒａｌ　Ｐｒｏｃｅｓｓｉｎｇ　Ｕｎｉｔ）などで構成されている。マイクロコンピュータ５は、キーボード１から入力されたテキスト文書や、入出力ポート３を介して外部から入力されたテキスト文書に基づいて、構造化文書を作成する。なお、図示しない記録媒体に既に記録されているテキスト文書から構造化文書を作成してもよい。
【００３４】
図２は、マイクロコンピュータ５の機能的な構成を示すブロック図である。マイクロコンピュータ５は、テキスト文書からＸＭＬ文書を作成するＸＭＬ文書作成部１０と、語句解析を行う語句解析部２０と、意味を抽出して構造化文書を作成する意味抽出部３０と、を備えている。
【００３５】
ＸＭＬ文書作成部１０は、自然文のテキスト文書と、そのテキスト文書の題名や作成日時等の情報とを用いてＸＭＬ文書を作成し、作成したＸＭＬ文書を語句解析部２０に供給する。なお、テキスト文書は、キーボード１の操作によって生成された文書、入出力ポート３から入力された文書、又はＨＤＤ４やその他の記憶媒体に予め記憶された文書のいずれでもよい。
【００３６】
図３は、語句解析部２０の機能的な構成を示すブロック図である。
【００３７】
語句解析部２０は、単語と品詞種類等との対応関係を示した語句辞書２１と、ＸＭＬ文書の構造タグと自然文である内容文字列とを分離する内容文字列分離部２２と、特定語を分離・置換する特定語分離・置換部２３と、特定語の品詞を決定する特定語品詞決定部２４と、内容文字列の形態素解析を行う形態素解析部２５と、代替語を特定語に置換する代替語置換部２６と、複数の単語から複合語や句を生成する複合語・句生成部２７と、単語や複合語に品詞タグを付与する品詞タグ付与部２８とを備えている。
【００３８】
図４は、語句辞書２１の構成を示す図である。語句辞書２１は、単語に関する辞書と、語句（複合語や句）に関する辞書で構成されている。
【００３９】
語句辞書２１は、単語に関する部分については、単語、品詞種類、代替語で構成されている。なお、代替語は、詳しくは後述するが、特定語の場合の場合に限り使用される。
【００４０】
図４によると、例えば、単語「増幅」について、品詞種類は「名詞−サ変接続」である。これは、「増幅」は名詞であり、サ変動詞と接続することを表している。また、単語「問題」について、品詞種類は「名詞−ナイ形容詞語幹」である。これは、「問題」は名詞であり、「…ない」という形容詞の語幹になることを表している。さらに、単語「時」について、品詞種類は「名詞−接尾−副詞可能」である。これは、「時」は名詞であり、「…時」という接尾語となり、さらに副詞として利用可能であることを表している。単語「が」について、品詞種類は「助詞−格助詞−一般」である。これは、「が」は、一般的な格助詞であることを表している。このように、品詞種類は、対応する単語の品詞の種類だけでなく、対応する単語の属性も表している。
【００４１】
ここで、「単語」は、名詞や助詞等の一般的な単語の他に、語句辞書２１によって新たに定義された単語を示す特定語も含んでいる。
【００４２】
例えば、図４に示す「２Ｓ［Ａ−Ｄ］［０−９］＋」は、２Ｓで始まり、次がＡからＤのいずれか１文字であり、最後が１文字以上の数字で構成された単語を表している。上記条件に該当する単語は、トランジスタの品番を表し、一般的な辞書にない。そこで、語句辞書２１は、上記条件に該当する単語を特定語として定義している。
【００４３】
ここで、単語（特定語）「２Ｓ［Ａ−Ｄ］［０−９］＋」について、品詞種類は「名詞―固有名詞―識別子―品番―トランジスタ」であり、代替語は「代替−トランジスタ品番」である。これは、上記特定語は固有名詞であり、識別子であり、トランジスタの品番であることを表している。また、上記特定語は、それ自体では意味が分からないので、代替語として「トランジスタ品番」があることを表している。
【００４４】
語句辞書２１は、語句（複合語や句）に関する部分については、「語句列」、「品詞種類」で構成され、複合語や句を作成するための辞書としても機能している。
【００４５】
「語句列」は、複合語や句として成立するためのパターンを表している。例えば、図４によると、「（名詞　ａｎｄ　ｎｏｔ（名詞−接尾−副詞可能）ｏｒ記号）＊［２，∞］」は、名詞（ただし、名詞−接尾−副詞可能を除く）、又は記号が、２つ以上連続するパターンであることを表し、このパターンの「品詞種類」は名詞句である。つまり、こような条件に該当する語句列のパターンは、名詞句として取り扱うことを表している。
【００４６】
図５及び図６は、語句解析部２０によって解析されている文書を説明する図である。以下では、図５（Ａ）に示すＸＭＬ文書が語句解析部２０に入力された場合について説明する。
【００４７】
内容文字列分離部２２は、ＸＭＬ文書作成部１０から図５（Ａ）に示すＸＭＬ文書が供給されると、ＸＭＬ文書の構造タグと自然文である内容文字列とを分離する。そして、内容文字列分離部２２は、図５（Ｂ）に示すＸＭＬ文書の構造タグを品詞タグ付与部２８に供給し、図５（Ｃ）に示すＸＭＬ文書の内容文字列を特定語分離・置換部２３に供給する。
【００４８】
特定語分離・置換部２３は、内容文字列分離部２２から供給された内容文字列と語句辞書２１とを照合し、内容文字列に含まれる特定語を代替語に置換して、特定語を分離する。そして、分離した特定語を特定語品詞決定部２４に供給し、特定語から代替語に置換された内容文字列を形態素解析部２５に供給する。
【００４９】
具体的には、特定語分離・置換部２３は、語句辞書２１に基づいて、内容文字列に含まれる特定語「２ＳＣ７７７７」を「代替−トランジスタ品番」に置換する。そして、図５（Ｄ）に示す特定語「２ＳＣ７７７７」を分離して特定語品詞決定部２４に供給し、図５（Ｅ）に示す置換済みの内容文字列「（代替−トランジスタ品番）においてＡ級増幅時に発熱が問題」を形態素解析部２５に供給する。
【００５０】
特定語品詞決定部２４は、語句辞書２１に基づいて特定語の品詞を決定し、特定語及び品詞種類情報を代替語置換部２６に供給する。具体的には、「２ＳＣ７７７７」の品詞を決定し、図５（Ｆ）に示すように、特定語及び品詞種類情報である「２ＳＣ７７７７−名詞―固有名詞―識別子―品番−トランジスタ」を代替語置換部２６に供給する。
【００５１】
形態素解析部２５は、特定語分離・置換部２３から供給された内容文字列に対して、語句辞書２１を参照しながら形態素解析を実行する。具体的には、内容文字列を１つ１つの単語に分解し、分解された各々の単語（特定語も含む。）と品詞種類との対応付けを行う。そして、形態素解析部２５は、図５（Ｇ）に示す各単語及び対応する品詞種類を代替語置換部２６に供給する。なお、図５（Ｇ）において、「時」、「に」、「発熱」、「が」、「問題」の各単語については省略しているが、これらの各単語についても同様に形態素解析を行い、各単語及び対応する品詞種類を代替語置換部２６に供給する。
【００５２】
代替語置換部２６は、特定語品詞決定部２４及び形態素解析部２５から供給された情報に基づいて、形態素解析された各単語の中から代替語を選択し、選択した代替語を元の特定語に置換する。具体的には、「代替−トランジスタ」を「２ＳＣ７７７７」に置換する。そして、図６（Ａ）に示す各単語及び対応する品詞種類を複合語・句生成部２７に供給する。なお、図６（Ａ）では、「時」、「に」、「発熱」、「が」、「問題」の各単語については図示を省略している。これらの単語は、後述の図６（Ｂ）でも同様に図示を省略する。
【００５３】
複合語・句生成部２７は、代替語置換部２６から供給された連続する単語の中に、語句辞書２１の複合語・句に該当する単語の部分列があるかを判定し、該当する単語の部分列があるときは、該当する全部の部分列を複合語・句に置換する。
【００５４】
具体的には、単語の部分列「Ａ」、「級」、「増幅」が、図４に示す語句辞書２１の名詞句の語句列パターンに該当している。そこで、複合語・句生成部２７は、単語の部分列「Ａ」、「級」、「増幅」から、名詞句である「Ａ級増幅」を生成し、図６（Ｂ）に示す各単語及び対応する品詞種類を品詞タグ付与部２８に供給する。なお、複合語・句生成部２７は、語句辞書２１の複合語・句に該当する単語の部分列がないときは、代替語置換部２６から供給された情報をそのまま品詞タグ付与部２８に供給する。
【００５５】
品詞タグ付与部２８は、複合語・句生成部２７から供給された各単語（複合語・句も含む。）に対して、各々の品詞種類を示す品詞タグを付与する。そして、各単語と各品詞タグとを、内容文字列分離部２２から供給された構造タグ＜内容＞の要素として埋め込む。この結果、品詞タグ付与部２８は、図６（Ｃ）に示すように、語句解析済みのＸＭＬ文書を生成して出力する。
【００５６】
このように、語句解析部２０は、一般的なＸＭＬ文書から構造タグと内容文字列（自然文）とを分離し、内容文字列に対して形態素解析を行い、内容文字列の各単語に品詞タグを付与して、各単語と品詞タグとを元のＸＭＬ文書に埋め込む処理を行う。すなわち、語句解析部２０は、自然文に形態素解析を行って、自然文を構成する各々の単語に品詞タグを付与することで、自然文の文書構成を明確にして、意味を抽出しやすい文書を出力することができる。
【００５７】
また、語句解析部２０は、語句辞書２１に新たに定義した特定語とその代替語を登録しておくことで、形態素解析の際には特定語の代わりに代替語を使用し、形態素解析後には代替語を元の特定語に置換することで、一般的な辞書にはないような専門用語・技術用語であっても、正確に形態素解析を行うことができる。
【００５８】
なお、語句辞書２１は、図４に示した構成に限定されるものではなく、その他の単語や品詞種類についても記憶することができる。また、語句辞書２１は、複数の名詞から名詞句を作成するパターンを記憶するだけでなく、形容詞句や副詞句等のその他の複合語・句を生成するためのパターンも記憶可能であるのは勿論である。
【００５９】
図７は、意味抽出部３０の機能的な構成を示すブロック図である。意味抽出部３０は、語句解析済みのＸＭＬ文書から意味を抽出して、抽出された意味を構造化したＸＭＬ文書を生成する。
【００６０】
意味抽出部３０は、属性値の抽出パターンと属性値の出力形式とを様々な概念毎に記憶した概念辞書３１と、パターン照合を行って複数の項目概念辞書から最適な項目概念辞書を選択するパターン照合部３２と、選択した項目概念辞書を用いて属性値を抽出する属性値抽出部３３と、抽出した属性値と対応するタグとをＸＭＬ文書に整形して出力する整形部３４とを備えている。
【００６１】
図８は、概念辞書３１の構成を示すブロック図である。概念辞書は、例えば、「現状の問題」、「解決策」などの様々な概念（項目）毎に構成された複数の項目概念辞書で構成されている。ここでは、「現状の問題」を例に挙げながら説明する。
【００６２】
項目概念辞書は、当該項目概念辞書の項目を表す「項目」、当該「項目」の類義語を表す「類義語」、「項目を説明する属性」、抽出した属性値のＸＭＬ出力形式を表す「ＸＭＬ出力形式」で構成されている。
【００６３】
「項目を説明する属性」は、「属性名」、「タグパターン」で構成されている。「属性名」は、どのような属性値を抽出するかを表すものである。「タグパターン」は、属性値を含んだ単語列の各々の品詞タグの構成パターンと、当該構成パターンの中に含まれる属性値の位置を表している。
【００６４】
ここで、「名詞＊」は、任意の名詞又は名詞句であることを表している。また、「＄」とそれに続く属性名の組は、タグパターンのその位置に該当する語句の字面がその属性の値になることを表している。
【００６５】
例えば図８において、「問題」は、語句解析済みＸＭＬ文書の中から、どのような「問題」かを説明する属性値を抽出することを示している。「問題」の「タグパターン」は、「問題」を説明する属性値を含んだ単語列の品詞タグ構成パターンを表している。
【００６６】
なお、「問題」の「タグパターン」は、最初の単語の品詞タグは名詞又は名詞句であり、次の単語の品詞タグは任意でよいが、当該次の単語は「が」であることを示している。ここで、最初の品詞タグの要素に「＄問題」があるので、最初の品詞タグに対応する単語が属性名「問題」の値（属性値）になる。
【００６７】
また、「対象」は、語句解析済みＸＭＬ文書の中から、どのような「対象」かを説明する属性値を抽出することを示している。「対象」の「タグパターン」は、最初の単語の品詞タグは名詞又は名詞句であり、次の単語の品詞タグは任意でよいが、当該次の単語（単語列）は「において」であることを示している。また、最初の品詞タグの要素に「＄対象」があるので、最初の品詞タグに対応する単語が属性名「対象」の値（属性値）になる。
【００６８】
さらに、「問題発生の状況」は、語句解析済みＸＭＬ文書の中から、どのような「問題発生の状況」かを説明する属性値を抽出することを示している。「問題発生の状況」の「タグパターン」は、最初の単語の品詞タグは名詞又は名詞句であり、次の単語の品詞タグは「名詞−接尾−副詞可能」（副詞として利用可能な接尾語となる名詞）であり、最後の単語の品詞タグは任意でよいが、当該最後の単語は「に」であることを示している。また、最初の品詞タグの要素に「＄問題発生の状況」があるので、最初の品詞タグに対応する単語が属性名「問題の発生の状況」の値（属性値）になる。
【００６９】
「ＸＭＬ出力形式」は、抽出した各属性値を「＄属性名」の箇所と置き換えたＸＭＬ文書を出力することを表したものである。
【００７０】
例えば図８の「ＸＭＬ出力形式」の１行目は、「問題」の属性値と置き換えて出力することを表している。なお、＜現状の問題＞の要素として組み込まれている＜対象＞、＜問題発生時＞についても同様にして、各々対応する要素を形成して構造タグと共に出力することを表している。
【００７１】
そして、パターン照合部３２、属性値抽出部３３、整形部３４は、以上のように構成された概念辞書３１を用いて、以下のような処理を実行する。
【００７２】
パターン照合部３２は、語句解析部２０から供給された語句解析済みＸＭＬ文書と、概念辞書３１の各々の項目概念辞書との照合を行って、前記ＸＭＬ文書に対応する項目概念辞書を選択する。具体的には、各々の項目概念辞書の中から、すべてのタグパターンが前記文書の品詞タグの構成パターンと完全に一致する項目概念辞書を選択する。そして、パターン照合部３２は、選択した項目概念辞書の「項目」を属性値抽出部３３に供給すると共に、項目概念辞書と照合した部分のＸＭＬ文書を属性値抽出部３３に供給する。
【００７３】
例えば、ここでは図６（Ｃ）に示した文書の品詞タグの構成パターンと、図８に示した項目概念辞書「現状の問題」のすべての「タグパターン」とが一致する。そこで、パターン照合部３２は、項目名「現状の問題」及び図６（Ｃ）の＜内容＞の要素を属性値抽出部３３に供給する。
【００７４】
なお、語句解析部２０から供給される文書の品詞タグの構成パターンと、すべてのタグパターンとが完全に一致する項目概念辞書が複数存在する場合、項目概念辞書のタグパターンの条件を更に制限すればよい。例えば、任意の名詞又は名詞句を表す「名詞＊」の代わりに、固有名詞を表す「名詞−固有名詞」としてもよいし、その他の条件を制限してもよい。
【００７５】
属性値抽出部３３は、パターン照合部３２から供給された「項目」が示す項目概念辞書を概念辞書３１から読み出し、パターン照合部３２から供給された文書の中から、項目概念辞書のタグパターンに該当する単語列を探し出す。そして、単語列の中から、タグパターンの「＄属性値」に対応する単語を属性値として抽出する。
【００７６】
具体的には、属性値抽出部３３は、図６（Ｃ）の＜内容＞の要素の中から、図８に示す３つのタグパターンにそれぞれ対応する単語列を探し出す。そして、探し出した各々の単語列から、各々のタグパターンの「＄属性値」に対応する単語「発熱」、「２ＳＣ７７７７」、「Ａ級増幅」　を抽出し、これらを属性値として整形部３４に供給する。
【００７７】
整形部３４は、属性値抽出部３３から供給された各々の属性値を、概念辞書３１の出力形式に従って整形してＸＭＬ文書を生成し、このＸＭＬ文書を外部に出力する。
【００７８】
具体的には、「問題」を説明する属性値「発熱」、「対象」を説明する属性値「２ＳＣ７７７７」、「問題発生の状況」を説明する属性値「Ａ級増幅」を、「ＸＭＬ出力形式」のそれぞれ対応する「＄属性値」に代入して要素を形成し、形成された要素と構造タグとをＸＭＬ文書形式で出力する。
【００７９】
図９は、意味抽出部３０から出力された意味抽出済みのＸＭＬ文書を示す図である。従来は、所定の文書から構造化文書を生成することができたが、当該所定の文書に自然文が含まれていた場合は、その自然文についてはそのまま出力されていた。つまり、従来の構造化文書は、構造タグの要素の中に自然文を含んでいた。これに対して、図９に示す意味抽出済みＸＭＬ文書は、構造タグの要素の中に含まれていた自然文の意味が抽出されて、抽出された意味が構造化された文書になっている。
【００８０】
以上のように、本実施の形態に係る構造化文書生成装置は、ＸＭＬ文書に含まれる自然文に形態素解析を行って各単語に品詞種類を付与し、これらの品詞種類の構成パターンと、属性値を含む単語列の品詞種類構成を表すタグパターンとを照合して、合致する単語列から属性値を抽出する。そして、抽出した属性値を予め定めたＸＭＬ出力形式に従って出力することで、自然文が構造化されたＸＭＬ文書を生成することができる。
【００８１】
構造化文書生成装置は、特に、様々な概念（項目）毎に各々の属性値を抽出するためのタグパターンを予め用意しているので、タグパターンを用いて最初に自然文の概念、つまり項目概念辞書を特定し、更に、その概念を説明するための属性値を自然文から抽出することができる。
【００８２】
また、構造化文書生成装置は、様々な概念（項目）毎に、その概念を説明するのに必要な属性値と構造タグとのＸＭＬ出力形式を予め用意しているので、抽出した属性値をＸＭＬ出力形式に従って出力するだけで、抽出された属性値を構造化したＸＭＬ文書を容易に生成することができる。
【００８３】
構造化文書生成装置によって生成されたＸＭＬ文書は、図９に示すように、自然文の概念が構造化タグと属性値によって説明された構造化文書であるので、検索に直接利用できたり、内容の問い合わせに対して応答しやすい文書である。以下、ＸＭＬ文書の応用例について説明する。
【００８４】
（応用例１）
多数の文書が記憶されているデータベースから、「製品の重量が減少する」ことを記述した文書を検索する場合について説明する。
【００８５】
図１０（Ａ）はデータベースに記憶されている従来の自然文の文書を示す図、（Ｂ）は従来の検索結果を示す図、（Ｃ）はデータベースに記憶されている本願のＸＭＬ文書を示す図である。本願のＸＭＬ文書とは、（Ａ）に示す従来の自然文の文書から、上述した構造化文書生成装置によって生成されたＸＭＬ文書をいう。つまり、（Ａ）及び（Ｃ）の文書は、同じ内容であり、製品の重量が増加することを示唆している。
【００８６】
ここで、「製品の重量が減少する」ことを記述した文書を一般的な手法で検索する場合、検索のキーワードとして、通常では「製品」、「重量」、「減少」を使用する。データベースに従来の文書が記憶されている場合、（Ｂ）に示すように、「製品」、「重量」、「減少」の単語を含む従来の文書を誤って検索してしまう。
【００８７】
一方、データベースに本願のＸＭＬ文書が記憶されている場合、構造タグを順次追っていけばよい。ここでは、最初に「製品の重量」に関する構造タグを検索し、そしてその構造タグの要素から製品の重量が増加したか減少したかを表す構造タグを検索する。具体的には、最初に＜製品の重量＞を検索し、その要素の中から重量の変化を表す＜方向＞を検索する。本願のＸＭＬ文書は、＜方向＞の要素が「増加」となっているので、誤って検索されることはない。
【００８８】
したがって、構造化文書生成装置は、通常のキーワード検索では誤検索してしまうような文書であっても、その文書から誤検索しないようなＸＭＬ文書を生成することができる。
【００８９】
（応用例２）
データベースの１つの文書に対して、「加工不良に対する今回の対策は何か？」という問い合わせをする場合について説明する。
【００９０】
図１１（Ａ）はデータベースに記憶されている従来の自然文の文書を示す図、（Ｂ）はデータベースに記憶されている本願のＸＭＬ文書を示す図である。なお、（Ａ）及び（Ｂ）の文書は、同じ内容である。
【００９１】
データベースに従来の文書が記憶されている場合、「加工不良に対する今回の対策は何か？」という問い合わせをしても、何ら応答することができない。
【００９２】
一方、データベースに本願のＸＭＬ文書が記憶されている場合、上記問い合わせに関連する構造タグを追っていけばよい。ここでは、最初に「対策」に関する構造タグを検索し、そしてその構造タグの要素（下位の構造タグ及びその要素）を抽出する。そして、抽出した構造タグ及び要素を組み合わせると、上記問い合わせに対して、「厚さを２ｍｍ増加させた」と答えることができる。
【００９３】
したがって、構造化文書生成装置は、自然文の意味を抽出して構造化することで、問い合わせに対して容易に応答できるようなＸＭＬ文書を生成することができる。
【００９４】
なお、本発明は、上述した実施の形態に限定されるものではなく、特許請求の範囲に記載された範囲内で種々の設計上の変更を行うことができる。
【００９５】
例えば、概念辞書３１は、図８に示すような構成に限定されるものではない。本実施の形態では、１つの項目概念辞書に３つのタグパターンがある場合を例に挙げて説明したが、抽出すべき属性値の数と同じ数だけタグパターンを設けることができる。
【００９６】
また、本実施の形態では、ＸＭＬ文書を例に挙げて説明したが、例えばＳＧＭＬ文書であってもよい。このとき、項目概念辞書の「ＸＭＬ出力形式」を「ＳＧＭＬ出力形式」にすればよい。なお、自然文から構造化文書を作成することを考慮すれば、文書形式は特に限定されるものではない。
【００９７】
【発明の効果】
本発明に係る構造化文書生成装置及び構造化文書生成プログラムは、自然文を含む文章に基づいて、当該文章の意味的構造によって構成される構造化文書を自動的に生成することができる。
【００９８】
また、本発明に係る構造化文書生成装置及び構造化文書生成プログラムは、語句解析済み文書の単語列の品詞種類の構成パターンと、抽出対象となる属性値を含んだ単語列の品詞種類の構成を表す品詞種類構成パターンとを照合して属性値を抽出し、抽出された属性値を出力形式情報に基づいて構造化して構造化文書を生成することにより、自然文から意味が抽出された構造化文書を得ることができる。
【図面の簡単な説明】
【図１】本発明の実施の形態に係る構造化文書生成装置の構成を示すブロック図である。
【図２】マイクロコンピュータの機能的な構成を示すブロック図である。
【図３】語句解析部の機能的な構成を示すブロック図である。
【図４】語句辞書の構成を示す図である。
【図５】語句解析部によって解析されている文書を説明する図である。
【図６】語句解析部によって解析されている文書を説明する図である。
【図７】意味抽出部の機能的な構成を示すブロック図である。
【図８】概念辞書の構成を示すブロック図である。
【図９】意味抽出部から出力された意味抽出済みのＸＭＬ文書を示す図である。
【図１０】（Ａ）はデータベースに記憶されている従来の自然文の文書を示す図、（Ｂ）は従来の検索結果を示す図、（Ｃ）はデータベースに記憶されている本願のＸＭＬ文書を示す図である。
【図１１】（Ａ）はデータベースに記憶されている従来の自然文の文書を示す図、（Ｂ）はデータベースに記憶されている本願のＸＭＬ文書を示す図である。
【符号の説明】
２０　語句解析部
２１　語句辞書
２２　内容文字列分離部
２３　特定語分離・置換部
２４　特定語品詞決定部
２５　形態素解析部
２６　代替語置換部
２７　複合語・句生成部
２８　品詞タグ付与部
３０　意味抽出部
３１　概念辞書
３２　パターン照合部
３３　属性値抽出部
３４　整形部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a structured document generation device and a structured document generation program, and more particularly to a structured document generation device and a structured document generation program suitable for generating data that can be directly used for retrieval.
[0002]
Problems to be solved by the prior art and the invention
Japanese Patent Application Laid-Open No. 2001-290801 discloses a structure documentation system, a structure documentation program, and a computer readable storage medium (hereinafter referred to as “conventional”) that input a text document and output a structured document using the document structure. Technology 1 ”).
[0003]
The prior art 1 aims at reconstructing a text document in order to display it on a Web page. In order to decompose the structure of the document, a document decomposition definition file is defined in advance, and the document definition is defined. The document is structured using the patterns described in the file. The document structure generated in this manner has a surface format such as a title, a creator, and a date.
[0004]
However, a part written in a natural sentence as a text is output as it is without any conversion, and is not in a format usable by a computer system. For this reason, the structured document output from the prior art 1 has a problem that only the surface information is used in the database, but specific contents such as the text written in natural sentences are not used in the database. Was.
[0005]
SUMMARY OF THE INVENTION The present invention has been proposed to solve the above-described problems, and has a structure and a structure of a structured document generating apparatus that generates a structured document having a semantic structure that can be easily recognized by a computer from a natural sentence. It is an object of the present invention to provide a document generation program.
[0006]
[Means for Solving the Problems]
In order to solve the above-mentioned problem, the invention according to claim 1 provides a structured document constituted by a sentence input unit for inputting a sentence including a natural sentence, and a semantic structure of the sentence input by the sentence input unit. And a structured document generating means for generating a document.
[0007]
Here, the invention described in claim 1 is configured by installing the invention described in claim 4 on a computer.
[0008]
That is, the invention according to claim 4 generates a structured document constituted by a sentence inputting step of inputting a sentence including a natural sentence into a computer and a semantic structure of the sentence input in the sentence inputting step. And a structured document generation program for executing a process including a structured document generation step.
[0009]
Therefore, according to the first and fourth aspects of the invention, based on a sentence including a natural sentence, a structured document constituted by the semantic structure of the sentence can be automatically generated.
[0010]
According to a second aspect of the present invention, in the first aspect of the present invention, the structured document generation unit includes a phrase representing the concept based on a concept of the text input by the text input unit; , And a structured sentence composed of a semantic structure having
[0011]
Here, the invention described in claim 2 is configured by installing the invention described in claim 5 on a computer.
[0012]
That is, in the invention according to claim 5, in the invention according to claim 4, the structured document generation step includes: a phrase representing the concept based on a concept of the sentence input in the sentence input step; And generating a structured sentence composed of a semantic structure having an attribute value explaining the phrase.
[0013]
Therefore, according to the second and fifth aspects of the present invention, based on the concept of a sentence, a main word representing the concept and an attribute value for explaining the word are extracted, and a semantic having the main word and the attribute value is extracted. A structured sentence composed of a structure can be generated.
[0014]
The invention according to claim 3 is the invention according to claim 2, wherein a phrase representing a concept, an attribute value explaining the phrase, a pattern in which the phrase and the attribute value are described in a sentence, Further comprising a concept dictionary storage unit for storing a concept dictionary representing the following. The structured document generation unit compares a sentence input by the sentence input unit with a concept dictionary stored in the concept dictionary storage unit. Thus, a structured document constituted by the semantic structure is generated.
[0015]
Here, the invention described in claim 3 is configured by installing the invention described in claim 6 on a computer.
[0016]
That is, according to a sixth aspect of the present invention, in the invention of the fifth aspect, the structured document generating step describes the sentence input in the sentence input step, a phrase representing a concept, and the phrase. A structured document constituted by the semantic structure is generated by comparing an attribute value with a concept dictionary representing a pattern in which the word and the attribute value are described in a sentence.
[0017]
Therefore, according to the third and sixth aspects of the present invention, a phrase and an attribute value are extracted from a document by using a concept dictionary in which a pattern in which the phrase and an attribute value are described in a sentence is represented in advance. , A structured document composed of a semantic structure can be generated.
[0018]
The invention according to claim 7, wherein phrase analysis is performed on a document including a natural sentence, and a phrase analysis unit that outputs a phrase-analyzed document in which words constituting the document are associated with parts of speech types, A concept for storing a concept dictionary having a part-of-speech type configuration pattern representing a configuration of a part-of-speech type of a word string including an attribute value to be extracted, and output format information for structuring and outputting the attribute value and the corresponding tag. The dictionary storage unit compares the part-of-speech type configuration pattern of the word string of the phrase-analyzed document with the part-of-speech type configuration pattern of the concept dictionary stored in the concept dictionary storage unit. Attribute value extracting means for extracting an attribute value from a word string of the corresponding word-parsed document; attribute values extracted by the attribute value extracting means; and a concept word stored in the concept dictionary storage means. Based of the output format information, and a, a document generation means for generating a structured document.
[0019]
The invention according to claim 7 is configured by installing the invention according to claim 9 on a computer.
[0020]
According to a ninth aspect of the present invention, there is provided a phrase analysis that executes a phrase analysis on a document including a natural sentence and outputs a phrase-analyzed document in which words constituting the document are associated with parts of speech. And comparing the part-of-speech type composition pattern of the word string of the word-parsed document with the part-of-speech type composition pattern representing the configuration of the word part type of the word string including the attribute value to be extracted. An attribute value extracting step of extracting an attribute value from a word string of the phrase-analyzed document corresponding to a configuration pattern, and output format information for structuring and outputting the extracted attribute value and a tag corresponding to the attribute value And a document generation step of generating a structured document based on the above.
[0021]
The phrase analysis unit performs phrase analysis on a document including a natural sentence. Here, the document containing a natural sentence may be a document consisting of only a natural sentence or a document of a predetermined format having a natural sentence and a predetermined tag. Then, the phrase analyzing unit associates each word constituting the document with the part of speech type, and outputs a word / phrase-analyzed document in which each word is associated with each part of speech type.
[0022]
The concept dictionary storage unit stores a concept dictionary having a part of speech type configuration pattern and output format information. Here, the part-of-speech type configuration pattern represents the configuration of the part-of-speech type of a word string including an attribute value to be extracted when an attribute value is to be extracted from a document. The output format information indicates the output format of the document, and is information that outputs the attribute values and the corresponding tags in a structured manner.
[0023]
The attribute value extraction unit compares the part-of-speech type configuration pattern of the word string of the phrase-analyzed document with the part-of-speech type configuration pattern of the concept dictionary, and extracts a word string corresponding to the part-of-speech type configuration pattern from the phrase-analyzed document. Find out. Then, the attribute value is extracted from the searched word string.
The document generating means generates a structured document in a predetermined format based on the extracted attribute values and the output format information of the concept dictionary. This structured document is a document having a format including an attribute value and a tag even if it is a natural sentence at first.
[0024]
Therefore, according to the seventh and ninth aspects of the invention, the part-of-speech type configuration pattern of the word string of the word-parsed document and the part-of-speech type representation of the word string including the attribute value to be extracted are included. By extracting attribute values by comparing with a configuration pattern, and structuring the extracted attribute values based on output format information to generate a structured document, a structured document in which meaning is extracted from a natural sentence is obtained. be able to.
[0025]
According to an eighth aspect of the present invention, in the invention according to the seventh aspect, the concept dictionary storage unit stores a concept dictionary provided for each concept, and the attribute value extracting unit includes the concept dictionary storage unit. From a plurality of concept dictionaries stored in the dictionary, a concept dictionary having a part-of-speech type configuration pattern that matches the part-of-speech type configuration pattern of the word-parsed document is extracted, and the attribute value is extracted using the selected concept dictionary. I do.
[0026]
The invention described in claim 8 is configured by installing the invention described in claim 10 on a computer.
[0027]
According to a tenth aspect of the present invention, in the ninth aspect of the present invention, the attribute value extracting step comprises: setting a part-of-speech type of the word-parsed document from a plurality of concept dictionaries provided for each concept. A concept dictionary having a part-of-speech type configuration pattern that matches the pattern is selected, and attribute values are extracted using the selected concept dictionary.
[0028]
The concept dictionary storage means stores each concept dictionary for each of various concepts. The attribute value extracting unit specifies the concept of the word-parsed document by selecting a concept dictionary having a part-of-speech type composition pattern that matches the part-of-speech type composition pattern of the word-parsed document from various concept dictionaries. . Then, the attribute value is extracted by using the selected concept dictionary to obtain the attribute value related to the specified concept.
[0029]
Therefore, according to the eighth and tenth aspects of the present invention, a concept dictionary having a part-of-speech type configuration pattern that matches the part-of-speech type configuration pattern of the word-parsed document is selected and selected from various concept dictionaries. By extracting attribute values using the concept dictionary, attribute values can be reliably extracted regardless of the concept of the document content.
[0030]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the drawings.
[0031]
FIG. 1 is a block diagram illustrating a configuration of a structured document generation device according to an embodiment of the present invention.
[0032]
The structured document generation device includes a keyboard 1 for inputting character information by key operation, a mouse 2 as a pointing device, an input / output port 3 for inputting and outputting information to and from the outside, and a structured document generation device. (Hard Disc Drive) 4 for storing an application program and the like for executing the arithmetic processing of the microcomputer 5, a microcomputer 5 for executing an arithmetic processing for creating a structured document, and an arithmetic result of the microcomputer 5, for example, a structured document (Liquid Crystal Display) 6 that displays
[0033]
The microcomputer 5 includes a RAM (Random Access Memory) as a work area for data, a ROM (Read Only Memory) storing a predetermined control program, a CPU (Central Processing Unit) for executing arithmetic processing, and the like. I have. The microcomputer 5 creates a structured document based on a text document input from the keyboard 1 or a text document input from the outside via the input / output port 3. Note that a structured document may be created from a text document already recorded on a recording medium (not shown).
[0034]
FIG. 2 is a block diagram showing a functional configuration of the microcomputer 5. The microcomputer 5 includes an XML document creation unit 10 that creates an XML document from a text document, a word analysis unit 20 that performs word / phrase analysis, and a meaning extraction unit 30 that extracts a meaning and creates a structured document. I have.
[0035]
The XML document creation unit 10 creates an XML document using a text document of a natural sentence and information such as the title and creation date and time of the text document, and supplies the created XML document to the phrase analysis unit 20. The text document may be a document generated by operating the keyboard 1, a document input from the input / output port 3, or a document stored in the HDD 4 or another storage medium in advance.
[0036]
FIG. 3 is a block diagram illustrating a functional configuration of the phrase analyzing unit 20.
[0037]
The phrase analyzing unit 20 includes a phrase dictionary 21 indicating a correspondence relationship between a word and a part of speech type, a content character string separating unit 22 for separating a structure tag of an XML document from a content character string as a natural sentence, , A specific word part-of-speech determining unit 24 that determines the part of speech of a specific word, a morphological analysis unit 25 that performs a morphological analysis of a content character string, and substitute words for specific words And a compound word / phrase generation unit 27 that generates compound words and phrases from a plurality of words, and a part-of-speech tag attaching unit 28 that attaches a part-of-speech tag to words and compound words.
[0038]
FIG. 4 is a diagram showing the configuration of the phrase dictionary 21. The phrase dictionary 21 includes a dictionary relating to words and a dictionary relating to words (compound words and phrases).
[0039]
The phrase dictionary 21 is composed of words, parts of speech, and alternative words with respect to words. The alternative word will be described later in detail, but is used only in the case of a specific word.
[0040]
According to FIG. 4, for example, for the word “amplification”, the part of speech type is “noun-sa transformation connection”. This indicates that “amplification” is a noun and is connected to a variance verb. For the word “problem”, the part of speech type is “noun-nai adjective stem”. This indicates that "problem" is a noun and becomes the stem of the adjective "... no." Further, for the word "time", the part of speech is "noun-suffix-adverb possible". This indicates that "time" is a noun, becomes a suffix of "... time", and can be used as an adverb. For the word "ga", the part of speech type is "particle-case particle-general". This indicates that "ga" is a general case particle. As described above, the part of speech type indicates not only the type of part of speech of the corresponding word but also the attribute of the corresponding word.
[0041]
Here, the “word” includes not only general words such as nouns and particles, but also specific words indicating words newly defined by the phrase dictionary 21.
[0042]
For example, “2S [AD] [0-9] +” shown in FIG. 4 starts with 2S, is followed by one character from A to D, and ends with one or more characters. Represents a word. The word corresponding to the above condition indicates the product number of the transistor and is not in a general dictionary. Thus, the phrase dictionary 21 defines words that meet the above conditions as specific words.
[0043]
Here, for the word (specific word) “2S [AD] [0-9] +”, the part-of-speech type is “noun-proper noun-identifier-part number-transistor” and the alternative word is “alternative-transistor part number”. ". This means that the specific word is a proper noun, an identifier, and a product number of a transistor. Further, since the specific word has no meaning in itself, it indicates that there is a “transistor part number” as an alternative word.
[0044]
The phrase dictionary 21 is composed of “phrase strings” and “part-of-speech types” in terms of parts related to phrases (compound words and phrases), and also functions as a dictionary for creating compound words and phrases.
[0045]
"Phrase string" represents a pattern to be formed as a compound word or phrase. For example, according to FIG. 4, “(noun and not (noun-suffix-adverb) or symbol) * [2, ∞]” is a noun (except for noun-suffix-adverb) or a symbol This indicates that the pattern is two or more continuous patterns, and the “part of speech” of this pattern is a noun phrase. That is, a pattern of a phrase string corresponding to such a condition is treated as a noun phrase.
[0046]
FIGS. 5 and 6 are diagrams illustrating documents analyzed by the phrase analyzing unit 20. FIG. Hereinafter, a case will be described in which the XML document shown in FIG.
[0047]
When supplied with the XML document shown in FIG. 5A from the XML document creation unit 10, the content character string separation unit 22 separates the structure tag of the XML document from the content character string that is a natural sentence. Then, the content character string separating unit 22 supplies the structure tag of the XML document shown in FIG. 5B to the part-of-speech tag attaching unit 28, and converts the content character string of the XML document shown in FIG. This is supplied to the replacement unit 23.
[0048]
The specific word separation / replacement unit 23 compares the content character string supplied from the content character string separation unit 22 with the phrase dictionary 21, replaces the specific word included in the content character string with an alternative word, and replaces the specific word. To separate. Then, the separated specific word is supplied to the specific word part-of-speech determination unit 24, and the content character string obtained by replacing the specific word with the alternative word is supplied to the morphological analysis unit 25.
[0049]
Specifically, the specific word separation / replacement unit 23 replaces the specific word “2SC7777” included in the content character string with “substitute-transistor part number” based on the phrase dictionary 21. Then, the specific word “2SC7777” shown in FIG. 5D is separated and supplied to the specific word part-of-speech determination unit 24, and the replaced content character string “(alternative-transistor part number) shown in FIG. Is generated during class amplification "to the morphological analysis unit 25.
[0050]
The specific word part-of-speech determination unit 24 determines the part of speech of the specific word based on the word dictionary 21 and supplies the specific word and part-of-speech type information to the alternative word replacement unit 26. Specifically, the part of speech of “2SC7777” is determined, and as shown in FIG. 5 (F), “2SC7777−noun−proper noun−identifier−part number−transistor” which is the specific word and part of speech type information is replaced with a substitute word. To the unit 26.
[0051]
The morphological analysis unit 25 performs morphological analysis on the content character string supplied from the specific word separation / substitution unit 23 while referring to the phrase dictionary 21. Specifically, the content character string is decomposed into individual words, and each decomposed word (including a specific word) is associated with a part of speech type. Then, the morphological analysis unit 25 supplies each word and the corresponding part of speech type shown in FIG. In FIG. 5 (G), the words “time”, “ni”, “fever”, “ga”, and “problem” are omitted, but morphological analysis is similarly performed on each of these words. Then, each word and the corresponding part of speech type are supplied to the alternative word replacement unit 26.
[0052]
The alternative word replacement unit 26 selects an alternative word from each morphologically analyzed word based on the information supplied from the specific word part of speech determining unit 24 and the morphological analysis unit 25, and specifies the selected alternative word as an original. Replace with word. Specifically, “substitute-transistor” is replaced with “2SC7777”. Then, each word and the corresponding part of speech type shown in FIG. 6A are supplied to the compound word / phrase generation unit 27. In FIG. 6A, the words “hour”, “ni”, “fever”, “ga”, and “problem” are not shown. These words are also omitted in FIG. 6B described later.
[0053]
The compound word / phrase generation unit 27 determines whether there is a substring of the word corresponding to the compound word / phrase in the phrase dictionary 21 in the continuous words supplied from the alternative word replacement unit 26, and If there is a subsequence, all the corresponding subsequences are replaced with compound words / phrases.
[0054]
Specifically, the partial strings “A”, “class”, and “amplification” of the word correspond to the phrase string pattern of the noun phrase in the phrase dictionary 21 shown in FIG. Therefore, the compound word / phrase generation unit 27 generates a noun phrase “A class amplification” from the word substrings “A”, “class”, and “amplification”, and generates each word shown in FIG. And the corresponding part of speech type are supplied to the part of speech tag assigning unit 28. When there is no substring of the word corresponding to the compound word / phrase in the word / phrase dictionary 21, the compound word / phrase generation unit 27 supplies the information supplied from the substitute word replacement unit 26 to the part-of-speech tag attaching unit 28 as it is. I do.
[0055]
The part-of-speech tag assigning section 28 assigns a part-of-speech tag indicating each part-of-speech type to each word (including compound words / phrases) supplied from the compound word / phrase generation section 27. Then, each word and each part of speech tag are embedded as elements of the structure tag <content> supplied from the content character string separating unit 22. As a result, as shown in FIG. 6C, the part-of-speech tag assigning unit 28 generates and outputs an XML document that has undergone term analysis.
[0056]
As described above, the phrase analyzing unit 20 separates the structure tag and the content character string (natural sentence) from the general XML document, performs a morphological analysis on the content character string, and adds a part of speech to each word of the content character string. A process of adding a tag and embedding each word and the part of speech tag in the original XML document is performed. That is, the phrase analysis unit 20 performs a morphological analysis on a natural sentence and adds a part-of-speech tag to each word constituting the natural sentence, thereby clarifying the document structure of the natural sentence and extracting the meaning of the document. Can be output.
[0057]
In addition, the phrase analyzing unit 20 registers the newly defined specific word and its alternative word in the phrase dictionary 21, so that the morphological analysis uses the alternative word instead of the specific word, and after the morphological analysis, By replacing a substitute word with the original specific word, morphological analysis can be accurately performed even for technical terms and technical terms that are not in a general dictionary.
[0058]
It should be noted that the phrase dictionary 21 is not limited to the configuration shown in FIG. 4, and can store other words and part of speech. Further, the phrase dictionary 21 can store not only a pattern for creating a noun phrase from a plurality of nouns, but also a pattern for generating other compound words / phrases such as adjective phrases and adverb phrases. Of course.
[0059]
FIG. 7 is a block diagram illustrating a functional configuration of the meaning extracting unit 30. The meaning extracting unit 30 extracts a meaning from the XML document that has been subjected to the phrase analysis, and generates an XML document in which the extracted meaning is structured.
[0060]
The meaning extracting unit 30 selects an optimal item concept dictionary from a plurality of item concept dictionaries by performing pattern matching with a concept dictionary 31 storing an attribute value extraction pattern and an attribute value output format for each of various concepts. A pattern matching unit 32, an attribute value extracting unit 33 for extracting attribute values using the selected item concept dictionary, and a shaping unit 34 for shaping the extracted attribute values and corresponding tags into an XML document and outputting the XML document. ing.
[0061]
FIG. 8 is a block diagram showing the configuration of the concept dictionary 31. The concept dictionary is composed of a plurality of item concept dictionaries configured for each of various concepts (items) such as “current problem” and “solution”. Here, a description will be given using the “current problem” as an example.
[0062]
The item concept dictionary includes “items” representing the items of the item concept dictionary, “synonyms” representing synonyms of the “items”, “attributes describing the items”, and “XML output” representing the XML output format of the extracted attribute values. Format ".
[0063]
The “attribute that describes the item” includes an “attribute name” and a “tag pattern”. The “attribute name” indicates what attribute value is to be extracted. The “tag pattern” indicates a configuration pattern of each part-of-speech tag of the word string including the attribute value, and a position of the attribute value included in the configuration pattern.
[0064]
Here, “noun *” indicates an arbitrary noun or a noun phrase. Further, a pair of “@” and an attribute name following the symbol indicates that the character face of the word corresponding to the position of the tag pattern becomes the value of the attribute.
[0065]
For example, in FIG. 8, “question” indicates that an attribute value that describes what kind of “problem” is extracted from the XML document that has undergone phrase analysis. The “tag pattern” of “problem” indicates a part-of-speech tag configuration pattern of a word string including an attribute value that describes “problem”.
[0066]
Note that the “tag pattern” of the “problem” means that the part of speech tag of the first word is a noun or a noun phrase, and the part of speech tag of the next word may be arbitrary, but that the next word is “ga”. Is shown. Here, since the element of the first part-of-speech tag includes “＄ problem”, the word corresponding to the first part-of-speech tag becomes the value (attribute value) of the attribute name “problem”.
[0067]
“Target” indicates that an attribute value that describes what “target” is extracted from the XML document that has undergone the phrase analysis. In the “tag pattern” of “target”, the part of speech tag of the first word is a noun or a noun phrase, and the part of speech tag of the next word may be arbitrary, but the next word (word string) is “in”. It is shown that. Also, since the element of the first part of speech tag includes “＄ target”, the word corresponding to the first part of speech tag becomes the value (attribute value) of the attribute name “target”.
[0068]
Further, “problem situation” indicates that an attribute value describing what “problem situation” is extracted from the XML document that has been subjected to the phrase analysis. In the “tag pattern” of the “problem situation”, the part-of-speech tag of the first word is a noun or a noun phrase, and the part-of-speech tag of the next word is “noun-suffix-adverbable” (a suffix that can be used as an adverb). And the part of speech tag of the last word may be arbitrary, but indicates that the last word is “ni”. In addition, since the element of the first part of speech tag includes “＄ problem occurrence situation”, the word corresponding to the first part of speech tag becomes the value (attribute value) of the attribute name “problem occurrence situation”.
[0069]
The “XML output format” indicates that an XML document in which each of the extracted attribute values has been replaced with “@attribute name” is output.
[0070]
For example, the first line of the “XML output format” in FIG. 8 indicates that the output is performed by replacing the attribute value of “problem”. The <target> and <when a problem occurs> incorporated as elements of the <current problem> similarly indicate that corresponding elements are formed and output together with the structure tag.
[0071]
Then, the pattern matching unit 32, the attribute value extraction unit 33, and the shaping unit 34 execute the following processing using the concept dictionary 31 configured as described above.
[0072]
The pattern matching unit 32 compares the XML document supplied with the phrase analysis supplied from the phrase analysis unit 20 with each item concept dictionary of the concept dictionary 31, and selects an item concept dictionary corresponding to the XML document. Specifically, from each item concept dictionary, an item concept dictionary in which all tag patterns completely match the configuration patterns of the part-of-speech tags of the document is selected. Then, the pattern matching unit 32 supplies the “items” of the selected item concept dictionary to the attribute value extraction unit 33 and supplies the XML document of the part matched with the item concept dictionary to the attribute value extraction unit 33.
[0073]
For example, here, the configuration pattern of the part-of-speech tag of the document shown in FIG. 6 (C) matches all the “tag patterns” of the item concept dictionary “current problem” shown in FIG. Therefore, the pattern matching unit 32 supplies the item name “current problem” and the element of <content> in FIG. 6C to the attribute value extraction unit 33.
[0074]
If there are a plurality of item concept dictionaries in which the configuration patterns of the part-of-speech tags of the document supplied from the phrase analysis unit 20 and all the tag patterns completely match, the condition of the tag pattern of the item concept dictionary must be further restricted. Just fine. For example, instead of “noun *” representing an arbitrary noun or a noun phrase, “noun-proper noun” representing a proper noun may be used, and other conditions may be limited.
[0075]
The attribute value extracting unit 33 reads out the item concept dictionary indicated by the “item” supplied from the pattern matching unit 32 from the concept dictionary 31 and converts the document supplied from the pattern matching unit 32 into a tag pattern of the item concept dictionary. Find the word string. Then, a word corresponding to the "@attribute value" of the tag pattern is extracted from the word string as an attribute value.
[0076]
Specifically, the attribute value extracting unit 33 searches for the word strings respectively corresponding to the three tag patterns shown in FIG. 8 from the elements of <contents> in FIG. 6C. Then, the words “fever”, “2SC7777”, and “class A amplification” corresponding to the “$ attribute value” of each tag pattern are extracted from each of the searched word strings, and these are extracted as attribute values by the shaping unit 34. Supply.
[0077]
The shaping section 34 shapes each attribute value supplied from the attribute value extracting section 33 in accordance with the output format of the concept dictionary 31, generates an XML document, and outputs this XML document to the outside.
[0078]
Specifically, the attribute value “fever” describing “problem”, the attribute value “2SC7777” describing “object”, and the attribute value “A-class amplification” describing “problem situation” are output as “XML output”. The element is formed by substituting the corresponding "@attribute value" of the "format", and the formed element and the structure tag are output in the XML document format.
[0079]
FIG. 9 is a diagram illustrating a meaning-extracted XML document output from the meaning extracting unit 30. Conventionally, a structured document can be generated from a predetermined document. However, when a natural sentence is included in the predetermined document, the natural sentence is output as it is. That is, the conventional structured document includes a natural sentence in the element of the structure tag. On the other hand, the meaning-extracted XML document shown in FIG. 9 is a document in which the meaning of the natural sentence included in the element of the structure tag is extracted, and the extracted meaning is structured. .
[0080]
As described above, the structured document generation device according to the present embodiment performs a morphological analysis on a natural sentence included in an XML document to assign a part of speech type to each word, The attribute value is extracted from the matching word string by comparing the word string including the value with the tag pattern indicating the part of speech type configuration. Then, by outputting the extracted attribute values in accordance with a predetermined XML output format, it is possible to generate an XML document in which a natural sentence is structured.
[0081]
In particular, the structured document generation apparatus prepares in advance a tag pattern for extracting each attribute value for each of various concepts (items). The concept dictionary can be specified, and further, attribute values for explaining the concept can be extracted from the natural sentence.
[0082]
In addition, the structured document generation device prepares, in advance, for each of various concepts (items), an XML output format of an attribute value necessary to explain the concept and a structure tag. An XML document in which the extracted attribute values are structured can be easily generated simply by outputting in accordance with the XML output format.
[0083]
As shown in FIG. 9, the XML document generated by the structured document generation device is a structured document in which the concept of a natural sentence is described by a structured tag and an attribute value. It is a document that is easy to respond to inquiries. Hereinafter, an application example of the XML document will be described.
[0084]
(Application Example 1)
A case will be described in which a database in which "product weight is reduced" is searched from a database in which a large number of documents are stored.
[0085]
10A is a diagram showing a conventional natural sentence document stored in the database, FIG. 10B is a diagram showing a conventional search result, and FIG. 10C is an XML document of the present application stored in the database. FIG. The XML document of the present application refers to an XML document generated by the above-described structured document generation device from the conventional natural sentence document shown in FIG. That is, the documents of (A) and (C) have the same content, suggesting that the weight of the product increases.
[0086]
Here, when a document describing that "product weight is reduced" is searched by a general method, "product", "weight", and "reduction" are usually used as search keywords. When conventional documents are stored in the database, conventional documents containing the words “product”, “weight”, and “reduction” are erroneously searched as shown in FIG.
[0087]
On the other hand, when the XML document of the present application is stored in the database, the structure tags may be sequentially followed. Here, a structure tag related to “product weight” is searched first, and a structure tag indicating whether the weight of the product has been increased or decreased is searched from the elements of the structure tag. Specifically, first, <product weight> is searched, and <direction> representing a change in weight is searched from among the elements. In the XML document of the present application, since the element of <direction> is “increased”, it is not erroneously searched.
[0088]
Therefore, the structured document generation device can generate an XML document that does not cause an erroneous search from a document that is erroneously searched in a normal keyword search.
[0089]
(Application 2)
A case will be described in which an inquiry "What is the current countermeasure for processing defects?"
[0090]
FIG. 11A is a diagram showing a document of a conventional natural sentence stored in a database, and FIG. 11B is a diagram showing an XML document of the present application stored in the database. The documents in (A) and (B) have the same contents.
[0091]
If a conventional document is stored in the database, no response can be made even if an inquiry "What is the current countermeasure for processing defects?"
[0092]
On the other hand, when the XML document of the present application is stored in the database, the structure tag related to the inquiry may be followed. Here, first, a structure tag related to “measures” is searched, and elements of the structure tag (lower-order structure tags and their elements) are extracted. Then, by combining the extracted structure tags and elements, it is possible to reply to the above inquiry that “the thickness has been increased by 2 mm”.
[0093]
Therefore, the structured document generation device can generate an XML document that can easily respond to an inquiry by extracting and structuring the meaning of a natural sentence.
[0094]
The present invention is not limited to the above-described embodiment, and various design changes can be made within the scope described in the claims.
[0095]
For example, the concept dictionary 31 is not limited to the configuration shown in FIG. In the present embodiment, a case where one item concept dictionary has three tag patterns has been described as an example. However, the same number of tag patterns as the number of attribute values to be extracted can be provided.
[0096]
Further, in the present embodiment, an XML document has been described as an example, but an SGML document may be used, for example. At this time, the “XML output format” of the item concept dictionary may be changed to the “SGML output format”. The document format is not particularly limited in consideration of creating a structured document from natural sentences.
[0097]
【The invention's effect】
The structured document generation device and the structured document generation program according to the present invention can automatically generate a structured document having a semantic structure of a sentence based on a sentence including a natural sentence.
[0098]
In addition, the structured document generation device and the structured document generation program according to the present invention provide a configuration of a part-of-speech type of a word string of a word-parsed document and a configuration of a part-of-speech type of a word string including an attribute value to be extracted. By extracting attribute values by comparing them with the part-of-speech type composition pattern representing the structure, and generating a structured document by structuring the extracted attribute values based on the output format information, a structure in which the meaning is extracted from the natural sentence Document can be obtained.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration of a structured document generation device according to an embodiment of the present invention.
FIG. 2 is a block diagram showing a functional configuration of a microcomputer.
FIG. 3 is a block diagram showing a functional configuration of a phrase analyzing unit.
FIG. 4 is a diagram showing a configuration of a phrase dictionary.
FIG. 5 is a diagram illustrating a document being analyzed by a phrase analysis unit.
FIG. 6 is a diagram illustrating a document that has been analyzed by a phrase analysis unit.
FIG. 7 is a block diagram illustrating a functional configuration of a meaning extracting unit.
FIG. 8 is a block diagram showing a configuration of a concept dictionary.
FIG. 9 is a diagram illustrating a meaning-extracted XML document output from a meaning extraction unit.
10A is a diagram showing a conventional document of a natural sentence stored in a database, FIG. 10B is a diagram showing a conventional search result, and FIG. 10C is an XML document of the present application stored in the database FIG.
11A is a diagram showing a document of a conventional natural sentence stored in a database, and FIG. 11B is a diagram showing an XML document of the present application stored in the database.
[Explanation of symbols]
20 Phrase analyzer
21 Word Dictionary
22 Content string separator
23 Specific word separation / replacement unit
24 Specific word part of speech decision unit
25 Morphological analyzer
26 Alternative word substitution unit
27 Compound word / phrase generator
28 Part-of-speech tag assigning unit
30 Meaning extractor
31 Concept Dictionary
32 pattern matching unit
33 Attribute value extractor
34 Shaping Department

Claims

A sentence input means for inputting a sentence including a natural sentence,
Structured document generating means for generating a structured document constituted by the semantic structure of the text input by the text input means,
Structured document generation device provided with.

The structured document generating unit is configured to have a semantic structure including a phrase representing the concept and an attribute value describing the phrase based on the concept of the sentence input by the text input unit. 2. The structured document generation device according to claim 1, wherein the structured document generation device generates a sentence.

A concept dictionary storing means for storing a concept dictionary representing a phrase representing a concept, an attribute value explaining the phrase, and a pattern in which the phrase and the attribute value are described in a sentence,
The structured document generation unit generates a structured document constituted by the semantic structure by comparing a sentence input by the sentence input unit with a concept dictionary stored in the concept dictionary storage unit. The structured document generation device according to claim 2.

On the computer,
A sentence input process of inputting a sentence including a natural sentence,
A structured document generating step of generating a structured document constituted by the semantic structure of the sentence input in the sentence input step,
A structured document generation program for executing a process including

The structured document generation step is a structuring that is configured by a semantic structure having a phrase representing the concept and an attribute value that describes the phrase based on the concept of the text input in the text input step. The structured document generation program according to claim 4, which generates a sentence.

The structured document generating step includes:
A sentence input in the sentence input step,
A concept dictionary representing a phrase representing a concept, an attribute value explaining the phrase, and a pattern in which the phrase and the attribute value are described in a sentence,
The structured document generation program according to claim 5, wherein a structured document configured by the semantic structure is generated by collating the structured document.

A phrase analysis unit that performs phrase analysis on a document including a natural sentence and outputs a phrase-analyzed document in which the words constituting the document and the parts of speech are associated with each other;
A concept for storing a concept dictionary having a part-of-speech type configuration pattern representing a configuration of a part-of-speech type of a word string including an attribute value to be extracted, and output format information for structuring and outputting the attribute value and the corresponding tag. Dictionary storage means;
The phrase analysis corresponding to the part-of-speech type configuration pattern is performed by comparing the configuration pattern of the part-of-speech type of the word string of the phrase-analyzed document with the part-of-speech type configuration pattern of the concept dictionary stored in the concept dictionary storage unit. Attribute value extracting means for extracting an attribute value from a word string of a completed document;
Document generation means for generating a structured document based on the attribute value extracted by the attribute value extraction means and the output format information of the concept dictionary stored in the concept dictionary storage means,
Structured document generation device provided with.

The concept dictionary storage means stores a concept dictionary provided for each concept,
The attribute value extracting unit selects a concept dictionary having a part-of-speech type configuration pattern that matches a part-of-speech type configuration pattern of the word-parsed document from a plurality of concept dictionaries stored in the concept dictionary storage unit. 8. The structured document generation device according to claim 7, wherein attribute values are extracted using the selected concept dictionary.

On the computer,
A phrase analysis step of performing phrase analysis on a document including a natural sentence, and outputting a phrase-analyzed document in which words constituting the document are associated with part-of-speech types;
The part-of-speech type configuration pattern of the word string of the phrase-analyzed document is compared with the part-of-speech type configuration pattern representing the configuration of the part-of-speech type of the word string including the attribute value to be extracted. An attribute value extracting step of extracting an attribute value from a word string of the corresponding word-parsed document;
A document generation step of generating a structured document based on the extracted attribute values and output format information for structuring and outputting the tags corresponding to the attribute values,
A structured document generation program for executing a process including

The attribute value extracting step selects and selects, from a plurality of concept dictionaries provided for each concept, a concept dictionary having a part-of-speech type configuration pattern that matches the part-of-speech type configuration pattern of the word-parsed document. The structured document generation program according to claim 9, wherein attribute values are extracted using a concept dictionary.