JP2004326461A

JP2004326461A - Apparatus and method for recognizing proper name

Info

Publication number: JP2004326461A
Application number: JP2003120579A
Authority: JP
Inventors: Shoichi Tateno; 昌一舘野
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2003-04-24
Filing date: 2003-04-24
Publication date: 2004-11-18
Anticipated expiration: 2023-04-24
Also published as: JP4023371B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a technology of identifying a proper name for precisely identifying a proper name expression. <P>SOLUTION: A text input part 10 inputs a text. A morpheme analysis part 11 analyzes a morpheme of the text with reference to a storage part 12 for a morpheme analysis dictionary. An analysis part 13 for proper name information adds a feature to a result of a morpheme analysis with reference to a storage part 14 for candidate dictionary of proper name components. A proper name identifying part 15 identifies all together a morpheme string matched with a rule as a proper name with reference to a storage part 16 for a proper name specifying rule. The text specifying the proper name is output from an output part 17. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
この発明は、人名、地名、組織名、日時、価格等、特定の事物を表す表現（固有名、固有表現ともいう）を抽出する技術に関する。
【０００２】
【従来の技術】
膨大な量の文書に含まれる情報についての質問に答えたり、文書を要約したり、データべース化したり、視覚化したりするためには、その文書から、人名や地名や組織名や日時などの固有名を抽出する必要がある。この場合、コンピュータを利用して、予め各固有名を登録した辞書を用意しておき、この辞書を検索することにより、文書からの固有名の抽出を行うことができる。ところで、実際の文書には、予め用意した辞書に含まれない新しい言葉が必ず存在するので、辞書の検索だけでは、正確な抽出結果は得られない。このような問題に対処するために、固有名そのものと、その前後に含まれる単語の並びの出現パターンを規則化して得た多数の規則を予め人手により作成し、その規則に基づきコンピュータ処理して、対象の文書から、固有名を抽出するという技術がある。
【０００３】
しかし、この技術では、規則同士が競合したり相互作用したりするため、それぞれの規則が意図したとおりに動くとは限らないので、作成された規則を、予め用意された訓練データに適用して、その結果に基づき、間違ったところを見つけ出して、規則を修正するという作業を何度も繰り返さなければならない。
【０００４】
また、ある規則を修正した結果、それまで正常に動いていた規則が影響を受けて、間違った答を出すようになることが少なくない。そのため、多数の規則の全てを意図したとおりに動くようにするためには、膨大な時間と労力を要する。
【０００５】
このような固有名を抽出する規則をコンピュータを用いて自動的に生成する技術においても、規則の間の競合や相互作用のため、自動生成された規則同士をどのように組み合わせれば良い成績が得られるかは、組み合わせた規則（ルール）を再度、実際の文書に適用して、その結果を正解と比較して採点し、その結果に基づき、より良い成績が得られるように規則を追加したり削除したりする試行錯誤を繰り返すしかなく、多大な計算時間が必要である。
【０００６】
なお、上記のような規則の良否を訓練用文書を用いて選別することや（特許文献１）、最大エントロピモデルを用いた文章解析において係り受けの確率等を学習させることが提案されている。
【特許文献１】
特開２００１−３１８７９２公報
【特許文献２】
特開２００２−３３４０７６公報
【０００７】
【発明が解決する課題】
この発明は、以上の事情を考慮してなされたものであり、高精度に固有名表現を認識することができる固有名認識技術を提供することを目的としている。
【０００８】
【課題を解決するための手段】
この発明によれば、上述の目的を達成するために、特許請求の範囲に記載のとおりの構成を採用している。
【０００９】
まず、この発明の概要を図１に示す例を参照して模式的に説明しておく。この例においては、例えば、既知の固有名を収集した固有名リストを用い、これに形態素解析を適用して、固有名構成要素候補である左端形態素リスト、中間形態素リスト、右端形態素リストおよび単語自体の形態素のリスト（固有名が単一の形態素からなる場合）を予め取得しておく。固有名は、原則として、その左端（前端）の形態素、右端（後端）の形態素および０個または１つ以上の中間の形態素から構成されている。例外として、固有名が単一の形態素からなる場合もある（形態素が固有名の単語自体の場合）。この後、処理対象のテキストを入力し、形態素解析を行い、形態素解析結果を取得し、さらに形態素に対して、左端形態素リスト、中間形態素リスト、右端形態素リストおよび単語自体の形態素のリストを参照して素性（固有名構成要素候補に関する属性。例えば図６に示す）を付与する。素性が付与された改訂版の形態素解析結果に対して固有名特定規則を適用して固有名を抽出する。抽出された固有名に対しては例えば強調処理、隠蔽処理を行ったのち表示等を行う。なお、図１の例はあくまでも説明目的の事例であり、この発明は図１の例に限定されない。
【００１０】
さらに、この発明を説明する。
【００１１】
この発明の一側面によれば、上述の目的を達成するために、固有名認識装置に：固有名構成要素候補を記憶する固有名構成要素候補記憶手段と；上記固有名構成要素候補との関連で規定された固有名特定規則を記憶する固有名特定規則記憶手段と；文章を形態素解析する形態素解析手段と；上記形態素解析手段から出力される形態素を上記固有名構成要素候補を用いて解析する固有名情報解析手段と；上記構文解析手段の解析結果に、上記固有名構成要素項を用いて解析した結果を反映させて得た文章解析結果に、上記固有名特定規則を適用して上記文章に含まれる固有名を特定する固有名特定手段とを設けるようにしている。
【００１２】
この構成においては、固有名構成要素候補を用いて得た情報を、文法情報等の他の情報とともに用いて、形態素または形態素列を、高精度に固有名表現として認識することができる。
【００１３】
この構成において、上記固有名構成要素候補は、固有名の前端、後端、および、中央部のうちの少なくとも１つであることが好ましいが、これに限定されない。固有名の前端、後端はとくに有効であることが判明したが、これに限定されない。
【００１４】
上記固有名構成要素候補は、固有名それ自体を含んでもよい。
【００１５】
また、上記固有名情報解析手段は、例えば、上記形態素解析手段から出力される形態素が固有名構成要素候補に該当するときに、当該固有名構成要素候補の種類により特定される属性（素性）を当該形態素に割り当てる。
【００１６】
また、上記固有名情報解析手段は、例えば、上記形態素解析手段から出力される形態素の一部が固有名構成要素候補に該当するときに、当該固有名構成要素候補の種類により特定される属性（素性）を当該形態素または当該形態素の一部に割り当てる。
【００１７】
また、上記固有名特定規則は、抽出して固有名の属性も決定することが好ましい。固有名の属性は、人名（姓、名）、組織名、場所、金額、日時、製品名、商品名等であるが、これに限定されない。
【００１８】
また、この発明は例えば日本語の固有名認識に適用されて最適であるが、固有名構成要素候補の形態素に着目して固有名を認識する範囲で他の言語にも、適用可能である。
【００１９】
なお、この発明は装置またはシステムとして実現できるのみでなく、方法としても実現可能である。また、そのような発明の一部をソフトウェアとして構成することができることはもちろんである。またそのようなソフトウェアをコンピュータに実行させるために用いるソフトウェア製品もこの発明の技術的な範囲に含まれることも当然である。回路要素等をディスクリートに結合して装置を構成することを妨げない。
【００２０】
また、この発明の上述の側面およびこの発明の他の側面は特許請求の範囲に記載され、以下、実施例を用いて詳細に説明される。
【００２１】
【発明の実施の形態】
以下、この発明の実施例について説明する。
【００２２】
図２は、この発明の実施例の固有名認識装置１００を全体として示しており、この図において、固有名認識装置１００はテキスト入力部１０、形態素解析部１１、形態素解析辞書記憶部１２、固有名情報解析部１３、固有名構成要素候補辞書記憶部１４、固有名特定部１５、固有名特定規則記憶部１６および出力部１７等を含んで構成されている。
【００２３】
この固有名認識装置１００の主たる部分は、例えば、計算機２００上で実行されるコンピュータソフトウェアとして実現できる。コンピュータソフトウェアは例えば記録媒体２０１を用いて計算機２００にインストールされる。計算機２００は、通常どおり、ＣＰＵ、主メモリ、ハードディスク等からなり、例えばパーソナルコンピュータやワークステーションであるが、これに限定されない。
【００２４】
図３は、図２の固有名認識装置１００で行われる処理（ステップＳ１０〜Ｓ１４）を説明している。
【００２５】
図２および図３において、テキスト入力部１０は、日本語テキストを入力する（Ｓ１０）。形態素解析部１１は、形態素解析辞書記憶部１２を参照してテキストを形態素解析する（Ｓ１１）。形態素解析結果は例えば図５に示すようなものである。この例では「米カリフォルニアのオレンジ郡が・・・」を形態素解析している。固有名情報解析部１３は、固有名構成要素候補辞書記憶部１４を参照して形態素解析結果に対して素性（固有名構成要素候補に関する属性）を付与する（Ｓ１２）。固有名構成要素候補辞書記憶部１４は、例えば図４に示すような、形態素と、それが構成する固有名における位置とを関連づけた情報からなる固有名構成要素候補辞書を記憶している。固有名が「日本国」であれば、「日本国」は左端形態素であり、「国」は右端形態素である。「日本国憲法」であれば、「日本国」は左端形態素、「国」は中間形態素、「憲法」は右端形態素である。この場合、「国」は左端形態素でもあるし、中間形態素でもある。
【００２６】
固有名情報解析部１３が形態素に付与する素性は例えば図６に示すようなものである。この例では、固有名自体の属性と、固有名を構成する形態素の位置に関する属性とから素性が決定される。
【００２７】
形態素解析結果の形態素が、図６中に矢印Ａで示した、固有名の「単語」自体、「左端」形態素自体、「中間」形態素自体、「右端」形態素自体である場合には、その形態素にそろぞれの素性を付与する。例えば、「場所」の「右端」形態素であれば（例えば「日本国」の「国」）、「ｐｒｂ」を付与する。
【００２８】
また形態素解析結果の形態素の一部が、固有名の「単語」、「左端」形態素、「中間」形態素、「右端」形態素と一致する場合には、図６中矢印Ｂで示すような素性を付与する。例えば、「・・・韓国軍人・・」を形態素解析して「韓国」、「軍人」の形態素列を得た場合には、形態素「軍人」中の「軍」は「組織」の「右端」形態素でもあり得るから、「左に右端を含む」という位置情報を有し、「ｏｒｂｌ」の素性が「軍人」に割り当てられる。「軍人」中の「軍」のような形態素の一部に個別に素性を割りあてられる記述手法を採用した場合には、形態素の一部にかかる素性を割り当ててもよい。
【００２９】
このようにして、形態素解析結果の形態素またはその一部に関して固有名構成要素候補辞書を参照して解析を行い形態素に素性を割り当てる。
【００３０】
図７は、図４の形態素解析結果に素性を割り当てた例を示す。この例では下線を付した部分が素性として新たに割り当てられている。
【００３１】
固有名特定部１５は、固有名特定規則記憶部１６を参照して規則に合致する形態素列を一まとめにして固有名として特定する（Ｓ１３）。
【００３２】
固有名特定規則記憶部１６の固有特定規則（チャンキングルールともいう）は例えば図８に示すようなものであり、最終的には固有名の属性（姓、名、組織、場所等）が付与される。そして図９に示すように固有名の属性を有する形態素列が固有名として抽出される。図８の例では、「場所」の属性を有するものが、「ｌｏｃａｔｉｏｎ（）」として抽出される。図９中、「＊」はその直前の形態素が０回以上繰り返すことを表し、「＋」はその直前の形態素が０回以上繰り返すことを表す。「？」は任意の形態素を表す。
【００３３】
図１０は、先の「米カリフォルニア州のオレンジ郡が・・・」の形態素解析結果に固有名特定規則を適用して固有名およびその属性を特定した例を示し、図１１はこの結果から固有名を抽出した例を示す。この例では、形態素解析結果の形態素は固有名構成要素候補（左端、中間、右端、単語）自体である。
【００３４】
図１２は、形態素解析結果の形態素の一部が固有名構成要素候補をなす例を示している。この例では、先に述べたように、「・・韓国軍人・・」の形態素解析結果に含まれる「韓国」、「軍人」の形態素列中の「軍人」の左部分が右端形態素になり得るので「軍人」に「組織」の「左に右端を含む」素性である「ｏｒｂｌ」を付与している。
【００３５】
抽出された固有名はテキストにおいて強調や隠蔽されて出力部１７から出力される。出力は、表示、印刷、メール送出、音声出力等種々の形態を用いることができる。後段の各種処理装置へ、固有名情報を付加したテキスト等として出力することもできる。
【００３６】
この実施例の固有名認識装置によれば、既知の固有名から取得した固有名構成要素候補の情報を用い、その候補間の関連に基づいて固有名をチャンキング（構成要素を一塊にして固有名にすること）して認識を行うのできめ細かな高精度の固有名認識が可能になる。
【００３７】
とくに「左端」、「右端」の固有名構成要素候補に着目すると極めて高精度な認識が行えた。
【００３８】
つぎに、固有名構成要素候補辞書を作成する手法について説明する。
【００３９】
図１３は、固有名候補抽出装置１１０を示しており、図１４はその動作例（ステップＳ２０〜Ｓ２１）を示している。これらの図において、固有名候補抽出装置１１０は、固有名入力部２０、形態素解析部２１、形態素解析辞書記憶部２２、固有名構成要素候補記憶部２３を含んで構成され、固有名構成要素候補記憶部２３に記憶された固有名構成要素候補が固有名認識装置１００の固有名構成要素候補辞書記憶部１４（図１）に記憶保持される。
【００４０】
固有名入力部２０により入力される既知の固有名のサンプル郡に対して形態素解析が実行され、左端形態素、中間形態素、右端形態素、単語自体の形態素が取り出され、固有名構成要素候補辞書が作成される。なお、図１４に示される動作は図１４の記載内容から自明であるのでとくに説明は行わない。
【００４１】
この固有名候補抽出装置１１０の主たる部分も、例えば、計算機２００上で実行されるコンピュータソフトウェアとして実現できる。コンピュータソフトウェアは例えば記録媒体２０１を用いて計算機２００にインストールされる。計算機２００は、通常どおり、ＣＰＵ、主メモリ、ハードディスク等からなり、例えばパーソナルコンピュータやワークステーションであるが、これに限定されない。
【００４２】
図１５はこの発明の固有名認識装置を用いたテキスト処理装置の例を示している。この例では、テキスト中の固有名を適宜強調したり隠蔽したりする。
【００４３】
図１５において、テキスト処理装置１３０は、固有名認識装置１００、対象固有名特定部３０、テキスト部分指定部３１、特定固有名処理部３２、出力部３３を含んで構成されている。
【００４４】
このテキスト処理装置１３０の主たる部分も、例えば、計算機２００上で実行されるコンピュータソフトウェアとして実現できる。コンピュータソフトウェアは例えば記録媒体２０１を用いて計算機２００にインストールされる。計算機２００は、通常どおり、ＣＰＵ、主メモリ、ハードディスク等からなり、例えばパーソナルコンピュータやワークステーションであるが、これに限定されない。
【００４５】
テキスト部分指定部３１は、例えば表示されているテキストに対して利用者がポインティングしたときにそのポインティング情報を対象固有名特定部３０に送る。対象固有名特定部３０は、例えば、ポインティングされた部分の文章を判別し、その文章に含まれる固有名を強調したり、隠蔽したりする。シフトキー等の補助キーを操作しながらポインティングしたときに強調が行われ、そうでないときには隠蔽が行われるようにすることもできる。もちろんそれに限定されず種々の対象で強調や隠蔽を指示できる。特定固有名処理部３２は強調や隠蔽に必要な表示属性や文字の置き換えを行い、出力部３３に送出する。出力部３３は表示出力や印刷出力、所定のメールアドレスへの送付等を行う。
【００４６】
図１６は、テキスト処理装置の他の例を示す。図１４のテキスト処理装置１３０は、図１３のテキスト処理装置の構成要素に加えて処理規則記憶部３４を含んでいる。処理規則記憶部３４は、図１７に示すような処理条件、処理内容を特定するユーザインタフェースを用いて入力された処理規則を記憶する。もちろんデフォルトの処理条件や処理規則を用いることもできる。この例では、処理内容（強調、隠蔽、そのまま）や条件をプルダウンメニュー等で指定できる。この例によればテキスト処理を細かに設定できる。図１７の例では、文章を指定することもできるが、文章を指定せずに、テキスト全体を一括で処理するモードしかない場合もあり得る。
【００４７】
なお、この発明は上述の実施例に限定されるものではなくその趣旨を逸脱しない範囲で種々変更が可能である。例えば、上述の例では種々の固有名構成要素候補を用いたが、「左端」、「右端」あるいはその一方のみを用いるなど、種々の変更が可能である。また固有名構成要素候補辞書を複数用意して適用場面やテキストの内容に合わせて適合的に辞書選別・統合を行うようにしてもよい。
【００４８】
【発明の効果】
以上説明したように、この発明によれば、固有名構成要素候補に着目して高精度に固有名を認識することができる。
【図面の簡単な説明】
【図１】この発明の原理的な構成例を模式的に説明する図である。
【図２】この発明の実施例の固有名認識装置の構成を示すブロック図である。
【図３】図２の実施例の動作を説明するフローチャートである。
【図４】図２の実施例の固有名構成要素候補辞書を説明する図である。
【図５】図２の実施例の形態素解析結果を説明する図である。
【図６】図２の実施例で用いる素性を説明する図である。
【図７】図２の実施例で形態素解析結果に素性を反映させた結果を説明する図である。
【図８】図２の実施例の固有名特定部におけるチャンキング規則を説明する図である。
【図９】図２の実施例の固有名特定部における抽出規則を説明する図である
【図１０】図２の実施例のチャンキング規則適用後の解析結果の例を示す図である。
【図１１】図１０の解析結果に図９の抽出規則を適用した抽出結果の例を示す図である。
【図１２】図２の実施例のチャンキング規則適用後の解析結果の他の例を示す図である。
【図１３】この発明の実施例の固有名候補抽出装置を全体として示すブロック図である。
【図１４】図１３の実施例の動作を説明するフローチャートである。
【図１５】この発明の実施例のテキスト処理装置を全体として示すブロック図である。
【図１６】図１５のテキスト処理装置の変形例を説明するブロック図である。
【図１７】図１６の変形例の動作を説明する図である。
【符号の説明】
１０テキスト入力部
１１形態素解析部
１２形態素解析辞書記憶部
１３固有名情報解析部
１４固有名構成要素候補辞書記憶部
１５固有名特定部
１６固有名特定規則記憶部
１７出力部
２０固有名入力部
２１形態素解析部
２２形態素解析辞書記憶部
２３固有名構成要素候補記憶部
３０対象固有名特定部
３１テキスト部分指定部
３２特定固有名処理部
３３出力部
３４処理規則記憶部
１００固有名認識装置
１１０固有名候補抽出装置
１３０テキスト処理装置
２００計算機
２０１記録媒体[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a technique for extracting an expression (also referred to as a unique name or a unique expression) representing a specific thing such as a person name, a place name, an organization name, a date and time, a price, and the like.
[0002]
[Prior art]
In order to answer questions about the information contained in a vast amount of documents, summarize, database, and visualize the documents, the documents are used to identify people, places, organizations, dates and times, etc. Need to extract the unique name of In this case, using a computer, a dictionary in which each unique name is registered is prepared in advance, and the unique name can be extracted from the document by searching this dictionary. By the way, since an actual document always includes a new word which is not included in the dictionary prepared in advance, an accurate extraction result cannot be obtained only by searching the dictionary. In order to deal with such a problem, a number of rules obtained by regularizing the proper name itself and the appearance pattern of the word sequence before and after the proper name are manually created in advance, and computer processing is performed based on the rules. There is a technique of extracting a unique name from a target document.
[0003]
However, in this technology, since rules compete or interact with each other, each rule does not always work as intended, so the created rules are applied to training data prepared in advance. Based on the results, you have to repeat the process of finding the wrong place and correcting the rules.
[0004]
Also, as a result of modifying a certain rule, the rule that worked normally until then is often affected and gives an incorrect answer. Therefore, it takes a great deal of time and effort to make all of the many rules work as intended.
[0005]
Even in the technology that automatically generates a rule for extracting such a unique name using a computer, due to competition and interaction between rules, how to combine automatically generated rules with good results is good. To determine whether it is possible, apply the combined rules again to the actual document, compare the result with the correct answer, score it, and add rules based on the result so that you can get better results. It has no choice but to repeat the trial and error of deleting or deleting, and it takes a lot of calculation time.
[0006]
In addition, it has been proposed to use a training document to determine whether the above rules are good or not (Patent Document 1), and to learn the dependency probability and the like in a sentence analysis using a maximum entropy model.
[Patent Document 1]
JP 2001-318792 A [Patent Document 2]
JP 2002-334076 A
[Problems to be solved by the invention]
The present invention has been made in view of the above circumstances, and has as its object to provide a proper name recognition technique capable of recognizing a proper name expression with high accuracy.
[0008]
[Means for Solving the Problems]
According to the present invention, in order to achieve the above object, a configuration as described in the claims is adopted.
[0009]
First, an outline of the present invention will be schematically described with reference to an example shown in FIG. In this example, for example, a proper name list obtained by collecting known proper names is used, and a morphological analysis is applied to the proper name list, and the leftmost morpheme list, the intermediate morpheme list, the rightmost morpheme list, and the word itself, which are proper name component candidates, are used. (In the case where the proper name is a single morpheme) is obtained in advance. The proper name is composed of a morpheme at the left end (front end), a morpheme at the right end (rear end), and zero or more intermediate morphemes in principle. As an exception, the proper name may consist of a single morpheme (when the morpheme is the proper name word itself). After that, input the text to be processed, perform morphological analysis, obtain the morphological analysis result, and refer to the leftmost morpheme list, intermediate morpheme list, rightmost morpheme list and the morpheme list of the word itself for morphemes. And an attribute (an attribute related to a unique name component candidate, for example, shown in FIG. 6). The unique name is extracted by applying the unique name specifying rule to the morphological analysis result of the revised version to which the feature is added. For example, the extracted unique name is displayed after emphasis processing and concealment processing are performed. Note that the example in FIG. 1 is merely an example for the purpose of explanation, and the present invention is not limited to the example in FIG.
[0010]
Further, the present invention will be described.
[0011]
According to one aspect of the present invention, in order to achieve the above object, in the unique name recognizing device: unique name component candidate storage means for storing unique name component candidates; A specific name specifying rule storing means for storing the specific name specifying rule specified in the above; a morphological analysis means for morphologically analyzing a sentence; and a morpheme output from the morphological analysis means is analyzed using the proper name component candidate. A proper name information analyzing means; and a sentence analysis result obtained by reflecting a result of the analysis using the proper name component item in the analysis result of the syntax analyzing means, and applying the proper name specification rule to the above sentence. And a unique name specifying means for specifying a unique name included in the URL.
[0012]
In this configuration, a morpheme or a morpheme string can be recognized as a proper name expression with high accuracy by using information obtained using the proper name component candidate together with other information such as grammar information.
[0013]
In this configuration, the proper name component candidate is preferably at least one of a front end, a rear end, and a central portion of the proper name, but is not limited thereto. The leading and trailing ends of the proper name have been found to be particularly useful, but are not limited thereto.
[0014]
The unique name component candidate may include the unique name itself.
[0015]
In addition, for example, when the morpheme output from the morphological analysis unit corresponds to the proper name component candidate, the proper name information analyzing unit may convert the attribute (feature) specified by the type of the proper name component candidate. Assign to the morpheme.
[0016]
In addition, for example, when a part of the morpheme output from the morphological analysis unit corresponds to the candidate for the unique name component, the unique name information analyzing unit may set the attribute ( Is assigned to the morpheme or a part of the morpheme.
[0017]
Further, it is preferable that the unique name specifying rule is extracted to determine the attribute of the unique name. The attributes of the unique name include a personal name (first name, last name), an organization name, a place, an amount, date and time, a product name, a product name, and the like, but are not limited thereto.
[0018]
The present invention is optimally applied to, for example, Japanese proper name recognition, but can be applied to other languages as long as proper names are recognized by focusing on morphemes of proper name component candidates.
[0019]
The present invention can be realized not only as a device or a system but also as a method. In addition, it goes without saying that a part of such an invention can be configured as software. Also, it goes without saying that a software product used for causing a computer to execute such software is also included in the technical scope of the present invention. It does not hinder that a device is constituted by connecting circuit elements and the like discretely.
[0020]
Further, the above-described aspects of the present invention and other aspects of the present invention are described in the claims, and will be described in detail below with reference to examples.
[0021]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described.
[0022]
FIG. 2 shows an entirety of the proper name recognition device 100 according to the embodiment of the present invention. In this figure, the proper name recognition device 100 includes a text input unit 10, a morphological analysis unit 11, a morphological analysis dictionary storage unit 12, and a unique It includes a name information analysis unit 13, a unique name component candidate dictionary storage unit 14, a unique name identification unit 15, a unique name identification rule storage unit 16, an output unit 17, and the like.
[0023]
The main part of the unique name recognition device 100 can be realized, for example, as computer software executed on the computer 200. The computer software is installed in the computer 200 using the recording medium 201, for example. As usual, the computer 200 includes a CPU, a main memory, a hard disk, and the like, and is, for example, a personal computer or a workstation, but is not limited thereto.
[0024]
FIG. 3 illustrates processing (steps S10 to S14) performed by the unique name recognition device 100 of FIG.
[0025]
2 and 3, the text input unit 10 inputs a Japanese text (S10). The morphological analysis unit 11 morphologically analyzes the text with reference to the morphological analysis dictionary storage unit 12 (S11). The morphological analysis result is, for example, as shown in FIG. In this example, a morphological analysis is performed for "Orange County, California, USA ...". The unique name information analysis unit 13 refers to the unique name component candidate dictionary storage unit 14 and gives a feature (attribute relating to the unique name component candidate) to the morphological analysis result (S12). The unique name component candidate dictionary storage unit 14 stores a unique name component candidate dictionary composed of information associating a morpheme with a position in the unique name that composes it, as shown in FIG. 4, for example. If the proper name is "Japan", "Japan" is the leftmost morpheme, and "country" is the rightmost morpheme. If it is "Constitution of Japan", "Japan" is the leftmost morpheme, "country" is the middle morpheme, and "Constitution" is the rightmost morpheme. In this case, "country" is both a leftmost morpheme and an intermediate morpheme.
[0026]
The features assigned to the morpheme by the unique name information analysis unit 13 are, for example, as shown in FIG. In this example, the feature is determined from the attribute of the proper name itself and the attribute related to the position of the morpheme constituting the proper name.
[0027]
If the morpheme of the morphological analysis result is the proper word “word” itself, the “leftmost” morpheme itself, the “intermediate” morpheme itself, or the “rightmost” morpheme indicated by arrow A in FIG. To each element. For example, if the “right end” morpheme of “place” (for example, “country” of “Japan”), “prb” is assigned.
[0028]
When a part of the morphemes of the morphological analysis result matches the “word”, “leftmost” morpheme, “intermediate” morpheme, and “rightmost” morpheme of the proper name, the feature as indicated by arrow B in FIG. Give. For example, when morphological analysis is performed on "... Korean soldier ..." to obtain a morpheme sequence of "Korea" and "military", "military" in the morpheme "military" is the "right end" of "organization". Since it may be a morpheme, it has position information of “including the right end on the left ”, and the feature of “orbl” is assigned to “military”. When a description method in which features are individually assigned to some morphemes, such as "military" in "military personnel", is employed, features associated with some morphemes may be assigned.
[0029]
In this way, the morpheme or a part of the morpheme analysis result is analyzed with reference to the unique name component candidate dictionary, and a feature is assigned to the morpheme.
[0030]
FIG. 7 shows an example in which features are assigned to the morphological analysis results of FIG. In this example, the underlined portion is newly assigned as a feature.
[0031]
The unique name specifying unit 15 refers to the unique name specifying rule storage unit 16 and collectively specifies morpheme strings that match the rule as a unique name (S13).
[0032]
The unique identification rule (also referred to as a chunking rule) in the unique name identification rule storage unit 16 is, for example, as shown in FIG. 8, and finally the attributes of the unique name (last name, first name, organization, location, etc.) are added Is done. Then, as shown in FIG. 9, a morpheme string having an attribute of the unique name is extracted as the unique name. In the example of FIG. 8, one having the attribute of “place” is extracted as “location ()”. In FIG. 9, “*” indicates that the morpheme immediately preceding it repeats zero or more times, and “+” indicates that the morpheme immediately preceding it repeats zero or more times. "?" Represents an arbitrary morpheme.
[0033]
FIG. 10 shows an example in which the unique name and its attributes are specified by applying the unique name specification rule to the result of the morphological analysis of “Orange County, California, U.S.A.”, and FIG. An example of extracting names is shown. In this example, the morpheme of the morphological analysis result is the proper name component candidate (left end, middle, right end, word) itself.
[0034]
FIG. 12 shows an example in which some of the morphemes of the morphological analysis result form unique name component candidates. In this example, as described above, the left part of “military” in the morpheme sequence of “korean” and “military” included in the morphological analysis result of “... Korean military personnel ...” can be the rightmost morpheme Therefore, "orbl" which is a feature of "organization""including the right end on the left " is given to "military".
[0035]
The extracted unique name is emphasized or hidden in the text and output from the output unit 17. Various forms of output, such as display, printing, mail transmission, and voice output, can be used. It can also be output as text or the like to which unique name information has been added to various subsequent processing devices.
[0036]
According to the unique name recognizing apparatus of this embodiment, the unique name is chunked based on the association between the candidate using the information on the candidate for the unique name component obtained from the known unique name. The name can be recognized as a unique name.
[0037]
In particular, focusing on the "left end" and "right end" unique name component element candidates, highly accurate recognition could be performed.
[0038]
Next, a method of creating a unique name component candidate dictionary will be described.
[0039]
FIG. 13 shows the unique name candidate extracting device 110, and FIG. 14 shows an operation example (steps S20 to S21). In these figures, the proper name candidate extracting device 110 includes a proper name input unit 20, a morphological analysis unit 21, a morphological analysis dictionary storage unit 22, and a proper name component candidate storage unit 23. The unique name component candidate stored in the storage unit 23 is stored and held in the unique name component candidate dictionary storage unit 14 (FIG. 1) of the unique name recognition device 100.
[0040]
A morphological analysis is performed on a sample group of a known proper name input by the proper name input unit 20, and a leftmost morpheme, an intermediate morpheme, a rightmost morpheme, and a morpheme of the word itself are taken out, and a proper name component candidate dictionary is created. Is done. The operation shown in FIG. 14 is self-evident from the description in FIG. 14, and therefore will not be particularly described.
[0041]
The main part of the unique name candidate extracting device 110 can also be realized, for example, as computer software executed on the computer 200. The computer software is installed in the computer 200 using the recording medium 201, for example. As usual, the computer 200 includes a CPU, a main memory, a hard disk, and the like, and is, for example, a personal computer or a workstation, but is not limited thereto.
[0042]
FIG. 15 shows an example of a text processing device using the proper name recognition device of the present invention. In this example, the unique name in the text is emphasized or hidden as appropriate.
[0043]
In FIG. 15, the text processing device 130 includes a unique name recognition device 100, a target unique name specifying unit 30, a text part specifying unit 31, a specific unique name processing unit 32, and an output unit 33.
[0044]
The main part of the text processing device 130 can also be realized, for example, as computer software executed on the computer 200. The computer software is installed in the computer 200 using the recording medium 201, for example. As usual, the computer 200 includes a CPU, a main memory, a hard disk, and the like, and is, for example, a personal computer or a workstation, but is not limited thereto.
[0045]
For example, when the user points to the displayed text, the text part specifying unit 31 sends the pointing information to the target unique name specifying unit 30. The target unique name specifying unit 30 determines, for example, the sentence of the pointed part, and emphasizes or hides the unique name included in the sentence. Emphasis may be performed when pointing while operating an auxiliary key such as a shift key, and concealment may be performed otherwise. Of course, the present invention is not limited to this. The specific unique name processing unit 32 replaces display attributes and characters necessary for emphasis and concealment, and sends it to the output unit 33. The output unit 33 performs display output, print output, transmission to a predetermined mail address, and the like.
[0046]
FIG. 16 shows another example of the text processing apparatus. The text processing device 130 in FIG. 14 includes a processing rule storage unit 34 in addition to the components of the text processing device in FIG. The processing rule storage unit 34 stores a processing rule input using a user interface for specifying processing conditions and processing contents as shown in FIG. Of course, default processing conditions and processing rules can also be used. In this example, the processing contents (emphasis, concealment, as it is) and conditions can be designated by a pull-down menu or the like. According to this example, the text processing can be set finely. In the example of FIG. 17, a sentence can be specified, but there may be a case where there is only a mode for processing the entire text at once without specifying the sentence.
[0047]
It should be noted that the present invention is not limited to the above-described embodiment, and various changes can be made without departing from the gist of the present invention. For example, in the above example, various unique name component candidates are used, but various changes are possible, such as using “left end”, “right end” or only one of them. Alternatively, a plurality of unique name component candidate dictionaries may be prepared, and dictionary selection and integration may be performed adaptively in accordance with the application scene and the contents of the text.
[0048]
【The invention's effect】
As described above, according to the present invention, a unique name can be recognized with high accuracy by focusing on a unique name component candidate.
[Brief description of the drawings]
FIG. 1 is a diagram schematically illustrating an example of the basic configuration of the present invention.
FIG. 2 is a block diagram showing a configuration of a unique name recognition device according to the embodiment of the present invention.
FIG. 3 is a flowchart illustrating the operation of the embodiment in FIG. 2;
FIG. 4 is a diagram illustrating a unique name component candidate dictionary of the embodiment in FIG. 2;
FIG. 5 is a diagram illustrating a morphological analysis result of the embodiment in FIG. 2;
FIG. 6 is a diagram for explaining features used in the embodiment of FIG. 2;
FIG. 7 is a diagram illustrating a result of reflecting a feature on a morphological analysis result in the embodiment of FIG. 2;
FIG. 8 is a diagram illustrating a chunking rule in a unique name specifying unit of the embodiment in FIG. 2;
9 is a diagram illustrating an extraction rule in a unique name specifying unit of the embodiment in FIG. 2; FIG. 10 is a diagram illustrating an example of an analysis result after applying a chunking rule in the embodiment in FIG. 2;
11 is a diagram illustrating an example of an extraction result obtained by applying the extraction rule of FIG. 9 to the analysis result of FIG. 10;
FIG. 12 is a diagram illustrating another example of the analysis result after the chunking rule of the embodiment in FIG. 2 is applied.
FIG. 13 is a block diagram showing the entirety of the unique name candidate extracting device according to the embodiment of the present invention.
FIG. 14 is a flowchart illustrating the operation of the embodiment in FIG.
FIG. 15 is a block diagram showing an entire text processing apparatus according to an embodiment of the present invention.
FIG. 16 is a block diagram illustrating a modification of the text processing apparatus of FIG.
FIG. 17 is a diagram illustrating the operation of the modification of FIG. 16;
[Explanation of symbols]
Reference Signs List 10 Text input unit 11 Morphological analysis unit 12 Morphological analysis dictionary storage unit 13 Proper name information analysis unit 14 Proper name component candidate dictionary storage unit 15 Proper name specification unit 16 Proper name specification rule storage unit 17 Output unit 20 Proper name input unit 21 Morphological analysis unit 22 Morphological analysis dictionary storage unit 23 Specific name component candidate storage unit 30 Target specific name specifying unit 31 Text part specifying unit 32 Specific specific name processing unit 33 Output unit 34 Processing rule storage unit 100 Specific name recognition device 110 Specific name Candidate extraction device 130 Text processing device 200 Computer 201 Recording medium

Claims

Unique name component candidate storage means for storing a unique name component candidate,
A proper name specifying rule storing means for storing a proper name specifying rule defined in relation to the proper name component candidate,
Morphological analysis means for morphologically analyzing a sentence,
A proper name information analyzing means for analyzing the morpheme output from the morphological analyzing means using the proper name component candidate,
Identify the proper name included in the text by applying the proper name specification rule to the sentence analysis result obtained by reflecting the result of analysis using the proper name component item item in the analysis result of the parsing means. And a unique name identifying means.

2. The proper name recognition device according to claim 1, wherein the proper name component candidate includes at least one of a front end, a rear end, and a central portion of the proper name.

3. The proper name recognition device according to claim 2, wherein the proper name component candidate includes the proper name itself.

The unique name information analyzing means, when the morpheme output from the morphological analyzing means corresponds to a proper name component candidate, assigns an attribute specified using the type of the proper name component candidate to the morpheme. Item 4. The proper name recognition device according to item 1, 2 or 3.

When a part of the morphemes output from the morphological analysis means corresponds to the proper name component candidate, the proper name information analyzing means converts the attribute specified using the type of the proper name component candidate into the morpheme 5. The unique name recognition device according to claim 1, wherein the morpheme is assigned to a part of the morpheme.

The unique name recognition device according to claim 1, wherein the unique name specification rule determines an attribute of the extracted unique name.

A document processing device that recognizes a unique name in a document using the unique name recognition device according to any one of claims 1 to 6,
Means for designating text or parts thereof in the document;
Means for identifying a unique name associated with the specified document or portion thereof;
Means for emphasizing or hiding the part of the specified unique name.

A document processing device for recognizing a unique name in a document using the unique name recognition device according to claim 6,
Means for storing, in association with the proper name, a rule for determining a process for a portion corresponding to the proper name;
Means for applying the rule to the unique name recognized by the unique name recognition device in the document and executing a process corresponding to the attribute of the unique name on a portion corresponding to the unique name. A document processing device characterized by the above-mentioned.

Means for entering a known unique name;
Means for morphologically analyzing the input known proper name,
Means for storing candidate proper name components in association with the position in the known proper name based on the result of the morphological analysis.

In a proper name recognition method for recognizing a proper name using a proper name component candidate storage means, a proper name specifying rule storing means, a morphological analysis means, a proper name information analyzing means, and a proper name specifying means,
Storing the unique name component candidate by the unique name component candidate storage means;
Storing the unique name specifying rule defined in relation to the unique name component candidate by the unique name specifying rule storing means,
Morphologically analyzing the sentence by the morphological analysis means,
Analyzing the morpheme output from the morphological analysis means using the proper name component candidate by the proper name information analyzing means;
The analysis result of the parsing means is applied to the sentence analysis result obtained by reflecting the result of the analysis using the unique name component item term, and the unique name specifying rule is applied by the unique name specifying means to the sentence. Identifying the included unique name.

A proper name component candidate extracting method for extracting a proper name component candidate using a proper name input unit, a morphological analysis unit, and a proper name component candidate storage unit,
Inputting a known unique name by the unique name input means;
Morphologically analyzing the input known unique name by the morphological analysis means;
Storing said unique name component candidate in association with a position in said known proper name based on the result of said morphological analysis by said unique name component storage means. Extraction method.

In a proper name recognizing computer program for recognizing a proper name using the proper name component candidate storing means, the proper name specifying rule storing means, the morphological analyzing means, the proper name information analyzing means, and the proper name specifying means,
Storing the unique name component candidate by the unique name component candidate storage means;
Storing the unique name specifying rule defined in relation to the unique name component candidate by the unique name specifying rule storing means,
Morphologically analyzing the sentence by the morphological analysis means,
Analyzing the morpheme output from the morphological analysis means using the proper name component candidate by the proper name information analyzing means;
The analysis result of the parsing means is applied to the sentence analysis result obtained by reflecting the result of the analysis using the unique name component item term, and the unique name specifying rule is applied by the unique name specifying means to the sentence. A computer program for recognizing a unique name, wherein the step of specifying a unique name included in the program is performed by a computer.

A proper name component candidate extracting computer program for extracting proper name component candidates using the proper name input means, morphological analysis means, and proper name component candidate storage means,
Inputting a known unique name by the unique name input means;
Morphologically analyzing the input known unique name by the morphological analysis means;
Storing the candidate proper name component in association with the position in the known proper name based on the result of the morphological analysis by the proper name component storage means. Computer program for extracting a unique name component candidate.