JP4098873B2

JP4098873B2 - Document summarizer for word processors

Info

Publication number: JP4098873B2
Application number: JP04065098A
Authority: JP
Inventors: エイ．フェインロナルド; ビー．ドランウィリアム; ジェイ．フライズエドワード; メッサリージョン; エイ．ソープクリストファ; ジェイ．コーカスショーン
Original assignee: Microsoft Corp
Current assignee: Microsoft Corp
Priority date: 1998-02-23
Filing date: 1998-02-23
Publication date: 2008-06-11
Anticipated expiration: 2018-02-23
Also published as: JPH11259457A

Description

【０００１】
【発明の属する技術分野】
本発明は、ワード・プロセッサに関し、より具体的には、ワード・プロセッサのためのドキュメント・サマライザ(document summarizers)に関する。
【０００２】
【従来の技術】
多くの人々が大量の電子テキスト・データを読むといううんざりする作業に直面する。コンピュータ時代において、人々には、論文、メモ、電子メール・メッセージ、レポート、ウェブ・ページ、スケジュール、参考文献、テスト結果などが押し寄せる。残念なことに、多くのドキュメントは要約で始まっていない。要約の作成は退屈であり、要約を作成するためには、著者はドキュメントを再読し、主要なテーマを特定し、ドキュメントの主なポイントを簡潔な要約にまとめる必要がある。多くの著者は、わざわざそのようなことをしようとはしない。
【０００３】
ドキュメントの要約は、読者にとってはより一層難しく、時間がかかる。読者は、まずドキュメント全体を読み（または少なくともざっと読み）、その内容を理解しなければならない。そして、読者は、重要でない詳細な記述からドキュメントのキー・ポイントを抽出することを試みなければならない。
【０００４】
大量の要約されていないドキュメントの処理に関する問題は、ＭＩＳ（Management Information Systems：経営情報システム）の担当者にとって特に深刻である。これらの人々は、大規模データベースからのドキュメントを組織化し、管理し、検索するという作業に日々直面する。典型的なシナリオを想定する。ＭＩＳの一スタッフが、３年から４年ぐらい前に書かれたいくつかの会社メモの中で議論されたと思われるトピックに関するすべてのドキュメントを発見せよ、というなぞめいた要求を受ける。この検索要求に応えるため、そのＭＩＳの一スタッフは、まずトピックに関するワード検索を行い、その後不可解なメモを見つけようと努めて、ヒットした各ドキュメントを一生懸命通読しなければならない。要約がないと、その一スタッフは、各ドキュメントのすべてではないにせよ、大部分を読んで、ドキュメントが適合するか否かを結論づけることを強いられる。不必要なテキストを読むことを強いられることにより、その一スタッフの時間の多くが無駄になる。
【０００５】
インターネットまたは他のネットワークを通じてブラウジングし、関連するトピックについてのドキュメントを見つける個人的なユーザにとっては、問題の重要性は低いが、なお悩ませられる問題である。ドキュメントを発見するため、ユーザは、（追加のオンライン費用を払って）オンラインでドキュメントを読んでそれが適合するか否かを判断するか、または、（適合しないドキュメントを取ってくるリスクを払って）後で見直すためにドキュメントをダウンロードするかのいずれかをしなければならない。
【０００６】
これらの問題に対処するのを支援するため、テキストによるドキュメントを読者のために自動的に要約する、コンピュータに実装されるドキュメント・サマライザが開発された。ドキュメント・サマライザは、既存のドキュメントを調べ、既存のテキストから抄録または要約を作成することを試みる。
【０００７】
ドキュメント・サマライザの初期の開発においては、統計手法による要約作成に主眼がおかれた。統計手法の１つが、１９５８年４月にIBM Journal の１５９〜１６５ページにおいて発表された"The Automatic Creation of Literature Abstracts"（自動文献抄録作成）と題されたH.P. Luhn による論文で説明されている。Luhnの手法では、各文にその文の語の解析により得られる「重要」度を付与する。この重要度は、文中の語のかたまりを確定し、そのかたまり中に含まれる重要語の数を計数し、この数の２乗をかたまり中の語の総数で割ることにより計算される。その後、文をその重要度に従って順位付けし、最高順位の文の中の１つまたは複数を選択し、抄録を形成する。
【０００８】
今日使用されているドキュメント・サマライザのすべてではないにせよ、ほとんどが、Luhnの手法を使用していることは明らかである。そのようなサマライザの例には、ＢＴ（以前のBritish Telecom ）のText Summariser 、Xsoft Corporation （Xerox の子会社）のVisual Recall 、および、Island Software のInTextが含まれる。
【０００９】
ドキュメントを要約する他の手法が、日本の京都で１９９４年８月５〜９日に開催された会議のために、Proceedings of the 15 ^th International Conference on Computational Linguistics. Vol.1の３４４〜３４８ページにおいて発表された"Abstract Generation Based on Rhetorical Structure Extraction"（修辞学的構造の抽出に基づく抄録生成）と題されたKenji Ono 等による論文で説明されている。彼らの手法には、セクションの本文中における文の種々のかたまり間の関係を表す修辞学的構造を構築する言語学的解析が含まれる。修辞学的構造は、２つのレベルにより表される。文のユニットに従ってテキストを分解する段落内レベルと段落ユニットを使用してテキストを分解する段落間レベルである。修辞学的構造の抽出は、詳細で洗練された５つのステップによる手続を使用して達成される。Ono の手法は、基本的な要約が単に求められている多くの状況において、不必要に複雑である。
【００１０】
さらに、この手法は、ジャンルに大きく依存し、テキストが談話(discourse) 構造についての外見上の標識に富んでいる場合にのみ、よい要約を生成する。したがって、この手法は、Ono 等が調査した学術的な散文に対しては比較的うまくいくが、より形式的でない散文で書かれたドキュメントに対してはうまくいかない。
【００１１】
要約を作成すると、従来のドキュメント・サマライザでは、２つの形式のうちの１つにより読者に結果を提供する。第１の形式では、要約の一部と考えられる文にアンダーラインを引いたり、あるいはまた、それらの文を強調する。第２の形式では、ドキュメントの付随するテキストは示さず、抄録の文のみを段落または箇条書き形式により示す。
【００１２】
従来のドキュメント・サマライザに共通する問題の１つは、それらが読者に基礎を置いているということである。これらのサマライザは、著者の視点からの要約作成および要約提示を考慮していない。
【００１３】
したがって、著者が自分の書物について要約を自動的に作成するのを助け、与えられたいかなるテキストに対しても要約を生成する著者指向のワード・プロセッサのためのサマライザを提供する必要性が残されている。
【００１４】
【発明の要約】
本発明は、著者がドキュメントの要約を用意するのを補助すること、および、読者が要約のないドキュメントを調査するのを支援することに特に役立つドキュメント・サマライザに関する。与えられたテキストに対し、ドキュメント・サマライザは、まず統計解析を行い、要約において考慮される順位付けされた文のリストを生成する。サマライザは、内容語がドキュメント中にどの程度の頻度で現れるかを計数し、内容語を対応する頻度数に関連付ける表を生成する。各文の文点数は、文中の内容語の頻度数を合計し、その合計を文中の内容語の数で割ることにより得られる。その後、文を文点数の順序で順位付けする。高順位の文は比較的高い文点数を有し、低順位の文は比較的低い文点数を有する。
【００１５】
ドキュメントを通しての同じパスにおいて、統計解析と並列に、ドキュメント・サマライザは合図句解析(cue-phase analysis)を行う。合図句解析は、ドキュメント中における近隣の文間の談話関係の標識として、または、ドキュメント中の特別な文の全体的な重要性の標識として役立つ、あらかじめ蓄積された語および句のリストを調べることにより行う。合図句解析では、文の文字列をこのあらかじめ蓄積された合図句リストと比較する。各合図句には、その合図句を含む文を要約で使用するか否かを決定するために使用する条件が関連付けられる。
【００１６】
例えば、リストには、文を適切に理解することについてドキュメントにおける周囲の文脈(context) に依存する語または句を含めてもよい。"That is why..."（それが... の理由である）または"In contrast to this..."（これとは対照的に... ）で始まる文は、先行する文で述べられたことに依存する。サマライザは、依存語または依存句を含む文は、その語または句が依存する近隣の文脈も要約に含められる場合にのみ要約に含められてよいとする条件を定める。
【００１７】
あらかじめ蓄積されたリストには、その合図句が文中に存在すると、統計により得られた点数がいかに高くてもその文が要約から除外されることとなる合図句も含まれる。例えば、"as shown in Fig...." （図... に示すように）という句を含む文は、参照する図が存在しないので、要約に含められるべきではない。
【００１８】
統計解析および合図句解析の後、サマライザは、高順位の文を含む要約を作成する。文を含めることに関し定められた条件が満たされる場合には、要約は、条件付けられている文(conditioned sentence)（例えば、依存語または依存句を含む文）を含めることがある。しかしながら、要約は禁止されている文(prohibited sentences)を含むことはない。
【００１９】
サマライザは、ユーザの選択に基づき、テキストが始まる前のドキュメントの最初に、または、新たなドキュメントに文を挿入する。この配置は著者にとって便利であり、有効である。その後、著者は自分が望むように要約を自由に校正することができる。
【００２０】
【発明の実施の形態】
図１は、中央処理装置（ＣＰＵ）２２、モニタまたはディスプレイ２４、キーボード２６およびマウス２８を有するコンピュータ２０を示す。トラック・ボール、ジョイスティックおよびそれらに類するもののような他の入力デバイスを、キーボードおよびマウスの代わりにしてもよいし、または、キーボードおよびマウスとともに使用してもよい。ＣＰＵ２２は、メモリ（ディスク、ＲＡＭ、グラフィックス）およびプロセッサを含む標準的な構成である。
【００２１】
コンピュータ２０では、多数のアプリケーションをサポートするオペレーティング・システムが動作する。オペレーティング・システムは、ＣＰＵ２２内のメモリにストアされ、プロセッサ上で動作する。オペレーティング・システムは、多数のアプリケーションを同時に実行できるマルチタスキング・オペレーティング・システムが望ましい。オペレーティング・システムの一例は、Windows （登録商標）95またはWindows NT（商標）または他の派生したバージョンのWindows （登録商標）のようなMicrosoft Corporation が販売するWindows （登録商標）ブランドのオペレーティング・システムである。しかしながら、Apple Computer, Inc.により製作され、Macintosh コンピュータで使用される Mac（商標）OSオペレーティング・システムのような他のオペレーティング・システムを使用してもよい。
【００２２】
本発明は、ワード・プロセッシング・システム内に実装され得るドキュメント・サマライザに関する。説明するシステムにおいて、ワード・プロセッシング・システムは、ソフトウェア・アプリケーションとして実装され、ＣＰＵメモリまたは他のロード可能な記憶媒体にストアされ、コンピュータ２０のオベレーティング・システム上で動作する。ワード・プロセッシング・アプリケーションの一例は、ここで説明するドキュメント・サマライザに合わせて修正されるMicrosoft Corporation のMicrosoft （登録商標）Wordである。
【００２３】
ワード・プロセッシング・システムは、他の方法により実現してもよいことに留意されたい。例えば、ワード・プロセッシング・システムは、ほとんど排他的にワード・プロセッシング作業に使用される、限られたメモリおよび処理能力（パーソナル・コンピュータに比べて）を有する専用タイプライタ・マシンからなるものであってもよい。さらに、ここで説明するドキュメント・サマライザは、インターネット・ウェブ・ブラウザ（例えば、Microsoft Corporation のInternet Explorer ）、電子メール・プログラム（例えば、Microsoft Corporation のWordMail and Exchange ）およびそれらに類するもののような他のプログラム内に実現することができることにも留意されたい。しかしながら、説明のため、ドキュメント・サマライザの説明は、Microsoft （登録商標）Wordのようなコンピュータのワード・プロセッシング・プログラムのコンテキストで行う。
【００２４】
著者は、ドキュメントを要約したいと望む場合は、ワード・プロセッシング・プログラム上のドキュメント・サマライザ機能を起動する。ここで使用する限り、「ドキュメント」という用語は、後にテキストを理解できる言語として提供するビュアまたは他のコンピュータ・プログラムのためのフォーマットによるテキストを含むすべてのイメージを意味する。ドキュメントの例には、従来のワード・プロセッシング・ドキュメント、電子メール・メッセージ、メモ、ウェブ・ページおよびそれらに類するものが含まれる。ドキュメント・サマライザは、ワード・プロセッサにより提供されるグラフィカル・ユーザ・インターフェース・ウィンドウ上のプル・ダウン・メニューまたはソフト・ボタンを通じて起動される。起動されると、ドキュメント・サマライザは、要約を生成するためにドキュメントの処理を開始する。
【００２５】
図２は、コンピュータに実装されるドキュメントを要約する方法における、コンピュータが実行する一般的なステップを示す。ドキュメント例の参照も加えて、この方法を説明する。このドキュメント例は４文の段落を含み、この段落は要約されて２文の要約となる。段落は以下のように与えられる。
【００２６】
The Internet is a great place to shop for a computer. Manufacturers have web sites describing their computers. One computer manufacturer offers a money back guarantee. That is why that manufacturer has so many visits to its Internet web site.
（インターネットはコンピュータの買物をするのには非常にいい場所である。製造業者は自分のコンピュータを説明するウェブ・サイトをもっている。あるコンピュータ製造業者は返金保証を申し出ている。それがその製造業者がそのインターネット・ウェブ・サイトに非常に多くの訪問を受ける理由である。）
大まかには、ドキュメント要約プロセスは３つの段階を伴う。統計段階、合図句段階および提示段階である。統計および合図句段階は、ドキュメントを通しての一回のパス中に並列に行うことが望ましい。しかしながら、両段階を逐次的に行うこともでき、順序は問わない。統計段階において、ドキュメント・サマライザは、各語を読み、ドキュメント中に内容語がどの程度の頻度で現れるかを計数する（図２のステップ４０）。「内容語」は、テキストに文法的意味を与えない語である。名詞は内容語のよい例である。上の段落において、内容語には"Internet"、"manufacturer"、"computer"などが含まれる。
【００２７】
サマライザのコンテキスト内において、内容語は、技術的には、「不要語（ストップワード）」ではない語として定義できる。このコンテキスト内において、不要語の集合には、文法的機能を果たす語（例えば、接続詞、冠詞、前置詞）、および、文に対して比較的少ない意味内容しか与えないある程度高い頻度の動詞および名詞（例えば、"get" 、"have"）が含まれる。不要語の基本的な特性は、不要語がドキュメントのテーマに直接寄与しないこと、および、ドキュメントが不要語に関するものであるということは極めて考えにくいことである。したがって、不要語を計数すべきではない。不要語はメモリにストアされるリスト内に保持されることが望ましい。このように、プロセッサは、すべての語を読むが、不要語リストに現れない語しか計数しない。上の段落例において、第１文は、不要語"The" 、"is"、"a" 、"great" 、"to"、"for" および"a" を含む。
【００２８】
ドキュメントを通してのパス中、ドキュメント・サマライザは、内容語の形態的変形を調べ、それらを原形(root form) に変換する（ステップ４２）。例えば、語"walking" 、"walked"および"walks" は、すべて原形"walk"の形態的変形である。このようにして、原形および関連する変形はすべて同じ語として計数される。上の段落例において、語"computer"および"computers" は同じ語として計数され、語"manufacture" および"manufactures"についても同様である。
【００２９】
サマライザは、句圧縮が可能か否かについての語解析も行う（ステップ４４）。同じ順序で繰り返し現れる内容語の組は、単一の内容語であるかのようにして計数される。例えば、語のペア"Microsoft Corporation" は、まったく同じ順序で十分な回数現れる場合には、単一の語として計数してもよい。そのような句の中の語は、別々にすると、それら自身では文に対していかなる意味も付け加えない。句圧縮をしないと、語"Microsoft" および"Corporation" はそれぞれ独立に計数され、それらを含む文の重要性が、望ましくないことにゆがめられてしまうかもしれない。上の段落例において、句"web site"は、２回同じように現れており、したがって、句圧縮の候補にしてもよい。また、句"money back guarantee"も、１語としての句に圧縮され、単一のものとして計数されると考えられる。
【００３０】
ドキュメント中のすべての内容語を計数したら、ドキュメント・サマライザは、内容語を対応する頻度数に関連付ける表を生成する（ステップ４６）。内容語を順位付けして、最も高い頻度で現れる語が表の一番上にくるようにすることができる。表１は、上のドキュメント例の内容語の順位を示す。
【００３１】
【表１】

ステップ４８において、ドキュメント・サマライザは、ドキュメント中の個々の文について、文の各内容語に従い文点数を求める。ドキュメント中に高い頻度で現れる内容語を多く有する文は、高い頻度で現れる内容語を少なく有する文、および、ドキュメント中に低い頻度で現れる内容語を有する文のいずれよりも高く順位付けされる。より具体的には、ドキュメント・サマライザは、文の語の平均点数に従って文を順位付けする。この値は、文中に現れるすべての内容語の頻度数を合計し、その合計を文中の内容語の数で割ることにより得られる。文点数は以下のように表される。
【００３２】
【数１】
文点数＝語の頻度数の合計÷語の数
その後、文点数の順序に従って文を順位付けする（図２のステップ５０）。高順位の文は比較的高い文点数を有し、低順位の文は比較的低い文点数を有する。表１の語の計数を使用すると、段落例の第１文の点数は、以下のように１．７５である。
【００３３】
【数２】
文＃１＝［Internet(2) ＋Place(1)＋Shop(1) ＋Computer(3) ］÷４語＝1.75
残りの３文の点数も計算する。表２は、段落例中の４文の順位を示す。
【００３４】
【表２】

他の手法を使用して文点数を求めてもよいことに留意されたい。例えば、総頻度数を文中のすべての語（不要語を含む）の総数で割って点数を計算してもよい。異なるアプローチとして、いかなる平均もとらず、単に内容語の計数を合計する方法がある。さらに、内容語の中央値に基づいて文点数を定めるというような数学的または統計的技法を使用することもできる。
【００３５】
ステップ４０〜５０は、本要約法の統計段階を構成する。統計段階と並列に、ドキュメント・サマライザは、ドキュメントを通しての同じパス中に、合図句解析を行う。合図句解析は、テキスト中に存在する明示的な談話の標識を利用するために行う。大まかには、合図句解析は、要約に含めると文を混乱させたり、または、理解しづらくする可能性のある句を認識するように探索する。ここでの実装において、ドキュメント・サマライザは、文の文字列をあらかじめ蓄積された語および句のリストと比較する（ステップ５２）。
【００３６】
リストにある語または句を認識すると、ドキュメント・サマライザは、文全体を「禁止されている」または「条件付けられている」ものとして指定する。文が「禁止されている」場合、ドキュメント・サマライザは、その文点数にかかわらず、その文を要約に含めないように動作する（ステップ５４および５６）。文が「条件付けられている」ものと考えられた場合、ドキュメント・サマライザは、条件が満たされる場合にのみその文を要約に含める（ステップ５８および６０）。条件付けられている文の一例は、その文の意味を理解することについて前文または周囲の文脈に依存する文である。"He said..."で始まる文は、読者が"He"が誰なのかわかる場合にのみ明確になる。したがって、この文は、前の文脈に依存し、"He"を特定する前文も要約に使用される場合にのみ要約に使用される。
【００３７】
表３は、文を「禁止されている」または「条件付けられている」ものとする、あらかじめ蓄積された合図句リストの語または句の例を示す。
【００３８】
【表３】

合図句解析を段落例に適用すると、第４文は、句"That is why..."を含むので、条件付きであることがわかる。この句は、合図句リストに前依存句として載せられている。前依存句とは、文脈に関して前の文に依存する句を意味する。この場合、前の第３文で、ある製品が返金保証を申し出ることが説明されている。そのことが、第４文で、その製造業者がそのウェブ・サイトに多くの訪問を受けるといわれるのはなぜかを支持する理由である。第３文なしに第４文が要約に現れると、読者は、なぜその製造業者がそのウェブ・サイトに多くの訪問を受けるのかを理解できない。したがって、ドキュメント・サマライザは、第４文は第３文も使用された場合にのみ要約に使用されるという条件を設定する。
【００３９】
この例においては、結局、合図句リストがなくても、第４文は第３文も使用された場合にのみ現れることがわかる。それは第３文が第４文より高い点数を有するという単純な理由による。この結果は、少ない文を有する短いドキュメントの産物である。しかしながら、より多くの文を有するより大きなドキュメントにおいて、合図句リストは、一定の文の使用に関し効果的に条件を設定する。例えば、上の４文の段落における第４文が第３文より高い文点数を有すると仮定する。この場合、第４文は、より点数の低い、先行する第３文が使用された場合にのみ使用される。
【００４０】
統計および合図句解析段階の後、ドキュメント・サマライザは、合図句解析をくぐり抜けた高順位の文を含む要約を作成する（ステップ６２）。要約は、関連する条件が満たされる場合には、条件付けられている文を含めてもよいが、禁止されている文は一切排除する。要約の長さは著者が調整するパラメータである。表２より、上の段落例についての２文の要約は以下のとおりである。
【００４１】
Manufacturers have web sites describing their computers. One computer manufacturer offers a money back guarantee.
要約中の２つの文は最高順位を有する。要約中での文の組立は、ドキュメント中での出現の順序に従って行われており、文の順位の順序で行われているわけではない。この場合、出現の順序と順位の順序は同じであるが、いつもそうであるとは限らない。例えば、第３文が第２文よりも高い順位を受けたと仮定する。その結果得られる要約では、より低順位の第２文がより高順位の第３文よりもなお先行するが、これはドキュメント中で第２文が第３文よりも前に現れることによる。順位に基づき要約の順序付けを行うと、著者による文の順序を組み立て直すことになり、その結果、混乱させる、読みにくい要約を生ずることになるかもしれない。
【００４２】
２文の要約には、合図句に関する文はまったく含まれていない。しかしながら、要約を３文に拡張すると、要約は以下のように表される。
【００４３】
Manufacturers have web sites describing their computers. One computer manufacturer offers a money back guarantee. That is why that manufacturer has so many visits to its Internet web site.
この要約において、最後の文（すなわち、原文の第４文）は３番目に高い文点数を有する（表２参照）。この文も条件付けられている文になっているが、それはあらかじめ蓄積された合図句リストにある句"That is why..."を含むからである。したがって、この文は条件が満たされる場合にのみ使用される。この場合、条件は前依存条件である。前依存条件は、このクラスに属する文を、前文も要約に含められた場合にのみ要約に含めてもよいことを規定する。第３文は要約に現れるので、前依存条件は満たされ、したがって、第４文を要約に含めてもよい。
【００４４】
要約を作成した後、ドキュメント・サマライザは、著者選択による４つのＵＩ（ユーザ・インターフェース）形式のうちの１つで、要約をコンピュータ・モニタ上に表示する（ステップ６４）。第１のＵＩ形式では、既存のドキュメントの先頭に要約を挿入する。ドキュメント・サマライザは、ファイルの先頭を見つけ出し、ドキュメントの冒頭の段落の前に要約のテキストを挿入する。図３（ａ）は、要約７２を先頭に挿入された既存のドキュメント７０を示す。第２のＵＩ形式では、新たなドキュメントを作成するか、開き、要約を新たなドキュメントに挿入する。図３（ｂ）は、開かれて、既存のドキュメント７０上に重ねられた新たなドキュメント７４を示す。要約７２は新たなドキュメント７４に挿入される。
【００４５】
第３のＵＩ形式では、要約で使用される重要な文にアンダーラインを引くか、あるいはまた、それらの文を強調する。第４のＵＩ形式では、付随するテキストは示さず、要約文のみを示す。これらの第３および第４の形式は、従来技術のセクションの中で説明した従来の提示に類似する。
【００４６】
要約が作成され、著者に対して表示されると、著者は、既存のドキュメントまたは新たなドキュメント中の要約をメモリにセーブすることができる（ステップ６６）。
【００４７】
上記のコンピュータに実装される方法についての修正の１つは、統計段階に関する。上で説明した方法においては、内容語の数を計数し、同じ頻度数を使用してすべての文点数を求めている。場合によっては、高順位の文中の一定の語が文点数を過度に支配し、影響を与えることがある。
【００４８】
修正法として、反復点数付け法(iterative scoring approach)がある。この手法では、サマライザは、１回目の繰り返しで上と同じようにすべての文にまず点数をつける。そして、次の繰り返しにおいて、サマライザは、最高順位の文の影響を排除し、最高順位の文が存在しないかのようにして、残りの文の点数をつけ直す。次の繰り返しにおいて、前の繰り返しで見つかった最高順位の文の影響を排除し、最高順位の文２つが存在しないかのようにして、残りの文の点数を再びつけ直す。このプロセスをすべての文について続ける。
【００４９】
この修正された統計解析を実際に使用して説明するために、本解析を上で使用した４文の段落に適用する。第１ステップでは、不要語および句圧縮の処理をする一方、内容語を計数する。内容語を計数すると表１の結果を得る。次に文点数を求める。１回目の繰り返しでは、文＃２について、２．６７という同じ点数を得る。しかしながら、ここが修正法が異なり始めるところである。最高順位の文の影響を排除するために、ドキュメント・サマライザは、第２文がドキュメント中にまったく存在しないかのようにして、文点数を再計算する。内容語の頻度数はそれに応じて減少する。表４は、表１の修正版であり、第２文の不在が反映されている。
【００５０】
【表４】

次に、内容語の修正された頻度数を使用して、残りの３文の点数をつけ直す。その結果、第３文について、１．６７という点数を得るが、これは２番目に高い。
【００５１】
【数３】
文＃３＝［computer(2) ＋manufacturer(2) ＋money(1)］÷３語＝1.67
その後、文＃３の影響が排除され、内容語の頻度数はそれに応じて減少する。表５は、表４の修正版であり、第２および第３文の不在が反映されている。
【００５２】
【表５】

残りの２文を通じてこのプロセスを続けると、表６において与えられる新たな文の順位が得られる。
【００５３】
【表６】

反復再点数付け法を使用すると、文の順位にわずかな相違をもたらし、文＃１が文＃４よりも高く順位付けされることに気づく。反復再点数付け法を使用した２文の要約は、上述の方法を使用して作成された２文の要約と同一である。しかしながら、３文の要約はかなり異なる。表６を使用した３文の要約は以下のとおりである。
【００５４】
The Internet is a great place to shop for a computer. Manufacturers have web sites describing their computers. One computer manufacturer offers a money back guarantee.
この３文の要約は、要約で使用される文を順位の順序ではなく、ドキュメント中での出現の順序で書いた状態のよい例である。要約の始まりの文は、実際、３番目に高い順位の文である。それにもかかわらず、この文が要約で第１文として書かれているのは、この文が、より順位の高い文＃２および＃３よりもドキュメント中で先に現れるからである。
【００５５】
上の例において、より高順位の文に現れる内容語の計数は、すべて１計数分完全に減らされている。他の実現法においては、製造業者または著者が排除したいと望む、より高順位の文が与える影響度に基づいて程度を変化させることにより、頻度数を変えることができる。例えば、サマライザは、最高順位の文に現れる語に対応する各計数から小数量（例えば、０．３または０．５）を引くことにより補償してもよい。あるいはまた、その内容語が他の内容語と比較して高い頻度数を有するか、低い頻度数を有するかに基づき補償量を変化させてもよい。この動的点数付けプロセス中に語の計数を補償する量を、より高順位の文に現れる内容語の影響を適切に打ち消す種々の統計的または数学的手法に従って、製造業者または著者が決定し、設定することができる。
【００５６】
本ドキュメント・サマライザは、著者の観点から設計されているので、従来技術によるサマライザに比べて有利である。ドキュメント・サマライザは、統計と合図句とを結合した手法を使用して、著者が自分の書物の要約を自動的に作成することを可能にする。要約を作成すると、サマライザは、著者がドキュメントの先頭に、または、新たなドキュメント中に要約を配置することを可能にするＵＩを提供する。この配置は著者にとって便利であり、有効である。その後、著者は自分が望むように要約を自由に校正することができる。
【００５７】
ドキュメント・サマライザの他の利点は、統計と合図句とを結合した処理から生ずる。この二重の解析は、統計部分がいかなる場合にも要約を生成することを保証し、合図句部分が得られる要約の質を向上させることから有益である。
【００５８】
法令に従って、本発明は、多かれ少なかれ構造および方法の特徴に関連した具体的な用語で説明されている。しかしながら、ここで開示した手段は発明を実施するための例示であるから、本発明は、ここで説明した具体的な特徴に限定されないと理解されるべきである。したがって、本発明は、均等論および他の適用可能な適正な原則に従って適切に解釈された付属の特許請求の範囲の適切な範囲内のいかなる形態または変形を含むように定義されている。
【図面の簡単な説明】
【図１】ドキュメント・サマライザを有するワード・プロセッシング・プログラムがロードされたコンピュータの図である。
【図２】コンピュータに実装されるドキュメントを要約する方法におけるステップのフローチャートである。
【図３】要約の２つの異なるディスプレイ提示を説明するために、要約が挿入されたドキュメントを示す図である。
【符号の説明】
２０コンピュータ
２２中央処理装置（ＣＰＵ）
２４モニタ（ディスプレイ）
２６キーボード
２８マウス
７０既存のドキュメント
７２要約
７４新たなドキュメント[0001]
BACKGROUND OF THE INVENTION
The present invention relates to word processors, and more specifically to document summarizers for word processors.
[0002]
[Prior art]
Many people face the tedious task of reading large amounts of electronic text data. In the computer era, people get papers, notes, email messages, reports, web pages, schedules, references, test results, and more. Unfortunately, many documents do not begin with a summary. Creating summaries is tedious and in order to create summaries, authors need to re-read the document, identify key themes, and summarize the main points of the document into a concise summary. Many authors don't bother to do that.
[0003]
Document summarization is even more difficult and time consuming for the reader. Readers must first read (or at least read) the entire document and understand its contents. And the reader must try to extract the key points of the document from a detailed description that is not important.
[0004]
The problem with processing large volumes of unsummarized documents is particularly acute for MIS (Management Information Systems) personnel. These people face the task of organizing, managing, and retrieving documents from large databases every day. Assume a typical scenario. A staff member of MIS receives a thrilling request to discover all the documents on a topic that seems to have been discussed in several company notes written about three to four years ago. In order to respond to this search request, the MIS staff member must first perform a word search on the topic, then try to find an incomprehensible note, and read each hit document hard. Without a summary, the staff is forced to read most if not all of each document and conclude whether the document is relevant. By being forced to read unnecessary text, much of that staff's time is wasted.
[0005]
For personal users who browse through the Internet or other networks and find documents on related topics, the problem is less important but still annoying. To find a document, the user either reads the document online (at an additional online fee) to determine if it fits, or (at the risk of fetching a non-conforming document) ) You must either download the document for later review.
[0006]
To help address these issues, a computer-implemented document summarizer has been developed that automatically summarizes textual documents for the reader. A document summarizer examines an existing document and attempts to create an abstract or summary from existing text.
[0007]
In the initial development of the document summarizer, the main focus was on creating summaries using statistical methods. One statistical method is described in a paper by H.P. Luhn titled “The Automatic Creation of Literature Abstracts” published on pages 159-165 of the IBM Journal in April 1958. In Luhn's method, each sentence is given an “importance” degree obtained by analyzing the words of the sentence. This importance level is calculated by determining a group of words in a sentence, counting the number of important words contained in the group, and dividing the square of this number by the total number of words in the group. The sentences are then ranked according to their importance, and one or more of the highest ranking sentences are selected to form an abstract.
[0008]
It's clear that most, if not all, document summarizers used today use Luhn's approach. Examples of such summarizers include Text Summariser from BT (formerly British Telecom), Visual Recall from Xsoft Corporation (a subsidiary of Xerox), and InText from Island Software.
[0009]
Another method for summarizing the document was for a meeting held August 5-9, 1994 in Kyoto, Japan.Proceedings of the 15 ^th International Conference on Computational Linguistics. Vol.1In the paper by Kenji Ono et al. Entitled “Abstract Generation Based on Rhetorical Structure Extraction” published on pages 344-348. Their approach includes linguistic analysis that builds rhetorical structures that represent relationships between various chunks of sentences in the body of the section. The rhetorical structure is represented by two levels. An intra-paragraph level that breaks down text according to sentence units and an inter-paragraph level that breaks down text using paragraph units. Rhetorical structure extraction is accomplished using a detailed and sophisticated five-step procedure. Ono's approach is unnecessarily complicated in many situations where a basic summary is simply required.
[0010]
Furthermore, this approach is highly genre dependent and produces good summaries only if the text is rich in apparent signs of discourse structure. Thus, this approach works relatively well for academic prose investigated by Ono et al., But not for documents written in less formal prose.
[0011]
Once the summary is created, the traditional document summarizer provides the results to the reader in one of two forms. In the first form, sentences that are considered part of the summary are underlined or otherwise highlighted. In the second format, the accompanying text of the document is not shown, and only the abstract sentence is shown in paragraph or bullet format.
[0012]
One problem common to traditional document summarizers is that they are based on the reader. These summarizers do not take into account summarization and presentation from the author's perspective.
[0013]
Therefore, there remains a need to provide a summarizer for author-oriented word processors that help authors automatically create summaries for their books and generate summaries for any given text. ing.
[0014]
SUMMARY OF THE INVENTION
The present invention relates to a document summarizer that is particularly useful for assisting authors in preparing document summaries and for assisting readers in exploring non-summary documents. For a given text, the document summarizer first performs a statistical analysis to generate a list of ranked sentences that are considered in the summary. The summarizer counts how often the content word appears in the document and generates a table associating the content word with the corresponding frequency number. The sentence score of each sentence is obtained by summing up the frequency counts of the content words in the sentence and dividing the total by the number of content words in the sentence. Then, the sentences are ranked in the order of the number of sentences. High-ranked sentences have a relatively high score, and low-ranked sentences have a relatively low score.
[0015]
In the same pass through the document, the document summarizer performs cue-phase analysis in parallel with statistical analysis. Cue phrase analysis examines a pre-stored list of words and phrases that serve as an indicator of discourse relationships between neighboring sentences in a document or as an indicator of the overall importance of a particular sentence in a document To do. In the cue phrase analysis, the sentence character string is compared with this pre-stored cue phrase list. Each cue phrase is associated with a condition that is used to determine whether a sentence containing the cue phrase is used in the summary.
[0016]
For example, the list may include words or phrases that depend on the surrounding context in the document for proper understanding of the sentence. Sentences that begin with "That is why ..." (that is why ...) or "In contrast to this ..." (as opposed to this ...) are stated in the preceding sentence Depends on what has been done. The summarizer establishes a condition that a sentence containing a dependent word or phrase may only be included in the summary if the neighboring context on which the word or phrase depends is also included in the summary.
[0017]
The list accumulated in advance includes a cue phrase that, if the cue phrase is present in the sentence, will be excluded from the summary no matter how high the score obtained by statistics is. For example, a sentence containing the phrase "as shown in Fig ...." (as shown in Figure ...) should not be included in the summary because there are no referenced figures.
[0018]
After statistical analysis and cue phrase analysis, the summarizer creates a summary that includes high-order sentences. A summary may include a conditioned sentence (eg, a sentence that includes a dependent word or phrase) if the conditions specified for including the sentence are met. However, summaries do not contain prohibited sentences.
[0019]
The summarizer inserts a sentence at the beginning of the document before the text begins or in a new document based on the user's selection. This arrangement is convenient and effective for the author. The author is then free to proofread the summary as he wishes.
[0020]
DETAILED DESCRIPTION OF THE INVENTION
FIG. 1 shows a computer 20 having a central processing unit (CPU) 22, a monitor or display 24, a keyboard 26 and a mouse 28. Other input devices such as track balls, joysticks and the like may be substituted for the keyboard and mouse, or may be used with the keyboard and mouse. The CPU 22 has a standard configuration including a memory (disk, RAM, graphics) and a processor.
[0021]
The computer 20 operates an operating system that supports a large number of applications. The operating system is stored in a memory in the CPU 22 and operates on the processor. The operating system is preferably a multitasking operating system capable of simultaneously executing a large number of applications. An example of an operating system is a Windows® branded operating system sold by Microsoft Corporation, such as Windows® 95 or Windows NT® or other derived versions of Windows®. is there. However, other operating systems may be used such as the Mac ™ OS operating system manufactured by Apple Computer, Inc. and used on Macintosh computers.
[0022]
The present invention relates to a document summarizer that can be implemented in a word processing system. In the system described, the word processing system is implemented as a software application, stored in CPU memory or other loadable storage medium, and runs on the operating system of computer 20. An example of a word processing application is Microsoft® Word from Microsoft Corporation that is modified for the document summarizer described herein.
[0023]
Note that the word processing system may be implemented in other ways. For example, a word processing system consists of a dedicated typewriter machine with limited memory and processing power (compared to a personal computer) that is used almost exclusively for word processing tasks. Also good. In addition, the document summarizer described herein includes other programs such as Internet web browsers (eg, Microsoft Corporation's Internet Explorer), email programs (eg, Microsoft Corporation's WordMail and Exchange) and the like Note also that it can be realized within. However, for purposes of explanation, the document summarizer is described in the context of a computer word processing program such as Microsoft® Word.
[0024]
If the author wants to summarize the document, he invokes the document summarizer function on the word processing program. As used herein, the term “document” means any image that contains text in a format for a viewer or other computer program that is later provided as a language in which the text can be understood. Examples of documents include traditional word processing documents, email messages, notes, web pages, and the like. The document summarizer is activated through a pull down menu or soft button on a graphical user interface window provided by the word processor. When activated, the document summarizer begins processing the document to generate a summary.
[0025]
FIG. 2 illustrates the general steps performed by a computer in a method for summarizing a computer-implemented document. The method is described with reference to an example document. This example document includes a four sentence paragraph, which is summarized into a two sentence summary. The paragraph is given as follows:
[0026]
The Internet is a great place to shop for a computer.Manufacturers have web sites describing their computers.One computer manufacturer offers a money back guarantee. That is why that manufacturer has so many visits to its Internet web site.
(The Internet is a great place to shop for computers. A manufacturer has a website that describes their computer. A computer manufacturer offers a money-back guarantee. That manufacturer. Is the reason for receiving so many visits to the Internet web site.)
Broadly, the document summarization process involves three stages. Statistical stage, cue phrase stage and presentation stage. The statistics and cue phrases are preferably performed in parallel during a single pass through the document. However, both steps can be performed sequentially and the order is not limited. At the statistical stage, the document summarizer reads each word and counts how often the content word appears in the document (step 40 in FIG. 2). A “content word” is a word that does not give grammatical meaning to the text. Nouns are good examples of content words. In the above paragraph, the content words include "Internet", "manufacturer", "computer", etc.
[0027]
Within the context of the summarizer, content words can technically be defined as words that are not “unnecessary words (stop words)”. Within this context, the set of unwanted words includes words that perform grammatical functions (eg, conjunctions, articles, prepositions), and some high frequency verbs and nouns that give relatively little semantic content to the sentence ( For example, “get”, “have”). The basic characteristics of unwanted words are that unwanted words do not contribute directly to the theme of the document and that it is very unlikely that the document is related to unwanted words. Therefore, unnecessary words should not be counted. Unnecessary words are preferably kept in a list stored in memory. Thus, the processor reads all words but counts only those words that do not appear in the unwanted word list. In the above paragraph example, the first sentence includes the unnecessary words “The”, “is”, “a”, “great”, “to”, “for” and “a”.
[0028]
During the pass through the document, the document summarizer examines the morphological variations of the content words and converts them to the root form (step 42). For example, the words “walking”, “walked”, and “walks” are all morphological variations of the original “walk”. In this way, the original form and the associated deformation are all counted as the same word. In the above paragraph example, the words “computer” and “computers” are counted as the same word, as are the words “manufacture” and “manufactures”.
[0029]
The summarizer also performs word analysis as to whether phrase compression is possible (step 44). A set of content words that appear repeatedly in the same order is counted as if it were a single content word. For example, the word pair “Microsoft Corporation” may be counted as a single word if it appears a sufficient number of times in exactly the same order. The words in such phrases, by themselves, do not add any meaning to the sentence by themselves. Without phrase compression, the words "Microsoft" and "Corporation" are counted independently, and the importance of the sentence containing them may be distorted undesirably. In the example paragraph above, the phrase “web site” appears twice in the same way and may therefore be a candidate for phrase compression. The phrase “money back guarantee” is also considered to be compressed into a phrase as one word and counted as a single phrase.
[0030]
Once all content words in the document have been counted, the document summarizer generates a table that associates the content words with the corresponding frequency numbers (step 46). You can rank the content words so that the most frequently occurring words are at the top of the table. Table 1 shows the ranking of content words in the above document example.
[0031]
[Table 1]

In step 48, the document summarizer obtains a sentence score for each sentence in the document according to each content word of the sentence. A sentence having many content words appearing frequently in a document is ranked higher than either a sentence having few content words appearing frequently or a sentence having content words appearing less frequently in a document. More specifically, the document summarizer ranks sentences according to the average score of words in the sentence. This value is obtained by summing the frequency numbers of all the content words appearing in the sentence and dividing the sum by the number of content words in the sentence. The number of sentences is expressed as follows.
[0032]
[Expression 1]
Number of sentences = total number of word frequencies ÷ number of words
Thereafter, the sentences are ranked according to the order of the number of sentence points (step 50 in FIG. 2). High-ranked sentences have a relatively high score, and low-ranked sentences have a relatively low score. Using the word count in Table 1, the score of the first sentence in the example paragraph is 1.75 as follows:
[0033]
[Expression 2]
Sentence # 1 = [Internet (2) + Place (1) + Shop (1) + Computer (3)] / 4 words = 1.75
The score of the remaining 3 sentences is also calculated. Table 2 shows the ranking of the four sentences in the example paragraph.
[0034]
[Table 2]

Note that other techniques may be used to determine the number of sentences. For example, the score may be calculated by dividing the total frequency by the total number of all words (including unnecessary words) in the sentence. A different approach is to simply sum the content word counts without any averaging. In addition, mathematical or statistical techniques such as determining the number of sentences based on the median value of content words can be used.
[0035]
Steps 40-50 constitute the statistical stage of the summarization method. In parallel with the statistical phase, the document summarizer performs cue phrase analysis during the same pass through the document. The cue phrase analysis is performed in order to use an explicit discourse sign existing in the text. Roughly, cue phrase analysis searches to recognize phrases that can be confused or difficult to understand when included in a summary. In this implementation, the document summarizer compares the sentence string with a pre-stored list of words and phrases (step 52).
[0036]
Upon recognizing a word or phrase in the list, the document summarizer designates the entire sentence as “forbidden” or “conditioned”. If the sentence is "prohibited", the document summarizer operates to not include the sentence in the summary, regardless of its score (steps 54 and 56). If the sentence is considered "conditioned", the document summarizer will include the sentence in the summary only if the condition is met (steps 58 and 60). An example of a conditioned sentence is a sentence that relies on the previous sentence or the surrounding context to understand the meaning of the sentence. Sentences beginning with "He said ..." are only clear if the reader knows who "He" is. Therefore, this sentence depends on the previous context and is used for summarization only if the previous sentence specifying "He" is also used for the summarization.
[0037]
Table 3 shows examples of words or phrases in the pre-stored cue phrase list that make the sentence “prohibited” or “conditioned”.
[0038]
[Table 3]

If the cue phrase analysis is applied to the example paragraph, it can be seen that the fourth sentence is conditional because it includes the phrase “That is why ...”. This phrase is listed as a pre-dependent phrase in the cue phrase list. A pre-dependent phrase means a phrase that depends on the previous sentence with respect to context. In this case, the previous third sentence explains that a product offers a money back guarantee. That is the reason why the fourth sentence supports why the manufacturer is said to receive many visits to the web site. If the fourth sentence appears in the summary without the third sentence, the reader cannot understand why the manufacturer receives many visits to the web site. Thus, the document summarizer sets the condition that the fourth sentence is used for summarization only if the third sentence is also used.
[0039]
In this example, it can be seen that the fourth sentence appears only when the third sentence is also used even if there is no cue phrase list. That is for the simple reason that the third sentence has a higher score than the fourth sentence. This result is the product of a short document with few sentences. However, in larger documents with more sentences, the cue phrase list effectively sets a condition for the use of certain sentences. For example, assume that the fourth sentence in the above four sentence paragraph has a higher sentence score than the third sentence. In this case, the fourth sentence is used only when the preceding third sentence having a lower score is used.
[0040]
After the statistic and cue phrase analysis phase, the document summarizer creates a summary that includes high-order sentences that have passed the cue phrase analysis (step 62). The summary may include conditioned sentences if the relevant conditions are met, but excludes any prohibited sentences. The length of the summary is a parameter that the author adjusts. From Table 2, a summary of the two sentences for the above paragraph example is as follows.
[0041]
Manufacturers have web sites describing their computers.One computer manufacturer offers a money back guarantee.
The two sentences in the summary have the highest rank. The assembly of sentences in the summary is performed according to the order of appearance in the document, not in the order of sentence order. In this case, the order of appearance is the same as the order of ranking, but this is not always the case. For example, assume that the third sentence received a higher rank than the second sentence. In the resulting summary, the lower second sentence still precedes the higher third sentence because the second sentence appears before the third sentence in the document. Ordering summaries based on rank may re-arrange the sentence order by the author, resulting in a confusing and difficult-to-read summary.
[0042]
The two sentence summaries do not include any sentence about the cue phrase. However, when the summary is expanded to three sentences, the summary is expressed as follows.
[0043]
One computer manufacturer offers a money back guarantee. That is why that manufacturer has so many visits to its Internet web site.
In this summary, the last sentence (ie, the fourth sentence of the original sentence) has the third highest sentence score (see Table 2). This sentence is also a conditional sentence because it contains the phrase "That is why ..." in the pre-stored cue phrase list. Therefore, this statement is used only when the condition is met. In this case, the condition is a pre-dependent condition. The pre-dependency condition specifies that sentences belonging to this class may be included in the summary only if the previous sentence is also included in the summary. Since the third sentence appears in the summary, the pre-dependency condition is satisfied, so the fourth sentence may be included in the summary.
[0044]
After creating the summary, the document summarizer displays the summary on a computer monitor in one of four user-selected (UI) formats (step 64). In the first UI format, a summary is inserted at the beginning of an existing document. The document summarizer finds the beginning of the file and inserts summary text before the first paragraph of the document. FIG. 3A shows an existing document 70 with the summary 72 inserted at the beginning. In the second UI format, a new document is created or opened and the summary is inserted into the new document. FIG. 3 (b) shows a new document 74 that has been opened and overlaid on an existing document 70. The summary 72 is inserted into a new document 74.
[0045]
In the third UI format, important sentences used in summarization are underlined or otherwise highlighted. In the fourth UI format, the accompanying text is not shown, and only the summary sentence is shown. These third and fourth forms are similar to the conventional presentation described in the prior art section.
[0046]
Once the summary is created and displayed to the author, the author can save the summary in the existing document or new document to memory (step 66).
[0047]
One modification to the computer-implemented method described above relates to the statistical stage. In the method described above, the number of content words is counted, and all sentence scores are obtained using the same frequency number. In some cases, certain words in high-ranking sentences may dominate and affect the number of sentences.
[0048]
As a correction method, there is an iterative scoring approach. In this method, the summarizer scores all sentences first in the same way as above in the first iteration. Then, in the next iteration, the summarizer eliminates the effect of the highest order sentence and renumbers the remaining sentences as if the highest order sentence did not exist. In the next iteration, the influence of the highest sentence found in the previous iteration is eliminated, and the remaining sentences are renumbered as if there were no two highest sentences. Continue this process for all sentences.
[0049]
In order to demonstrate using this modified statistical analysis in practice, we apply this analysis to the four sentence paragraphs used above. In the first step, unnecessary words and phrase compression are processed, while content words are counted. When the content words are counted, the result of Table 1 is obtained. Next, the number of sentences is obtained. In the first iteration, the same score of 2.67 is obtained for sentence # 2. However, this is where the correction method begins to differ. In order to eliminate the effect of the highest sentence, the document summarizer recalculates the sentence score as if the second sentence is not present in the document at all. The frequency of content words decreases accordingly. Table 4 is a modified version of Table 1 and reflects the absence of the second sentence.
[0050]
[Table 4]

Next, the score of the remaining three sentences is reset using the corrected frequency number of the content word. As a result, the third sentence gets a score of 1.67, which is the second highest.
[0051]
[Equation 3]
Sentence # 3 = [computer (2) + manufacturer (2) + money (1)] / 3 words = 1.67
Thereafter, the effect of sentence # 3 is eliminated, and the frequency of content words decreases accordingly. Table 5 is a modified version of Table 4 and reflects the absence of the second and third sentences.
[0052]
[Table 5]

Continuing this process through the remaining two sentences gives the new sentence ranking given in Table 6.
[0053]
[Table 6]

Using the iterative re-scoring method, we notice a slight difference in sentence ranking, and sentence # 1 is ranked higher than sentence # 4. A two-sentence summary using the iterative re-scoring method is identical to the two-sentence summary created using the method described above. However, the three sentence summaries are quite different. A summary of the three sentences using Table 6 is as follows:
[0054]
The Internet is a great place to shop for a computer.Manufacturers have web sites describing their computers.One computer manufacturer offers a money back guarantee.
This three-sentence summary is a good example of writing the sentences used in the summary in the order of appearance in the document, not in the order of rank. The sentence at the beginning of the summary is actually the third highest sentence. Nevertheless, this sentence is written as the first sentence in the summary because it appears earlier in the document than sentences # 2 and # 3, which are higher in rank.
[0055]
In the example above, the count of content words appearing in higher order sentences has been completely reduced by one count. In other realizations, the frequency number can be varied by varying the degree based on the impact of higher order sentences that the manufacturer or author wants to exclude. For example, the summarizer may compensate by subtracting a small quantity (eg, 0.3 or 0.5) from each count corresponding to the word that appears in the highest sentence. Alternatively, the compensation amount may be changed based on whether the content word has a higher frequency number or a lower frequency number than other content words. The amount that compensates for word counting during this dynamic scoring process is determined by the manufacturer or author according to various statistical or mathematical techniques that appropriately counteract the effects of content words appearing in higher order sentences, Can be set.
[0056]
Since this document summarizer is designed from the author's point of view, it is advantageous over prior art summarizers. The document summarizer uses an approach that combines statistics and cue phrases to allow authors to automatically create summaries of their books. When creating a summary, the summarizer provides a UI that allows the author to place the summary at the beginning of the document or in a new document. This arrangement is convenient and effective for the author. The author is then free to proofread the summary as he wishes.
[0057]
Another advantage of the document summarizer stems from the combined processing of statistics and cue phrases. This double analysis is beneficial because it ensures that the statistical part produces a summary at any time and improves the quality of the summary from which the cue phrase part is obtained.
[0058]
In accordance with the statute, the present invention has been described in specific terms more or less related to features of structure and method. However, it should be understood that the present invention is not limited to the specific features described herein, since the means disclosed herein are exemplary for carrying out the invention. Accordingly, the present invention is defined to include any forms or variations within the proper scope of the appended claims, appropriately interpreted according to equivalence and other applicable applicable principles.
[Brief description of the drawings]
FIG. 1 is a diagram of a computer loaded with a word processing program having a document summarizer.
FIG. 2 is a flowchart of steps in a method for summarizing a computer-implemented document.
FIG. 3 shows a document with a summary inserted to illustrate two different display presentations of the summary.
[Explanation of symbols]
20 computers
22 Central processing unit (CPU)
24 Monitor (display)
26 keyboard
28 mouse
70 Existing Document
72 Summary
74 New Document

Claims

A method implemented in a computer, a method for automatically summarizing a document the document summarizer is stored in the storage means,
Obtaining a frequency number of the content word by counting how often the content word that is a word other than an unnecessary word appears in the document stored in the storage means, the document summarizer ;
The document summarizer determines a sentence score for each sentence based on the frequency of the content words;
The document summarizer ranks the sentences in the order of the sentence scores, wherein a higher order sentence has a higher sentence number than a lower order sentence;
The document summarizer compares the words and phrases in the sentence with a list of words and phrases stored in the storage means in advance, and sets a use condition of the sentence including any word or phrase in the list Performing the analysis; and
The document summarizer is, the statement of the high-ranking a including summary, when the use condition is satisfied includes the step of creating a summary containing sentences the use condition is set,
Executing the cue phrase analysis,
Compare the words and phrases in the sentence with a list of words and phrases that depend on the previous sentence in the context of the document, and if a word or phrase in the list exists in the sentence, the previous sentence becomes the summary The usage condition that the sentence is included in the summary only when it is included in the summary is set.

The computer-implemented method of claim 1, wherein
The method of determining the frequency of content words , determining the number of sentences , and executing the cue phrase analysis are performed in parallel.

The computer-implemented method of claim 1, wherein
The method of determining the frequency of content words includes counting the frequency of content words while eliminating the counting of all unnecessary words.

The computer-implemented method of claim 1, wherein
The step of obtaining the frequency number of the content word includes:
Evaluating the content words for original morphological deformation,
Method characterized by comprising counting the original with its said morphological deformation.

The computer-implemented method of claim 1, wherein
The step of obtaining the frequency number of the content word includes:
Evaluating the content words for recurring ordered content words set in the same order,
Method characterized by comprising a set of content words that the ordered, they are counted as if it were a single content word.

The computer-implemented method of claim 1, wherein
The step of obtaining the number of sentences is
Sum the frequency of all content words in the sentence to obtain the total number,
Dividing the total number by the number of content words in the sentence to determine the sentence score.

The computer-implemented method of claim 1, wherein
The step of ranking the text further comprises identifying a sentence of highest order,
The method of determining the frequency of content words further comprises recalculating the frequency of content words excluding the highest sentence .

The computer-implemented method of claim 1, wherein
The document summarizer compares the words and phrases in the sentence with a pre-accumulated list of prohibited words and phrases, and if there is a word or phrase in the list in the sentence, The method further comprising the step of setting the usage condition that a sentence is not included in the summary regardless of ranking.

The computer-implemented method of claim 1, wherein
The step of creating said summary, a method which comprises the assembling statements high ranking in the order they appear in the document.

The computer-implemented method of claim 1, wherein
The method wherein the document summarizer further comprises inserting the summary at the beginning of the document.

The computer-implemented method of claim 1, wherein
The document summarizer opens a new document;
The document summarizer further comprises inserting the summary into the new document.

A computer-readable recording medium for recording a word processing application program, wherein the word processing application program instructs the computer to perform the computer-implemented method of claim 1. A recording medium characterized by:

A computer-readable recording medium for recording an electronic mail application program, wherein the electronic mail application program instructs the computer to perform the computer-implemented method of claim 1. A recording medium characterized by the above.

A computer-readable recording medium for recording an internet web browser application program, wherein the internet web browser application program performs the computer-implemented method of claim 1. A recording medium characterized by instructing a computer.

A computer programmed to perform the steps of the computer-implemented method of claim 1.

A document file formed in memory as a result of the computer-implemented method of claim 1.

A computer-implemented method, wherein a document summarizer automatically summarizes documents stored in storage means,
Obtaining a frequency number of the content word by counting how often the content word that is a word other than an unnecessary word appears in the document stored in the storage means, the document summarizer;
The document summarizer associates the content word with a corresponding frequency number;
The document summarizer determines a sentence score for each sentence based on the frequency count of the content word;
The document summarizer ranks the sentences in the order of the sentence scores, wherein a higher order sentence has a higher sentence number than a lower order sentence;
The document summarizer is
(1) A sentence having a reference phrase is identified as a prohibited sentence, and (2) a contextual sentence having a phrase other than a reference depending on the previous sentence is identified as a sentence for which a use condition is set. Performing parse analysis;
The document summarizer creates a summary including a high-order sentence, and includes a sentence in which the use condition is set when the use condition is satisfied, but excludes the prohibited sentence. Including steps to
The step of performing the cue phrase analysis includes:
The words and phrases in the sentence are compared with a list of words and phrases stored in the storage means in advance, and it is determined that the sentence or phrase in the list is stored in the context of the document. Further comprising indicating dependence on a previous sentence, and setting that the previous sentence is used as a use condition of the sentence,
The method of creating the summary further includes including in the summary a sentence with the usage condition set only if the previous sentence is included in the summary.

The computer-implemented method of claim 17, wherein
The step of obtaining the frequency number of the content word includes:
Evaluating the content words for original morphological deformation,
Method characterized by comprising counting the original with its said morphological deformation.

The computer-implemented method of claim 17, wherein
The step of obtaining the frequency number of the content word includes:
Evaluating the content words for recurring ordered content words set in the same order,
Method characterized by comprising a set of content words that the ordered, they are counted as if it were a single content word.

The computer-implemented method of claim 17, wherein
Ranking the sentences further comprises identifying the highest ranking sentence;
Determining a frequent degree of the content words, a method which is characterized in that the exclusion of sentences of the highest ranking further comprising recalculating the frequency number of the content words.

The computer-implemented method of claim 17, wherein
The method of creating the summary includes assembling high-order sentences in the order they appear in the document.

The computer-implemented method of claim 17, wherein
The method wherein the document summarizer further comprises inserting the summary at the beginning of the document.

The computer-implemented method of claim 17, wherein
The document summarizer opens a new document;
The document summarizer further comprises inserting the summary into the new document.

A computer readable recording medium for recording a word processing application program, wherein the word processing application program instructs the computer to perform the computer-implemented method of claim 17. A recording medium characterized by:

A computer programmed to perform the steps of the computer-implemented method of claim 17.

A document file formed in memory as a result of the computer-implemented method of claim 17.

A computer-implemented method, wherein a document summarizer automatically summarizes documents stored in storage means,
(A) The document summarizer calculates the frequency of the content word by counting how often the content word, which is a word other than an unnecessary word, appears in the document stored in the storage means. When,
(B) the document summarizer associates the content word with a corresponding frequency number;
(C) the document summarizer determines a sentence score of each sentence based on the frequency count of the content word;
(D) the document summarizer ranking the sentences in the order of the sentence scores, wherein a higher order sentence has a higher sentence score than a lower order sentence;
(E) the document summarizer identifying the highest order sentence;
(F) the document summarizer re-calculates the frequency count of the content words excluding the highest order sentence;
(G) the document summarizer determines a sentence score of sentences other than the highest-order sentence, and ranks the sentences ;
(H) The document summarizer compares the words and phrases in the sentence with a list of words and phrases stored in the storage unit in advance, and sets a use condition of the sentence including any word or phrase in the list Performing a cue phrase analysis to:
(I) the document summarizer includes a summary including a high-order sentence, and if the usage condition is satisfied, includes a step of creating a summary including the sentence with the usage condition set;
Performing the cue phrase analysis comprises:
Compare the words and phrases in the sentence with a list of words and phrases that depend on the previous sentence in the context of the document, and if a word or phrase in the list exists in the sentence, the previous sentence becomes the summary Setting a use condition that the sentence is included in the summary only if it is included in the summary.

28. The computer implemented method of claim 27, wherein steps (c) through (g) are repeated .

28. The computer-implemented method of claim 27, further comprising the step of creating a summary including a sentence with a high score and, when the usage condition is satisfied, including the sentence with the usage requirement set. how to.

28. A computer readable recording medium for recording a word processing application program, wherein the word processing application program executes the steps of the computer-implemented method of claim 27. A recording medium characterized by instructing

28. A computer readable recording medium for recording an e-mail application program, wherein the e-mail application program instructs the computer to perform the computer-implemented method of claim 27. A recording medium characterized by the above.

28. A computer readable recording medium for recording an internet web browser application, wherein the internet web browser application performs the steps of the computer implemented method of claim 27. A recording medium characterized by instructing

A computer programmed to perform the steps of the computer-implemented method of claim 27.