JP2000268034A

JP2000268034A - Automatic text pre-editing apparatus and method, and storage medium used therefor

Info

Publication number: JP2000268034A
Application number: JP11070312A
Authority: JP
Inventors: Takehiko Yoshimi; 毅彦吉見
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1999-03-16
Filing date: 1999-03-16
Publication date: 2000-09-29

Abstract

(57)【要約】【課題】自然言語で記述されたテキストの前編集処理
において、テキストの種類の識別と、そのテキストの種
類に応じた前編集規則及び標準表記の選択を自動化す
る。【解決手段】テキストに関連する単語、品詞情報、形
態素情報を記憶した辞書テーブルと、テキストの種類を
識別するための識別規則を記憶した識別規則テーブル
と、テキストを前編集するための前編集規則及び標準表
記をテキストの種類別に記憶した前編集規則テーブル
と、テキストを入力する入力部と、辞書テーブルを参照
し、入力テキストの各単語について形態素解析して品詞
情報、形態素情報を抽出する形態素解析部と、識別規則
テーブルを参照し、抽出された品詞情報、形態素情報か
らテキストの種類を識別する識別部と、前編集規則テー
ブルを参照し、識別された種類に対応する編集規則から
テキストの前編集対象部分の単語を検出し、その単語を
標準表記に書き換える前編集部とから構成される。 (57) [Summary] In a pre-editing process of a text described in a natural language, identification of a text type and selection of a pre-editing rule and a standard notation according to the text type are automated. SOLUTION: A dictionary table storing words, part-of-speech information, and morpheme information related to text, an identification rule table storing identification rules for identifying types of text, and a pre-editing rule for pre-editing text. A morphological analysis that extracts a part-of-speech information and a morphological information by performing a morphological analysis on each word of the input text by referring to a pre-editing rule table storing standard notations for each type of text, an input unit for inputting the text, and a dictionary table. And the identification unit that identifies the type of text from the extracted part-of-speech information and morpheme information by referring to the identification rule table and the pre-editing rule table that references the editing rule corresponding to the identified type. A pre-editing unit that detects a word in the portion to be edited and rewrites the word into standard notation.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、自然言語処理シス
テムの機械翻訳装置などに適用され、自然言語で記述さ
れたテキストをその意味を変えない範囲で前編集するこ
とによって機械翻訳などの自然言語処理精度の向上を図
るテキスト前編集装置及び方法並びにこれに利用される
記憶媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention is applied to a machine translation device or the like of a natural language processing system, and pre-edits a text described in a natural language to the extent that its meaning is not changed, thereby enabling natural language such as machine translation. The present invention relates to a text pre-editing apparatus and method for improving processing accuracy and a storage medium used for the same.

【０００２】[0002]

【従来の技術】従来、機械翻訳装置などの自然言語処理
システムは、通常、科学技術論文、特許明細書、機器の
取り扱い説明書、報道記事など様々な種類のテキストを
処理対象としている2. Description of the Related Art Conventionally, a natural language processing system such as a machine translator usually processes various types of texts such as scientific and technical papers, patent specifications, equipment instruction manuals, and news articles.

【０００３】しかしながら、従来の機械翻訳装置では、
テキストを構文解析するための規則が、標準的な表現に
対応しているため、テキストの分野や種類によっては、
特殊な表現が多用されるので、特殊な表現形式のテキス
トの構文解析が適切に行えない。このため、既存の構文
解析規則を用いた翻訳装置では、特殊な表現形式のテキ
ストを適切に翻訳することは困難であり、例え、翻訳さ
れても非常に長い翻訳時間を要し、翻訳結果も好ましい
ものにならない。したがって、一般には、特殊な表現形
式のテキストが扱えるように構文解析規則を拡張すると
いう対応策が取られている。However, in a conventional machine translation device,
Because the rules for parsing text correspond to standard expressions, depending on the domain and type of text,
Since special expressions are frequently used, parsing of text in a special expression format cannot be properly performed. For this reason, it is difficult for a translator using the existing parsing rules to properly translate a text in a special expression format, and even if it is translated, it takes a very long time to translate, and the translation result is also high. It is not desirable. Therefore, in general, countermeasures have been taken to extend the parsing rules so that text in a special expression format can be handled.

【０００４】一方、機械翻訳の精度を高め、かつユーザ
の介入を最小限に抑えながら機械翻訳装置にテキストを
入力するに先立って前編集処理する従来技術として、特
開平６−１３９２７４号公報の記載によれば、入力テキ
ストを予め形態素解析して前編集規則を適用すべき箇所
を検出し、機械翻訳しやすい形に自動的に前編集するテ
キスト自動前編集装置が提案されている。On the other hand, Japanese Patent Application Laid-Open No. Hei 6-139274 discloses a prior art in which a pre-editing process is performed prior to inputting a text into a machine translator while improving the accuracy of machine translation and minimizing user intervention. Has proposed a text automatic pre-editing device that detects a portion to which a pre-editing rule is to be applied by preliminarily morphologically analyzing an input text and automatically pre-edits the input text into a form that can be easily translated.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら、特殊な
表現形式のテキストの翻訳が扱えるように既存の構文解
析規則を拡張した従来の機械翻訳装置は、例えば、下記
の問題がある。（１）構文解析規則の規模が非常に大きくなるため、既
存規則との整合性を保ちながら新たな規則を追加するこ
とは、実用システムでは容易ではない。（２）特殊な表現形式のテキストを扱うための規則と、
標準的な表現形式のテキストを扱うための規則とを混在
させると、既存規則の汎用性が損なわれる。However, a conventional machine translator that extends an existing parsing rule so that it can handle the translation of a text in a special expression format has the following problems, for example. (1) Since the scale of the parsing rules becomes very large, it is not easy in a practical system to add new rules while maintaining consistency with existing rules. (2) rules for handling text in a special expression form,
Mixing the rules for handling text in a standard representation format reduces the versatility of existing rules.

【０００６】また、特開平６−１３９２７４号公報に記
載のテキスト前編集装置では、特殊な表現形式のテキス
トを標準的な表現形式のテキストに前編集する際、多種
多様のテキストの種類に応じた前編集規則群の選択処理
が自動化されておらず、テキストの種類に応じた前編集
規則群の選択をユーザが行わなければならないため、ユ
ーザの負担になる問題がある。In the text pre-editing device described in Japanese Patent Application Laid-Open No. Hei 6-139274, when pre-editing a text in a special expression format into a text in a standard expression format, the text pre-editing device conforms to a wide variety of text types. The process of selecting the pre-editing rule group is not automated, and the user has to select the pre-editing rule group according to the type of text.

【０００７】本発明は以上の事情を考慮してなされたも
のであり、例えば、自然言語で記述されたテキストの前
編集処理において、テキストの種類の識別と、そのテキ
ストの種類に応じた前編集規則及び標準表記の選択を自
動化し、その自動選択された前編集規則に基づいて検出
したテキストの前編集対象部分の単語を標準表記に書き
換えるテキスト自動前編集装置及び方法並びにこれに利
用される記憶媒体を提供する。The present invention has been made in view of the above circumstances. For example, in a pre-editing process of a text described in a natural language, identification of a text type and pre-editing according to the text type are performed. Automatic text pre-editing apparatus and method for automating the selection of rules and standard notation, rewriting words in the pre-editing target portion of the text detected based on the automatically selected pre-editing rule into standard notation, and storage used for this Provide media.

【０００８】[0008]

【課題を解決するための手段】本発明は、自然言語で記
述されたテキストに関連する単語、品詞情報、形態素情
報を記憶した辞書テーブルとテキストの種類を識別する
ための識別規則を記憶した識別規則テーブルとテキスト
を前編集するための前編集規則及び標準表記をテキスト
の種類別に記憶した前編集規則テーブルとからなるテー
ブルメモリと、テキストを入力する入力部と、辞書テー
ブルを参照し、入力テキストの各単語について形態素解
析して品詞情報、形態素情報を抽出する形態素解析部
と、識別規則テーブルを参照し、抽出された品詞情報、
形態素情報からテキストの種類を識別する識別部と、前
編集規則テーブルを参照し、識別された種類に対応する
編集規則からテキストの前編集対象部分の単語を検出
し、その単語を標準表記に書き換える前編集部とを備え
たことを特徴とするテキスト自動前編集装置である。SUMMARY OF THE INVENTION The present invention provides a dictionary table storing words, parts of speech information and morpheme information related to a text described in a natural language, and an identification table storing identification rules for identifying types of text. A table memory consisting of a rule table, a pre-editing rule for pre-editing text and a pre-editing rule table storing standard notations for each type of text, an input unit for inputting text, and an input text by referring to a dictionary table. A morphological analysis unit that performs morphological analysis on each of the words to extract part-of-speech information and morphological information, and a part-of-speech information extracted by referring to an identification rule table;
Refers to an identification unit for identifying the type of text from the morpheme information and a pre-editing rule table, detects a word in a pre-editing target portion of the text from an editing rule corresponding to the identified type, and rewrites the word to a standard notation. An automatic text pre-editing device comprising a pre-editing unit.

【０００９】本発明によれば、自然言語で記述されたテ
キストの前編集処理において、テキストの種類の識別
と、そのテキストの種類に応じた前編集規則及び標準表
記の選択を自動化したので、ユーザの負担を軽減するこ
とができる。また、その自動選択された前編集規則に基
づいて検出したテキストの前編集対象部分の単語を標準
表記に書き換えるテキスト自動前編集装置を提供するこ
とができる。According to the present invention, in the pre-editing process of a text described in a natural language, identification of a text type and selection of a pre-editing rule and a standard notation according to the text type are automated. Burden can be reduced. Further, it is possible to provide an automatic text pre-editing apparatus that rewrites a word of a pre-editing target portion of a text detected based on the automatically selected pre-editing rule into a standard notation.

【００１０】[0010]

【発明の実施の形態】本発明の構成において、（１）既
存規則は変更せず、特殊な表現を扱うための規則群を、
標準的な表現を扱うための既存規則群から独立させた形
式で保持した前編集規則テーブルと、既存規則による処
理を行う前に、既存規則でも適切に処理できるように特
殊な形式の表現を標準的な表現に書き換える前編集部を
設ける。（２）対象テキストに含まれる様々な種類の情
報を、識別規則テーブルに基づいてテキストの種類を自
動的に識別し、その識別結果に従って適切な前編集規則
群を前編集規則テーブルから自動的に選択する。DESCRIPTION OF THE PREFERRED EMBODIMENTS In the structure of the present invention, (1) existing rules are not changed, and a rule group for handling a special expression is defined as:
A pre-editing rule table maintained in a format independent of the existing rules for handling standard expressions, and a special format expression so that existing rules can be properly processed before processing by the existing rules There is a pre-editing unit that rewrites the expression. (2) The type of text is automatically identified based on the identification rule table for various types of information included in the target text, and an appropriate preedit rule group is automatically determined from the preedit rule table according to the identification result. select.

【００１１】なお、本発明において、辞書テーブル、識
別規則テーブル、前編集規則テーブルは、例えば、本体
と分離可能な磁気テープやカセットテープ等のテープ
系、フロッピー（登録商標）ディスクやハードディスク
等の磁気ディスクやＣＤ−ＲＯＭ／ＭＯ／ＭＤ／ＤＶＤ
等の光ディスクのディスク系、ＩＣカード（メモリカー
ドも含む）／光カード等のカード系、あるいはマスクＲ
ＯＭ、ＥＰＲＯＭ、ＥＥＰＲＯＭ、フラッシュＲＯＭ等
による半導体メモリを含めた固定的にプログラムを担持
する記憶媒体で構成してもよい。入力部は、キーボー
ド、マウス、ペン、タブレット、スキャナ、文字認識装
置記憶媒体読取装置などの入力装置で構成してもよい。
識別部、前編集部は、例えば、ＣＰＵ、ＲＯＭ、ＲＡ
Ｍ、Ｉ／Ｏポートからなるコンピュータ、ＣＰＵを含む
ＡＳＩＣなどで構成してもよい。In the present invention, the dictionary table, the identification rule table, and the pre-editing rule table include, for example, a tape system such as a magnetic tape or a cassette tape which can be separated from the main body, and a magnetic system such as a floppy (registered trademark) disk or a hard disk. Disc and CD-ROM / MO / MD / DVD
Such as an optical disk, an IC card (including a memory card) / an optical card, or a mask R
It may be configured by a storage medium that fixedly carries a program, including a semiconductor memory such as an OM, an EPROM, an EEPROM, and a flash ROM. The input unit may be configured by an input device such as a keyboard, a mouse, a pen, a tablet, a scanner, a character recognition device, and a storage medium reading device.
The identification unit and the pre-editing unit are, for example, CPU, ROM, RA
It may be configured by a computer including M and I / O ports, an ASIC including a CPU, and the like.

【００１２】前記前編集部は、前編集規則テーブルを参
照し、前記識別部により識別されたテキストの種類に対
応する編集規則を選択する選択部と、選択された編集規
則に対応するテキストの前編集対象部分の単語を検索す
る検索部と、検索された単語を標準表記に書き換える書
換部とをさらに備えた構成にしてもよい。この構成にお
いて、選択部、検索部、書換部は、例えば、ＣＰＵ、Ｒ
ＯＭ、ＲＡＭ、Ｉ／Ｏポートからなるコンピュータ、Ｃ
ＰＵを含むＡＳＩＣなどで構成してもよい。A pre-editing unit that refers to a pre-editing rule table and selects an editing rule corresponding to the type of text identified by the identifying unit; and a pre-editing unit that precedes the text corresponding to the selected editing rule. A configuration may also be provided that further includes a search unit that searches for the word of the edit target portion, and a rewriting unit that rewrites the searched word to standard notation. In this configuration, the selecting unit, the searching unit, and the rewriting unit include, for example, a CPU, an R
Computer consisting of OM, RAM, I / O port, C
An ASIC including a PU may be used.

【００１３】前記識別規則テーブルは、テキストの種類
が科学技術論文、特許明細書、機器の取り扱い説明書ま
たは報道記事であるか否かを識別するための識別規則を
分野別に記憶した構成にしてもよい。[0013] The identification rule table may be configured to store identification rules for discriminating whether the type of text is a scientific or technical paper, a patent specification, an instruction manual for a device, or a news article for each field. Good.

【００１４】この構成によれば、異なる言語間の機械翻
訳を行う機械翻訳装置が本発明のテキスト自動前編集装
置を用いて翻訳の前編集処理を行う機械翻訳システムが
提供される。また、自然言語装置を接続と自然言語で記
述されたテキストの転送を制御するインターフェイスが
本発明のテキスト自動前編集装置を用いてテキストの前
編集処理を行う自然言語インターフェイスが提供され
る。また、自然言語で記述されたテキストを自動要約す
るテキスト要約装置が本発明のテキスト自動前編集装置
を用いてテキストを前編集処理を行うテキスト要約シス
テムが提供される。According to this configuration, there is provided a machine translation system in which a machine translation device that performs machine translation between different languages performs a pre-editing process of translation using the automatic text pre-editing device of the present invention. Further, there is provided a natural language interface for connecting a natural language device and controlling the transfer of a text described in a natural language by using the automatic text pre-editing device of the present invention to pre-edit text. Further, there is provided a text summarizing system in which a text summarizing apparatus for automatically summarizing text described in a natural language performs a pre-editing process on a text using the automatic text pre-editing apparatus of the present invention.

【００１５】以下、図に示す実施例に基づいて本発明を
詳述する。なお、これによって本発明は限定されること
はない。Hereinafter, the present invention will be described in detail based on an embodiment shown in the drawings. The present invention is not limited by this.

【００１６】図１は本発明の一実施例であるテキスト自
動前編集装置の構成を示すブロック図である。図１にお
いて、１はコピュータのＣＰＵ（中央処理装置）からな
る制御部を示し、制御部１は、プログラムメモリに記憶
された制御プログラムにより各部を制御する。FIG. 1 is a block diagram showing the configuration of an automatic text pre-editing apparatus according to an embodiment of the present invention. In FIG. 1, reference numeral 1 denotes a control unit including a CPU (Central Processing Unit) of a computer, and the control unit 1 controls each unit by a control program stored in a program memory.

【００１７】２はキーボード、マウス、ペン、タブレッ
ト、スキャナ、文字認識装置などの入力装置や、通信回
線と接続されている通信装置、記憶媒体読取装置などか
らなる入力部を示し、入力部２は自然言語で記述された
テキストの入力、前編集処理の指示、テキストの通信、
制御プログラムのインストールなどを行う。Reference numeral 2 denotes an input unit such as a keyboard, a mouse, a pen, a tablet, a scanner, and a character recognition device, and an input unit including a communication device connected to a communication line, a storage medium reading device, and the like. Input of text written in natural language, instruction of pre-editing process, communication of text,
Install control programs.

【００１８】３はＣＲＴ（陰極線管）ディスプレイ、Ｌ
ＣＤ（液晶ディスプレイ）、ＰＤ（プラズマディスプレ
イ）などからなる表示装置３ａや、サーマルプリンタ、
レーザプリンタなどからなる印字装置、または通信回線
と接続されている通信装置３ｃで構成される出力部を示
し、出力部３は、入力部２による入力結果、制御部１の
制御により翻訳結果を表示装置３ａに表示したり、印字
装置３ｂを介して印字したり、通信装置３ｃを介して送
信する。3 is a CRT (cathode ray tube) display, L
A display device 3a including a CD (liquid crystal display), a PD (plasma display), a thermal printer,
Shows an output unit composed of a printing device such as a laser printer or a communication device 3c connected to a communication line. The output unit 3 displays an input result from the input unit 2 and a translation result under the control of the control unit 1. The information is displayed on the device 3a, printed via the printing device 3b, or transmitted via the communication device 3c.

【００１９】４はマスクＲＯＭ、ＥＰＲＯＭ、ＥＥＰＲ
ＯＭ、フラッシュＲＯＭ等による半導体メモリ、あるい
は磁気テープやカセットテープ等のテープ系、フロッピ
ーディスクやハードディスク等の磁気ディスクやＣＤ−
ＲＯＭ／ＭＯ／ＭＤ／ＤＶＤ等の光ディスクのディスク
系、ＩＣカード（メモリカードも含む）／光カード等の
カード系等を含めた記憶媒体からなるテーブルメモリを
示し、テーブルメモリ４は、単語、品詞情報、形態素情
報を記憶した辞書テーブル４ａ、テキストの種類を識別
するための識別規則を記憶した識別規則テーブル４ｂ、
テキストを前編集するための新聞記事の見出し用前編集
規則、特許明細書用前編集規則、取り扱い説明書用前編
集規則などの前編集規則及び標準表記をテキストの種類
別に記憶した前編集規則テーブル４ｃとして機能する。4 is a mask ROM, EPROM, EEPROM
OM, a semiconductor memory such as a flash ROM, a tape system such as a magnetic tape or a cassette tape, a magnetic disk such as a floppy disk or a hard disk, or a CD-ROM.
The table memory 4 includes a storage medium including a disk system of an optical disk such as a ROM / MO / MD / DVD and a card system such as an IC card (including a memory card) / an optical card. A dictionary table 4a storing information and morphological information, an identification rule table 4b storing identification rules for identifying types of text,
A pre-editing rule table that stores pre-editing rules such as pre-editing rules for headlines of newspaper articles, pre-editing rules for patent specifications, and pre-editing rules for instruction manuals for text pre-editing, and standard notations for each type of text 4c.

【００２０】５はマスクＲＯＭ、ＥＰＲＯＭ、ＥＥＰＲ
ＯＭ、フラッシュＲＯＭ等による半導体メモリ、あるい
は磁気テープやカセットテープ等のテープ系、フロッピ
ーディスクやハードディスク等の磁気ディスクやＣＤ−
ＲＯＭ／ＭＯ／ＭＤ／ＤＶＤ等の光ディスクのディスク
系、ＩＣカード（メモリカードも含む）／光カード等の
カード系等を含めた記憶媒体からなるプログラムメモリ
を示し、プログラムメモリ５は、辞書テーブル４ａを参
照し、入力テキストの各単語について形態素解析して品
詞情報、形態素情報を抽出する形態素解析部５ａ、識別
規則テーブル４ｂを参照し、形態素解析部５ａによって
抽出された単語の品詞情報、形態素情報からテキストの
種類を識別する識別部５ｂ、前編集規則テーブル４ｃを
参照し、識別された種類に対応する編集規則からテキス
トの前編集対象部分の単語を検出し、その単語を標準表
記に書き換える前編集部５ｃとして機能する各制御プロ
グラムを記憶している。5 is a mask ROM, EPROM, EEPROM
OM, a semiconductor memory such as a flash ROM, a tape system such as a magnetic tape or a cassette tape, a magnetic disk such as a floppy disk or a hard disk, or a CD-ROM.
A program memory including a storage medium including a disk system of an optical disk such as a ROM / MO / MD / DVD and a card system such as an IC card (including a memory card) / an optical card is shown. The program memory 5 is a dictionary table 4a. , Morphological analysis of each word of the input text to extract part-of-speech information and morphological information, morphological analysis section 5a, and reference to identification rule table 4b, POS information and morphological information of the words extracted by morphological analysis section 5a Identifying unit 5b for identifying the type of text from the above, referring to the pre-editing rule table 4c, detecting the word of the part to be pre-edited of the text from the editing rule corresponding to the identified type, and rewriting the word to the standard notation. Each control program that functions as the editing unit 5c is stored.

【００２１】前編集部５ｃは、識別部５ｂにより識別さ
れたテキストの種類に対応する編集規則を前編集規則テ
ーブルから選択する選択部５ｃ-1と、選択された編集規
則に対応するテキストの前編集対象部分の単語を検索す
る検索部５ｃ-2と、検索された単語を標準表記に書き換
える書換部５ｃ-3として機能する。The pre-editing section 5c includes a selecting section 5c-1 for selecting an editing rule corresponding to the type of text identified by the identifying section 5b from the pre-editing rule table, and a pre-editing section 5c-1 for selecting a text corresponding to the selected editing rule. It functions as a retrieval unit 5c-2 for retrieving the word of the edit target portion and a rewriting unit 5c-3 for rewriting the retrieved word into the standard notation.

【００２２】６はＲＡＭ、ＥＥＰＲＯＭ、フラッシュＲ
ＯＭ等による半導体メモリ、あるいは磁気テープやカセ
ットテープ等のテープ系、フロッピーディスクやハード
ディスク等の磁気ディスクやＣＤ−ＲＯＭ／ＭＯ／ＭＤ
／ＤＶＤ等の光ディスクのディスク系、ＩＣカード（メ
モリカードも含む）／光カード等のカード系等を含めた
記憶媒体からなるバッファメモリを示し、バッファメモ
リ６は、入力テキストを記憶するテキストバッファ６
ａ、形態素解析部５ａで形態素解析された単語、品詞情
報、形態素情報を記憶する形態素解析結果バッファ６
ｂ、識別部５ｂで識別された種類を記憶する識別結果バ
ッファ６ｃ、選択部５ｃ-1で選択された標準表記を記憶
する選択結果バッファ６ｄ、書換部５ｃ-3で書き換えら
れたテキストを記憶する書換テキストバッファ６ｅとし
て機能する領域に備えている。6 is RAM, EEPROM, flash R
OM or other semiconductor memory, or tape system such as magnetic tape or cassette tape, magnetic disk such as floppy disk or hard disk, CD-ROM / MO / MD
/ A buffer memory composed of a storage medium including a disk system of an optical disk such as a DVD and a card system such as an IC card (including a memory card) / an optical card, and the buffer memory 6 is a text buffer 6 for storing input text.
a, a morphological analysis result buffer 6 for storing words, part-of-speech information, and morphological information morphologically analyzed by the morphological analysis unit 5a
b, an identification result buffer 6c for storing the type identified by the identification unit 5b, a selection result buffer 6d for storing the standard notation selected by the selection unit 5c-1, and a text rewritten by the rewriting unit 5c-3. It is provided in an area functioning as a rewriting text buffer 6e.

【００２３】制御部１は、識別規則とのマッチング処理
したデータや前編集規則とのマッチング処理したデータ
を各バッファに記憶する。７はバスラインを示し、制御
プログラムデータ及びアドレスデータが転送される。制
御部１は、バスライン７を会してプログラムメモリ４か
ら制御プログラムを読み出して各部を制御することによ
り本発明のテキスト前編集装置を実現する。The control unit 1 stores data subjected to the matching process with the identification rule and data subjected to the matching process with the pre-editing rule in each buffer. Reference numeral 7 denotes a bus line to which control program data and address data are transferred. The control unit 1 realizes the text pre-editing device of the present invention by reading out a control program from the program memory 4 via the bus line 7 and controlling each unit.

【００２４】８はマスクＲＯＭ、ＥＰＲＯＭ、ＥＥＰＲ
ＯＭ、フラッシュＲＯＭ等による半導体メモリ、あるい
は磁気テープやカセットテープ等のテープ系、フロッピ
ーディスクやハードディスク等の磁気ディスクやＣＤ−
ＲＯＭ／ＭＯ／ＭＤ／ＤＶＤ等の光ディスクのディスク
系、ＩＣカード（メモリカードも含む）／光カード等の
カード系等を含めた本体と分離可能なメディアで構成し
た固定的にプログラムを担持する記憶媒体を示し、記憶
媒体８に本発明のテキスト前編集プログラムを記憶し、
入力部２の記憶媒体読取装置を介してバッファメモリ６
の予備領域にテキスト前編集プログラムをインストール
することにより本発明のテキスト前編集機能を実現して
もよい。また、この記憶媒体は、本テキスト前編集装置
がインターネットを含めた外部の通信ネットワークとの
接続可能な通信装置を備えている場合には、その通信装
置を介して通信ネットワークからプログラムをダウンロ
ードするように流動的にプログラムを担持する媒体であ
ってもよい。なお、このように通信ネットワークからプ
ログラムをダウンロードする場合には、そのダウンロー
ド用プログラムは予め本体装置に格納しておくか、ある
いは別な記憶媒体からインストールされるものであって
もよい。なお、記憶媒体に格納されている内容としては
プログラムに限定されず、データであってもよい。8 is a mask ROM, EPROM, EEPROM
OM, a semiconductor memory such as a flash ROM, a tape system such as a magnetic tape or a cassette tape, a magnetic disk such as a floppy disk or a hard disk, or a CD-ROM.
A storage for fixedly carrying a program constituted by a medium which can be separated from a main body including a disk system of an optical disk such as a ROM / MO / MD / DVD and a card system such as an IC card (including a memory card) / an optical card. A text pre-editing program of the present invention in a storage medium 8;
The buffer memory 6 via the storage medium reading device of the input unit 2
The pre-text editing function of the present invention may be realized by installing a pre-text editing program in the spare area. When the text pre-editing device includes a communication device connectable to an external communication network including the Internet, the storage medium downloads the program from the communication network via the communication device. It may be a medium that carries the program in a fluid manner. When the program is downloaded from the communication network as described above, the download program may be stored in the main device in advance, or may be installed from another storage medium. The content stored in the storage medium is not limited to a program, but may be data.

【００２５】本発明の別の観点によれば、自然言語で記
述されたテキストに関連する単語、品詞情報、形態素情
報を記憶した辞書テーブル４ａとテキストの種類を識別
するための識別規則を記憶した識別規則テーブル４ｂと
テキストを前編集するための前編集規則及び標準表記を
テキストの種類別に記憶した前編集規則テーブル４ｃと
からなるテーブルメモリ４と、テキストを入力する入力
部とを備えたテキスト自動前編集装置に用いられ、コン
ピュータ１によって読み取り可能なテキスト自動前編集
プログラムを記憶した記憶媒体８であって、前記コンピ
ュータ１に、辞書テーブル４ａを参照し、入力テキスト
を形態素解析して品詞情報、形態素情報を抽出させ、識
別規則テーブル４ｂを参照し、抽出された品詞情報、形
態素情報からテキストの種類を識別させ、前編集規則テ
ーブル４ｃを参照し、識別された種類に対応する編集規
則からテキストの前編集対象部分の単語を検出させ、そ
の単語を標準表記に書き換えさせることができる。According to another aspect of the present invention, a dictionary table 4a storing words, parts of speech information, and morpheme information related to a text described in a natural language, and an identification rule for identifying the type of text are stored. A text automatic system comprising a table memory 4 comprising an identification rule table 4b, a pre-editing rule for pre-editing text and a pre-editing rule table 4c storing standard notations for each type of text, and an input unit for inputting text. A storage medium 8 that is used in a pre-editing device and stores a text automatic pre-editing program that can be read by a computer 1. The computer 1 refers to a dictionary table 4a, performs morphological analysis of input text, and performs part-of-speech information, The morpheme information is extracted, and the text is extracted from the extracted part-of-speech information and morpheme information by referring to the identification rule table 4b. To identify of Type refers to the pre-editing rules table 4c, to detect the word before editing target portion of the text from the editing rules corresponding to the identified type, it is possible to cause rewriting the word in standard notation.

【００２６】図２は本実施例のテキスト自動前編集処理
の手順を示すフローチャートである。以下、図３〜図５
を用いて、英日機械翻訳システムで翻訳されるテキスト
の前編集処理について説明する。図３は本実施例の入力
テキストの一例を示す図である。図３に示すように、英
日機械翻訳の対象となるＥ１〜Ｅ３のテキストが入力部
２により入力されテキストバッファ６ａに記憶される。
続いて、入力部２により自動テキスト前編集処理の指示
が入力される。FIG. 2 is a flowchart showing the procedure of the automatic text pre-editing process of this embodiment. Hereinafter, FIGS. 3 to 5
The pre-editing process of the text translated by the English-Japanese machine translation system will be described with reference to FIG. FIG. 3 is a diagram illustrating an example of the input text according to the present embodiment. As shown in FIG. 3, texts E1 to E3 to be subjected to English-Japanese machine translation are input by the input unit 2 and stored in the text buffer 6a.
Subsequently, an instruction for automatic text pre-editing processing is input by the input unit 2.

【００２７】ＳＴＥＰ１では、形態素解析部５ａにより
対象テキストの形態素解析を行い、対象テキストに含ま
れる対象表現となる各単語について品詞などの語彙属性
を抽出する。ＳＴＥＰ２では、１対象テキストに含まれ
る対象表現の数をカウントする対象表現数カウンタの値
ｉをリセットする。ＳＴＥＰ３〜ＳＴＥＰ６では、識別
部５ｂにより、対象テキストの先頭から一表現ずつ順
に、識別規則テーブルに記憶された各テキスト識別規則
とのマッチングを行い、どの条件にマッチするかに応じ
てテキストの種類を識別する。In STEP 1, a morphological analysis of the target text is performed by the morphological analysis unit 5a, and vocabulary attributes such as parts of speech are extracted for each word which is a target expression included in the target text. In STEP 2, the value i of a target expression number counter that counts the number of target expressions included in one target text is reset. In STEP3 to STEP6, the identification unit 5b performs matching with each text identification rule stored in the identification rule table in order from the head of the target text one expression at a time, and determines the type of text according to which condition is matched. Identify.

【００２８】図４は本実施例の識別規則テーブルに記憶
された識別規則の一例を示す図である。入力テキストの
種類を識別する場合、入力テキストの対象表現が、識別
規則１）と２）を満たすかどうかは、対象表現を比較す
ることによって判断する。図４では、図３に示す入力テ
キストのＥ２、Ｅ３は、識別規則１）と２）を満たすの
で、この入力テキストは新聞記事であると識別される。
従来技術では、対象テキストの種類が例えば新聞記事で
あることをあらかじめ指定しておく必要があった。本実
施例では対象テキストの種類を自動的に識別する。FIG. 4 is a diagram showing an example of the identification rules stored in the identification rule table according to the present embodiment. When identifying the type of the input text, whether or not the target expression of the input text satisfies the identification rules 1) and 2) is determined by comparing the target expressions. In FIG. 4, since the input texts E2 and E3 shown in FIG. 3 satisfy the identification rules 1) and 2), the input text is identified as a newspaper article.
In the related art, it is necessary to specify in advance that the type of the target text is a newspaper article, for example. In this embodiment, the type of the target text is automatically identified.

【００２９】ＳＴＥＰ７では、識別規則テーブルにあら
かじめ記憶されている識別規則によってテキストの種類
に識別できない場合、テキストの種類としてある値（例
えば「一般テキスト」）が設定される。ＳＴＥＰ８で
は、１対象テキストに含まれる対象表現の数をカウント
する対象表現数カウンタの値ｉをリセットする。In STEP 7, if the type of text cannot be identified by the identification rules stored in advance in the identification rule table, a certain value (for example, “general text”) is set as the type of text. In STEP 8, the value i of a target expression number counter that counts the number of target expressions included in one target text is reset.

【００３０】ＳＴＥＰ９〜ＳＴＥＰ１１では、前編集部
５ｃ（選択部、検索部、書換部）により、対象テキスト
の先頭から一表現ずつ順に、前編集規則群テーブルに記
憶された各前編集規則とのマッチングを行い、マッチン
グに成功した表現に対して書き換えを行うことにより、
前編集処理を実行する。なお、ＳＴＥＰ９のｎは入力テ
キストに含まれる表現の数である。In STEP9 to STEP11, the pre-editing unit 5c (selection unit, search unit, rewriting unit) performs matching with each pre-editing rule stored in the pre-editing rule group table in order from the head of the target text one expression at a time. Is performed, and by rewriting the expression that succeeds in matching,
Execute pre-edit processing. Note that n in STEP 9 is the number of expressions included in the input text.

【００３１】図５は本実施例の前編集規則テーブルに記
憶された前編集規則の一例を示す図である。図５に示す
ように、前編集規則テーブル４ｃには、例えば、新聞記
事の見出し用の前編集規則が記憶されている。図５の例
では、前編集規則１）、２）、３）がすべて満たされる
とき、“ｔｏ”を“ｗｉｌｌ”に書き換える前編集処理
である。FIG. 5 is a diagram showing an example of the pre-editing rule stored in the pre-editing rule table of the present embodiment. As shown in FIG. 5, the pre-editing rule table 4c stores, for example, pre-editing rules for headlines of newspaper articles. In the example of FIG. 5, when all of the pre-editing rules 1), 2), and 3) are satisfied, the pre-editing process rewrites “to” to “will”.

【００３２】図３に示すテキストを前編集する場合、Ｅ
１のテキスト“Agency to inspecthealth of 8 banks
”は、図５の前編集規則１）、２）、３）をすべて満
たすので、“Agency will inspect health of 8 banks
”に書き換えられる。“Agency to inspect health of
8 banks ”という表現を“Agency will inspect healt
h of 8 banks ”に書き換えてもよいのは、この表現が
新聞記事の見出しである場合であり、そうでない場合は
書き換えてよいとは限らない。When pre-editing the text shown in FIG.
1st text “Agency to inspecthealth of 8 banks
Satisfies all of the pre-editing rules 1), 2) and 3) in FIG. 5, so "Agency will inspect health of 8 banks"
"Agency to inspect health of
8 banks ”is replaced by“ Agency will inspect healt
You may rewrite it as "h of 8 banks" if it is the headline of a newspaper article, otherwise it may not be rewritten.

【００３３】これに対して、Ｅ２、Ｅ３のテキストは、
この編集規則を満たさないので書き換えられない。テキ
ストの対象表現が、図５の前編集規則１）を満たすかど
うかは、対象表現と、例えば以下のような名詞句パター
ンがマッチングするかどうかによって調べる。On the other hand, the texts of E2 and E3 are
Since this editing rule is not satisfied, it cannot be rewritten. Whether or not the target expression of the text satisfies the pre-editing rule 1) in FIG. 5 is determined by checking whether the target expression matches, for example, a noun phrase pattern as follows.

【００３４】名詞句パターン：｛冠詞｝｛｛副詞｝限定
形容詞｝｛名詞*｝名詞ここで、括弧｛｝で囲まれた要素（冠詞、副詞、限定
形容詞、名詞*）は存在してもしなくてもよいことを意
味し、記号「*」は、要素が任意回数繰り返し出現可能
であることを意味する。Noun phrase pattern: {article} adverb} restricted adjective {noun *} noun Here, the elements (article, adverb, qualified adjective, noun *) surrounded by parentheses {} may or may not exist. The symbol "*" means that the element can appear repeatedly an arbitrary number of times.

【００３５】対象表現が前編集規則２）を満たすかどう
かは、対象表現上で、単語“for”を検索することによ
って判断する。また、対象表現が前編集規則３）を満た
すかどうかは、例えば、次のような手続きによって調べ
る。Whether the target expression satisfies the pre-editing rule 2) is determined by searching for the word "for" on the target expression. Whether the target expression satisfies the pre-editing rule 3) is checked by, for example, the following procedure.

【００３６】テキストの対象表現が前編集規則を満たす
かどうかの判定手順としては、対象表現の先頭から順
に、述語になり得る定形動詞を検索する。もし、述語が
見つかれば、その述語候補と、人称・数が一致する名詞
を主辞とする名詞句が前方に存在するかどうかを調べ
る。名詞句の検索には、上記の名詞句パターンを利用
する。もし、そのような名詞句が存在すれば、それを主
語と見なし、対象表現全体を文と見なし、前編集規則
３）が満たされないものとする。As a procedure for determining whether or not the target expression of the text satisfies the pre-editing rule, a fixed verb that can be a predicate is searched in order from the head of the target expression. If a predicate is found, it is checked whether a predicate candidate and a noun phrase whose head is a noun whose personality and number match are present in front. The noun phrase pattern is used for searching for the noun phrase. If such a noun phrase exists, it is regarded as a subject, the entire target expression is regarded as a sentence, and the pre-editing rule 3) is not satisfied.

【００３７】前編集処理しないテキスト“Agency to in
spect health of 8 banks ”を、標準的な表現を主な対
象とした従来の機械翻訳システムに入力して翻訳する
と、例えば「８つの銀行の健全性を調査するための機
関」のように新聞記事の見出しとしては不適切な翻訳が
生成される。The text "Agency to in
If you translate “spect health of 8 banks” into a conventional machine translation system that mainly deals with standard expressions, you can translate it into newspaper articles such as “Institutions for investigating the health of eight banks” Produces an inappropriate translation for the headline.

【００３８】これに対して、ＳＴＥＰ１での形態素解析
処理、ＳＴＥＰ２〜ＳＴＥＰ７でのテキスト識別処理、
ＳＴＥＰ８〜ＳＴＥＰ１１でのテキスト前編集処理の説
明からわかるように、本発明のテキスト前編集装置で書
き換えた“Agency will inspect health of 8 banks ”
を、従来の機械翻訳システムの入力して翻訳すれば、例
えば、「機関は８つの銀行の健全性を調査するであろ
う」のように新聞記事の見出しとして適切な翻訳が得ら
れる。On the other hand, the morphological analysis processing in STEP1, the text identification processing in STEP2 to STEP7,
As can be seen from the description of the text pre-editing processing in STEP 8 to STEP 11, "Agency will inspect health of 8 banks" rewritten by the text pre-editing apparatus of the present invention.
Can be input into a conventional machine translation system and translated, for example, an appropriate translation can be obtained as a headline of a newspaper article, such as “the institution will investigate the health of eight banks”.

【００３９】図６は本発明のテキスト自動前編集装置の
適用システムの一例を示す図である。本発明のテキスト
自動前編集装置は、自然言語処理システムとは独立であ
り、本発明のテキスト前編集装置から出力される前編集
されたテキストを利用するシステムとして、図６に示す
ような適用テキストがある。FIG. 6 is a diagram showing an example of an application system of the automatic text pre-editing apparatus of the present invention. The automatic text pre-editing device of the present invention is independent of the natural language processing system, and is a system utilizing pre-edited text output from the text pre-editing device of the present invention. There is.

【００４０】例えば、異なる言語間の機械翻訳を行う機
械翻訳装置が本発明のテキスト自動前編集装置を用いて
翻訳の前編集処理を行う機械翻訳システムを提供するこ
とができる。また、自然言語装置を接続と自然言語で記
述されたテキストの転送を制御するインターフェイスが
本発明のテキスト自動前編集装置を用いてテキストの前
編集処理を行う自然言語インターフェイスを提供するこ
とができる。また、自然言語で記述されたテキストを自
動要約するテキスト要約装置が本発明のテキスト自動前
編集装置を用いてテキストを前編集処理を行うテキスト
要約システムを提供することができる。For example, a machine translation system that performs machine translation between different languages can provide a machine translation system that performs pre-editing processing of translation using the automatic text pre-editing device of the present invention. In addition, an interface for connecting a natural language device and controlling the transfer of text written in a natural language can provide a natural language interface for performing pre-editing processing of text using the automatic text pre-editing device of the present invention. Further, it is possible to provide a text summarizing system in which a text summarizing apparatus for automatically summarizing text described in a natural language performs a pre-editing process on a text using the automatic text pre-editing apparatus of the present invention.

【００４１】[0041]

【発明の効果】本発明によれば、自然言語で記述された
テキストの前編集処理において、テキストの種類の識別
と、そのテキストの種類に応じた前編集規則及び標準表
記の選択を自動化したので、ユーザの負担を軽減するこ
とができる。また、その自動選択された前編集規則に基
づいて検出したテキストの前編集対象部分の単語を標準
表記に書き換えるテキスト自動前編集装置を提供するこ
とができる。According to the present invention, in the pre-editing process of a text described in a natural language, identification of a text type and selection of a pre-editing rule and a standard notation according to the text type are automated. Thus, the burden on the user can be reduced. Further, it is possible to provide an automatic text pre-editing apparatus that rewrites a word of a pre-editing target portion of a text detected based on the automatically selected pre-editing rule into a standard notation.

[Brief description of the drawings]

【図１】本発明の一実施例であるテキスト自動前編集装
置の構成を示すブロック図である。FIG. 1 is a block diagram illustrating a configuration of an automatic text pre-editing apparatus according to an embodiment of the present invention.

【図２】本実施例のテキスト自動前編集処理の手順を示
すフローチャートである。FIG. 2 is a flowchart illustrating a procedure of an automatic text pre-editing process according to the embodiment;

【図３】本実施例の入力テキストの一例を示す図であ
る。FIG. 3 is a diagram illustrating an example of an input text according to the embodiment;

【図４】本実施例の識別規則テーブルに記憶された識別
規則の一例を示す図である。FIG. 4 is a diagram illustrating an example of an identification rule stored in an identification rule table according to the embodiment;

【図５】本実施例の前編集規則テーブルに記憶された前
編集規則の一例を示す図である。FIG. 5 is a diagram illustrating an example of a pre-editing rule stored in a pre-editing rule table according to the present embodiment.

【図６】本発明のテキスト自動前編集装置の適用システ
ムの一例を示す図である。FIG. 6 is a diagram showing an example of an application system of the automatic text pre-editing apparatus of the present invention.

[Explanation of symbols]

１制御部２入力部３出力部４テーブルメモリ４ａ辞書テーブル４ｂ識別規則テーブル４ｃ前編集規則群テーブル５プログラムメモリ５ａ形態素解析部５ｂ識別部５ｃ前編集部５ｃ-1 選択部５ｃ-2 検索部５ｃ-3 書換部６バッファメモリ６ａテキストバッファ６ｂ形態素解析結果バッファ６ｃ識別結果バッファ６ｄ選択結果バッファ６ｅ書換テキストバッファ７バスライン８記憶媒体 Reference Signs List 1 control unit 2 input unit 3 output unit 4 table memory 4a dictionary table 4b identification rule table 4c pre-editing rule group table 5 program memory 5a morphological analysis unit 5b identifying unit 5c pre-editing unit 5c-1 selecting unit 5c-2 searching unit 5c -3 Rewriting unit 6 Buffer memory 6a Text buffer 6b Morphological analysis result buffer 6c Identification result buffer 6d Selection result buffer 6e Rewriting text buffer 7 Bus line 8 Storage medium

Claims

[Claims]

1. Pre-editing a text and a dictionary table that stores words, part of speech information and morpheme information related to text described in a natural language, an identification rule table that stores identification rules for identifying types of text, and text. A table memory consisting of a pre-editing rule table storing pre-editing rules and standard notations for each type of text, an input unit for inputting text, and a dictionary table, and performing morphological analysis on each word of the input text. Part-of-speech information, a morphological analysis unit that extracts morphological information, and an identification unit that refers to an identification rule table, and an extracted part-of-speech information, an identification unit that identifies the type of text from morphological information, and a pre-editing rule table,
An automatic text pre-editing device comprising: a pre-editing unit for detecting a word in a pre-editing target portion of a text from an editing rule corresponding to an identified type, and rewriting the word into a standard notation.

2. The pre-editing unit refers to a pre-editing rule table and selects an editing rule corresponding to a type of text identified by the identifying unit, and a text corresponding to the selected editing rule. 2. The automatic text pre-editing apparatus according to claim 1, further comprising: a search unit that searches for a word in the part to be pre-edited; and a rewriting unit that rewrites the searched word into a standard notation.

3. The identification rule table stores identification rules for discriminating whether a type of text is a science and technology paper, a patent specification, an instruction manual for a device, or a news article for each field. 2. The automatic text pre-editing device according to claim 1, wherein:

4. A dictionary table storing words, parts of speech information, and morphological information related to text described in a natural language, an identification rule for identifying the type of text in an identification rule table, and The pre-editing rule and the standard notation for editing are stored in the pre-editing rule table for each type of text, and the text is input using the input unit,
Using the morphological analysis unit, refer to the dictionary table, morphologically analyze each word of the input text to extract part-of-speech information and morphological information, use the identification unit to refer to the identification rule table, and extract the extracted part-of-speech information Identifying the type of text from the morpheme information, using the pre-editing unit, referencing the pre-editing rule table, detecting the word of the pre-editing target portion of the text from the editing rule corresponding to the identified type, and Automatic pre-editing method, characterized by rewriting the standard notation.

5. A computer which refers to a dictionary table storing words, part-of-speech information, and morphological information related to a text described in a natural language, performs morphological analysis on the input text, and extracts part-of-speech information and morphological information. Refer to the identification rule table storing the identification rules for identifying the type of text, identify the type of text from the extracted part-of-speech information and morpheme information, and specify the pre-editing rule and standard notation for pre-editing the text. An automatic text editor that refers to the pre-editing rule table stored for each type of text, detects words in the pre-editing target portion of the text from the editing rules corresponding to the identified type, and executes processing for rewriting the words to standard notation. A storage medium storing a pre-editing program.

6. A machine translation system for performing machine translation between different languages by performing a pre-editing process of translation using the automatic text pre-editing device according to claim 1.

7. An automatic text pre-editing device according to claim 1, wherein an interface for connecting a natural language device and controlling the transfer of a text described in a natural language. Natural language interface.

8. A text summarization system for automatically summarizing text described in a natural language, wherein the text summarization system pre-edits the text using the automatic text pre-editing device according to claim 1.