JP4054353B2

JP4054353B2 - Machine translation apparatus and machine translation program

Info

Publication number: JP4054353B2
Application number: JP2006196519A
Authority: JP
Inventors: 陽子小▲高▼
Original assignee: Toshiba Corp; Toshiba Solutions Corp
Current assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Priority date: 2006-07-19
Filing date: 2006-07-19
Publication date: 2008-02-27
Anticipated expiration: 2026-07-19
Also published as: JP2008027027A

Description

本発明は、コンピュータを利用して第一言語（原言語）の文章を第二言語（目的言語）の文章に自動的に翻訳する機械翻訳装置及び機械翻訳プログラムに関する。 The present invention relates to a machine translation apparatus and a machine translation program that automatically translate a sentence in a first language (source language) into a sentence in a second language (target language) using a computer.

コンピュータを利用して、第一の自然言語で書かれた文章を第二の自然言語の文章に自動的に翻訳する機械翻訳装置やソフトウェアが実用化されている。例えばパソコン用翻訳ソフトウェアには多くの商品があり、また、インターネットで機械翻訳サービスが提供されている例もある。 Machine translation devices and software that automatically translate sentences written in a first natural language into sentences in a second natural language using a computer have been put into practical use. For example, there are many products in translation software for personal computers, and there are examples in which machine translation services are provided on the Internet.

例えば、第一の自然言語としての日本語を第二の自然言語としての英語に翻訳する場合を考える。まず、翻訳前の原文を形態素解析により、いくつかの単語に分割する。そして、分割された単語ごとに、あらかじめ用意した原語−訳語間のデータベースである翻訳辞書を検索して、適切な訳語を選択して日本語を英語に翻訳する。 For example, consider a case where Japanese as a first natural language is translated into English as a second natural language. First, the original text before translation is divided into several words by morphological analysis. Then, for each divided word, a translation dictionary that is a database between original words and translated words prepared in advance is searched, and an appropriate translated word is selected to translate Japanese into English.

ここで、原語−訳語間の翻訳辞書の見出しの中に、探している単語が見つからない場合には、訳語を得ることができず、翻訳前の単語は未知語として扱われる。そうすると、訳文として出力した英文の中には、未知語となった一部の語が、原語すなわち日本語のまま出現することがある。 Here, if the word being searched for is not found in the heading of the translation dictionary between the original word and the translated word, the translated word cannot be obtained, and the word before translation is treated as an unknown word. Then, in the English text output as a translation, some words that have become unknown words may appear as original words, that is, in Japanese.

これを回避するため、例えばカタカナ表記の単語については、いわゆる外来語として、そのままローマ字表記に変換して出力するという方法がある。この場合、そのようにして得た変換後のローマ字表記は、本来の英単語の綴りとは大きく異なることが多い。よって、変換後のローマ字表記からは本来の英単語を推察さえできないということもある。 In order to avoid this, there is a method in which, for example, a word in katakana notation is converted into a romaji notation as it is as a so-called foreign word and output. In this case, the converted Roman letter notation obtained in this way is often very different from the original spelling of English words. Therefore, it may be impossible to infer the original English word from the converted Roman alphabet.

そこで、翻訳辞書に見つからなかったカタカナの未知語を処理するために、カタカナ１文字ごとに分割して、それらを１単語とみなし、１字単語の連鎖として解析をするというものがある（例えば、特許文献１参照）。 Therefore, in order to process an unknown word of Katakana that was not found in the translation dictionary, there is one that divides each Katakana character into one character, considers them as one word, and analyzes it as a chain of one character words (for example, Patent Document 1).

また、単に１字単語とするのではなく、複数の文字からなるカタカナ文字列としていくつかに分割し、それらの「読み」に着目して、同一の読みをもつ他の表記（例えば、ひらがな文字や漢字混じりによる表記）が翻訳辞書にあるかどうかを探すことにより、かかる場合には未知語ではなく翻訳辞書に登録されている単語として取り扱う形態素解析方法もある（例えば、特許文献２参照）。
特開昭６２−２０８１６９号公報特開平５−２０３０４号公報 Also, instead of just a single word, it is divided into several katakana character strings consisting of a plurality of characters, and other notations with the same reading (for example, hiragana characters) In such a case, there is a morpheme analysis method in which a word registered in the translation dictionary is used instead of an unknown word by searching for whether or not (notation based on kanji or kanji) is in the translation dictionary (see, for example, Patent Document 2).
JP 62-208169 A JP-A-5-20304

しかしながら、従来技術においては、翻訳が成功するか否かは、原語を１字単語もしくは他の表記に変換したとしても、最終的には翻訳辞書に登録されているかどうかに依存している。つまり、それらが翻訳辞書に登録されていない場合には、依然として、適切に翻訳をすることができない。 However, in the prior art, whether or not the translation is successful depends on whether or not the original word is finally registered in the translation dictionary even if the original word is converted into a one-letter word or another notation. That is, if they are not registered in the translation dictionary, they cannot still be properly translated.

また、カタカナ表記の原語をいわゆる外来語としてローマ字表記に変換するようにしても、もともと外来語ではない語句については極めて不適切であった。 Even if the original words in katakana notation are converted into romaji notation as so-called foreign words, words that are not originally foreign words are extremely inappropriate.

そこで、本発明の目的は、翻訳辞書に登録されていない未知語であっても、精度の高い訳語を出力することができる機械翻訳装置及び機械翻訳プログラムを提供することである。 Accordingly, an object of the present invention is to provide a machine translation device and a machine translation program that can output a translation with high accuracy even if it is an unknown word that is not registered in a translation dictionary.

本発明に係る機械翻訳装置は、第一言語の語句とそれに対応する第二言語の語句とを対にして記憶する翻訳辞書部と、第一言語の原文を形態素解析していくつかの語句に分解する入力処理部と、前記入力処理部で分解された語句を前記翻訳辞書部から検索し第二言語の語句を訳語として選び出す翻訳辞書検索部と、第一言語の語句中の一文字以上の文字からなる字句及びこれに対応する第二言語の文字による綴りの一以上の字句を対応づけて記憶する綴り対応表と、第二言語の語句の綴り、語句の意味、品詞、分野、共起情報等の属性情報を蓄積した第二言語の知識データベースと、前記入力処理部で分解された語句が前記翻訳辞書部に存在しない未知語であるときには、当該未知語をさらに一文字以上の文字からなる字句に分解しその分解した字句を前記綴り対応表から検索し第二言語の字句を抽出し、この抽出した第二言語の字句を合成して前記未知語の訳語を求める際に、前記第二言語の知識データベースを参照し属性情報を基に前記未知語の訳語の優先度を決め前記未知語の訳語を求める未知語処理部と、前記翻訳辞書部検索部や前記未知語処理部から出力された訳語を受け取り組み立てて訳文として出力する出力処理部とを備えたことを特徴とする。 The machine translation device according to the present invention includes a translation dictionary unit that stores a pair of a first language word and a second language word corresponding thereto, and a morphological analysis of the first language source sentence into several words. An input processing unit to be decomposed, a translation dictionary search unit that searches the translation dictionary unit for words / phrases decomposed by the input processing unit and selects words in the second language as translated words, and one or more characters in the words in the first language And a spelling correspondence table for storing one or more spellings of spelling of characters in the second language corresponding thereto, and spelling of the phrases in the second language, meanings of words, parts of speech, fields, co-occurrence information When the knowledge database of the second language that stores attribute information such as and the phrase decomposed by the input processing unit is an unknown word that does not exist in the translation dictionary unit, the unknown word is further composed of one or more characters To decompose It was lexical retrieved from the spelling correspondence table extracting lexical second language, when synthesized lexical second language extracted this finding the translation of the unknown word, the knowledge database of the second language Referring to and determining the priority of the translation of the unknown word based on the attribute information, the unknown word processing unit for obtaining the translation of the unknown word, and receiving and assembling the translations output from the translation dictionary unit search unit and the unknown word processing unit And an output processing unit for outputting the translated sentence.

本発明によれば、翻訳辞書に登録されていない未知語について、精度よく訳語を出力することができる。 According to the present invention, it is possible to accurately output a translated word for an unknown word that is not registered in the translation dictionary.

本発明を実施する最良の形態を説明する。第一言語の原文の語句のうち翻訳辞書に登録されていない語句、すなわち未知語について、さらに字句に分割し、その字句に対応する第二言語の綴りのデータベースを参照して綴りを得て、これらの綴りの組み合わせにより第二言語の語句を得る。これにより、未知語を含んだ原文であっても、精度良く翻訳することができる。 The best mode for carrying out the present invention will be described. Of the words in the original text of the first language, the words that are not registered in the translation dictionary, that is, unknown words, are further divided into words, and the spelling is obtained with reference to the second language spelling database corresponding to the words, A combination of these spellings yields a second language phrase. Thereby, even an original sentence including an unknown word can be translated with high accuracy.

以下、本発明の実施の形態では、本発明の機械翻訳装置について、第一言語を日本語としたときに、第二言語として英語に翻訳する場合について説明する。図１は、本発明の実施の形態に係る機械翻訳装置のブロック図である。 Hereinafter, in the embodiment of the present invention, a case where the machine translation device of the present invention is translated into English as the second language when the first language is Japanese will be described. FIG. 1 is a block diagram of a machine translation apparatus according to an embodiment of the present invention.

機械翻訳装置１０１は、入力処理部１０２と、翻訳辞書検索部１０３と、翻訳辞書部１０４と、未知語処理部１０５と、綴り対応表１０６と、出力部処理部１０７と、第二言語の語彙データベース３０２、第二言語の知識データベース３０３とを備える。第二言語の知識データベース３０３は翻訳辞書部１０４内に設けてもよい。 The machine translation apparatus 101 includes an input processing unit 102, a translation dictionary search unit 103, a translation dictionary unit 104, an unknown word processing unit 105, a spelling correspondence table 106, an output unit processing unit 107, and a vocabulary in a second language. A database 302 and a knowledge database 303 of a second language. The second language knowledge database 303 may be provided in the translation dictionary unit 104.

機械翻訳装置１０１は、第一言語（原言語）で記載された翻訳対象の文書すなわち原文を電子データとして入力とし、その翻訳を行い、第二言語で記載された翻訳結果の文書すなわち訳文を出力する。入力処理部１０２は、原文を取り込む。そして、原文を構成する語句を、形態素解析などにより抽出し、翻訳辞書部検索部１０３へ出力する。 The machine translation apparatus 101 receives a translation target document described in a first language (source language), that is, an original text as electronic data, performs translation, and outputs a translation result document described in a second language, that is, a translated text. To do. The input processing unit 102 captures the original text. Then, the phrases constituting the original text are extracted by morphological analysis or the like and output to the translation dictionary search unit 103.

翻訳辞書部検索部１０３は、翻訳辞書部１０４の中に、前記の語句が見出し語としてあるかどうかを検索する。翻訳辞書部１０４は、第一言語から第二言語への翻訳を行うための辞書であり、第一言語の語句すなわち原語と、これに対応する第二言語の語句すなわち訳語を格納している。翻訳辞書部１０４の中に見出し語として登録されている原語については、その訳語を出力処理部１０７へ送る。 The translation dictionary unit search unit 103 searches the translation dictionary unit 104 to determine whether or not the word / phrase is an entry word. The translation dictionary unit 104 is a dictionary for performing translation from the first language into the second language, and stores words and phrases of the first language, that is, original words, and phrases and translations of the second language corresponding to the words. For the original language registered as a headword in the translation dictionary unit 104, the translated word is sent to the output processing unit 107.

未知語処理部１０５は、翻訳辞書部１０４の中に見出し語として登録されていない原語（例えば、外来語や新語など）を、未知語として扱い、これを適切な訳語に変換するものである。綴り対応表１０６は第一言語の原語の字句に対応する第二言語の綴りの候補をあらかじめ登録するものであり、第二言語の語彙データベース３０２は未知語の候補となる第二言語の語句をあらかじめ記憶するものであり、また、第二言語の知識データベース３０３は第二言語の語句の綴りに加え、語句の意味、品詞、分野、共起情報等の属性情報を蓄積するものである。これらは、未知語処理部１０５が未知語を処理する際に参照される。 The unknown word processing unit 105 treats an original word (for example, a foreign word or a new word) that is not registered as a headword in the translation dictionary unit 104 as an unknown word, and converts this into an appropriate translated word. The spelling correspondence table 106 pre-registers spelling candidates in the second language corresponding to the original language words in the first language, and the vocabulary database 302 in the second language lists the words in the second language that are candidates for unknown words. The knowledge database 303 of the second language stores attribute information such as the meaning of words, parts of speech, fields, and co-occurrence information in addition to the spelling of words in the second language. These are referred to when the unknown word processing unit 105 processes an unknown word.

出力処理部１０７は、翻訳辞書部検索部１０３または未知語処理部１０５から出力された訳語を受け取り、これらを組み立てて訳文として出力する処理を行うものである。 The output processing unit 107 performs processing to receive the translations output from the translation dictionary search unit 103 or the unknown word processing unit 105, assemble them, and output them as translations.

図２は本発明の実施の形態に係わる機械翻訳装置のハードウエア構成図である。本発明の実施の形態に係わる機械翻訳装置１０１はコンピュータで構成される。すなわち、中央処理制御装置２０１、ＲＯＭ（Read Only Memory）２０２、ＲＡＭ（Random Access Memory）２０３をバス２１０に接続し、一方、入力装置２０４、表示装置２０５、通信制御装置２０６、記憶装置２０７及びリムーバブルディスク２０８を入出力インターフェース２０９に接続し、この入出力インターフェース２０９をバス２１０に接続して構成される。 FIG. 2 is a hardware configuration diagram of the machine translation apparatus according to the embodiment of the present invention. The machine translation apparatus 101 according to the embodiment of the present invention is configured by a computer. That is, a central processing control device 201, a ROM (Read Only Memory) 202, and a RAM (Random Access Memory) 203 are connected to the bus 210, while an input device 204, a display device 205, a communication control device 206, a storage device 207, and a removable device. The disk 208 is connected to an input / output interface 209, and the input / output interface 209 is connected to a bus 210.

中央処理制御装置２０１は、入力装置２０４からの入力信号に基づいてＲＯＭ２０２からブートプログラムを読み出して実行し、更に記憶装置２０７に記憶されたオペレーティングシステムを読み出す。更に中央処理制御装置２０１は、入力装置２０４や通信制御装置２０６などの入力信号に基づいて、各種装置の制御を行い、ＲＡＭ２０３や記憶装置２０７などに記憶されたプログラム及びデータを読み出してＲＡＭ２０３にロードするとともに、ＲＡＭ２０３から読み出されたプログラムのコマンドに基づいて、データの計算又は加工など、後述する一連の処理を実現する処理装置である。 The central processing control device 201 reads and executes a boot program from the ROM 202 based on an input signal from the input device 204, and further reads an operating system stored in the storage device 207. Furthermore, the central processing control device 201 controls various devices based on input signals from the input device 204, the communication control device 206, etc., reads out programs and data stored in the RAM 203, the storage device 207, etc., and loads them into the RAM 203. In addition, the processing device implements a series of processes to be described later, such as data calculation or processing, based on a program command read from the RAM 203.

入力装置２０４は、操作者が各種の操作を入力するキーボード、マウスなどの入力デバイスにより構成されており、操作者の操作に基づいて入力信号を作成し、入出力インタフェース２０９及びバス２１０を介して中央処理制御装置２０１に送信される。表示装置２０５は、ＣＲＴ（Cathode Ray Tube）ディスプレイや液晶ディスプレイなどであり、中央処理制御装置２０１からバス２１０及び入出力インタフェース２０９を介して表示装置２０５において表示させる出力信号を受信し、例えば中央処理制御装置２０１の処理結果などを表示する装置である。通信制御装置２０６は、ＬＡＮカードやモデムなどの装置であり、機械翻訳装置１０１をインターネットやＬＡＮなどの通信ネットワークに接続する装置である。通信制御装置２０６を介して通信ネットワークと送受信したデータは入力信号又は出力信号として、入出力インタフェース及びバス２１０を介して中央処理制御装置２０１に送受信される。 The input device 204 is configured by an input device such as a keyboard and a mouse through which an operator inputs various operations. The input device 204 creates an input signal based on the operation of the operator, and inputs via the input / output interface 209 and the bus 210. It is transmitted to the central processing control apparatus 201. The display device 205 is a CRT (Cathode Ray Tube) display, a liquid crystal display, or the like, and receives an output signal to be displayed on the display device 205 from the central processing control device 201 via the bus 210 and the input / output interface 209. This is a device that displays the processing result of the control device 201. The communication control device 206 is a device such as a LAN card or a modem, and is a device that connects the machine translation device 101 to a communication network such as the Internet or a LAN. Data transmitted / received to / from the communication network via the communication control device 206 is transmitted / received to / from the central processing control device 201 via the input / output interface and bus 210 as an input signal or an output signal.

記憶装置２０７は磁気ディスク装置であって、中央処理制御装置２０１で実行されるプログラムやデータが記憶されている。リムーバブルディスク２０８は、光ディスクやフレキシブルディスクのことであり、ディスクドライブによって読み書きされた信号は、入出力インタフェース２０９及びバス２１０を介して中央処理制御装置２０１に送受信される。 The storage device 207 is a magnetic disk device, and stores programs and data executed by the central processing control device 201. The removable disk 208 is an optical disk or a flexible disk, and signals read / written by the disk drive are transmitted / received to / from the central processing control apparatus 201 via the input / output interface 209 and the bus 210.

このようなコンピュータを本発明の実施の形態の機械翻訳装置１０１として機能させるにあたっては、記憶装置２０７に機械翻訳プログラムを記憶するとともに、翻訳辞書部１０４を記憶する。また、機械翻訳プログラムが機械翻訳装置１０１の中央処理制御装置２０１に読み込まれ実行されることによって、入力処理部１０２、翻訳辞書部検索部１０３、未知語処理部１０５及び出力処理部１０７が実現される。機械翻訳装置１０１の綴り対応表１０６、第二言語の語彙データベース３０２、第二言語の知識データベース３０３は、ＲＡＭ２０３に格納される。なお、第二言語の知識データベース３０３は翻訳辞書部１０４内に設けられる場合には記憶装置２０７に格納される。機械翻訳装置１０１の入力処理部１０２については、入力装置２０４や通信制御装置２０６にてその機能を実装することができ、または、記憶装置２０７やリムーバブルディスク２０８からデータを入力するようにしてもよい。機械翻訳装置１０１の出力処理部１０７については、表示装置２０５や通信制御装置２０６にてその機能を実装することができ、または、記憶装置２０７やリムーバブルディスク２０８にデータを出力するようにしてもよい。 When such a computer functions as the machine translation apparatus 101 according to the embodiment of the present invention, the machine translation program is stored in the storage device 207 and the translation dictionary unit 104 is stored. In addition, the machine translation program is read and executed by the central processing controller 201 of the machine translation apparatus 101, whereby the input processing unit 102, the translation dictionary unit search unit 103, the unknown word processing unit 105, and the output processing unit 107 are realized. The The spelling correspondence table 106, the second language vocabulary database 302, and the second language knowledge database 303 of the machine translation device 101 are stored in the RAM 203. Note that the second language knowledge database 303 is stored in the storage device 207 when provided in the translation dictionary unit 104. The input processing unit 102 of the machine translation apparatus 101 can be implemented by the input device 204 or the communication control device 206, or data can be input from the storage device 207 or the removable disk 208. . The output processing unit 107 of the machine translation device 101 can be implemented by the display device 205 or the communication control device 206, or data can be output to the storage device 207 or the removable disk 208. .

図３は、機械翻訳装置１０１における未知語変換部１０５のブロック図である。未知語変換部１０５は、字句分割部１０５ａと、訳語合成部１０５ｂと、綴り対応表検索部３０１とを備え、綴り対応表１０６を参照する。また、必要に応じ、第二言語の語彙データベース３０２と、第二言語の知識データベース３０３とを参照する。 FIG. 3 is a block diagram of the unknown word conversion unit 105 in the machine translation apparatus 101. The unknown word conversion unit 105 includes a lexical division unit 105 a, a translation word synthesis unit 105 b, and a spelling correspondence table search unit 301, and refers to the spelling correspondence table 106. Further, the second language vocabulary database 302 and the second language knowledge database 303 are referred to as necessary.

未知語変換部１０５が先述した第一言語の未知語を受け取ると、その未知語を字句分割部１０５ａによりいくつかの字句に分割する。分割した各々の字句について、綴り対応表検索部３０１が綴り対応表１０６を検索し、当該字句が見つかれば、綴り対応表１０６に登録されている綴りにより当該字句を変換する。訳語合成部１０５ｂは、必要に応じて第二言語の語彙データベース３０２や第二言語の知識データベース３０３を参照しながら、各々の字句の綴りを合成することにより、もとの原語に対する訳語を出力する。 When the unknown word conversion unit 105 receives the aforementioned unknown word in the first language, the unknown word is divided into several lexical words by the lexical dividing unit 105a. For each divided word, the spelling correspondence table search unit 301 searches the spelling correspondence table 106. If the word is found, the spelling registered in the spelling correspondence table 106 is converted. The translation composition unit 105b refers to the vocabulary database 302 of the second language or the knowledge database 303 of the second language as necessary, and synthesizes the spelling of each lexical phrase, thereby outputting a translation for the original source word. .

以下、未知語変換部１０５の各構成部について詳細に説明する。字句分割部１０５ａは、未知語をいくつかの字句に分割する。まず、分割された各字句の文字数ができるだけ大きくなるようにする。そのために、まずは２つに分割する。例えば、「ユビキタス」が未知語であるとしたときに、「ユビ」という字句と「キタス」という字句とに分割する。その後の綴り対応表検索部３０１の処理で各字句の適切な訳語が見つからなければ、分割箇所を前後にずらしたり、分割箇所を増やしたりすることを繰り返す。例えば、「ユビキ」と「タス」というように分割箇所をずらす。それでも綴り対応表検索部３０１の処理で各軸の適切な訳語が見つからない場合は分割数を増やす。例えば、「ユ」「ビキ」「タス」という三つの字句に分割したり、「ユビ」「キ」「タス」という三つの字句に分割するようにする。 Hereinafter, each component of the unknown word conversion unit 105 will be described in detail. The lexical division unit 105a divides the unknown word into several lexical terms. First, the number of characters in each divided lexical phrase is made as large as possible. For this purpose, first, it is divided into two. For example, when “ubiquitous” is an unknown word, it is divided into a lexical phrase “ubi” and a lexical phrase “kitas”. If an appropriate translation of each lexical phrase is not found in the processing of the spelling correspondence table search unit 301 thereafter, the division location is shifted back and forth or the division location is increased. For example, the division part is shifted like “ubiquity” and “tas”. If the appropriate translation for each axis is still not found by the processing of the spelling correspondence table search unit 301, the number of divisions is increased. For example, it is divided into three phrases such as “Yu”, “Biki”, and “TASS”, or divided into three phrases such as “YUBI”, “KI”, and “TASS”.

このようにすることで、未知語をはじめから１文字単位にまで分割することなく、大まかに分割された字句によって早期のうちに適切な訳語を見つける可能性が高まるから、効率的に未知語を翻訳することができる。 By doing this, the unknown word is efficiently divided because it increases the possibility of finding an appropriate translated word at an early stage by roughly dividing the lexical phrase without dividing it into a single character unit from the beginning. Can be translated.

綴り対応表検索部３０１は、字句分割部１０５ａから渡された字句ごとに、綴り対応表１０６を検索する。ここで、字句分割部１０５ａでの各種分割の態様によっても当該字句が、綴り対応表１０６に見つからなかった場合には、デフォルト動作として、当該字句をローマ字表記に変換する。例えば、「ユビ」は「yubi」というように変換する。 The spelling correspondence table search unit 301 searches the spelling correspondence table 106 for each word passed from the word division unit 105a. Here, even if the lexical phrase is not found in the spelling correspondence table 106 even by various division modes in the lexical segmentation unit 105a, the lexical phrase is converted into a Roman character notation as a default operation. For example, “ubi” is converted to “yubi”.

綴り対応表１０６は、第一原語の文字もしくは文字列に対応する第二原語の綴りの候補があらかじめ登録されているものである。特に、第一原語が日本語であり、第二原語が英語である場合には、日本語の表音文字をローマ字に対応させることはもちろん、単純に日本語の表音文字をローマ字に対応させるだけでなく、日本人が行いがちな外来語の取扱いを、この綴り対応表１０６に登録しておく。例えば、mouthはマウスとすることが多いので「ス」は「th」とし、meterはメートルとすることが多いので「トル」は「ter」とするが如きである。 In the spelling correspondence table 106, candidates for spelling of the second original word corresponding to the characters or character strings of the first original word are registered in advance. In particular, if the first source language is Japanese and the second source language is English, the Japanese phonetic characters will correspond to Roman characters, as well as the Japanese phonetic characters will simply correspond to Roman characters. In addition, the handling of foreign words that Japanese tend to do is registered in this spelling correspondence table 106. For example, mouth is often a mouse, so "su" is "th", and meter is often a meter, so "tor" is "ter".

綴り対応表検索部３０１は、このような綴り対応表１０６から分割された言語の各字句の綴りを抽出する。訳語合成部１０５ｂは、綴り対応表検索部３０１から分割された言語の各字句の綴りを受け取ると、これを合成して訳語として出力する。ここで、綴り対応表１０６において、複数の綴りの候補が見つかることがあるから、各々を組み合わせて合成した訳語も複数の候補があり得ることになる。そこで、各々の綴りの候補おける優先度を考慮して、複数の訳語の各々に優先度を決定することにより、最も適切な訳語を選択することができる。具体的には、次の（１）〜（４）の処理のいずれかを行うようにする。 The spelling correspondence table search unit 301 extracts the spelling of each lexical phrase of the language divided from the spelling correspondence table 106. When the translated word synthesizing unit 105b receives the spelling of each lexical phrase of the language divided from the spelling correspondence table searching unit 301, it synthesizes it and outputs it as a translated word. Here, since a plurality of spelling candidates may be found in the spelling correspondence table 106, there may be a plurality of candidates for a translated word synthesized by combining them. Therefore, the most appropriate translation can be selected by determining the priority for each of a plurality of translations in consideration of the priority of each spelling candidate. Specifically, any one of the following processes (1) to (4) is performed.

（１）合成した訳語として、第二言語においてはありえないもの、もしくはあまり出現しないものを排除する。もしくは訳語としての優先度を低くする。 (1) Eliminate words that are impossible or rarely appear in the second language as synthesized translations. Alternatively, the priority as a translated word is lowered.

（２）ローマ字表記のものについては訳語としての優先度を低くする。 (2) Lower the priority as a translated word for romaji writing.

（３）第二言語の語彙データベース３０２を検索し、この中で見つかったものについては訳語としての優先度を高くする。 (3) Search the vocabulary database 302 of the second language, and increase the priority as a translated word for those found in this.

（４）第二言語の知識データベース３０３を参照し、原語に関する分野の情報から、その分野の単語がもつ傾向を調べ、これを訳語の合成における優先度の設定に利用する。例えば、化学式に使われる物質名や植物の名前などには、特徴のある綴りをもつものが多いので、原語が化学式に使われる物質名や植物の名前である場合には、そのような特徴のある綴りの優先度を高くする。 (4) Referring to the knowledge database 303 of the second language, the tendency of the words in the field is checked from the information on the field related to the original language, and this is used for setting the priority in the synthesis of the translated words. For example, many substance names and plant names used in chemical formulas have characteristic spellings, so if the original language is the name of a substance or plant used in a chemical formula, Increase the priority of certain spellings.

第二言語の語彙データベース３０２は、第二言語の未知語の候補となる語句を語彙として収集したものである。未知語の候補となる語句は、新語、造語、外来語、略語、商標名などであり、例えば、第二言語の一般的な雑誌や新聞記事から単語を収集する。従って、機械翻訳装置１０１の翻訳辞書部１０４の見出し語としては登録されていないような単語や活用形の単語も含み得る。この語彙データベース３０２においては、単語の意味や用法はわからなくてもよく、訳語合成部１０５ｂが、その綴りの単語が存在するかどうかを調べるために参照できればよい。これにより、訳語合成部１０５ｂが出力する訳語の精度を効果的に向上することができる。 The vocabulary database 302 in the second language is a collection of vocabularies that are candidates for unknown words in the second language. Words that are candidates for unknown words are new words, coined words, foreign words, abbreviations, trade names, and the like. For example, words are collected from general magazines and newspaper articles in the second language. Therefore, words that are not registered or words in the utilization form may be included as headwords in the translation dictionary unit 104 of the machine translation apparatus 101. In this vocabulary database 302, it is not necessary to know the meaning and usage of a word, and it is only necessary that the translation synthesizer 105b can refer to whether or not the spelled word exists. Thereby, the accuracy of the translated word output by the translated word synthesizing unit 105b can be effectively improved.

次に、第二言語の知識データベース３０３は、第二言語の単語の綴りだけでなく、単語の意味、品詞、分野、共起情報など、様々な種類の膨大な量の情報が蓄積されているものである。これは、一般的には機械翻訳装置に備わっているものである。前述の処理（４）で述べたように、訳語合成部１０５ｂがこれを参照することによって、合成した複数の訳語の各々について、訳語としての優先度を設定することができ、訳語合成部１０５ｂが出力する訳語の精度を効果的に向上することができる。 Next, the second language knowledge database 303 stores not only spelling of words in the second language but also a huge amount of various types of information such as word meaning, part of speech, field, and co-occurrence information. Is. This is generally provided in a machine translation apparatus. As described in the above process (4), the translated word synthesizing unit 105b refers to this, so that the priority as a translated word can be set for each of the synthesized translated words, and the translated word synthesizing unit 105b The accuracy of the translated word to be output can be improved effectively.

図４は、綴り対応表１０６の説明図である。綴り対応表１０６は、第一言語の字句４０１を見出し語として、これに対応する第二言語の綴り４０２をあらかじめ登録しておくものである。例えば、前述したように、英語の「meter」は日本語の「メートル」という外来語として取扱っているように、英語の「ter」という綴りは日本語の「トル」という字句が対応する。また、例えば、英語の「th」という綴りは日本語の「ス」という字句が対応していることが多い。 FIG. 4 is an explanatory diagram of the spelling correspondence table 106. In the spelling correspondence table 106, the first language word 401 is used as a headword, and the spelling 402 in the second language corresponding thereto is registered in advance. For example, as described above, the English word “meter” is handled as a foreign word “meter” in Japanese, and the English word “ter” corresponds to the Japanese word “tor”. For example, the spelling of “th” in English often corresponds to the phrase “su” in Japanese.

このような対応を、綴り対応表１０６にあらかじめ登録しておくことにより、従来のように未知語を単なるローマ字表記に変換するだけでなく、より現実的で適切な訳語に翻訳することができる。 By registering such correspondence in advance in the spelling correspondence table 106, it is possible not only to convert an unknown word into a simple Roman character notation as in the past, but also to translate it into a more realistic and appropriate translation.

図５は、未知語変換部１０５の処理内容を示すフローチャートである。翻訳辞書検索部１０３から未知語として取り扱われる原語を受け取ることにより処理を開始する（Ｓ５０１）。その原語を、字句分割部１０５ａにより、いくつかの字句に分割する（Ｓ５０２）。 FIG. 5 is a flowchart showing the processing contents of the unknown word conversion unit 105. The processing is started by receiving an original word treated as an unknown word from the translation dictionary search unit 103 (S501). The original word is divided into several words by the word dividing unit 105a (S502).

分割した各々の字句について、綴り対応表検索部３０１が綴り対応表１０６を検索する（Ｓ５０３）。当該字句が綴り対応表１０６に存在するか否かを判定し（Ｓ５０４）、当該字句が綴り対応表１０６に存在する場合は、当該字句を綴り対応表に登録されている綴りにより変換する（Ｓ５０５）。一方、当該字句が綴り対応表１０６に存在しない場合には、ローマ字表記に変換する（Ｓ５０６）。 The spelling correspondence table search unit 301 searches the spelling correspondence table 106 for each divided word / phrase (S503). It is determined whether or not the lexical phrase exists in the spelling correspondence table 106 (S504). If the lexical phrase exists in the spelling correspondence table 106, the lexical phrase is converted by the spelling registered in the spelling correspondence table (S505). ). On the other hand, if the lexical phrase does not exist in the spelling correspondence table 106, it is converted into a Roman alphabet (S506).

そして、すべての分割字句につき綴り対応表１０６の検索を完了したか否かを判定し（Ｓ５０７）、すべての分割字句につき綴り対応表１０６の検索を完了していないときはステップＳ５０３に戻る。すべての分割字句につき綴り対応表１０６の検索を完了したときは、ローマ字表記が含まれているか否かを判定し（Ｓ５０８）、ローマ字表記が含まれるときは、未知語に対するすべての分割の仕方を検討したか否かを判定し（Ｓ５０９）、別の分割の仕方がある場合にはステップＳ５０２に戻り、その別の分割の仕方で未知語を再分割する。ステップＳ５０９の判定ですべての分割の仕方を検討しているとき、または、ステップＳ５０８の判定でローマ字表記が含まれていないときは、訳語合成部１０５ｂは綴りの組み合わせの優先度を決定し（Ｓ５１０）、処理を終了する（Ｓ５１１）。 Then, it is determined whether or not the search of the spelling correspondence table 106 has been completed for all the divided characters (S507). If the search of the spelling correspondence table 106 has not been completed for all of the divided characters, the process returns to step S503. When the search of the spelling correspondence table 106 is completed for all the divided characters, it is determined whether or not the Roman character notation is included (S508). If the Roman character notation is included, all the dividing methods for the unknown word are determined. It is determined whether or not it has been examined (S509), and if there is another division method, the process returns to step S502, and the unknown word is subdivided by the other division method. When all the division methods are examined in the determination in step S509, or when the Roman character notation is not included in the determination in step S508, the translated word composition unit 105b determines the priority of the spelling combination (S510). ), The process is terminated (S511).

本発明の実施の形態によれば、翻訳中に未知語が見つかった場合、語句全体では未知語であっても、分解した字句ごとに綴りが得られれば、それらを繋ぎ合わせることにより訳語を導き出すことができる。その際に、字句の単なる綴りの変換だけでなく、例えば外来語の日本語読みを決める際に日本人が行いがちな変換をあらかじめ取り込んでおくことにより、さらに精度の高い訳語を導き出すことができる。 According to the embodiment of the present invention, if an unknown word is found during translation, even if it is an unknown word as a whole, if a spelling is obtained for each decomposed lexical word, a translated word is derived by connecting them together be able to. At that time, it is possible to derive a translation with higher accuracy by taking in not only the spelling conversion of lexical words but also the conversion that Japanese people tend to perform when deciding Japanese readings of foreign words. .

第二言語の知識データベースを使用することにより、綴りだけでなく意味、品詞、分野、共起など、様々な種類の膨大な量の情報を基にして訳語間の優先度を決めることができる。同様に、あらかじめ未知語の候補を記憶した第二言語の語彙データベースに蓄積された情報を使用して、訳語間の優先度を決めることができる。 By using the knowledge database of the second language, it is possible to determine priorities between translated words based on a huge amount of information such as meaning, part of speech, field, and co-occurrence as well as spelling. Similarly, the priority between translated words can be determined using information stored in a vocabulary database of a second language in which unknown word candidates are stored in advance.

本発明の実施の形態に係わる機械翻訳装置のブロック図である。It is a block diagram of a machine translation device concerning an embodiment of the invention. 本発明の実施の形態に係わる機械翻訳装置のハードウエア構成図である。It is a hardware block diagram of the machine translation apparatus concerning embodiment of this invention. 本発明の実施の形態に係わる機械翻訳装置の未知語変換部のブロック図である。It is a block diagram of the unknown word conversion part of the machine translation apparatus concerning embodiment of this invention. 本発明の実施の形態に係わる機械翻訳装置の綴り対応表の説明図である。It is explanatory drawing of the spelling | correspondence table | surface of the machine translation apparatus concerning embodiment of this invention. 本発明の実施の形態に係る機械翻訳装置の未知語変換部の処理内容のフローチャートである。It is a flowchart of the processing content of the unknown word conversion part of the machine translation apparatus concerning embodiment of this invention.

Explanation of symbols

１０１…機械翻訳装置、１０２…入力処理部、１０３…翻訳辞書検索部、１０４…翻訳辞書部、１０５…未知語処理部、１０５ａ…字句分割部、１０５ｂ…訳語合成部、１０６…綴り対応表、１０７…出力処理部
DESCRIPTION OF SYMBOLS 101 ... Machine translation apparatus, 102 ... Input processing part, 103 ... Translation dictionary search part, 104 ... Translation dictionary part, 105 ... Unknown word processing part, 105a ... Lexical division part, 105b ... Translation word synthesis part, 106 ... Spelling correspondence table, 107: Output processing unit

Claims

A translation dictionary unit that stores a pair of a first language word and a corresponding second language word, an input processing unit that morphologically analyzes the original text of the first language and decomposes it into several words, and the input A translation dictionary search unit that searches the translation dictionary unit for a phrase decomposed by the processing unit and selects a second language phrase as a translation word; a phrase composed of one or more characters in the first language phrase; and a corresponding phrase A spelling correspondence table that stores one or more words spelled in bilingual characters in association with each other and a second language that stores attribute information such as spelling of words in the second language, meaning of words, parts of speech, fields, co-occurrence information, etc. When the language knowledge database and the word / phrase decomposed by the input processing unit are unknown words that do not exist in the translation dictionary unit, the unknown word is further decomposed into a word / phrase composed of one or more characters and the decomposed word / phrase is From spelling correspondence table Search and extract the lexical second language, said when synthesized lexical second language extracted this finding the translation of the unknown word, based on the reference to the attribute information knowledge database of the second language An unknown word processing unit that determines the priority of the translation of the unknown word and obtains the translation of the unknown word, and an output processing unit that receives and assembles the translation output from the translation dictionary unit search unit and the unknown word processing unit and outputs the translation A machine translation apparatus comprising:

A vocabulary database of the second language for previously storing the word of the second language as a candidate for the unknown word, the unknown word processor, when determining the translation of an unknown word by combining the lexical second language 2. The machine translation apparatus according to claim 1, wherein when there is a match with reference to the vocabulary database of the second language, the priority of the unknown word as a translated word is increased.

The computer is decomposed from a translation dictionary unit that stores a pair of a phrase of the first language and a phrase of the second language corresponding to a means for decomposing the original sentence of the first language into a plurality of phrases. Means for searching for the first language word and selecting the second language word as a translation word, and when the decomposed first language word is an unknown word that does not exist in the translation dictionary, Spelling means that associates and memorizes one or more lexical words spelled with one or more characters in a first language word and a second language character corresponding to the means for decomposing the lexical word consisting of the above characters It means for extracting the lexical second language by searching the token mentioned above degraded from the correspondence table, extracted by synthesizing the lexical second language when obtaining the translation of the unknown word, spelled words in a second language , Word meaning, part of speech, Field, means for determining a translation of the unknown word in the referenced based on the attribute information knowledge database of the second language attribute information accumulated decided priorities of the unknown word translation of such co-occurrence information, the decomposed phrases A machine translation program for functioning as a means for receiving and assembling the translated words and the translated words of the unknown word and outputting them as translated sentences