JPH11238061A

JPH11238061A - Japanese text analysis method

Info

Publication number: JPH11238061A
Application number: JP10040718A
Authority: JP
Inventors: Hiroki Kamanaka; 博樹釜中
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1998-02-23
Filing date: 1998-02-23
Publication date: 1999-08-31

Abstract

PROBLEM TO BE SOLVED: To provide a Japanese text analysis method without an adverse influence due to words displayed through the use of a visually similar character. SOLUTION: In a Japanese text analyzing method for dividing an arbitrary Japanese text into words from the head to the end of a sentence based on a longest matching method for retrieving the longest word which is matched with a title being the notation of a word dictionary by a dictionary retrieval processing using the notation matching of the words and dividing the Japanese text from the head by using the retrieved word; the following processings are executed by using a table where visually similar characters are stored. Namely, the table is referred to and the detected similar character is notation-matched as a similar character in a processing S28 at the time of notation-matching the words.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、日本語テキスト音
声変換装置、機械翻訳装置等といったテキスト解析処理
を必要とする様々な装置やシステムに組込まれ、任意の
日本語テキストを単語に分割する日本語テキスト解析方
法に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention is incorporated in various devices and systems that require text analysis processing, such as a Japanese text-to-speech conversion device and a machine translation device, and divides an arbitrary Japanese text into words. It relates to a word text analysis method.

【０００２】[0002]

【従来の技術】一般に、日本語テキスト音声変換装置等
に組込まれている日本語テキスト解析方法は、例えば、
日本語テキストが横書き文書の場合、応答速度やメモリ
容量の制限等から、処理が速く、使用メモリも少なく、
処理の簡単なメリットを有する左最長一致法が用いられ
ている。左最長一致法とは、単語辞書の見出しと一致す
る最長の単語を用いてテキストを左（文頭）から分割し
ていく方法である。図２は従来の一般的な左最長一致法
に基づく日本語テキスト解析方法の処理の流れを示すフ
ローチャート、及び図３は図２中の従来の表記マッチン
グによる辞書検索処理を示すフローチャートである。以
下、図２に示す従来の一般的な左最長一致法に基づく日
本語テキスト解析方法の全体説明（１）と、図３に示す
図２中の表記マッチングによる辞書検索処理の説明
（２）とを行う。なお、日本語テキストは普通、半角文
字（１バイト）と全角文字（２バイト）で構成されてい
るが、テキスト解析の前処理として入力テキスト中の半
角文字はそれに対応する全角文字に全て変換され、全角
文字だけで構成されたテキストに対して、以下のテキス
ト解析処理が行われるものとする。これは、表記マッチ
ング（文字の比較）による辞書検索の処理を簡便化する
ためである。2. Description of the Related Art In general, a Japanese text analysis method incorporated in a Japanese text-to-speech converter or the like includes, for example,
If the Japanese text is a horizontally written document, the processing is fast and the memory used is small due to the response speed and memory capacity limitation, etc.
The longest left match method is used, which has the advantage of simple processing. The left longest matching method is a method of dividing a text from the left (start of sentence) using the longest word that matches a heading of the word dictionary. FIG. 2 is a flowchart showing the flow of processing of a conventional general Japanese text analysis method based on the longest left matching method, and FIG. 3 is a flowchart showing conventional dictionary search processing by notation matching in FIG. Hereinafter, a general description of a conventional Japanese text analysis method based on the general left longest matching method shown in FIG. 2 (1), and a description of dictionary search processing by notation matching in FIG. 2 shown in FIG. I do. Note that Japanese text is usually composed of half-width characters (1 byte) and full-width characters (2 bytes). However, as preprocessing for text analysis, all half-width characters in the input text are converted to the corresponding full-width characters. It is assumed that the following text analysis processing is performed on a text composed of only full-width characters. This is for simplifying the dictionary search process by notation matching (character comparison).

【０００３】（１）左最長一致法に基づく日本語テキ
スト解析方法の全体説明図２に示す日本語テキスト解析方法では、例えば、演算
制御機能を有する中央処理装置（以下、「ＣＰＵ」とい
う）、テキストポインタ、文字カウンタ、及び単語辞書
等が格納されたメモリ等を備えたコンピュータにより、
次のような処理Ｓ１〜Ｓ７が実行される。まず、Ｓ１で
は、解析しようとする横書きの入力日本語テキストの先
頭にテキストポインタｐをセットする。Ｓ２では、ポイ
ンタｐを先頭にして入力テキストと表記（即ち、見出
し）がマッチし、かつ、接続条件を満たす単語を単語辞
書に探しに行き、得られた単語全てを単語候補とする
（この表記マッチングによる辞書検索については、後で
詳しく説明する）。Ｓ３では、単語候補が得られたかど
うかをチェックし、得られたのであれば、その中から表
記の一番長い単語を選択する（Ｓ４）。但し、単語候補
が１つしか得られなかったのであれば、それをそのまま
選択する。Ｓ２で単語候補が１つも得られなかった場合
は、バックトラックする。バックトラックとは、ポイン
タｐを直前の単語の先頭まで戻し、その単語の次候補
（表記が２番目に長い単語）を選択することをいう（Ｓ
７）。そして、選択した単語の長さだけポインタｐを進
める（Ｓ５）。ポインタｐが入力テキストの終端まで来
ていれば解析処理を終了し、来ていなければＳ２へ戻
り、次の単語を解析しに行く（Ｓ６）。以上の手順を踏
むことにより、入力テキストが文頭から文末まで単語に
分割される。(1) General description of Japanese text analysis method based on left longest match method In the Japanese text analysis method shown in FIG. 2, for example, a central processing unit (hereinafter, referred to as "CPU") having an arithmetic control function, By a computer having a memory and the like in which a text pointer, a character counter, a word dictionary, and the like are stored,
The following processes S1 to S7 are executed. First, in S1, a text pointer p is set at the head of the horizontally input Japanese text to be analyzed. In S2, the input text and the notation (that is, the heading) are matched with the pointer p at the head, and words that satisfy the connection conditions are searched for in the word dictionary, and all the obtained words are set as word candidates (this notation). The dictionary search by matching will be described later in detail). In S3, it is checked whether or not a word candidate has been obtained, and if it has been obtained, the longest written word is selected from the words (S4). However, if only one word candidate is obtained, it is selected as it is. If no word candidate is obtained in S2, backtracking is performed. Backtracking refers to returning the pointer p to the beginning of the immediately preceding word and selecting the next candidate for that word (the word with the second longest notation) (S
7). Then, the pointer p is advanced by the length of the selected word (S5). If the pointer p has reached the end of the input text, the analysis process is terminated. If not, the process returns to S2 to analyze the next word (S6). By performing the above procedure, the input text is divided into words from the beginning to the end of the sentence.

【０００４】（２）図２中の表記マッチングによる辞
書検索処理Ｓ２の説明図２の表記マッチングによる辞書検索処理Ｓ２では、図
３に示すように、例えば、コンピュータを用いて次のよ
うな処理Ｓ１０〜Ｓ１７が実行される。但し、入力日本
語テキスト上でポインタｐの指す文字から数えてｎ番目
の文字をｐ［ｎ］、単語辞書から得られた単語ｗの表記
（見出し）上のｎ番目の文字をｗ［ｎ］と定義する。ｎ
はゼロオリジンであり、ｐ［０］はポインタｐの指す文
字、ｗ［０］は単語ｗの表記（見出し）の先頭文字を表
わす。まず、表記（見出し）が文字ｐ［０］で始まる単
語ｗを単語辞書から１つ取出し（Ｓ１０）、文字カウン
タｎに０をセットする（Ｓ１１）。そして、Ｓ１２にお
いて文字ｗ［ｎ］と文字ｐ［ｎ］を比較する。比較した
結果、同じでなければ、表記がマッチしなかったのでＳ
１７へ進み、同じであれば、文字カウンタｎの値を１つ
増やし（Ｓ１３）、ｎの値と単語ｗの表記（見出し）の
長さ（文字数）を比較する（Ｓ１４）。(2) Description of dictionary search processing S2 by notation matching in FIG. 2 In dictionary search processing S2 by notation matching in FIG. 2, for example, as shown in FIG. To S17 are executed. However, on the input Japanese text, the n-th character counted from the character indicated by the pointer p is p [n], and the n-th character on the notation (heading) of the word w obtained from the word dictionary is w [n]. Is defined. n
Is the zero origin, p [0] represents the character pointed to by the pointer p, and w [0] represents the first character of the notation (heading) of the word w. First, one word w whose notation (heading) starts with the character p [0] is extracted from the word dictionary (S10), and 0 is set in the character counter n (S11). Then, in S12, the character w [n] and the character p [n] are compared. If the result of the comparison is not the same, the notation did not match, so S
If it is the same, the value of the character counter n is incremented by one (S13), and the value of n is compared with the length (number of characters) of the notation (heading) of the word w (S14).

【０００５】比較した結果、等しくなければＳ１２へ進
み、次の文字のマッチングを試みる。等しければ、表記
のマッチングが単語ｗの語末まで完了したので、次にこ
の単語の接続条件をチェックする（Ｓ１５）。接続条件
とは、文頭なら文頭として存在できるか、文中なら直前
の単語と文法的に接続可能であるか、という条件のこと
である。接続条件を満たすならば、単語ｗを単語候補の
１つとした後（Ｓ１６）、Ｓ１７へ進む。接続条件を満
たさなければ、そのままＳ１７へ進む。Ｓ１７では、表
記（見出し）が文字ｐ［０］で始まる単語が、現在の単
語ｗ以外にも単語辞書に登録されていないかどうかを調
べる。登録されていれば、Ｓ１０へ戻ってその単語を辞
書から取出し、新たな単語ｗとして同様の表記マッチン
グを試みる。登録されていなければ、Ｓ２の処理を終了
する。このようにして、入力テキストと表記（見出し）
がマッチし、かつ、接続条件を満たす、０個以上の単語
候補が辞書から得られる。As a result of the comparison, if they are not equal, the process proceeds to S12, and the matching of the next character is attempted. If they are equal, the matching of the notation has been completed up to the end of the word w, so the connection condition of this word is checked (S15). The connection condition is a condition of whether the sentence can be present as the beginning of a sentence, or if the sentence can be grammatically connected to the immediately preceding word. If the connection condition is satisfied, the word w is set as one of the word candidates (S16), and the process proceeds to S17. If the connection condition is not satisfied, the process proceeds to S17. In S17, it is checked whether or not a word whose heading (heading) starts with the letter p [0] is registered in the word dictionary other than the current word w. If it has been registered, the process returns to S10 to retrieve the word from the dictionary, and attempts a similar notation matching as a new word w. If not registered, the process of S2 ends. In this way, input text and notation (heading)
Are found and zero or more word candidates satisfying the connection condition are obtained from the dictionary.

【０００６】[0006]

【発明が解決しようとする課題】しかしながら、従来の
日本語テキスト解析方法では、次の（ａ）〜（ｃ）のよ
うな問題があり、これらを解決することが困難であっ
た。（ａ）仮名文字で表記される単語（例えば、外来語）
は長母音の所を長音文字「ー」（ＪＩＳ漢字コード：２
１３Ｃ）で表記するのが普通である。例えば、「デパー
ト」の３文字目や「ノート」の２文字目は、長音文字
「ー」で表記される。しかし、この長音文字の代りにマ
イナス記号文字「−」（ＪＩＳ漢字コード：２１５Ｄ）
を用いて、「デパ−ト」や「ノ−ト」と表記される場合
がある。これが起こる理由としては、ワードプロセッサ
やパーソナルコンピュータ等の画面上で、長音文字とマ
イナス記号文字が区別できないこと等が挙げられる。ま
た、視覚上の個人的な好みから、長音文字ではなくマイ
ナス記号文字をあえて用いて表記することもありえる。
このように、ある単語が長音文字の代りにマイナス記号
文字を用いて表記されていると、図２のＳ２の処理にお
いて表記がマッチしないため、単語候補が辞書から得ら
れないことになり、入力日本語テキストを正しく解析で
きないことになる。However, the conventional Japanese text analysis method has the following problems (a) to (c), and it is difficult to solve these problems. (A) Words written in kana characters (for example, foreign words)
Is a long vowel character "-" (JIS Kanji code: 2
13C). For example, the third character of "department store" and the second character of "note" are represented by a long letter "-". However, instead of this long sound character, a minus sign character "-" (JIS Kanji code: 215D)
May be described as "depart" or "note". The reason why this occurs is that long sounds and minus signs cannot be distinguished on a screen of a word processor or a personal computer. Further, depending on personal preference in sight, notation using a long sign but not using a minus sign may be used.
As described above, if a certain word is described using a minus sign character instead of a long sound character, the notation does not match in the process of S2 in FIG. You will not be able to parse Japanese text correctly.

【０００７】（ｂ）前記（ａ）の問題を解決する方法
として、例えば、長音文字の代りにマイナス記号文字を
用いて表記した単語を辞書に追加登録することが考えら
れる。こうすれば、前記（ａ）の表記マッチングの問題
をクリアできるからである。しかし、長音文字を表記に
含む単語は数が多いので、追加登録により単語辞書のサ
イズが膨れ上がることになるため、この方法は現実的で
はない。（ｃ）前記（ａ）のような視覚的に類似した文字の誤
用ないし代用に起因する問題は他にもある。例えば、
「霞ヶ浦」と「霞ケ浦」のように、小文字の「ヶ」（Ｊ
ＩＳ漢字コード：２５７６）と大文字の「ケ」（ＪＩＳ
漢字コード：２５３１）のどちらを使って表記するかは
全くの個人の自由であり、予測不可能である。また、平
仮名の「へ」、「べ」、「ぺ」（ＪＩＳ漢字コード：２
４５８〜２４５Ａ）と片仮名の「ヘ」、「ベ」、「ペ」
（ＪＩＳ漢字コード：２５５８〜２５５Ａ）も、それぞ
れ視覚的に非常に似ているために誤用ないし代用される
おそれが十分考えられる。こうした類似文字を用いて表
記した単語を考えられる限り辞書に追加登録するという
解決策もあるが、前述したように追加登録により単語辞
書のサイズが飛躍的に増えるという別の問題が生じてし
まう。本発明は、前記従来技術が持っていた課題を解決
し、視覚的に類似した文字を用いて表記された単語によ
る悪影響の出ない日本語テキスト解析方法を提供するこ
とを目的とする。(B) As a method for solving the problem (a), for example, it is conceivable to additionally register words written using a minus sign character instead of a long sound character in a dictionary. This is because the problem of the notation matching described in (a) can be cleared. However, this method is not practical because the number of words including long letters in the notation is large, and the size of the word dictionary increases due to additional registration. (C) There are other problems caused by misuse or substitution of visually similar characters as in (a) above. For example,
As in "Kasumigaura" and "Kasumigaura", the lowercase "ka" (J
IS Kanji code: 2576) and capital letter "K" (JIS)
Which one of the kanji codes: 2531) is used is completely individual freedom and unpredictable. In addition, the hiragana characters "he", "be", and "@" (JIS kanji code: 2
458-245A) and katakana "he", "be", "pe"
(JIS kanji codes: 2558 to 255A) are also visually very similar, so that there is a good possibility that they will be misused or substituted. Although there is a solution of additionally registering words written using such similar characters in the dictionary as long as possible, as described above, another problem arises that the size of the word dictionary is dramatically increased by the additional registration. SUMMARY OF THE INVENTION It is an object of the present invention to provide a method for analyzing Japanese text that solves the problems of the prior art and does not cause any adverse effects due to words written using visually similar characters.

【０００８】[0008]

【課題を解決するための手段】前記課題を解決するため
に、本発明のうちの請求項１に係る発明では、単語の表
記マッチングによる辞書検索処理によって単語辞書の表
記である見出しと一致する最長の単語を検索し、この検
索した単語を用いて日本語テキストの文頭から分割して
いく最長一致法に基づき、任意の日本語テキストを文頭
から文末まで単語に分割していく日本語テキスト解析方
法において、視覚的に類似した文字が格納されたテーブ
ルを備え、前記単語の表記マッチングの際に、前記テー
ブルを参照して検出した類似文字を同一の文字として表
記マッチングを行うようにしている。請求項２に係る発
明では、請求項１の日本語テキスト解析方法において、
単語の表記マッチングの際に、長音文字「ー」とマイナ
ス記号文字「−」を同一の文字として表記マッチングを
行うようにしている。請求項３に係る発明では、請求項
１の日本語テキスト解析方法において、単語の表記マッ
チングの際に、小文字の「ヶ」、「ヵ」と大文字の
「ケ」、「カ」をそれぞれ同一の文字として表記マッチ
ングを行うようにしている。In order to solve the above-mentioned problems, according to the first aspect of the present invention, the longest match with a heading which is a notation in a word dictionary is performed by a dictionary search process based on word notation matching. A Japanese text analysis method that searches for a word, and divides any Japanese text into words from the beginning to the end of the sentence based on the longest match method that uses the searched word to divide from the beginning of the Japanese text , A table in which characters that are visually similar are stored is provided, and at the time of the notation matching of the words, the notation matching is performed with the similar characters detected with reference to the table as the same characters. In the invention according to claim 2, in the method for analyzing Japanese text according to claim 1,
At the time of word notation matching, notation matching is performed using the long letter "-" and the minus sign character "-" as the same character. According to the third aspect of the present invention, in the Japanese text analysis method of the first aspect, when matching notation of words, lowercase letters "ka" and "ka" and uppercase letters "ke" and "ka" are the same. Notation matching is performed as characters.

【０００９】請求項４に係る発明では、請求項１の日本
語テキスト解析方法において、単語の表記マッチングの
際に、平仮名の「へ」、「べ」、「ぺ」と片仮名の
「ヘ」、「ベ」、「ペ」をそれぞれ同一の文字として表
記マッチングを行うようにしている。このような構成を
採用したことにより、日本語テキストが入力されると、
視覚的に類似した文字が格納されたテーブルが参照され
て、単語の表記マッチングの際に、検出された類似文字
が同一の文字として表記マッチングが行われ、その入力
された日本語テキストの文頭から文末まで単語に分割さ
れていく。請求項５に係る発明では、単語の表記マッチ
ングによる辞書検索処理によって単語辞書の表記である
見出しと一致する最長の単語を検索し、この検索した単
語を用いて日本語テキストの文頭から分割していく最長
一致法に基づき、任意の日本語テキストを文頭から文末
まで単語に分割していく日本語テキスト解析方法におい
て、対応する横書き文字と縦書き文字が格納されたテー
ブルを備え、前記単語の表記マッチングの際に、前記テ
ーブルを参照して検出した対応文字を同一の文字として
表記マッチングを行うようにしている。請求項６に係る
発明では、請求項５の日本語テキスト解析方法におい
て、単語の表記マッチングの際に、長音文字「ー」と縦
線記号文字「｜」を同一の文字として表記マッチングを
行うようにしている。このような構成を採用したことに
より、日本語テキストが入力されると、対応する横書き
文字と縦書き文字が格納されたテーブルが参照され、単
語の表記マッチングの際に、検出された対応文字が同一
の文字として表記マッチングが行われ、その入力された
日本語テキストの文頭から文末まで単語に分割されてい
く。According to a fourth aspect of the present invention, in the method of analyzing a Japanese text according to the first aspect, when matching notation of a word, hiragana "he", "be", "@" and katakana "he", Notation matching is performed by using "be" and "pe" as the same characters. By adopting such a configuration, when Japanese text is input,
A table in which visually similar characters are stored is referred to, and at the time of word notation matching, notation matching is performed as if the detected similar characters are the same character, and from the beginning of the input Japanese text. It is divided into words until the end of the sentence. In the invention according to claim 5, the longest word that matches the heading that is the notation of the word dictionary is searched by the dictionary search process based on the word notation matching, and the searched word is used to divide from the beginning of the Japanese text. A Japanese text analysis method for dividing an arbitrary Japanese text into words from the beginning to the end of a sentence based on the longest matching method, comprising a table in which corresponding horizontal and vertical characters are stored; At the time of matching, notation matching is performed with the corresponding characters detected with reference to the table as the same characters. In the invention according to claim 6, in the Japanese text analysis method according to claim 5, when performing notation matching of words, notation matching is performed by using the long letter "-" and the vertical line symbol character "|" as the same character. I have to. By adopting such a configuration, when Japanese text is input, a table storing the corresponding horizontal writing characters and vertical writing characters is referred to, and at the time of word notation matching, the detected corresponding character is Notation matching is performed as the same character, and the input Japanese text is divided into words from the beginning to the end of the sentence.

【００１０】[0010]

【発明の実施の形態】図１は本発明の実施形態を示す左
最長一致法に基づく日本語テキスト解析方法で行われる
表記マッチングによる辞書検索処理のフローチャート、
及び図４は図１の処理で用いられる類似文字テーブルの
一般例（視覚的に類似した文字）を示す図である。以
下、これらの図１及び図４を参照しつつ、本実施形態の
左最長一致法に基づく日本語テキスト解析方法の全体説
明（１）と、この解析方法の具体例の説明（２）とを行
う。DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 is a flowchart of a dictionary search process by notation matching performed by a Japanese text analysis method based on the longest left matching method according to an embodiment of the present invention.
FIG. 4 is a diagram showing a general example of a similar character table (visually similar characters) used in the processing of FIG. Hereinafter, with reference to FIGS. 1 and 4, an overall description (1) of the Japanese text analysis method based on the longest left match method of the present embodiment and a description (2) of a specific example of this analysis method will be described. Do.

【００１１】（１）左最長一致法に基づく日本語テキ
スト解析方法の全体説明本実施形態の日本語テキスト解析方法では、例えば、演
算制御機能を有するＣＰＵ、文字カウンタ、及び単語辞
書や図４に示すような視覚的に類似した文字が記述され
ている類似文字テーブル等が格納されたメモリ等を備え
たコンピュータを用いて、従来と同様に図２のフローチ
ャートに従って実行される。本実施形態の特徴は、図２
の処理Ｓ２において、従来の図３のような処理に代え
て、図１に示すような表記マッチングによる辞書検索処
理を行うことである。この際、予め作成された図４のよ
うな類似文字テーブルを参照する。なお、図４中のＡ，
Ｂ，Ｃ，Ｄ，Ｅ，Ｆはテーブルの識別子である。このよ
うに、本実施形態の特徴は、図２の処理Ｓ２の表記マッ
チングによる辞書検索であるため、従来と同様の処理は
説明を省略し、本実施形態の特徴である図１の表記マッ
チングによる辞書検索処理について以下説明する。(1) Overall description of Japanese text analysis method based on left longest match method In the Japanese text analysis method of this embodiment, for example, a CPU having an arithmetic control function, a character counter, a word dictionary, The processing is executed according to the flowchart of FIG. 2 in the same manner as in the related art using a computer having a memory or the like in which a similar character table or the like in which visually similar characters as described are described is stored. This embodiment is characterized in that FIG.
In the process S2, a dictionary search process by notation matching as shown in FIG. 1 is performed instead of the conventional process as shown in FIG. At this time, a similar character table as shown in FIG. 4 created in advance is referred to. A, A in FIG.
B, C, D, E, and F are table identifiers. As described above, the feature of the present embodiment is a dictionary search based on the notation matching in the process S2 in FIG. 2, and therefore, the description of the same processes as in the related art is omitted, and the description is based on the notation matching in FIG. The dictionary search processing will be described below.

【００１２】図１の辞書検索では、処理Ｓ２０〜Ｓ２８
が実行される。但し、図３と同様に、横書きの入力日本
語テキスト上でポインタｐの指す文字から数えてｎ番目
の文字をｐ［ｎ］、単語辞書から得られた単語ｗの表記
（見出し）上のｎ番目の文字をｗ［ｎ］と定義する。ｎ
はゼロオリジンであり、ｐ［０］はポインタｐの指す文
字、ｗ［０］は単語ｗの表記（見出し）の先頭文字を表
わす。まず、表記（見出し）が文字ｐ［０］で始まる単
語ｗを単語辞書から１つ取出し（Ｓ２０）、文字カウン
タｎに０をセットする（Ｓ２１）。そして、Ｓ２２にお
いて文字ｗ［ｎ］と文字ｐ［ｎ］を比較する。比較した
結果、同じでなければ、類似文字であるかどうかのチェ
ックをするためにＳ２８へ進み、同じであれば、文字カ
ウンタｎの値を１つ増やし（Ｓ２３）、ｎの値と単語ｗ
の表記（見出し）の長さ（文字数）を比較する（Ｓ２
４）。比較した結果、等しくなければ、Ｓ２２へ進み、
次の文字のマッチングを試みる。等しければ、表記のマ
ッチングが単語ｗの語末まで完了したので、次にこの単
語の接続条件をチェックする（Ｓ２５）。接続条件と
は、文頭なら文頭として存在できるか、文中なら直前の
単語と文法的に接続可能であるか、という条件のことで
ある。In the dictionary search of FIG. 1, processes S20 to S28
Is executed. However, as in FIG. 3, the n-th character counted from the character pointed to by the pointer p on the horizontally input Japanese text is p [n], and n in the notation (heading) of the word w obtained from the word dictionary. The w th character is defined as w [n]. n
Is the zero origin, p [0] represents the character pointed to by the pointer p, and w [0] represents the first character of the notation (heading) of the word w. First, one word w whose notation (heading) starts with the character p [0] is extracted from the word dictionary (S20), and 0 is set in the character counter n (S21). Then, in S22, the character w [n] and the character p [n] are compared. As a result of the comparison, if they are not the same, the process proceeds to S28 to check whether they are similar characters, and if they are the same, the value of the character counter n is increased by one (S23), and the value of n and the word w
The length (number of characters) of the notation (heading) is compared (S2
4). If the result of the comparison is not equal, proceed to S22,
Attempts to match the next character. If they are equal, the matching of the notation has been completed up to the end of the word w, so the connection condition of this word is checked (S25). The connection condition is a condition of whether the sentence can be present as the beginning of a sentence, or if the sentence can be grammatically connected to the immediately preceding word.

【００１３】接続条件を満たすならば、単語ｗを単語候
補の１つとした後（Ｓ２６）、Ｓ２７へ進む。接続条件
を満たさなければ、そのままＳ２７へ進む。Ｓ２７で
は、表記（見出し）が文字ｐ［０］で始まる単語が、現
在の単語ｗ以外にも単語辞書に登録されていないかどう
かを調べる。登録されていれば、Ｓ２０へ戻ってその単
語を辞書から取出し、新たな単語ｗとして同様の表記マ
ッチングを試みる。登録されていなければ、辞書検索処
理を終了する。Ｓ２８では、図４に示す類似文字テーブ
ルの中に文字ｗ［ｎ］と文字ｐ［ｎ］があるかどうかが
調べられる。テーブル中にあるならば、文字ｗ［ｎ］と
文字ｐ［ｎ］が同一の文字であると判定し、表記がマッ
チしたことにしてＳ２３へ進む。テーブル中になけれ
ば、表記がマッチしなかったので、従来通りＳ２７へ進
む。以上の処理により、入力日本語テキストと表記（見
出し）がマッチし、かつ、接続条件を満たす、０個以上
の単語候補を辞書から取出す際に、類似文字を用いて表
記した単語であっても単語候補として正しく取出される
ようになる。If the connection condition is satisfied, the word w is set as one of the word candidates (S26), and the process proceeds to S27. If the connection condition is not satisfied, the process proceeds to S27. In S27, it is checked whether a word whose heading (heading) starts with the letter p [0] is not registered in the word dictionary other than the current word w. If it is registered, the process returns to S20, where the word is taken out of the dictionary, and the same notation matching is attempted as a new word w. If not registered, the dictionary search process ends. In S28, it is checked whether or not the character w [n] and the character p [n] exist in the similar character table shown in FIG. If it is in the table, it is determined that the character w [n] and the character p [n] are the same character, and the notation is matched, and the process proceeds to S23. If it is not in the table, the notation did not match, and the process proceeds to S27 as before. According to the above processing, even when the input Japanese text matches the notation (heading) and satisfies the connection condition, when extracting 0 or more word candidates from the dictionary, even if the word is written using similar characters, It will be correctly extracted as a word candidate.

【００１４】（２）解析方法の具体例の説明前記（１）の具体例を幾つか挙げて説明する。例えば、
入力日本語テキストがマイナス記号文字で表記された
「ノ−ト」である場合を考える。図２において、初めに
ポインタｐは「ノ」を指しており（Ｓ１）、次に表記マ
ッチングによる辞書検索が行われる（Ｓ２）。Ｓ２で
は、図１に示す一連の処理が行われる。まず、Ｓ２０に
おいて、文字ｐ［０］即ち「ノ」で表記（見出し）が始
まる単語ｗが辞書から１つ取出される。ここでは、「ノ
ルマ」が取出されたとする。Ｓ２１で文字カウンタｎに
０がセットされ、Ｓ２２にて文字ｐ［ｎ］と文字ｗ
［ｎ］を比較すると、どちらも「ノ」であり同じなの
で、文字カウンタｎの値を１つ増やす（Ｓ２３）。ｎの
値は現在１であり、単語ｗの表記「ノルマ」の長さ（文
字数）３に等しくないので（Ｓ２４）、Ｓ２２へ進む。(2) Description of Specific Examples of Analysis Method Some specific examples of the above (1) will be described. For example,
Consider a case where the input Japanese text is a "note" written in minus sign characters. In FIG. 2, first, the pointer p points to "No" (S1), and then a dictionary search by notation matching is performed (S2). In S2, a series of processing shown in FIG. 1 is performed. First, in S20, one word w whose notation (heading) starts with the letter p [0], that is, “no” is extracted from the dictionary. Here, it is assumed that “norma” has been extracted. In S21, the character counter n is set to 0, and in S22, the character p [n] and the character w are set.
When [n] is compared, they are both “No” and the same, so the value of the character counter n is increased by one (S23). Since the value of n is currently 1 and is not equal to the length (number of characters) 3 of the notation "norma" of the word w (S24), the process proceeds to S22.

【００１５】Ｓ２２にて文字ｐ［ｎ］と文字ｗ［ｎ］を
比較すると、前者が「−」、後者が「ル」で同じではな
いので、Ｓ２８へ進む。Ｓ２８では、文字ｐ［ｎ］即ち
「−」と文字ｗ［ｎ］即ち「ル」が、図４の類似文字テ
ーブルにあるかどうかが調べられるが、ないのでＳ２７
へ進む。Ｓ２７では、文字ｐ［０］で表記（見出し）が
始まる単語が「ノルマ」以外にまだ辞書に登録されてい
ないかどうかが調べられ、「ノート」が登録されている
ので、Ｓ２０へ戻り、「ノート」が新たな単語ｗとして
辞書から取出される。Ｓ２１で文字カウンタｎに０がセ
ットされ、Ｓ２２にて文字ｐ［ｎ］と文字ｗ［ｎ］を比
較すると、どちらも「ノ」であり同じなので、文字カウ
ンタｎの値を１つ増やす（Ｓ２３）。ｎの値は現在１で
あり、単語ｗの表記「ノート」の長さ（文字数）３に等
しくないので（Ｓ２４）、Ｓ２２へ進む。When the character p [n] is compared with the character w [n] in S22, the former is "-" and the latter is "ru", which is not the same. In S28, it is checked whether or not the character p [n], that is, "-", and the character w [n], that is, "ru", are in the similar character table of FIG.
Proceed to. In S27, it is checked whether or not the word starting with the notation (heading) with the letter p [0] has not been registered in the dictionary other than "Norma". Since "Note" has been registered, the process returns to S20 and returns to S20. The note is taken from the dictionary as a new word w. In step S21, the character counter n is set to 0. When the character p [n] and the character w [n] are compared in step S22, they are both "No" and the same, so the value of the character counter n is increased by one (S23). ). Since the value of n is currently 1 and is not equal to the length (number of characters) 3 of the notation "note" of the word w (S24), the process proceeds to S22.

【００１６】Ｓ２２にて文字ｐ［ｎ］と文字ｗ［ｎ］を
比較すると、前者がマイナス記号文字「−」、後者が長
音文字「ー」で同じではないので、Ｓ２８へ進む。Ｓ２
８では、文字ｐ［ｎ］即ちマイナス記号文字「−」と文
字ｗ［ｎ］即ち長音文字「ー」が、図４の類似文字テー
ブルにあるかどうかが調べられ、テーブルＡにあるの
で、Ｓ２３へ進み、文字カウンタｎの値が１つ増やされ
る。ｎの値は現在２であり、単語ｗの表記「ノート」の
長さ（文字数）３に等しくないので（Ｓ２４）、Ｓ２２
へ進む。Ｓ２２にて文字ｐ［ｎ］と文字ｗ［ｎ］を比較
すると、どちらも「ト」であり同じなので、文字カウン
タｎの値を１つ増やす（Ｓ２３）。ｎの値は現在３であ
り、単語ｗの表記「ノート」の長さ（文字数）３に等し
いので、Ｓ２５へ進む。Ｓ２５では接続条件がチェック
されるが、単語ｗは名詞「ノート」であり、文頭として
存在可能である。このため、接続条件を満足するので、
Ｓ２６にて単語「ノート」が単語候補の１つとして採択
される。そして、Ｓ２７において、文字ｐ［０］で表記
（見出し）が始まる単語が「ノルマ」、「ノート」以外
にまだ辞書に登録されていないかどうかが調べられ、
「ノー」が登録されているので、Ｓ２０へ戻り、「ノ
ー」が新たな単語ｗとして辞書から取出される。When the character p [n] is compared with the character w [n] in S22, the former is a minus sign character "-" and the latter is a long sound character "-", which is not the same. S2
At step 8, it is checked whether the character p [n], that is, the minus sign character "-", and the character w [n], that is, the long character "-", are in the similar character table of FIG. Then, the value of the character counter n is increased by one. Since the value of n is currently 2 and is not equal to the length (number of characters) 3 of the notation "note" of the word w (S24), S22
Proceed to. When the character p [n] and the character w [n] are compared in S22, both are "g" and are the same, so the value of the character counter n is increased by one (S23). Since the value of n is currently 3, which is equal to the length (number of characters) of the notation "note" of the word w, the process proceeds to S25. In S25, the connection condition is checked, but the word w is a noun “note” and can exist as the beginning of a sentence. Because of this, the connection conditions are satisfied.
In S26, the word "note" is adopted as one of the word candidates. Then, in S27, it is checked whether a word whose notation (heading) starts with the letter p [0] is not yet registered in the dictionary other than "norma" and "note",
Since "NO" has been registered, the process returns to S20, and "NO" is extracted from the dictionary as a new word w.

【００１７】Ｓ２１で文字カウンタｎに０がセットさ
れ、Ｓ２２にて文字ｐ［ｎ］と文字ｗ［ｎ］を比較する
と、どちらも「ノ」であり同じなので、文字カウンタｎ
の値を１つ増やす（Ｓ２３）。ｎの値は現在１であり、
単語ｗの表記「ノー」の長さ（文字数）２に等しくない
ので（Ｓ２４）、Ｓ２２へ進む。Ｓ２２にて文字ｐ
［ｎ］と文字ｗ［ｎ］を比較すると、前者がマイナス記
号文字「−」、後者が長音文字「ー」で同じではないの
で、Ｓ２８へ進む。Ｓ２８では、文字ｐ［ｎ］即ちマイ
ナス記号文字「−」と文字ｗ［ｎ］即ち長音文字「ー」
が、図４の類似文字テーブルにあるかどうかが調べら
れ、テーブルＡにあるので、Ｓ２３へ進み、文字カウン
タｎの値が１つ増やされる。ｎの値は現在２であり、単
語ｗの表記「ノー」の長さ（文字数）２に等しいので、
Ｓ２５へ進む。Ｓ２５では接続条件がチェックされる
が、単語ｗは間投詞「ノー」であり文頭として存在可能
である。このため接続条件を満足するので、Ｓ２６にて
単語「ノー」が単語候補の１つとして採択される。そし
て、Ｓ２７において、文字ｐ［０］で表記（見出し）が
始まる単語が「ノルマ」、「ノート」、「ノー」以外に
まだ辞書に登録されていないかどうかが調べられ、もう
登録されていないので、図１の辞書検索処理（図２のＳ
２）を終了する。In step S21, the character counter n is set to 0. In step S22, the character p [n] and the character w [n] are compared.
Is increased by one (S23). The value of n is currently 1,
Since the length (number of characters) of the notation “No” of the word w is not equal to 2 (S24), the process proceeds to S22. The letter p in S22
When [n] is compared with the character w [n], the former is a minus sign character “−” and the latter is a long sound character “−”, which is not the same, so the process proceeds to S28. In S28, the character p [n], that is, the minus sign character "-", and the character w [n], that is, the long sound character "-"
Is checked in the similar character table of FIG. 4, and since it is in table A, the process proceeds to S23, where the value of the character counter n is incremented by one. Since the value of n is currently 2 and equal to the length (number of characters) of the notation “no” of the word w, 2
Proceed to S25. In S25, the connection condition is checked, but the word w is the interjection "no" and can exist as the beginning of a sentence. Since the connection condition is satisfied, the word “NO” is adopted as one of the word candidates in S26. Then, in S27, it is checked whether or not a word whose notation (heading) starts with the character p [0] is not registered in the dictionary other than "norma", "note", and "no", and is not registered anymore. Therefore, the dictionary search process of FIG.
End 2).

【００１８】こうして、２つの単語候補「ノート」と
「ノー」が辞書から得られたので、図２のＳ３からＳ４
へ進む。Ｓ４では、２つの単語候補「ノート」と「ノ
ー」の中から、表記の一番長い単語即ち「ノート」が選
択される。Ｓ５において、選択した単語「ノート」の長
さ（＝３）だけポインタｐが進められ、このポインタｐ
が入力テキスト「ノ−ト」の終端まで来たので、図２の
テキスト解析処理を終了する（Ｓ６）。このように、マ
イナス記号文字で表記された入力テキスト「ノ−ト」
が、長音文字で表記された単語「ノート」を用いて正し
く解析される。以上のように、本実施形態では、次のよ
うな効果がある。本実施形態では、処理が速く、使用メ
モリも少ないという最長一致法のメリットを生かしつ
つ、かつ、視覚的に類似した文字の誤用ないし代用に起
因する日本語テキストの解析誤りを防ぐことができるよ
うになる。なお、本発明は上記実施形態に限定されず、
種々の変形や利用形態が可能である。この変形例や利用
形態としては、例えば、次の（ｉ），（ii）のようなも
のがある。In this way, two word candidates "note" and "no" are obtained from the dictionary, and therefore, from S3 to S4 in FIG.
Proceed to. In S4, the longest word, ie, "note", is selected from the two word candidates "note" and "no". In S5, the pointer p is advanced by the length (= 3) of the selected word "note".
Has reached the end of the input text "note", the text analysis process of FIG. 2 ends (S6). In this way, the input text "note" written in minus sign character
Is correctly analyzed using the word "note" written in long letters. As described above, the present embodiment has the following effects. In the present embodiment, it is possible to take advantage of the longest matching method in which the processing is fast and uses less memory, and to prevent an error in analyzing Japanese text caused by misuse or substitution of visually similar characters. become. The present invention is not limited to the above embodiment,
Various modifications and usage forms are possible. For example, there are the following modifications (i) and (ii) as the modified examples and the use forms.

【００１９】（ｉ）図５は、図１の類似文字テーブル
の特殊例（入力テキストが縦書きの場合）を示す図であ
る。上記実施形態では、入力日本語テキストが横書きの
場合について説明したが、この入力日本語テキストが縦
書き文書として作成されたものである場合、例えば、長
音文字「ー」（ＪＩＳ漢字コード：２１３Ｃ）の代り
に、縦線記号文字「｜」（ＪＩＳ漢字コード：２１４
３）を用いて表記されることがある。この場合において
も、上記実施形態の解析方法を用いることにより、表記
がマッチしないために正しく解析できないという問題を
解決することができる。即ち、長音文字「ー」（ＪＩＳ
漢字コード：２１３Ｃ）と縦線記号文字「｜」（ＪＩＳ
漢字コード：２１４３）は視覚的に類似した文字ではな
いが、図４の類似文字テーブルに特殊例として図５に示
す類似文字テーブルを追加することにより、この場合の
問題を解決することができる。また、図４と図５のテー
ブルに新しいテーブルデータを追加することにより、任
意の日本語テキストに対する解析精度をさらに高めてい
くことも可能である。（ii）図１や図２の処理内容は、図示以外の他の内容
に変更することも可能である。また、これらの処理を行
う装置は、コンピュータに限定されず、他の構成の装置
で実行することも可能である。(I) FIG. 5 is a diagram showing a special example of the similar character table of FIG. 1 (when the input text is written vertically). In the above embodiment, the case where the input Japanese text is written horizontally is described. However, when the input Japanese text is created as a vertical writing document, for example, a long sound character "-" (JIS Kanji code: 213C) Instead of the vertical bar symbol character "|" (JIS Kanji code: 214
It may be described using 3). Also in this case, by using the analysis method of the above embodiment, it is possible to solve the problem that the analysis cannot be correctly performed because the notations do not match. That is, the long letter "-" (JIS
Kanji code: 213C) and vertical line symbol character "|" (JIS
Although the kanji code: 2143) is not a visually similar character, the problem in this case can be solved by adding a similar character table shown in FIG. 5 as a special example to the similar character table in FIG. Further, by adding new table data to the tables of FIGS. 4 and 5, it is possible to further improve the analysis accuracy for an arbitrary Japanese text. (Ii) The processing contents of FIGS. 1 and 2 can be changed to contents other than those shown. The device that performs these processes is not limited to a computer, and can be executed by a device having another configuration.

【００２０】[0020]

【発明の効果】以上詳細に説明したように、請求項１〜
４に係る発明によれば、最長一致法に基づく日本語テキ
スト解析方法において、単語の表記マッチングの際に、
視覚的に類似した文字が格納されたテーブルを参照し
て、検出した類似文字を同一の文字として表記マッチン
グを行うようにしている。このため、処理が速く、使用
メモリも少ないという最長一致法のメリットを生かしつ
つ、かつ、視覚的に類似した文字の誤用ないし代用に起
因する日本語テキストの解析誤りを防ぐことができるよ
うになる。請求項５及び６に係る発明によれば、最長一
致法に基づく日本語テキスト解析方法において、単語の
表記マッチングの際に、対応する横書き文字と縦書き文
字が格納されたテーブルを参照して、検出した対応文字
を同一の文字として表記マッチングを行うようにしてい
る。このため、入力される日本語テキストが横書き文書
であっても、あるいは縦書き文書であっても、請求項１
に係る発明と同様に、日本語テキストの解析誤りを的確
に防止できる。従って、本発明の日本語テキスト解析方
法は、日本語テキスト音声変換装置、機械翻訳装置等の
テキスト解析処理を必要とする様々な装置やシステムに
組込んで使用することができる。As described in detail above, claims 1 to 5
According to the invention of Item 4, in the Japanese text analysis method based on the longest match method,
By referring to a table in which visually similar characters are stored, notation matching is performed using the detected similar characters as the same characters. This makes it possible to take advantage of the longest match method, which is fast in processing and uses less memory, and to prevent Japanese text analysis errors caused by misuse or substitution of visually similar characters. . According to the invention according to claims 5 and 6, in the Japanese text analysis method based on the longest match method, at the time of word notation matching, referring to a table in which corresponding horizontal writing characters and vertical writing characters are stored, Notation matching is performed using the detected corresponding character as the same character. Therefore, whether the input Japanese text is a horizontal writing document or a vertical writing document,
As in the invention according to the first aspect, it is possible to accurately prevent the analysis error of the Japanese text. Therefore, the Japanese text analysis method of the present invention can be used by incorporating it into various devices and systems that require text analysis processing, such as a Japanese text-to-speech conversion device and a machine translation device.

【図面の簡単な説明】[Brief description of the drawings]

【図１】本発明の実施形態を示す表記マッチングによる
辞書検索処理のフローチャートである。FIG. 1 is a flowchart illustrating a dictionary search process using notation matching according to an embodiment of the present invention.

【図２】従来の一般的な左最長一致法に基づく日本語テ
キスト解析方法のフローチャートである。FIG. 2 is a flowchart of a conventional general Japanese text analysis method based on the longest left matching method.

【図３】図２中の従来の表記マッチングによる辞書検索
処理のフローチャートである。FIG. 3 is a flowchart of a conventional dictionary search process using notation matching in FIG. 2;

【図４】図１の類似文字テーブルの一般例（視覚的に類
似した文字）を示す図である。FIG. 4 is a diagram illustrating a general example (visually similar characters) of the similar character table in FIG. 1;

【図５】図１の類似文字テーブルの特殊例（入力テキス
トが縦書きの場合）を示す図である。FIG. 5 is a diagram illustrating a special example of the similar character table in FIG. 1 (when the input text is written vertically).

[Explanation of symbols]

Ｓ１テキストポインタ初期化処理Ｓ２表記マッチングによる辞書検索処理Ｓ３単語候補が得られたか否かのチェック
処理Ｓ４最長の単語選択処理Ｓ５テキストポインタ更新処理Ｓ６テキストポインタが文末まで来たか否
かの判定処理Ｓ７バックトラック処理Ｓ２０表記が文字ｐ［０］で始まる単語ｗを
辞書から取出す処理Ｓ２１文字カウンタｎの初期化処理Ｓ２２文字ｗ［ｎ］と文字ｐ［ｎ］が同じか
否かの比較処理Ｓ２３文字カウンタｎの更新処理Ｓ２４文字カウンタｎが単語ｗの長さに等し
いか否かの比較処理Ｓ２５単語ｗは接続条件を満たすか否かのチ
ェック処理Ｓ２６単語ｗを単語候補に追加する処理Ｓ２７表記がｐ［０］で始まる単語が他にま
だ辞書にあるか否かの検索処理Ｓ２８類似文字テーブルを参照して文字ｗ
［ｎ］と文字ｐ［ｎ］が類似文字であるか否かの判定処
理S1 Text pointer initialization processing S2 Dictionary search processing by notation matching S3 Check processing whether word candidates are obtained S4 Longest word selection processing S5 Text pointer updating processing S6 Judgment processing whether text pointer reaches the end of sentence S7 Backtrack processing S20 Processing to retrieve word w whose notation starts with character p [0] from dictionary S21 Initialization processing of character counter n S22 Comparison processing of whether character w [n] and character p [n] are the same S23 Character Update process of counter n S24 Comparison process of whether character counter n is equal to the length of word w S25 Check process of whether word w satisfies connection condition S26 Process of adding word w to word candidate S27 Notation Search processing to determine whether another word starting with p [0] is still in the dictionary S28 Referring to the similar character table The letter w
Processing for determining whether [n] and character p [n] are similar characters

Claims

[Claims]

1. A longest word that matches a heading, which is a notation of a word dictionary, is searched by a dictionary search process based on word notation matching, and the longest match is divided from the beginning of the Japanese text using the searched word. In a Japanese text analysis method for dividing an arbitrary Japanese text into words from the beginning to the end of a sentence based on the method, a table in which visually similar characters are stored is provided. A method for analyzing Japanese text, comprising: performing notation matching on similar characters detected with reference to the table as the same character.

2. The Japanese text analysis method according to claim 1, wherein, when the notation matching of the words is performed, the notation matching is performed with the long character "-" and the minus sign character "-" as the same character.

3. The method according to claim 1, wherein when performing notation matching of words, the notation matching is performed by using lowercase “ga”, “ka” and uppercase “ke”, “ka” as the same characters. Japanese text analysis method described.

4. When matching notation of words, hiragana “he”, “be”, “ぺ” and katakana “he”, “be”,
2. The Japanese text analysis method according to claim 1, wherein the notation matching is performed using "pe" as the same character.

5. A longest match that is searched from the beginning of a Japanese text using the searched words by searching for the longest word that matches a heading that is a notation in a word dictionary by dictionary search processing based on word notation matching. A Japanese text analysis method for dividing an arbitrary Japanese text into words from the beginning to the end of a sentence based on the law, comprising a table in which corresponding horizontal and vertical characters are stored. And performing notation matching using the corresponding characters detected with reference to the table as the same characters.

6. The Japanese text analysis method according to claim 5, wherein, when the notation matching of the word is performed, the notation matching is performed by using the long character "-" and the vertical line symbol character "|" as the same character. .