JPH05274351A

JPH05274351A - Keyboard extracting system

Info

Publication number: JPH05274351A
Application number: JP4100509A
Authority: JP
Inventors: Reiko Bessho; 礼子別所
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1992-03-25
Filing date: 1992-03-25
Publication date: 1993-10-22

Abstract

(57)【要約】【目的】文字種に左右されず、キーワード辞書中不要
語辞書を必要としないキーワード抽出を実現する。【構成】入力手段１により入力された日本語文書は、
形態素解析手段２により単語単位に分けられ、それぞれ
の語に品詞を与える。一般名詞、固有名詞、接頭辞など
の対象品詞に複合語語基、固有名詞構成語、接頭修飾の
いずれかのキーワード素性が付与されているかを判断
し、このキーワード素性による情報を用いてキーワード
抽出を行う。 (57) [Summary] [Purpose] To realize keyword extraction that does not depend on the character type and does not require an unnecessary word dictionary in the keyword dictionary. [Structure] The Japanese document input by the input means 1 is
The morpheme analysis unit 2 divides into words and gives a part of speech to each word. It is judged whether a keyword feature of a compound word base, proper noun constituent word, or prefix modification is given to the target part-of-speech such as a general noun, proper noun, or prefix, and keyword extraction is performed using information based on this keyword feature. I do.

Description

Detailed Description of the Invention

【０００１】[0001]

【技術分野】本発明は、キーワード抽出方式に関し、よ
り詳細には、文書から自動的にキーワードを抽出するキ
ーワード抽出方式に関する。TECHNICAL FIELD The present invention relates to a keyword extraction method, and more particularly to a keyword extraction method for automatically extracting a keyword from a document.

【０００２】[0002]

【従来技術】本発明に係る従来技術を記載した公知文献
としては、例えば、特開平３−８０７０号公報に「キ
ーワード抽出方式」がある。この公報のものは、文書中
からキーワード抽出する際、強調された文字列のみを対
象としており、さらにそれら対象とされた強調文字列に
対し、キーワード辞書を用いることによって自立語と判
断されたものをキーワードと決定するものである。しか
しながら、文書中の強調文字列しかキーワードの対象と
されず、また、キーワード辞書が別に必要であるという
欠点がある。2. Description of the Related Art As a publicly known document describing the prior art of the present invention, there is a "keyword extraction method" in Japanese Patent Laid-Open No. 3-8070, for example. In this publication, when extracting keywords from a document, only emphasized character strings are targeted, and it is determined that the targeted emphasized character strings are independent words by using a keyword dictionary. Is determined as a keyword. However, there are disadvantages that only the emphasized character strings in the document are targeted for keywords, and a keyword dictionary is required separately.

【０００３】また、「新聞記事データベースにおける
キーワード自動抽出」（情報管理 vol.32 No.4, July 1
989）に記載のものは、字種の相違に着目して単語にく
ぎり、単語テーブルなどを用いて名詞を抽出し、抽出さ
れた名詞から不要語テーブルを用いてそこに収録された
用語は削除してキーワードを決定するものである。しか
しながら、字種の相違で単語を区切るので、正確にキー
ワード候補を抽出することができず、ます、不要語辞書
が必要であるという欠点がある。Also, "Automatic keyword extraction in newspaper article database" (Information Management vol.32 No.4, July 1
989) refers to the difference in the type of character, pinches the word, extracts the noun using a word table, etc., and deletes the terms recorded from the extracted noun using the unnecessary word table. Then, the keyword is decided. However, since the words are separated by the difference in the character type, it is not possible to accurately extract the keyword candidates, and there is a disadvantage that an unnecessary word dictionary is needed.

【０００４】また、特開平２−１４８２６５号公報に
提案されている「自動索引システム」は、文書中から索
引語を抽出する際、まず形態素解析をして格助詞を伴う
名詞を取り出し、これら名詞データから不要語辞書を用
いて不要語を削除した後、残ったものを索引語と決定す
るものである。しかしながら、依然不要語辞書が必要で
あるという欠点がある。Further, the "automatic indexing system" proposed in Japanese Patent Laid-Open No. 2-148265 discloses that when extracting an index word from a document, first a morphological analysis is performed to extract a noun accompanied by a case particle. After deleting unnecessary words from the data using the unnecessary word dictionary, the remaining words are determined as index words. However, there is a drawback that the unnecessary word dictionary is still required.

【０００５】[0005]

【目的】本発明は、上述のごとき実情に鑑みなされたも
ので、字種に左右されず、キーワード辞書や不要語辞書
を必要としないキーワード抽出方式を提供することを目
的としてなされたものである。[Purpose] The present invention has been made in view of the above circumstances, and has as its object to provide a keyword extraction method that is not influenced by the character type and does not require a keyword dictionary or an unnecessary word dictionary. ..

【０００６】[0006]

【構成】本発明は、上記目的を達成するために、（１）
日本語文書を入力する入力手段と、該入力手段により入
力された文書を単語単位に分け、該単語に品詞を与える
形態素解析手段を、該形態素解析手段により与えられた
品詞を用いてキーワードを抽出するキーワード抽出手段
とから成り、前記形態素解析手段により解析された結果
で得られた品詞情報と、キーワード素性の情報を用いる
ことにより、文書中からキーワードを抽出すること、更
には、（２）前記キーワード素性の一つである複合語語
基をも用いることにより、不要なキーワード候補を少な
くすること、更には、（３）前記キーワード素性の一つ
である固有名詞構成語を用いることにより、不要なキー
ワード候補を少なくすること、更には、（４）前記キー
ワード素性の一つである接頭修飾を用いることにより、
必要な接頭辞はキーワードの一部として抽出することを
特徴としたものである。以下、本発明の実施例に基づい
て説明する。In order to achieve the above object, the present invention provides (1)
An input unit for inputting a Japanese document and a document input by the input unit are divided into word units, and a morphological analysis unit for giving a part of speech to the word is used to extract a keyword using the part of speech given by the morphological analysis unit. Extracting keyword from the document by using the part-of-speech information obtained as a result of the analysis by the morpheme analysis means and the keyword feature information, and (2) above. By using a compound word base that is one of the keyword features, unnecessary keyword candidates are reduced, and further, (3) by using a proper noun constituent word that is one of the keyword features, unnecessary By reducing the number of possible keyword candidates, and (4) using prefix modification, which is one of the keyword features,
The required prefix is characterized by being extracted as part of the keyword. Hereinafter, description will be given based on examples of the present invention.

【０００７】図１は、本発明によるキーワード抽出方式
の一実施例を説明するための構成図で、図中、１は入力
手段、２は形態素解析手段、３はキーワード抽出手段で
ある。入力手段１により日本語文書が入力され、該入力
手段１により入力された文書は、形態素解析手段２によ
り、単語単位に分けられ、該単語に品詞を与える。キー
ワード抽出手段３は、前記品詞を用いてキーワード抽出
を行う。FIG. 1 is a block diagram for explaining an embodiment of a keyword extraction system according to the present invention. In the figure, 1 is an input means, 2 is a morpheme analysis means, and 3 is a keyword extraction means. A Japanese document is input by the input unit 1, and the document input by the input unit 1 is divided into word units by the morphological analysis unit 2 and a part of speech is given to the word. The keyword extracting means 3 extracts keywords using the part of speech.

【０００８】図２は、本発明によるキーワード抽出方式
の動作を説明するためのフローチャートである。以下、
各ステップに従って順に説明する。あらかじめ、文章は
形態素解析によって単語単位に区切られたものとする。
まず、先頭の一単語が入力される（ステップ１）。ここ
で、カウンターを０にセットする（ステップ２）。さら
にその単語が未登録語または固有名詞かを判断し（ステ
ップ４）、該当すればカウンターに１を加える（ステッ
プ５）。該当しなければまた一単語入力するところまで
戻る（ステップ１）。ステップ３で一般名詞ならば、次
にその一般名詞に「複合語語基」「固有名詞構成語」い
ずれかのキーワード素性が付与されているかどうかをみ
る（ステップ６）。付与されていなければさきほどの未
登録語／固有名詞のときと同様、カウンターに１を加え
る（ステップ５）。要するに、入力された第一の単語が
キーワード素性の付与されていない一般名詞または固有
名詞・未登録語のいずれかに該当するものはカウンター
に１を加えることになる。FIG. 2 is a flow chart for explaining the operation of the keyword extraction method according to the present invention. Less than,
The steps will be described in order. It is assumed that the sentence is divided into words in advance by morphological analysis.
First, the first word is input (step 1). Here, the counter is set to 0 (step 2). Further, it is judged whether the word is an unregistered word or proper noun (step 4), and if it is applicable, 1 is added to the counter (step 5). If it does not correspond, it returns to the point where one word is input again (step 1). If it is a general noun in step 3, then it is checked whether or not the general noun is given a keyword feature of either "compound word base" or "proper noun constituent word" (step 6). If not given, 1 is added to the counter as in the case of the previously unregistered word / proper noun (step 5). In short, if the input first word corresponds to any of the general noun or proper noun / unregistered word to which no keyword feature is given, 1 is added to the counter.

【０００９】さて、ここまで残ったものについては全て
についてカウンターに１を加え（ステップ７）、スタッ
クに積み（ステップ８）、次の一単語を入力する（ステ
ップ９）。そしてその入力された単語が固有名詞・一般
名詞・サ変名詞・接頭辞（キーワード素性「接頭修飾付
き」のもの）・接尾辞・未登録語のいずれかであるかど
うかを判断する（ステップ１０）。該当するならば再び
カウンターに１を加えてスタックに積むという処理を続
けるが（ステップ７，８，９）、該当しないのならばこ
こで品詞による判断は打ち切る。For all the remaining items, 1 is added to the counter (step 7), the stack is added (step 8), and the next word is input (step 9). Then, it is determined whether the input word is any of a proper noun, a general noun, a sahen noun, a prefix (of keyword feature “with prefix modification”), a suffix, and an unregistered word (step 10). .. If it is, the process of adding 1 to the counter again and stacking it on the stack is continued (steps 7, 8 and 9), but if it is not, the judgment based on the part of speech is terminated here.

【００１０】ここでスタックに保存してきた単語（群）
を抽出するかどうかを判断するわけだが、それはカウン
ターが１より大きいかどうかで判断する（ステップ１
１）。１よりも大きければそれまでスタックに保存して
きたものを複合語として抽出するが（ステップ１２）、
そうでなければスタックの内容をクリアして（ステップ
１３）最初に戻る。要するにキーワード素性の付与され
ていない一般名詞または固有名詞・未登録語のいずれか
にあてはまるもの（ステップ５で１を加える処理をおこ
なったもの）は単独でも複合語として抽出されるが、キ
ーワード素性の付与された一般名詞・サ変名詞・接頭辞
・接尾辞はそれらの品詞をもつ単語が２つ以上連なって
いなければ複合語として抽出されないことになる。The word (s) saved in the stack here
Is determined by whether or not the counter is greater than 1 (step 1
1). If it is larger than 1, the word stored in the stack until then is extracted as a compound word (step 12),
If not, the contents of the stack are cleared (step 13) and the process returns to the beginning. In short, a general noun to which a keyword feature is not assigned, a proper noun, or an unregistered word (those subjected to the process of adding 1 in step 5) alone are extracted as a compound word, but The assigned general noun / sahen noun / prefix / suffix will not be extracted as a compound word unless two or more words having those parts of speech are connected.

【００１１】ここで、上の説明で用いたキーワード素性
について説明する。キーワード素性には、複合語語基・
固有名詞構成語・接頭修飾の３種類がある。それぞれの
素性の付与され得る品詞と特徴、役割を以下の表１にま
とめる。Now, the keyword features used in the above description will be described. Keyword features include compound word bases
There are three types of proper noun constituent words and prefix modification. Table 1 below summarizes the parts of speech, characteristics, and roles that can be given to each feature.

【００１２】[0012]

【表１】 [Table 1]

【００１３】次に、本発明の具体的な実施例を例を用い
て説明する。例１この工場で装置を開発する予定だ。この文章を形態素解析するとこの／工場／で／装置／を／開発／する／予定／だ／。前述した品詞（つまり固有名詞・一般名詞・サ変名詞
・接頭辞（キーワード素性「接頭修飾」つき）・接尾辞
・未登録語のいずれかに該当するもの。以後これらの語
をキーワード候補と呼ぶ）に該当するものは次の通り。工場（一般名詞、素性「複合語語基」つき）装置（一般名詞、素性「複合語語基」つき）開発（サ変名詞）予定（サ変名詞）前のフローチャートによると、サ変名詞も一般名詞
（素性つき）もキーワード候補なのでカウンターはｎ＝
１になるが、いずれも単独でしか出現していない。結局
ｎ＜１なので、この文章からはキーワードは抽出されな
い。Next, a specific embodiment of the present invention will be described using an example. Example 1 Equipment is planned to be developed at this factory. When this sentence is morphologically analyzed, this / plant / in / device / development / development / plan / Part of the above-mentioned part of speech (that is, proper noun, general noun, sahen noun, prefix (with keyword feature "prefix modification"), suffix, or unregistered word. These words will be referred to as keyword candidates hereinafter) The following applies to. Factory (general noun, with feature "compound word base") Device (general noun, feature "compound word base") Development (sahenon) Schedule (sahenon) According to the previous flowchart, sahenon is also a common noun ( The counter is n =
Although it is 1, all of them have appeared independently. After all, since n <1, no keyword is extracted from this sentence.

【００１４】例２ここでレーザー測量装置の開発を行
なう予定だ。この文章を形態素解析するとここ／で／レーザー／測量／装置／の／開発／を／行な
／う／予定／だ／。キーワード候補に該当するものは以下の通り。レーザー（一般名詞・キーワード素性なし）測量（サ
変名詞）装置（一般名詞、素性「複合語語基」つき）開発（サ
変名詞）予定（サ変名詞）前のフローチャートによると、まず「レーザー」はキ
ーワード素性が何も付与されていないので、もしも単独
で出現していてもキーワードとなり得る。しかしこの例
の場合は、あとに「測量」「装置」が続く。この２語は
単独ではキーワードにはなり得ない。しかし「レーザ
ー」から連続しているため次々と単語がスタックに積ま
れ、結局カウンターはｎ＝４、つまりｎ＞１になり、こ
の文からは「レーザー測量装置」がキーワードとして抽
出されることになる。なお、「開発」「予定」はいずれ
も単独でしか出現していないのでどちらもキーワードと
して抽出しない。 Example 2 Here, a laser surveying instrument is developed.
I plan to follow. If you morphologically analyze this sentence, you can find here /// laser / survey / device / of / development / do / work / plan / da /. The following are candidates for keyword. Laser (no general noun / keyword feature) Surveying (sa noun) Device (general noun, with feature “compound word base”) Development (sa noun) Schedule (sa noun) According to the previous flowchart, “laser” is the keyword Since no feature is added, it can be a keyword even if it appears alone. However, in this example, “survey” and “device” follow. These two words cannot be keywords by themselves. However, because it is continuous from "laser", words are stacked on the stack one after another, and eventually the counter becomes n = 4, that is, n> 1, and from this sentence "laser surveying device" is extracted as a keyword. Become. Note that "development" and "scheduled" have not appeared as keywords since both have appeared independently.

【００１５】例３ここで高画質な大画面テレビの開発
が始まった。この文章を形態素解析するとここ／で／高／画質／な／大／画面／テレビ／の／開発
／が／始まっ／た／。キーワード候補に該当するものは次の通り。高（接頭辞、素性「接頭修飾」つき）画質（一般名詞、素性「複合語語基」つき）大（接頭辞、素性「接頭修飾」つき）画面（一般名詞、素性「複合語語基」つき）テレビ（一般名詞、素性「複合語語基」つき）前のフローチャートによると「高」「大」は接頭語だ
が、キーワード素性の接頭修飾が付与されているので、
キーワード候補が後続すれば共にキーワードとして抽出
する。結局「高画質」はｎ＝２、「大画面テレビ」はｎ
＝３、いずれもｎ＞１なので、この文からは「高画
質」、「大画面テレビ」がキーワードとして抽出される
ことになる。なお、「開発」は単独でしか出現していな
いので、キーワードとして抽出しない。 Example 3 Development of a high-quality large-screen television here
Has begun. If you morphologically analyze this sentence, you can see here / / / high / image quality / na / large / screen / TV / of / development / started / started /. The following are candidates for keyword. High (with prefix, feature "prefix modifier") Image quality (with general noun, feature "compound base") Large (with prefix, feature "prefix modifier") Screen (generic noun, feature "compound base") Tsuki) Television (general noun, with a feature "compound base") According to the previous flowchart, "High" and "Large" are prefixes, but since the prefix modification of keyword features is added,
If a keyword candidate follows, both are extracted as a keyword. After all, "high quality" is n = 2, "large screen TV" is n
= 3, both of which are n> 1, "high image quality" and "large screen television" are extracted as keywords from this sentence. Note that "development" has not appeared as a keyword because it appears only by itself.

【００１６】[0016]

【効果】以上の説明から明らかなように、本発明による
と、以下のような効果がある。（１）請求項１に対応する効果：日本語文書を形態素解
析し、その結果で得た品詞情報を用いてキーワード抽出
をするので、特種な文字（例えば強調文字）だけをキー
ワード抽出するのではなく、全ての文字列をキーワード
の対象として捉えることができる。また、品詞情報に加
えてキーワード素性も用いるので、不要語は少なく、か
つ必要な語が落ちることの少ない、正確なキーワード抽
出ができる。（２）請求項２に対応する効果：キーワード素性の一つ
である複合語語基を用いることにより、キーワードとし
てふさわしくないと思われる不要な語をできるだけ抽出
しないことができる。（３）請求項３に対応する効果：キーワード素性の一つ
である固有名詞構成語を用いることにより、キーワード
としてふさわしくないと思われる不要な語をできるだけ
抽出しないことができる。（４）請求項４に対応する効果：キーワード素性の一つ
である接頭修飾を用いることにより、キーワードの一部
として必要な接頭辞を抽出することができる。As is apparent from the above description, the present invention has the following effects. (1) Effect corresponding to claim 1: A Japanese document is subjected to morpheme analysis, and keyword extraction is performed using the part-of-speech information obtained as a result, so it is not possible to extract only special characters (for example, emphasized characters). Instead, all character strings can be regarded as the target of the keyword. Moreover, since the keyword features are used in addition to the part-of-speech information, it is possible to accurately extract keywords with few unnecessary words and less necessary words. (2) Effect corresponding to claim 2: By using a compound word base that is one of the keyword features, unnecessary words that are not considered suitable as keywords can be extracted as little as possible. (3) Effect corresponding to claim 3: By using proper noun constituent words, which is one of the keyword features, unnecessary words that are not suitable as keywords can be extracted as little as possible. (4) Effect corresponding to claim 4: By using the prefix modification which is one of the keyword features, it is possible to extract the prefix necessary as a part of the keyword.

[Brief description of drawings]

【図１】本発明によるキーワード抽出方式の一実施例
を説明するための構成図である。FIG. 1 is a configuration diagram for explaining an embodiment of a keyword extraction system according to the present invention.

【図２】本発明によるキーワード抽出方式の動作を説
明するためのフローチャートである。FIG. 2 is a flow chart for explaining the operation of the keyword extraction method according to the present invention.

[Explanation of symbols]

１…入力手段、２…形態素解析手段、３…キーワード抽
出手段。1 ... Input means, 2 ... Morphological analysis means, 3 ... Keyword extraction means.

Claims

[Claims]

1. An input means for inputting a Japanese document, a morpheme analysis means for dividing a document input by the input means into word units, and giving a part of speech to the word, and a part of speech given by the morpheme analysis means. It is characterized by comprising a keyword extracting means for extracting a keyword by using the part-of-speech information obtained as a result of the analysis by the morpheme analyzing means, and the keyword feature information to extract the keyword from the document. Keyword extraction method to do.

2. The keyword extraction method according to claim 1, wherein unnecessary keyword candidates are reduced by also using a compound word base that is one of the keyword features.

3. The keyword extracting method according to claim 1, wherein unnecessary keyword candidates are reduced by using proper noun constituent words which are one of the keyword features.

4. The keyword extraction method according to claim 1, wherein the required prefix is extracted as a part of the keyword by using a prefix modification which is one of the keyword features.