[go: up one dir, main page]

JPH03131967A - Method for sorting japanese word - Google Patents

Method for sorting japanese word

Info

Publication number
JPH03131967A
JPH03131967A JP1271156A JP27115689A JPH03131967A JP H03131967 A JPH03131967 A JP H03131967A JP 1271156 A JP1271156 A JP 1271156A JP 27115689 A JP27115689 A JP 27115689A JP H03131967 A JPH03131967 A JP H03131967A
Authority
JP
Japan
Prior art keywords
word
words
dependent
distance
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP1271156A
Other languages
Japanese (ja)
Inventor
Shiyouichi Sasabe
佐々部 昭一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ricoh Co Ltd
Original Assignee
Ricoh Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ricoh Co Ltd filed Critical Ricoh Co Ltd
Priority to JP1271156A priority Critical patent/JPH03131967A/en
Publication of JPH03131967A publication Critical patent/JPH03131967A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

PURPOSE:To efficiently sort words even in the case of a large word group by clustering words by an inter-word distance indicating the conceptional nearness of respective words to separate words. CONSTITUTION:The modification structure of words is obtained by a modification analysis part 1 and the combination of different modification and the appearance frequency of respective combinations are extracted by a statistic data extraction part as statistic data. After analyzing the numerized IV sort of a modifying word or a modified word in the statistic data, the Euclidean distance of values applied to the words is calculated and held. Processing is executed by a clustering part 6 and a hierarchical processing part 7 by using the calculated result as a distance matrix and a sorted result is obtained from a word sorting result output part 8.

Description

【発明の詳細な説明】 産業上の利用分野 本発明は、言語処理分野における日本語の単語分類方法
に関する。
DETAILED DESCRIPTION OF THE INVENTION FIELD OF INDUSTRIAL APPLICATION The present invention relates to a Japanese word classification method in the field of language processing.

従来の技術 自然言語処理においては、意味解析のため、意味概念に
より分類した単語辞書が必須となる。
In conventional natural language processing, a dictionary of words classified by semantic concepts is essential for semantic analysis.

従来、単語を概念により分類する方法としては、人間が
知識、経験から判断して分類体系を設定するトップダウ
ン的な手法がある。しかし、この方法は作業が人力に依
存するため、その制作・設計に要するコストが高く、タ
スクに応じた単語分類の変更が容易でない。
Conventionally, as a method for classifying words according to concepts, there is a top-down method in which a classification system is set by humans based on their knowledge and experience. However, since this method relies on human power, the production and design costs are high, and it is not easy to change the word classification according to the task.

この問題を解決するため、日本語文章を言語解析した結
果得られるデータを基にして単語を自動的に分類する方
法が提案されている。例えば、情報処理学会論文誌、V
OL、26.No、4中の 「係り受け解析のための辞
書の構成とその学習機能」や、特開昭63−17237
2号公報に示されるように、単語の係り受けの組合せに
ついて出現回数をカウントし、それらのデータから下記
のように定義される単語相互の概念的近さを示す単語間
距離(=非類似度)を計算し、その値によって適当なり
ラスタリング(Centroid法、S ingle 
L inkage法など)を行い、分類するというもの
である。
To solve this problem, a method has been proposed to automatically classify words based on data obtained as a result of linguistic analysis of Japanese sentences. For example, Information Processing Society of Japan Transactions, V
OL, 26. No. 4, ``Dictionary structure and its learning function for dependency analysis'' and Japanese Patent Application Laid-Open No. 17237-1983.
As shown in Publication No. 2, the number of occurrences of combinations of word dependencies is counted, and from this data, the distance between words (=dissimilarity), which indicates the conceptual closeness of words to each other, is defined as follows. ), and then perform appropriate rastering (Centroid method, Single
This method uses a linkage method, etc.) to classify the information.

ここに、単語間距離d(A、B)は、 d(A、B)=1−N(AB)/ (N(A)+N(B
))N(A)  :係り語Aの出現回数 N(B)  :係り語Bの出現回数 N(AB):係り語A、Bが同一付属語で同−受け語に
係る回数の総和 又は、 N、(AB):係り語A、Bが同一役割の付属語で、同
一の受け語に係る回数の総 和 N(AB):係り語A、Bが同一役割の付属語で、同一
の同義グループの受け語 に係る回数の総和 N、(AB):係り語A、Bが同一役割の付属語で、レ
ベルjに属する同一の類似 語グループの受け語に係る回数の 総和 W、   :重み係数 として定義される。
Here, the distance between words d(A, B) is d(A, B)=1-N(AB)/(N(A)+N(B)
)) N(A): Number of occurrences of dependent word A N(B): Number of appearance of dependent word B N(AB): Sum of the number of times dependent words A and B are the same adjunct word and relate to the same -object word, or, N, (AB): Sum of the number of times that the dependent words A and B are attached words with the same role and are related to the same receiving word. Total sum of the number of times N, (AB) related to the receiving word of , (AB): Total sum of the number of times related to the receiving word of the same similar word group belonging to level j, where the dependent words A and B are attached words of the same role, W: As a weighting coefficient defined.

また、クラスタリングは距離の小さい単語グループ間の
統合によって次第に距離の大きい単語グループ間の統合
へと進行する。従って、階層化処理は予め想定する同義
及び類似の階層レベルに単語グループ間の距離の上限値
を対応付け、これらを基準として生成されるクラスタを
各々のレベルに階層化している。
In addition, clustering progresses from the integration of word groups with small distances to the integration of word groups with large distances. Therefore, in the hierarchization process, the upper limit value of the distance between word groups is associated with the predetermined synonymous and similar hierarchy levels, and the clusters generated based on these are hierarchized into each level.

さらに、このようなグループ化処理を繰返し実行して、
漸次グループ化される単語数を増加させたりしている。
Furthermore, by repeatedly performing this grouping process,
The number of words that are grouped is gradually increased.

発明が解決しようとする課題 しかし、このような従来の方法では、特に限られた分野
の少数の単語群に対しては適用できるが、単語群が大き
くなると分類の精度が低下し、グループ化の割合が低い
ものである。
Problems to be Solved by the Invention However, such conventional methods can be applied to a small number of word groups, especially in a limited field, but as the word group becomes larger, the classification accuracy decreases and grouping becomes difficult. The percentage is low.

課題を解決するための手段 係り受け解析部により日本語文章を解析して得られる統
計的データを用いて単語を意味概念によるグループに分
類するようにした日本語の単語分類方法において、単語
間の係り受けに関する相異なる組合せとその頻度とから
なる単語の統計データを数量化■類解析部により数量化
■類解析して単語に数量を与え、少なくともこの数量か
ら求められる単語相互の概念的近さを示す単語間距離に
よってクラスタリングを行い、単語を分類するようにし
た。
Means for Solving the Problem In the Japanese word classification method, which uses statistical data obtained by analyzing Japanese sentences with a dependency analysis unit to classify words into groups based on semantic concepts, Quantifying the statistical data of words consisting of different combinations of dependencies and their frequencies ■ Quantifying by the class analysis unit ■ Giving quantities to words through class analysis, and at least the conceptual closeness of words to each other determined from this quantity The words are classified by clustering based on the distance between words.

作用 単語間の係り受けに関する相異なる組合せとその頻度と
からなる単語の統計データの数量化N類解析によって与
えられる数量を用いて単語間距離を定義したので、大き
な単語群であっても効率よく分類できる。
Quantification of word statistical data consisting of different combinations of dependencies between action words and their frequencies Since we defined the distance between words using the quantity given by N-type analysis, we can efficiently calculate the distance even for large word groups. Can be classified.

実施例 本発明の第一の実施例を第1図に基づいて説明する。こ
の第1図は本実施例の単語分類処理を実行する処理部を
フローチャート的に示すブロック図であるが、数量化■
類解析部以外は、従来方式のものと同様に機能する。
Embodiment A first embodiment of the present invention will be explained based on FIG. This FIG.
The functions other than the class analysis section are the same as those of the conventional method.

まず、日本語文章例について、必要ならば形態素解析を
施した後、係り受け解析部lによりその係り受け構造を
得る。次いで、統計データ抽出部2により、「係り単語
、付属語、受け単語」の3項組、或いは、「係り単語、
格の種類、受け単語jの3項組等の係り受けの相異なる
組合せと、各々の組合せの出現頻度とを統計データとし
て抽出する。
First, a Japanese sentence example is subjected to morphological analysis if necessary, and then its dependency structure is obtained by the dependency analysis unit l. Next, the statistical data extraction unit 2 extracts a three-term set of "dependent word, attached word, and dependent word" or "dependent word,
Different combinations of case types, ternary sets of dependent words j, and the frequency of appearance of each combination are extracted as statistical data.

次に、この統計データの係り単語に関して、係り単語数
量化■類解析部3により数量化■類解析を行う。数量化
■類解析とは、対象間の親近性eijに基づいて対象に
数量を与え、ユークリッド空間に位置付けようとする方
法である。即ち、対象lに与える2次元の数量を(xi
、yi)とすると、 を最大とするように(xi、 yi)を定めることにな
る(現代数学柱「多変量統計解析法」等参照)。
Next, the dependent word quantification type analysis section 3 performs quantification type analysis on the dependent words of this statistical data. Quantification type analysis is a method of assigning quantities to objects based on the affinities eij between the objects and positioning them in Euclidean space. That is, the two-dimensional quantity given to the object l is (xi
, yi), then (xi, yi) is determined so as to maximize (see the modern mathematics pillar ``Multivariate statistical analysis method'', etc.).

そこで、例えば親近性elJを次のように定義して解析
を行えばよい。
Therefore, for example, affinity elJ may be defined and analyzed as follows.

ただし、 N(i):係り語iの出現頻度 N(j):係り語lの出現頻度 N(i、j):係り語i、jが同−係り受けで同−受け
語に係る頻度の総和 与える数量の次元は解析の結果得られる固有値の大きさ
により適当に決めればよい。
However, N(i): Frequency of appearance of dependent word i N(j): Frequency of appearance of dependent word l N(i, j): Frequency of occurrence of dependent word i and j in the same - dependent word The dimension of the quantity to be summed may be appropriately determined depending on the size of the eigenvalue obtained as a result of the analysis.

また、受け単語に関しても同様に、親近性eiJを次の
ように定義して解析を行えばよい。
Similarly, for received words, the affinity eiJ may be defined and analyzed as follows.

ただし、 N(i):受け語iの出現頻度 N(j):受け語iの出現頻度 N(t+ j):受け語’r Jが同−係り受けで同−
係り語から係る頻度の総和 係り単語又は受け単語についての数量化■類解析後、単
語間距離計算部5により、単語に与えられた数量のユー
クリッド距離を、各単語間について計算し保持する。2
次元ならば、 d(Lj)= ((x+−x+ )”+(y+−y+ 
)’ )”’により計算する。この計算結果を距離マト
リックスとして、クラスタリング部6、階層化処理部7
による処理を従来と同様に行うことにより、単語分類結
果出力部8から分類結果が得られる。
However, N(i): Frequency of appearance of target word i N(j): Frequency of appearance of target word i N(t+j): Target word 'r J is the same - dependent on the same -
Summation of Frequencies from Dependent Words Quantification of Dependent Words or Subordinate Words After the class analysis, the inter-word distance calculation unit 5 calculates and holds the Euclidean distance of the quantity given to the words between each word. 2
If it is a dimension, d(Lj)=((x+-x+)"+(y+-y+
)')''. Using this calculation result as a distance matrix, the clustering unit 6 and the hierarchization processing unit 7
By performing the processing in the same manner as conventionally, the classification results are obtained from the word classification result output section 8.

つづいて、本発明の第二の実施例を第2図により説明す
る。本実施例は、前記実施例による処理を繰返し実行す
るようにしたものである。まず、単語間距離計算部5、
クラスタリング部6及び階層化処理部7を係り単語側と
受け単語側とで区分して設け、第1回目の分類処理を係
り単語、受け単語毎に行い、各々の結果を係り単語知識
部9、受け単語知識部10に保持する。次に、これらの
結果を用いて、例えば次のような親近性eijを受け語
、係り語で定義して係り単語数量化■類解析部3、受け
単語数量化■類解析部4よる数量化■類解析を行う。
Next, a second embodiment of the present invention will be described with reference to FIG. In this embodiment, the processing according to the embodiment described above is repeatedly executed. First, the word distance calculation unit 5,
The clustering unit 6 and the hierarchization processing unit 7 are provided separately for the dependent word side and the dependent word side, and the first classification process is performed for each dependent word and dependent word, and each result is sent to the dependent word knowledge unit 9, It is held in the received word knowledge section 10. Next, using these results, for example, the following affinity eij is defined by the recipient word and the dependent word, and the dependent words are quantified. ■Perform class analysis.

係り語数量化■類解析の親近性 ただし、 N(i)  :係り語lの出現頻度 N(j)  :係り語jの出現頻度 N、(i、J):係り語i+ Jが同−係り受けで、同
一の受け語に係る頻度の総和 Nk(l+、+):係り語i+Jが同−係り受けで、レ
ベルkに属する同一のグループの受 は語に係る頻度の総和 w、   二重み係数 受け語数量化■類解析の親近性 ただし、 N(i)  :受け語1の出現頻度 N(j)  :受け語jの出現頻度 N、(i、j) :受け語1+Jが同−係り受けで、同
一の係り語から係る頻度の総和 N、(l、J):受け語i+ Jが同−係り受けで、し
ベルkに属する同一のグループの係 り語から係る頻度の総和 wk   :重み係数 と定義される。このようにして得られた単語の数量を用
いて、前記実施例と同様に、単語間距離の計算、クラス
タリング、階層化処理を実行する。
Dependent word quantification ■Finality in class analysis However, N(i): Frequency of appearance of dependent word l N(j): Frequency of appearance of dependent word j N, (i, J): Dependent word i + J is the same - related In uke, the sum of the frequencies related to the same dependent word Nk (l+, +): When the dependent word i + J is the same - dependent word, the uke of the same group belonging to level k is the sum of the frequencies related to the word w, doubleness coefficient Categories quantification ■Finality in class analysis However, N(i): Frequency of occurrence of subject word 1 N(j): Frequency of appearance of subject word j, (i, j): Same - dependency of subject word 1 + J Then, the sum of frequencies N, (l, J) from the same dependent word: sum of frequencies wk from the dependent words of the same group belonging to the same group, where J has the same - dependency, wk: weighting coefficient is defined as Using the number of words obtained in this manner, calculation of distance between words, clustering, and hierarchization processing are executed in the same manner as in the above embodiment.

このような処理を、単語グループが一定の条件に飽和す
るまで繰返す。
Such processing is repeated until the word groups are saturated to a certain condition.

さらに、従来と同様に、係り単語と受け単語の関係は、
同一の付属語を介した名詞と動詞の関係複合語における
修飾語と被修飾語の関係、同一の格タイプによる名詞と
動詞の係り受け関係などで表現することもできる。
Furthermore, as before, the relationship between dependent words and dependent words is
The relationship between a noun and a verb through the same adjunct can also be expressed as the relationship between a modifier and a modified word in a compound word, or the dependency relationship between a noun and a verb using the same case type.

発明の効果 本発明は、上述したように単語間の係り受けに関する相
異なる組合せとその頻度とからなる単語の統計データの
数量化■類解析によって与えられる数量を用いて単語間
距離を定義するようにしたので、大きな単語群であって
も効率よく分類できるものである。
Effects of the Invention As described above, the present invention defines the distance between words using the quantity given by the quantification of statistical data of words consisting of different combinations of dependencies between words and their frequencies. Therefore, even large word groups can be efficiently classified.

【図面の簡単な説明】[Brief explanation of the drawing]

第1図は本発明の第一の実施例を示すブロック図、第2
図は本発明の第二の実施例を示すブロック図である。 l・・・係り受け解析部、3,4・・・数量化■類解析
部 二高 1 図 図
FIG. 1 is a block diagram showing a first embodiment of the present invention, and FIG.
The figure is a block diagram showing a second embodiment of the present invention. l... Dependency analysis section, 3, 4... Quantification ■ Type analysis section 2nd high school 1 Diagram

Claims (1)

【特許請求の範囲】[Claims] 係り受け解析部により日本語文章を解析して得られる統
計的データを用いて単語を意味概念によるグループに分
類するようにした日本語の単語分類方法において、単語
間の係り受けに関する相異なる組合せとその頻度とから
なる単語の統計データを数量化IV類解析部により数量化
IV類解析して単語に数量を与え、少なくともこの数量か
ら求められる単語相互の概念的近さを示す単語間距離に
よってクラスタリングを行い、単語を分類するようにし
たことを特徴とする日本語の単語分類方法。
In the Japanese word classification method, which uses statistical data obtained by analyzing Japanese sentences with a dependency analysis unit to classify words into groups based on semantic concepts, different combinations of dependencies between words and The statistical data of words consisting of their frequency is quantified by the Class IV analysis unit.
A Japanese word characterized in that the words are classified by performing class IV analysis and giving a quantity to each word, and then clustering and classifying the words based on the inter-word distance that indicates the conceptual closeness between the words determined from this quantity. Classification method.
JP1271156A 1989-10-18 1989-10-18 Method for sorting japanese word Pending JPH03131967A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP1271156A JPH03131967A (en) 1989-10-18 1989-10-18 Method for sorting japanese word

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP1271156A JPH03131967A (en) 1989-10-18 1989-10-18 Method for sorting japanese word

Publications (1)

Publication Number Publication Date
JPH03131967A true JPH03131967A (en) 1991-06-05

Family

ID=17496116

Family Applications (1)

Application Number Title Priority Date Filing Date
JP1271156A Pending JPH03131967A (en) 1989-10-18 1989-10-18 Method for sorting japanese word

Country Status (1)

Country Link
JP (1) JPH03131967A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001075966A (en) * 1999-07-07 2001-03-23 Internatl Business Mach Corp <Ibm> Data analysis system
JP2008217398A (en) * 2007-03-05 2008-09-18 Hidetsugu Nanba Technical term classification device, technical term classification method, and program

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001075966A (en) * 1999-07-07 2001-03-23 Internatl Business Mach Corp <Ibm> Data analysis system
JP2008217398A (en) * 2007-03-05 2008-09-18 Hidetsugu Nanba Technical term classification device, technical term classification method, and program

Similar Documents

Publication Publication Date Title
Ghanbari-Adivi et al. Text emotion detection in social networks using a novel ensemble classifier based on Parzen Tree Estimator (TPE)
CN111144127B (en) Text semantic recognition method, text semantic recognition model acquisition method and related device
Lewis et al. Heterogeneous uncertainty sampling for supervised learning
CN110825877A (en) A Semantic Similarity Analysis Method Based on Text Clustering
CN112464638A (en) Text clustering method based on improved spectral clustering algorithm
CN108519971B (en) A cross-language news topic similarity comparison method based on parallel corpus
CN106407406A (en) A text processing method and system
US20040122660A1 (en) Creating taxonomies and training data in multiple languages
CN106991171A (en) Topic based on Intelligent campus information service platform finds method
CN118313345A (en) Text data set processing method, system, equipment and storage medium
CN106708926B (en) Implementation method of analysis model supporting massive long text data classification
Imad et al. Automated Arabic News Classification using the Convolutional Neural Network.
Fatwanto et al. A Systematic Literature Review of BERT-based Models for Natural Language Processing Tasks
Lai et al. Government affairs message text classification based on RoBerta and TextCNN
CN112685374A (en) Log classification method and device and electronic equipment
JPH03131967A (en) Method for sorting japanese word
Revenko et al. Learning Ontology Classes from Text by Clustering Lexical Substitutes Derived from Language Models 1
JPH03111972A (en) Method for sorting japanese word
JPH03132870A (en) Concept expressing method for japanese word
Ağduk et al. Classification of news texts from different languages with machine learning algorithms
CN119202260B (en) Power audit text classification method based on large language model
CN119513318B (en) Text classification method based on double-channel semantic enhancement and convolutional neural network
Zhang et al. An intelligent fault diagnosis method using variable weight artificial immune recognizers (V-AIR)
CN110175237A (en) It is a kind of towards multi-class secondary sensibility classification method
Dikovitsky et al. Topic Clustering of Social Media Using Multilayer Text Analysis