TWI515584B

TWI515584B - Compupter-assisted specialized noun produced dictionary system and method therefor

Info

Publication number: TWI515584B
Application number: TW100125187A
Authority: TW
Inventors: 郭建明; 蘭昀
Original assignee: 南臺科技大學
Priority date: 2011-07-15
Filing date: 2011-07-15
Publication date: 2016-01-01
Also published as: TW201303623A

Description

Computer aided professional noun dictionary generation system and method

本發明係為一種電腦輔助專業名詞辭典產生系統與方法，尤指一種可於輸入文檔中自動挑選出新專業名詞，並加以釋明以供大眾查詢之系統與方法。 The invention relates to a computer aided professional noun dictionary generation system and method, in particular to a system and method for automatically selecting new professional terms in an input document and interpreting them for public inquiry.

隨著人類文明演進與新興科技、學說或論文等資訊的加速創新，新的專有詞彙不斷快速地增加，這使得原有專業詞彙處理系統不足以應付需求，各式紙本或電子詞彙系統中通常都會用到詞彙，雖然可以盡可能地增加辭典中的詞彙數量，但是無論詞彙量有多大，都不可能概括到所有可能用到的詞彙，這是因為隨著文明的演進，詞彙本身就會隨著時間增加，且於各種知識領域中都會有特定的關鍵字詞或專有名詞，這些不是系統設計者可以預先知道的，因此，專有名詞的辭典或其他各式各樣的辭海內包含的詞彙不能一成不變，應該隨著處理的文章或相關領域做更新，並針對各種領域自動擷取而產生出新的詞彙，而可對於專有名詞詞彙系統的應用範圍或新領域的探索帶來相當大的幫助。 With the evolution of human civilization and the accelerated innovation of information such as emerging technologies, doctrines or essays, new proprietary vocabulary is increasing rapidly, which makes the original professional vocabulary processing system insufficient to meet demand, in various paper or electronic vocabulary systems. Words are usually used, although the number of words in the dictionary can be increased as much as possible, but no matter how big the vocabulary is, it is impossible to generalize all the words that may be used. This is because with the evolution of civilization, the words themselves will As time increases, and there are specific keyword words or proper nouns in various fields of knowledge, these are not known to the system designer in advance, so the dictionary of proper nouns or other various words include The vocabulary can't be changed. It should be updated with the processed articles or related fields, and new words can be generated by automatically extracting from various fields. It can bring about the application scope of the proper noun vocabulary system or the exploration of new fields. Great help.

爰此，為了改善現有的專業詞彙處理系統更新新詞彙的速度太慢，已不足應付需求，本發明因而提供一種專業名詞辭典產生系統，所述系統可將輸入文檔之內容自動挑選出新字彙，並加以釋明而放入辭典以供查詢，藉由輸入文檔隨時更新字彙，以應付各領域之使用需求。 Therefore, in order to improve the speed of updating the new vocabulary of the existing professional vocabulary processing system, which is too slow to meet the demand, the present invention thus provides a professional noun dictionary generating system, which can automatically select new vocabulary from the content of the input document. And interpret it and put it into the dictionary for enquiry, by input The documentation updates the vocabulary at any time to meet the needs of each area.

欲達成上述功能可藉由一種電腦輔助專業名詞辭典產生系統，包含有一資料庫模組，包含有舊專業名詞儲存區及新專業名詞儲存區；一接收模組，用以接收一輸入文檔；一重複序列模組，耦合至上述接收模組，用以將上述輸入文檔進行分段斷詞而形成複數語言單位；一統計詞頻機率模組，耦合至上述重複序列模組及上述資料庫模組，用以分析比對所述語言單位，以過濾出舊專業名詞、介係詞及冠詞，而產生出新專業名詞儲存於新專業名詞儲放區；一釋明模組，耦合至上述資料庫模組，用以釋明所述新專業名詞，並將釋明完成之新專業名詞歸類成所述舊專業名詞，而儲存於所述資料庫模組之舊專業名詞儲放區。 To achieve the above functions, a computer-assisted professional noun dictionary generation system includes a database module including an old professional noun storage area and a new professional term storage area; and a receiving module for receiving an input document; a repeating sequence module coupled to the receiving module for segmenting and dividing the input document to form a plurality of language units; a statistical word frequency probability module coupled to the repeating sequence module and the database module Used to analyze and compare the language units to filter out old professional nouns, prepositions and articles, and generate new professional terms stored in the new professional noun storage area; an interpretation module coupled to the above database module In order to explain the new terminology, and to classify the completed new professional term as the old terminology, and store it in the old professional term storage area of the database module.

所述之電腦輔助專業名詞辭典產生系統，進一步包含有一辭典查詢模組，耦合至上述資料庫模組，用以供查詢舊專業名詞。 The computer aided professional noun dictionary generation system further includes a dictionary query module coupled to the database module for querying old professional nouns.

所述語言單位包含有單字或片語。 The language unit contains a single word or a phrase.

所述釋明模組利用網路搜尋引擎可程式之開放的函式庫來進行新專業名詞之釋明。 The illuminating module utilizes an open library of web search engines to explain the new terminology.

本發明再提供一種電腦輔助專業名詞辭典產生方法，包含有儲存有舊專業名詞；接收一輸入文檔；將上述輸入文檔進行分段斷詞成複數語言單位；分析比對所述語言單位，以過濾出舊專業名詞、介係詞及冠詞，而產生出新專業名詞；釋明所述新專業名詞，並將釋明完成之新專業名詞歸類成所述舊專業名詞。 The invention further provides a computer aided professional noun dictionary generating method, which comprises storing an old professional noun; receiving an input document; segmenting the input document into a plurality of language units; analyzing and comparing the language units to filter The old professional nouns, prepositions and articles are used to produce new professional nouns; the new professional nouns are explained, and the new professional terms that are explained are classified into the old professional nouns.

本發明之功效：隨時更新字彙，以應付各領域之使用需求。 The effect of the invention: the vocabulary is updated at any time to meet the needs of use in various fields.

有關本發明之技術特徵及增進功效，配合下列圖式之較佳實施例即可清楚呈現，首先，請參閱第一圖所示，本發明之較佳實施例，為一種電腦輔助專業名詞辭典產生系統，包含有： With regard to the technical features and the enhancement of the present invention, the preferred embodiments of the present invention can be clearly presented. First, referring to the first embodiment, the preferred embodiment of the present invention is a computer-aided professional term dictionary. System, including:

一資料庫模組(10)，包含有舊專業名詞儲存區(12)及新專業名詞儲存區(14)。 A database module (10) includes an old professional noun storage area (12) and a new professional term storage area (14).

一接收模組(16)，用以接收一輸入文檔，前述輸入文檔包含有新期刊、新專利、新學說或新聞等。 A receiving module (16) for receiving an input document, the input document comprising a new journal, a new patent, a new doctrine or news.

一重複序列模組(18)，耦合至上述接收模組，用以將上述輸入文檔進行分段斷詞而形成複數語言單位〔包含有單字或片語〕，所述重複序列模組(18)係根據生物基因學中各類型重複序列的樣式來建立基本數學模式，再分別依數學模式編寫程式，並將分析所得字、詞相關資料整合建成一關聯資料庫，詳細的說，利用Matteo Pellegrini提出的重覆片段偵測方法來進行斷詞，其利用以下公式之矩陣P_ij，其為一個N*N之矩陣，若二個互相比較的字串，其值為相等，則在矩陣裡的(i,j)位址值便由0轉變成1，此方法並會試著找出擁有最長且重覆的字，藉此來達到斷詞。 a repeating sequence module (18) coupled to the receiving module for segmenting the input document into a plurality of language units (including a single word or a phrase), the repeating sequence module (18) The basic mathematical model is established according to the style of each type of repeated sequence in biogenetics, and then the program is written according to the mathematical mode, and the analyzed words and word-related data are integrated into a related database. In detail, using Matteo Pellegrini a method of detecting repeated segments to hyphenation which the matrix using the formula P _ij, which is a matrix of N * N, compared to each other if the two strings, which is equal, within the matrix of ( The i, j) address value is changed from 0 to 1, and this method will try to find the longest and repeated word to achieve the word break.

一統計詞頻機率模組(20)(TF-IDF)，耦合至上述重複序列模組(18)及上述資料庫模組(10)，用以分析比對所述語言單位，以過濾出舊專業名詞、介係詞及冠詞，而產生出新專業名詞儲存於新專業名詞儲放區(14)，進一步說明，所述統計詞頻機率模組(20)為一種統計方法，用以評估所述語言單位於所述輸入文檔及資料庫模組(10)儲存內容的重要程度，所述語言單位的重要程度是隨著輸入文檔中出現的次數成正比增加，且同時隨著資料庫模組之舊有專業名詞儲放區中出現的次數成反比下降。 a statistic frequency probability module (20) (TF-IDF) coupled to the repeating sequence module (18) and the database module (10) for analyzing the language unit to filter out the old professional Nouns, prepositions and articles, and new professional terms are stored in the new professional noun storage area (14). Further, the statistical word frequency probability module (20) is a statistical method for evaluating the language unit. The importance of the content stored in the input document and the database module (10), the importance of the language unit increases in proportion to the number of occurrences in the input document, and at the same time, as the database module is old The number of occurrences in the professional noun storage area decreases inversely.

所述語言單位具有重複出現的特性，因此，若語言單位出現的次數愈高，愈有可能是新專業名詞，上述ni,j是單一所述語言單位在輸入文檔dj中的出現次數，而分母則是在輸入文檔dj中所有所述語言單位的出現次數之和。 The language unit has recurring characteristics. Therefore, if the number of occurrences of the language unit is higher, the more likely it is a new professional noun, the above ni, j is the number of occurrences of a single said language unit in the input document dj, and the denominator Then is the sum of the occurrences of all the language units in the input document dj.

逆向文件頻率(Inverse Document Frequency，IDF)是一個所述語言單位普遍重要性的度量，某一特定語言單位的IDF，可以由總輸入文檔件數除以包含有前述某一特定語言單位之輸入文檔的件數，再將得到的商取對數得到：其中： Inverse Document Frequency (IDF) is a measure of the universal importance of the language unit. The IDF of a particular language unit can be divided by the total number of input documents by the input document containing a specific language unit. The number of pieces, and then the logarithm of the obtained quotient is obtained: among them:

|D|：所述資料庫模組(10)中的輸入文檔總數。 |D|: The total number of input documents in the database module (10).

|{d：d t _i}|：包含有某一特性語言單位ti的輸入文檔件數(即n _i≠0的輸入文檔件數)。 |{ d : d t _i }|: The number of input documents containing a certain characteristic language unit ti (ie, the number of input documents of n _i ≠ 0).

然後：tfidf_i,j=tf_i,j．idf_i Then: tfidf _i,j =tf _i,j . Idf _i

因此，當某一特定輸入文檔內的高語言單位頻率，以及所述高語言單位在總輸入文檔中的低輸入文檔頻率，則判斷出高權重的所述語言單位而形成所述新專業名詞，並供專家檢核。 Therefore, when a high language unit frequency within a particular input document, and a low input document frequency of the high language unit in the total input document, the high-weighted language unit is determined to form the new professional noun, And for expert inspection.

一釋明模組(22)，耦合至上述資料庫模組(10)，用以釋明所述新專業名詞，並將釋明完成之新專業名詞歸類成所述舊專業名詞，而儲存於所述資料庫模組之舊專業名詞儲放區，進一步說明，所述釋明模組(22)利用網路搜尋引擎可程式之開放的函式庫來進行新專業名詞之釋明。 An explanatory module (22) coupled to the database module (10) for explaining the new terminology and classifying the completed new professional term into the old terminology and storing In the old professional term storage area of the database module, it is further explained that the explanatory module (22) uses the open dictionary library of the network search engine to explain the new professional term.

一辭典查詢模組(24)，耦合至上述資料庫模組(10)，用以供學者、專家、出版社、學生及辭典編譯者查詢舊專業名詞。 A dictionary query module (24) is coupled to the database module (10) for querying old terminology by scholars, experts, publishers, students, and dictionary compilers.

本發明再提供一種電腦輔助專業名詞辭典產生方法，包含有： The invention further provides a computer aided professional noun dictionary generation method, which comprises:

步驟01：所述資料庫模組(10)儲存有舊專業名詞。 Step 01: The database module (10) stores old professional nouns.

步驟02：所述接收模組(16)接收一輸入文檔。 Step 02: The receiving module (16) receives an input document.

步驟03：所述重複序列模組(18)將上述輸入文檔進行分段斷詞成複數語言單位〔包含有單字或片語〕。 Step 03: The repeating sequence module (18) segments the input document into a plurality of language units (including a single word or a phrase).

步驟04：所述統計詞頻機率模組(20)分析比對所述語言單位，以過濾出舊專業名詞、介係詞及冠詞，而產生出新專業名詞。 Step 04: The statistical word frequency probability module (20) analyzes and compares the language units to filter out old professional nouns, prepositions, and articles to generate new professional terms.

步驟05：所述釋明模組(22)釋明所述新專業名詞，並將釋明完成之新專業名詞歸類成所述舊專業名詞，所述釋明係利用網路搜尋引擎可程式之開放的函式庫來進行新專業名詞之釋明。 Step 05: The explanation module (22) interprets the new professional term and classifies the completed new professional term into the old professional term, and the explanation system uses a web search engine to program The open library is used to explain the new terminology.

透過上述方法將新專業名詞自動挑選出來，並加以釋明，而供大眾查詢使用，對於專有名詞詞彙系統的應用範圍或新領域的探索將會有相當大的幫助。 Through the above methods, new professional terms are automatically selected and explained, and used for public inquiry. It will be of great help to the application scope of the proper noun vocabulary system or the exploration of new fields.

惟以上所述僅係為本發明之較佳實施例，當不能以此限定本發明實施之範圍，即依本發明申請專利範圍及發明說明內容所作簡單的等效變化與修飾，皆屬本發明專利涵蓋之範圍內。 However, the above description is only a preferred embodiment of the present invention, and the scope of the present invention is not limited thereto, that is, the simple equivalent changes and modifications according to the scope of the present invention and the description of the invention are the present invention. Within the scope of the patent.

(10)‧‧‧資料庫模組 (10) ‧‧‧Database Module

(12)‧‧‧舊專業名詞儲存區 (12) ‧‧‧Old terminology storage area

(14)‧‧‧新專業名詞儲存區 (14) ‧‧‧New terminology storage area

(16)‧‧‧接收模組 (16)‧‧‧ receiving module

(18)‧‧‧重複序列模組 (18)‧‧‧Repetitive sequence module

(20)‧‧‧統計詞頻機率模組 (20) ‧‧‧Statistics Frequency Probability Module

(22)‧‧‧釋明模組 (22)‧‧‧ Explain module

(24)‧‧‧辭典查詢模組 (24) ‧‧‧ dictionary query module

第一圖係為示意圖，說明較佳實施例之系統個部模組對性關係。 The first figure is a schematic diagram illustrating the sexual relationship of the system module of the preferred embodiment.

第二圖係為示意圖，說明較佳實施例之方法實施流程。 The second figure is a schematic diagram illustrating the implementation of the method of the preferred embodiment.

(10)‧‧‧資料庫模組 (10) ‧‧‧Database Module

(16)‧‧‧接收模組 (16)‧‧‧ receiving module

(18)‧‧‧重複序列模組 (18)‧‧‧Repetitive sequence module

(22)‧‧‧釋明模組 (22)‧‧‧ Explain module

(24)‧‧‧辭典查詢模組 (24) ‧‧‧ dictionary query module

Claims

A computer-assisted professional noun dictionary generation system includes: a database module, including an old professional noun storage area and a new professional noun storage area, the old professional nouns are stored with plural old professional nouns; a receiving module is used Receiving an input document; a repeating sequence module coupled to the receiving module, the repeating sequence module is to establish a basic mathematical mode according to a pattern of various types of repeated sequences in biogenetics, and then respectively write the program according to the mathematical mode And integrating the analyzed words and word-related data into a related database to find the longest and repeated words to achieve the word break, which is used to divide the input document by using the repeated segment detection method. A segmentation word forms a complex language unit, which uses a matrix P _{ij of the} following formula, which is a matrix of N*N. If two mutually compared strings have equal values, then (i,j in the matrix The address value is changed from 0 to 1, and this method will try to find the longest and repeated word to achieve the word break; a statistic frequency probability module coupled to the repeating sequence module and the database module for analyzing the language unit to filter out old professional nouns, prepositions and articles, and generating new professional term storage In the new professional noun storage area, the statistical frequency probability module is used to evaluate the importance of the language unit in the input document and the database module storage content, the importance of the language unit is along with the input document The number of occurrences increases in proportion, and at the same time decreases inversely with the number of occurrences in the old professional noun storage area of the database module, and uses a reverse file frequency as a measure to evaluate the universal importance of the language unit. For evaluating a high language unit frequency in the input document, and a low input document frequency of the high language unit in the total input document, determining the high-weighted language unit to form the new professional noun, For expert review, the formula is as follows: Where: |D|: the total number of input documents in the database module; |{d:d t _i }|: the number of input documents containing a certain characteristic language unit t _i (ie the number of input documents of n _i ≠ 0), then: tfidf _i,j =tf _i,j . An idf _i ; an interpretive module coupled to the database module for explaining the new terminology and classifying the completed new professional term into the old term, and storing in the The old professional term storage area of the database module.

For example, the computer-aided professional noun dictionary generation system described in claim 1 further includes a dictionary query module coupled to the database module for querying old professional terms.

The computer-aided professional noun dictionary generating system according to claim 1, wherein the language unit includes a single word or a phrase.

For example, the computer-aided professional noun dictionary generating system described in claim 1 of the patent scope, the explanatory module utilizes an open library of a web search engine to perform the explanation of the new professional term.

A method for generating a computer-assisted professional noun dictionary includes: storing an old professional term; receiving an input document; and using the repeated segment detection method to segment the word into a plurality of language units, the repeated sequence module According to the style of each type of repeated sequence in a biogenetics, the basic mathematical model is established, and the program is written according to the mathematical mode, and the analyzed words and word related data are integrated into a related database to find the longest and repeated a word to thereby achieve a word break; an analysis of the language unit, using a reverse file frequency as a measure of the universal importance of the language unit for evaluating the high language unit frequency within the input document, and The low-input document frequency of the high-language unit in the total input document determines the high-weighted language unit to filter out old professional nouns, prepositions, and articles to generate new professional terms; New professional nouns, and classify the completed new professional terms into the old professional nouns.

The method for generating a computer-assisted professional noun dictionary according to claim 5, wherein the language unit includes a single word or a phrase.

The method for generating a computer-assisted professional noun dictionary according to claim 5, wherein the explanatory module is programmable by using a web search engine Put the library to explain the new terminology.