TWI515584B - Compupter-assisted specialized noun produced dictionary system and method therefor - Google Patents
Compupter-assisted specialized noun produced dictionary system and method therefor Download PDFInfo
- Publication number
- TWI515584B TWI515584B TW100125187A TW100125187A TWI515584B TW I515584 B TWI515584 B TW I515584B TW 100125187 A TW100125187 A TW 100125187A TW 100125187 A TW100125187 A TW 100125187A TW I515584 B TWI515584 B TW I515584B
- Authority
- TW
- Taiwan
- Prior art keywords
- professional
- module
- language unit
- new
- old
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 13
- 239000011159 matrix material Substances 0.000 claims description 6
- 108091081062 Repeated sequence (DNA) Proteins 0.000 claims description 4
- 230000007423 decrease Effects 0.000 claims description 2
- 238000013178 mathematical model Methods 0.000 claims description 2
- 238000001514 detection method Methods 0.000 claims 2
- 230000011218 segmentation Effects 0.000 claims 1
- 238000010586 diagram Methods 0.000 description 2
- 230000003252 repetitive effect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001568 sexual effect Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
Landscapes
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Description
本發明係為一種電腦輔助專業名詞辭典產生系統與方法,尤指一種可於輸入文檔中自動挑選出新專業名詞,並加以釋明以供大眾查詢之系統與方法。 The invention relates to a computer aided professional noun dictionary generation system and method, in particular to a system and method for automatically selecting new professional terms in an input document and interpreting them for public inquiry.
隨著人類文明演進與新興科技、學說或論文等資訊的加速創新,新的專有詞彙不斷快速地增加,這使得原有專業詞彙處理系統不足以應付需求,各式紙本或電子詞彙系統中通常都會用到詞彙,雖然可以盡可能地增加辭典中的詞彙數量,但是無論詞彙量有多大,都不可能概括到所有可能用到的詞彙,這是因為隨著文明的演進,詞彙本身就會隨著時間增加,且於各種知識領域中都會有特定的關鍵字詞或專有名詞,這些不是系統設計者可以預先知道的,因此,專有名詞的辭典或其他各式各樣的辭海內包含的詞彙不能一成不變,應該隨著處理的文章或相關領域做更新,並針對各種領域自動擷取而產生出新的詞彙,而可對於專有名詞詞彙系統的應用範圍或新領域的探索帶來相當大的幫助。 With the evolution of human civilization and the accelerated innovation of information such as emerging technologies, doctrines or essays, new proprietary vocabulary is increasing rapidly, which makes the original professional vocabulary processing system insufficient to meet demand, in various paper or electronic vocabulary systems. Words are usually used, although the number of words in the dictionary can be increased as much as possible, but no matter how big the vocabulary is, it is impossible to generalize all the words that may be used. This is because with the evolution of civilization, the words themselves will As time increases, and there are specific keyword words or proper nouns in various fields of knowledge, these are not known to the system designer in advance, so the dictionary of proper nouns or other various words include The vocabulary can't be changed. It should be updated with the processed articles or related fields, and new words can be generated by automatically extracting from various fields. It can bring about the application scope of the proper noun vocabulary system or the exploration of new fields. Great help.
爰此,為了改善現有的專業詞彙處理系統更新新詞彙的速度太慢,已不足應付需求,本發明因而提供一種專業名詞辭典產生系統,所述系統可將輸入文檔之內容自動挑選出新字彙,並加以釋明而放入辭典以供查詢,藉由輸入 文檔隨時更新字彙,以應付各領域之使用需求。 Therefore, in order to improve the speed of updating the new vocabulary of the existing professional vocabulary processing system, which is too slow to meet the demand, the present invention thus provides a professional noun dictionary generating system, which can automatically select new vocabulary from the content of the input document. And interpret it and put it into the dictionary for enquiry, by input The documentation updates the vocabulary at any time to meet the needs of each area.
欲達成上述功能可藉由一種電腦輔助專業名詞辭典產生系統,包含有一資料庫模組,包含有舊專業名詞儲存區及新專業名詞儲存區;一接收模組,用以接收一輸入文檔;一重複序列模組,耦合至上述接收模組,用以將上述輸入文檔進行分段斷詞而形成複數語言單位;一統計詞頻機率模組,耦合至上述重複序列模組及上述資料庫模組,用以分析比對所述語言單位,以過濾出舊專業名詞、介係詞及冠詞,而產生出新專業名詞儲存於新專業名詞儲放區;一釋明模組,耦合至上述資料庫模組,用以釋明所述新專業名詞,並將釋明完成之新專業名詞歸類成所述舊專業名詞,而儲存於所述資料庫模組之舊專業名詞儲放區。 To achieve the above functions, a computer-assisted professional noun dictionary generation system includes a database module including an old professional noun storage area and a new professional term storage area; and a receiving module for receiving an input document; a repeating sequence module coupled to the receiving module for segmenting and dividing the input document to form a plurality of language units; a statistical word frequency probability module coupled to the repeating sequence module and the database module Used to analyze and compare the language units to filter out old professional nouns, prepositions and articles, and generate new professional terms stored in the new professional noun storage area; an interpretation module coupled to the above database module In order to explain the new terminology, and to classify the completed new professional term as the old terminology, and store it in the old professional term storage area of the database module.
所述之電腦輔助專業名詞辭典產生系統,進一步包含有一辭典查詢模組,耦合至上述資料庫模組,用以供查詢舊專業名詞。 The computer aided professional noun dictionary generation system further includes a dictionary query module coupled to the database module for querying old professional nouns.
所述語言單位包含有單字或片語。 The language unit contains a single word or a phrase.
所述釋明模組利用網路搜尋引擎可程式之開放的函式庫來進行新專業名詞之釋明。 The illuminating module utilizes an open library of web search engines to explain the new terminology.
本發明再提供一種電腦輔助專業名詞辭典產生方法,包含有儲存有舊專業名詞;接收一輸入文檔;將上述輸入文檔進行分段斷詞成複數語言單位;分析比對所述語言單位,以過濾出舊專業名詞、介係詞及冠詞,而產生出新專業名詞;釋明所述新專業名詞,並將釋明完成之新專業名詞歸類成所述舊專業名詞。 The invention further provides a computer aided professional noun dictionary generating method, which comprises storing an old professional noun; receiving an input document; segmenting the input document into a plurality of language units; analyzing and comparing the language units to filter The old professional nouns, prepositions and articles are used to produce new professional nouns; the new professional nouns are explained, and the new professional terms that are explained are classified into the old professional nouns.
所述語言單位包含有單字或片語。 The language unit contains a single word or a phrase.
所述釋明模組利用網路搜尋引擎可程式之開放的函式庫來進行新專業名詞之釋明。 The illuminating module utilizes an open library of web search engines to explain the new terminology.
本發明之功效:隨時更新字彙,以應付各領域之使用需求。 The effect of the invention: the vocabulary is updated at any time to meet the needs of use in various fields.
有關本發明之技術特徵及增進功效,配合下列圖式之較佳實施例即可清楚呈現,首先,請參閱第一圖所示,本發明之較佳實施例,為一種電腦輔助專業名詞辭典產生系統,包含有: With regard to the technical features and the enhancement of the present invention, the preferred embodiments of the present invention can be clearly presented. First, referring to the first embodiment, the preferred embodiment of the present invention is a computer-aided professional term dictionary. System, including:
一資料庫模組(10),包含有舊專業名詞儲存區(12)及新專業名詞儲存區(14)。 A database module (10) includes an old professional noun storage area (12) and a new professional term storage area (14).
一接收模組(16),用以接收一輸入文檔,前述輸入文檔包含有新期刊、新專利、新學說或新聞等。 A receiving module (16) for receiving an input document, the input document comprising a new journal, a new patent, a new doctrine or news.
一重複序列模組(18),耦合至上述接收模組,用以將上述輸入文檔進行分段斷詞而形成複數語言單位〔包含有單字或片語〕,所述重複序列模組(18)係根據生物基因學中各類型重複序列的樣式來建立基本數學模式,再分別依數學模式編寫程式,並將分析所得字、詞相關資料整合建成一關聯資料庫,詳細的說,利用Matteo Pellegrini提出的重覆片段偵測方法來進行斷詞,其利用以下公式之矩陣Pij,其為一個N*N之矩陣,若二個互相比較的字串,其值為相等,則在矩陣裡的(i,j)位址值便由0轉變成1,此方法並會試著找出擁有最長且重覆的字,藉此來達到斷 詞。 a repeating sequence module (18) coupled to the receiving module for segmenting the input document into a plurality of language units (including a single word or a phrase), the repeating sequence module (18) The basic mathematical model is established according to the style of each type of repeated sequence in biogenetics, and then the program is written according to the mathematical mode, and the analyzed words and word-related data are integrated into a related database. In detail, using Matteo Pellegrini a method of detecting repeated segments to hyphenation which the matrix using the formula P ij, which is a matrix of N * N, compared to each other if the two strings, which is equal, within the matrix of ( The i, j) address value is changed from 0 to 1, and this method will try to find the longest and repeated word to achieve the word break.
一統計詞頻機率模組(20)(TF-IDF),耦合至上述重複序列模組(18)及上述資料庫模組(10),用以分析比對所述語言單位,以過濾出舊專業名詞、介係詞及冠詞,而產生出新專業名詞儲存於新專業名詞儲放區(14),進一步說明,所述統計詞頻機率模組(20)為一種統計方法,用以評估所述語言單位於所述輸入文檔及資料庫模組(10)儲存內容的重要程度,所述語言單位的重要程度是隨著輸入文檔中出現的次數成正比增加,且同時隨著資料庫模組之舊有專業名詞儲放區中出現的次數成反比下降。 a statistic frequency probability module (20) (TF-IDF) coupled to the repeating sequence module (18) and the database module (10) for analyzing the language unit to filter out the old professional Nouns, prepositions and articles, and new professional terms are stored in the new professional noun storage area (14). Further, the statistical word frequency probability module (20) is a statistical method for evaluating the language unit. The importance of the content stored in the input document and the database module (10), the importance of the language unit increases in proportion to the number of occurrences in the input document, and at the same time, as the database module is old The number of occurrences in the professional noun storage area decreases inversely.
逆向文件頻率(Inverse Document Frequency,IDF)是一個所述語言單位普遍重要性的度量,某一特定語言單位的IDF,可以由總輸入文檔件數除以包含有前述某一特定語言單位之輸入文檔的件數,再將得到的商取對數得到:
|D|:所述資料庫模組(10)中的輸入文檔總數。 |D|: The total number of input documents in the database module (10).
|{d:d t i }|:包含有某一特性語言單位ti的輸入文檔件數(即n i ≠0的輸入文檔件數)。 |{ d : d t i }|: The number of input documents containing a certain characteristic language unit ti (ie, the number of input documents of n i ≠ 0).
然後:tfidfi,j=tfi,j.idfi Then: tfidf i,j =tf i,j . Idf i
因此,當某一特定輸入文檔內的高語言單位頻率,以及所述高語言單位在總輸入文檔中的低輸入文檔頻率,則判斷出高權重的所述語言單位而形成所述新專業名詞,並供專家檢核。 Therefore, when a high language unit frequency within a particular input document, and a low input document frequency of the high language unit in the total input document, the high-weighted language unit is determined to form the new professional noun, And for expert inspection.
一釋明模組(22),耦合至上述資料庫模組(10),用以釋明所述新專業名詞,並將釋明完成之新專業名詞歸類成所述舊專業名詞,而儲存於所述資料庫模組之舊專業名詞儲放區,進一步說明,所述釋明模組(22)利用網路搜尋引擎可程式之開放的函式庫來進行新專業名詞之釋明。 An explanatory module (22) coupled to the database module (10) for explaining the new terminology and classifying the completed new professional term into the old terminology and storing In the old professional term storage area of the database module, it is further explained that the explanatory module (22) uses the open dictionary library of the network search engine to explain the new professional term.
一辭典查詢模組(24),耦合至上述資料庫模組(10),用以供學者、專家、出版社、學生及辭典編譯者查詢舊專業名詞。 A dictionary query module (24) is coupled to the database module (10) for querying old terminology by scholars, experts, publishers, students, and dictionary compilers.
本發明再提供一種電腦輔助專業名詞辭典產生方法,包含有: The invention further provides a computer aided professional noun dictionary generation method, which comprises:
步驟01:所述資料庫模組(10)儲存有舊專業名詞。 Step 01: The database module (10) stores old professional nouns.
步驟02:所述接收模組(16)接收一輸入文檔。 Step 02: The receiving module (16) receives an input document.
步驟03:所述重複序列模組(18)將上述輸入文檔進行分段斷詞成複數語言單位〔包含有單字或片語〕。 Step 03: The repeating sequence module (18) segments the input document into a plurality of language units (including a single word or a phrase).
步驟04:所述統計詞頻機率模組(20)分析比對所述語言單位,以過濾出舊專業名詞、介係詞及冠詞,而產生出新專業名詞。 Step 04: The statistical word frequency probability module (20) analyzes and compares the language units to filter out old professional nouns, prepositions, and articles to generate new professional terms.
步驟05:所述釋明模組(22)釋明所述新專業名詞,並將釋明完成之新專業名詞歸類成所述舊專業名詞,所述釋明係利用網路搜尋引擎可程式之開放的函式庫來進行新專業名詞之釋明。 Step 05: The explanation module (22) interprets the new professional term and classifies the completed new professional term into the old professional term, and the explanation system uses a web search engine to program The open library is used to explain the new terminology.
透過上述方法將新專業名詞自動挑選出來,並加以釋明,而供大眾查詢使用,對於專有名詞詞彙系統的應用範圍或新領域的探索將會有相當大的幫助。 Through the above methods, new professional terms are automatically selected and explained, and used for public inquiry. It will be of great help to the application scope of the proper noun vocabulary system or the exploration of new fields.
惟以上所述僅係為本發明之較佳實施例,當不能以此限定本發明實施之範圍,即依本發明申請專利範圍及發明說明內容所作簡單的等效變化與修飾,皆屬本發明專利涵蓋之範圍內。 However, the above description is only a preferred embodiment of the present invention, and the scope of the present invention is not limited thereto, that is, the simple equivalent changes and modifications according to the scope of the present invention and the description of the invention are the present invention. Within the scope of the patent.
(10)‧‧‧資料庫模組 (10) ‧‧‧Database Module
(12)‧‧‧舊專業名詞儲存區 (12) ‧‧‧Old terminology storage area
(14)‧‧‧新專業名詞儲存區 (14) ‧‧‧New terminology storage area
(16)‧‧‧接收模組 (16)‧‧‧ receiving module
(18)‧‧‧重複序列模組 (18)‧‧‧Repetitive sequence module
(20)‧‧‧統計詞頻機率模組 (20) ‧‧‧Statistics Frequency Probability Module
(22)‧‧‧釋明模組 (22)‧‧‧ Explain module
(24)‧‧‧辭典查詢模組 (24) ‧‧‧ dictionary query module
第一圖係為示意圖,說明較佳實施例之系統個部模組對性關係。 The first figure is a schematic diagram illustrating the sexual relationship of the system module of the preferred embodiment.
第二圖係為示意圖,說明較佳實施例之方法實施流程。 The second figure is a schematic diagram illustrating the implementation of the method of the preferred embodiment.
(10)‧‧‧資料庫模組 (10) ‧‧‧Database Module
(12)‧‧‧舊專業名詞儲存區 (12) ‧‧‧Old terminology storage area
(14)‧‧‧新專業名詞儲存區 (14) ‧‧‧New terminology storage area
(16)‧‧‧接收模組 (16)‧‧‧ receiving module
(18)‧‧‧重複序列模組 (18)‧‧‧Repetitive sequence module
(20)‧‧‧統計詞頻機率模組 (20) ‧‧‧Statistics Frequency Probability Module
(22)‧‧‧釋明模組 (22)‧‧‧ Explain module
(24)‧‧‧辭典查詢模組 (24) ‧‧‧ dictionary query module
Claims (7)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW100125187A TWI515584B (en) | 2011-07-15 | 2011-07-15 | Compupter-assisted specialized noun produced dictionary system and method therefor |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW100125187A TWI515584B (en) | 2011-07-15 | 2011-07-15 | Compupter-assisted specialized noun produced dictionary system and method therefor |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| TW201303623A TW201303623A (en) | 2013-01-16 |
| TWI515584B true TWI515584B (en) | 2016-01-01 |
Family
ID=48138079
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| TW100125187A TWI515584B (en) | 2011-07-15 | 2011-07-15 | Compupter-assisted specialized noun produced dictionary system and method therefor |
Country Status (1)
| Country | Link |
|---|---|
| TW (1) | TWI515584B (en) |
-
2011
- 2011-07-15 TW TW100125187A patent/TWI515584B/en not_active IP Right Cessation
Also Published As
| Publication number | Publication date |
|---|---|
| TW201303623A (en) | 2013-01-16 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Zhang et al. | An empirical study of TextRank for keyword extraction | |
| Hoffart et al. | Discovering emerging entities with ambiguous names | |
| US8326819B2 (en) | Method and system for high performance data metatagging and data indexing using coprocessors | |
| EP2092419B1 (en) | Method and system for high performance data metatagging and data indexing using coprocessors | |
| KR101715432B1 (en) | Word pair acquisition device, word pair acquisition method, and recording medium | |
| CN102640152B (en) | Method and computer system for retrieving document data according to retrieval keywords | |
| US20210357585A1 (en) | Methods for extracting and assessing information from literature documents | |
| JP7697320B2 (en) | Automatic generation of new machine learning project pipelines from existing machine learning project pipelines stored in a corpus | |
| JP2005158010A (en) | Apparatus, method and program for classification evaluation | |
| CN111158641B (en) | Automatic recognition method for transaction function points based on semantic analysis and text mining | |
| US20180341686A1 (en) | System and method for data search based on top-to-bottom similarity analysis | |
| Vani et al. | Text plagiarism classification using syntax based linguistic features | |
| JPH11102377A (en) | Method and device for retrieving document from data base | |
| Madyatmadja et al. | Sentiment analysis on user reviews of threads applications in indonesia | |
| Trieschnigg et al. | TNO Hierarchical topic detection report at TDT 2004 | |
| Cerquitelli et al. | Data miners' little helper: data transformation activity cues for cluster analysis on document collections | |
| Sailaja et al. | An overview of pre-processing text clustering methods | |
| TWI515584B (en) | Compupter-assisted specialized noun produced dictionary system and method therefor | |
| Cardenas et al. | Improving Topic Coherence Using Entity Extraction Denoising. | |
| CN115828929A (en) | Data processing method and device and electronic equipment | |
| Mahdi et al. | A citation-based approach to automatic topical indexing of scientific literature | |
| Adam et al. | Tracking the Evolution of Climate Protection Discourse in Austrian Newspapers: A Comparative Study of BERTopic and Dynamic Topic Modeling | |
| BAZRFKAN et al. | Using machine learning methods to summarize persian texts | |
| CN114661892A (en) | Manuscript abstract generation method and device, equipment and storage medium | |
| JP2004206571A (en) | Document information presentation method and apparatus, program and recording medium |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| MM4A | Annulment or lapse of patent due to non-payment of fees |