TWI452475B - A dictionary generating device, a dictionary generating method, a dictionary generating program product, and a computer readable memory medium storing the program - Google Patents
A dictionary generating device, a dictionary generating method, a dictionary generating program product, and a computer readable memory medium storing the program Download PDFInfo
- Publication number
- TWI452475B TWI452475B TW101133547A TW101133547A TWI452475B TW I452475 B TWI452475 B TW I452475B TW 101133547 A TW101133547 A TW 101133547A TW 101133547 A TW101133547 A TW 101133547A TW I452475 B TWI452475 B TW I452475B
- Authority
- TW
- Taiwan
- Prior art keywords
- word
- dictionary
- unit
- information
- text
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Machine Translation (AREA)
Description
本發明之一形態係關於一種用以產生單詞辭典之裝置、方法、程式及電腦可讀取記錄媒體。One aspect of the present invention relates to an apparatus, method, program, and computer readable recording medium for generating a word dictionary.
自先前以來,已知有使用單詞辭典分割文章而獲得複數個單詞之技術(單詞分割)。與其關聯之下述專利文獻1中記載有一種技術:其自單詞辭典檢索與輸入文本之部分字符串對照之單詞,並作為單詞候補產生,從未與該單詞辭典對照之輸入文本之部分字符串選擇有未知語可能性者作為未知語候補,使用未知語模型推測未知語候補之詞類區分單詞出現概率,使用動態計劃法求得同時概率最大之單詞列。Since the prior art, a technique of dividing a plurality of words using a word dictionary to obtain a plurality of words (word division) has been known. The following Patent Document 1 associated therewith describes a technique in which a word that is compared with a partial character string of an input text is searched from a word dictionary, and is generated as a word candidate, and a partial character string of the input text that has not been compared with the word dictionary is generated. Selecting the possibility of unknown language as the candidate of the unknown language, using the unknown language model to infer the candidate word class of the unknown word to distinguish the word appearance probability, and using the dynamic programming method to obtain the word column with the highest probability at the same time.
專利文獻1:日本專利特開2001-051996號公報Patent Document 1: Japanese Patent Laid-Open Publication No. 2001-051996
為正確分割文本,為使詞彙知識充實而在辭典內準備大量單詞較理想。但由人手構築大規模辭典並不容易。因此,要求容易構築大規模單詞辭典。In order to correctly segment the text, it is desirable to prepare a large number of words in the dictionary in order to enrich the vocabulary knowledge. However, it is not easy to build a large-scale dictionary by human hand. Therefore, it is required to easily construct a large-scale word dictionary.
本發明之一形態之辭典產生裝置具備:模型產生部,其使用預先準備之語料庫及單詞群產生單詞分割模型,且對 語料庫中所含之各文本賦予表示單詞界限之界限資訊;解析部,其對所收集之文本之集合執行組入有單詞分割模型之單詞分割,並對各文本賦予界限資訊;選擇部,其自藉由解析部賦予界限資訊之文本中選擇所要登錄於辭典之單詞;登錄部,其將由選擇部選擇之單詞登錄於辭典。A dictionary generating apparatus according to an aspect of the present invention includes: a model generating unit that generates a word segmentation model using a corpus and a word group prepared in advance, and Each text contained in the corpus is given information indicating the boundary of the word boundary; the analysis unit performs word segmentation incorporating the word segmentation model on the collected set of texts, and assigns boundary information to each text; the selection unit itself The word to be registered in the dictionary is selected from the text of the limit information given by the analysis unit, and the registration unit registers the word selected by the selection unit in the dictionary.
本發明之一形態之辭典產生方法係藉由辭典產生裝置執行者,其包含:模型產生步驟,其使用預先準備之語料庫及單詞群產生單詞分割模型,且對語料庫中所含之各文本賦予表示單詞界限之界限資訊;解析步驟,其對所收集之文本之集合執行組入有單詞分割模型之單詞分割,並對各文本賦予界限資訊;選擇步驟,其自於解析步驟中被賦予界限資訊之文本中選擇所要登錄於辭典之單詞;及登錄步驟,其將於選擇步驟中選擇之單詞登錄於辭典。A dictionary generating method according to an aspect of the present invention is a dictionary generating device executor comprising: a model generating step of generating a word segmentation model using a corpus and a word group prepared in advance, and assigning a representation to each text included in the corpus a boundary information boundary; a parsing step that performs a word segmentation of a word segmentation model on a collection of collected texts, and assigns boundary information to each text; and a selection step that is given boundary information from the parsing step The word to be registered in the dictionary is selected in the text; and the login step is to register the word selected in the selection step in the dictionary.
本發明之一形態之辭典產生程式使電腦執行:模型產生部,其使用預先準備之語料庫及單詞群產生單詞分割模型,且對語料庫中所含之各文本賦予表示單詞界限之界限資訊;解析部,其對所收集之文本之集合執行組入有單詞分割模型之單詞分割,並對各文本賦予界限資訊;選擇部,其自藉由解析部賦予界限資訊之文本中選擇所要登錄於辭典之單詞;及登錄部,其將由選擇部選擇之單詞登錄於辭典。The dictionary generating program of one aspect of the present invention causes the computer to execute: a model generating unit that generates a word segmentation model using a corpus and a word group prepared in advance, and assigns information indicating a boundary of a word to each text included in the corpus; And performing a word segmentation with a word segmentation model on the collected set of texts, and assigning boundary information to each text; the selection unit selects a word to be registered in the dictionary from the text of the boundary information given by the analysis unit And a registration unit that registers the word selected by the selection unit in the dictionary.
本發明之一實施形態之電腦可讀取記錄媒體係記憶辭典產生程式者,該辭典產生程式使電腦執行:模型產生部,其使用預先準備之語料庫及單詞群產生單詞分割模型,且 對語料庫中所含之各文本賦予表示單詞界限之界限資訊;解析部,其對所收集之文本之集合執行組入有單詞分割模型之單詞分割,並對各文本賦予界限資訊;選擇部,其自藉由解析部賦予界限資訊之文本中選擇所要登錄於辭典之單詞;及登錄部,其將由選擇部選擇之單詞登錄於辭典。A computer-readable recording medium according to an embodiment of the present invention is a memory dictionary generating program that causes a computer to execute: a model generating unit that generates a word segmentation model using a corpus and a word group prepared in advance, and Each of the texts included in the corpus is given a limit information indicating a limit of a word; the analysis unit performs a word segmentation incorporating a word segmentation model on the collected set of texts, and assigns boundary information to each of the texts; The word to be registered in the dictionary is selected from the text of the limit information given by the analysis unit; and the registration unit registers the word selected by the selection unit in the dictionary.
根據如此之形態,使用賦予界限資訊之語料庫與單詞群產生單詞分割模型,組入有該模型之單詞分割應用於文本集合。並且,自藉由該應用賦予界限資訊之文本集合中選擇單詞,並登錄於辭典。如此,藉由使用附界限資訊之語料庫之解析亦對文本集合賦予界限資訊之後,登錄自該文本集合提取之單詞,從而可容易構築大規模之單詞辭典。According to such a form, a word segmentation model is generated using a corpus and a word group to which the boundary information is given, and a word segmentation in which the model is incorporated is applied to the text collection. Further, a word is selected from a set of texts to which the application limits the information, and is registered in the dictionary. In this way, by using the corpus of the bound information, the boundary information is also added to the text collection, and the words extracted from the text collection are registered, so that a large-scale word dictionary can be easily constructed.
其他實施形態之辭典產生裝置中,選擇部亦可基於自利用解析部賦予之界限資訊算出之各單詞之出現頻率,選擇所要登錄於辭典之單詞。考慮如此算出之出現頻率而可提高辭典之精度。In the dictionary generating device of the other embodiment, the selection unit may select a word to be registered in the dictionary based on the frequency of occurrence of each word calculated from the limit information given by the analyzing unit. The accuracy of the dictionary can be improved by considering the frequency of occurrence thus calculated.
進而其他形態之辭典產生裝置中,選擇部亦可選擇出現頻率為特定閾值以上之單詞。僅將出現一定次數以上之單詞登錄於辭典,從而可提高辭典之精度。Further, in the other form of the dictionary generating device, the selection unit may select a word whose frequency is equal to or greater than a specific threshold. Only words that have appeared more than a certain number of times are registered in the dictionary, so that the accuracy of the dictionary can be improved.
進而其他形態之辭典產生裝置中,選擇部將出現頻率為閾值以上之單詞提取作為登錄候補,從出現頻率較高之單詞起依次自該登錄候補中選擇特定數之單詞,登錄部亦可將由選擇部選擇之單詞追加至記錄有單詞群之辭典。僅將出現頻率相對較高之單詞登錄於辭典,藉此可提高辭典之精度。又,對預先準備之單詞群之辭典追加單詞,從而可 使辭典之構成簡單。Further, in the dictionary generating apparatus of another aspect, the selection unit extracts a word whose frequency is equal to or greater than the threshold as a registration candidate, and selects a specific number of words from the registration candidate in order from the word having a high frequency of occurrence, and the registration unit may select the word. The word selected by the department is added to the dictionary in which the word group is recorded. Only words with a relatively high frequency are registered in the dictionary, thereby improving the accuracy of the dictionary. In addition, a word is added to the dictionary of the word group prepared in advance, so that Make the composition of the dictionary simple.
進而其他形態之辭典產生裝置中,選擇部將出現頻率為閾值以上之單詞提取作為登錄候補,從出現頻率較高之單詞起依次自該登錄候補中選擇特定數之單詞,登錄部亦可將由選擇部選擇之單詞記錄於與記錄有單詞群之辭典不同之其他辭典。僅將出現頻率相對較高之單詞登錄於辭典,從而可提高辭典之精度。又,對與預先準備之單詞群之辭典(現有辭典)不同之辭典追加單詞,從而可產生與現有辭典不同特性之辭典。Further, in the dictionary generating apparatus of another aspect, the selection unit extracts a word whose frequency is equal to or greater than the threshold as a registration candidate, and selects a specific number of words from the registration candidate in order from the word having a high frequency of occurrence, and the registration unit may select the word. The words selected by the department are recorded in other dictionaries different from the dictionary in which the word group is recorded. Only the words with relatively high frequency are registered in the dictionary, so that the accuracy of the dictionary can be improved. Further, a word is added to a dictionary different from a dictionary (available dictionary) of a word group prepared in advance, and a dictionary having characteristics different from the existing dictionary can be generated.
進而其他形態之辭典產生裝置中,登錄部亦可將由選擇部選擇之單詞登錄於與記錄有單詞群之辭典不同之辭典。對與預先準備之單詞群之辭典(現有辭典)不同之辭典追加單詞,從而可產生與現有辭典不同特性之辭典。Further, in the dictionary generation device of another aspect, the registration unit may register the word selected by the selection unit in a dictionary different from the dictionary in which the word group is recorded. A dictionary is added to a dictionary different from the dictionary of the word group prepared in advance (the existing dictionary), so that a dictionary having characteristics different from the existing dictionary can be generated.
進而其他形態之辭典產生裝置中,選擇部將出現頻率為閾值以上之單詞提取作為登錄候補,根據出現頻率之高低而將該登錄候補之單詞組群化,登錄部亦可將由選擇部產生之複數個組群個別地登錄於與記錄有單詞群之辭典不同之複數個辭典。根據出現頻率之高低而將單詞組群化,將產生之各組群登錄於各個辭典,從而可產生起因於出現頻率之特性不同之複數個辭典。Further, in the dictionary generating apparatus of another aspect, the selection unit extracts a word whose appearance frequency is equal to or greater than the threshold value as a registration candidate, and groups the words of the registration candidate according to the frequency of occurrence, and the registration unit may generate the plural number generated by the selection unit. Each group is individually registered in a plurality of dictionaries different from the dictionary in which the word group is recorded. The words are grouped according to the frequency of occurrence, and each of the generated groups is registered in each dictionary, so that a plurality of dictionaries different in characteristics due to the frequency of occurrence can be generated.
進而其他形態之辭典產生裝置中,將表示該文本之領域之資訊與所收集之各文本建立關聯,登錄部亦可基於含有該單詞之文本之領域,將由選擇部選擇之單詞個別地登錄於每個領域準備之辭典。藉由依每個領域產生辭典,從而 可產生特性互相不同之複數個辭典。Further, in the dictionary generating device of another form, the information indicating the field of the text is associated with each of the collected texts, and the registration unit may individually register the word selected by the selecting unit based on the field containing the text of the word. A dictionary of field preparations. By generating a dictionary for each field, A plurality of dictionaries whose characteristics are different from each other can be generated.
進而其他形態之辭典產生裝置中,界限資訊包含表示文字間位置上不存在界限之第1資訊、表示文字間位置上存在界限之第2資訊、及表示文字間位置上概率性存在界限之第3資訊;各單詞之出現頻率亦可基於第1、第2及第3資訊算出。由於並非單純以是否存在界限之二擇一,而是導入表示其中間概念之第3資訊,藉此可將文本更適當地分割成複數個單詞。Further, in the dictionary generating apparatus of another aspect, the boundary information includes the first information indicating that there is no boundary between the characters, the second information indicating that there is a limit between the positions of the characters, and the third meaning indicating that there is a limit between the positions of the characters. Information; the frequency of occurrence of each word can also be calculated based on the first, second and third information. Since it is not simply a matter of whether or not there is a limit, but a third information indicating the intermediate concept is introduced, the text can be more appropriately divided into a plurality of words.
進而其他形態之辭典產生裝置中,解析部具備第1二值分類器及第2二值分類器,第1二值分類器對於各文字間位置判斷是分配第1資訊或是分配第1資訊以外之資訊,第2二值分類器亦可對於由第1二值分類器判斷為分配第1資訊以外之資訊之文字間位置判斷是分配第2資訊或是分配第3資訊。使用複數個二值分配器階段性確定界限資訊,從而可高速且有效對文本賦予界限資訊。Further, in the dictionary generating apparatus of another aspect, the analyzing unit includes a first binary classifier and a second binary classifier, and the first binary classifier determines whether to assign the first information or assign the first information to the position between the characters. In the information, the second binary classifier may determine whether to assign the second information or assign the third information to the inter-text position determination by which the first binary scoring device determines that the information other than the first information is assigned. The boundary information is determined step by step using a plurality of binary distributors, so that the boundary information can be given to the text at high speed and effectively.
進而其他形態之辭典產生裝置中,將所收集之文本集合分割成複數個組群;解析部、選擇部、及登錄部基於複數個組群中之一者執行處理後,模型產生部使用語料庫、單詞群及由登錄部登錄之單詞產生單詞分割模型,繼而,解析部、選擇部、及登錄部亦可基於複數個組群中之其他一者執行處理。Further, in another form of the dictionary generating device, the collected text set is divided into a plurality of groups; the analysis unit, the selection unit, and the registration unit perform processing based on one of the plurality of groups, and the model generation unit uses the corpus, The word group and the word registered by the registration unit generate a word segmentation model, and then the analysis unit, the selection unit, and the registration unit may perform processing based on the other of the plurality of groups.
根據本發明之一態樣,可容易構築大規模之單詞辭典。According to one aspect of the present invention, a large-scale word dictionary can be easily constructed.
以下一面參照添加附圖詳細說明本發明之實施形態。再者,附圖說明中對同一或同等要素附加同一符號,省略重複說明。Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. In the description of the drawings, the same or equivalent elements are designated by the same reference numerals, and the repeated description is omitted.
首先,使用圖1~圖3,說明實施形態之辭典產生裝置10之功能構成。辭典產生裝置10係如下之電腦,即,對包含所收集之大量文本之集合(以下亦稱作「大規模文本」)進行解析而自該文本集合中提取單詞,將所提取之單詞追加至辭典。First, the functional configuration of the dictionary generating apparatus 10 of the embodiment will be described with reference to Figs. 1 to 3 . The dictionary generating device 10 is a computer that analyzes a collection of a large amount of collected text (hereinafter also referred to as "large-scale text"), extracts words from the text collection, and adds the extracted words to the dictionary. .
如圖1所示,辭典產生裝置10具備:執行作業系統或應用程式等之CPU101、由ROM及RAM等構成之主記憶部102、由硬碟等構成之輔助記憶部103、由網路卡等構成之通信控制部104、鍵盤或滑鼠等之輸入裝置105、及顯示器等輸出裝置106。As shown in FIG. 1, the dictionary generating device 10 includes a CPU 101 that executes a work system or an application program, a main memory unit 102 composed of a ROM and a RAM, an auxiliary memory unit 103 composed of a hard disk or the like, and a network card or the like. The communication control unit 104, the input device 105 such as a keyboard or a mouse, and the output device 106 such as a display.
後述辭典產生裝置10之各功能構成要素係藉由如下而實現,即,在CPU101或主記憶部102上讀入特定之軟體,在CPU101之控制下使通信控制部104或輸入裝置105、輸出裝置106動作,進行主記憶部102或輔助記憶部103中資料之讀出及寫入。處理所需要之資料或資料庫儲存於主記憶部102或輔助記憶部103內。再者,圖1中表示辭典產生裝置10由1台電腦構成,但亦可使辭典產生裝置10之功能分散於複數台電腦。Each functional component of the dictionary generating device 10 described later is realized by reading a specific software on the CPU 101 or the main memory unit 102, and causing the communication control unit 104, the input device 105, and the output device under the control of the CPU 101. The operation of 106 performs reading and writing of data in the main storage unit 102 or the auxiliary storage unit 103. The data or database required for processing is stored in the main memory unit 102 or the auxiliary memory unit 103. Further, although the dictionary generating device 10 is constituted by one computer in Fig. 1, the function of the dictionary generating device 10 may be distributed to a plurality of computers.
如圖2所示,辭典產生裝置10具備作為功能構成要素之模型產生部11、解析部12、選擇部13及登錄部14。辭典產生裝置10執行單詞提取處理時,參照預先準備之學習語料 庫20、現有辭典31及大規模文本40,將提取之單詞儲存於單詞辭典30。再者,單詞辭典30至少包含現有辭典31,亦可進而包含1個以上之追加辭典32。在對辭典產生裝置10詳細說明之前,針對該等資料進行說明。As shown in FIG. 2, the dictionary generation device 10 includes a model generation unit 11 as a functional component, an analysis unit 12, a selection unit 13, and a registration unit 14. When the dictionary generating device 10 executes the word extraction process, it refers to the learning corpus prepared in advance. The library 20, the existing dictionary 31, and the large-scale text 40 store the extracted words in the word dictionary 30. Furthermore, the word dictionary 30 includes at least the existing dictionary 31, and may further include one or more additional dictionaries 32. Before the detailed description of the dictionary generating device 10, the data will be described.
學習語料庫20係經賦予(建立關聯)表示單詞之界限(將詞句分割成單詞時之分割位置)之界限資訊(註釋)之文本的集合,預先準備作為資料庫。文本包含複數個單詞而成之詞句或字符串。本實施形態中,將自蓄積於虛擬商場之網站內之商品之標題及說明文中隨機提取之特定數之文本作為學習語料庫20之材料。The learning corpus 20 is a set of texts that are provided (associated) with the boundary information (comment) indicating the boundary of the word (the divided position when the word is divided into words), and is prepared in advance as a database. A text consists of a plurality of words or strings. In the present embodiment, a text of a specific number randomly extracted from the title and description of the product stored in the website of the virtual shopping mall is used as the material of the learning corpus 20.
由評估者之人手對提取之各文本賦予界限資訊。界限資訊之設定係基於利用點推測之單詞分割與3階段分割語料庫之二個技術實施。The extracted text is bounded by the evaluator's hand. The setting of the boundary information is based on two technical implementations of word segmentation using point speculation and a three-stage segmentation corpus.
[利用點推測之單詞分割][Using point speculation word segmentation]
與文本(字符串)x=x1 x2 ...xn (x1 ,x2 ...,xn 係文字)中分配單詞界限標籤b=b1 b2 ...bn 。此處,bi 係表示文字xi 與xi+1 間(文字間位置)上是否存在單詞界限之標籤,bi =1意指分割,bi =0意指非分割。此處,該標籤bi 所示之值亦可以說係分割之強度。The word boundary label b = b 1 b 2 ... b n is assigned to the text (string) x = x 1 x 2 ... x n (x 1 , x 2 ..., x n is a character). Here, b i is a label indicating whether or not a word boundary exists between the characters x i and x i+1 (the position between characters), b i =1 means division, and b i =0 means non-segmentation. Here, the value indicated by the label b i can also be said to be the strength of the division.
圖3係表示「買。」(bo-rupen wo katta)之日語詞句(英語為「(I)bought a ballpoint pen.」)中決定「(n)」與「(wo)」間之標籤之例。單詞界限標籤之值係參照自存在於周邊之文字所得之特性(feature)而決定。例如使用文字特性、文字種類特性及辭典特性之3種 特性設定單詞界限標籤之值。Figure 3 shows " buy . ("bo-rupen wo katta" in Japanese ("I) bought a ballpoint pen." (n)" and " An example of a label between (wo). The value of the word boundary tag is determined by reference to the feature obtained from the surrounding text. For example, the value of the word boundary tag is set using three characteristics of a character characteristic, a character type characteristic, and a dictionary characteristic.
文字特性係相接於界限bi 或內包界限bi 之長度n以下之所有文字(n-gram),與相對於bi 之相對位置之組合所示之特性。例如圖3中設為n=3之情形時,獲得對於「(n)」與「(wo)」間之界限bi,「-1/(n)」「1/(wo)」「-2/ (pen)」「-1/(n wo)」 「1/買(wo ka)」「-3/ (rupen)」「-2/(pen wo)」「-1/買(n wo ka)」「1/買(wo kat)」之9個特性。The character characteristic is a characteristic shown by a combination of all the characters (n-gram) of the length b below the length b i or the inner packet limit b i and the relative position with respect to b i . For example, when n=3 is set in FIG. 3, it is obtained for " (n)" and " (wo) the boundary between the two, "-1/ (n)""1/ (wo)"-2/ (pen)"-1/ (n wo)""1/ Buy (wo ka)"-3/ (rupen)"-2/ (pen wo)"-1/ Buy (n wo ka)"1/ buy 9 characteristics of (wo kat).
文字種類特性除取代文字處理文字種類之處外,與上述文字特性相同。作為文字種類,考慮平假名、片假名、漢字、大寫字母、小寫字母、阿拉伯數字、漢字數字及中黑(.)8種。再者,使用之文字種類及其種類數量無任何限制。The character type attribute is the same as the above character except that it replaces the word processing type. As the type of text, consider Hiragana, Katakana, Chinese characters, uppercase letters, lowercase letters, Arabic numerals, Chinese characters, and Chinese black (.). Furthermore, there is no limit to the type of text used and the number of types thereof.
辭典特性係表示位於界限周邊之長度j(1≦j≦k)之單詞是否存在於辭典中之特性。辭典特性係以表示界限bi 存在於單詞之終點(L)、或位於其起始點(R)、或包含於該單詞內(M)之圖表與該單詞之長度j之組合表示。若辭典中登錄有「(pen)」「(wo)」之單詞,則對圖3之界限bi 製作L2及R1之辭典特性。再者,如後述使用複數個辭典之情形時,對辭典特性賦予辭典之識別件。例如若在識別件為DIC1之辭典A中登錄「(pen)」,在識別件為DIC2之辭典B中登錄「(wo)」,則辭典特性如DIC1-L2、DIC2-R1等表示。The dictionary characteristic indicates whether the word of the length j (1≦j≦k) located around the boundary exists in the dictionary. The dictionary characteristic is represented by a combination of a graph indicating that the limit b i exists at the end point (L) of the word, or at the start point (R) thereof, or within the word (M), and the length j of the word. If there is a login in the dictionary (pen)"" The word (wo)" produces the dictionary characteristics of L2 and R1 for the boundary b i of Fig. 3. Further, when a plurality of dictionaries are used as will be described later, the dictionary characteristics are assigned to the dictionary characteristics. For example, if the identifier is DIC1, the dictionary A is registered. (pen)", in the dictionary B whose identification is DIC2, " (wo)", the dictionary characteristics are expressed as DIC1-L2, DIC2-R1, and the like.
再者,本實施形態中,將文字特性及文字種類特性之n- gram之最大長度n設為3,將辭典特性之單詞最大長度k設為8,但該等值可任意規決定。Furthermore, in the present embodiment, the character characteristics and the character type characteristics are n- The maximum length n of the gram is set to 3, and the maximum length k of the dictionary feature is set to 8, but the value can be determined at will.
[3階段單詞分割語料庫][3-stage word segmentation corpus]
日語中存在獨一決定單詞界限較難之單詞,有適當之單詞分割態樣根據情形不同而不同之問題。作為一例,設想對含「(bo-rupen)」(英語為「ballpoint pen」)之單詞之文本集合進行關鍵字檢索之情形。若不分割「 (bo-rupen)」之情形時,則即使以「(pen)」(英語為「pen」)之關鍵字檢索亦無法提取文本(再現率之下降)。另一方面,將「(bo-rupen)」分割成「(bo-ru)」(英語為「ball」)與「(pen)」之情形時,藉由將體育用品「(bo-ru)」作為關鍵字檢索,而導致提取含「(bo-rupen)」之文本(精度之下降)。In Japanese, there is a single word that is difficult to determine the word boundary, and there is a problem that the appropriate word segmentation pattern is different depending on the situation. As an example, imagine the inclusion of " (bo-rupen)" (English for "ballpoint pen") The text collection of words for keyword search. If not split (bo-rupen)", even if " Key search for (pen) (English "pen") also fails to extract text (reduced reproduction rate). On the other hand, will (bo-rupen)" is divided into " (bo-ru)"("ball" in English) and " In the case of (pen), by using sporting goods (bo-ru)" as a keyword search, resulting in extraction containing " (bo-rupen) text (decreased precision).
因此,如上述使用導入非「分割」「非分割」2值之「半分割」概念之3階段單詞分割語料庫。3階段單詞分割語料庫係使以概率化值表示分割態樣之概率化單詞分割發展之方法。人實際可識別之單詞分割之強度即使較多亦僅為數階段,基於以連續之概率化值表示分割態樣之必要性較低之理由,而使用該3階段單詞分割語料庫。對於含半分割之單詞,提取該單詞全體與該單詞之構成要素兩者,因此可將對於人而言難以判斷分割或非分割之單詞首先作為半分割而加以記錄,且界限資訊之賦予變得容易。「半分割」係表示文字間位置上界限概率化(在大於0或小於1 之概率範圍內)存在之一態樣。Therefore, as described above, a three-stage word segmentation corpus that introduces the concept of "half-segmentation" of non-"split" and "non-segmented" values is used. The 3-stage word segmentation corpus is a method for probabilistic word segmentation that uses a probabilistic value to represent a segmentation pattern. The intensity of the word segmentation that is actually recognizable by human beings is only a few stages, and the three-stage word segmentation corpus is used based on the reason that the necessity of representing the segmentation pattern is low with continuous probability values. For a word containing a half-divided word, both the word and the constituent elements of the word are extracted. Therefore, it is possible to record a word that is difficult for a person to judge whether to divide or not to be divided as a half-segment, and the boundary information is added. easily. "Semi-segmentation" means the probability of a positional boundary between words (more than 0 or less than 1) There is one aspect of the probability range.
3階段單詞分割語料庫係藉由對「分割」(bi =1)及「非分割」(bi =0)加入「半分割」(bi =0.5)之3階段分割之離散概率化單詞分割而生成之語料庫。例如亦包含如「/(bo-ru/pen)」之複合名詞,或如「折/(ori/tatamu)」(英語為「fold」)之複合動詞,「/(o/susume)」(英語為「recommendation」)之詞綴而詞彙化之單詞中之分割(此例中以"/"表示)自然作為半分割定義。 又,「充電池(juudenchi)」(英語為「rechargeable battery」)可以說係如「充電(juuden)」(英語為recharge)與「電池(denchi)」(英語為「battery」)之「AB+BC→ABC」型複合語,但如此之單詞係以「充/電/池(juu/den/chi)」之方式半分割。The 3-stage word segmentation corpus is a three-stage segmentation discrete probability word segmentation by adding "half-segmentation" (b i =0.5) to "segmentation" (b i =1) and "non-segmentation" (b i =0). And the generated corpus. For example, it also contains such as " / (bo-ru/pen)" or a compound / (ori/tatamu)" (English "fold") compound verb, " / The segmentation of the word (o/susume)" (English for "recommendation") and the lexicalization of the word (in this case, "/") is naturally defined as a semi-segmentation. Also, "juudenchi"("rechargeablebattery" in English) can be said to be "AB"("rechargeable" in English (recharge) and "denchi"("battery" in English). BC→ABC” compound, but such words are semi-divided in the form of “juu/den/chi”.
「買。」(bo-rupen wo katta)之文本係使用由上述點推測之單詞分割與3階段單詞分割語料庫例如如圖3所示分割。圖3之例中,「分割」(bi =1)之單詞界限標籤賦予至文本之前頭或「(n)」與「(wo)」之間等。「半分割」(bi =0.5)之單詞界限標籤賦予至「(ru)」與「(pe)」之間。圖3中省略「非分割」(bi =0)之單詞界限標籤,但對文字間未示界限之部位(例如「(pe)」與「(n)」之間)賦予該標籤。" buy . The text of (bo-rupen wo katta) is divided by a word segmentation and a three-stage word segmentation corpus estimated by the above-mentioned points, for example, as shown in FIG. In the example of Figure 3, the word boundary label of "split" (b i =1) is given to the front of the text or " (n)" and " Wait between (wo). The word boundary label of "half-split" (b i =0.5) is assigned to " (ru)" and " Between (pe). In Figure 3, the word boundary label of "non-segmentation" (b i =0) is omitted, but the boundary between the characters is not shown (for example, " (pe)" and " The label is given between (n)".
對各文本賦予作為界限資訊之單詞界限標籤,作為學習語料庫20而儲存於資料庫。將界限資訊賦予至文本之方法為任意。作為一例,以空間表示「分割」,以連字號表示 「半分割」,亦可以省略「非分割」之表示之方式於各文本中埋入界限資訊。此時,可保持字符串之狀態下記錄賦予有界限資訊之文本。A word boundary tag as boundary information is given to each text, and is stored in the database as the learning corpus 20. The method of assigning boundary information to text is arbitrary. As an example, "segmentation" is represented by space, and is represented by a hyphen "Semi-segmentation" can also embed the boundary information in each text by omitting the representation of "non-segmentation". At this time, the text to which the bounded information is given can be recorded while maintaining the character string.
現有辭典31係特定數之單詞之集合,作為資料庫預先準備。現有辭典31可為一般使用之電子化辭典,亦可為例如UniDic之形態素解析辭典。The existing dictionary 31 is a collection of words of a specific number and is prepared in advance as a database. The existing dictionary 31 may be an electronic dictionary generally used, and may be, for example, a morphological analysis dictionary of UniDic.
大規模文本40係收集之文本之集合,作為資料庫預先準備。亦可於大規模文本40中包含對應欲提取之單詞或該單詞之領域等之任意詞句或字符串。例如亦可自虛擬商場之網站大量收集商品之標題及說明書,由該等不充分資料構築大規模文本40。作為大規模文本40準備之文本之數量壓倒性多於學習語料庫20所含之文本之數量。The collection of texts collected by the large-scale text 40 is prepared in advance as a database. Any word or character string corresponding to the word to be extracted or the field of the word may be included in the large-scale text 40. For example, the title and description of the product may be collected in large quantities from the website of the virtual shopping mall, and the large-scale text 40 may be constructed from such insufficient information. The number of texts prepared as large-scale text 40 is overwhelming more than the amount of text contained in learning corpus 20.
將以上作為前提說明辭典產生裝置10之功能構成要素。The functional components of the dictionary generating device 10 will be described above as a premise.
模型產生部11係使用學習語料庫20及單詞辭典30產生單詞分割模型之機構。模型產生部11具備向量支援機器(SVM:Support vector machine),將學習語料庫20及單詞辭典30輸入於該機器,執行學習處理,從而產生單詞分割模型。該單詞分割模型表示應將文本如何劃分之規則,作為單詞分割所使用之參數群輸出。再者,機械學習所使用之算則不限於SVM,亦可為決策樹或邏輯回歸等。The model generation unit 11 is a mechanism for generating a word segmentation model using the learning corpus 20 and the word dictionary 30. The model generation unit 11 includes a vector support device (SVM: Support Vector Machine), and inputs the learning corpus 20 and the word dictionary 30 to the device, and executes learning processing to generate a word segmentation model. The word segmentation model indicates a rule for how text should be divided, and is output as a parameter group used for word segmentation. Furthermore, the calculations used in mechanical learning are not limited to SVMs, but also to decision trees or logistic regression.
為解析大規模文本40,模型產生部11使SVM執行基於學習語料庫20及現有辭典31之學習,從而產生最初之單詞分割模型(基準模型)。然後,模型產生部11將該單詞分割模型向解析部12輸出。To analyze the large-scale text 40, the model generation unit 11 causes the SVM to perform learning based on the learning corpus 20 and the existing dictionary 31, thereby generating an initial word segmentation model (reference model). Then, the model generation unit 11 outputs the word division model to the analysis unit 12.
其後,當藉由後述之解析部12、選擇部13及登錄部14之處理對單詞辭典30追加單詞時,則模型產生部11使SVM執行基於學習語料庫20與單詞辭典30全體之學習(再學習)處理,從而產生經修正之單詞分割模型。此處,所謂單詞辭典30全體,意指最初記憶於現有辭典31之單詞,及自大規模文本40獲得之所有單詞。Then, when a word is added to the word dictionary 30 by the processing of the analysis unit 12, the selection unit 13, and the registration unit 14, which will be described later, the model generation unit 11 causes the SVM to perform learning based on the entire learning corpus 20 and the word dictionary 30 (again Learning) processing to produce a modified word segmentation model. Here, the entire word dictionary 30 means the words originally memorized in the existing dictionary 31 and all the words obtained from the large-scale text 40.
解析部12係對大規模文本40執行將組入單詞分割模型之解析(單詞分割),且對各文本賦予(附關聯)界限資訊之機構。其結果,大量獲得如圖3所示之文本。解析部12對形成大規模文本40之各文本執行如此之單詞分割,從而將表示上述「分割」(第2資訊)、「半分割」(第3資訊)及「非分割」(第1資訊)之界限資訊賦予至各文本,將經處理之所有文本向選擇部13輸出。The analysis unit 12 executes a mechanism for analyzing the large-scale text 40 (word division) into a word segmentation model, and assigning (attached) the boundary information to each text. As a result, a large amount of text as shown in FIG. 3 is obtained. The analysis unit 12 performs such word segmentation on each of the texts forming the large-scale text 40, thereby indicating the above-mentioned "division" (second information), "half-segmentation" (third information), and "non-segmentation" (first information). The boundary information is given to each text, and all the processed text is output to the selection unit 13.
解析部12具備二個二值分類器,依次使用該等分類器將3種界限資訊賦予各文本。第1分類器係判斷文字間位置為「非分割」或其以外之機構,第2分類器係判斷經判斷為非「非分割」之界限係為「分割」或「半分割」之機構。由於實際上文字間位置多半為「非分割」,因此首先判斷文字間位置是否為「非分割」,接著針對經判斷為「非分割」以外之部位判斷分割態樣,從而可有效率地將界限資訊賦予至大量之文本。又,藉由組合二值分類器,而可簡化解析部12之構造。The analysis unit 12 includes two binary classifiers, and sequentially uses the classifiers to assign three pieces of limit information to each text. The first classifier judges that the position between characters is "non-segmentation" or the other mechanism, and the second classifier determines a mechanism that is determined to be "non-segmented" as "divided" or "semi-segmented". Since the position between characters is actually "non-segmented", it is first determined whether the position between characters is "non-segmentation", and then the division is determined for the portion other than the "non-segmentation", so that the boundary can be efficiently Information is given to a large amount of text. Further, the configuration of the analyzing unit 12 can be simplified by combining the binary classifiers.
選擇部13係自藉由解析部12賦予界限資訊之文本中選擇登錄於單詞辭典30之單詞之機構。The selection unit 13 is a mechanism for selecting a word registered in the word dictionary 30 from the text to which the analysis unit 12 gives the limit information.
首先,選擇部13根據下述式(1)求得輸入之文本群所含之各單詞w之合計出現頻率fr (w)。該計算意指自賦予各文字間位置之界限資訊bi 獲得出現頻率。First, the selection unit 13 obtains the total occurrence frequency f r (w) of each word w included in the input text group based on the following formula (1). This calculation means that the frequency of occurrence is obtained from the boundary information b i given to the position between the characters.
此處,O1 表示單詞w之書寫之出現,如下述定義。Here, O 1 represents the occurrence of the writing of the word w as defined below.
圖3所示之「買。」(bo-rupen wo katta)之一個詞句之單詞「(bo-rupen)」之出現頻率成1.0*1.0*1.0*0.5*1.0*1.0=0.5,該詞句之單詞「(pen)」之出現頻率成0.5*1.0*1.0=0.5。此意指看作該詞句中「(bo-rupen)」及「(pen)」之單詞分別每隔0.5次出現者。選擇部13求得各文本所含之各單詞之出現頻率,合計每個單詞其出現頻率,從而獲得各單詞之合計出現頻率。Figure 3 buy . (bo-rupen wo katta) the word "a word" The frequency of occurrence of (bo-rupen)" is 1.0*1.0*1.0*0.5*1.0*1.0=0.5, the word of the phrase " The appearance frequency of (pen) is 0.5*1.0*1.0=0.5. This means to be treated as "in the phrase" (bo-rupen)" and " The word "(pen)" appears every 0.5 times. The selection unit 13 obtains the frequency of occurrence of each word included in each text, and sums the frequency of occurrence of each word, thereby obtaining the total appearance frequency of each word.
繼而,選擇部13自大規模文本40內之單詞群中僅將合計出現頻率為第1閾值THa以上之單詞作為登錄候補V選擇(根據頻率之單詞之截斷)。然後,選擇部13自該登錄候補V中選擇最終登錄於單詞辭典30之單詞,於必要時決定儲存該單詞之辭典(資料庫)。最終登錄之單詞及儲存端之辭典之決定方法不限於一個,可使用如下述之各種方法。Then, the selection unit 13 selects only the words whose total occurrence frequency is equal to or greater than the first threshold value THa from the word group in the large-scale text 40 as the registration candidate V (the truncation of the word according to the frequency). Then, the selection unit 13 selects a word that is finally registered in the word dictionary 30 from the registration candidate V, and determines a dictionary (database) in which the word is stored, if necessary. The method of determining the final word and the dictionary of the storage terminal is not limited to one, and various methods as described below can be used.
選擇部13亦可決定為僅將登錄候補V中合計出現頻率為特定閾值以上之單詞追加於現有辭典31。此時,選擇部13可僅選擇合計出現頻率為第2閾值THb(其中THb>THa)之單詞,亦可僅選擇合計出現頻率達上位n位之單詞。以下亦將如此之處理稱作「APPEND」。The selection unit 13 may also determine that only words whose total occurrence frequency of the registration candidate V is equal to or greater than a certain threshold value are added to the existing dictionary 31. At this time, the selection unit 13 can select only the words whose total occurrence frequency is the second threshold THb (where THb>THa), or can select only the words whose total appearance frequency is up to the upper n bits. The following process is also referred to as "APPEND".
或選擇部13亦可決定為僅將登錄候補V中合計出現頻率為特定閾值以上之單詞登錄於追加辭典32。此時選擇部13可僅選擇合計出現頻率為第2閾值THb(其中THb>THa)之單詞,亦可僅選擇合計出現頻率達上位n位之單詞。以下亦將如此之處理稱作「TOP」。Alternatively, the selection unit 13 may determine that only the words whose total occurrence frequency of the registration candidate V is equal to or greater than a certain threshold value are registered in the additional dictionary 32. At this time, the selection unit 13 can select only the words whose total occurrence frequency is the second threshold THb (where THb>THa), or can select only the words whose total occurrence frequency is up to the upper n bits. The following process is also referred to as "TOP".
或又選擇部13亦可決定為將所有登錄候補V登錄於追加辭典32。以下將如此之處理稱作「ALL」。Alternatively, the selection unit 13 may determine that all the registration candidates V are registered in the additional dictionary 32. Hereinafter, such processing is referred to as "ALL".
或選擇部13亦可決定為將登錄候補V根據合計出現頻率分成複數個部分集合,將各部分集合登錄於個別之追加辭典32。將登錄候補V中合計出現頻率達上位n位之部分集合記作Vn 。此時,選擇部13例如產生含達上位1000位之單詞之部分集合V1000 、含達上位2000位之單詞之部分集合V2000 、含達上位3000位之單詞之部分集合V3000 。然後,選擇部13決定將部分集合V1000 、V2000 及V3000 登錄於第1追加辭典32、第2追加辭典32及第3追加辭典32。再者,產生之部分集合之個數或各部分集合之大小亦可任意決定。以下將如此之處理稱作「MULTI」。Alternatively, the selection unit 13 may determine that the registration candidate V is divided into a plurality of partial sets based on the total appearance frequency, and each partial collection is registered in the individual additional dictionary 32. A set of parts in which the total occurrence frequency of the registration candidate V reaches the upper n bits is denoted as V n . In this case, selection unit 13, for example generating section 1000 of the upper set of words containing the V 1000, including the upper portion 2000 of the words of the set V 2000, including the upper portion 3000 of the words of the set V 3000. Then, the selection unit 13 determines to register the partial sets V 1000 , V 2000 , and V 3000 in the first additional dictionary 32 , the second additional dictionary 32 , and the third additional dictionary 32 . Furthermore, the number of partial collections generated or the size of each partial collection can also be arbitrarily determined. Hereinafter, such processing is referred to as "MULTI".
若選擇最終登錄之單詞且決定儲存端之辭典,則選擇部13將該選擇結果向登錄部14輸出。When the word finally registered is selected and the dictionary of the storage end is determined, the selection unit 13 outputs the selection result to the registration unit 14.
登錄部14係將由選擇部13選擇之單詞登錄於單詞辭典30之機構。單詞辭典30中於哪一辭典登錄單詞依賴於選擇部13下之處理,因此登錄部14可能僅於現有辭典31中登錄單詞,或可能僅於一個追加辭典32中登錄單詞。上述「MULTI」處理之情形時,登錄部14將所選擇之單詞分成複數個追加辭典32進行登錄。The registration unit 14 is a mechanism for registering the word selected by the selection unit 13 in the word dictionary 30. The dictionary entry word in the word dictionary 30 depends on the processing under the selection unit 13, and therefore the registration unit 14 may register a word only in the existing dictionary 31, or may register a word only in one additional dictionary 32. In the case of the above-mentioned "MULTI" processing, the registration unit 14 divides the selected word into a plurality of additional dictionaries 32 and registers them.
如上述,追加於單詞辭典30之單詞用於單詞分割模型之修正,但亦可在單詞分割以外之目的下使用單詞辭典30。例如亦可為形態素解析,或具備自動輸入功能之輸入盒之輸入候補語句之顯示,或用以提取固有名詞之知識資料庫等而使用單詞辭典30。As described above, the word added to the word dictionary 30 is used for the correction of the word segmentation model, but the word dictionary 30 may be used for purposes other than word segmentation. For example, the word dictionary 30 may be used for morphological analysis, display of an input candidate sentence of an input box having an automatic input function, or a knowledge database for extracting a proper noun.
接著使用圖4,說明辭典產生裝置10之動作且針對本實施形態之辭典產生方法進行說明。Next, the operation of the dictionary generating device 10 will be described with reference to Fig. 4, and the dictionary generating method of the present embodiment will be described.
首先,模型產生部11於SVM執行基於學習語料庫20及現有辭典31之學習,從而產生最初之單詞分割模型(基準模型)(步驟S11、模型產生步驟)。接著,解析部12對大規模文本40執行組入有該基準模型之解析(單詞分割),將表示「分割」、「半分割」或「非分割」之界限資訊賦予(建立關聯)至各文本(步驟S12、解析步驟)。First, the model generation unit 11 performs learning based on the learning corpus 20 and the existing dictionary 31 in the SVM, thereby generating an initial word segmentation model (reference model) (step S11, model generation step). Next, the analysis unit 12 performs analysis (word division) in which the reference model is incorporated in the large-scale text 40, and assigns (associates) the boundary information indicating "segmentation", "half-segmentation", or "non-segmentation" to each text. (Step S12, analysis step).
繼而,選擇部13選擇登錄於辭典之單詞(選擇步驟)。具體而言,選擇部13基於附界限資訊之文本算出各單詞之合計出現頻率(步驟S13),將該頻率為特定閾值以上之單詞作為登錄候補而選擇(步驟S14)。然後,選擇部13自登錄候補中選擇最終登錄於辭典之單詞,且決定登錄單詞之辭 典(步驟S15)。選擇部13可使用上述APPEND、TOP、ALL、MULTI等方法,選擇單詞並指定辭典。Then, the selection unit 13 selects a word registered in the dictionary (selection step). Specifically, the selection unit 13 calculates the total appearance frequency of each word based on the text of the boundary information (step S13), and selects a word whose frequency is equal to or greater than the specific threshold as the registration candidate (step S14). Then, the selection unit 13 selects a word that is finally registered in the dictionary from the registration candidates, and determines the word of the login word. Code (step S15). The selection unit 13 can select a word and specify a dictionary using the above-described methods such as APPEND, TOP, ALL, and MULTI.
繼而,登錄部14基於選擇部13中之處理,將所選擇之單詞登錄於指定之辭典(步驟S16、登錄步驟)。Then, the registration unit 14 registers the selected word in the designated dictionary based on the processing in the selection unit 13 (step S16, registration step).
利用以上處理,向單詞辭典30之單詞追加結束。本實施形態中,使用經擴張之單詞辭典30修正單詞分割模型。 即,模型產生部11根據基於學習語料庫20與單詞辭典30全體之再學習,產生經修正之單詞分割模型(步驟S17)。By the above processing, the word addition to the word dictionary 30 is completed. In the present embodiment, the word segmentation model is corrected using the expanded word dictionary 30. In other words, the model generation unit 11 generates a corrected word segmentation model based on the re-learning based on the entire learning corpus 20 and the word dictionary 30 (step S17).
繼而,使用圖5,說明用以使電腦作為辭典產生裝置10發揮功能之辭典產生程式P1。Next, a dictionary generation program P1 for causing a computer to function as the dictionary generating device 10 will be described with reference to FIG.
辭典產生程式P1具備主模組P10、模型產生模組P11、解析模組P12、選擇模組P13及登錄模組P14。The dictionary generation program P1 includes a main module P10, a model generation module P11, an analysis module P12, a selection module P13, and a registration module P14.
主模組P10係總體控制辭典產生功能之部分。藉由執行模型產生模組P11、解析模組P12、選擇模組P13及登錄模組P14實現之功能係分別與上述模型產生部11、解析部12、選擇部13及登錄部14之功能相同。The main module P10 is part of the overall control dictionary generation function. The functions realized by the execution model generation module P11, the analysis module P12, the selection module P13, and the registration module P14 are the same as those of the model generation unit 11, the analysis unit 12, the selection unit 13, and the registration unit 14, respectively.
辭典差生那程式P1例如係於固定記錄於CD-ROM或DVD-ROM、半導體記憶體等有形記錄媒體之上提供。又,辭典產生程式P1亦可作為與搬送波重疊之資料信號經由網路提供。The dictionary program P1 is provided, for example, on a tangible recording medium such as a CD-ROM or a DVD-ROM or a semiconductor memory. Further, the dictionary generation program P1 can also be provided as a data signal overlapping with the transport wave via the network.
如上說明,根據本實施形態,使用賦予界限資訊之學習語料庫20與現有辭典31產生單詞分割模型,組入有該模型之單詞分割應用於大規模文本40中。然後,藉由該應用而自賦予界限資訊之文本集合中選擇單詞,登錄於單詞辭典 30。如此,於亦藉由使用學習語料庫20之解析而對文本集合賦予界限資訊之後,登錄自該文本集合提取之單詞,從而可容易構築大規模之單詞辭典30。As described above, according to the present embodiment, the word segmentation model is generated using the learning corpus 20 that provides the limit information and the existing dictionary 31, and the word segmentation in which the model is incorporated is applied to the large-scale text 40. Then, by using the application, selecting a word from the set of texts that give the boundary information, and logging in the word dictionary 30. In this way, by using the analysis of the learning corpus 20 to provide the boundary information to the text collection, the words extracted from the text collection are registered, and the large-scale word dictionary 30 can be easily constructed.
例如「」(sumahoke-su)(英語為「smartphone case」)分成「」(sumaho)與「 」(ke-su),至此為止未知語「」(sumaho)可登錄於辭典中。再者,「」(sumaho)係日語「 」(suma-tofon)之省略語。又,「」(uttororin)之語句(日語「」(uttori)(相當於英語「fascinated」之未知語)亦可登錄於辭典中。然後,使用經構築之辭典進行文本解析,從而可更高精度執行包含已登錄之單詞之詞句(例如包含「」(sumaho)或「 」(uttororin)之詞句)之單詞分割。E.g" (sumahoke-su) (English for "smartphone case") is divided into " (sumaho) and " (ke-su), the unknown language so far (sumaho) can be registered in the dictionary. Furthermore, " (sumaho) is Japanese" (suma-tofon) ellipsis. also," (uttororin) statement (Japanese) (uttori) (equivalent to the English "fascinated" unknown) can also be registered in the dictionary. Then, using the constructed dictionary for text analysis, the words containing the registered words can be executed with higher precision (for example, including " (sumaho) or " The word segmentation of (uttororin).
其次,表示利用本實施形態之產生裝置10之單詞分割性能之評估之一例。單詞分割形成之評估指標若將使用精度(Prec)、再現率(Rec)及F值之正解語料庫所含的延伸單詞數設為NREF ,將解析結果所含之延伸單詞數設為NSYS ,將解析結果及正解語料庫兩者所含之延伸單詞數設為NCOR ,則上述3個指標如下定義。Next, an example of evaluation of word segmentation performance by the generating device 10 of the present embodiment will be described. The evaluation index formed by the word segmentation is set to N REF if the number of extended words included in the positive corpus of the precision (Prec), the reproduction rate (Rec), and the F value is set to N SYS , and the number of extended words included in the analysis result is set to N SYS . When the number of extended words included in both the analysis result and the positive corpus is N COR , the above three indexes are defined as follows.
Prec=NCOR /NSYS Prec=N COR /N SYS
Rec=NCOR =NREF Rec=N COR =N REF
F=2Prec.Rec/(Prec+Rec)F=2Prec. Rec/(Prec+Rec)
使用UniDic之主詞條名單(不同之304,267詞)作為現有辭典,在棄權參數下使用LIBLINEAR作為向量支援機器。學 習語料庫及大規模文本內之半角文字全都統一成全角,不進行此外之標準化。Use UniDic's main entry list (different 304, 267 words) as the existing dictionary, and use LIBLINEAR as the vector support machine under the waiver parameter. learn The corpus and the half-width characters in the large-scale text are all unified into a full-width, and no further standardization is performed.
首先,針對學習語料庫及大規模文本為相同領域之情形(同一領域之學習)之有效性進行說明。此處,所謂領域,係基於文體、內容(類型)等用以將詞句及單詞群組化之概念。同一領域之學習下,由自虛擬商場A之網站類型無偏向地隨機提取之590商品之標題及說明文,與自虛擬商品B之網站隨機提取之50商品之說明文製作3階段單詞分割之學習語料庫。該學習語料庫之單詞數約為11萬,文字數約為34萬。使用該學習語料庫評估性能。First, the validity of the learning corpus and large-scale text for the same field (learning in the same field) is explained. Here, the domain is based on the concept of stylistic and word grouping based on style, content (type), and the like. Under the same field of study, the title and description of the 590 product randomly selected from the website type of the virtual shopping mall A are randomly selected, and the description of the 50-products randomly extracted from the website of the virtual product B is used to create a 3-stage word segmentation learning. Corpus. The number of words in the learning corpus is about 110,000, and the number of words is about 340,000. Use this learning corpus to evaluate performance.
作為大規模文本,係使用上述虛擬商場A內之所有商品資料之標題及說明文。商品數約為2700萬,文字數約為160億。As a large-scale text, the title and explanatory text of all the product materials in the above-mentioned virtual shopping mall A are used. The number of goods is about 27 million, and the number of words is about 16 billion.
於藉由基準模型解析該大規模文本,執行2階段單詞分割之情形時,提取不同之576,954詞,於該解析後執行3階段單詞分割之情形時,提取不同之603,187詞。此處,用以單詞之截斷使用之頻率之閾值為20。採用上述「MULTI」時,將合計出現頻率之上位10萬詞、上位20萬詞、上位30萬詞、上位40萬詞及全體作為分開之辭典追加。採用上述「TOP」時僅使用上位10萬詞。When the large-scale text is parsed by the reference model and the two-stage word segmentation is performed, 576,954 words are extracted, and when the three-stage word segmentation is performed after the analysis, 603, 187 words are extracted. Here, the threshold for the frequency of use of the truncation of words is 20. When the above-mentioned "MULTI" is used, a total of 100,000 words in the upper frequency, a top-ranked 200,000 words, a higher-ranking 300,000 words, a higher-ranking 400,000 words, and the whole are added as separate dictionaries. Only the top 100,000 words are used when using the above "TOP".
將利用基準模型之學習結果、使用由2階段單詞分割所得之單詞辭典之再學習結果、及使用由3階段分割所得之單詞辭典之再學習結果表示於表1。表1中之數值均係百分率(%)。The learning result using the reference model, the relearning result using the word dictionary obtained by dividing the two-stage word, and the relearning result using the word dictionary obtained by the three-stage segmentation are shown in Table 1. The values in Table 1 are all percentages (%).
於使用2階段分割再學習之情形時,即使使用哪一方法(APPEND/TOP/ALL/MULTI)追加單詞,F值均上升,此表示使用提案之大規模文本之學習有效。F值之增加幅度按APPEND<TOP<ALL<MULTI之順序增大。由該結果可知,追加單詞時,追加於其他辭典比追加於現有辭典更有效,再者,根據出現頻率追加於其他辭典比將追加之單詞登錄於一個追加辭典更有效。In the case of using the two-stage split re-learning, even if the method is used (APPEND/TOP/ALL/MULTI) to add a word, the F value rises, which means that the learning of the large-scale text using the proposal is effective. The increase in F value increases in the order of APPEND<TOP<ALL<MULTI. From this result, it is understood that when a word is added, it is more effective to add to another dictionary than to add to an existing dictionary. Further, it is more effective to add a word added to another dictionary based on the frequency of occurrence than to register the added word in an additional dictionary.
根據表1,認為分類器自動學習根據單詞之出現頻率不同之貢獻度及重量。再者,於使用3階段單詞分割再學習之情形時,所有情形時基準模型及2階段單詞分割性能更提高。具體言之,藉由考慮半分割,而獲得正確獲得伴隨詞綴之單詞等之改善。According to Table 1, it is considered that the classifier automatically learns the contribution and weight according to the frequency of occurrence of the word. Furthermore, in the case of using the three-stage word segmentation and re-learning, the benchmark model and the 2-stage word segmentation performance are improved in all cases. Specifically, by considering the half-segmentation, an improvement in obtaining a word or the like accompanying the affix is obtained.
其次,針對學習語料庫與大規模文本不同之領域情形之有效性進行說明。使用之學習語料庫與上述同一領域之學習中者相同。另一方面,大規模文本使用旅行預約網站C內之用戶評論、住宿設施名、住宿計劃名、及自住宿設施之回答。文本數為348,564,其文字數約為1億2600萬。該 大規模文本中,隨機提取150起及50起評論,由人手進行單詞分割,分別作為文本語料庫及能動學習用語料庫(相對於學習語料庫之追加部分)使用。Secondly, the effectiveness of the situation in the field of learning corpora and large-scale text is explained. The learning corpus used is the same as that of the same field as described above. On the other hand, the large-scale text uses the user reviews in the travel reservation website C, the name of the accommodation facility, the name of the accommodation plan, and the answer from the accommodation facility. The number of texts is 348,564, and the number of characters is about 126 million. The In the large-scale text, 150 and 50 comments were randomly extracted, and the word segmentation was performed by the human hand, and used as a text corpus and an active learning corpus (relative to the additional part of the learning corpus).
首先,使用自上述商品領域之學習語料庫學習之基準模型解析旅行領域之大規模文本。該解析性能為下述表2之「基準」。First, a large-scale text in the field of travel is parsed using a benchmark model of learning corpus learning from the above-mentioned commodity field. This analysis performance is the "reference" in Table 2 below.
其次,對商品領域之學習語料庫加入領域適應之語料庫,學習單詞分割模型後,使用其解析大規模文本。該解析性能係下述表2之「領域適應」。解析大規模文本後提取與使用2階段單詞分割不同之41,671詞,提取與使用3階段分割不同之44,247詞。任一情形時均僅使用合計出現頻率為5以上之單詞。Secondly, the learning corpus of the commodity field is added to the corpus of domain adaptation, and after learning the word segmentation model, it is used to parse large-scale text. This analysis performance is "field adaptation" in Table 2 below. After parsing the large-scale text, 41,671 words are extracted differently from the two-stage word segmentation, and 44,247 words different from the three-stage segmentation are extracted. In either case, only words with a total occurrence frequency of 5 or more are used.
將該等所得之單詞追加於辭典,將使用學習語料庫及領域應用語料庫在學習模型之結果表示於表2。表2中之數值均係百分率(%)。The obtained words are added to the dictionary, and the results of the learning model using the learning corpus and the domain application corpus are shown in Table 2. The values in Table 2 are all percentages (%).
由該表可知,於學習語料庫與大規模文本領域不同之情形時,3階段單詞分割之情形時發現性能之提高。It can be seen from the table that when the learning corpus is different from the large-scale text field, the performance of the three-stage word segmentation is found to be improved.
以上基於本實施形態詳細說明本發明。但本發明不限於上述實施形態。本發明在不脫離其主旨之範圍內可進行各種變形。The present invention has been described in detail based on the present embodiment. However, the present invention is not limited to the above embodiment. The present invention can be variously modified without departing from the spirit and scope of the invention.
上述實施形態中選擇部13基於出現頻率選擇單詞,但選擇部13亦可不參照其出現頻率而將所有單詞登錄於現有辭典31或追加辭典32。又,單詞之截斷非必要之處理。In the above embodiment, the selection unit 13 selects a word based on the frequency of occurrence, but the selection unit 13 may register all the words in the existing dictionary 31 or the additional dictionary 32 without referring to the frequency of occurrence. Also, the truncation of words is not necessary.
上述實施形態中,解析部12解析大規模文本40全體後,進行利用選擇部13及登錄部14之處理,但解析部12亦可將所收集之大量文本分成複數次解析。此時,複數次重複含模型產生步驟、解析步驟、選擇步驟及登錄步驟之一連串處理。例如將大規模文本40分成組群1~3之情形時,以第1環之處理解析組群1,登錄單詞,以第2環之處理解析組群2,進而登錄單詞,以第3環之處理解析組群3,進而登錄單詞。第2環之後之處理中,模型產生部11參照單詞辭典30全體,產生經修正之單詞分割模型。In the above-described embodiment, the analysis unit 12 analyzes the entire large-scale text 40 and performs the processing by the selection unit 13 and the registration unit 14. However, the analysis unit 12 may divide the collected large amount of text into a plurality of times. At this time, the plurality of repetitions include a series of processing steps including a model generation step, an analysis step, a selection step, and a login step. For example, when the large-scale text 40 is divided into groups 1 to 3, the group 1 is analyzed by the processing of the first loop, the words are registered, the group 2 is analyzed by the processing of the second loop, and the words are registered, and the third ring is used. The analysis group 3 is processed, and the words are registered. In the process after the second ring, the model generation unit 11 refers to the entire word dictionary 30 to generate a corrected word segmentation model.
上述實施形態中使用3階段分割方法,因此界限資訊為3種,但界限資訊之態樣不限於該例。例如亦可僅使用「分割」、「非分割」2種界限資訊進行2階段單詞分割。又,亦可使用「分割」、「非分割」與複數種概率性分割,進行4階段以上之單詞分割。例如亦可進行使用bi =0.33與bi =0.67之概率性分割(第3資訊)之4階段單詞分割。無論採用任一者,相當於第3資訊之分割強度都大於界限資訊為「非分割」情形之強度(例如bi =0),小於界限資訊為「分割」情形之強度(例如bi =1)。Since the three-stage division method is used in the above embodiment, the limit information is three types, but the aspect of the boundary information is not limited to this example. For example, it is also possible to perform two-stage word segmentation using only two types of boundary information, "segmentation" and "non-segmentation". In addition, "segmentation", "non-segmentation", and plural kinds of probabilistic divisions can be used to perform word segmentation of four or more stages. For example, a 4-stage word segmentation using a probabilistic division (third information) of b i = 0.33 and b i = 0.67 may be performed. Regardless of the use of either, the intensity of the segmentation equivalent to the third information is greater than the strength of the boundary information as "non-segmentation" (eg, b i =0), and less than the strength of the boundary information as "segmentation" (eg, b i =1) ).
根據本實施形態,可容易構築大規模之單詞辭典。According to this embodiment, it is possible to easily construct a large-scale word dictionary.
10‧‧‧辭典產生裝置10‧‧‧ dictionary production device
11‧‧‧模型產生部11‧‧‧Model Generation Department
12‧‧‧解析部12‧‧‧ Analysis Department
13‧‧‧選擇部13‧‧‧Selection Department
14‧‧‧登錄部14‧‧‧Login Department
20‧‧‧學習語料庫20‧‧‧Learning corpus
30‧‧‧單詞辭典30‧‧‧Word dictionary
31‧‧‧現有辭典(單詞群)31‧‧‧ Existing Dictionary (word group)
32‧‧‧追加辭典32‧‧‧Additional Dictionary
40‧‧‧大規模文本(所收集文本之集合)40‧‧‧ Large-scale text (collection of collected texts)
P1‧‧‧辭典產生程式P1‧‧‧ dictionary production program
P10‧‧‧主模組P10‧‧‧ main module
P11‧‧‧模型產生模組P11‧‧‧Model Generation Module
P12‧‧‧解析模組P12‧‧‧analysis module
P13‧‧‧選擇模組P13‧‧‧Selection module
P14‧‧‧登錄模組P14‧‧‧ Login Module
圖1係表示實施形態之辭典產生裝置之硬體構成之圖。Fig. 1 is a view showing a hardware configuration of a dictionary generating device of an embodiment.
圖2係表示圖1所示之辭典產生裝置之功能構成之方塊圖。Fig. 2 is a block diagram showing the functional configuration of the dictionary generating device shown in Fig. 1.
圖3係用以說明界限資訊(單詞界限標籤)之設定之圖。Fig. 3 is a diagram for explaining setting of boundary information (word boundary label).
圖4係表示圖1所示之辭典產生裝置之動作之流程圖。Fig. 4 is a flow chart showing the operation of the dictionary generating device shown in Fig. 1.
圖5係表示實施形態之辭典產生程式之構成之圖。Fig. 5 is a view showing the configuration of a dictionary generating program of the embodiment.
10‧‧‧辭典產生裝置10‧‧‧ dictionary production device
11‧‧‧模型產生部11‧‧‧Model Generation Department
12‧‧‧解析部12‧‧‧ Analysis Department
13‧‧‧選擇部13‧‧‧Selection Department
14‧‧‧登錄部14‧‧‧Login Department
20‧‧‧學習語料庫20‧‧‧Learning corpus
30‧‧‧單詞辭典30‧‧‧Word dictionary
31‧‧‧現有辭典31‧‧‧ Existing dictionary
32‧‧‧追加辭典32‧‧‧Additional Dictionary
40‧‧‧所收集之文本之集合(大規模文本)40‧‧‧Collection of collected texts (large-scale text)
Claims (16)
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201261604266P | 2012-02-28 | 2012-02-28 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| TW201335776A TW201335776A (en) | 2013-09-01 |
| TWI452475B true TWI452475B (en) | 2014-09-11 |
Family
ID=49081915
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| TW101133547A TWI452475B (en) | 2012-02-28 | 2012-09-13 | A dictionary generating device, a dictionary generating method, a dictionary generating program product, and a computer readable memory medium storing the program |
Country Status (5)
| Country | Link |
|---|---|
| JP (1) | JP5373998B1 (en) |
| KR (1) | KR101379128B1 (en) |
| CN (1) | CN103608805B (en) |
| TW (1) | TWI452475B (en) |
| WO (1) | WO2013128684A1 (en) |
Families Citing this family (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105701133B (en) * | 2014-11-28 | 2021-03-30 | 方正国际软件(北京)有限公司 | Address input method and equipment |
| JP6813776B2 (en) * | 2016-10-27 | 2021-01-13 | キヤノンマーケティングジャパン株式会社 | Information processing device, its control method and program |
| JP6746472B2 (en) * | 2016-11-11 | 2020-08-26 | ヤフー株式会社 | Generation device, generation method, and generation program |
| JP6707483B2 (en) * | 2017-03-09 | 2020-06-10 | 株式会社東芝 | Information processing apparatus, information processing method, and information processing program |
| EP3446241A4 (en) * | 2017-06-20 | 2019-11-06 | Accenture Global Solutions Limited | AUTOMATIC EXTRACTION OF A LEARNING CORPUS FOR A DATA CLASSIFIER BASED ON AUTOMATIC LEARNING ALGORITHMS |
| JP2019049873A (en) * | 2017-09-11 | 2019-03-28 | 株式会社Screenホールディングス | Synonym dictionary creation apparatus, synonym dictionary creation program, and synonym dictionary creation method |
| CN109033183B (en) * | 2018-06-27 | 2021-06-25 | 清远墨墨教育科技有限公司 | Editable cloud word stock analysis method |
| KR102543343B1 (en) * | 2023-03-07 | 2023-06-16 | 주식회사 로이드케이 | Method and device for generating search word dictionary and searching based on artificial neural network |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH09288673A (en) * | 1996-04-23 | 1997-11-04 | Nippon Telegr & Teleph Corp <Ntt> | Japanese morphological analysis method and device and dictionary unregistered word collection method and device |
| JP2002351870A (en) * | 2001-05-29 | 2002-12-06 | Communication Research Laboratory | Method for analyzing morpheme |
| TW200729001A (en) * | 2005-01-31 | 2007-08-01 | Nec China Co Ltd | Dictionary learning method and device using the same, input method and user terminal device using the same |
| JP2008257511A (en) * | 2007-04-05 | 2008-10-23 | Yahoo Japan Corp | Technical term extraction device, method and program |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1086821C (en) * | 1998-08-13 | 2002-06-26 | 英业达股份有限公司 | The Method and System of Chinese Sentence Segmentation |
-
2012
- 2012-09-03 JP JP2013515598A patent/JP5373998B1/en active Active
- 2012-09-03 CN CN201280030052.2A patent/CN103608805B/en active Active
- 2012-09-03 KR KR1020137030410A patent/KR101379128B1/en active Active
- 2012-09-03 WO PCT/JP2012/072350 patent/WO2013128684A1/en not_active Ceased
- 2012-09-13 TW TW101133547A patent/TWI452475B/en active
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH09288673A (en) * | 1996-04-23 | 1997-11-04 | Nippon Telegr & Teleph Corp <Ntt> | Japanese morphological analysis method and device and dictionary unregistered word collection method and device |
| JP2002351870A (en) * | 2001-05-29 | 2002-12-06 | Communication Research Laboratory | Method for analyzing morpheme |
| TW200729001A (en) * | 2005-01-31 | 2007-08-01 | Nec China Co Ltd | Dictionary learning method and device using the same, input method and user terminal device using the same |
| JP2008257511A (en) * | 2007-04-05 | 2008-10-23 | Yahoo Japan Corp | Technical term extraction device, method and program |
Also Published As
| Publication number | Publication date |
|---|---|
| TW201335776A (en) | 2013-09-01 |
| WO2013128684A1 (en) | 2013-09-06 |
| KR101379128B1 (en) | 2014-03-27 |
| CN103608805A (en) | 2014-02-26 |
| CN103608805B (en) | 2016-09-07 |
| KR20130137048A (en) | 2013-12-13 |
| JP5373998B1 (en) | 2013-12-18 |
| JPWO2013128684A1 (en) | 2015-07-30 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| TWI452475B (en) | A dictionary generating device, a dictionary generating method, a dictionary generating program product, and a computer readable memory medium storing the program | |
| CN109791569B (en) | Causality identification device and storage medium | |
| CN108363790B (en) | Method, device, equipment and storage medium for evaluating comments | |
| CN108287858B (en) | Semantic extraction method and device for natural language | |
| CN111444320A (en) | Text retrieval method and device, computer equipment and storage medium | |
| CN110851590A (en) | Method for classifying texts through sensitive word detection and illegal content recognition | |
| US11170169B2 (en) | System and method for language-independent contextual embedding | |
| US20160189057A1 (en) | Computer implemented system and method for categorizing data | |
| CN106030568B (en) | Natural language processing system, natural language processing method and natural language processing program | |
| CN113076748A (en) | Method, device and equipment for processing bullet screen sensitive words and storage medium | |
| CN113420127A (en) | Threat information processing method, device, computing equipment and storage medium | |
| JP6186198B2 (en) | Learning model creation device, translation device, learning model creation method, and program | |
| CN101114282A (en) | A word segmentation processing method and device | |
| CN114548082B (en) | A syntax analysis method, device and readable storage medium | |
| JP2011238159A (en) | Computer system | |
| US20150019382A1 (en) | Corpus creation device, corpus creation method and corpus creation program | |
| JP5169456B2 (en) | Document search system, document search method, and document search program | |
| CN109902162B (en) | Text similarity identification method based on digital fingerprints, storage medium and device | |
| JP6689466B1 (en) | Sentence structure vectorization device, sentence structure vectorization method, and sentence structure vectorization program | |
| CN112364666A (en) | Text representation method and device and computer equipment | |
| JP5184195B2 (en) | Language processing apparatus and program | |
| JP4088171B2 (en) | Text analysis apparatus, method, program, and recording medium recording the program | |
| JP6303508B2 (en) | Document analysis apparatus, document analysis system, document analysis method, and program | |
| JP5289032B2 (en) | Document search device | |
| JP5506482B2 (en) | Named entity extraction apparatus, string-named expression class pair database creation apparatus, numbered entity extraction method, string-named expression class pair database creation method, program |