TWI636370B - Establishing chart indexing method and computer program product by text information - Google Patents
Establishing chart indexing method and computer program product by text information Download PDFInfo
- Publication number
- TWI636370B TWI636370B TW105140773A TW105140773A TWI636370B TW I636370 B TWI636370 B TW I636370B TW 105140773 A TW105140773 A TW 105140773A TW 105140773 A TW105140773 A TW 105140773A TW I636370 B TWI636370 B TW I636370B
- Authority
- TW
- Taiwan
- Prior art keywords
- vocabulary
- chart
- information
- text information
- words
- Prior art date
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本發明係提供一種以文字資訊建立圖表索引方法及其電腦程式產品。前述方法包含下列步驟:讀取檔案內與目標圖表關聯之文章段落資訊或句子資訊、分析文章段落資訊或句子資訊之集合,以萃取複數個重點關注詞彙、對重點關注詞彙進行權重處理以及排序,以選出一個或多個候選詞彙、以及依據候選詞產生虛擬圖表訊息目錄及圖式索引資訊。 The invention provides a chart indexing method and a computer program product thereof by using text information. The foregoing method comprises the steps of: reading the article paragraph information or sentence information associated with the target chart in the file, analyzing the article paragraph information or the sentence information collection, extracting a plurality of key vocabulary words, weighting and sorting the key words of interest, and sorting, The one or more candidate words are selected, and the virtual chart message directory and the schema index information are generated according to the candidate words.
Description
本發明係一種索引方法及其電腦程式產品,尤指一種以文字資訊建立圖表索引方法及其電腦程式產品。 The invention relates to an indexing method and a computer program product thereof, in particular to a chart indexing method and a computer program product thereof by using text information.
在現行的檔案管理系統中,為查詢特定檔案內文時,多藉由輸入特定關鍵字來查找關鍵字所位於之區域,並在此區域內尋找所需的文字資料。 In the current file management system, in order to query a specific file context, it is often necessary to input a specific keyword to find the region in which the keyword is located, and to find the required text data in the region.
由於習知技術僅能針對文字內容進行搜尋,而當使用者欲查詢特定圖表內容時,前述方案便無法有效的找到所應之圖表內容,而必須藉由人工方式在檔案中一頁一頁的搜尋,造成使用者的諸多不便。 Since the prior art can only search for text content, when the user wants to query the specific chart content, the foregoing solution cannot effectively find the content of the chart, but must be manually paged in the file. Searching, causing a lot of inconvenience to users.
綜上所述,如何提供一種可解決前揭問題之技術手段乃本領域亟需解決之技術問題。 In summary, how to provide a technical means for solving the above problems is a technical problem that needs to be solved in the field.
為解決前揭之問題,本案之目的係提供一種以文字資訊建立圖表索引之技術方案。 In order to solve the problems disclosed above, the purpose of this case is to provide a technical solution for establishing a chart index by text information.
為達上述目的,本案提出一種以文字資訊建立圖表索引方法,並包含下列步驟:讀取檔案內與目標圖表關聯之文章段落資訊或句子資訊、分析文章段落資訊或句子資訊之集合,以萃取複數個重點關注詞彙、對重點關注詞彙進行權重處理以及排序,以選出一個或多個候選詞彙、以 及依據候選詞產生虛擬圖表訊息目錄及圖式索引資訊。 In order to achieve the above objectives, the present invention proposes a method for creating a chart index by using text information, and includes the following steps: reading the article paragraph information or sentence information associated with the target chart in the file, analyzing the article paragraph information or sentence information collection, to extract the plural Focus on vocabulary, weight processing and sorting of key vocabulary words to select one or more candidate vocabulary, and generate virtual chart message catalog and schema index information based on candidate words.
為達上述目的,本案提出一種以文字資訊建立圖表索引之電腦程式產品,當電腦裝置載入並執行電腦程式產品,可完成以文字資訊建立圖表索引方法所述之步驟。 In order to achieve the above objectives, the present invention proposes a computer program product for creating a chart index by text information. When the computer device loads and executes the computer program product, the steps described in the method of creating a chart index by text information can be completed.
綜上所述,本案之以文字資訊建立圖表索引方法及其電腦程式產品藉由萃取重點關注詞彙以找出對應此關鍵字之圖表內容,而能有效的解決習知技術不足部分。 In summary, in this case, the text indexing method and the computer program product can effectively solve the shortcomings of the prior art by extracting the key words of interest to find the chart content corresponding to the keyword.
S11~S42‧‧‧步驟 S11~S42‧‧‧Steps
圖1為本發明一實施範例之流程示意圖。 FIG. 1 is a schematic flow chart of an embodiment of the present invention.
圖2為本案一範例檔案之結構示意圖。 FIG. 2 is a schematic structural diagram of a sample file of the present invention.
圖3為本案一模型訓練步驟流程圖。 Figure 3 is a flow chart of a model training step of the present case.
圖4為本案文章段落或句子之辨識流程圖。 Figure 4 is a flow chart for identifying the paragraph or sentence of the article in this case.
圖5為本案文章段落或句子之分析流程圖。 Figure 5 is a flow chart for analyzing the passage or sentence of the article in this case.
圖6為本案實施範例之使用者瀏覽畫面示意圖。 FIG. 6 is a schematic diagram of a user browsing screen according to an embodiment of the present invention.
以下將描述具體之實施例以說明本發明之實施態樣,惟其並非用以限制本發明所欲保護之範疇。 The specific embodiments are described below to illustrate the embodiments of the invention, but are not intended to limit the scope of the invention.
本發明於第一實施例提供一種以文字資訊建立圖表索引方法。此方法包含下列步驟:讀取檔案內與目標圖表關聯之文章段落資訊或句子資訊、分析文章段落資訊或句子資訊之集合,以萃取複數個重點關注詞彙、對重點關注詞彙進行權重處理以及排序,以選出一個或多個候選詞 彙、以及依據候選詞產生虛擬圖表訊息目錄及圖式索引資訊。 The first embodiment of the present invention provides a method for establishing a chart index by using text information. The method comprises the steps of: reading the article paragraph information or sentence information associated with the target chart in the file, analyzing the article paragraph information or the sentence information collection, extracting a plurality of key vocabulary words, weighting and sorting the key words of interest, and sorting, The one or more candidate words are selected, and the virtual chart message directory and the schema index information are generated according to the candidate words.
於另一實施例中,前述方法係透過命名實體辨識(Name Entity Recognition)分析法文章段落資訊或句子資訊之集合,以萃取重點關注詞彙。於另一實施例中,前述方法係透過詞彙統計抽詞法分析文章段落資訊或句子資訊之集合,以萃取重點關注詞彙。 In another embodiment, the foregoing method extracts key vocabulary words through a collection of article information or sentence information by Name Entity Recognition. In another embodiment, the foregoing method analyzes the article paragraph information or the sentence information collection through vocabulary statistical lexical extraction to extract the focused vocabulary.
於另一實施例中,前述方法之命名實體辨識法係提取特定專有名詞之詞彙。於另一實施例中,前述方法之專有名詞進一步包含人名詞彙、地名詞彙、組織名稱詞彙其中至少一。 In another embodiment, the named entity identification method of the foregoing method extracts a vocabulary of a specific proper noun. In another embodiment, the proper noun of the foregoing method further includes at least one of a person name vocabulary, a place name vocabulary, and an organization name vocabulary.
於另一實施例中,前述方法之詞彙統計抽詞法係使用後綴數位抽詞法提取特定詞彙。於另一實施例中,前述方法係運行於雲端資料管理系統。 In another embodiment, the lexical statistical lexical method of the foregoing method extracts a specific vocabulary using a suffix digital lexical grammar. In another embodiment, the foregoing method runs on a cloud data management system.
於另一實施例中,前述方法依據詞頻、逆向文件頻率以進行權重處理。於另一實施例中,前述方法係以高單詞頻率之詞彙,乘上詞彙在檔案總數中的低文件頻率,以產生對應之權重值以及過濾特定詞彙。 In another embodiment, the foregoing method performs weight processing according to word frequency and reverse file frequency. In another embodiment, the foregoing method multiplies the vocabulary of the high word frequency by the low file frequency of the vocabulary in the total number of files to generate corresponding weight values and filter specific words.
本發明於第二實施例更提供一種以文字資訊建立圖表索引之電腦程式產品,當電腦裝置載入並執行電腦程式產品,可完成前述方法所述之步驟。 The second embodiment of the present invention further provides a computer program product for creating a chart index by using text information. When the computer device loads and executes the computer program product, the steps described in the foregoing method can be completed.
以下本發明茲以第一實施例之以文字資訊建立圖表索引方法進行說明,惟第二實施例之以文字資訊建立圖表索引電腦程式產品亦可達到相同或相似之技術功效。 The following description of the present invention is based on the text indexing method of the first embodiment. However, the second embodiment can also achieve the same or similar technical effects by using the text information to create a chart index computer program product.
請接著參閱圖1,其本發明一實施範例之流程示意圖。於此實施範例中,以文字資訊建立圖表索引方法係運行於雲端資料管理系統, 惟應用層面不在此限,流程步驟說明如下: Please refer to FIG. 1 , which is a schematic flowchart of an embodiment of the present invention. In this example, the method of creating a graph index by using text information is run in the cloud data management system, but the application level is not limited to this, and the process steps are as follows:
S11:蒐集檔案中與圖表相關的文章段落或句子。 S11: Collect article passages or sentences related to the chart in the archive.
S12:利用命名實體辨識技術,分析檔案中與圖表相關的文章段落或句子,擷取與圖表相關的重點關注詞彙。 S12: Using the named entity identification technology, analyzing the paragraphs or sentences of the articles related to the chart in the archive, and extracting the key words of interest related to the chart.
S13:利用詞彙統計抽詞技術,分析檔案中與圖表相關的文章段落或句子,擷取與圖表相關的重點關注詞彙。 S13: Using the vocabulary statistical word-selling technique, analyzing the paragraphs or sentences of the articles related to the chart in the archive, and extracting the key words of interest related to the chart.
S14:將重點關注詞彙分別依據權重公式排序後,再二次結合算出權重排序,挑選出前n名重點關注詞彙指定給圖表。 S14: The key vocabulary words are sorted according to the weight formula, and then the weights are sorted by the second time, and the top n key vocabulary words are selected and assigned to the chart.
S15:產生虛擬圖表訊息目錄或圖式索引,提供用戶搜尋。 S15: Generate a virtual chart message directory or a schema index to provide a user search.
前述步驟細部說明如下:S11係蒐集檔案中與圖表相關的文章段落或句子,此步驟收集檔案中的1.圖表自有的名稱、2.圖表前後出現的文章段落、3.檔案本文中提到圖表的關鍵字內容、4.檔案本文中提到圖表的關鍵字前後段落、5.檔案註解,或6.檔案標籤文字等文章段落或句子之集合。 The details of the above steps are as follows: S11 collects the paragraphs or sentences related to the chart in the archive. This step collects the name of the chart in the file, the name of the article appearing before and after the chart, and the file mentioned in the article. The keyword content of the chart, 4. The file refers to the front and back paragraphs of the chart, the 5. file annotation, or the 6. file tag text and other articles or sentences.
請參閱圖2,其為一範例檔案之結構示意圖。若此檔案為一Word檔,且內容包含了段落內容(段落一~段落五)、圖式及其說明(圖2-2-1圖檔名稱A,以及圖2-2-2圖檔名稱B)。在執行時先利用自動剖析句子方式,將圖檔名稱A、圖2-2-1圖片上下出現的文章段落二與段落三前五行語句、檔案本文中有提及圖2-2-1之相關語句、再加上原有此檔案之註解或標籤等段落與句子蒐集起來當作後續辨識與抽詞的分析語料。 Please refer to FIG. 2 , which is a schematic structural diagram of a sample file. If the file is a Word file, and the content contains the paragraph content (paragraph 1 to paragraph 5), the schema and its description (Figure 2-2-1 image name A, and Figure 2-2-2 image name B ). In the execution, the automatic parsing sentence method is used first, and the article name A and the picture 2-2 appearing above and below the second and fifth lines of the sentence and the file are mentioned in the figure. The sentence, together with the original annotations or labels of the file, are collected as sentences for subsequent identification and abstraction.
步驟S12利用命名實體辨識技術,分析檔案中與圖表相關的文章段落或句子以及擷取與圖表相關的重點關注詞彙,此步驗主要是將S11產生之文章段落或句子之集合進行分析,該細節方法包含步驟說明如下: 命名實體辨識技術屬於自然語言處理項目之技術之一,提供在全文文件中,將常見的人名、地名、組織名等專有名詞詞彙擷取出來,此辨識技術需事前先建立一監督式(Supervised)模型,可依據圖3之模型訓練步驟建立辨識模型。說明如下: Step S12 uses the named entity identification technology to analyze the paragraphs or sentences of the articles related to the chart in the archive and to extract the key vocabulary related to the chart. The step is mainly to analyze the collection of articles or sentences generated by S11, the details. The method includes the following steps: The named entity identification technology belongs to one of the techniques of the natural language processing project, and provides the common vocabulary words such as the name of the person, the place name and the organization name in the full-text file. A supervised model is established, and the identification model can be established according to the model training steps of FIG. described as follows:
S21:定義標籤種類;依據人名、地名、組織名設定標籤種類,標籤目前設定有:人名起始標籤(B_PER)、人名內部標籤(I_PER)、地名起始標籤(B_LOC)、地名內部標籤(I_LOC)、組織名起始標籤(B_ORG)、組織名內部標籤(I_ORG)、非專有名詞標籤(O)等七種分類。 S21: Define the label type; set the label type according to the person name, the place name, and the organization name. The label is currently set with: a person name start label (B_PER), a person name internal label (I_PER), a place name start label (B_LOC), and a place name internal label (I_LOC). ), organization name start tag (B_ORG), organization name internal tag (I_ORG), non-professional noun tag (O) and other seven categories.
S22:收集訓練語料。蒐集相關領域的眾多句子之集合以提供模型訓練用。 S22: Collect training corpus. Collect a collection of sentences in related fields to provide model training.
S23:定義特徵。針對單一文字或單詞定義出問題及判斷機率,例如此一單字或詞是否為百家姓,以1代表是,0代表否。特徵的狀況可能有很多,蒐集起來成為一組特徵向量集合。 S23: Define features. Define a question and judgment probability for a single word or word, such as whether the word or word is a hundred names, 1 for yes, 0 for no. There may be a lot of characteristics, and they are collected into a set of feature vectors.
S24:特徵字典與規則字典建立。 S24: The feature dictionary is established with a rule dictionary.
S25:模型訓練建立;依據條件隨機域(Conditional random fields,CRFs)訓練句子中的每個單字之「標籤-特徵向量」之組合。條件隨機域為無向性之圖模型(undirected graph model),圖模型中的頂點代表隨機變數,頂點間的連線代表隨機變數間的相依關係,在條件隨機域當中,隨機變數Y的分佈為條件機率,給定的觀察值則為隨機變數X。原則上,條件隨機域的圖模型佈局是可以任意給定的,一般常用的佈局是鏈結式的架構,鏈結式架構不論在訓練(training)、推論(inference)、或是解碼(decoding)上,都存在有效率的演算法可供標籤的判別與演算。 S25: Model training is established; the combination of "tag-feature vectors" of each word in the sentence is trained according to Conditional random fields (CRFs). The conditional random field is an undirected graph model. The vertex in the graph model represents a random variable. The line between the vertices represents the dependence relationship between random variables. In the conditional random field, the distribution of the random variable Y is Conditional probability, given the observed value is the random variable X. In principle, the layout of the conditional random domain graph model can be arbitrarily given. The commonly used layout is a chained architecture. The link architecture is either training, inference, or decoding. Above, there are efficient algorithms for label discrimination and calculation.
S26:實際語料評估;給定句子測試模型之精確度,找出特徵向量並判斷模型之精確度。(如果語料有辨識錯誤的話,提供前處理加入字典檔或是後處理人工調整重新訓練模型) S26: Actual corpus evaluation; given the accuracy of the test model, find the feature vector and judge the accuracy of the model. (If the corpus has a recognition error, provide pre-processing to add dictionary files or post-process manual adjustment re-training model)
當模型建立後,即可用來分析經由步驟S11)所蒐集的文章段落或句子。透過圖4之辨識流程,擷取與圖表相關的重點關注詞彙。辨識流程說明如下: When the model is established, it can be used to analyze the paragraph or sentence of the article collected through step S11). Through the identification process in Figure 4, draw the key vocabulary related to the chart. The identification process is described as follows:
S31:特徵抽取。 S31: Feature extraction.
S32:特徵字典與規則字典對應。 S32: The feature dictionary corresponds to the rule dictionary.
S33:模型辨識,給定欲分析之檔案中與圖表相關的文章段落或句子進行標籤辨識。 S33: Model identification, which identifies the paragraph or sentence of the article related to the chart in the file to be analyzed.
S34:輸出格式處理;模型預測出句子中的每個單字之標籤,依據標籤之B與I做斷詞。例如:「第六任董事長蔡力行先生」此句可被模型辨識為「O-O-O-O-O-O-B_PER-I_PER-I_PER-O-O」。則取出第一個B與最後一個I為止的詞彙為單一詞彙之結果。 S34: Output format processing; the model predicts the label of each word in the sentence, and breaks the word according to the B and I of the label. For example, the phrase "the sixth chairman, Mr. Cai Lixing" can be identified as "O-O-O-O-O-O-B_PER-I_PER-I_PER-O-O" by the model. Then the vocabulary of the first B and the last I is taken as the result of a single vocabulary.
S35:標註結果分析;將句子依據標籤做斷詞,擷取出專有名詞。 S35: Annotation result analysis; the sentence is based on the label as a word break, and the proper noun is extracted.
S13利用詞彙統計抽詞技術,分析檔案中與圖表相關的文章段落或句子,擷取與圖表相關的重點關注詞彙;此步主要是將S11)產生之文章段落或句子之集合進行分析,該細節方法包含步驟如下圖5: S13 uses the vocabulary statistical word-selling technique to analyze the paragraphs or sentences of the articles related to the chart in the archives, and draws the key vocabulary related to the chart; this step mainly analyzes the collection of articles or sentences generated by S11). The method includes the steps shown in Figure 5 below:
S41:利用後綴數組抽詞法擷取詞彙。 S41: Using the suffix array to extract vocabulary.
S42:依據規則過濾詞彙。 S42: Filter the vocabulary according to the rules.
S41:利用後綴數組抽詞法擷取詞彙。本案利用後綴數組Suffix array)方式,利用將字串轉為後綴數組的方式,擷取序列中最長前綴字串作為候選 詞。基本演算概念如下,假設一長度為n的字串S,對每個存在於字串S的n個字符作0至n-1的索引,S[i]表示索引i的後綴字串,假設S=「abracadabra」,在索引之後結果如下表1所示 S41: Using the suffix array to extract vocabulary. In this case, the suffix array is used to convert the string into a suffix array, and the longest prefix string in the sequence is taken as a candidate. The basic calculus concept is as follows. Suppose a string S of length n is indexed from 0 to n-1 for each of the n characters existing in the string S, and S[i] represents the suffix string of the index i, assuming S = "abracadabra", after indexing, the results are shown in Table 1 below.
此字串總共有11個後綴,依字典順序排序後產生表2的後綴字串,其中之頻率為該後綴出現於所有後綴字串之前綴部分的次數: This string has a total of 11 suffixes, which are sorted in lexicographic order to produce the suffix string of Table 2. The frequency is the number of times the suffix appears in the prefix part of all suffix strings:
上述後綴字串中頻率大於1者即為可能的候選詞,然而若該候選詞為其他候選詞所包含,且其頻率未高於較長的候選詞,則該候選詞會被濾除。依上例,最終可得到「a」和「abra」這兩個候選詞。 If the frequency of the above suffix string is greater than one, it is a possible candidate word. However, if the candidate word is included in other candidate words and the frequency is not higher than the longer candidate word, the candidate word is filtered out. According to the above example, the two candidates "a" and "abra" can be obtained.
再舉一中文字串為範例:「自然科學與人文社會科學和新世代社會科學」(表3),經由後綴數組Suffix array)排序可得到部分後綴數組如下表,再經由排序與頻率統計後會抽出如表4中「科學」出現3次、「社會科 學」出現2次這兩個候選詞。 Take another Chinese string as an example: "Natural Science and Humanities and Social Sciences and New Generation Social Science" (Table 3), sorted by Suffix array) to get a partial suffix array as shown in the following table, and then sorted and frequency statistics will be The two candidate words appearing in "Science" three times in Table 4 and two times in "Social Science" are drawn.
S42:依據規則過濾詞彙以及依據後綴數組方法可以從檔案中與圖表相關的文章段落或句子抓取出大量的可能候選詞,但亦含有大量無用的雜訊詞。這些雜訊詞彙不但會大幅耗費計算時間,且錯誤的詞組將會大幅影響圖表標籤之結果。故依據規則過濾詞彙,將無用的詞彙濾除。例如: S42: Filtering the vocabulary according to the rule and according to the suffix array method can capture a large number of possible candidate words from the article paragraph or sentence related to the chart in the file, but also contains a large number of useless noise words. These noise vocabularies not only consume a lot of computation time, but the wrong phrases will greatly affect the results of the chart labels. Therefore, the vocabulary is filtered according to the rules, and the useless words are filtered out. E.g:
1.標點符號規則:抽取出來的詞彙中含有中英文標點符號的詞一律濾除。 1. Punctuation rules: Words with Chinese and English punctuation marks in the extracted words are filtered out.
2.起頭文字規則:抽取出來的詞彙開頭符合特定字元的詞彙一律濾除,如「在...」、「自...」。 2. Beginning rules: The vocabulary at the beginning of the extracted vocabulary that matches the specific character is filtered out, such as "in...", "from...".
3.末尾文字規則:抽取出來的詞彙結尾符合特定字元的詞彙一律濾除,主要為一些特定的詞語,如「...先生」、「...董事」等。 3. The last word rule: the vocabulary at the end of the extracted vocabulary that matches the specific character is filtered out, mainly for specific words such as "...Mr.", "...Director".
4.長詞優先(Maximum Matching)規則:與預先準備的規則字典S32)比對,如果規則字典有收錄該詞,則該詞於抽詞結果中的所有子字串皆予以濾除。例如規則字典中收錄「人力資源管理系統」一詞,則「人力資源管理」、「資源管理系統」、「資源管理」等皆全部濾除。 4. The Maximum Matching Rule: is compared with the rule dictionary S32 prepared in advance. If the rule dictionary includes the word, all the substrings of the word in the extracted result are filtered out. For example, in the rule dictionary, the term "human resource management system" is included, and "human resource management", "resource management system", and "resource management" are all filtered out.
在步驟S14中將重點關注詞彙分別依據權重公式排序後,再二次結合算出權重排序,挑選出前n名重點關注詞彙指定給圖表;在此說明權重公式利用詞頻(Term-Frequency,TF)與逆向文件頻率(Inverse Document Frequency,IDF)來運算出重點關注詞彙之權重。在此權重公式又稱「TF-IDF演算法」計算方法如下: In step S14, the key vocabulary words are sorted according to the weight formula, and then the weights are sorted by the second time, and the top n key vocabulary words are selected and assigned to the chart; here, the weight formula is used to use the term frequency (Term-Frequency, TF) and the reverse direction. The Inverse Document Frequency (IDF) is used to calculate the weight of the focused vocabulary. In this weight formula, the calculation method of "TF-IDF algorithm" is as follows:
S14.1:先計算詞頻TF(Term Frequency,TF):假設dj是「某一特定文件」,ti是該文件中所使用單詞或單字的「其中一種」,那麼tfi的計算方法就是ti在每篇文章dj中出現次數的加總,除以所有詞彙在每篇文章的加總,如圖十運算式,主要是強調出現越多次的詞越重要。 S14.1: First calculate the word frequency TF (Term Frequency, TF): suppose dj is "a specific file", ti is "one of the words or words used in the file", then the calculation method of tfi is ti in each The sum of the occurrences of the article dj, divided by the sum of all the words in each article, as shown in Figure 10, is mainly to emphasize that the more words appear more and more important.
S14.2:再計算逆向文件頻率(inverse document frequency,IDF)是一個詞語普遍重要性的度量。某一特定詞彙的IDF,可以由所有文件總數除 以包含該詞彙在文件總數中出現的文件數,再將得到的商取對數得到,如圖十一運算式,主要是強調出現在越多文件的詞越不重要 S14.2: Recalculating the inverse document frequency (IDF) is a measure of the universal importance of a word. The IDF of a particular vocabulary can be obtained by dividing the total number of all files by the number of files that contain the vocabulary in the total number of files, and then obtaining the logarithm of the obtained quotient, as shown in the eleventh formula, mainly emphasizing the more files appearing. The less important the word
S14.3:將tfi*idfi來進行計算,以某一特定文件內的高單詞頻率,乘上該詞彙在文件總數中的低文件頻率,便可以產生TF-IDF權重值,且TF-IDF傾向於過濾掉常見的單詞,保留重要的詞彙,如圖十二運算式。 S14.3: Calculate tfi*idfi, multiply the high word frequency in a particular file by the low file frequency of the vocabulary in the total number of files, and then generate TF-IDF weight values, and TF-IDF tends to Filter out common words and retain important words, as shown in Figure 12.
TF-IDF i =tf i *idf i TF - IDF i = tf i * idf i
S14.4:詞彙權重計算公式由六項因子組成:圖表自有的名稱之TF-IDF值 S14.4: The vocabulary weight calculation formula consists of six factors: the TF-IDF value of the chart's own name.
圖表前後出現的文章段落之TF-IDF值 TF-IDF value of the article paragraph appearing before and after the chart
檔案本文中提到圖表的關鍵字內容之TF-IDF值 Archive The TF-IDF value of the keyword content of the chart mentioned in this article
檔案本文中提到圖表的關鍵字前後段落之TF-IDF值 The TF-IDF value of the paragraph before and after the keyword of the chart mentioned in this article
檔案註解之TF-IDF值 TF-IDF value of file annotation
檔案標籤文字之TF-IDF值 TF-IDF value of the file label text
詞彙i權重值i=λvar1*+λvar2*+λvar3*+λvar4*+λvar5*+λvar6*+λε Vocabulary i weight value i = λ var1 * +λ var2 * +λ var3 * +λ var4 * +λ var5 * +λ var6 * +λ ε
其中參數λvar1、λvar2、λvar3、λvar4、λvar5、λvar6、λε為變數,擷取出之詞彙構成一個算式。令已知之重要詞彙達到最大值為該算式理想值。n筆詞彙構成n個算式,可使用參數最佳化演算法求解(如:線性回 歸演算法),各參數λ之lower bound設定為0.1,而upper bound則以初始之訓練資料經參數最佳化演算法所求出之最佳解之最大值再加10為其upper bound之值。 The parameters λ var1 , λ var2 , λ var3 , λ var4 , λ var5 , λ var6 , λ ε are variables, and the extracted words constitute an equation. Let the known important words reach the maximum value of the ideal value of the formula. n vocabulary constitutes n formulas, which can be solved by parameter optimization algorithm (such as linear regression algorithm). The lower bound of each parameter λ is set to 0.1, and the upper bound is optimized by the initial training data. The maximum value of the best solution found by the algorithm is added to its upper bound value.
接著將S12利用命名實體辨識技術與S13利用詞彙統計抽詞技術所擷取與圖表相關的重點之關注詞彙個別利用S14)TF-IDF演算法挑選出前m名之詞彙。再分別給S12命名實體辨識技術與所挑選出來的詞彙給予權重α、S13利用詞彙統計抽詞技術所挑選出來的詞彙給予權重β,其中α與β總合為1,重新算一權重分數,挑出前n名詞彙指定給圖表,成為圖表索引標籤。 Then, S12 uses the named entity identification technology and the S13 vocabulary statistical word-taking technique to draw the key vocabulary related to the chart, and uses the S14) TF-IDF algorithm to select the vocabulary of the first m name. Then, the S12 named entity identification technology and the selected vocabulary are given weights α, S13, and the vocabulary selected by the lexical statistical word-taking technique is given a weight β, wherein the total of α and β is 1, and a weight score is recalculated. The first n words are assigned to the chart and become the chart index label.
於步驟S15產生虛擬圖表訊息目錄或圖式索引,提供用戶搜尋。依據S11至S14的步驟,可於系統中建立一虛擬圖表訊息目錄或圖示索引,提供用戶以重點關注詞彙方式尋找圖表,並且可以顯示出圖表所在之檔案名稱與路徑。 A virtual chart message directory or schema index is generated in step S15 to provide a user search. According to the steps of S11 to S14, a virtual chart message directory or a graphical index can be established in the system, and the user can find the chart by focusing on the vocabulary, and can display the file name and path where the chart is located.
雲端資料管理系統所提供之使用者瀏覽畫面如圖6所示,其包含圖表搜尋欄位、虛擬圖表訊息目錄、圖表索引名稱、圖表縮圖、圖表所在檔案名稱與檔案路徑和檔案超連結。前述之圖表搜尋選項可提供用戶輸入關鍵字進行查詢。前述虛擬圖表訊息目錄則可顯示出當前所有圖表訊息之索引標籤目錄。舉例說明之,當使用者在圖表搜尋欄位輸入「相對論時」,可在資料中找到「相對論科學家之簡介.pptx」以及「當代數學公式.doc」二個檔案,並在檔案內找到與關鍵字關聯之圖像(愛因斯坦圖像以及相對論公式圖式),得以快速得知檔案內對應圖表內容,並在點選圖表訊息之索引標籤目錄後可經由超連結前往此檔案之對應段落處。 The user browsing screen provided by the cloud data management system is shown in FIG. 6 , which includes a chart search field, a virtual chart message directory, a chart index name, a chart thumbnail, a file name and a file path and a file hyperlink. The aforementioned chart search option provides the user with a keyword to query. The aforementioned virtual chart message directory can display the index tag directory of all current chart messages. For example, when the user enters "relativistic time" in the chart search field, he can find two files "relativistic scientist profile.pptx" and "contemporary math formula.doc" in the data, and find and key in the file. The image associated with the word (Einstein image and the relativistic formula) can quickly know the corresponding chart content in the file, and after clicking the index tab directory of the chart message, you can go to the corresponding paragraph of the file via the hyperlink. .
上列詳細說明係針對本發明之一可行實施例之具體說明,惟該實施例並非用以限制本發明之專利範圍,凡未脫離本發明技藝精神所為之等效實施或變更,均應包含於本案之專利範圍中。 The detailed description of the preferred embodiments of the present invention is intended to be limited to the scope of the invention, and is not intended to limit the scope of the invention. The patent scope of this case.
Claims (8)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW105140773A TWI636370B (en) | 2016-12-09 | 2016-12-09 | Establishing chart indexing method and computer program product by text information |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW105140773A TWI636370B (en) | 2016-12-09 | 2016-12-09 | Establishing chart indexing method and computer program product by text information |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| TW201822031A TW201822031A (en) | 2018-06-16 |
| TWI636370B true TWI636370B (en) | 2018-09-21 |
Family
ID=63258406
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| TW105140773A TWI636370B (en) | 2016-12-09 | 2016-12-09 | Establishing chart indexing method and computer program product by text information |
Country Status (1)
| Country | Link |
|---|---|
| TW (1) | TWI636370B (en) |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112307265A (en) * | 2019-07-26 | 2021-02-02 | 珠海金山办公软件有限公司 | Method, system, storage medium and terminal for searching chart in document |
| TWI820845B (en) * | 2022-08-03 | 2023-11-01 | 中國信託商業銀行股份有限公司 | Training data labeling method and its computing device, article labeling model establishment method and its computing device, and article labeling method and its computing device |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| TW201126359A (en) * | 2010-01-25 | 2011-08-01 | Ind Tech Res Inst | Keyword evaluation systems and methods |
| TW201331772A (en) * | 2012-01-17 | 2013-08-01 | Alibaba Group Holding Ltd | Image index generation method and apparatus |
-
2016
- 2016-12-09 TW TW105140773A patent/TWI636370B/en not_active IP Right Cessation
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| TW201126359A (en) * | 2010-01-25 | 2011-08-01 | Ind Tech Res Inst | Keyword evaluation systems and methods |
| TW201331772A (en) * | 2012-01-17 | 2013-08-01 | Alibaba Group Holding Ltd | Image index generation method and apparatus |
Also Published As
| Publication number | Publication date |
|---|---|
| TW201822031A (en) | 2018-06-16 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN113569050B (en) | Method and device for automatically constructing government affair field knowledge map based on deep learning | |
| CN106997382B (en) | Automatic labeling method and system for innovative creative labels based on big data | |
| CN114880496B (en) | Multimedia information topic analysis method, device, equipment and storage medium | |
| JP5710581B2 (en) | Question answering apparatus, method, and program | |
| JP5216063B2 (en) | Method and apparatus for determining categories of unregistered words | |
| CN106126619A (en) | A kind of video retrieval method based on video content and system | |
| WO2018056423A1 (en) | Scenario passage classifier, scenario classifier, and computer program therefor | |
| CN109446313B (en) | Sequencing system and method based on natural language analysis | |
| Singh et al. | Writing Style Change Detection on Multi-Author Documents. | |
| CN105608075A (en) | Related knowledge point acquisition method and system | |
| CN110457715A (en) | Chinese-Vietnamese neural machine translation out-of-collection word processing method integrated into lexicon | |
| Zhang et al. | Stanford at TAC KBP 2016: Sealing Pipeline Leaks and Understanding Chinese. | |
| Shen et al. | Practical text phylogeny for real-world settings | |
| TWI636370B (en) | Establishing chart indexing method and computer program product by text information | |
| CN115496066A (en) | Text analysis system, method, electronic device and storage medium | |
| CN115438048B (en) | Table search methods, devices, equipment, and storage media | |
| Osanyin et al. | A neural network language document representation technique for web-page classification | |
| Gupta et al. | Natural language processing algorithms for domain-specific data extraction in material science: Reseractor | |
| Ren et al. | Role-explicit query extraction and utilization for quantifying user intents | |
| Tsapatsoulis | Web image indexing using WICE and a learning-free language model | |
| Sun et al. | Generalized abbreviation prediction with negative full forms and its application on improving chinese web search | |
| Alamir et al. | Arabic question-answering system using search engine techniques | |
| Chen et al. | Chinese named entity abbreviation generation using first-order logic | |
| Begum et al. | Comparative analysis on automatic keyphrase extraction (AKPE) techniques | |
| Kodicherla et al. | Comparative Analysis of TextRank and Latent Semantic Analysis Algorithms for Extractive News Summarization |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| MM4A | Annulment or lapse of patent due to non-payment of fees |