TWI774105B

TWI774105B - Official document analysis method

Info

Publication number: TWI774105B
Application number: TW109137724A
Authority: TW
Inventors: 田金山; 劉任哲; 呂威廷
Original assignee: 全友電腦股份有限公司
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2022-08-11
Also published as: TW202217689A

Abstract

An official document analysis method is provided. The official document analysis method includes: receiving an electronic official document with a word content, which includes a plurality of content areas; building at least one feature point of each of the content areas, wherein those feature points form a topology structure; reading an official document database with a plurality of official document format information, which includes an area deployment information and a plurality of title information; deciding the format of official document with a relative position score, computed by analyzing the topology structure with the area deployment information; building a blank official document with a standard format of the electronic official document, such blank official document includes a plurality of blank content areas, which separately corresponding to at least one title information; dividing the word content and writing said word content into the corresponding blank content area bases on the title information.

Description

Analysis method of official documents

本案係關於一種公文書之文件內容之解析方法。This case is about the analysis method of the document content of a public document.

公務文書簡稱公文。公文可以是單位間用以交流訊息之文書，亦可以是用以保存訊息以供查閱之文書。公文依據用途的不同而具有不同之標準格式。以通知函與開會記錄單為例，兩份公文所需傳達或記錄的內容不同，因此在與會人員、受文者、說明文字等文字內容之呈現上亦有不同。Official documents are referred to as official documents. Official documents can be documents used to exchange information between units, or documents used to save information for inspection. Official documents have different standard formats depending on their purpose. Taking the notification letter and the meeting record as an example, the two official documents need to convey or record different contents, so the presentation of the text content such as the participants, recipients, and explanatory text is also different.

在公文之管理過程，如何有效率地完成公文分類及確保內容正確是一項耗費人力之重要任務。由於公文種類的多樣化，因此在將紙本公文數位化之過程，多半得依賴人員逐一校對紙本並逐字繕打而完成。此方法不僅效率低落，且極易產生人為錯誤。In the process of document management, how to efficiently classify documents and ensure the correct content is a labor-intensive and important task. Due to the diversification of the types of official documents, the process of digitizing paper official documents mostly relies on personnel to proofread the paper documents one by one and type them word by word. This method is not only inefficient, but also prone to human error.

有鑑於此，申請人提出一種公文書解析方法。依據一些實施例，一種公文書解析方法包含接收電子公文，電子公文包含文字內容，文字內容包含多個內容區塊。建立各內容區塊之至少一特徵點，該些特徵點構成拓撲結構。讀取公文類別資料庫，公文類別資料庫包含多個公文格式資訊，該些公文格式資訊各包含區塊結構資訊以及多個標題資訊。根據拓撲結構及區塊結構資訊，決定相對位置評分。根據相對位置評分判斷電子公文之格式。建立對應電子公文格式之空白公文，空白公文包含多個空白內容區塊，各空白內容區塊對應於至少一標題資訊。根據電子公文所對應公文格式資訊之標題資訊，將文字內容劃分並寫入各標題資訊所對應之空白內容區塊。In view of this, the applicant proposes a method for parsing public documents. According to some embodiments, a method for parsing an official document includes receiving an electronic official document, the electronic official document includes text content, and the text content includes a plurality of content blocks. At least one feature point of each content block is established, and the feature points constitute a topology structure. Read the document type database, the document type database includes a plurality of document format information, and each of the document format information includes block structure information and a plurality of title information. Based on the topology and block structure information, the relative position score is determined. The format of the electronic document is judged according to the relative position score. A blank official document corresponding to the electronic official document format is established. The blank official document includes a plurality of blank content blocks, and each blank content block corresponds to at least one title information. According to the title information of the official document format information corresponding to the electronic official document, the text content is divided and written into the blank content block corresponding to each title information.

依據一些實施例，於將文字內容劃分並寫入文字內容所對應之空白內容區塊之步驟中，更包含逐行判定電子公文之文字內容，當判定第X行之文字內容包含第一標題資訊，第X+n行之文字內容包含第二標題資訊，且第X行與第X+n行之間不存在其他標題資訊時，將第X行至第(X+n-1)行之間的文字內容寫入第一標題資訊所對應之空白內容區塊。According to some embodiments, in the step of dividing the text content and writing it into the blank content block corresponding to the text content, it further comprises determining the text content of the electronic official document line by line, when it is determined that the text content of the Xth line contains the first title information , when the text content of the X+n line contains the second title information, and there is no other title information between the Xth line and the X+n line, the Xth line to the (X+n-1) line The text content is written into the blank content block corresponding to the first title information.

依據一些實施例，文字內容更包含關鍵字。公文格式資訊更包含關鍵字資訊。公文書解析方法根據關鍵字及關鍵字資訊，決定關鍵字評分。公文書解析方法根據相對位置評分及關鍵字評分判斷電子公文之格式。According to some embodiments, the text content further includes keywords. The document format information also includes keyword information. The official document parsing method determines keyword scores based on keywords and keyword information. The official document parsing method judges the format of the electronic document according to the relative position score and the keyword score.

依據一些實施例，區塊結構資訊包含區塊結構權重，關鍵字資訊包含關鍵字權重，公文書解析方法適用機器學習演算法，並根據區塊結構權重及關鍵字權重判斷電子公文之格式。According to some embodiments, the block structure information includes block structure weights, the keyword information includes keyword weights, the document parsing method applies a machine learning algorithm, and the format of the electronic document is determined according to the block structure weights and the keyword weights.

依據一些實施例，公文書解析方法更包含根據電子公文之格式所對應之公文格式資訊，判斷電子公文之拓撲結構與區塊結構資訊之差異，並於存在差異時提出警示。根據電子公文之格式所對應之公文格式資訊，判斷電子公文之關鍵字與關鍵字資訊之差異，並於存在差異時提出警示。According to some embodiments, the official document parsing method further includes determining the difference between the topological structure of the electronic official document and the block structure information according to the document format information corresponding to the format of the electronic official document, and issuing a warning when there is a discrepancy. According to the official document format information corresponding to the format of the electronic official document, determine the difference between the keywords of the electronic official document and the keyword information, and issue a warning when there is a difference.

依據一些實施例，公文書解析方法更包含去除電子公文中，關鍵字評分低於評分閾值之關鍵字。According to some embodiments, the official document parsing method further includes removing keywords whose keyword scores are lower than a score threshold in the electronic official documents.

依據一些實施例，公文書解析方法更包含根據關鍵字所處之內容區塊以決定關鍵字評分。According to some embodiments, the official document parsing method further includes determining the keyword score according to the content block in which the keyword is located.

依據一些實施例，公文書解析方法更包含去除文字訊息中無語法意義之字元。According to some embodiments, the method for parsing official documents further includes removing characters without grammatical meanings in the text message.

依據一些實施例，電子公文係利用光學字元辨識掃描原始公文所產生。According to some embodiments, the electronic document is generated by scanning the original document using optical character recognition.

依據一些實施例，公文書解析方法更包含將電子公文之文字內容寫入空白公文之空白內容區塊並輸出為文件實體格式檔案。According to some embodiments, the method for parsing the official document further includes writing the text content of the electronic official document into the blank content block of the blank official document and outputting it as a file in a physical format file.

綜上所述，依據一些實施例，公文書解析方法分析公文書內容區塊之分布以及各個內容區塊所包含之關鍵字，以判斷公文之類別，並於文字內容或公文格式錯誤時提供警示或校正。To sum up, according to some embodiments, the official document parsing method analyzes the distribution of the content blocks of the official document and the keywords contained in each content block, so as to determine the type of the official document, and provide a warning when the text content or the format of the official document is wrong or correction.

公文書解析方法是一種分析公文之方法。依公文之使用者而言，公文可以是但不限於政府機關或單位所使用之文書或私人機關或單位所使用之文書。依公文之功能而言，公文可以是但不限於用以交流訊息之文書或用以記錄訊息之文書。依據一些實施例，電子裝置執行公文書解析方法。電子裝置可以是但不限於個人電腦、手機、平板電腦、伺服器等裝置。舉例而言，手機將手機鏡頭所拍攝之公文影像檔轉換而產生電子公文1，再由執行於手機之公文書解析方法分析該電子公文1。舉例而言，將手機鏡頭所拍攝之公文影像檔傳送至伺服器，伺服器將公文影像檔轉換為電子公文1後，再由執行於伺服器之公文書解析方法分析該電子公文1。The analysis method of official documents is a method of analyzing official documents. In terms of users of official documents, official documents may be, but not limited to, documents used by government agencies or units or documents used by private agencies or units. In terms of the functions of official documents, official documents can be, but are not limited to, documents used to communicate information or documents used to record information. According to some embodiments, the electronic device executes the document parsing method. The electronic device may be, but is not limited to, a personal computer, a mobile phone, a tablet computer, a server, and other devices. For example, the mobile phone converts the document image file captured by the camera of the mobile phone to generate an electronic document 1, and then the electronic document 1 is analyzed by the document parsing method executed on the mobile phone. For example, the official document image file captured by the camera of the mobile phone is sent to the server, and after the server converts the official document image file into electronic document 1, the electronic document 1 is analyzed by the official document parsing method executed on the server.

圖1係依據一些實施例之通知函之示意圖。請參照圖1，電子公文1包含有文字內容11。例如「地址: 33004OO市OO區OO路O號」或「說明:一、OOOOOO…」。文字內容11可劃分為多個內容區塊12，使公文產生數個區塊的文字群聚。依據一些實施例，內容區塊12利用線框以與其他內容區塊12作區分。依據一些實施例，內容區塊12利用固定留白以與其他內容區塊12作區分。依據一些實施例，內容區塊12利用標題資訊14以與其他內容區塊12作區分。FIG. 1 is a schematic diagram of a notification letter according to some embodiments. Referring to FIG. 1 , the electronic document 1 includes text content 11 . For example, "Address: No. 0, OO Road, OO District, 33004OO City" or "Description: 1. OOOOOO...". The text content 11 can be divided into a plurality of content blocks 12 , so that the official document generates text clusters of several blocks. According to some embodiments, the content blocks 12 are distinguished from other content blocks 12 using wireframes. According to some embodiments, the content block 12 is distinguished from other content blocks 12 with a fixed margin. According to some embodiments, content blocks 12 utilize header information 14 to distinguish them from other content blocks 12 .

依據一些實施例，公文書解析方法利用影像辨識分析內容區塊12的相對位置分布關係，藉以判斷公文之類別。圖2係依據一些實施例之依通知函執行影像辨識之示意圖。請參照圖2，圖2所示之虛線區域係圖1之通知函之內容區塊12。於本實施例中，總共包含七個內容區塊12。公文書解析方法建立各該內容區塊12之至少一特徵點121，該些特徵點121構成一拓撲結構122。特徵點121可以為影像辨識演算法中用以辨識內容區塊12之特徵。依據一些實施例，特徵點121為內容區塊12之幾何中心點。依據一些實施例，特徵點121為內容區塊12之邊線上的點。依據一些實施例，特徵點121為內容區塊12之邊角點。依據一些實施例，特徵點121為內容區塊12所包含之文字內容11上的點。對於單個內容區塊12，特徵點121不限於一個點或多個點。依據一些實施例，多個特徵點121之間的間距極小，使該些特徵點121趨近於一輪廓線。公文書解析方法辨識該些特徵點121之相對位置關係，即該些特徵點121所構成之拓撲結構122。依據一些實施例，該些特徵點121之間形成多個向量之拓撲結構，該些向量構成一特徵矩陣；公文書解析方法辨識該特徵矩陣而判定該公文之類別。圖3係依據一些實施例之會議記錄單之示意圖。請參照圖3，與圖1之通知函之實施例不同，圖3之會議記錄單之實施例包含四個內容區塊12。圖4係依據一些實施例之依會議記錄單執行影像辨識之示意圖。請一併參照圖2及圖4，影像辨識演算法利用兩種公文類型的內容區塊12之特徵點121相對位置分布的不同，而區分兩者為通知函及會議記錄單。According to some embodiments, the official document parsing method utilizes image recognition to analyze the relative position distribution relationship of the content blocks 12 , so as to determine the type of the official document. FIG. 2 is a schematic diagram of performing image recognition according to a notification letter according to some embodiments. Please refer to FIG. 2 , the dotted area shown in FIG. 2 is the content block 12 of the notification letter in FIG. 1 . In this embodiment, a total of seven content blocks 12 are included. The official document parsing method establishes at least one feature point 121 of each of the content blocks 12 , and the feature points 121 form a topology structure 122 . The feature point 121 may be a feature used to identify the content block 12 in the image recognition algorithm. According to some embodiments, the feature point 121 is the geometric center point of the content block 12 . According to some embodiments, the feature points 121 are points on the edge of the content block 12 . According to some embodiments, the feature points 121 are corner points of the content block 12 . According to some embodiments, the feature points 121 are points on the textual content 11 contained in the content block 12 . For a single content block 12, the feature point 121 is not limited to one point or multiple points. According to some embodiments, the distance between the feature points 121 is extremely small, so that the feature points 121 are close to a contour line. The official document parsing method identifies the relative positional relationship of the feature points 121 , that is, the topology 122 formed by the feature points 121 . According to some embodiments, a topology structure of a plurality of vectors is formed between the feature points 121 , and the vectors form a feature matrix; the document parsing method identifies the feature matrix to determine the type of the document. 3 is a schematic diagram of a meeting minutes sheet according to some embodiments. Please refer to FIG. 3 , which is different from the embodiment of the notification letter in FIG. 1 , the embodiment of the meeting record sheet in FIG. 3 includes four content blocks 12 . FIG. 4 is a schematic diagram of performing image recognition according to a conference record sheet according to some embodiments. Please refer to FIG. 2 and FIG. 4 together. The image recognition algorithm utilizes the difference in the relative position distribution of the feature points 121 of the content blocks 12 of the two document types, and distinguishes the two as notification letters and meeting minutes.

公文書解析方法讀取公文類別資料庫，公文類別資料庫包含多個公文格式資訊，該些公文格式資訊各包含一區塊結構資訊以及多個標題資訊14。公文格式資訊包含多種類別之公文的資訊，例如但不限於通知函、研究記錄單、會議記錄單、簽到表等各式公文之格式或內容之相關資訊。區塊結構資訊可包含公文之格式資訊或用以輔助辨別公文格式之資訊。依據一些實施例，區塊結構資訊儲存範本公文之特徵點121之拓撲結構122。依據一些實施例，區塊結構資訊儲存範本公文之空白內容區塊。舉例而言，區塊結構資訊儲存如圖2所示之通知函之七個空白之內容區塊12，該些空白之內容區塊12分別對應至圖1所示之內容區塊12之文字內容11。標題資訊14可為用以區分文字內容11及判斷文字內容11所對應之內容區塊12之資訊。圖5係依據一些實施例之標題資訊之示意圖。以圖5之會議記錄單為例，標題資訊14為「記錄單」、「會議名稱」、「出席人員」、「會議記錄」等文字。依據一些實施例，標題資訊14可以用於區分文字內容11。舉例而言，圖5之「會議名稱」以及「出席人員」之間的文字內容11屬於同一個內容區塊12；「出席人員」以及「會議記錄」之間的文字內容11屬於另一個內容區塊12。依據一些實施例，標題資訊14可以用於判斷文字內容11所對應之內容區塊12。舉例而言，圖5之「記錄單」對應至最上方的內容區塊12；「會議記錄」對應至最下方的內容區塊12。The official document parsing method reads a document type database, the document type database includes a plurality of document format information, and each of the document format information includes a block structure information and a plurality of title information 14 . The official document format information includes information on various types of official documents, such as but not limited to notification letters, research records, meeting records, sign-in forms and other information related to the format or content of various official documents. The block structure information may include the format information of the official document or the information used to assist in identifying the format of the official document. According to some embodiments, the block structure information stores the topology 122 of the feature points 121 of the template document. According to some embodiments, the block structure information stores blank content blocks of the template document. For example, the block structure information stores the seven blank content blocks 12 of the notification letter shown in FIG. 2 , and the blank content blocks 12 correspond to the text content of the content block 12 shown in FIG. 1 respectively. 11. The title information 14 may be information used to distinguish the text content 11 and determine the content block 12 corresponding to the text content 11 . 5 is a schematic diagram of header information according to some embodiments. Taking the meeting minutes sheet in FIG. 5 as an example, the title information 14 includes words such as "record sheet", "meeting name", "attendees", and "meeting minutes". According to some embodiments, title information 14 may be used to differentiate textual content 11 . For example, the text content 11 between "meeting name" and "attendees" in Fig. 5 belongs to the same content block 12; the text content 11 between "attendees" and "meeting minutes" belongs to another content block Block 12. According to some embodiments, the title information 14 can be used to determine the content block 12 corresponding to the text content 11 . For example, in FIG. 5 , the “record sheet” corresponds to the topmost content block 12 ; the “meeting minutes” corresponds to the bottommost content block 12 .

依據一些實施例，公文書解析方法分析電子公文1的內容區塊12之特徵點121的相對位置關係，並將該相對位置關係與公文類別資料庫裡各種公文範本之區塊結構資訊進行比對，決定該電子公文1與各種公文範本相似度之相對位置評分。相對位置評分係用以判斷該電子公文1之格式的依據。舉例而言，公文書解析方法接收圖3實施例之電子公文1，經影像分析並與通知函之範本公文比對後，依照預設之評分機制判定獲得40分之相對位置評分；與會議記錄單之範本公文比對後，依照預設之評分機制判定獲得90分之相對位置評分，故將該電子公文1分類為會議記錄單。According to some embodiments, the official document analysis method analyzes the relative positional relationship of the feature points 121 of the content block 12 of the electronic official document 1, and compares the relative positional relationship with the block structure information of various official document templates in the official document category database, Determine the relative position score of the similarity between the electronic document 1 and various document templates. The relative position score is the basis for judging the format of the electronic document 1 . For example, the official document parsing method receives the electronic document 1 in the embodiment of FIG. 3, analyzes the image and compares it with the official document in the notification letter, and obtains a relative position score of 40 points according to the preset scoring mechanism; After the template document of the document is compared, a relative position score of 90 points is obtained according to the preset scoring mechanism, so the electronic document 1 is classified as a meeting record sheet.

公文書解析方法確認電子公文1對應至公文類別資料庫之何種公文格式資訊後，建立該格式之空白公文，並且空白公文包含多個空白內容區塊。文字內容11包含多個標題資訊14，各個標題資訊14可以對應到特定之內容區塊12。公文書解析方法根據標題資訊14，將該文字內容11劃分並寫入該標題資訊14所對應之空白內容區塊。依據一些實施例，各種公文格式資訊各包含專屬於該種公文格式之演算法。公文書解析方法於確認公文格式後，根據該格式之演算法而決定如何將文字內容11劃分，以及如何將劃分後之文字內容11寫入各個內容區塊12。依據一些實施例，演算法將標題資訊14後特定字元數之文字內容11劃入同一內容區塊12。依據一些實施例，演算法將與標題資訊14所處之整行文字內容11劃入同一內容區塊12。依據一些實施例，演算法將任兩個標題資訊14之間的文字內容11劃入同一內容區塊12；詳細而言，公文書解析方法逐行判定電子公文1之文字內容11，當判定第X行文字內容11包含第一標題資訊14時，將第X行之文字內容11與其後續行次之文字內容11寫入同一內容區塊12，直到在第X+n行文字內容11判定包含第二標題資訊14後，停止將第X+n行之文字內容11與其後續行次之文字內容11寫入前述同一內容區塊12。所述X及n為整數數值。The official document parsing method confirms which official document format information the electronic official document 1 corresponds to in the official document category database, and establishes a blank official document in the format, and the blank official document includes a plurality of blank content blocks. The text content 11 includes a plurality of title information 14 , and each title information 14 may correspond to a specific content block 12 . According to the title information 14 , the official document parsing method divides and writes the text content 11 into the blank content block corresponding to the title information 14 . According to some embodiments, each document format information includes an algorithm specific to that document format. After confirming the format of the official document, the official document parsing method determines how to divide the text content 11 and how to write the divided text content 11 into each content block 12 according to the algorithm of the format. According to some embodiments, the algorithm divides the text content 11 of a certain number of characters after the title information 14 into the same content block 12 . According to some embodiments, the algorithm divides the entire line of text content 11 where the title information 14 is located into the same content block 12 . According to some embodiments, the algorithm divides the text content 11 between any two title information 14 into the same content block 12; When the text content 11 of the X line contains the first title information 14, the text content 11 of the Xth line and the text content 11 of the following line are written into the same content block 12 until it is determined that the text content 11 of the X+nth line contains the text content 11 of the Xth line. After the second title information 14, stop writing the text content 11 of the X+nth line and the text content 11 of the following lines into the same content block 12. The X and n are integer values.

舉例而言，圖5之公文為一紙本公文，公文書解析方法判斷圖5之公文為會議記錄單並產生會議記錄單格式之空白公文。公文書解析方法利用光學字元辨識掃描紙本公文而獲得文字內容11。由於掃描產生之文字內容11係混合之狀態，因此需要被劃分到不同之內容區塊12。公文書解析方法讀取會議記錄單之公文格式資訊之標題資訊14，得到標題資訊14為「記錄單」、「會議名稱」、「出席人員」、「會議記錄」。演算法將包含「記錄單」之整行文字內容11寫入空白公文中由上至下的第一個空白內容區塊；演算法將包含「會議名稱」之整行文字內容11以及包含「出席人員」之行次之上的文字內容11寫入空白公文中由上至下的第二個空白內容區塊；演算法將包含「出席人員」之整行文字內容11以及包含「會議記錄」之行次之上的文字內容11寫入空白公文中由上至下的第三個空白內容區塊；演算法將包含「會議記錄」之整行文字內容11以及包含「會議記錄」之行次之下的文字內容11寫入空白公文中由上至下的第四個空白內容區塊。For example, the official document in FIG. 5 is a paper official document, and the official document parsing method determines that the official document in FIG. 5 is a meeting minutes sheet and generates a blank document in the meeting minutes sheet format. The official document parsing method uses optical character recognition to scan a paper official document to obtain the text content 11 . Since the text content 11 generated by scanning is in a mixed state, it needs to be divided into different content blocks 12 . The official document analysis method reads the title information 14 of the official document format information of the meeting record sheet, and obtains the title information 14 as "record sheet", "meeting name", "attendees", and "meeting minutes". The algorithm will include the entire line of text 11 containing the "record sheet" into the first blank content block from top to bottom in the blank document; the algorithm will include the entire line of text 11 of the "meeting name" and the content of The text content 11 above the line of "people" is written into the second blank content block from top to bottom in the blank document; the algorithm will include the entire line of text content 11 for "attendees" and the text content containing "meeting minutes". The text content 11 above the line is written into the third blank content block from top to bottom in the blank document; the algorithm will include the entire line of text content 11 of "meeting minutes" and the next line containing "meeting minutes". The text content 11 below is written into the fourth blank content block from top to bottom in the blank official document.

依據一些實施例，文字內容11包含有關鍵字111。關鍵字111可以是但不限於具有特殊意義之文字、用以提示公文用途之文字或用以提示記載內容之文字，例如「正本」、「函」、「主旨」、「電話」、「發文日期」等。關鍵字111可以是但不限於由供應商所定義、由使用者所定義或由演算法所產生。依據一些實施例，公文格式資訊可包含關鍵字資訊13。隨著公文格式之不同而可以對應不同之關鍵字資訊13。圖6係依據一些實施例之通知函之關鍵字資訊之示意圖。請參照圖6，舉例而言，通知函之關鍵字資訊13包含有關鍵字111之「函」、「主旨」、「說明」、「正本」、「副本」、「地址」及「電話」等。而會議記錄單之關鍵字資訊13則可以包含有關鍵字111「記錄單」、「會議名稱」、「主持人」及「會議記錄」等(圖未示)。當任一電子公文1中出現關鍵字111「地址」時，公文書解析方法基於各種公文格式資訊之關鍵字資訊13並依照預設之評分機制判定該關鍵字111對應至通知函之關鍵字評分為60分，而對應至會議記錄單之關鍵字評分為40分；當該電子公文1中出現另一關鍵字111「主旨」時，公文書解析方法基於各種公文格式資訊之關鍵字資訊13並依照預設之評分機制判定該關鍵字111對應至通知函之關鍵字評分為80分，而對應至會議記錄單之關鍵字評分為30分。依據一些實施例，公文書解析方法根據關鍵字評分以及相對位置評分而決定電子公文1之格式。According to some embodiments, the textual content 11 includes keywords 111 . The keyword 111 can be, but is not limited to, words with special meanings, words used to indicate the purpose of official documents, or words used to indicate the content of the document, such as "original", "letter", "subject", "telephone", "date of publication" "Wait. The keyword 111 may be, but is not limited to, vendor-defined, user-defined, or algorithm-generated. According to some embodiments, the document format information may include keyword information 13 . Depending on the format of the official document, it can correspond to different keyword information 13 . FIG. 6 is a schematic diagram of keyword information of a notification letter according to some embodiments. Referring to FIG. 6 , for example, the keyword information 13 of the notification letter includes “letter”, “subject”, “description”, “original”, “copy”, “address” and “telephone” of the keywords 111 , etc. . The keyword information 13 of the meeting record sheet may include keywords 111 "record sheet", "meeting name", "host" and "meeting record" (not shown). When the keyword 111 "address" appears in any electronic official document 1, the official document parsing method is based on the keyword information 13 of various official document format information and determines that the keyword 111 corresponds to the keyword score of the notification letter according to the preset scoring mechanism 60 points, and the keyword score corresponding to the meeting minutes is 40 points; when another keyword 111 "Subject" appears in the electronic document 1, the document parsing method is based on the keyword information 13 of various document format information and According to the preset scoring mechanism, it is determined that the keyword 111 corresponds to the keyword score of the notification letter as 80 points, and the keyword score corresponding to the meeting minutes is 30 points. According to some embodiments, the document parsing method determines the format of the electronic document 1 according to the keyword score and the relative position score.

依據一些實施例，各公文格式資訊之區塊結構資訊包含區塊結構權重，各公文格式資訊之關鍵字資訊13包含關鍵字權重。依據一些實施例，區塊結構權重為將範本公文之內容區塊12之特徵點121作為訓練資料輸入影像辨識之機器學習演算法，使其學習範本公文之種類而獲得之權重值。依據一些實施例，關鍵字權重為將範本公文之文字內容11之關鍵字111作為訓練資料輸入文字辨識之機器學習演算法，使其學習範本公文之種類而獲得之權重值。依據一些實施例，公文書解析方法根據訓練完成之機器學習演算法，將電子公文1之內容區塊12作為輸入而輸出相對位置評分；公文書解析方法根據訓練完成之機器學習演算法，將電子公文1之關鍵字111作為輸入而輸出關鍵字評分，並根據關鍵字評分以及相對位置評分而決定電子公文1之格式。依照適用的機器學習演算法之不同，依據一些實施例，區塊結構權重及關鍵字權重可以在判斷過程中被修改。According to some embodiments, the block structure information of each document format information includes the block structure weight, and the keyword information 13 of each document format information includes the keyword weight. According to some embodiments, the block structure weight is a weight value obtained by inputting the feature points 121 of the content block 12 of the template document as training data into the image recognition machine learning algorithm to learn the type of the template document. According to some embodiments, the keyword weight is a weight value obtained by inputting the keyword 111 of the text content 11 of the template document as training data into a text recognition machine learning algorithm to learn the type of the template document. According to some embodiments, the official document parsing method takes the content block 12 of the electronic document 1 as input and outputs the relative position score according to the trained machine learning algorithm; the official document parsing method uses the trained machine learning algorithm to The keyword 111 of the official document 1 is used as an input to output a keyword score, and the format of the electronic official document 1 is determined according to the keyword score and the relative position score. Depending on the applicable machine learning algorithm, according to some embodiments, block structure weights and keyword weights may be modified during the determination process.

依據一些實施例，公文書解析方法於確認電子公文1之格式後，根據公文格式資訊之區塊結構資訊或關鍵字資訊13，判斷電子公文1之拓撲結構122與區塊結構資訊之差異，並於存在差異時提出警示，以及判斷該電子公文1之關鍵字111與該關鍵字資訊13之差異，並於存在差異時提出警示。圖7係依據一些實施例之內容缺漏之通知函之示意圖，請參照圖7之例示，公文書解析方法根據圖7之電子公文1之內容區塊12之相對位置關係而判斷該電子公文1較為接近通知函之格式，然而該電子公文1包含缺漏欄位15。相較於圖1所示之範本公文，該缺漏欄位15應填寫正本或副本之資訊。該電子公文1之內容區塊12之特徵點121所構成之拓撲結構122與區塊結構資訊存在差異，因此公文書解析方法提出公文內容有誤之警示。圖8係依據一些實施例之內容誤繕之通知函之示意圖，請參照圖8之例示，公文書解析方法根據圖8電子公文1之內容區塊12之相對位置關係而判斷該電子公文1較為接近通知函之格式，然而該電子公文1之文件標題「會議記錄單」包含有錯誤之關鍵字111之「記錄單」。根據通知函之關鍵字資訊13，該內容區塊12應包含關鍵字111之「函」。電子公文1之文字內容11之關鍵字111與關鍵字資訊13存在差異，因此公文書解析方法提出公文內容有誤之警示。According to some embodiments, after confirming the format of the electronic document 1, the official document parsing method determines the difference between the topological structure 122 of the electronic document 1 and the block structure information according to the block structure information or the keyword information 13 of the document format information, and When there is a difference, a warning is issued, and the difference between the keyword 111 of the electronic document 1 and the keyword information 13 is determined, and a warning is issued when there is a difference. FIG. 7 is a schematic diagram of a notification letter of content omission according to some embodiments. Please refer to the example of FIG. 7 . The official document parsing method determines that the electronic document 1 is relatively high according to the relative positional relationship of the content blocks 12 of the electronic document 1 in FIG. 7 . Close to the format of the notification letter, however the electronic document 1 contains the missing field 15. Compared with the template official document shown in Figure 1, the missing field 15 should be filled with the original or copy information. The topological structure 122 formed by the feature points 121 of the content block 12 of the electronic official document 1 is different from the block structure information, so the official document parsing method raises a warning that the content of the official document is wrong. FIG. 8 is a schematic diagram of a notification letter of incorrect content according to some embodiments. Please refer to the example of FIG. 8 . The official document parsing method determines the electronic document 1 according to the relative positional relationship of the content blocks 12 of the electronic document 1 in FIG. 8 . It is close to the format of the notification letter, but the document title "Meeting Minutes" of the electronic document 1 contains the wrong keyword 111 "Minutes". According to the keyword information 13 of the notification letter, the content block 12 should contain "letter" of the keyword 111 . The keyword 111 of the text content 11 of the electronic official document 1 is different from the keyword information 13, so the official document parsing method raises a warning that the content of the official document is wrong.

依據一些實施例，公文書解析方法去除該電子公文1中關鍵字評分低於評分閾值之關鍵字111。評分閾值可以是但不限於由供應商所定義、由使用者所定義或由演算法所產生。舉例而言，公文書解析方法包含有50分之評分閾值，當電子公文1中出現關鍵字111之「主旨」時，公文書解析方法根據關鍵字資訊13而決定該關鍵字111對應至通知函之關鍵字評分為80分，而對應至會議記錄單之關鍵字評分為30分。如最終公文書解析方法根據該電子公文1之相對位置評分而判斷該電子公文1之格式為會議記錄單時，由於關鍵字111之「主旨」對應到會議記錄單之關鍵字評分低於評分閾值，因此將其去除。依據一些實施例，公文書解析方法根據關鍵字111所處之內容區塊12以決定關鍵字評分。圖9係依據一些實施例之相同關鍵字位處不同內容區塊之示意圖，請參照圖9，依據一些實施例，機器學習演算法根據會議記錄單之範本公文學習到關鍵字111之「記錄單」主要出現在電子公文1由上至下的第一個內容區塊12。因此，當電子文件之關鍵字111之「記錄單」位於由上至下的第一個內容區塊12時，該關鍵字111對應至會議記錄單之關鍵字評分為80分。當電子文件之關鍵字111之「記錄單」位於由上至下的第四個內容區塊12時，該關鍵字111對應至會議記錄單之關鍵字評分為30分。若公文書解析方法包含有50分之評分閾值，該位於由上至下的第四個內容區塊12的關鍵字111之「記錄單」將被去除。圖10係依據一些實施例之內容誤繕之會議記錄單之示意圖，請參照圖10，依據一些實施例，關鍵字111之「會議時間」位於由上至下的第二個內容區塊12時，該關鍵字111對應至會議記錄單之關鍵字評分為90分。當電子文件之關鍵字111之「會議時間」位於由上至下的第三個內容區塊12時，該關鍵字111對應至會議記錄單之關鍵字評分為10分。若公文書解析方法包含有50分之評分閾值，該位於由上至下的第三個內容區塊12的關鍵字111之「會議時間」將被去除。依據一些實施例，公文書解析方法去除該文字訊息中無語法意義之字元112。無語法意義之字元112可以是但不限於由供應商所定義、由使用者所定義或由演算法所產生。舉例而言，圖10之實施例中由上至下的第四個內容區塊12之最後一行包含有無語法意義之字元112之「&%$**/」，公文書解析方法於辨識後將其去除。According to some embodiments, the document parsing method removes keywords 111 in the electronic document 1 whose keyword scores are lower than the score threshold. Scoring thresholds can be, but are not limited to, vendor-defined, user-defined, or algorithm-generated. For example, the official document parsing method includes a score threshold of 50 points. When the “subject” of the keyword 111 appears in the electronic document 1, the official document parsing method determines that the keyword 111 corresponds to the notification letter according to the keyword information 13 The keyword score is 80 points, and the keyword score corresponding to the meeting minutes is 30 points. If the final official document parsing method determines that the format of the electronic official document 1 is a meeting minutes sheet according to the relative position score of the electronic document document 1, because the keyword score of the keyword 111 corresponding to the “subject” of the meeting minutes sheet is lower than the scoring threshold , so remove it. According to some embodiments, the document parsing method determines the keyword score according to the content block 12 in which the keyword 111 is located. FIG. 9 is a schematic diagram of different content blocks at the same keyword position according to some embodiments. Please refer to FIG. 9. According to some embodiments, the machine learning algorithm learns the “note sheet” of the keyword 111 according to the template document of the meeting note sheet. ” mainly appears in the first content block 12 of the electronic document 1 from top to bottom. Therefore, when the "note sheet" of the keyword 111 of the electronic document is located in the first content block 12 from top to bottom, the keyword 111 corresponding to the keyword score of the meeting note sheet is 80 points. When the "note sheet" of the keyword 111 of the electronic document is located in the fourth content block 12 from the top to the bottom, the keyword 111 corresponds to the keyword score of the meeting note sheet with a score of 30 points. If the official document parsing method includes a score threshold of 50 points, the "record sheet" of the keyword 111 located in the fourth content block 12 from the top to the bottom will be removed. FIG. 10 is a schematic diagram of a meeting record sheet with incorrect content according to some embodiments. Please refer to FIG. 10. According to some embodiments, the “meeting time” of the keyword 111 is located in the second content block 12 from top to bottom. , the keyword 111 corresponds to a keyword score of 90 points in the meeting record sheet. When the "meeting time" of the keyword 111 of the electronic document is located in the third content block 12 from the top to the bottom, the keyword 111 corresponds to the keyword score of the meeting record sheet as 10 points. If the official document parsing method includes a score threshold of 50 points, the "meeting time" of the keyword 111 located in the third content block 12 from the top to the bottom will be removed. According to some embodiments, the official document parsing method removes the characters 112 that have no grammatical meaning in the text message. The ungrammatical characters 112 may be, but are not limited to, vendor-defined, user-defined, or algorithm-generated. For example, in the embodiment of FIG. 10 , the last line of the fourth content block 12 from top to bottom contains “&%$**/” of characters 112 with or without grammatical meaning. remove it.

依據一些實施例，公文書解析方法利用光學字元辨識掃描原始公文而產生電子公文1。原始公文可以是但不限於紙本公文、可攜式文件格式(Portable Document Format，PDF)之檔案或各種影像檔儲存格式之檔案。According to some embodiments, the document parsing method utilizes optical character recognition to scan the original document to generate the electronic document 1 . The original document can be, but is not limited to, a paper document, a Portable Document Format (PDF) file or a file in various image file storage formats.

依據一些實施例，公文書解析方法將電子公文1之文字內容11寫入空白公文之空白內容區塊後，得到填寫完成之公文，並將填寫完成之公文輸出為文件實體格式(Document Instance，DI)檔案。According to some embodiments, after the document parsing method writes the text content 11 of the electronic document 1 into the blank content block of the blank document, the completed official document is obtained, and the completed official document is output as a document entity format (Document Instance, DI). )file.

1:電子公文 11:文字內容 111:關鍵字 112:無語法意義之字元 12:內容區塊 121:特徵點 122:拓撲結構 13:關鍵字資訊 14:標題資訊 15:缺漏欄位 1: Electronic document 11: Text content 111: Keywords 112: Characters with no grammatical meaning 12: Content Blocks 121: Feature Points 122: Topology 13:Keyword Information 14: Title information 15: Missing field

[圖1]係依據一些實施例之通知函之示意圖； [圖2]係依據一些實施例之依通知函執行影像辨識之示意圖； [圖3]係依據一些實施例之會議記錄單之示意圖； [圖4]係依據一些實施例之依會議記錄單執行影像辨識之示意圖； [圖5]係依據一些實施例之標題資訊之示意圖； [圖6]係依據一些實施例之通知函之關鍵字資訊之示意圖； [圖7]係依據一些實施例之內容缺漏之通知函之示意圖； [圖8]係依據一些實施例之內容誤繕之通知函之示意圖； [圖9]係依據一些實施例之相同關鍵字位處不同內容區塊之示意圖；以及 [圖10]係依據一些實施例之內容誤繕之會議記錄單之示意圖。 [FIG. 1] is a schematic diagram of a notification letter according to some embodiments; [FIG. 2] is a schematic diagram of performing image recognition according to a notification letter according to some embodiments; [FIG. 3] is a schematic diagram of a meeting minutes sheet according to some embodiments; [FIG. 4] is a schematic diagram of performing image recognition according to a meeting record sheet according to some embodiments; [FIG. 5] is a schematic diagram of header information according to some embodiments; [Fig. 6] is a schematic diagram of keyword information of notification letters according to some embodiments; [FIG. 7] is a schematic diagram of a notification letter of content omission according to some embodiments; [FIG. 8] is a schematic diagram of a notification letter of incorrect content according to some embodiments; [FIG. 9] is a schematic diagram of different content blocks at the same key position according to some embodiments; and [FIG. 10] is a schematic diagram of a meeting record sheet with incorrect content according to some embodiments.

1:電子公文 1: Electronic document

11:文字內容 11: Text content

111:關鍵字 111: Keywords

12:內容區塊 12: Content Blocks

Claims

A method for parsing official documents, including: receiving an electronic official document, the electronic official document includes a text content, and the text content includes a plurality of content blocks; establishing at least one feature point of each of the content blocks, and the feature points constitute a topology structure; reading a document type database, the document type database includes a plurality of document format information, and each of the document format information includes a block structure information and a plurality of title information; determining a relative position score according to the topology structure and the block structure information; Determine the format of the electronic official document according to the relative position score; creating a blank document corresponding to the electronic document format, the blank document includes a plurality of blank content blocks, each of the blank content blocks corresponding to at least one of the title information; and According to the title information of the document format information corresponding to the electronic document, the text content is divided and written into the blank content blocks corresponding to the title information.

The method for parsing official documents as described in claim 1, in the step of dividing and writing the text content into the blank content block corresponding to the text content, further comprising: Determine the text content of the electronic official document line by line, when it is determined that the text content of the Xth line contains a first title information, the text content of the X+nth line contains a second title information, and the Xth line and the Xth When there is no other title information between the n lines, the text content between the Xth line and the (X+n-1)th line is written into the blank content block corresponding to the first title information.

The method for parsing public documents as described in claim 1, wherein: The text content further includes a keyword; The official document format information further includes a keyword information; The official document parsing method determines a keyword score according to the keyword and the keyword information; and The official document parsing method determines the format of the electronic official document according to the relative position score and the keyword score.

The official document parsing method as claimed in claim 3, wherein the block structure information includes a block structure weight, the keyword information includes a keyword weight, the official document parsing method applies a machine learning algorithm, and is based on The block structure weight and the keyword weight determine the format of the electronic official document.

The method for parsing an official document according to claim 3, further comprising determining the difference between the topological structure of the electronic document and the block structure information according to the document format information corresponding to the format of the electronic document, and issuing a warning when there is a discrepancy ; and determine the difference between the keyword of the electronic official document and the keyword information, and issue a warning when there is a difference.

The method for parsing official documents as described in claim 3, further comprising removing keywords in the electronic official document whose keyword scores are lower than a score threshold.

The official document parsing method according to claim 6, further comprising determining the keyword score according to the content block where the keyword is located.

The method for parsing official documents as described in claim 1, further comprising removing a character without grammatical meaning in the text content.

The method for parsing an official document as claimed in claim 1, wherein the electronic official document is generated by scanning an original official document using optical character recognition.

The method for parsing an official document according to claim 1, further comprising writing the text content of the electronic official document into the blank content block of the blank official document and outputting it as a file in a physical format file.