TWI845355B - System for judging input mode of form data - Google Patents
System for judging input mode of form data Download PDFInfo
- Publication number
- TWI845355B TWI845355B TW112123637A TW112123637A TWI845355B TW I845355 B TWI845355 B TW I845355B TW 112123637 A TW112123637 A TW 112123637A TW 112123637 A TW112123637 A TW 112123637A TW I845355 B TWI845355 B TW I845355B
- Authority
- TW
- Taiwan
- Prior art keywords
- judgment
- form data
- data
- judged
- input
- Prior art date
Links
Images
Landscapes
- Input From Keyboards Or The Like (AREA)
Abstract
Description
本發明係有關於一種判斷系統,尤其是指一種用於判斷表單資料之輸入方式為手動輸入或自動輸入之判斷系統。The present invention relates to a judgment system, and more particularly to a judgment system for judging whether the input method of form data is manual input or automatic input.
因應組織管理、系統管理與資訊整合管理等多方面的需要,越來越多的表單資料需要被傳送至資料管理中心的資料儲存裝置(伺服器)進行集中式的管理,以作為對特定事物進行分析的數據基礎。這些表單資料有一部分是藉由打字輸入、點選選項輸入或者手寫配合影像辨識技術等手動輸入方式輸出資料後所產生的;其他部分可能是藉由讀取條碼、晶片感應、標籤感應、影像辨識或系統自動帶入或匯入等自動輸入方式輸入資料後所產生的。In response to the needs of organizational management, system management, and information integration management, more and more form data needs to be sent to the data storage device (server) of the data management center for centralized management, so as to serve as the data basis for analyzing specific things. Some of these form data are generated by manual input methods such as typing, clicking on options, or handwriting combined with image recognition technology; other parts may be generated by automatic input methods such as reading barcodes, chip sensing, label sensing, image recognition, or automatic system import or import.
為了精確判讀並分析出所有的表單資料所呈現出來的意義,必須藉由大數據演算進行精確的統計分析。此作法的其前提為表單資料的內容本身必須具備極高的正確性,以免誤判表單資料的所呈現出來的真實意義。然而,藉由人員打字、書寫或點選等手動輸入方式輸入資料時所發生的錯誤率,往往遠高於讀取條碼、晶片感應、標籤感應、影像辨識或系統自動帶入或匯入等自動輸入方式輸入資料時所發生的錯誤率。In order to accurately judge and analyze the meaning of all the form data, accurate statistical analysis must be performed through big data calculations. The premise of this approach is that the content of the form data itself must have extremely high accuracy to avoid misjudging the true meaning of the form data. However, the error rate when entering data manually by typing, writing or clicking is often much higher than the error rate when entering data automatically by reading barcodes, chip sensing, label sensing, image recognition or system automatic import or import.
由於表單資料數量極為龐大,由資料管理中心的工作人員逐一核對確認表單資料內容是否正確的作法相當不切實際。只能藉由輔助性的檢驗工具軟體來輔助,即便如此,由於表單資料具備相當高的多元性,不太可能為所有的表單資料都分別開發出對應的檢驗工具軟體來逐一進行檢驗。Due to the huge amount of form data, it is impractical for the staff of the data management center to check and confirm whether the form data content is correct one by one. It can only be assisted by auxiliary verification tool software. Even so, due to the high diversity of form data, it is impossible to develop corresponding verification tool software for all form data to conduct verification one by one.
若表單資料能夠自動化產生,就可大幅減輕資料管理中心的工作人員的工作負擔。因此,推動表單資料自動化輸入產生的工作就勢在必行。然而,因應各種不同的使用情境與使用需求,實際上並非所有表單都可以在短時間內全部轉變為利用自動輸入資料的方式產生,所以仍難免會有部分之表單資料是藉由手動輸入資料的方式產生。If form data can be automatically generated, the workload of data management center staff can be greatly reduced. Therefore, it is imperative to promote the automatic input of form data. However, due to various usage scenarios and usage requirements, not all forms can be converted to automatic data input in a short period of time, so it is inevitable that some form data will still be generated by manual data input.
由於資料管理中心的資料儲存裝置(伺服器)所儲存的表單資料中,夾雜了部分自動輸入與部分手輸入資料方式所產生的表單資料,因此,實在有必要開發出一種新的判斷技術來判斷出哪些表單資料是自動輸入資料產生的,哪些又是手動輸入資料產生的。藉此,可將更多的檢驗資源投注於對手動輸入資料產生的表單資料進行檢驗,藉以提升表單資料整體的正確率。Since the form data stored in the data storage device (server) of the data management center is mixed with some form data generated by automatic input and some form data generated by manual input, it is necessary to develop a new judgment technology to judge which form data is generated by automatic input and which is generated by manual input. In this way, more inspection resources can be invested in the inspection of form data generated by manual input, so as to improve the overall accuracy of form data.
有鑒於在先前技術中,缺乏用以判斷表單資料是自動輸入或手動輸入資料產生的判斷技術,因而無法將檢驗資源集中投注於對手動輸入資料產生的表單資料進行檢驗,導致不易提升表單資料整體的正確率問題;本發明為解決先前技術之問題所採用之其中一種必要技術手段為提供一種表單資料輸入方式判斷系統(以下簡稱「判斷系統」),且判斷系統包含一資料儲存裝置與一判斷裝置。In view of the fact that in the prior art, there is a lack of judgment technology for judging whether form data is automatically input or manually input, and therefore it is impossible to concentrate inspection resources on the form data generated by manually input data, resulting in the difficulty in improving the overall accuracy of the form data. One of the necessary technical means adopted by the present invention to solve the problems of the prior art is to provide a form data input method judgment system (hereinafter referred to as "judgment system"), and the judgment system includes a data storage device and a judgment device.
資料儲存裝置係儲存有複數個基準真相表單資料與複數個待判斷表單資料,該些基準真相表單資料係對應地具有用以定義為自動輸入或手動輸入之複數個初始基準真相。判斷裝置係通信連結於資料儲存裝置以擷取基準真相表單資料與待判斷表單資料,並且在安裝與執行一判斷程式後產生一特徵擷取模組、一監督式學習模組、一判斷模組與一驗證示警模組。The data storage device stores a plurality of reference truth form data and a plurality of to-be-judged form data, and the reference truth form data correspondingly have a plurality of initial reference truths defined as automatically input or manually input. The judgment device is communicatively connected to the data storage device to capture the reference truth form data and the to-be-judged form data, and generates a feature capture module, a supervised learning module, a judgment module, and a verification alarm module after installing and executing a judgment program.
特徵擷取模組係自每一基準真相表單資料中擷取反應資料亂度之一學習用欄位資訊量與一學習用時間戳記量,使基準真相表單資料具有對應之複數個上述之學習用欄位資訊量與複數個上述之學習用時間戳記量,並自每一待判斷表單資料中擷取反應資料亂度之一判斷用欄位資訊量與一判斷用時間戳記量。The feature extraction module extracts a learning field information quantity and a learning timestamp quantity reflecting the data disorder from each reference truth form data, so that the reference truth form data has a corresponding plurality of the above-mentioned learning field information quantities and a plurality of the above-mentioned learning timestamp quantities, and extracts a judgment field information quantity and a judgment timestamp quantity reflecting the data disorder from each form data to be judged.
監督式學習模組係依據基準真相表單資料所對應之學習用欄位資訊量與學習用時間戳記量與初始基準真相,進行一學習演算而產生一判斷演算模型。The supervised learning module performs a learning calculation based on the learning field information amount and the learning time stamp amount corresponding to the benchmark truth table data and the initial benchmark truth to generate a judgment calculation model.
判斷模組係依據判斷演算模型與每一待判斷表單資料所對應之判斷用欄位資訊量與判斷用時間戳記量,判斷每一待判斷表單資料以對應產生用以預測每一待判斷表單資料為自動輸入或手動輸入之一判斷結果,據以產生複數個判斷結果。The judgment module judges each form data to be judged based on the judgment calculation model and the judgment field information amount and judgment timestamp amount corresponding to each form data to be judged to generate a judgment result for predicting whether each form data to be judged is automatically input or manually input, thereby generating multiple judgment results.
驗證示警模組係接收用以定義待判斷表單資料為自動輸入或手動輸入之複數個回饋基準真相,並在驗證出待判斷表單資料所對應之判斷結果與回饋基準真相不符時,發出一示警信息,據以定義出一判斷異常表單資料與一追認基準真相。驗證示警模組更將判斷異常表單資料與追認基準真相儲存至資料儲存裝置以作為基準真相表單資料與初始基準真相,藉以供監督式學習模組重新進行學習演算而修改判斷演算模型。The verification alarm module receives a plurality of feedback reference truths for defining the form data to be judged as automatically input or manually input, and issues a warning message when it is verified that the judgment result corresponding to the form data to be judged is inconsistent with the feedback reference truth, thereby defining a judgment abnormal form data and a confirmed reference truth. The verification alarm module further stores the judgment abnormal form data and the confirmed reference truth in the data storage device as the reference truth form data and the initial reference truth, so as to provide the supervised learning module with a new learning calculation to modify the judgment calculation model.
在上述必要技術手段的基礎下,所衍生出之附屬技術手段中,較佳者,判斷系統更包含複數個資料輸入終端裝置,且基準真相表單資料與待判斷表單資料可由資料輸入終端裝置傳送至資料儲存裝置加以儲存。資料儲存裝置可為一資料儲存伺服器,判斷裝置可為一運算伺服器。每一資料輸入終端裝置更包含一回饋操作介面,以供每一資料輸入終端裝置之一操作者在驗證出待判斷表單資料所對應之判斷結果錯誤時,對應地輸入回饋基準真相。Among the subsidiary technical means derived from the above necessary technical means, preferably, the judgment system further includes a plurality of data input terminal devices, and the reference truth form data and the form data to be judged can be transmitted from the data input terminal device to the data storage device for storage. The data storage device can be a data storage server, and the judgment device can be a computing server. Each data input terminal device further includes a feedback operation interface, so that an operator of each data input terminal device can input the feedback reference truth accordingly when verifying that the judgment result corresponding to the form data to be judged is wrong.
較佳者,驗證示警模組更包含一驗證週期設定介面,以供設定一驗證週期,藉以依據驗證週期而週期性地將判斷異常表單資料與追認基準真相作為基準真相表單資料與初始基準真相,以供監督式學習模組依據該驗證週期而週期性地進行該學習演算。Preferably, the verification alarm module further includes a verification cycle setting interface for setting a verification cycle, so as to periodically determine abnormal form data and confirm the baseline truth as the baseline truth form data and the initial baseline truth according to the verification cycle, so that the supervised learning module can periodically perform the learning calculation according to the verification cycle.
特徵擷取模組可包含一欄位資訊量擷取單元,且欄位資訊量擷取單元係依據一欄位熵演算法加以運作而獲得學習用欄位資訊量與判斷用欄位資訊量,且欄位熵演算法係為 ,其中 ,k表示資料欄位類型的數量,表示共包含k種資料欄位類型, 表示該k種資料欄位類型中的第i種資料欄位類型的數量,其中i、k與 皆為自然數。 The feature extraction module may include a field information extraction unit, and the field information extraction unit operates according to a field entropy algorithm to obtain learning field information and judgment field information, and the field entropy algorithm is ,in , k represents the number of data field types, indicating that there are k types of data field types in total. represents the number of the i-th data field type among the k data field types, where i, k and All are natural numbers.
特徵擷取模組更可包含一時間戳記量擷取單元,且時間戳記量擷取單元供一使用者在p個欄位中指定q個時間戳記欄位,並依據一時間戳記量演算法擷取學習用時間戳記量與判斷用時間戳記量,且時間戳記量演算法係 ,其中 表示q個時間戳記欄位中第j個時間戳記欄位所對應的r列資料中所包含之相異資料內容種類數,其中j、 、p、q與r皆為自然數,且p>q。 The feature extraction module may further include a timestamp quantity extraction unit, and the timestamp quantity extraction unit allows a user to specify q timestamp fields in p fields, and extracts learning timestamp quantities and judgment timestamp quantities according to a timestamp quantity algorithm, and the timestamp quantity algorithm is ,in Indicates the number of different data content types contained in the r columns of data corresponding to the jth timestamp field among the q timestamp fields, where j, , p, q and r are all natural numbers, and p>q.
判斷模組可更包含一標記單元,且該標記單元係依據判斷結果對每一待判斷表單資料賦予一自動輸入標記或一手動輸入標記後,再儲存於資料儲存裝置。The judgment module may further include a marking unit, and the marking unit assigns an automatic input mark or a manual input mark to each form data to be judged according to the judgment result, and then stores it in the data storage device.
此外,針對學習演算部分,較佳者,學習演算可包含至少一基礎演訓練算法,且基礎訓練演算法包含K-最鄰近演算法(KNN)、支援向量機(SVM)演算法、決策樹(Decision Tree)演算法與回歸(Regression)演算法中之至少一者。更佳者,學習演算可再包含至少一擬合演算法,且擬合演算法包含隨機森林(Random Forest)演算法與極限梯度提升(XGBoost)中之至少一者。In addition, for the learning algorithm part, preferably, the learning algorithm may include at least one basic training algorithm, and the basic training algorithm includes at least one of the K-nearest neighbor algorithm (KNN), the support vector machine (SVM) algorithm, the decision tree algorithm and the regression algorithm. More preferably, the learning algorithm may further include at least one fitting algorithm, and the fitting algorithm includes at least one of the random forest algorithm and the extreme gradient boosting (XGBoost).
綜合以上所述,由於在本發明所提供之表單資料輸入方式判斷系統中,係特別依據長期對表單資料為自動輸入或手動輸入之間的關聯性與規則性之觀察結果,特別選擇與時間與資料亂度相關之欄位資訊量與時間戳記量,作為後續進行學習訓練與判斷之重要特徵依據,據此進行監督式學習訓練而在短時間內建立較高信心水準之判斷演算模型與獲得較高判斷準確率之判斷結果。In summary, in the form data input method judgment system provided by the present invention, based on the long-term observation results of the correlation and regularity between the form data automatically input or manually input, the field information amount and timestamp amount related to time and data chaos are specially selected as the important feature basis for subsequent learning training and judgment. Based on this, supervised learning training is carried out to establish a judgment calculation model with a higher confidence level and obtain a judgment result with a higher judgment accuracy in a short time.
進一步地,可藉由週期性進行判斷、驗證、示警與產生追認基礎真相的方式,修正錯誤的判斷結果,據以重新進行學習演算而修改判斷演算模型,藉此,不但可以達到自動判斷輸入方式之功效,更可以在較短的時間內大幅提升判斷準確度。在獲得判斷正確率較高的判斷結果後,更可進一步將檢驗資源(包含人員、設備與/或工具軟體等)集中投注於對手動輸入資料產生的表單資料進行檢驗,藉以進一步達到提升表單資料整體的正確率之功效。Furthermore, by periodically conducting judgment, verification, warning, and generating a basis for verification, the erroneous judgment results can be corrected, and the judgment calculation model can be modified by relearning the calculation. In this way, not only can the effect of automatically judging the input method be achieved, but also the judgment accuracy can be greatly improved in a shorter period of time. After obtaining a judgment result with a higher judgment accuracy rate, the inspection resources (including personnel, equipment, and/or tool software, etc.) can be further concentrated on the inspection of the form data generated by manually input data, so as to further achieve the effect of improving the overall accuracy of the form data.
由於本發明所提供之表單資料輸入方式判斷系統,可廣泛運用於判斷表單資料是藉由自動輸入或手動輸入的方式所產生,其應用層面相當廣闊,故在此不再一一贅述,僅列舉其中較佳的一個實施例來加以具體說明,且此實施例僅用以方便、明晰地輔助說明本發明實施例的目的與功效。Since the form data input method determination system provided by the present invention can be widely used to determine whether the form data is generated by automatic input or manual input, its application level is quite broad, so it will not be described one by one here, and only a preferred embodiment is listed to be specifically explained, and this embodiment is only used to conveniently and clearly assist in explaining the purpose and effect of the embodiment of the present invention.
請參閱第一圖,其係顯示本發明較佳實施例所提供之表單資料輸入方式判斷系統之功能方塊示意圖。如第一圖所示,一種表單資料輸入方式判斷系統(以下簡稱「判斷系統」)100,包含一資料儲存裝置1、一判斷裝置2與資料輸入終端裝置3a~3c。Please refer to the first figure, which is a functional block diagram of the form data input method determination system provided by the preferred embodiment of the present invention. As shown in the first figure, a form data input method determination system (hereinafter referred to as "determination system") 100 includes a
資料儲存裝置1可為資料儲存伺服器。判斷裝置2可為一運算伺服器。資料輸入終端裝置3a~3c可為工作設備內建電腦、工業電腦、桌上型電腦、筆記型電腦、平板電腦或智慧型手機可輸入表單資料之終端裝置,分別具有資料輸入介面31a~31c,並且分別具有回饋操作介面32a~32c。資料輸入介面31a~31與回饋操作介面32a~32可以資料輸入終端裝置3a~3c執行特定程式後之程式操作頁面,也可以是連結到一網頁伺服器後,在資料輸入終端裝置3a~3c上所呈現之網路操作頁面。The
資料儲存裝置1儲存有複數個基準真相表單資料GF與複數個待判斷表單資料JF,且基準真相表單資料GF與待判斷表單資料JF可由資料輸入終端裝置3a~3c中之少一操作者利用資料輸入介面31a~31c中之少一者輸入後,被傳送至資料儲存裝置1加以儲存。基準真相表單資料GF係對應地具有用以定義為自動輸入(即藉由自動輸入資料方式所產生的表單資料)或手動輸入(即藉由手動輸入資料方式所產生的表單資料)之複數個初始基準真相。所謂的初始基準真相是指在利用判斷系統100對待判斷表單資料JF進行判斷前,已被證實過因而具備極高可信度的事實真相。待判斷表單資料JF則是有待判斷系統100判斷其為自動輸入或手動輸入的表單資料。The
舉例而言,如表一所示,資料儲存裝置1共存了10個基準真相表單資料,對應之序號分別為0001~0010,其中序號為0001與0005之基準真相表單資料之初始基準真相為藉由手動輸入方式輸入的,所謂的手動輸入方式可藉由打字輸入、點選選項輸入或手寫配合文字辨識軟體等手動方式輸入。為了便於人員與軟體識別與辯讀,可對序號為0001與0005之基準真相表單資料賦予「M」之標記,以代表為手動輸入。其餘8個基準真相表單資料之初始基準真相為藉由自動輸入方式輸入的,所謂的自動輸入方式可藉由讀取條碼、晶片感應、標籤感應、影像辨識或系統自動帶入或匯入等自動輸入方式輸入。相似地,可對自動輸入之基準真相表單資料賦予「A」之標記。For example, as shown in Table 1,
表一:基準真相表單資料列表
判斷裝置2係通信連結於資料儲存裝置1以擷取基準真相表單資料GF與待判斷表單資料JF,安裝有一判斷程式JAP,並在執行判斷程式JAP後,產生一特徵擷取模組21、一監督式學習模組22、一判斷模組23與一驗證示警模組24。The
經過長期的觀察,發明人發現自動輸入之表單資料與手動輸入之表單資料之間,普遍存在的關聯性與規則性包含: 1. 自動輸入方式可在短時間內輸入較大的資料量,所以在短時間內輸入的資料量較大者多為自動輸入; 2. 資料內容相似度高或重複性較高者多為自動輸入,資料內容相異度較高者或重複性較低者多為手動輸入; 3. 由於表單資料是依據欄位輸入資料,存在同欄位的多列資料之間多半具有相同的種類屬性(如時間、文字與數字等)的規則性;以及 4. 關於資料的相似度、相異度與重複性等特性,可以藉由與資料亂度相關的參數或指標加以體現。 After long-term observation, the inventors found that there are common correlations and regularities between automatically input form data and manually input form data, including: 1. Automatic input can input a large amount of data in a short time, so the data input in a short time is mostly automatically input; 2. Data with high similarity or high repetition is mostly automatically input, and data with high dissimilarity or low repetition is mostly manually input; 3. Since form data is input based on fields, there is a regularity that multiple rows of data in the same field mostly have the same type of attributes (such as time, text, and numbers, etc.); and 4. The characteristics of data such as similarity, dissimilarity and repetition can be reflected through parameters or indicators related to data noise.
依據發明人長期觀察與歸納的上述種種關聯性,為了提升判斷裝置2的判斷能力,應優先考慮擷取與時間以及資料亂度相關的參數或指標做為學習訓練與後續進行判斷的依據。在以上前提下,特徵擷取模組21包含一欄位資訊量擷取單元211與一時間戳記量擷取單元212。According to the above-mentioned various correlations observed and summarized by the inventors over a long period of time, in order to improve the judgment ability of the
在學習階段時,欄位資訊量擷取單元211可自每一基準真相表單資料GF中擷取反應資料亂度之一學習用欄位資訊量,並在判斷每一待判斷表單資料JF是自動輸入或手動輸入時(即判斷階段時),自待判斷表單資料JF中擷取反應資料亂度之一判斷用欄位資訊量。During the learning phase, the field
欄位資訊量擷取單元212可依據一欄位熵演算法加以運作而獲得學習用欄位資訊量與判斷用欄位資訊量,且欄位熵演算法係為
,其中
,k表示資料欄位類型的數量,表示共包含k種資料欄位類型,
表示該k種資料欄位類型中的第i種資料欄位類型的數量,其中i、k與
皆為自然數。其中,由於欄位熵演算法是一個機率的對數函數,機率必然小於1,且小於1的數值取對數的結果必然為負值,因此必須加負號還原成正的數值。
The field
除了上述之欄位熵演算法之外,欄位資訊量擷取單元212也可擷取與其他與資料亂度相關參數,譬如資料重複率(具有相同資料的欄位數或資料總數或所占比例)或資料相似度(資料內容相同部分的比例)等作為學習用欄位資訊量與判斷用欄位資訊量。In addition to the above-mentioned field entropy algorithm, the field
時間戳記量擷取單元212可提供一時間戳記指定介面(圖未示)以供一使用者在p個欄位中指定q個時間戳記欄位,在學習階段時依據一時間戳記量演算法擷取學習用時間戳記量,並在判斷階段時依據相同的時間戳記量演算法擷取判斷用時間戳記量。時間戳記量演算法可為
,其中
表示q個時間戳記欄位中第j個時間戳記欄位所對應的r列資料中所包含之相異資料內容種類數,其中j、
、p、q與r皆為自然數,且p>q。
The timestamp
時間戳記欄位的資料內容雖然不一定必須是時間本身,但是最好跟時間有關,譬如可反應時間順序的序號或流水編號等。此外,雖然在本實施例中,時間戳記量演算法是取所有時間戳記欄位中相異資料內容種類數的最大值,在實際應用時,也可採用所有時間戳記欄位中相異資料內容種類數之算術平均數、中位數或眾數作為時間戳記量。Although the data content of the timestamp field does not necessarily have to be the time itself, it is best to be related to time, such as a sequence number or serial number that can reflect the time sequence. In addition, although in this embodiment, the timestamp quantity algorithm takes the maximum value of the number of different data content types in all timestamp fields, in actual application, the arithmetic mean, median or mode of the number of different data content types in all timestamp fields can also be used as the timestamp quantity.
關於(學習用或判斷用)欄位資訊量的擷取,舉例而言,如表二所示,一表單資料包含銷售日期、品牌、型號、數量、單價與銷售金額共6個欄位,其中,銷售日期屬於時間類欄位;品牌與型號屬於文字類欄位;數量、單價與銷售金額屬於數字類欄位。因此共有時間類欄位、文字類欄位與數字類欄位共3種欄位,表示上述的k為3;時間類欄位的數量為1,表示 為1;文字類欄位的數量為2,表示 為2;且數字類欄位的數量為3,表示 為3。 第一種欄位(時間類欄位)的機率 ,等於1/6; 第二種欄位(文字類欄位)的機率 ,等於2/6; 第三種欄位(數字類欄位)的機率 ,等於3/6; 帶入欄位熵演算法 ,可以獲得欄位資訊量為0.4392473。當此表單資料為上述之基準真相表單資料GF時,則表示學習用欄位資訊量為0.4392473;當此表單資料為上述之待判斷表單資料JF時,則表示判斷用欄位資訊量為0.4392473。 Regarding the extraction of field information (for learning or judgment), for example, as shown in Table 2, a form data contains 6 fields: sales date, brand, model, quantity, unit price and sales amount. Among them, sales date belongs to the time field; brand and model belong to the text field; quantity, unit price and sales amount belong to the number field. Therefore, there are 3 types of fields in total, namely time field, text field and number field, indicating that the above k is 3; the quantity of the time field is 1, indicating that is 1; the number of text fields is 2, indicating is 2; and the number of numeric fields is 3, indicating is 3. The probability of the first type of field (time type field) , which is equal to 1/6; The probability of the second field (text field) , which is equal to 2/6; The probability of the third field (numeric field) , equal to 3/6; Bring in the field entropy algorithm , the amount of field information can be obtained to be 0.4392473. When the form data is the above-mentioned benchmark truth form data GF, the amount of field information for learning is 0.4392473; when the form data is the above-mentioned to-be-judged form data JF, the amount of field information for judgment is 0.4392473.
關於(學習用或判斷用)時間戳記量的擷取,舉例而言,如表二所示,承以上所述,共有6個欄位代表P等於6,這6個欄位中只有銷售日期與時間相關,因此,使用者可藉由時間戳記量擷取單元212所提供一時間戳記指定介面(圖未示)指定「銷售日期」欄位為時間戳記欄位,表示q為1,表中的時間戳記欄位(「銷售日期」欄位)共有14列資料,表示r為14,唯一一個時間戳記欄位(「銷售日期」欄位)中的14列資料中只有「3月26日」、「3月27日」與「3月28日」這3種相異資料內容,代表相異資料內容種類數為3,也就是
為3。因為只有唯一一個時間戳記欄位,即q為1,所以
即為3,表示時間戳記量為3。當此表單資料為上述之基準真相表單資料GF時,則表示學習用時間戳記量為3;當此表單資料為上述之待判斷表單資料JF時,則表示判斷用時間戳記量為3。
Regarding the extraction of timestamp quantity (for learning or judgment), for example, as shown in Table 2, as mentioned above, there are 6 fields representing P equal to 6, and among these 6 fields, only the sales date is related to time. Therefore, the user can specify the "sales date" field as the timestamp field through a timestamp designation interface (not shown) provided by the timestamp
表二:表單資料
雖然在以上的例子中,銷售日期、品牌、型號、數量、單價與銷售金額等6個欄位是橫向排列的6個欄位,但是在實務上,上述的欄位也可能是縱向排列的。當欄位是縱向排列時,上述的r列(橫向排列的)資料,可改以r行(縱向排列的)資料加以替代,所採用的特徵擷取方式,包含(學習用或判斷用)欄位資訊量的擷取以及(學習用或判斷用)時間戳記量也與以上描述內容相似,只是縱橫互換與行列互換而已,以下不再予以贅述。Although in the above example, the six fields of sales date, brand, model, quantity, unit price and sales amount are arranged horizontally, in practice, the above fields may also be arranged vertically. When the fields are arranged vertically, the above r columns (arranged horizontally) of data can be replaced by r rows (arranged vertically) of data, and the feature extraction method used, including the extraction of field information (for learning or judgment) and the extraction of timestamp (for learning or judgment) is similar to the above description, except that the vertical and horizontal fields and the rows and columns are interchanged, which will not be elaborated below.
監督式學習模組22係依據基準真相表單資料GF所對應之學習用欄位資訊量與學習用時間戳記量與初始基準真相(可以利用自動輸入標記A或手動輸入標記M加以代表),進行一學習演算而產生一判斷演算模型。針對學習演算部分,較佳者,學習演算可包含至少一基礎演訓練算法,且基礎訓練演算法包含K-最鄰近演算法(KNN)、支援向量機(SVM)演算法、決策樹(Decision Tree)演算法與回歸(Regression)演算法中之至少一者。更佳者,學習演算除了使用基礎演訓練算法之外,還可再包含至少一擬合演算法,且擬合演算法包含隨機森林(Random Forest)演算法與極限梯度提升(XGBoost)中之至少一者。判斷演算模型是指藉由進行上述的學習演算而自動推導所建立的數學演算模型。The
由於上述學習演算技術(包含基礎訓練演算法與擬合演算法)都是目前已相當成熟的演算技術,舉凡在所屬領域中具有通常知識者皆可利用以上所述之演算法或其組合來建構上述之判斷演算模型,以下不再予以贅述。Since the above-mentioned learning algorithm techniques (including basic training algorithms and fitting algorithms) are currently quite mature algorithm techniques, anyone with general knowledge in the relevant fields can use the above-mentioned algorithms or their combinations to construct the above-mentioned judgment algorithm model, and will not be elaborated on below.
判斷模組23可包含一判斷單元231與一標記單元232。判斷單元231係依據判斷演算模型與每一待判斷表單資料所對應之判斷用欄位資訊量與判斷用時間戳記量,判斷每一待判斷表單資料以對應產生用以預測每一待判斷表單資料JF為自動輸入或手動輸入之一判斷結果,據以產生複數個判斷結果。標記單元232係依據判斷結果對每一待判斷表單資料賦予一自動輸入標記A或一手動輸入標記M後,再儲存於資料儲存裝置1。The
簡單而言,判斷演算模型就是用以在學習階段學習訓練如何依據學習用欄位資訊量與學習用時間戳記量推導出初始基準真相(可以利用自動輸入標記A或手動輸入標記M加以代表),以便於在判斷階段依據判斷用欄位資訊量與判斷用時間戳記量推導預測出判斷結果(同樣也可以利用自動輸入標記A或手動輸入標記M加以代表)。Simply put, the judgment algorithm model is used to learn and train how to derive the initial benchmark truth based on the learning field information and the learning timestamp in the learning stage (which can be represented by the automatic input label A or the manual input label M), so as to derive and predict the judgment result based on the judgment field information and the judgment timestamp in the judgment stage (which can also be represented by the automatic input label A or the manual input label M).
舉例而言,如表三所示,資料儲存裝置1也儲存了10個待判斷表單資料JF,對應之序號分別為1001~1010,因此會完成判斷後對應地產生10個判斷結果。其中序號為1001、1003與0005之待判斷表單資料之判斷結果為藉由手動輸入方式輸入的,因此,標記單元232會將序號為1001、1003與0005之待判斷表單資料賦予手動輸入標記M,其餘待判斷表單資料之判斷結果為藉由自動輸入的,標記單元232則會賦予自動輸入標記A。For example, as shown in Table 3, the
表三:待判斷表單資料之判斷結果與標記
在將判斷結果儲存於資料儲存裝置1後,判斷裝置2可對資料輸入終端裝置3a~3c發出推播信號,請資料輸入終端裝置3a~3c之操作者驗證判斷結果是否正確。在資料輸入終端裝置3a~3c,在驗證出待判斷表單資料所對應之判斷結果錯誤時,利用回饋操作介面32a~32c對應地輸入回饋基準真相。此時,對應的待判斷表單資料會被定義為一判斷異常表單資料。在驗證出待判斷表單資料所對應之判斷結果正確時,可輸入判斷正確信息。回饋基準真相與判斷正確信息都會被傳送至判斷裝置2。After the judgment result is stored in the
驗證示警模組24可包含一驗證單元241、一示警單元242、一驗證週期設定介面243與一判斷準確度計算單元244。驗證單元241在判斷裝置2收到回饋基準真相時,會將對應的待判斷表單資料會被定義為一判斷異常表單資料,並認定判斷結果與回饋基準真相不符。此時,驗證單元241會將回饋基準真相列為對應於判斷異常表單資料之追認基準真相,且示警單元242會發出一判斷異常提示信息。The
相反地,驗證單元241在判斷裝置2收到判斷正確信息時,會將對應的待判斷表單資料會被定義為一判斷正確表單資料,並認定判斷結果與回饋基準真相相符。此時,驗證單元241會直接將判斷結果列為對應於判斷正確表單資料之追認基準真相。接著,驗證單元241可將判斷異常表單資料與對應之追認基準真相傳送至儲存至資料儲存裝置1,以分別作為新增的基準真相表單資料與初始基準真相,藉以供監督式學習模組22重新進行學習演算而修改判斷演算模型。On the contrary, when the
驗證週期設定介面243可進一步用以供設定一驗證週期,藉以依據驗證週期而週期性地將判斷異常表單資料與追認基準真相作為基準真相表單資料與初始基準真相,以供監督式學習模組依據該驗證週期而週期性地進行學習演算,藉以週期性地修改判斷演算模型。判斷準確度計算單元244則會統計帶判斷表單資JF中被定義為判斷異常表單資料與判斷正確表單資料的數量,藉以在計算出每一個驗證週期期間進行判斷之判斷準確度。驗證週期可依據需要獲悉判斷正確性之統計週期需求、表單資料數量的多寡或其他需求而定,可設定為每天驗證一次、每週驗證一次、每月驗證一次或每季驗證一次等。The verification
舉例而言,承襲表三之判斷結果,如表四所示,在資料輸入終端裝置3a~3c之操作者驗證判斷結果後,發現序號為1003之待判斷表單資料之判斷結果為手動輸入,但實際上序號為1003之待判斷表單資料卻是自動輸入的,因此可利用回饋操作介面32a~32c中之一者對應地輸入應為「自動輸入」之回饋基準真相,此時,驗證單元241會將序號為1003之待判斷表單資料定義為判斷異常表單資料,並回饋基準真相(自動輸入)列為判斷異常表單資料(即序號為1003之待判斷表單資料)之追認基準真相,並儲存於資料儲存裝置1。示警單元242會發出一判斷異常提示信息,藉以提示利用目前所建立的數學演算模型來判斷序號為1003之待判斷表單資料時所產生之判斷結果是錯誤的。For example, according to the judgment result of Table 3, as shown in Table 4, after the operator of the data input
相反地,在資料輸入終端裝置3a~3c之操作者驗證判斷結果後,發現其餘序號之待判斷表單資料之判斷結果皆為正確,則可藉由回饋操作介面32a~32c輸入判斷正確信息,此時,驗證單元241會直接把其餘序號之待判斷表單資料定義為判斷正確表單資料,並直接將所對應之判斷結果列為追認基準真相,也一併儲存於資料儲存裝置1。On the contrary, after the operator of the data input
在每一個驗證週期期間,可累積多個已完成驗證之待判斷表單資料JF與對應的追認基準真相,並進入下個驗證週期時,將所累積之已完成驗證之部分或全部待判斷表單資料JF(如判斷異常表單資料,或者同時包含判斷異常表單資料與判斷正確表單資料)與對應之追認基準真相分別作為新增的基準真相表單資料與對應之初始基準真相。During each verification cycle, multiple verified pending form data JF and the corresponding confirmed baseline truths can be accumulated, and when entering the next verification cycle, part or all of the accumulated pending form data JF (such as abnormal form data, or both abnormal form data and correct form data) that have been verified and the corresponding confirmed baseline truths are used as newly added baseline truth form data and the corresponding initial baseline truth respectively.
判斷準確度計算單元244會統計出在10件待判斷表單資料中,只有1件(即序號為1003之待判斷表單資料)為判斷異常表單資料,其餘9件待判斷表單資料為判斷正確表單資料。據此,判斷準確度計算單元244可計算出在本次驗證週期期間,判斷系統100判斷待判斷表單資料JF為自動輸入或手動輸入之判斷準確度為90%。同時,判斷準確度計算單元244也可以順帶統計出在本次驗證週期期間的表單資料自動化率,如表四中之追認基準真相所呈現之結果,在本次驗證週期期間,共有8個自動輸入之表單資料, 2個手動輸入之表單資料,代表在本次驗證週期期間的表單資料自動化率為80%。The judgment
表四:對判斷結果進行驗證之結果
在經過多個驗證週期的判斷、驗證與重新進行學習演算而修改判斷演算模型等步驟,判斷系統100判斷待判斷表單資料JF為自動輸入或手動輸入之判斷準確度可逐漸提升,直到判斷準確度提升到一目標準確度(如99.99%)以上,表示判斷系統100之判斷能力已達到一定程度的信心水準,此時可以延長驗證週期(譬如由每一季驗證一次延長為每一年驗證一次),甚至可以直接接受判斷結果,也就是把判斷系統100每一次進行判斷所得到之判斷結果都直接當成是基準真相而加以採信,不需要再進行後續的驗證。After multiple verification cycles of judgment, verification, and re-learning calculation to modify the judgment calculation model, the judgment accuracy of the
進一步地,可將檢驗資源(包含人員、設備與/或工具軟體等)集中投注於對手動輸入資料產生的表單資料(也就是具有手動輸入標記M)的表單資料(特別是指經過驗證為手動輸入的基準真相表單資料GF)進行檢驗,並更正手動輸入資料產生的表單資料中的錯誤以提升表單資料之整體正確率。此外,也可以藉由增加對手動輸入之表單資料的抽檢率,與降低對自動輸入之表單資料的抽檢率的方式,在不增加檢驗的總工作負擔(總投注的檢驗資源)下,有效率地提升表單資料之整體正確率。Furthermore, inspection resources (including personnel, equipment and/or tool software, etc.) can be concentrated on inspecting form data generated by manually input data (i.e., form data with a manual input mark M) (especially the benchmark truth form data GF that has been verified to be manually input), and correcting errors in form data generated by manually input data to improve the overall accuracy of form data. In addition, the overall accuracy of form data can be efficiently improved without increasing the total workload of inspection (total inspection resources invested) by increasing the sampling rate of manually input form data and reducing the sampling rate of automatically input form data.
由於上述之特徵擷取模組21、監督式學習模組22、判斷模組23與驗證示警模組24都是再執行判斷程式JAP所產生的,因此,特徵擷取模組21、監督式學習模組22、判斷模組23與驗證示警模組24在本質上可以是判斷程式JAP之(部分)主程式、副程式或執行判斷程式JAP後所產生之程式頁面或功能介面。舉凡在所屬技術領域(特別是人工智慧演算法領域)中具有通常知識者,都可以依據以上學習與判斷邏輯,利用適當的程式語言來編寫具備上述之特徵擷取模組21、監督式學習模組22、判斷模組23與驗證示警模組24功能之判斷程式JAP(含其主程式或副程式),藉以實現本發明之上述種種技術。Since the above-mentioned
綜合以上所述,由於在本發明所提供之表單資料輸入方式判斷系統100中,係特別依據長期對表單資料為自動輸入或手動輸入之間的關聯性與規則性之觀察結果,特別選擇與時間與資料亂度相關之欄位資訊量與時間戳記量,作為後續進行學習訓練與判斷之重要特徵依據,據此進行監督式學習訓練而在短時間內建立較高信心水準之判斷演算模型與獲得較高判斷準確率之判斷結果。In summary, in the form data input
進一步地,可藉由利用判斷裝置2週期性地進行判斷、驗證、示警與產生追認基礎真相的方式,修正錯誤的判斷結果,據以重新進行學習演算而修改判斷演算模型,藉此,不但可以達到自動判斷輸入方式之功效,更可以在較短的時間內大幅提升判斷準確度。在獲得判斷正確率更高的判斷結果後,更可進一步將檢驗資源(包含人員、設備與/或工具軟體等)集中投注於對手動輸入資料產生的表單資料進行檢驗,藉以進一步達到有效提升表單資料整體的正確率之功效。Furthermore, by using the judgment device to perform judgment, verification, warning and generate the basic truth in a two-cycle manner, the erroneous judgment results can be corrected, and the judgment calculation model can be modified by re-learning calculation. In this way, not only can the effect of automatic judgment input method be achieved, but also the judgment accuracy can be greatly improved in a shorter time. After obtaining the judgment result with a higher judgment accuracy rate, the inspection resources (including personnel, equipment and/or tool software, etc.) can be further concentrated on the inspection of the form data generated by manually input data, so as to further achieve the effect of effectively improving the overall accuracy of the form data.
藉由以上較佳具體實施例之詳述,係希望能更加清楚描述本發明之特徵與精神,而並非以上述所揭露的較佳具體實施例來對本發明之範疇加以限制。相反地,其目的是希望能涵蓋各種改變及具相等性的安排於本發明所欲申請之專利範圍的範疇內。The above detailed description of the preferred specific embodiments is intended to more clearly describe the features and spirit of the present invention, but is not intended to limit the scope of the present invention by the preferred specific embodiments disclosed above. On the contrary, the purpose is to cover various changes and arrangements with equivalents within the scope of the patent application for the present invention.
100:判斷系統
1:資料儲存裝置
2:判斷裝置
21:特徵擷取模組
211:欄位資訊量擷取單元
212:時間戳記量擷取單元
22:監督式學習模組
23:判斷模組
231:判斷單元
232:標記單元
24:驗證示警模組
241:驗證單元
242:示警單元
243:驗證週期設定介面
244:判斷準確度計算單元
3a~3c:資料輸入終端裝置
31a~31c:資料輸入介面
32a~32c:回饋操作介面
GF:基準真相表單資料
JF:待判斷表單資料
JAP:判斷程式
100: Judgment system
1: Data storage device
2: Judgment device
21: Feature acquisition module
211: Field information acquisition unit
212: Timestamp acquisition unit
22: Supervised learning module
23: Judgment module
231: Judgment unit
232: Marking unit
24: Verification alarm module
241: Verification unit
242: Alarm unit
243: Verification cycle setting interface
244: Judgment
第一圖係顯示本發明較佳實施例所提供之表單資料輸入方式判斷系統之功能方塊示意圖。The first figure is a functional block diagram showing the form data input method determination system provided by the preferred embodiment of the present invention.
100:判斷系統 100: Judgment system
1:資料儲存裝置 1: Data storage device
2:判斷裝置 2: Judgment device
21:特徵擷取模組 21: Feature extraction module
211:欄位資訊量擷取單元 211: Field information extraction unit
212:時間戳記量擷取單元 212: Timestamp quantity capture unit
22:監督式學習模組 22:Supervised Learning Module
23:判斷模組 23: Judgment module
231:判斷單元 231: Judgment unit
232:標記單元 232: Marking unit
24:驗證示警模組 24: Verify the alarm module
241:驗證單元 241: Verification unit
242:示警單元 242: Alarm unit
243:驗證週期設定介面 243: Verification cycle setting interface
244:判斷準確度計算單元 244: Judgment accuracy calculation unit
3a~3c:資料輸入終端裝置 3a~3c: Data input terminal device
31a~31c:資料輸入介面 31a~31c: Data input interface
32a~32c:回饋操作介面 32a~32c: Feedback operation interface
GF:基準真相表單資料 GF: Baseline truth table data
JF:待判斷表單資料 JF: Form data to be determined
JAP:判斷程式 JAP: Judgment Program
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW112123637A TWI845355B (en) | 2023-06-26 | 2023-06-26 | System for judging input mode of form data |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW112123637A TWI845355B (en) | 2023-06-26 | 2023-06-26 | System for judging input mode of form data |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| TWI845355B true TWI845355B (en) | 2024-06-11 |
| TW202501334A TW202501334A (en) | 2025-01-01 |
Family
ID=92541729
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| TW112123637A TWI845355B (en) | 2023-06-26 | 2023-06-26 | System for judging input mode of form data |
Country Status (1)
| Country | Link |
|---|---|
| TW (1) | TWI845355B (en) |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| TW200421152A (en) * | 2003-02-01 | 2004-10-16 | Baxter Int | Remote multi-purpose user interface for a healthcare system |
| TW200632801A (en) * | 2004-10-08 | 2006-09-16 | Univ Utah Res Found | System for supervised remote training |
| WO2022133210A2 (en) * | 2020-12-18 | 2022-06-23 | Strong Force TX Portfolio 2018, LLC | Market orchestration system for facilitating electronic marketplace transactions |
| CN116113967A (en) * | 2020-07-16 | 2023-05-12 | 强力交易投资组合2018有限公司 | System and method for controlling rights related to digital knowledge |
-
2023
- 2023-06-26 TW TW112123637A patent/TWI845355B/en active
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| TW200421152A (en) * | 2003-02-01 | 2004-10-16 | Baxter Int | Remote multi-purpose user interface for a healthcare system |
| TW200632801A (en) * | 2004-10-08 | 2006-09-16 | Univ Utah Res Found | System for supervised remote training |
| CN116113967A (en) * | 2020-07-16 | 2023-05-12 | 强力交易投资组合2018有限公司 | System and method for controlling rights related to digital knowledge |
| WO2022133210A2 (en) * | 2020-12-18 | 2022-06-23 | Strong Force TX Portfolio 2018, LLC | Market orchestration system for facilitating electronic marketplace transactions |
Also Published As
| Publication number | Publication date |
|---|---|
| TW202501334A (en) | 2025-01-01 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111752833B (en) | Software quality system approval method, device, server and storage medium | |
| US12020046B1 (en) | Systems and methods for automated process discovery | |
| CN113592017B (en) | Deep learning model standardized training method, management system and processing terminal | |
| WO2020082673A1 (en) | Invoice inspection method and apparatus, computing device and storage medium | |
| CN112559817B (en) | Report content verification method, system, computer equipment and storage medium | |
| CN112632174B (en) | A method, device and system for data verification | |
| US11816112B1 (en) | Systems and methods for automated process discovery | |
| US10423916B1 (en) | Method for generating developer performance ratings | |
| CN113255525A (en) | Mechanical water meter reading method and system | |
| CN114168565B (en) | Backtracking test method, device and system of business rule model and decision engine | |
| CN111309586A (en) | Command testing method, device and storage medium thereof | |
| CN108897765A (en) | A kind of batch data introduction method and its system | |
| CN114066438A (en) | Model-based monitoring data display method, device, equipment and storage medium | |
| CN110287110A (en) | The code detection method and device of application program | |
| CN120892321B (en) | Software defect processing method, electronic device, storage medium, and program product | |
| TWI845355B (en) | System for judging input mode of form data | |
| CN114327377B (en) | Method and device for generating demand tracking matrix, computer equipment and storage medium | |
| JP2020057345A (en) | Information processing device, learning device, information processing system, information processing method, and computer program | |
| CN114841663A (en) | Verification method, device, equipment and storage medium for installation quality of GPS equipment | |
| JP2020052981A (en) | Information processing device, learning device, information processing system, information processing method, and computer program | |
| CN119416270A (en) | A data verification method, device, equipment and readable storage medium | |
| US12373639B2 (en) | System for judging input mode of form data | |
| CN115077906B (en) | Method, device, electronic equipment and medium for determining the cause of frequent engine failures | |
| CN112328951B (en) | Processing method of experimental data of analysis sample | |
| CN119442118B (en) | Methods, devices, equipment and media for detecting anomalies in clearing data |