TWI758725B

TWI758725B - Data analysis system and data analysis method

Info

Publication number: TWI758725B
Application number: TW109115289A
Authority: TW
Inventors: 邵志杰; 劉正邦; 龔如心
Original assignee: 台達電子工業股份有限公司
Priority date: 2020-05-08
Filing date: 2020-05-08
Publication date: 2022-03-21
Also published as: TW202143093A

Abstract

A data analysis method includes following steps: obtaining a data table including a plurality of fields, each of which stores field data; classifying a field type based on these field data; determining the field type of each of these fields; calculating a similarity between the fields, and determining a correlation between the fields based on the similarities; generating a field data description file according to these field types, field types, and correlations, and by the field data description file to determine whether there is abnormal data in the data to evaluate the data quality.

Description

Data analysis system and data analysis method

本發明之實施例主要係有關於一種分析方法，特別係關於一種資料分析系統和資料分析方法。Embodiments of the present invention mainly relate to an analysis method, and particularly relate to a data analysis system and data analysis method.

隨著數據資料蒐集更加便利，可利用的數據資料量急速增加，且資料分析技術也跟著蓬勃發展。有效的大數據資料分析結果需仰賴良好的資料品質，因此資料品質是資料分析上重要的課題。現行的資料品質診斷做法可分為資料分析專家自行利用程式語言分析，或使用市面上分析套裝軟體。As data collection has become more convenient, the amount of available data has increased rapidly, and data analysis technology has also flourished. Effective big data data analysis results depend on good data quality, so data quality is an important issue in data analysis. The current data quality diagnosis methods can be divided into data analysis experts using programming language to analyze, or using market analysis package software.

然而，在資料分析流程中，首先必須先確認資料品質，再進行資料前處理，但實務上往往都是在資料前處理階段同步觀察資料品質，使得此階段需投入大量人力、以致產生龐大溝通及時間成本。However, in the data analysis process, the data quality must be confirmed first, and then the data preprocessing is carried out. However, in practice, the data quality is often observed simultaneously in the data preprocessing stage, which requires a lot of manpower at this stage, resulting in huge communication and Time costs.

因此如何建立一套自動化輔助機制以減少資料前處理階段所需的人力及時間成本，已成為本領域待解決的問題之一。Therefore, how to establish an automatic auxiliary mechanism to reduce the manpower and time cost required in the data preprocessing stage has become one of the problems to be solved in this field.

有鑑於上述先前技術之問題，本發明之實施例提供了一種資料分析系統和方法。In view of the above-mentioned problems of the prior art, embodiments of the present invention provide a data analysis system and method.

根據本發明之一實施例提供了一種資料分析系統。上述資料分析系統包括一處理器、一儲存裝置、一欄位型態分析裝置、一欄位分類裝置及一欄位關聯裝置。處理器用以取得至少一資料表，資料表包括複數個欄位，此些欄位中各自儲存一欄位資料。儲存裝置用以儲存資料表。欄位型態分析裝置用以依據此些欄位資料分析出一欄位型態。欄位分類裝置用以判斷此些欄位各自的一欄位類別。欄位關聯裝置用以計算跨資料表中此些欄位之間各自的一相似度，依據此些相似度判斷此些欄位之間各自的一關聯性。其中，處理器依據此些欄位類別、欄位型態及此些關聯性產生一欄位資料描述檔，處理器判斷欄位資料描述檔是否異常。According to an embodiment of the present invention, a data analysis system is provided. The above data analysis system includes a processor, a storage device, a column type analysis device, a column classification device and a column association device. The processor is used for obtaining at least one data table, and the data table includes a plurality of fields, each of which stores a field of data. The storage device is used to store the data table. The column type analysis device is used for analyzing a column type according to the column data. The field classification device is used for judging a field type of each of the fields. The field association device is used for calculating a similarity between the fields in the cross-data table, and judging a correlation between the fields according to the similarity. The processor generates a field data description file according to the field types, field types and these associations, and the processor determines whether the field data description file is abnormal.

根據本發明之一實施例提供了一種資料分析方法。上述資料分析方法之步驟包括，取得一資料表，資料表包括複數個欄位，此些欄位中各自儲存一欄位資料；依據此些欄位資料分析出一欄位型態；判斷此些欄位各自的一欄位類別；計算跨資料表中此些欄位之間各自的一相似度，依據此些相似度判斷此些欄位之間各自的一關聯性；以及依據此些欄位類別、欄位型態及關聯性產生一欄位資料描述檔，進而判斷欄位資料描述檔是否異常。According to an embodiment of the present invention, a data analysis method is provided. The steps of the above data analysis method include: obtaining a data table, the data table includes a plurality of fields, each of which stores a field data; analyzing a field type according to the field data; judging these fields a field category for each of the fields; calculating a similarity between the fields in the cross data table, judging a correlation between the fields according to the similarity; and according to the fields A field data description file is generated based on the category, field type and association, and then it is judged whether the field data description file is abnormal.

根據本發明提出之資料分析方法及資料分析系統，可自動化地在資料前處理的階段，透過分析欄位類別、欄位型態、關聯性等等資訊，以建立自動化機制，產生欄位的資料描述檔，進而輔助使用者快速了解資料，達到降低資料前處理階段所需的人力，並提升資料前處理階段的資料分析效率。According to the data analysis method and data analysis system proposed by the present invention, in the stage of data preprocessing, an automated mechanism can be established to generate the data of the field by analyzing information such as field type, field type, correlation, etc. The description file can help users to quickly understand the data, reduce the manpower required in the data preprocessing stage, and improve the data analysis efficiency in the data preprocessing stage.

以下說明係為完成發明的較佳實現方式，其目的在於描述本發明的基本精神，但並不用以限定本發明。實際的發明內容必須參考之後的權利要求範圍。The following descriptions are preferred implementations for completing the invention, and are intended to describe the basic spirit of the invention, but are not intended to limit the invention. Reference must be made to the scope of the following claims for the actual inventive content.

必須了解的是，使用於本說明書中的“包括”、“包括”等詞，係用以表示存在特定的技術特徵、數值、方法步驟、作業處理、元件以及/或組件，但並不排除可加上更多的技術特徵、數值、方法步驟、作業處理、元件、組件，或以上的任意組合。It must be understood that the words "including" and "including" used in this specification are used to indicate the existence of specific technical features, values, method steps, operation processes, elements and/or components, but do not exclude the possibility of Plus more technical features, values, method steps, job processes, elements, components, or any combination of the above.

於權利要求中使用如“第一”、“第二”、“第三”等詞係用來修飾權利要求中的元件，並非用來表示之間具有優先權順序，先行關係，或者是一個元件先於另一個元件，或者是執行方法步驟時的時間先後順序，僅用來區別具有相同名字的元件。The use of words such as "first", "second", "third", etc. in the claims is used to modify the elements in the claims, and is not used to indicate that there is a priority order, an antecedent relationship between them, or an element Prior to another element, or chronological order in which method steps are performed, is only used to distinguish elements with the same name.

第1圖係顯示根據本發明之一實施例所述之一資料分析系統100之方塊圖。如第1圖所示，資料分析系統100可包括一處理器10、一儲存裝置20、一欄位型態分析裝置30、一欄位分類裝置40及欄位關聯裝置50。在此需特別注意的是，在第1圖中所示之方塊圖，僅係為了方便說明本發明之實施例，但本發明並不以第1圖為限，資料分析系統100中亦可包括其他元件。FIG. 1 shows a block diagram of a data analysis system 100 according to an embodiment of the present invention. As shown in FIG. 1 , the data analysis system 100 may include a processor 10 , a storage device 20 , a column type analysis device 30 , a column classification device 40 and a column association device 50 . It should be noted here that the block diagram shown in FIG. 1 is only for the convenience of explaining the embodiment of the present invention, but the present invention is not limited to FIG. 1, and the data analysis system 100 may also include other components.

於一實施例中，處理器10例如為微控制單元(microcontroller)、微處理器(microprocessor)、數位訊號處理器(digital signal processor)、特殊應用積體電路(application specific integrated circuit，ASIC)或一邏輯電路。In one embodiment, the processor 10 is, for example, a microcontroller, a microprocessor, a digital signal processor, an application specific integrated circuit (ASIC) or an logic circuit.

於一實施例中，欄位型態分析裝置30、欄位分類裝置40及欄位關聯裝置50可以各自或合併被實施為例如為微控制單元(microcontroller)、微處理器(microprocessor)、數位訊號處理器(digital signal processor)、特殊應用積體電路(application specific integrated circuit，ASIC)或一邏輯電路。In one embodiment, the column type analysis device 30, the column classification device 40 and the column association device 50 may be implemented individually or in combination as, for example, a microcontroller, a microprocessor, a digital signal A digital signal processor, an application specific integrated circuit (ASIC) or a logic circuit.

於一實施例中，欄位型態分析裝置30、欄位分類裝置40及欄位關聯裝置50可以是以電子裝置(例如包括電路、處理器或邏輯電路)運行的軟體。In one embodiment, the field type analysis device 30 , the field classification device 40 and the field association device 50 may be software running on electronic devices (eg, including circuits, processors, or logic circuits).

於一實施例中，儲存裝置20例如為唯讀記憶體、快閃記憶體、軟碟、硬碟、光碟、隨身碟、磁帶、可由網路存取之資料庫或熟悉此技藝者可輕易思及具有相同功能之儲存媒體。儲存裝置20可用以儲存一或多個資料表。In one embodiment, the storage device 20 is, for example, a read-only memory, a flash memory, a floppy disk, a hard disk, an optical disk, a pen drive, a magnetic tape, a database accessible through a network, or a person skilled in the art can easily contemplate and storage media with the same function. The storage device 20 can be used to store one or more data tables.

第2圖係顯示根據本發明之一實施例所述之一資料分析方法200之示意圖。第2圖的資料分析方法200可以由第1圖的資料分析系統100實現之。FIG. 2 is a schematic diagram showing a data analysis method 200 according to an embodiment of the present invention. The data analysis method 200 of FIG. 2 can be implemented by the data analysis system 100 of FIG. 1 .

於步驟210，處理器10取得一資料表。In step 210, the processor 10 obtains a data table.

於一實施例中，資料表包括多個欄位，此些欄位中各自儲存一欄位資料。例如，資料表中包括機台型號欄位、機台識別(ID)欄位、機台多工欄位、製造時間欄位、出貨時間欄位…等等，此些欄位中儲存不同的資料，例如機台型號欄位儲存“NB1”(此為字串)、機台識別欄位儲存“3”(此為整數)，機台多工欄位儲存“0”(此為布林值)、製造時間欄位儲存“2020/03/16”(此為日期)、出貨時間欄位儲存“2020/09/16”(此為日期)。然，此處僅為一示例，本發明的欄位與欄位資料並不限於此。In one embodiment, the data table includes a plurality of fields, each of which stores a field of data. For example, the data table includes a machine model field, a machine identification (ID) field, a machine multi-task field, a manufacturing time field, a shipping time field, etc., and these fields store different For example, the machine model field stores "NB1" (this is a string), the machine identification field stores "3" (this is an integer), and the machine multi-task field stores "0" (this is the Boolean value ), the manufacturing time field stores "2020/03/16" (this is the date), and the shipping time field stores "2020/09/16" (this is the date). However, this is just an example, and the fields and field data of the present invention are not limited thereto.

於一實施例中，處理器10可取得多個資料表。In one embodiment, the processor 10 may obtain multiple data tables.

於步驟220中，處理器10觸發欄位型態分析裝置30、欄位分類裝置40及欄位關聯裝置50產生一欄位資料描述檔。In step 220, the processor 10 triggers the column type analysis device 30, the column classification device 40 and the column association device 50 to generate a column data description file.

於一實施例中，步驟220中包括多個子步驟220(a)~220(c) 的任一或其組合。於子步驟220(a)中，處理器10分析得出欄位型態，於子步驟220(b)中，處理器10分析得出欄位識別。於子步驟220(c)中，處理器10分析得出欄位關聯性。In one embodiment, step 220 includes any one or a combination of a plurality of sub-steps 220(a)-220(c). In sub-step 220(a), the processor 10 analyzes and obtains the field type, and in sub-step 220(b), the processor 10 analyzes and obtains the field identification. In sub-step 220(c), the processor 10 analyzes and obtains the field association.

於一實施例中，欄位型態分析裝置30依據此些欄位資料分析出一欄位型態。欄位型態是指每一個欄位(例如一列有500筆資料)中所儲存內容的資料型態，資料型態例如為數值、字串、時間類、布林值，一個欄位中，所有資料筆數內，占較多的資料型態視為該欄位主要型態，例如資料表中一欄位有500筆資料，其中499筆是數值，則將此欄位定義為數值欄位型態。In one embodiment, the field type analysis device 30 analyzes a field type according to the field data. Field type refers to the data type of the content stored in each field (for example, there are 500 data in a row). In the number of data records, the type of data that accounts for more is regarded as the main type of the field. For example, there are 500 records in a field in the data table, 499 of which are numerical values, then this field is defined as a value field type state.

於一實施例中，欄位分類裝置40判斷此些欄位各自的一欄位類別。欄位類別是指欄位名稱本身所屬的類別，例如為人、機器、材料、方法、量測、其他…等等。例如，欄位名稱中包括關鍵字機台，則欄位類別被分為機器類別的欄位。In one embodiment, the field classification device 40 determines a field type of each of the fields. Field category refers to the category to which the field name itself belongs, such as human, machine, material, method, measurement, other...etc. For example, if the field name includes the keyword machine, the field category is divided into the field of the machine category.

於一實施例中，欄位關聯裝置50計算不同資料表(跨資料表)的兩兩欄位之間的一相似度，依據此些相似度判斷此些欄位之間各自的一關聯性是否存在。關聯性是指跨資料表中至少兩欄位之間的相關程度，例如產品製造表中的製造時間欄位以及產品出貨表中的出貨時間欄位，此兩個來自不同資料表的欄位在時間上具相關性。In one embodiment, the field association device 50 calculates a similarity between two fields in different data tables (cross-data tables), and determines whether a correlation between the fields is related according to the similarity. exist. Relevance refers to the degree of correlation between at least two fields in a cross data table, such as the manufacturing time field in the product manufacturing table and the shipping time field in the product shipping table, these two columns from different data tables Bits are correlated in time.

於一實施例中，處理器10依據此些欄位類別、欄位型態及此些關聯性產生一欄位資料描述檔，進而判斷欄位資料描述檔是否異常。In one embodiment, the processor 10 generates a field data description file according to the field types, field types and these associations, and then determines whether the field data description file is abnormal.

於一實施例中，欄位資料描述檔包括此些欄位類別、欄位型態、欄位關聯性…等資訊。In one embodiment, the field data description file includes information such as the field types, field types, field associations, and so on.

關於欄位型態分析裝置30、欄位分類裝置40及欄位關聯裝置50的細部流程，將於後續第3~5圖對應說明。The detailed flow of the column type analysis device 30 , the column classification device 40 , and the column association device 50 will be described correspondingly in the following Figures 3 to 5 .

於步驟230，處理器10判斷欄位資料描述檔是否異常。於一實施例中，處理器10判斷欄位資料描述檔是否完整或正確。於一實施例中，若處理器10判斷欄位資料描述檔不完整或有誤，則進入步驟240。若處理器10判斷欄位資料描述檔完整且正確，則結束流程。In step 230, the processor 10 determines whether the field data description file is abnormal. In one embodiment, the processor 10 determines whether the field data description file is complete or correct. In one embodiment, if the processor 10 determines that the field data description file is incomplete or incorrect, it proceeds to step 240 . If the processor 10 determines that the field data description file is complete and correct, the process ends.

於一實施例中，欄位資料描述檔被判斷為異常的情況包括: 該欄位資料描述檔不完整，或欄位資料描述檔存在錯誤。In one embodiment, the situation that the field data description file is judged to be abnormal includes: the field data description file is incomplete, or the field data description file has errors.

例如，資料表中一欄位有500筆資料，欄位資料其中有499個是數值，有1個是字串，此欄位應定義為數值欄位型態，若欄位型態分析裝置30分析為其他欄位型態(如字串、布林值、時間)，則處理器10判斷欄位資料描述檔異常，進入步驟240。For example, there are 500 pieces of data in a field in the data table, 499 of which are numeric values and 1 is a string. This field should be defined as a numeric field type. If the field type analysis device 30 If the analysis is other field types (such as string, boolean value, time), the processor 10 determines that the field data description file is abnormal, and proceeds to step 240 .

例如，資料表中一欄位有500筆資料，欄位資料其中有499個是數值，有1個是空白資料，若因空白資料使得欄位型態分析裝置30未能分析出欄位型態，則處理器10判斷欄位資料描述檔不完整或有誤，進入步驟240。For example, there are 500 pieces of data in a field in the data table, 499 of the field data are numerical values, and 1 is blank data. If the field type analysis device 30 fails to analyze the field type due to the blank data , the processor 10 determines that the field data description file is incomplete or incorrect, and proceeds to step 240 .

於步驟240中，當處理器10判斷欄位資料描述檔異常時，自動修正欄位資料描述檔的內容。In step 240, when the processor 10 determines that the field data description file is abnormal, it automatically corrects the content of the field data description file.

於一實施例中，處理器10基於欄位資料描述檔中缺失的部份，從儲存裝置20中再計算出缺失的資料，以自動修正欄位資料描述檔中的內容，例如步驟240中包括子步驟241~243：修正欄位資料類別(category)241、修正欄位資料型態(data type)242及/或修正其他資料表中相關聯欄位(related column)243。於一實施例中，使用者可基於資料描述檔中缺失的部份輸入新的資料描述檔內容。例如，使用者透過一輸入裝置(例如滑鼠游標、觸控式螢幕、鍵盤)基於描述檔中缺失的部份，輸入新增或更新的資料，處理器10由輸入裝置接收到新增或更新的資料後，處理器10透過此些新增或更新的資料完善欄位資料描述檔中的內容，例如自動修正包括：新增/更新欄位資料描述(description)、新增或更新欄位資料群組數(group)、新增或更新欄位允許空值(nullable) 、新增或更新欄位資料上下界(value range)、是否允許忽略異常資料、及/或新增或更新相同資料表中有關係的欄位(relation column)。In one embodiment, the processor 10 recalculates the missing data from the storage device 20 based on the missing part in the field data description file to automatically correct the content in the field data description file, for example, step 240 includes: Sub-steps 241 to 243 : amend the column data category 241 , amend the column data type 242 and/or amend the related column 243 in other data tables. In one embodiment, the user may enter new data description file content based on the missing portion of the data description file. For example, the user inputs new or updated data based on the missing part in the description file through an input device (such as a mouse cursor, a touch screen, a keyboard), and the processor 10 receives the new or updated data from the input device After the data is stored, the processor 10 completes the content in the field data description file through the added or updated data. For example, the automatic correction includes: adding/updating field data description, adding or updating field data Number of groups (group), new or updated fields allow nullable values (nullable), new or updated field data upper and lower bounds (value range), whether exception data is allowed to be ignored, and/or new or updated identical data tables There is a relation column.

於一實施例中，處理器10將資料描述檔中缺失的部份依據一預設規則(例如將空白欄位補入“0”或依據空白欄位的相鄰兩個欄位資料計算一平均值，將平均值填入空白欄位)將缺失的部份進行修正。In one embodiment, the processor 10 performs the missing part in the data description file according to a preset rule (for example, filling the blank field with "0" or calculating an average value according to the data of two adjacent fields of the blank field. value, fill in the average value in the blank field) to correct the missing part.

於一實施例中，處理器10依據預設規則判斷此欄位資料可以是空值，則處理器10設置允許欄位資料描述中的此欄位資料是空值，後續資料分析系統會忽略此異常資料。In one embodiment, the processor 10 determines that the field data can be a null value according to a preset rule, then the processor 10 is set to allow the field data in the field data description to be a null value, and the subsequent data analysis system will ignore this. Unusual data.

於一實施例中，當處理器10判斷欄位資料描述檔資料異常時，處理器10進行修正該欄位資料描述檔(例如將數值轉成字串)、新增欄位資料描述檔(例如透過使用者輸入或處理器10從儲存裝置20中撈出缺失的資料)、編輯欄位資料描述檔(例如更改數值大小)、忽略異常資料或透過一顯示器顯示該欄位資料描述檔異常。In one embodiment, when the processor 10 determines that the field data description file is abnormal, the processor 10 modifies the field data description file (for example, converts the value into a string), adds a field data description file (for example, Through user input or processor 10 retrieves missing data from storage device 20), edits the field data description file (eg changing the value size), ignores abnormal data or displays the field data description file exception through a display.

第3A~3B圖係顯示根據本發明之一實施例所述之一欄位型態分析方法300之流程圖。於步驟310中，處理器10取得一或多個資料表。於步驟320中，欄位型態分析裝置30分析欄位型態。FIGS. 3A-3B show a flowchart of a field type analysis method 300 according to an embodiment of the present invention. In step 310, the processor 10 obtains one or more data tables. In step 320, the column type analysis device 30 analyzes the column type.

於一實施例中，欄位型態分析裝置30將單一欄位中數量最多的資料型態視為該欄位之欄位型態，例如資料表中一欄位有500筆資料，有499個是數值，則將此欄位型態定義為數值欄位型態。例如資料表中一欄有500筆資料，有480筆是字串，則將此欄位型態定義為字串欄位型態。In one embodiment, the field type analysis device 30 regards the data type with the largest number in a single field as the field type of the field. For example, there are 500 pieces of data in a field in the data table, and there are 499 pieces of data. is a numeric value, this field type is defined as the numeric field type. For example, there are 500 pieces of data in a column in the data table, and 480 of them are strings, so this field type is defined as the string field type.

於步驟330中，欄位型態分析裝置30判斷欄位型態是否為一數值欄位型態。若欄位型態分析裝置30判斷欄位型態為數值欄位型態，則進入步驟340。若欄位型態分析裝置30判斷欄位型態不為數值欄位型態，則進入步驟350。In step 330, the field type analysis device 30 determines whether the field type is a numerical field type. If the field type analysis device 30 determines that the field type is a numerical field type, the process proceeds to step 340 . If the field type analysis device 30 determines that the field type is not the numeric field type, then the process proceeds to step 350 .

於步驟340中，欄位型態分析裝置30判斷此些欄位資料是否為整數或浮點數。若欄位型態分析裝置30判斷此些欄位資料為整數或浮點數，則進入步驟343。若欄位型態分析裝置30判斷此些欄位資料不為整數或浮點數，則進入步驟345。In step 340, the field type analysis device 30 determines whether the field data are integers or floating-point numbers. If the field type analysis device 30 determines that the field data are integers or floating-point numbers, the process proceeds to step 343 . If the field type analysis device 30 determines that the field data are not integers or floating-point numbers, the process proceeds to step 345 .

於一實施例中，整數和浮點統稱為數值。In one embodiment, integers and floating point are collectively referred to as numeric values.

於步驟343中，資料型態分析裝置30確認欄位資料描述檔中的欄位型態為數值欄位型態。In step 343, the data type analysis device 30 confirms that the field type in the field data description file is the numeric field type.

於一實施例中，數值欄位型態包括整數及浮點數。In one embodiment, the numeric field types include integers and floating-point numbers.

於一實施例中，欄位型態分析裝置30若發現欄位資料中有異常，則新增欄位資料描述檔、編輯欄位資料描述檔、忽略異常的欄位資料或透過一顯示器顯示異常資料。例如，欄位資料中有部份空值，則忽略此欄位空值資料。In one embodiment, if the field type analysis device 30 finds that there is an abnormality in the field data, it adds a field data description file, edits the field data description file, ignores the abnormal field data, or displays the abnormality through a display. material. For example, if there are some null values in the field data, the null value data in this field will be ignored.

於步驟345中，欄位型態分析裝置30修正欄位型態為一非數值欄位型態。In step 345, the field type analysis device 30 corrects the field type to a non-numerical field type.

於一實施例中，欄位型態分析裝置30進一步判斷欄位資料中只儲存0或1時，則視為布林值欄位型態，因此修正欄位型態為一非數值欄位型態。此處僅為一舉例，並不限於此。In one embodiment, when the field type analysis device 30 further determines that only 0 or 1 is stored in the field data, it is regarded as a boolean value field type, so the corrected field type is a non-numerical field type state. This is just an example, not limited to this.

於步驟350中，欄位型態分析裝置30判斷此些欄位資料是否包括數值。若欄位型態分析裝置30判斷此些欄位資料包括數值，則進入步驟353。若欄位型態分析裝置30判斷此些欄位資料不為數值，則進入步驟355。In step 350, the field type analysis device 30 determines whether the field data includes a numerical value. If the field type analysis device 30 determines that the field data includes numerical values, the process proceeds to step 353 . If the field type analysis device 30 determines that the field data are not numerical values, the process proceeds to step 355 .

於一實施例中，欄位型態分析裝置30進一步判斷欄位資料中儲存字串型態的“12”，則視為包括數值，因此進入步驟353。然此處僅為一舉例，本發明並不限於此。In one embodiment, the field type analysis device 30 further determines that "12" of the string type stored in the field data is regarded as including a numerical value, so step 353 is entered. However, this is only an example, and the present invention is not limited thereto.

於步驟353中，欄位型態分析裝置30將欄位資料描述檔中的欄位型態修正為數值欄位型態。In step 353, the field type analysis device 30 modifies the field type in the field data description file to the value field type.

於一實施例中，欄位型態分析裝置30若發現欄位資料中有異常，則新增欄位資料描述檔、編輯欄位資料描述檔、忽略異常的欄位資料或透過一顯示器顯示異常資料。例如，欄位資料中有較多的空值(導致在步驟320判斷欄位類型為非數值欄位型態)，則可能將此些空值欄位資料忽略此欄位，藉此修正欄位資料描述檔，若欄位資料中，非空值部份皆為數值資料，則將欄位資料描述檔中的欄位型態修正為數值欄位型態。In one embodiment, if the field type analysis device 30 finds that there is an abnormality in the field data, it adds a field data description file, edits the field data description file, ignores the abnormal field data, or displays the abnormality through a display. material. For example, if there are many null values in the field data (resulting in determining that the field type is a non-numeric field type in step 320), these null value field data may be ignored for this field, thereby modifying the field In the data description file, if all the non-null values in the field data are numeric data, the field type in the field data description file is modified to the numeric field type.

於步驟355中，欄位型態分析裝置30判斷此些欄位資料是否為日期、時間、時間與日期的資料型態之一。若欄位型態分析裝置30判斷此些欄位資料為日期、時間、時間與日期的資料型態之一，則進入步驟360。若欄位型態分析裝置30判斷此些欄位資料不為日期、時間、時間與日期的資料型態之一，則進入步驟370。In step 355, the field type analysis device 30 determines whether the field data is one of the data types of date, time, time and date. If the field type analysis device 30 determines that the field data is one of the data types of date, time, time and date, the process proceeds to step 360 . If the field type analysis device 30 determines that the field data is not one of the data types of date, time, time and date, the process proceeds to step 370 .

於一實施例中，日期、時間、時間與日期的資料型態統稱為時間資料型態。In one embodiment, the data types of date, time, time and date are collectively referred to as time data types.

於步驟360中，欄位型態分析裝置30將欄位資料描述檔中的欄位型態修正為時間欄位型態。In step 360, the field type analysis device 30 modifies the field type in the field data description file to the time field type.

於一實施例中，欄位型態分析裝置30將時間欄位型態進行細分。例如，欄位型態分析裝置30將時間欄位型態細分出時間或日期。又例如欄位型態分析裝置30將時間欄位型態細分出日期及時間。In one embodiment, the field type analysis device 30 subdivides the time field type. For example, the field type analysis device 30 subdivides the time field type into time or date. For another example, the field type analysis device 30 subdivides the time field type into date and time.

於步驟370中，欄位型態分析裝置30判斷欄位資料是否可分成其他欄位型態。若欄位型態分析裝置30判斷欄位資料可分成其他欄位型態(例如欄位型態分析裝置30仍可分析出特定的欄位資料占的比例較多)，則進入步驟380。若欄位型態分析裝置30欄位資料不能分成其他欄位型態，則結束流程。In step 370, the field type analysis device 30 determines whether the field data can be divided into other field types. If the column type analysis device 30 determines that the column data can be divided into other column types (for example, the column type analysis device 30 can still analyze that the specific column data accounts for a larger proportion), then go to step 380 . If the field data in the field type analysis device 30 cannot be divided into other field types, the process ends.

於步驟380中，欄位型態分析裝置30判斷此些欄位資料是否為一文字資料或一布林值資料，當欄位型態分析裝置30判斷此些欄位資料為文字資料或布林值資料，則對應此些欄位資料，將欄位資料描述檔中的欄位型態修正為一文字型態或一布林值型態。In step 380, the field type analysis device 30 determines whether the field data is a text data or a Boolean value data, when the field type analysis device 30 determines that the field data is a text data or a Boolean value data, corresponding to these field data, modify the field type in the field data description file to a text type or a boolean value type.

第4圖係顯示根據本發明之一實施例所述之一欄位分類方法400之流程圖。於步驟410中，欄位分類裝置將此些欄位名稱各自進行斷詞。例如，中文的欄位名稱“機台編號”，則斷詞為“機台”、“編號”，又例如，英文的欄位名稱“functionId”，則斷詞為“function”、“Id”。中文欄位名稱的斷詞方法通常是將欄位名稱與已知的語料對應(mapping)，若尋找到相符字詞則分出此字詞，此外，可以應用已知的斷詞演算法如CKIP、HanLP、Ansj、Jieba…等等實作之。英文的欄位名稱的斷詞方法可以是找出大小寫規則、字根、底線、空白、或依據欄位名稱命名的規則，以分出字詞。FIG. 4 is a flowchart showing a field classification method 400 according to an embodiment of the present invention. In step 410, the field classification apparatus performs word segmentation on each of the field names. For example, if the field name in Chinese is "machine number", the hyphenated words are "machine" and "number". For another example, if the field name in English is "functionId", the hyphenated words are "function" and "Id". The method of word segmentation of Chinese field names is usually to map the field name to a known corpus. If a matching word is found, the word is separated. In addition, known word segmentation algorithms such as CKIP, HanLP, Ansj, Jieba...etc. The word segmentation method for English field names can be to find out the capitalization rules, radicals, underscores, blanks, or naming rules according to field names to separate words.

於步驟420中，欄位分類裝置40將斷詞後的複數個字詞各自轉換為一字詞特徵，將此些字詞特徵輸入一分類模型。In step 420 , the field classification device 40 converts each of the plurality of words after word segmentation into a word feature, and inputs these word features into a classification model.

於一實施例中，欄位分類裝置40一預先建好的語料庫與所有分割出來的字詞作比對。例如，字詞“機台”存在於預先建好的語料庫中，則將“機台”標示為1，例如，字詞“冰淇淋”不存在於預先建好的語料庫中，則將“冰淇淋”標示為0。欄位分類裝置40將預先建好的語料庫與所有分割出來的字詞作比對後，會有許多0與1所組成的字詞特徵。In one embodiment, the field classification device 40 compares all segmented words with a pre-built corpus. For example, if the word "machine" exists in the pre-built corpus, mark "machine" as 1; for example, if the word "ice cream" does not exist in the pre-built corpus, mark "ice cream" is 0. After the field classification device 40 compares the pre-built corpus with all the segmented words, there will be many word features composed of 0s and 1s.

於一實施例中，此些字詞特徵可以是特徵向量、特徵矩陣或一序列的數值。欄位分類裝置40將此些字詞特徵輸入一分類模型，分類模型例如是一決策樹模型。決策樹模型經常在運用在決策分析中，幫助確定一個能最可能達到目標的策略。決策樹可作為計算條件概率的描述性手段，換言之，決策樹可以依據字詞特徵分析最可能的欄位所屬類別。決策樹模型為已知技術，故此處不贅述之。In one embodiment, the word features may be feature vectors, feature matrices, or a sequence of values. The field classification device 40 inputs these word features into a classification model, such as a decision tree model. Decision tree models are often used in decision analysis to help determine a strategy that is most likely to achieve a goal. Decision trees can be used as a descriptive means of calculating conditional probabilities. In other words, decision trees can analyze the most likely category of fields based on word characteristics. The decision tree model is a known technology, so it will not be repeated here.

於步驟430中，分類模型依據此些字詞特徵輸出欄位類別。於一實施例中，欄位類別例如為人、機器、材料、方法、量測或其他。然此處僅為舉例，本發明並不限於此。In step 430, the classification model outputs field categories according to the word features. In one embodiment, the field category is, for example, human, machine, material, method, measurement, or others. However, this is only an example, and the present invention is not limited thereto.

例如，將“機台”所對應的字詞特徵輸入決策樹模型，則決策樹模型會將“機台”對應到機器的欄位類別。For example, if the word feature corresponding to "machine" is input into the decision tree model, the decision tree model will map "machine" to the column category of the machine.

例如，將“公分”所對應的字詞特徵輸入決策樹模型，則決策樹模型會將“公分”對應到量測的欄位類別。For example, if the word feature corresponding to "centimeter" is input into the decision tree model, the decision tree model will map "centimeter" to the measured field category.

於一實施例中，欄位分類裝置40藉由決策樹(Decision Tree)演算法、貝葉斯分類(Bayes Classification)演算法、k-近鄰(k-Nearest Neighbors)演算法、支持向量機(Support Vector Machine)演算法，以判斷該些欄位各自的該欄位類別In one embodiment, the column classification device 40 uses a Decision Tree algorithm, a Bayes Classification algorithm, a k-Nearest Neighbors algorithm, a Support Vector Machine (SVM) Vector Machine) algorithm to determine the field type of each of the fields

藉此，欄位分類裝置40可以應用欄位分類方法400依據表格、欄位名稱分析出欄位類別。Thereby, the column classification device 40 can apply the column classification method 400 to analyze the column type according to the table and the column name.

第5圖係顯示根據本發明之一實施例所述之一欄位關聯方法500之流程圖。於一實施例中，處理器10取得複數個資料表。於步驟510中，欄位關聯裝置50從不同的多個資料表中任選兩個資料表視為一第一資料表與一第二資料表，從第一資料表中選擇一第一欄位，從第二資料表中選擇一第二欄位，第一欄位包括一第一斷詞資料，第二欄位包括一第二斷詞資料。FIG. 5 is a flowchart illustrating a field association method 500 according to an embodiment of the present invention. In one embodiment, the processor 10 obtains a plurality of data tables. In step 510, the field association device 50 selects two data tables from a plurality of different data tables as a first data table and a second data table, and selects a first field from the first data table , select a second column from the second data table, the first column includes a first word segmentation data, and the second column includes a second word segmentation data.

於一實施例中，欄位關聯裝置50會將第一欄位與第二欄位中的欄位資料進行斷詞，以取得第一斷詞資料與第二斷詞資料。In one embodiment, the field association device 50 performs word segmentation on the field data in the first field and the second field to obtain the first word segmentation data and the second word segmentation data.

於一實施例中，第一斷詞資料與第二斷詞資料的語言相同。例如在中文的例子中，第一斷詞資料為“機械”，第二斷詞資料為“機台”。例如在英文的例子中，第一斷詞資料為“wire”，第二斷詞資料為“wireless”。In one embodiment, the language of the first segmented data and the second segmented data is the same. For example, in the Chinese example, the first word segmentation data is "machine", and the second word segmentation data is "machine". For example, in the English example, the first segmented data is "wire", and the second segmented data is "wireless".

於步驟520中，欄位關聯裝置50計算第一斷詞資料與第二斷詞資料的相似度。在一實施例中，選用最小編輯距離，依據最小編輯距離計算相似度。然本發明並不限制於此。In step 520, the field association device 50 calculates the similarity between the first segmented data and the second segmented data. In one embodiment, the minimum edit distance is selected, and the similarity is calculated according to the minimum edit distance. However, the present invention is not limited to this.

於一實施例中，欄位關聯裝置50以最小編輯距離做為相似度實作方法，最小編輯距離是指第一斷詞資料與第二斷詞資料的相異字數，例如在中文的例子中，當第一斷詞資料為“機械”，第二斷詞資料為“機台”時，兩者相異的字數為1，將最小編輯距離視為1。例如在英文的例子中，當第一斷詞資料為“wire”，第二斷詞資料為“wireless”，兩者相異的字數(英文字母數)為4，將最小編輯距離視為4。In one embodiment, the field association device 50 uses the minimum edit distance as the similarity implementation method, and the minimum edit distance refers to the number of different characters between the first word segmentation data and the second word segmentation data, for example, in the Chinese example , when the first word segmentation data is "machine" and the second word segmentation data is "machine", the difference between the two is 1, and the minimum edit distance is regarded as 1. For example, in the English example, when the first segmented data is "wire" and the second segmented data is "wireless", the number of characters (number of English letters) that are different between the two is 4, and the minimum edit distance is regarded as 4 .

於一實施例中，欄位關聯裝置50依據最小編輯距離計算相似度，例如在前述中文的例子中，最長的字詞有兩個中文字，換言之，最長的字串是2，將2作為分母，將最長的字串減掉最小編輯距離(2-1=1)作為分子，因此，相似度為1/2(即50%)。In one embodiment, the field association device 50 calculates the similarity according to the minimum edit distance. For example, in the above Chinese example, the longest character string has two Chinese characters, in other words, the longest character string is 2, and 2 is used as the denominator. , take the longest string minus the minimum edit distance (2-1=1) as the numerator, so the similarity is 1/2 (ie 50%).

又例如在中文的例子中，當第一斷詞資料為“編號”，第二斷詞資料為“編號”時，最長的字詞有兩個中文字，換言之，最長的字串是2，將2作為分母，兩者相異的字數為0，將最長的字串減掉最小編輯距離(2-0=2)作為分子，因此，相似度為2/2(即100%)。Another example is in the Chinese example, when the first word segmentation data is "number" and the second word segmentation data is "number", the longest word has two Chinese characters, in other words, the longest word string is 2, the 2 is used as the denominator, and the number of different words between the two is 0. The longest character string minus the minimum edit distance (2-0=2) is used as the numerator. Therefore, the similarity is 2/2 (ie 100%).

例如在前述英文的例子中，最長的字詞有八個英文字母，換言之，最長的字串是8，將8作為分母，將最長的字串減掉最小編輯距離(8-4=4)作為分子，因此，相似度為4/8(即50%)。For example, in the above English example, the longest word has eight English letters, in other words, the longest character string is 8, and 8 is used as the denominator, and the longest character string minus the minimum edit distance (8-4=4) is used as the Molecules, therefore, are 4/8 (i.e. 50%) similar.

於步驟530中，欄位關聯裝置50判斷資料是否大於一相似度門檻值。當欄位關聯裝置50判斷相似度不大於相似度門檻值時，進入步驟550。當欄位關聯裝置50判斷相似度大於相似度門檻值時，進入步驟540。In step 530, the field association device 50 determines whether the data is greater than a similarity threshold. When the field association device 50 determines that the similarity is not greater than the similarity threshold, the process proceeds to step 550 . When the field association device 50 determines that the similarity is greater than the similarity threshold, the process proceeds to step 540 .

例如，相似度門檻值可預設為80%，其用意代表當相似度大於80%時，視為此兩個欄位具有關聯性。在前述例子中，當第一斷詞資料為“編號”，第二斷詞資料為“編號”時，相似度為100%，此相似度100%大於相似度門檻值80%，代表第一欄位與第二欄位之間具有關聯性。For example, the similarity threshold can be preset to 80%, which means that when the similarity is greater than 80%, the two fields are considered to be related. In the above example, when the first segmented data is "number" and the second segmented data is "number", the similarity is 100%, and the similarity of 100% is greater than the similarity threshold of 80%, representing the first column There is an association between the bit and the second field.

於一實施例中，欄位分類裝置40係依據第一斷詞資料與第二斷詞資料計算歐幾裡得距離（Euclidean Distance）、曼哈頓距離（Manhattan Distance）、漢明距離（Hamming Distance）、明可夫斯基距離(Minkowski distance)、餘弦相似度（Cosine Similarity）、Jaccard相似度（Jaccard Similarity）、編輯距離（Edit Distance）或皮爾森相關係數（Pearson Correlation Coefficient）以產生相似度。In one embodiment, the field classification device 40 calculates the Euclidean Distance, Manhattan Distance, Hamming Distance, Minkowski distance, Cosine Similarity, Jaccard Similarity, Edit Distance or Pearson Correlation Coefficient to generate similarity.

於步驟540中，欄位關聯裝置50建立第一欄位與第二欄位之間的關聯性。於一實施例中，例如可以第一欄位與第二欄位中加入旗標或另外以一檔案紀錄關聯性。In step 540, the field association device 50 establishes the association between the first field and the second field. In one embodiment, for example, a flag may be added to the first field and the second field or the association may be recorded in a file.

藉此，可以將第一欄位與第二欄位關聯起來，以利於後續的使用，例如第一欄位中記錄特定實驗的參數，第二欄位中記錄特定實驗的結果，透過建立第一欄位與第二欄位之間的關聯性，可以將參數與結果關聯起來。換言之，建立關聯性有助於在複雜與巨量的資料表及其欄位資料中，使具有相關性的欄位集中化，亦可以於資料特性方面進行其他應用。In this way, the first field can be associated with the second field to facilitate subsequent use, for example, the parameters of a specific experiment are recorded in the first field, and the results of a specific experiment are recorded in the second field. The association between the field and the second field can associate the parameter with the result. In other words, establishing associations helps to centralize relevant fields in complex and massive data tables and their field data, and can also be used for other applications in data characteristics.

於步驟550中，欄位關聯裝置50判斷是否所有第一資料表與第二資料表中的欄位組合都已計算過相似度。若欄位關聯裝置50判斷所有第一資料表與第二資料表中的欄位組合都已計算過相似度，則結束流程。若欄位關聯裝置50判斷所有第一資料表與第二資料表中的欄位組合並非都已計算過資料相似度，則回到步驟510。In step 550, the field association device 50 determines whether all the field combinations in the first data table and the second data table have calculated the similarity. If the field association device 50 determines that the similarity has been calculated for all the field combinations in the first data table and the second data table, the process ends. If the field association device 50 determines that the data similarity has not been calculated for all the field combinations in the first data table and the second data table, the process returns to step 510 .

於一實施例中，處理器10或使用者選定企業內部某部門之資料庫資料作為資料來源，共2個不同的資料表，30個欄位、將近36,000資料筆數(可能一個欄位資料中包括多筆資料筆數)，資料需進行資料清理與合併，以便後續分析使用。此實驗設計了實驗組與對照組，實驗組採用本案的資料分析系統100進行資料分析，對照組邀請本領域專家，以人工流程檢查欄位類別、欄位型態及欄位關聯性，評量標準是以各項目評估所需花費的時間。實驗結果如下表一：項目\類型對照組實驗組分析欄位型態本領域專家人工檢視欄位資料的內容，判斷欄位所屬資料型態，共花費198秒應用本發明提出之資料分析方法及資料分析系統，歷時15秒分析欄位類別本領域專家人工標註欄位，每一個欄位約需花費10~15秒時間判斷其欄位類別，本次標註30個欄位，共花費300~450秒應用本發明提出之資料分析方法及資料分析系統，歷時0.3秒，準確率達到95.3%(為確認自動分析的正確性，將自動化判斷的欄位類型與人工所判斷的欄位類別作比對，所得到的準確率。) 分析欄位關聯性本領域專家以人工方式判斷多個資料表當中的欄位之間是否存在關聯，共花費165秒應用本發明提出之資料分析方法及資料分析系統，每兩兩欄位比對歷時0.2秒表一在3個項目表現當中，實驗組所花費時間皆遠優於對照組，因此，本發明提出之資料分析方法及資料分析系統針對大量的資料，提升了資料分析的效率，能夠即時的分析巨量且複雜的資料。 In one embodiment, the processor 10 or the user selects the database data of a department within the enterprise as the data source, there are 2 different data tables, 30 fields, and nearly 36,000 data transactions (maybe one field data Including multiple records), the data needs to be cleaned and merged for subsequent analysis and use. In this experiment, an experimental group and a control group are designed. The experimental group uses the data analysis system 100 of this case to analyze the data. The control group invites experts in the field to check the field type, field type and field correlation by manual process, and evaluate the field. The criterion is the time it takes to evaluate each project. The experimental results are shown in Table 1: project type control group test group Analysis Field Type Experts in the field manually check the content of the field data and determine the data type of the field, which takes a total of 198 seconds Apply the data analysis method and data analysis system proposed by the present invention, which lasts 15 seconds Analysis Field Type Experts in the field manually mark fields, and it takes about 10~15 seconds for each field to determine its field type. This time, 30 fields are marked, and it takes 300~450 seconds in total Using the data analysis method and data analysis system proposed by the present invention, it lasted 0.3 seconds, and the accuracy rate reached 95.3% (in order to confirm the correctness of the automatic analysis, the automatically judged column type was compared with the manually judged column category, the resulting accuracy.) Analyze Field Associations Experts in the field manually determine whether there is a relationship between fields in multiple data tables, which takes a total of 165 seconds Applying the data analysis method and data analysis system proposed by the present invention, the comparison of every pair of columns lasts 0.2 seconds Among the performances of the three items in Table 1, the time spent in the experimental group is far better than that in the control group. Therefore, the data analysis method and data analysis system proposed by the present invention improve the efficiency of data analysis for a large amount of data and can analyze the data in real time. Huge and complex data.

根據本發明提出之資料分析方法及資料分析系統，可自動化地在資料前處理的階段，透過分析欄位類別、欄位型態、關聯性等等資訊，以建立自動化機制，產生欄位的資料描述檔，進而輔助使用者快速了解資料。達到降低資料前處理階段所需的人力，並提升資料前處理階段的資料分析效率。According to the data analysis method and data analysis system proposed by the present invention, in the stage of data preprocessing, an automated mechanism can be established to generate the data of the field by analyzing information such as field type, field type, correlation, etc. Description file to help users quickly understand the data. It can reduce the manpower required in the data preprocessing stage and improve the data analysis efficiency in the data preprocessing stage.

本發明之說明書所揭露之方法和演算法之步驟，可直接透過執行一處理器直接應用在硬體以及軟體模組或兩者之結合上。一軟體模組(包括執行指令和相關數據)和其它數據可儲存在數據記憶體中，像是隨機存取記憶體(RAM)、快閃記憶體(flash memory)、唯讀記憶體(ROM)、可抹除可規劃唯讀記憶體(EPROM)、電子可抹除可規劃唯讀記憶體(EEPROM)、暫存器、硬碟、可攜式硬碟、光碟唯讀記憶體(CD-ROM)、DVD或在此領域習之技術中任何其它電腦可讀取之儲存媒體格式。一儲存媒體可耦接至一機器裝置，舉例來說，像是電腦/處理器(爲了說明之方便，在本說明書以處理器來表示)，上述處理器可透過來讀取資訊(像是程式碼)，以及寫入資訊至儲存媒體。一儲存媒體可整合一處理器。一特殊應用積體電路(ASIC)包括處理器和儲存媒體。一用戶設備則包括一特殊應用積體電路。換句話說，處理器和儲存媒體以不直接連接用戶設備的方式，包括於用戶設備中。此外，在一些實施例中，任何適合電腦程序之產品包括可讀取之儲存媒體，其中可讀取之儲存媒體包括和一或多個所揭露實施例相關之程式碼。在一些實施例中，電腦程序之產品可包括封裝材料。The steps of the method and algorithm disclosed in the description of the present invention can be directly applied to hardware and software modules or a combination of the two by executing a processor. A software module (including execution instructions and associated data) and other data can be stored in data memory, such as random access memory (RAM), flash memory, read only memory (ROM) , Erasable Programmable Read-Only Memory (EPROM), Electronically Erasable Programmable Read-Only Memory (EEPROM), Scratchpad, Hard Disk, Portable Hard Disk, CD-ROM ), DVD, or any other computer-readable storage media format known in the art. A storage medium can be coupled to a machine device, such as a computer/processor (for convenience of description, it is referred to as a processor in this specification), and the processor can read information (such as a program) through code), and write information to the storage medium. A storage medium can integrate a processor. An application specific integrated circuit (ASIC) includes a processor and a storage medium. A user equipment includes an application-specific integrated circuit. In other words, the processor and storage medium are included in the user equipment in a manner that is not directly connected to the user equipment. Furthermore, in some embodiments, any product suitable for a computer program includes a readable storage medium, wherein the readable storage medium includes code associated with one or more of the disclosed embodiments. In some embodiments, the product of the computer program may include packaging material.

以上段落使用多種層面描述。顯然的，本文的教示可以多種方式實現，而在範例中揭露之任何特定架構或功能僅為一代表性之狀況。根據本文之教示，任何熟知此技藝之人士應理解在本文揭露之各層面可獨立實作或兩種以上之層面可以合併實作。The above paragraphs use multiple levels of description. Obviously, the teachings herein can be implemented in a variety of ways, and any particular architecture or functionality disclosed in the examples is merely a representative case. Based on the teachings herein, anyone skilled in the art should understand that each aspect disclosed herein may be implemented independently or two or more aspects may be implemented in combination.

雖然本揭露已以實施例揭露如上，然其並非用以限定本揭露，任何熟習此技藝者，在不脫離本揭露之精神和範圍內，當可作些許之更動與潤飾，因此發明之保護範圍當視後附之申請專利範圍所界定者為準。Although the present disclosure has been disclosed above with examples, it is not intended to limit the present disclosure. Anyone who is familiar with the art can make some changes and modifications without departing from the spirit and scope of the present disclosure. Therefore, the protection scope of the invention is The scope of the patent application attached herewith shall prevail.

100：資料分析系統 10：處理器 20：儲存裝置 30：欄位型態分析裝置 40：欄位分類裝置 50：欄位關聯裝置 200：資料分析方法 300：欄位型態分析方法 400：欄位分類方法 500：欄位關聯方法 210~243、310~380、410~430、510~550：步驟 100: Data Analysis System 10: Processor 20: Storage device 30: Column type analysis device 40: Column sorting device 50: Field Association Device 200: Data Analysis Methods 300: Field Pattern Analysis Methods 400: Field Classification Method 500: Field Association Method 210~243, 310~380, 410~430, 510~550: Steps

第1圖係顯示根據本發明之一實施例所述之一資料分析系統之方塊圖。第2圖係顯示根據本發明之一實施例所述之一資料分析方法之示意圖。第3A~3B圖係顯示根據本發明之一實施例所述之一欄位型態分析方法之流程圖。第4圖係顯示根據本發明之一實施例所述之一欄位分類方法之流程圖。第5圖係顯示根據本發明之一實施例所述之一欄位關聯方法之流程圖。 FIG. 1 shows a block diagram of a data analysis system according to an embodiment of the present invention. FIG. 2 is a schematic diagram showing a data analysis method according to an embodiment of the present invention. FIGS. 3A-3B are flowcharts showing a method for analyzing a field type according to an embodiment of the present invention. FIG. 4 is a flowchart showing a field classification method according to an embodiment of the present invention. FIG. 5 is a flowchart showing a field association method according to an embodiment of the present invention.

200：資料分析方法 210~250：步驟 200: Data Analysis Methods 210~250: Steps

Claims

A data analysis system, comprising: a processor for obtaining at least one data table, the data table including a plurality of fields, each of which stores a field data; a storage device for storing the data table ; a field type analysis device for analyzing a field type according to the field data; a field classification device for judging a field type of each of the fields; and a field association a device for calculating a similarity between the fields in the cross-data table, and judging a correlation between the fields according to the similarity; wherein the processor is based on the field types , the field type and these associations generate a field data description file, and the processor determines whether the field data description file is abnormal; wherein the field classification device is based on a first segmentation data and a second Word segmentation data to calculate Euclidean Distance, Manhattan Distance, Hamming Distance, Minkowski Distance, Cosine Similarity, Jaccard Similarity Jaccard Similarity, Edit Distance or Pearson Correlation Coefficient to generate this similarity.

The data analysis system of claim 1, wherein when the processor generates the field data description file and determines whether the field data description file is abnormal, a display is used to display whether the field data description file is abnormal or not.

According to the data analysis system of claim 1, the situation in which the field data description file is judged to be abnormal includes: the field data description file is incomplete, or the field data description file has errors.

The data analysis system of claim 1, wherein when the processor determines that the field data description file is abnormal, it automatically corrects the content of the field data description file.

According to the data analysis system of claim 1, the automatic correction includes adding/updating field data description (description), adding or updating field data group number (group), adding or updating field allowable null value (nullable) ), add or update the upper and lower bounds of the column data (value range), whether to allow ignoring abnormal data, and/or add or update a relation column in the same data table.

The data analysis system of claim 1, wherein the data type analysis device determines whether the field type is a numeric field type, and if the data type analysis device determines that the field type is the numeric field type If the data type analysis device determines whether the field data is a numerical value, the data type analysis device confirms that the field data description file is in the The field type is the numeric field type. If the data type analysis device determines that the field data is not a numeric value, the data type analysis device modifies the field type to a non-numeric field type.

The data analysis system of claim 1, wherein the data type analysis device determines whether the field type is a numeric field type, and if the data type device determines that the field type is not the numeric field type If the data type analysis device determines whether the field data is a numerical value, the data type analysis device determines whether the field data is a numerical value. The field type of is modified to the value field type.

The data analysis system of claim 5, wherein if the data type analysis device determines that the field data is not a numerical value, the data type analysis device determines whether the field data is a plurality of time data, if the data The type analysis device determines that the field data are the time data, and then corrects the field type in the field data description file to the time field type.

According to the data analysis system of claim 8, if the data type analysis device determines that the field data is not the time data, it determines whether the field data is a text data or a Boolean value data, if the data is not the time data The type analysis device determines that the field data is the text data or the Boolean value data, and corresponding to the field data, corrects the field type in the field data description file to a text type or a Boolean value type.

According to the data analysis system of claim 1, wherein the field classification device performs word segmentation on each of the field data, converts each of the multiple words after the segmentation into a word feature, and inputs the word features into a classification model, the classification model outputs the field category according to the word features.

The data analysis system of claim 1, wherein the processor obtains a plurality of data tables, and the field association device selects two data tables from different data tables As a first data table and a second data table, a first field is selected from the first data table, and a second field is selected from the second data table, and the first field includes the first field a word segmentation data, the second field includes the second word segmentation data, and generates a similarity between the first word segmentation data and the second word segmentation data, when the field association device determines that the similarity is greater than one When the similarity threshold is set, the correlation between the first field and the second field is established.

The data analysis system of claim 11, wherein the similarity is calculated by calculating a minimum edit distance (Minimum Edit Distance) between the first word segmentation data and the second word segmentation data, and is calculated and generated according to the minimum edit distance the similarity.

The data analysis system of claim 9, wherein the field classification device uses a Decision Tree algorithm, a Bayes Classification algorithm, a k-Nearest Neighbors algorithm, supports A Vector Machine (Support Vector Machine) algorithm is used to determine the field type of each of the fields.

A data analysis method, comprising: obtaining a data table, the data table including a plurality of fields, each of which stores a field data; analyzing a field type according to the field data; judging the a field category for each of the fields; calculating a similarity between the fields in the cross-data table, and judging a correlation between the fields according to the similarity; and Generate a field data description file according to the field types, the field type and the associations, and then determine whether the field data description file is abnormal; according to a first segmentation data and a second segmentation data Calculate Euclidean Distance, Manhattan Distance, Hamming Distance, Minkowski Distance, Cosine Similarity, Jaccard Similarity Similarity), Edit Distance or Pearson Correlation Coefficient to generate this similarity.