TWM660529U

TWM660529U - System for screening learning sample data and removing abnormal data for machine learning

Info

Publication number: TWM660529U
Application number: TW113206089U
Authority: TW
Inventors: 黃鈺茹; 蘇威智
Original assignee: 神耀科技股份有限公司
Priority date: 2024-06-11
Filing date: 2024-06-11
Publication date: 2024-09-11

Abstract

一種用於機器學習的學習樣本資料篩選與異常資料篩除之系統，該系統包含資料分析裝置以及機器學習模型裝置，資料分析裝置更包含資料接收模組、特徵轉換模組、聚類模組、相似度計算模組、聚類合併模組、異常資料篩除模組以及樣本生成模組，機器學習模型裝置更包含樣本接收模組以及模型訓練模組，資料接收模組與特徵轉換模組相連，特徵轉換模組與聚類模組相連，聚類模組與相似度計算模組相連，相似度計算模組與聚類合併模組相連，聚類合併模組與異常資料篩除模組相連，異常資料篩除模組與樣本生成模組相連，樣本生成模組與樣本接收模組相連，樣本接收模組與模型訓練模組相連，資料分析裝置將原始資料轉換為特徵資料再基於特徵空間轉換為轉換資料，使用聚類分析演算法將轉換資料分類為多個資料聚類，計算出每一個資料聚類的相似度指標陣列再基於第一閾值進行資料聚類的聚類合併，當相似度指標小於第二閾值且聚類的資料筆數小於轉換資料總筆數除以資料聚類的聚類總數以刪除對應的聚類，依據指定比例或是樣本數量與資料數量占比篩選出資料聚類中符合的轉換資料為樣本資料，再提供給機器學習模型進行學習訓練，藉此可以達成提供機器學習模型適當數量且精準的樣本資料的技術功效。A system for screening learning sample data and filtering abnormal data for machine learning, the system comprising a data analysis device and a machine learning model device, the data analysis device further comprising a data receiving module, a feature conversion module, a clustering module, a similarity calculation module, a clustering merging module, an abnormal data filtering module and a sample generation module, the machine learning model device further comprising a sample receiving module The data receiving module is connected to the feature conversion module, the feature conversion module is connected to the clustering module, the clustering module is connected to the similarity calculation module, the similarity calculation module is connected to the clustering merging module, the clustering merging module is connected to the abnormal data filtering module, the abnormal data filtering module is connected to the sample generation module, and the sample generation module is connected to the sample receiving module. The sample receiving module is connected to the model training module. The data analysis device converts the original data into feature data and then converts it into transformed data based on the feature space. The transformed data is classified into multiple data clusters using a cluster analysis algorithm. The similarity index array of each data cluster is calculated and then the data clusters are clustered and merged based on the first threshold. When the similarity index is less than the second threshold and When the number of data in a cluster is less than the total number of transformed data divided by the total number of clusters in the data cluster, the corresponding cluster is deleted. According to the specified ratio or the ratio of the number of samples to the number of data, the transformed data that meets the requirements in the data cluster is selected as sample data, and then provided to the machine learning model for learning and training, thereby achieving the technical effect of providing the machine learning model with an appropriate amount of accurate sample data.

Description

System for screening learning sample data and removing abnormal data for machine learning

一種資料篩選與異常資料篩除之系統，尤其是指一種提供機器學習模型適當數量且精準的樣本資料以進行學習訓練的資料篩選與異常資料篩除之系統。A data screening and abnormal data filtering system, in particular, a data screening and abnormal data filtering system that provides a machine learning model with an appropriate amount of accurate sample data for learning and training.

機器學習一般是透過建立機器學習模型並提供樣本資料讓機器實現學習的效果，機器學習模型的準確度與有效性往往受到樣本資料的數量與精準影響，故而為了提高機器學習模型的準確度與有效性，因此需要提供適當數量且精準的樣本資料將是發展的目標。Machine learning generally involves building a machine learning model and providing sample data to enable the machine to achieve learning effects. The accuracy and effectiveness of the machine learning model are often affected by the amount and precision of the sample data. Therefore, in order to improve the accuracy and effectiveness of the machine learning model, providing an appropriate amount of accurate sample data will be the development goal.

為了實現上述技術目標，現有技術已提出對要進行的資料分類為多個資料聚類，計算不同資料聚類的資料中心之間的相似度指標，設定閾值再依據相似度指標對資料聚類進行聚類的合併，再由指定比例或是樣本數量與資料數量占比篩選出資料聚類中符合的資料為樣本資料，但現有技術忽略了若有資料聚類為異常的資料聚類時，最後所提供的樣本資料其精準性將會出現問題，故而本創作將提出異常資料篩除的技術內容以解決現有技術所無法解決的技術問題。In order to achieve the above technical objectives, the existing technology has proposed to classify the data to be processed into multiple data clusters, calculate the similarity index between the data centers of different data clusters, set a threshold and then merge the data clusters according to the similarity index, and then select the data that meets the requirements in the data cluster as sample data by specifying a ratio or a ratio of the number of samples to the number of data. However, the existing technology ignores that if a data cluster is an abnormal data cluster, the accuracy of the sample data finally provided will be problematic. Therefore, this work will propose the technical content of abnormal data filtering to solve the technical problems that the existing technology cannot solve.

綜上所述，可知先前技術中長期以來一直存在現有資料聚類合併處理過程中未考慮篩除資料聚類的資料為異常情況導致樣本資料精準性降低的問題，因此有必要提出改進的技術手段，來解決此一問題。In summary, it can be seen that the prior art has long had the problem that the data in the existing data clustering and merging process has not considered the abnormal situation of filtering out the data clustering, resulting in reduced accuracy of the sample data. Therefore, it is necessary to propose improved technical means to solve this problem.

有鑒於先前技術存在現有資料聚類合併處理過程中未考慮篩除資料聚類的資料為異常情況導致樣本資料精準性降低的問題，本創作遂揭露一種用於機器學習的學習樣本資料篩選與異常資料篩除之系統，其中：In view of the fact that the existing data clustering and merging process in the prior art does not consider filtering out data clusters as abnormal situations, which leads to reduced accuracy of sample data, this invention discloses a system for learning sample data screening and abnormal data screening for machine learning, wherein:

本創作所揭露的用於機器學習的學習樣本資料篩選與異常資料篩除之系統，其包含：資料分析裝置以及機器學習模型裝置，資料分析裝置更包含：資料接收模組、特徵轉換模組、聚類模組、相似度計算模組、聚類合併模組、異常資料篩除模組以及樣本生成模組，機器學習模型裝置更包含：樣本接收模組以及模型訓練模組。The system disclosed in this invention for machine learning for learning sample data screening and abnormal data screening includes: a data analysis device and a machine learning model device. The data analysis device further includes: a data receiving module, a feature conversion module, a clustering module, a similarity calculation module, a clustering merging module, an abnormal data screening module and a sample generation module. The machine learning model device further includes: a sample receiving module and a model training module.

資料分析裝置的資料接收模組是接收原始資料；資料分析裝置的特徵轉換模組是使用特徵分析演算法將原始資料轉換為以資料特徵為主的特徵資料；資料分析裝置的聚類模組是將特徵資料基於特徵空間轉換為轉換資料，以設定的聚類數量將轉換資料使用聚類分析演算法以分類為多個資料聚類，每一個資料聚類設定對應的聚類標籤；資料分析裝置的相似度計算模組是分別計算每一個資料聚類的資料中心至其他資料聚類的資料中心的相似度指標依據聚類標籤生成為相似度指標陣列，計算合併聚類的資料中心至其他資料聚類的資料中心的相似度指標依據聚類標籤生成為合併相似度指標陣列；資料分析裝置的聚類合併模組重複選取相似度指標陣列中最大的相似度指標且被選取的相似度指標大於預先設定的第一閾值，將被選取的相似度指標對應的二個資料聚類的轉換資料合併為合併聚類，直到合併相似度指標陣列中所有的相似度指標小於閾值為止，合併聚類的聚類標籤為該二個資料聚類的轉換資料數量較多的聚類標籤；資料分析裝置的異常資料篩除模組是當合併相似度指標陣列中相似度指標小於預先設定的第二閾值且該相似度指標對應合併聚類的資料筆數小於轉換資料總筆數除以資料聚類的聚類總數時，刪除合併相似度指標陣列中與該相似度指標對應的聚類為異常篩除相似度指標陣列；及資料分析裝置的樣本生成模組是將異常篩除相似度指標陣列中合併聚類或是資料聚類的指定比例或是樣本數量與資料數量占比篩選出合併聚類或是資料聚類中接近的資料中心的轉換資料為樣本資料。The data receiving module of the data analysis device receives the original data; the feature conversion module of the data analysis device uses the feature analysis algorithm to convert the original data into feature data based on the data features; the clustering module of the data analysis device converts the feature data into converted data based on the feature space, and uses the clustering analysis algorithm to classify the converted data into multiple data clusters with a set number of clusters, and sets a corresponding cluster label for each data cluster; The similarity calculation module calculates the similarity index from the data center of each data cluster to the data center of other data clusters according to the cluster label to generate a similarity index array, and calculates the similarity index from the data center of the merged cluster to the data center of other data clusters according to the cluster label to generate a merged similarity index array; the cluster merging module of the data analysis device repeatedly selects the largest similarity index in the similarity index array and the selected similarity index is greater than the pre-set similarity index. The first threshold is set, and the conversion data of the two data clusters corresponding to the selected similarity index are merged into a merged cluster until all similarity indexes in the merged similarity index array are less than the threshold. The cluster label of the merged cluster is the cluster label with more conversion data of the two data clusters. The abnormal data filtering module of the data analysis device is when the similarity index in the merged similarity index array is less than the second threshold set in advance and the similarity index corresponds to the merged cluster. When the number of data items in a cluster is less than the total number of transformed data items divided by the total number of clusters in the data clustering, the cluster corresponding to the similarity index in the merged similarity index array is deleted as the abnormal screening similarity index array; and the sample generation module of the data analysis device selects the transformed data of the merged cluster or the data cluster close to the data center in the merged cluster or the data cluster as sample data by using the specified ratio of the merged cluster or the data cluster in the abnormal screening similarity index array or the ratio of the sample number to the data number.

機器學習模型裝置的樣本接收模組自資料分析裝置接收樣本資料；及機器學習模型裝置的模型訓練模組是依據樣本資料對機器學習模型進行學習訓練。The sample receiving module of the machine learning model device receives sample data from the data analysis device; and the model training module of the machine learning model device learns and trains the machine learning model according to the sample data.

本創作所揭露的系統如上，資料分析裝置將原始資料轉換為特徵資料再基於特徵空間轉換為轉換資料，使用聚類分析演算法將轉換資料分類為多個資料聚類，計算出每一個資料聚類的相似度指標陣列再基於第一閾值進行資料聚類的聚類合併，當相似度指標小於第二閾值且聚類的資料筆數小於轉換資料總筆數除以資料聚類的聚類總數以刪除對應的聚類，依據指定比例或是樣本數量與資料數量占比篩選出資料聚類中符合的轉換資料為樣本資料，再提供給機器學習模型進行學習訓練。The system disclosed in the present invention is as described above. The data analysis device converts the original data into feature data and then converts it into transformed data based on the feature space. The transformed data is classified into multiple data clusters using a cluster analysis algorithm. The similarity index array of each data cluster is calculated and then the data clusters are clustered and merged based on a first threshold. When the similarity index is less than a second threshold and the number of data items in the cluster is less than the total number of transformed data items divided by the total number of clusters of the data clusters, the corresponding cluster is deleted. The transformed data that meets the requirements in the data cluster is screened out as sample data according to a specified ratio or a ratio of the number of samples to the number of data, and then provided to the machine learning model for learning and training.

透過上述的技術手段，本創作可以達成提供機器學習模型適當數量且精準的樣本資料的技術功效。Through the above-mentioned technical means, this work can achieve the technical effect of providing an appropriate amount of accurate sample data for the machine learning model.

以下將配合圖式及實施例來詳細說明本創作的實施方式，藉此對本創作如何應用技術手段來解決技術問題並達成技術功效的實現過程能充分理解並據以實施。The following will be used in conjunction with diagrams and embodiments to explain in detail the implementation of this invention, so that the invention can fully understand how to apply technical means to solve technical problems and achieve technical effects and implement them accordingly.

以下首先要說明本創作所揭露的用於機器學習的學習樣本資料篩選與異常資料篩除之系統，並請參考「第1圖」所示，「第1圖」繪示為本創作用於機器學習的學習樣本資料篩選與異常資料篩除之系統的系統方塊圖。The following first explains the system for filtering learning sample data and removing abnormal data for machine learning disclosed in the present invention, and please refer to "Figure 1", which is a system block diagram of the system for filtering learning sample data and removing abnormal data for machine learning disclosed in the present invention.

本創作所揭露的用於機器學習模型的學習樣本資料篩選系統，其包含：資料分析裝置10以及機器學習模型裝置20，資料分析裝置更包含：資料接收模組11、特徵轉換模組12、聚類模組13、相似度計算模組14、聚類合併模組15、異常資料篩除模組16以及樣本生成模組17，機器學習模型裝置20更包含：樣本接收模組21以及模型訓練模組22。The learning sample data screening system for the machine learning model disclosed in this invention includes: a data analysis device 10 and a machine learning model device 20. The data analysis device further includes: a data receiving module 11, a feature conversion module 12, a clustering module 13, a similarity calculation module 14, a clustering merging module 15, an abnormal data screening module 16 and a sample generation module 17. The machine learning model device 20 further includes: a sample receiving module 21 and a model training module 22.

資料分析裝置10由資料接收模組11取得原始資料，值得注意的是，原始資料的資料類型包含但不限於時間序列（time series）資料、類別資料（categorical data）、影像資料…等，在此僅為舉例說明之，並不以此侷限本創作的應用範疇。The data analysis device 10 obtains the original data from the data receiving module 11. It is worth noting that the data types of the original data include but are not limited to time series data, categorical data, image data, etc. These are only given as examples and do not limit the scope of application of this invention.

特徵轉換模組12是以固定單位的資料擷取區域對原始資料進行資料擷取，再使用特徵分析演算法將原始資料轉換為以資料特徵為主的特徵資料，值得注意的是，前述的特徵分析演算法包含有快速傅立葉轉換（Fast Fourier Transform, FFT）、小波轉換（wavelet transform）…等積分轉換（integral transform）演算法，在此僅為舉例說明之，並不以此侷限本創作的應用範疇。The feature conversion module 12 captures the original data with a fixed unit data capture area, and then uses a feature analysis algorithm to convert the original data into feature data based on data features. It is worth noting that the aforementioned feature analysis algorithm includes integral transform algorithms such as Fast Fourier Transform (FFT), wavelet transform, etc., which are only used as examples to illustrate, and are not intended to limit the scope of application of this creation.

聚類模組13將特徵資料基於特徵空間轉換為轉換資料，以設定的聚類數量（例如是50、100…等，在此僅為舉例說明之，並不以此侷限本創作的應用範疇）將轉換資料使用聚類分析演算法以分類為多個資料聚類，每一個資料聚類設定對應的聚類標籤，聚類標籤例如是：0、1、2、3…等，在此僅為舉例說明之，並不以此侷限本創作的應用範疇。The clustering module 13 converts the feature data into transformed data based on the feature space, and classifies the transformed data into multiple data clusters using a clustering analysis algorithm with a set number of clusters (for example, 50, 100, etc., which is only used as an example here and does not limit the scope of application of the present invention). Each data cluster is set with a corresponding cluster label, and the cluster label is, for example: 0, 1, 2, 3, etc., which is only used as an example here and does not limit the scope of application of the present invention.

值得注意的是，前述的聚類分析演算法包含k-平均數叢集法（K-means clustering）、品質門檻叢集法（quality threshold clustering）…等，在此僅為舉例說明之，並不以此侷限本創作的應用範疇。It is worth noting that the aforementioned clustering analysis algorithms include K-means clustering, quality threshold clustering, etc. They are only used as examples here and are not intended to limit the scope of application of this work.

接著，相似度計算模組14分別計算每一個資料聚類的資料中心至其他資料聚類的資料中心的相似度指標（也可以是同質性指標）依據所述聚類標籤生成為相似度指標陣列，值得注意的是，相似度指標所使用的度量方式包含歐式距離、曼哈頓距離（Manhattan distance）、馬式距離（Mahalanobis distance）、餘弦相似性、漢明距離（Hamming distance）…等，在此僅為舉例說明之，並不以此侷限本創作的應用範疇。Then, the similarity calculation module 14 calculates the similarity index (which can also be a homogeneity index) from the data center of each data cluster to the data center of other data clusters according to the cluster label to generate a similarity index array. It is worth noting that the measurement methods used for the similarity index include Euclidean distance, Manhattan distance, Mahalanobis distance, cosine similarity, Hamming distance, etc., which are only used as examples here and are not intended to limit the scope of application of this creation.

接著，聚類合併模組15選取相似度指標陣列中最大的相似度指標且被選取的相似度指標大於預先設定的第一閾值，將被選取的相似度指標對應的二個資料聚類的轉換資料合併為合併聚類，合併聚類的聚類標籤為該二個資料聚類的轉換資料數量較多的聚類標籤，再由相似度計算模組14計算合併聚類的資料中心至其他資料聚類的資料中心的相似度指標依據聚類標籤生成為合併相似度指標陣列，重複使用相似度計算模組14以及聚類合併模組15進行聚類的合併以及計算與生成合併相似度指標陣列，直到合併相似度指標陣列中所有的相似度指標小於閾值為止。Then, the cluster merging module 15 selects the largest similarity index in the similarity index array and the selected similarity index is greater than a preset first threshold, and merges the transformed data of the two data clusters corresponding to the selected similarity index into a merged cluster. The cluster label of the merged cluster is the cluster label of the two data clusters with more transformed data. 14 calculates the similarity index from the data center of the merged cluster to the data center of other data clusters and generates a merged similarity index array according to the cluster label, and repeatedly uses the similarity calculation module 14 and the cluster merging module 15 to merge clusters and calculate and generate a merged similarity index array until all similarity indicators in the merged similarity index array are less than a threshold value.

具體而言，若預先設定的第一閾值為“0.85”，聚類標籤為“2”的資料聚類的轉換資料數量為“55”，聚類標籤為“9”的資料聚類的轉換資料數量為“70”，聚類合併模組15選取相似度指標陣列中最大的相似度指標為“0.95”，並且相似度指標為“0.95”對應的二個資料聚類分別為聚類標籤為“2”的資料聚類以及聚類標籤為“9”的資料聚類，因相似度指標為“0.95”大於第一閾值為“0.85”，故而將聚類標籤為“2”的資料聚類合併至聚類標籤為“9”的資料聚類，合併聚類的聚類標籤設定為“9”以及合併聚類的轉換資料數量為“125”，再由相似度計算模組14計算合併聚類的資料中心至其他未被選取的資料聚類的資料中心的相似度指標依據聚類標籤生成為合併相似度指標陣列。Specifically, if the preset first threshold is "0.85", the number of transformed data of the data cluster with cluster label "2" is "55", and the number of transformed data of the data cluster with cluster label "9" is "70", the cluster merging module 15 selects the largest similarity index "0.95" in the similarity index array, and the two data clusters corresponding to the similarity index "0.95" are the data cluster with cluster label "2" and the data cluster with cluster label "9". Clustering, because the similarity index of "0.95" is greater than the first threshold of "0.85", the data cluster with cluster label "2" is merged into the data cluster with cluster label "9", the cluster label of the merged cluster is set to "9" and the number of transformed data of the merged cluster is "125", and then the similarity calculation module 14 calculates the similarity index from the data center of the merged cluster to the data center of other unselected data clusters according to the cluster labels to generate a merged similarity index array.

接著，當合併相似度指標陣列中相似度指標小於預先設定的第二閾值且該相似度指標對應聚類的資料筆數小於轉換資料總筆數除以資料聚類的聚類總數時，異常資料篩除模組16刪除合併相似度指標陣列中與該相似度指標對應的聚類為異常篩除相似度指標陣列。Then, when the similarity index in the merged similarity index array is less than a preset second threshold and the number of data items in the cluster corresponding to the similarity index is less than the total number of converted data items divided by the total number of clusters of the data clustering, the abnormal data filtering module 16 deletes the cluster corresponding to the similarity index in the merged similarity index array as the abnormal filtering similarity index array.

具體而言，假設合併相似度指標陣列中聚類標籤為“11”至聚類標籤為“6”的相似度指標為“0.16”以及聚類標籤為“11”至聚類標籤為“12”的相似度指標為“0.12”，轉換資料總筆數為“300”，資料聚類的聚類總數為“30”，聚類標籤為“11”的資料筆數為“3”，預先設定的第二閾值為“0.25”，相似度指標為“0.16”以及相似度指標為“0.12”皆小於第二閾值為“0.25”，並且聚類標籤為“11”的資料筆數為“3”小於“10（即300/30，由轉換資料總筆數為“300”除以資料聚類的聚類總數為“30”計算得到）”，異常資料篩除模組16刪除合併相似度指標陣列中與聚類標籤為“11”對應的聚類為異常篩除相似度指標陣列。Specifically, assume that the similarity index from cluster label "11" to cluster label "6" in the merged similarity index array is "0.16" and the similarity index from cluster label "11" to cluster label "12" is "0.12", the total number of transformed data is "300", the total number of clusters of data clustering is "30", the number of data with cluster label "11" is "3", the preset second threshold is "0.25", and the similarity index is " 0.16" and the similarity index "0.12" are both less than the second threshold "0.25", and the number of data with the cluster label "11" is "3" which is less than "10 (i.e., 300/30, calculated by dividing the total number of transformed data "300" by the total number of clusters of data clustering "30")", the abnormal data filtering module 16 deletes the cluster corresponding to the cluster label "11" in the merged similarity index array as the abnormal filtering similarity index array.

接著，樣本生成模組17將異常篩除相似度指標陣列中合併聚類或是資料聚類的指定比例或是樣本數量與資料數量占比篩選出合併聚類或是資料聚類中接近的資料中心的轉換資料為樣本資料。Next, the sample generation module 17 selects the transformed data of the merged cluster or data cluster close to the data center in the merged cluster or data cluster as sample data by using the specified ratio of the merged cluster or data cluster in the abnormal filtering similarity index array or the ratio of the sample quantity to the data quantity.

具體而言，若異常篩除相似度指標陣列中僅剩下聚類標籤為“6”、聚類標籤為“7”以及聚類標籤為“9”，聚類標籤為“6”的合併聚類的轉換資料數量為“303”，聚類標籤為“7”的資料聚類的轉換資料數量為“26”，聚類標籤為“9”的資料聚類的轉換資料數量為“314”。Specifically, if only cluster label "6", cluster label "7" and cluster label "9" remain in the anomaly filtering similarity index array, the number of transformed data for the merged cluster with cluster label "6" is "303", the number of transformed data for the data cluster with cluster label "7" is "26", and the number of transformed data for the data cluster with cluster label "9" is "314".

在指定比例為“70%”的情況，將篩選出聚類標籤為“6”最接近合併聚類的資料中心的212筆轉換資料（即303的70%為212），篩選出聚類標籤為“7”最接近資料聚類的資料中心的18筆轉換資料（即26的70%為18），篩選出聚類標籤為“9”最接近合併聚類的資料中心的219筆轉換資料（即314的70%為219），所有被篩選出的轉換資料即為樣本資料。When the specified ratio is "70%", 212 conversion data with cluster label "6" closest to the data center of the merged cluster will be screened out (i.e. 70% of 303 is 212), 18 conversion data with cluster label "7" closest to the data center of the data cluster will be screened out (i.e. 70% of 26 is 18), and 219 conversion data with cluster label "9" closest to the data center of the merged cluster will be screened out (i.e. 70% of 314 is 219). All the selected conversion data are sample data.

在樣本數量為“200”的情況，計算得到聚類標籤為“6”的資料數量比例為“47%”（即303/(303+26+314)），計算得到聚類標籤為“7”的資料數量比例為“4%”（即26/(303+26+314)），計算得到聚類標籤為“9”的資料數量比例為“49%”（即314/(303+26+314)）。When the sample size is "200", the proportion of data with cluster label "6" is calculated to be "47%" (i.e. 303/(303+26+314)), the proportion of data with cluster label "7" is calculated to be "4%" (i.e. 26/(303+26+314)), and the proportion of data with cluster label "9" is calculated to be "49%" (i.e. 314/(303+26+314)).

依據聚類標籤為“6”的資料數量比例為“47%”篩選出最接近合併聚類的資料中心的94筆轉換資料（即200的47%為94），依據聚類標籤為“7”的資料數量比例為“4%”篩選出最接近資料聚類的資料中心的8筆轉換資料（即200的4%為8），依據聚類標籤為“9”的資料數量比例為“49%”篩選出最接近合併聚類的資料中心的98筆轉換資料（即200的49%為98），所有被篩選出的轉換資料即為樣本資料。According to the ratio of the number of data with cluster label "6" to "47%", 94 conversion data closest to the data center of the merged cluster are selected (that is, 47% of 200 is 94); according to the ratio of the number of data with cluster label "7" to "4%", 8 conversion data closest to the data center of the data cluster are selected (that is, 4% of 200 is 8); according to the ratio of the number of data with cluster label "9" to "49%", 98 conversion data closest to the data center of the merged cluster are selected (that is, 49% of 200 is 98). All the selected conversion data are sample data.

機器學習模型裝置20的樣本接收模組21自資料分析裝置10接收樣本資料，再由機器學習模型裝置20的模型訓練模組22依據樣本資料對機器學習模型進行學習訓練。The sample receiving module 21 of the machine learning model device 20 receives sample data from the data analysis device 10, and then the model training module 22 of the machine learning model device 20 learns and trains the machine learning model according to the sample data.

以下的第一個實施例以及第二個實施例是透過k-平均數叢集法進行資料聚類分類為對應的聚類標籤，資料聚類的資料中心是將資料聚類的轉換資料透過加權平均方法或是簡單平均方法以計算得到，在此僅為舉例說明之，並不以此侷限本創作的應用範疇，基於k-平均數叢集法計算每一個資料聚類的資料中心至其他資料聚類的資料中心的相似度指標可以使用下列公式計算得到：The first and second embodiments below use the k-means clustering method to cluster data into corresponding cluster labels. The data center of the data cluster is calculated by using the weighted average method or the simple average method to calculate the transformed data of the data cluster. This is only used as an example to illustrate, and does not limit the scope of application of this invention. Based on the k-means clustering method, the similarity index between the data center of each data cluster and the data center of other data clusters can be calculated using the following formula:

相似度指標= Similarity index =

其中，為一個資料聚類的每一個轉換資料，為其他資料聚類的每一個轉換資料，為一個資料聚類的資料中心，為其他資料聚類的資料中心。 in, For each transformed data of a data cluster, For each transformed data in the other data cluster, is the data center of a data cluster. The center of the data where the other data are clustered.

以下以第一個實施例說明模型預測的準確度，請參考「第2A圖」以及「第2B圖」所示，「第2A圖」繪示為本創作第一實施例異常資料聚類未篩除的聚類散佈圖；「第2B圖」繪示為本創作第一實施例異常資料聚類未篩除的混淆矩陣圖。The following uses the first embodiment to illustrate the accuracy of the model prediction. Please refer to "Figure 2A" and "Figure 2B". "Figure 2A" shows the cluster scatter diagram of the first embodiment of the present invention without filtering out the abnormal data clustering; "Figure 2B" shows the confusion matrix diagram of the first embodiment of the present invention without filtering out the abnormal data clustering.

透過將300筆正常的原始資料經過上述聚類後，再將3筆異常資料隨機放入聚類中，使用決策樹建模，再將75筆正常的測試資料經過上述聚類，計算運用決策樹模型的預測聚類的準確度為92%，在「第2A圖」中原始資料31與預測資料32重疊表示預測正確，在「第2A圖」中原始資料31與預測資料32未重疊表示預測不正確，再由混淆矩陣33（如「第2B圖」所示）即可以得到模型預測的準確度。After clustering 300 normal original data, 3 abnormal data were randomly placed in the cluster, and the decision tree model was used. Then, 75 normal test data were clustered. The accuracy of the predicted clustering using the decision tree model was calculated to be 92%. In "Figure 2A", the overlap of the original data 31 and the predicted data 32 indicates that the prediction is correct. In "Figure 2A", the non-overlap of the original data 31 and the predicted data 32 indicates that the prediction is incorrect. The accuracy of the model prediction can be obtained from the confusion matrix 33 (as shown in "Figure 2B").

請參考「第3A圖」以及「第3B圖」所示，「第3A圖」繪示為本創作第一實施例異常資料聚類篩除的聚類散佈圖；「第3B圖」繪示為本創作第一實施例異常資料聚類篩除的混淆矩陣圖。Please refer to "Figure 3A" and "Figure 3B", "Figure 3A" is a cluster scatter diagram of abnormal data clustering filtering according to the first embodiment of the present invention; "Figure 3B" is a confusion matrix diagram of abnormal data clustering filtering according to the first embodiment of the present invention.

透過將300筆正常的原始資料經過上述聚類後，使用決策樹建模，再將75筆正常的測試資料經過上述聚類，計算運用決策樹模型的預測聚類的準確度為96%，在「第3A圖」中原始資料31與預測資料32重疊表示預測正確，在「第3A圖」中原始資料31與預測資料32未重疊表示預測不正確，再由混淆矩陣33（如「第3B圖」所示）即可以得到模型預測的準確度，可以明確得知將異常資料聚類篩除後模型預測的準確度是提升的。After clustering 300 normal original data, the decision tree model was used to build the model. Then, 75 normal test data were clustered. The accuracy of the predicted clustering using the decision tree model was calculated to be 96%. In "Figure 3A", the overlap of the original data 31 and the predicted data 32 indicates that the prediction is correct. In "Figure 3A", the non-overlap of the original data 31 and the predicted data 32 indicates that the prediction is incorrect. The accuracy of the model prediction can be obtained from the confusion matrix 33 (as shown in "Figure 3B"). It can be clearly seen that the accuracy of the model prediction is improved after the abnormal data is clustered and filtered.

以下以第二個實施例說明模型預測的準確度，請參考「第4A圖」以及「第4B圖」所示，「第4A圖」繪示為本創作第二實施例異常資料聚類未篩除的聚類散佈圖；「第4B圖」繪示為本創作第二實施例異常資料聚類未篩除的混淆矩陣圖。The following uses the second embodiment to illustrate the accuracy of the model prediction. Please refer to "Figure 4A" and "Figure 4B". "Figure 4A" shows the cluster scatter diagram of the unfiltered abnormal data clustering of the second embodiment of the present creation; "Figure 4B" shows the confusion matrix diagram of the unfiltered abnormal data clustering of the second embodiment of the present creation.

透過將300筆正常的原始資料經過上述聚類後，再將3筆異常資料隨機放入聚類中，使用決策樹建模，再將75筆正常的測試資料經過上述聚類，計算運用決策樹模型的預測聚類的準確度為48%，在「第4A圖」中原始資料31與預測資料32重疊表示預測正確，在「第4A圖」中原始資料31與預測資料32未重疊表示預測不正確，再由混淆矩陣33（如「第4B圖」所示）即可以得到模型預測的準確度。After clustering 300 normal original data, 3 abnormal data were randomly placed in the cluster, and the decision tree model was used. Then, 75 normal test data were clustered, and the accuracy of the predicted clustering using the decision tree model was calculated to be 48%. In "Figure 4A", the overlap of the original data 31 and the predicted data 32 indicates that the prediction is correct, and the non-overlap of the original data 31 and the predicted data 32 in "Figure 4A" indicates that the prediction is incorrect. The accuracy of the model prediction can be obtained from the confusion matrix 33 (as shown in "Figure 4B").

請參考「第5A圖」以及「第5B圖」所示，「第5A圖」繪示為本創作第二實施例異常資料聚類篩除的聚類散佈圖；「第5B圖」繪示為本創作第二實施例異常資料聚類篩除的混淆矩陣圖。Please refer to “Figure 5A” and “Figure 5B”, “Figure 5A” is a cluster scatter diagram of abnormal data clustering filtering according to the second embodiment of the present invention; “Figure 5B” is a confusion matrix diagram of abnormal data clustering filtering according to the second embodiment of the present invention.

透過將300筆正常的原始資料經過上述聚類後，使用決策樹建模，再將75筆正常的測試資料經過上述聚類，計算運用決策樹模型的預測聚類的準確度為54.7%，在「第3A圖」中原始資料31與預測資料32重疊表示預測正確，在「第5A圖」中原始資料31與預測資料32未重疊表示預測不正確，再由混淆矩陣33（如「第5B圖」所示）即可以得到模型預測的準確度，可以明確得知將異常資料聚類篩除後模型預測的準確度是提升的。After clustering 300 normal original data, the decision tree model was used to build the model. Then, 75 normal test data were clustered. The accuracy of the predicted clustering using the decision tree model was calculated to be 54.7%. In "Figure 3A", the overlap of the original data 31 and the predicted data 32 indicates that the prediction is correct. In "Figure 5A", the non-overlap of the original data 31 and the predicted data 32 indicates that the prediction is incorrect. The accuracy of the model prediction can be obtained from the confusion matrix 33 (as shown in "Figure 5B"). It can be clearly seen that the accuracy of the model prediction is improved after the abnormal data is clustered and filtered.

接著，以下將說明本創作的運作過程，並請同時參考「第6A圖」以及「第6B圖」所示，「第6A圖」以及「第6B圖」繪示為本創作用於機器學習的學習樣本資料篩選與異常資料篩除的流程圖。Next, the operation process of the present invention will be described below, and please refer to "Figure 6A" and "Figure 6B" at the same time. "Figure 6A" and "Figure 6B" are flowcharts of the present invention for learning sample data screening and abnormal data screening for machine learning.

首先，資料分析裝置取得原始資料（步驟401）；接著，資料分析裝置使用特徵分析演算法將原始資料轉換為以資料特徵為主的特徵資料（步驟402）；接著，資料分析裝置將特徵資料基於特徵空間轉換為轉換資料，以設定的聚類數量將轉換資料使用聚類分析演算法以分類為多個資料聚類，每一個資料聚類設定對應的聚類標籤（步驟403）；接著，資料分析裝置分別計算每一個資料聚類的資料中心至其他資料聚類的資料中心的相似度指標依據聚類標籤生成為相似度指標陣列（步驟404）；接著，資料分析裝置選取相似度指標陣列中最大的相似度指標且被選取的相似度指標大於預先設定的第一閾值，將被選取的相似度指標對應的二個資料聚類的轉換資料合併為合併聚類，合併聚類的聚類標籤為該二個資料聚類的轉換資料數量較多的聚類標籤（步驟405）；接著，資料分析裝置分別計算合併聚類的資料中心至其他未被選取的資料聚類的資料中心的相似度指標依據聚類標籤生成為合併相似度指標陣列（步驟406）；接著，資料分析裝置重複進行合併相似度指標陣列中最大的相似度指標且相似度指標大於第一閾值的選取、合併聚類的合併以及計算與生成合併相似度指標陣列，直到合併相似度指標陣列中所有的相似度指標小於閾值為止（步驟407）；接著，當合併相似度指標陣列中相似度指標小於預先設定的第二閾值且該相似度指標對應合併聚類的資料筆數小於轉換資料總筆數除以資料聚類的聚類總數時，資料分析裝置刪除合併相似度指標陣列中與該相似度指標對應的聚類為異常篩除相似度指標陣列（步驟408）；接著，資料分析裝置將異常篩除相似度指標陣列中合併聚類或是資料聚類的指定比例或是樣本數量與資料數量占比篩選出合併聚類或是資料聚類中接近的資料中心的轉換資料為樣本資料（步驟409）；接著，機器學習模型裝置自資料分析裝置接收樣本資料（步驟410）；最後，機器學習模型裝置依據樣本資料對機器學習模型進行學習訓練（步驟411）。First, the data analysis device obtains the original data (step 401); then, the data analysis device uses the feature analysis algorithm to convert the original data into feature data based on data features (step 402); then, the data analysis device converts the feature data into transformed data based on the feature space, and uses the cluster analysis algorithm to classify the transformed data into multiple data clusters with a set number of clusters, and sets a corresponding cluster label for each data cluster (step 403); then, the data analysis device calculates the similarity index from the data center of each data cluster to the data center of other data clusters according to The cluster label is generated as a similarity index array (step 404); then, the data analysis device selects the largest similarity index in the similarity index array and the selected similarity index is greater than a preset first threshold, and the conversion data of the two data clusters corresponding to the selected similarity index are merged into a merged cluster, and the cluster label of the merged cluster is the cluster label with a larger amount of conversion data of the two data clusters (step 405); then, the data analysis device calculates the similarity index from the data center of the merged cluster to the data center of other unselected data clusters respectively, and generates a merged cluster according to the cluster label. similarity indicator array (step 406); then, the data analysis device repeatedly selects the largest similarity indicator in the merged similarity indicator array and the similarity indicator is greater than the first threshold, merges the merged clusters, and calculates and generates the merged similarity indicator array until all the similarity indicators in the merged similarity indicator array are less than the threshold (step 407); then, when the similarity indicator in the merged similarity indicator array is less than the preset second threshold and the number of data items in the merged cluster corresponding to the similarity indicator is less than the total number of converted data items divided by the total number of clusters in the data cluster, the data analysis device deletes the merged similarity indicator array. The cluster corresponding to the similarity index in the similarity index array is the anomaly-filtered similarity index array (step 408); then, the data analysis device selects the merged cluster or the data cluster in the anomaly-filtered similarity index array by a specified ratio or a ratio of the sample quantity to the data quantity to select the transformed data of the merged cluster or the data cluster close to the data center as sample data (step 409); then, the machine learning model device receives the sample data from the data analysis device (step 410); finally, the machine learning model device learns and trains the machine learning model according to the sample data (step 411).

資料分析裝置10以及機器學習模型裝置20為計算設備不同的呈現形式，在此僅為舉例說明之，並不以此侷限本創作的應用範疇，請參考「第7圖」所示，「第7圖」繪示為本創作所提之計算設備的元件示意圖。The data analysis device 10 and the machine learning model device 20 are different forms of computing devices, which are only used as examples for illustration and are not intended to limit the scope of application of the present invention. Please refer to “Figure 7”, which is a schematic diagram of the components of the computing device proposed in the present invention.

本創作所提之計算設備包含但不限於一個或多個處理器501、一個或多個記憶體模組502、及匯流排503等硬體元件，其中，匯流排503可以連接不同的硬體元件。透過所包含之多個硬體元件，計算設備可以載入並執行作業系統，使作業系統在計算設備上運行，也可以執行軟體或程式。計算設備也包含一個外殼509，上述之各個硬體元件設置於外殼內。The computing device mentioned in the present invention includes but is not limited to hardware components such as one or more processors 501, one or more memory modules 502, and a bus 503, wherein the bus 503 can connect different hardware components. Through the multiple hardware components included, the computing device can load and execute an operating system, so that the operating system runs on the computing device, and can also execute software or programs. The computing device also includes a housing 509, and the above-mentioned hardware components are arranged in the housing.

本創作所提之計算設備的匯流排503可以包含一種或多個類型，例如包含資料匯流排（data bus）、位址匯流排（address bus）、控制匯流排（control bus）、擴充功能匯流排（expansion bus）、及/或局域匯流排（local bus）等類型的匯流排。計算設備的匯流排包括但不限於並列的工業標準架構（ISA）匯流排、周邊元件互連（PCI）匯流排、視頻電子標準協會（VESA）局域匯流排、以及串列的通用序列匯流排（USB）、快速周邊元件互連（PCI-E）匯流排等。The bus 503 of the computing device of the present invention may include one or more types, such as a data bus, an address bus, a control bus, an expansion bus, and/or a local bus. The bus of the computing device includes but is not limited to a parallel Industrial Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, a Video Electronics Standards Association (VESA) local bus, a serial Universal Serial Bus (USB), a Peripheral Component Interconnect Express (PCI-E) bus, etc.

本創作所提之計算設備的處理器501與匯流排503耦接。處理器501包含暫存器（Register）組或暫存器空間，暫存器組或暫存器空間可以完全的被設置在處理晶片上，或全部或部分被設置在處理晶片外並經由專用電氣連接及/或經由匯流排耦接至處理器。處理器501可為處理單元、微處理器或任何合適的處理元件。若計算設備為多處理器設備，也就是計算設備包含多個處理器，則計算設備所包含的處理器都相同或類似，且透過匯流排耦接與通訊。處理器501可以解釋一連串的多個指令以進行特定的運算或操作，例如，數學運算、邏輯運算、資料比對、複製/移動資料等，藉以運行作業系統或執行各種程式、模組、及/或元件。The processor 501 of the computing device proposed in the present invention is coupled to a bus 503. The processor 501 includes a register group or a register space, which can be completely set on the processing chip, or completely or partially set outside the processing chip and coupled to the processor via a dedicated electrical connection and/or via a bus. The processor 501 can be a processing unit, a microprocessor, or any suitable processing element. If the computing device is a multi-processor device, that is, the computing device includes multiple processors, the processors included in the computing device are the same or similar, and are coupled and communicated through a bus. The processor 501 can interpret a series of multiple instructions to perform specific calculations or operations, such as mathematical operations, logical operations, data comparison, copying/moving data, etc., so as to run an operating system or execute various programs, modules, and/or components.

計算設備的處理器501可以與晶片組耦接或透過匯流排503與晶片組電性連接。晶片組是由一個或多個積體電路（IC）組成，包含記憶體控制器以及周邊輸出入（I/O）控制器，也就是說，記憶體控制器以及周邊輸出入控制器可以包含在一個積體電路內，也可以使用兩個或更多的積體電路實現。晶片組通常提供了輸出入和記憶體管理功能、以及提供多個通用及/或專用暫存器、計時器等，其中，上述之通用及/或專用暫存器與計時器可以讓耦接或電性連接至晶片組的一個或多個處理器存取或使用。The processor 501 of the computing device can be coupled to the chipset or electrically connected to the chipset through the bus 503. The chipset is composed of one or more integrated circuits (ICs), including a memory controller and a peripheral input/output (I/O) controller, that is, the memory controller and the peripheral input/output (I/O) controller can be included in one integrated circuit, or can be implemented using two or more integrated circuits. The chipset usually provides input/output and memory management functions, as well as multiple general and/or dedicated registers, timers, etc., wherein the above-mentioned general and/or dedicated registers and timers can be accessed or used by one or more processors coupled or electrically connected to the chipset.

計算設備的處理器501也可以透過記憶體控制器存取安裝於計算設備上的記憶體模組502和大容量儲存區中的資料。上述之記憶體模組502包含任何類型的揮發性記憶體（volatile memory）及/或非揮發性（non-volatile memory, NVRAM）記憶體，例如靜態隨機存取記憶體（SRAM）、動態隨機存取記憶體（DRAM）、快閃記憶體（Flash）、唯讀記憶體（ROM）等。上述之大容量儲存區可以包含任何類型的儲存裝置或儲存媒體，例如，硬碟機、光碟片、隨身碟（快閃記憶體）、記憶卡（memory card）、固態硬碟（Solid State Disk, SSD）、或任何其他儲存裝置等。也就是說，記憶體控制器可以存取靜態隨機存取記憶體、動態隨機存取記憶體、快閃記憶體、硬碟機、固態硬碟中的資料。The processor 501 of the computing device can also access the data in the memory module 502 and the mass storage area installed on the computing device through the memory controller. The above-mentioned memory module 502 includes any type of volatile memory and/or non-volatile memory (NVRAM) memory, such as static random access memory (SRAM), dynamic random access memory (DRAM), flash memory (Flash), read-only memory (ROM), etc. The mass storage area mentioned above may include any type of storage device or storage media, such as a hard drive, an optical disk, a flash memory, a memory card, a solid state disk (SSD), or any other storage device. In other words, the memory controller can access data in static random access memory, dynamic random access memory, flash memory, hard drive, and solid state disk.

計算設備的處理器501也可以透過周邊輸出入控制器經由匯流排503與周邊輸出裝置、周邊輸入裝置、通訊介面、以及GPS接收器等周邊裝置或介面連接並通訊。周邊輸入裝置可以是任何類型的輸入裝置，例如鍵盤、滑鼠、軌跡球、觸控板、搖桿等，周邊輸出裝置可以是任何類型的輸出裝置，例如顯示器、印表機等，周邊輸入裝置與周邊輸出裝置也可以是同一裝置，例如觸控螢幕等。通訊介面可以包含無線通訊介面及/或有線通訊介面，無線通訊介面可以包含支援Wi-Fi、Zigbee等無線區域網路、藍牙、紅外線、近場通訊（NFC）、3G/4G/5G等行動通訊網路或其他無線資料傳輸協定的介面，有線通訊介面可為乙太網路裝置、非同步傳輸模式（ATM）裝置、DSL數據機、纜線（Cable）數據機等。處理器501可以週期性地輪詢（polling）各種周邊裝置與介面，使得計算設備能夠透過各種周邊裝置與介面進行資料的輸入與輸出，也能夠與具有上面描述之元件的另一個計算設備進行通訊。The processor 501 of the computing device can also be connected and communicated with peripheral devices or interfaces such as peripheral output devices, peripheral input devices, communication interfaces, and GPS receivers through the peripheral input/output controller via the bus 503. The peripheral input device can be any type of input device, such as a keyboard, a mouse, a trackball, a touchpad, a joystick, etc. The peripheral output device can be any type of output device, such as a display, a printer, etc. The peripheral input device and the peripheral output device can also be the same device, such as a touch screen, etc. The communication interface may include a wireless communication interface and/or a wired communication interface. The wireless communication interface may include an interface supporting wireless local area networks such as Wi-Fi, Zigbee, Bluetooth, infrared, near field communication (NFC), mobile communication networks such as 3G/4G/5G, or other wireless data transmission protocols. The wired communication interface may be an Ethernet device, an ATM device, a DSL modem, a cable modem, etc. The processor 501 may periodically poll various peripheral devices and interfaces so that the computing device can input and output data through various peripheral devices and interfaces, and can also communicate with another computing device having the components described above.

資料分析裝置10以及機器學習模型裝置20所提及的各模組通常是在各自計算設備中的處理器501執行被載入記憶體模組502之特定程式後產生，或是包含在處理器501中。The modules mentioned in the data analysis device 10 and the machine learning model device 20 are usually generated after the processor 501 in the respective computing device executes a specific program loaded into the memory module 502, or are included in the processor 501.

綜上所述，資料分析裝置將原始資料轉換為特徵資料再基於特徵空間轉換為轉換資料，使用聚類分析演算法將轉換資料分類為多個資料聚類，計算出每一個資料聚類的相似度指標陣列再基於第一閾值進行資料聚類的聚類合併，當相似度指標小於第二閾值且聚類的資料筆數小於轉換資料總筆數除以資料聚類的聚類總數以刪除對應的聚類，依據指定比例或是樣本數量與資料數量占比篩選出資料聚類中符合的轉換資料為樣本資料，再提供給機器學習模型進行學習訓練。In summary, the data analysis device converts the original data into feature data and then converts it into transformed data based on the feature space. The transformed data is classified into multiple data clusters using a cluster analysis algorithm. The similarity index array of each data cluster is calculated and then the data clusters are clustered and merged based on a first threshold. When the similarity index is less than a second threshold and the number of data items in the cluster is less than the total number of transformed data items divided by the total number of clusters in the data cluster, the corresponding cluster is deleted. The transformed data that meets the requirements in the data cluster is screened out as sample data according to a specified ratio or a ratio of the number of samples to the number of data, and then provided to the machine learning model for learning and training.

藉由此一技術手段可以來解決先前技術所存在現有資料聚類合併處理過程中未考慮篩除資料聚類的資料為異常情況導致樣本資料精準性降低的問題，進而達成提供機器學習模型適當數量且精準的樣本資料的技術功效。This technical means can be used to solve the problem of previous technologies that the accuracy of sample data is reduced due to the failure to consider the abnormal data in the existing data clustering and merging process, thereby achieving the technical effect of providing a proper amount of accurate sample data for the machine learning model.

雖然本創作所揭露的實施方式如上，惟所述的內容並非用以直接限定本創作的專利保護範圍。任何本創作所屬技術領域中具有通常知識者，在不脫離本創作所揭露的精神和範圍的前提下，可以在實施的形式上及細節上作些許的更動。本創作的專利保護範圍，仍須以所附的申請專利範圍所界定者為準。Although the implementation methods disclosed in this work are as above, the content described is not used to directly limit the scope of patent protection of this work. Anyone with ordinary knowledge in the technical field to which this work belongs can make slight changes in the form and details of implementation without departing from the spirit and scope disclosed in this work. The scope of patent protection of this work shall still be defined by the scope of the attached patent application.

10:資料分析裝置 11:資料接收模組 12:特徵轉換模組 13:聚類模組 14:相似度計算模組 15:聚類合併模組 16:異常資料篩除模組 17:樣本生成模組 20:機器學習模型裝置 21:樣本接收模組 22:模型訓練模組 31:原始資料 32:預測資料 33:混淆矩陣 401-411:步驟 501:處理器 502:記憶體模組 503:匯流排 509:外殼 10: Data analysis device 11: Data receiving module 12: Feature conversion module 13: Clustering module 14: Similarity calculation module 15: Clustering merging module 16: Abnormal data filtering module 17: Sample generation module 20: Machine learning model device 21: Sample receiving module 22: Model training module 31: Original data 32: Prediction data 33: Confusion matrix 401-411: Steps 501: Processor 502: Memory module 503: Bus 509: Housing

第1圖繪示為本創作用於機器學習的學習樣本資料篩選與異常資料篩除之系統的系統方塊圖。第2A圖繪示為本創作第一實施例異常資料聚類未篩除的聚類散佈圖。第2B圖繪示為本創作第一實施例異常資料聚類未篩除的混淆矩陣圖。第3A圖繪示為本創作第一實施例異常資料聚類篩除的聚類散佈圖。第3B圖繪示為本創作第一實施例異常資料聚類篩除的混淆矩陣圖。第4A圖繪示為本創作第二實施例異常資料聚類未篩除的聚類散佈圖。第4B圖繪示為本創作第二實施例異常資料聚類未篩除的混淆矩陣圖。第5A圖繪示為本創作第二實施例異常資料聚類篩除的聚類散佈圖。第5B圖繪示為本創作第二實施例異常資料聚類篩除的混淆矩陣圖。第6A圖以及第6B圖繪示為本創作用於機器學習的學習樣本資料篩選與異常資料篩除的流程圖。第7圖繪示為本創作所提之計算設備的元件示意圖。 FIG. 1 shows a system block diagram of the system for filtering learning sample data and filtering abnormal data for machine learning of the present invention. FIG. 2A shows a cluster scatter diagram of abnormal data clustering without filtering of the first embodiment of the present invention. FIG. 2B shows a confusion matrix diagram of abnormal data clustering without filtering of the first embodiment of the present invention. FIG. 3A shows a cluster scatter diagram of abnormal data clustering after filtering of the first embodiment of the present invention. FIG. 3B shows a confusion matrix diagram of abnormal data clustering after filtering of the first embodiment of the present invention. FIG. 4A shows a cluster scatter diagram of abnormal data clustering without filtering of the second embodiment of the present invention. FIG. 4B shows a confusion matrix diagram of the abnormal data clustering without filtering in the second embodiment of the present invention. FIG. 5A shows a cluster scatter diagram of the abnormal data clustering filtering in the second embodiment of the present invention. FIG. 5B shows a confusion matrix diagram of the abnormal data clustering filtering in the second embodiment of the present invention. FIG. 6A and FIG. 6B show a flow chart of the learning sample data filtering and abnormal data filtering for machine learning in the present invention. FIG. 7 shows a schematic diagram of the components of the computing device proposed in the present invention.

10:資料分析裝置 10: Data analysis device

11:資料接收模組 11: Data receiving module

12:特徵轉換模組 12: Feature conversion module

13:聚類模組 13: Clustering module

14:相似度計算模組 14: Similarity calculation module

15:聚類合併模組 15: Clustering and merging module

16:異常資料篩除模組 16: Abnormal data filtering module

17:樣本生成模組 17: Sample generation module

20:機器學習模型裝置 20: Machine learning model device

21:樣本接收模組 21: Sample receiving module

22:模型訓練模組 22: Model training module

Claims

A system for filtering learning sample data and filtering abnormal data for machine learning, comprising: A data analysis device, the data analysis device further comprising: A data receiving module, receiving an original data; A feature conversion module, using a feature analysis algorithm to convert the original data into a feature data based on data features; A clustering module, converting the feature data into a converted data based on a feature space, and using a clustering analysis algorithm to classify the converted data into a plurality of data clusters with a set number of clusters, and setting a corresponding cluster label for each data cluster; A similarity calculation module, which calculates the similarity index from the data center of each data cluster to the data center of other data clusters according to the cluster label to generate a similarity index array, and calculates the similarity index from the data center of a merged cluster to the data center of other data clusters according to the cluster label to generate a merged similarity index array; A cluster merging module, which repeatedly selects the largest similarity index in the similarity index array and the selected similarity index is greater than a preset first threshold, and merges the conversion data of the two data clusters corresponding to the selected similarity index into the merged cluster until all similarity indexes in the merged similarity index array are less than the threshold, and the cluster label of the merged cluster is the cluster label of the two data clusters with more conversion data; An abnormal data filtering module, when the similarity index in the merged similarity index array is less than a preset second threshold and the number of data items in the merged cluster corresponding to the similarity index is less than the total number of converted data items divided by the total number of clusters in the data cluster, deletes the cluster corresponding to the similarity index in the merged similarity index array as an abnormal filtered similarity index array; and A sample generation module, selects the merged cluster or the converted data close to the data center in the data cluster as a sample data according to the specified ratio of the merged cluster or the data cluster in the abnormal filtered similarity index array or the ratio of the sample number to the data number; and A machine learning model device, the machine learning model device further comprises: a sample receiving module, receiving the sample data from the data analysis device; and a model training module, training a machine learning model based on the sample data.

A system for filtering learning sample data and removing abnormal data for machine learning as described in claim 1, wherein the feature analysis algorithm includes an integral transform algorithm of Fast Fourier Transform (FFT) and wavelet transform.

A system for filtering learning sample data and removing abnormal data for machine learning as described in claim 1, wherein the clustering analysis algorithm includes K-means clustering and quality threshold clustering.

A system for filtering learning sample data and removing abnormal data for machine learning as described in claim 1, wherein the similarity index is measured by Euclidean distance, Manhattan distance, Mahalanobis distance, cosine similarity and Hamming distance.

The system for selecting learning sample data and removing abnormal data for machine learning as described in claim 1, wherein the similarity index is Calculated, among which, For each transformed data of a data cluster, For each transformed data in the other data cluster, is the data center of a data cluster. The center of the data where the other data are clustered.