[go: up one dir, main page]

TWI678709B - Disease prediction method through a big database formed by data mining of neural network - Google Patents

Disease prediction method through a big database formed by data mining of neural network Download PDF

Info

Publication number
TWI678709B
TWI678709B TW107140554A TW107140554A TWI678709B TW I678709 B TWI678709 B TW I678709B TW 107140554 A TW107140554 A TW 107140554A TW 107140554 A TW107140554 A TW 107140554A TW I678709 B TWI678709 B TW I678709B
Authority
TW
Taiwan
Prior art keywords
data
model
disease
training
neural network
Prior art date
Application number
TW107140554A
Other languages
Chinese (zh)
Other versions
TW202020885A (en
Inventor
楊智惠
Chih Hui Yang
洪暐傑
Wei Chieh Hung
黃瑞初
Rey Chue Hwang
林義隆
Yi Lung Lin
蕭介夫
Jei Fu Shaw
暐霖 鄭
Wei Leng Chin
郭泓霆
Hung Ting Kuo
Original Assignee
義守大學
I-Shou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 義守大學, I-Shou University filed Critical 義守大學
Priority to TW107140554A priority Critical patent/TWI678709B/en
Application granted granted Critical
Publication of TWI678709B publication Critical patent/TWI678709B/en
Publication of TW202020885A publication Critical patent/TW202020885A/en

Links

Landscapes

  • Medical Treatment And Welfare Office Work (AREA)
  • Image Analysis (AREA)

Abstract

一種藉由神經網路協助,從資料探勘至建立大數據之資料庫,爾後進行疾病預測的方法,包含下列步驟,首先,輸入疾病相關資料進行資料轉換,提取疾病資料中的特徵,並進行資料編碼,以得到待訓練資料,接著,將該待訓練資料以歐式距離計算出特徵間的相似度,並進行資料分群,以得到欲訓練模型,最後,輸入未進行資料轉換的疾病相關資料與該欲訓練模型利用歐式距離進行資料分群,以得到訓練模型。 A method for assisting neural network from data exploration to building a big data database and then performing disease prediction includes the following steps. First, input disease-related data for data conversion, extract features from disease data, and perform data Encoding to obtain the data to be trained, then calculate the similarity between the features using the Euclidean distance, and group the data to obtain the model to be trained. Finally, enter the disease-related data without data conversion and the The training model uses Euclidean distance to perform data clustering to obtain a training model.

Description

藉由神經網路協助資料探勘與大數據之資料 辨識進行疾病預測的方法 Neural network to assist data exploration and big data Identify methods for disease prediction

本發明是有關一種疾病預測方法,特別是指一種藉由神經網路協助,從資料探勘至建立大數據之資料庫,爾後進行疾病預測的方法。 The invention relates to a disease prediction method, in particular to a method for assisting neural network from data exploration to establishing a database of big data, and then performing disease prediction.

近年來,人工智慧(AI)於醫療照護的應用已被廣泛的研究與討論,越來越多的研究顯示AI在未來人類的醫療健康領域應用中,將扮演著關鍵性的角色。 In recent years, the application of artificial intelligence (AI) in medical care has been extensively researched and discussed. More and more studies show that AI will play a key role in the future application of human health.

機器學習模型的準確度在很大程度上取決於用於訓練模型的數據量。機器學習又分為監督式學習(Supervised Learning)、非監督式學習(Unsupervised Learning),及加強式學習(Reinforcement Learning)。無論採用哪種學習策略,目標皆是為網路中的每一個神經元找出適當的參數供日後使用。 The accuracy of a machine learning model depends heavily on the amount of data used to train the model. Machine learning is further divided into supervised learning, unsupervised learning, and reinforcement learning. Regardless of the learning strategy adopted, the goal is to find appropriate parameters for each neuron in the network for future use.

訓練機器學習模型時,技術上有資料清整(Data Cleaning)、特徵萃取(Feature Extraction)與特徵選擇(Feature Selection),及模型選取等重要的部分,為了訓練機器會蒐集資料樣本作為訓練資料(Training Data)。從訓練資料中擷取出資料的特徵(Features)幫助我們判讀出目標。 When training machine learning models, there are technically important parts such as Data Cleaning, Feature Extraction and Feature Selection, and model selection. In order to train the machine, data samples are collected as training data ( Training Data). Features from the training data help us determine the target.

許多醫療的資料現階段皆以文字呈現,無法直接作為特徵輸入訓練模型,而該文字資料對於疾病及治療的預測卻十分重要,由人工將大量醫療文字資料逐筆 標記將過於浪費人力資源與造成錯誤率增加,如何藉由神經網路輔助,在最節約醫療人力資源需求下,直接將大量文字資料轉換為可輸入訓練模型的特徵為目前待努力的方向。 Many medical data are currently presented in text, which cannot be directly used as a feature input training model, but the text data is very important for the prediction of disease and treatment. A large amount of medical text data is manually written by one person Marking will waste human resources too much and cause an increase in the error rate. How to use neural network to help directly convert a large amount of text data into features that can be input into the training model under the need of the most economical medical human resources is the current direction of efforts.

特徵會影響所得結果,有一些特徵其實不是很重要,分群時沒有足夠的線索知道各個特徵的重要性,因此很容易對某些分布的特徵產生偏誤,造成無意義的分群結果,同時模型所需的特徵若過多,將導致實際運用上的困難。如何選取真實而有用的影響參數,進而提升神經網路預測的準確性,是一個非常值得努力的目標,並且如何定義與比較模型的預測準確性也是另一重要待研發之議題。 Features can affect the results obtained. Some features are not very important. There is not enough clues to know the importance of each feature when clustering. Therefore, it is easy to bias some of the distribution characteristics and cause meaningless clustering results. If there are too many required features, it will cause difficulties in practical application. How to select real and useful influence parameters to improve the accuracy of neural network prediction is a very worthwhile goal, and how to define and compare the prediction accuracy of models is another important subject to be developed.

本發明之目的,是提供一種藉由神經網路協助資料探勘與大數據之資料辨識進行疾病預測的方法,包含下列步驟。 An object of the present invention is to provide a method for predicting diseases by using neural network to assist data exploration and data identification of big data, including the following steps.

首先,輸入疾病相關資料進行資料轉換,提取疾病資料中的特徵,並進行資料編碼,以得到待訓練資料,接著,將該待訓練資料以歐式距離計算出特徵間的相似度,並進行資料分群,以得到欲訓練模型,最後,輸入未進行資料轉換的疾病相關資料與該欲訓練模型利用歐式距離進行資料分群,以得到訓練模型。 First, input disease-related data for data conversion, extract features in the disease data, and encode the data to obtain the data to be trained. Then, use the Euclidean distance to calculate the similarity between the features and group the data. In order to obtain the desired training model, finally, input disease-related data without data conversion and the desired training model uses Euclidean distance to perform data clustering to obtain the training model.

本發明的另一技術手段,是在於上述使用詞向量轉換,將疾病資料中的每單字轉換一個向量,所有單字轉換的單字可以表示一個向量空間(Vector Space),在該空間中可計算單字間的向量距離。 Another technical means of the present invention is that the above-mentioned word-vector conversion is used to convert each word in the disease data to a vector. All the words converted by a word can represent a vector space, and the space between words can be calculated in this space. Vector distance.

本發明的又一技術手段,是在於上述是給予特徵不同的比例權重,再進行資料分群。 Another technical means of the present invention is that the foregoing is to give proportional weights with different characteristics, and then perform data clustering.

本發明的再一技術手段,是在於上述之資 料分群是將資料進行疾病分級。 Another technical means of the present invention is the above-mentioned resources. Data grouping is the classification of disease data.

本發明的另一技術手段,是在於上述之方法更對該訓練模型進行驗證,其包含3個子步驟,首先輸入訓練模型,接著取該訓練模型之總個案數的1/4進行模型訓練,再取該訓練模型之總個案數的1/4進行模型驗證,最後再取該訓練模型之總個案數的1/2進行最後的模型訓練,訓練出的模型再取該訓練模型之總個案數的另外1/2進行測試,以確認模型驗證的正確率。 Another technical means of the present invention is that the above method further validates the training model, which includes 3 sub-steps, first input the training model, then take 1/4 of the total number of cases of the training model for model training, and then Take 1/4 of the total number of cases of the training model for model verification, and finally take 1/2 of the total number of cases of the training model for final model training. The trained model then takes the total number of cases of the training model. The other 1/2 is tested to confirm the accuracy of the model verification.

本發明的又一技術手段,是在於當確認可進行曲線擬合(curve fitting)後,再進行下一步驟。 Another technical means of the present invention is to perform the next step after confirming that curve fitting is possible.

本發明的再一技術手段,是在於上述是隨機取該訓練模型之個案。 Another technical means of the present invention is that the above is a case of randomly taking the training model.

本發明的另一技術手段,是在於模型訓練時參數調整的方法,以提升模型的準確度與穩定度。 Another technical means of the present invention is a method of parameter adjustment during model training to improve the accuracy and stability of the model.

本發明的又一技術手段,是在於上述確認模型驗證的正確率是透過95%信賴區間進行驗證。 Another technical means of the present invention is that the accuracy of the verification model verification is verified through a 95% confidence interval.

本發明的再一技術手段,是在於反覆重複進行輸入層特徵權重刪除,直到出現下列三個情況的其中之一即停止:加總所有輸入層特徵權重大於95%且須包含所有的特徵;刪除特徵後五次隨機抽樣產生的模型誤差率之95%信賴區間無法有五個同時重疊的範圍;五次模型彼此重複的誤差率之95%信賴區間未包含0,停止後便以最後留下的輸入層特徵作為最後模型訓練的輸入層。 Another technical means of the present invention is to repeatedly and repeatedly delete feature weights of the input layer until one of the following three situations occurs: stop: summing up all input layer feature weights is greater than 95% and must include all features; delete The 95% confidence interval of the model error rate generated by the five random samplings after the feature cannot have five simultaneous overlapping ranges; the 95% confidence interval of the error rate of the five models overlapping each other does not include 0. After stopping, the last remaining The input layer features are used as the input layer for the final model training.

本發明的另一技術手段,是在於建立模型之效益並作為不同比較模型間準確度的方法,滿足Model

Figure TWI678709B_D0001
關係式,其中,SD(standard deviation)為 訓練模型之資料庫中所有真實數值的標準差,mean為資料庫中所有真實數值之平均值,MAPE為預測模型之平均絕 對值誤差率。 Another technical means of the present invention is to establish the benefits of the model and use it as a method of comparing the accuracy between different models to satisfy the Model
Figure TWI678709B_D0001
The relational expression, where SD (standard deviation) is the standard deviation of all true values in the database of the training model, mean is the average of all true values in the database, and MAPE is the average absolute value error rate of the prediction model.

本發明之有益功效在於,先使用監督式學習,再配合非監督式學習,最後運用加強式學習等神經網路應用進行疾病之預測,以提升訓練模型之精準度與效度。 The beneficial effect of the present invention is that first, supervised learning is used, then unsupervised learning is used, and neural network applications such as reinforcement learning are used to predict the disease, so as to improve the accuracy and validity of the training model.

11~14‧‧‧步驟 11 ~ 14‧‧‧step

圖1是一流程示意圖,說明本發明藉由神經網路協助資料探勘與大數據之資料辨識進行疾病預測的方法之較佳實施例。 FIG. 1 is a schematic flow chart illustrating a preferred embodiment of a method for disease prediction by using neural network to assist data exploration and big data data identification in the present invention.

有關本發明之相關申請專利特色與技術內容,在以下配合參考圖式之較佳實施例的詳細說明中,將可清楚的呈現。 The features and technical contents of the related patent application of the present invention will be clearly presented in the following detailed description of the preferred embodiments with reference to the drawings.

參閱圖1,為本發明藉由神經網路協助資料探勘與大數據之資料辨識進行疾病預測的方法之較佳實施例,於此先使用監督式學習,再配合非監督式學習,最後運用加強式學習等神經網路應用進行疾病之預測方法,包含下列步驟。 Refer to FIG. 1, which is a preferred embodiment of a method for disease prediction by using neural network to assist data exploration and big data data identification in the present invention. Here, supervised learning is used first, then unsupervised learning is used, and then reinforcement Methods for predicting diseases using neural networks, such as learning, include the following steps.

以神經網路進行疾病預測醫療院所第一階段需將文字報告資料轉化為數字,例如X光、心電圖都是文字報告,以作為後續建立預測模型之使用。 Neural network for disease prediction medical institutions in the first stage needs to convert text report data into numbers. For example, X-rays and electrocardiograms are text reports for the use of subsequent prediction models.

首先,進行步驟11,藉由神經網路訓練疾病相關資料進行資料轉換,提取疾病資料中的特徵,並進行資料編碼,以得到待訓練資料。 First, step 11 is performed to train disease-related data through neural network to perform data conversion, extract features in the disease data, and perform data coding to obtain data to be trained.

其中,在該步驟11中,使用詞向量轉換,將疾病資料中的每單字轉換一個向量,所有轉換的單字可以表示一個向量空間(Vector Space),在該空間中可計算單字間的向量距離。例如:統計單字在文件出現的次數,進而計算兩個單字共同出現的機率大小,以決定其相似 性,單字間的向量距離越短越相似。 In step 11, word vector conversion is used to convert each word in the disease data to a vector. All the converted words can represent a vector space, and the vector distance between the words can be calculated in this space. For example: Count the number of times a word appears in a file, and then calculate the probability of two words appearing together to determine their similarity The shorter the vector distance between words, the more similar.

詞向量的技術將每個單字賦予一個向量,用來表示這個單字在各種維度所代表的相對特徵,以向量區分每個單字與其他單字的相似程度。 The technique of word vector assigns each word to a vector, which is used to represent the relative characteristics of this word in various dimensions, and the vector is used to distinguish the similarity between each word and other words.

舉例來說有3千份X光報告,若內容不超過200字,補滿設定共200字,每個字就是一個向量表示,每一個向量以100維表示,每份報告相當就有2萬個特徵決定是否有骨折。接下來,對3千份X光報告開始做人工標記(labelling),例如1代表有骨折,0代表沒有骨折,預測特徵若完全歸類為1,沒有的歸類為0。 For example, there are 3,000 X-ray reports. If the content does not exceed 200 words, the fill-in setting is a total of 200 words. Each word is a vector representation. Each vector is represented by 100 dimensions. Each report is equivalent to 20,000. Features determine whether there is a fracture. Next, manual labelling is started on 3,000 X-ray reports. For example, 1 indicates that there is a fracture, and 0 indicates that there is no fracture. If the predicted feature is completely classified as 1, and none is classified as 0.

進一步地,在該步驟11中,依特徵不同的比例權重,再進行資料分群。2萬個特徵可視為2萬個輸入層的神經元,舉例來說:向量設定100個重要訊號,每個訊號佔了多少權重,中間隱藏層做運算,輸出層為表示有骨折10或沒有骨折01,透過在輸入層加入比例權重,提升判別精準度。 Further, in step 11, data clustering is performed according to the proportional weights with different characteristics. 20,000 features can be regarded as neurons with 20,000 input layers. For example: the vector sets 100 important signals, how much weight each signal occupies, the middle hidden layer performs calculations, and the output layer is 10 with or without fracture. 01. By adding proportional weights to the input layer, the discrimination accuracy is improved.

基本上每個東西的空間向量有不同的重要性,部分資料存在雜訊時,分群就變得非常重要,舉例來說:報告裡寫到有些微的collapse,或文字排列組合有點不同,可整體看起來文字相似,所以給予權重,例如有些字只有寫fracture,但骨折有可能是髖骨骨折,或是其它部位骨折,因此針對要辨識出脊椎有骨折,雖然是有關可沒有那麼重要。 Basically, the space vector of each thing has different importance. When there is noise in some data, the clustering becomes very important. For example: the report writes a slight collapse, or the text arrangement is a little different, which can be overall. It seems that the text is similar, so weight is given. For example, some words only write a fragment, but the fracture may be a hip fracture or a fracture in other parts. Therefore, it is not so important to identify the spine fracture.

接著,進行步驟12,將該待訓練資料以歐式距離計算出特徵間的相似度,並進行資料分群,以得到欲訓練模型,歐式距離計算出兩個詞向量間的語意相似度,而資料分群將資料進行疾病分級。 Next, step 12 is performed to calculate the similarity between the features based on the Euclidean distance for the data to be trained, and perform data clustering to obtain the model to be trained. The Euclidean distance calculates the semantic similarity between the two word vectors, and the data is grouped Data were classified for disease.

在X光報告給予特定特徵不同的比例權重,fracture常見的字是collapse、compress,fracture,列舉例說明:系統把它當成裡面最重要,且在他附近的也跟著 很重要,因為在過去判斷文字過程中有一大堆字,有些字被覺得不重要沒給定位點,先定義權重告知重要性就會被給定位點,像是no bony collapse確是非常重要的辨識關鍵,而不可以被忽略。下一個步驟是監督式學習,監督式學習指給每個字標籤(Label)。 In the X-ray report, different proportional weights are given to specific features. The common words in fracture are collapse, compress, and fracture. Examples are given below: the system regards it as the most important inside, and the ones near him follow It is very important, because in the past, there were a lot of words in the process of judging the text. Some words were considered unimportant and no anchor points were given. First, a weight was defined to inform the importance of the anchor points. For example, no bony collapse is very important. The key is not to be ignored. The next step is supervised learning, which refers to labeling each word.

訓練好特徵,會發現分成幾個區塊,並區分出有沒有骨折,特別說明的是設定維度數量取決於醫師的報告字數,若醫師打字長度不同設定字數相對也不相同,進行資料分群舉例來說骨質正常、骨質疏鬆、介於中間值的骨缺乏,則畫分成3個區塊。 After training the features, you will find that it is divided into several blocks and distinguished whether there are fractures. It is particularly noted that the number of set dimensions depends on the number of words reported by the physician. If the doctor types different lengths, the set words are relatively different. For example, if the bone is normal, osteoporosis, and bone loss in between, the drawing is divided into 3 blocks.

然後,進行步驟13,輸入未進行資料轉換的疾病相關資料與該欲訓練模型利用歐式距離進行資料分群,以得到訓練模型。 Then, step 13 is performed to input disease-related data without data conversion and the model to be trained to perform data clustering by using Euclidean distance to obtain a training model.

系統辨識vector完之後,欲訓練模型完成,但只用3千份不確定穩定與否,於此使用非監督式學習,再輸入3萬份未進行資料轉換的疾病相關資料,在上一步驟已經把特徵的vector取出,3萬份重要的vector再利用歐式距離依據上述3千份進行資料分群,以自動coding這3萬份,而不需用人力去coding,用vector做同樣群聚特徵的算在一起,如此3萬份都會被coding完成,之後3萬份加原始人為的3千份,共3.3萬再產生一次新的模型(Model),亦為後續進行訓練的模型,該訓練模型可大幅節約實際在臨床工作者建立Model所花的時間,也是本發明最大的重點。訓練模型要建立好一開始必需使Label正確,用非監督式學習改善給Label的效率,後續再用更多的模型去訓練。 After the system recognizes the vector, I want to train the model, but only use 3,000 copies that are not stable or not. I use unsupervised learning here, and then enter 30,000 copies of disease-related data without data conversion. In the previous step, Take out the feature vector, and use 30,000 important vectors to classify the data based on the above 3,000 copies using the European distance to automatically encode the 30,000 copies. No manual coding is required. Use vectors to calculate the same clustering features. Together, such 30,000 copies will be coded. After that, 30,000 copies will be added to the original artificial 3,000 copies. A total of 33,000 will generate a new model, which is also a model for subsequent training. This training model can be significantly Saving the time actually used by the clinician to build the Model is also the biggest focus of the present invention. To build a training model, it is necessary to make the Label correct at the beginning, improve the efficiency of the Label with unsupervised learning, and then use more models to train.

之後,使用加強式學習,在臨床工作時必須要回饋資料告訴系統哪些是對或錯,使用者使用的過程不斷丟資料回系統讓Model不斷修正,下面是去改善方法跟快速建立Model的方法,以提升精準度。 Later, using reinforcement learning, in the clinical work, it is necessary to return data to tell the system which is right or wrong. The process of user use continuously loses data and returns to the system for the model to be continuously revised. The following are methods to improve and quickly build the model. To improve accuracy.

最後,進行步驟14,對該訓練模型進行驗證,於此建立疾病預測模型並以骨質密度預測模型為例,其包括下列3個子步驟。 Finally, step 14 is performed to verify the training model. Here, a disease prediction model is established and a bone density prediction model is taken as an example, which includes the following three sub-steps.

一、首先,輸入訓練模型,該訓練模型為含有雙能量X光吸收儀(DXA)檢測數值及上述透過前述本專利方法coding完成的資料庫。二、接著,隨機取該訓練模型之總個案數的1/4進行模型訓練(train),並隨機再取該訓練模型之總個案數的1/4進行模型驗證(validation),當確認可進行曲線擬合(curve fitting)後,再進行下一步驟。三、最後,隨機取該訓練模型之總個案數的1/2進行最後的模型訓練,訓練出的模型再取另外1/2之資料庫進行測試(test),以確認模型驗證的效益。其中,在第三步驟中,確認模型驗證的正確率是透過95%信賴區間進行驗證,在開始訓練模型前需先驗證輸入層因子(因子亦指特徵)彼此間的相關性,可以檢視所有因子間線性與非線性的相關性,若相關性大於0.8,則兩個因子不適合共存於輸入層中,只能選取其一放入模型。 1. First, a training model is input, and the training model is a database containing detection values of a dual energy X-ray absorptiometry (DXA) and the aforementioned coding completed by the aforementioned method of this patent. 2. Next, randomly take 1/4 of the total number of cases of the training model for model training, and then randomly take 1/4 of the total number of cases of the training model for model validation. After curve fitting, proceed to the next step. 3. Finally, randomly take 1/2 of the total number of cases of the training model for the final model training, and then take the other 1/2 of the database for testing to confirm the effectiveness of the model verification. Among them, in the third step, confirm that the correctness of the model verification is verified through the 95% confidence interval. Before starting to train the model, you need to verify the correlation between the input layer factors (factors also refer to features), and you can view all the factors. The correlation between linear and non-linear, if the correlation is greater than 0.8, then the two factors are not suitable for coexisting in the input layer, and only one of them can be selected and put into the model.

實際實施時,假設共有1千筆,首先取250筆(1/4),開始作Model的training,由於不是每個資料庫都適合做神經網絡分析,可能資料庫的樣本過度極端化、選取的變項不穩定或是沒選到影響輸出層的重要因子;加以模型訓練時須包含多項參數設定,如學習率(learning rate)、隱藏層數目設定、神經網絡訓練方式等,所以一開始要確定上述條件下適合於神經網絡訓練,一開始先從總資料庫進行隨機抽樣,第一次先抽250筆,250筆會訓練出他的Model,再由剩下的750筆(3/4)資料隨機抽樣250筆(1/4)作為驗證模型的信度與效度,為避免抽樣偏差,重複抽樣五次訓練五個model,而每個Model validation驗證時都會有一個平均絕對值誤差率(Mean Absolute Percentage Error,MAPE),舉例來說,第一個Model下, 倘若某筆資料對於骨密度預測結果為0.9,實際DXA檢測之結果為1,誤差值的絕對值則為0.1,與實際DXA檢測結果1相除為10%,將所有資料的誤差率加總平均後則為MAPE。 In actual implementation, assuming a total of 1,000 pens, first take 250 pens (1/4) and start training for the model. Since not every database is suitable for neural network analysis, the samples of the database may be excessively extreme and selected. Variables are not stable or important factors affecting the output layer are not selected; when training the model, multiple parameter settings must be included, such as the learning rate, the number of hidden layers, and the training method of the neural network. The above conditions are suitable for neural network training. At first, random sampling is performed from the total database. For the first time, 250 strokes are drawn. 250 strokes will train his Model, and then the remaining 750 strokes (3/4) of the data will be randomly selected. 250 samples (1/4) were taken as the reliability and validity of the validation model. In order to avoid sampling bias, five samples were repeatedly trained to train five models, and each model validation had an average absolute error rate (Mean Absolute). Percentage Error (MAPE), for example, under the first Model, If a piece of data predicts a bone density of 0.9, the actual DXA test result is 1, the absolute value of the error value is 0.1, and the actual DXA test result 1 is divided by 10%. The error rate of all the data is averaged and averaged. After that is MAPE.

參考之真實數值為DXA的骨密度檢查數值,跟Model運算出的數值作比較會有一個誤差值,誤差值與真實數值相除為誤差率,爾後再驗證階段可計算250個誤差值之平均誤差率與誤差率之標準差(Standard Deviation,SD),誤差率中間為0,兩端分別是正誤差與負誤差,而正負兩個誤差率之標準差就是誤差率之95%信賴區間,5次隨次抽樣所建立的model將會有各自不同的平均誤差率與誤差率之標準差,若五個model之誤差率之95%信賴區間有五個同時重疊的範圍,將可認定model在相關條件下(如段落0036所述)是穩定的,此方式可以做為驗證模型信度的方式;而同時五次模型彼此重複的誤差率之95%信賴區間需包含0,作為模型效度的驗證。 The reference real value is the DXA bone density check value. There is an error value when compared with the value calculated by the Model. The error value is divided by the real value as the error rate. The average error of the 250 error values can be calculated after the verification phase. The standard deviation (SD) of the error rate and error rate. The middle of the error rate is 0. The two ends are positive and negative errors. The standard deviation of the positive and negative error rates is the 95% confidence interval of the error rate. The models created by the sub-sampling will have different average error rates and standard deviations of the error rates. If the 95% confidence interval of the five model error rates has five simultaneous overlapping ranges, the model will be deemed to be under relevant conditions. (As described in paragraph 0036) is stable. This method can be used as a method to verify the reliability of the model. At the same time, the 95% confidence interval of the error rate of the five times the model repeats with each other needs to include 0 as the model validity verification.

模型的相關參數於設定過程中可以透過MAPE的變化或(誤差值的平均x誤差值的標準差)來做調整,如調整learning rate,比較不同learning rate,選擇MAPE最小或(誤差值的平均x誤差值的標準差)最小者為最終選擇的參數,而調整方法可以以下列數據做調整,先以10n,N=+2,+1,0,-1,-2作為基準進行測試,選擇最佳之MAPE或(誤差值的平均x誤差值的標準差),若最佳的n數值為最大或最小者(如+2或-2),將需再往上或往下增加五層進行測試(如+2,+3,+4,+5,+6),再以Ax10n,A=4,3,2,1,0,0.9,0.8,0.7,0.6,0.5分別測試比較最佳之參數數值。 The relevant parameters of the model can be adjusted through the change of MAPE or (the average of the error value x the standard deviation of the error value) during the setting process, such as adjusting the learning rate, comparing different learning rates, and choosing the minimum MAPE or (the average x The standard deviation of the error value) is the smallest parameter, and the adjustment method can be adjusted with the following data. First use 10 n , N = + 2, +1,0, -1, -2 as the benchmark for testing. Select The best MAPE or (the average of the error values x the standard deviation of the error values), if the best n value is the largest or smallest (such as +2 or -2), you need to add five more layers up or down Test (such as + 2, + 3, + 4, + 5, + 6), then Ax10 n , A = 4,3,2,1,0,0.9,0.8,0.7,0.6,0.5 Parameter value.

之後,開始嘗試減少Model輸入層的因子,在透過神經網絡訓練產生預測模型時,可以同時依照因子對於輸出層結果的影像力給與權重,所有輸入層因子權重 相加為100%,表示每個輸入層因子在於模型內對於輸出層的重要性比例,在五次隨機抽樣訓練的模型產生後,將五次的每個因子權重進行平均,下一步驟開始由最高權重的因子往最低權重的因子進行加總,由於醫學上常用之數值為95%信賴區間及α error 5%的概念,故在加總因子權重時,加到總權重大於95%即停止,剩餘的因子全部刪除,刪除後重複第二步驟與第三步驟,確認變動輸入層因子後之五次模型的誤差率之95%信賴區間仍有五個同時重疊的範圍,以確認刪除因子後模型的穩定度與信度,若模型確認適合,將重複此步驟,計算新的權重再次將輸入層因子嘗試刪除減少。 After that, I began to try to reduce the factors of the Model input layer. When generating a prediction model through neural network training, we can simultaneously give weight to the image power of the output layer results according to the factors, and all input layer factor weights. The sum is 100%, which indicates the importance ratio of each input layer factor to the output layer in the model. After five random sampling training models are generated, the weights of each factor are averaged five times. The next step starts with The highest weighted factors are summed up to the lowest weighted factors. Due to the concepts of 95% confidence interval and α error 5% commonly used in medicine, when adding the factor weights, the total weight is greater than 95% and stops. All remaining factors are deleted. After the deletion, repeat the second step and the third step, and confirm that the 95% confidence interval of the error rate of the model five times after changing the input layer factors still has five simultaneous overlapping ranges to confirm the model after deleting the factors. If the model confirms that it is suitable, this step will be repeated, new weights will be calculated, and the input layer factors will be deleted and reduced.

依照輸入層因子權重刪除因子項目的步驟反覆重複,直到下列三個情況的其中之一即停止:一、加總所有輸入層因子權重大於95%且須包含所有的因子才能達到此要求;二、刪除因子後五次隨機抽樣產生的模型誤差率之95%信賴區間無法有五個同時重疊的範圍;三、五次模型彼此重複的誤差率之95%信賴區間未包含0;停止後便以最後留下的輸入層因子作為最終預測模型的輸入層。 Repeat the steps of deleting factor items according to the input layer factor weights until it is stopped in one of the following three cases: 1. Add up all input layer factor weights greater than 95% and must include all factors to achieve this requirement; 2. After removing the factor, the 95% confidence interval of the model error rate generated by five random samplings cannot have five simultaneous overlapping ranges; the 95% confidence interval of the error rate repeated by the three or five models does not include 0; after stopping, the last The remaining input layer factors serve as the input layer for the final prediction model.

最後以這個輸入層因子的項目及所有相關參數設定進行最終模型建立,在此條件下隨機取所有總資料庫之500筆(1/2)進行模型訓練,訓練後的模型再以剩下之500筆(1/2)進行測試(test)。 Finally, the final model is established based on the input layer factor items and all relevant parameter settings. Under this condition, 500 (1/2) of all the total databases are randomly selected for model training, and the remaining 500 Test with pen (1/2).

最終所得出的模型在測試時的誤差率之95%信賴區間需包含0,以表示模型具有足夠效度。若包含0則代表預測模型與實際測量值間直接存在絕對誤差,醫學95%信賴區間若未包含0則代表是具有顯著差異,表示預測模型與實際測量是兩件不相關的獨立事件,此種情況的預測模型則不適合於臨床運用。 The 95% confidence interval of the error rate of the resulting model during testing needs to include 0 to indicate that the model has sufficient validity. If it contains 0, it means that there is an absolute error directly between the prediction model and the actual measurement value. If the 95% confidence interval in medicine does not include 0, it means that there is a significant difference, indicating that the prediction model and the actual measurement are two independent events that are not related. Situation prediction models are not suitable for clinical use.

模型的效度可以MAPE數值做為表示,如 MAPE為10%,表示模型運算出的結果與實際測量值大部分皆會落在±10%之間,而建立模型之效益並作為不同模型間的預測效果比較可透過model efficacy公式進行比較,

Figure TWI678709B_D0002
,其中,SD(standard deviation)為 資料庫中所有真實數值(DXA檢測數據)之標準差,mean為資料庫中所有真實數值之平均值,MAPE為預測模型之平均絕對值誤差率,其數值越大越好,亦代表Model的預測效果越良好,此公式表示透過預測模型將可較直接依照資料庫平均值盲猜的效果優秀多少,如某一群體DXA檢測數值之平均值為1,標準差為0.2,我們可得知若個案未做檢查,我們直接猜他的骨密度為1,大部分的個案骨密度也差不多在1±0.2之間,因此以模型之MAPE與此盲猜的數值做比較,可以得知預測模型的效果較盲猜良好的程度,而不同的模型也可透過此公式比較優劣度。 The validity of the model can be expressed as the MAPE value. For example, if the MAPE is 10%, it means that most of the results calculated by the model and the actual measured values will fall within ± 10%. The benefits of the model are used as the difference between different models. The prediction effect comparison can be compared through the model efficacy formula.
Figure TWI678709B_D0002
Among them, SD (standard deviation) is the standard deviation of all true values (DXA test data) in the database, mean is the average of all true values in the database, and MAPE is the average absolute value error rate of the prediction model. The bigger the better, the better the model's prediction effect. This formula indicates that the prediction model will be better than the direct guess of the average value of the database. For example, the average value of the DXA detection value of a group is 1, and the standard deviation is 0.2, we can know that if the case is not checked, we directly guess that his bone density is 1, and most cases have a bone density of about 1 ± 0.2, so we compare the MAPE of the model with the value of this blind guess It can be known that the prediction model performs better than blind guessing, and different models can compare the pros and cons through this formula.

綜上所述,本發明藉由神經網路協助資料探勘與大數據之資料辨識進行疾病預測的方法,透過輸入疾病相關資料進行資料轉換,提取疾病資料中的特徵,並進行資料編碼,以得到待訓練資料,再將該待訓練資料以歐式距離計算出特徵間的相似度,並進行資料分群,以得到欲訓練模型,最後,輸入未進行資料轉換的疾病相關資料與該欲訓練模型利用歐式距離進行資料分群,以得到訓練模型,先使用監督式學習,再配合非監督式學習,最後運用加強式學習等神經網路應用進行疾病之預測,以提升 訓練模型之精準度與效度,而模型訓練時參數調整的方法,則可提升模型的準確度與穩定度,故確實可以達成本發明之目的。 In summary, the present invention uses a neural network to assist data exploration and big data data identification for disease prediction. The method converts data by inputting disease-related data, extracts features in disease data, and encodes data to obtain The data to be trained is used to calculate the similarity between the features based on the Euclidean distance, and the data is grouped to obtain the model to be trained. Finally, the disease-related data without data conversion is input and the model to be trained uses Euclidean Data are grouped by distance to obtain a training model. Supervised learning is used first, then unsupervised learning is used, and neural network applications such as reinforcement learning are used to predict the disease in order to improve The accuracy and validity of the training model, and the method of parameter adjustment during model training can improve the accuracy and stability of the model, so it can indeed achieve the purpose of cost invention.

惟以上所述者,僅為本發明之較佳實施例而已,當不能以此限定本發明實施之範圍,即大凡依本發明申請專利範圍及發明說明內容所作之簡單的等效變化與修飾,皆仍屬本發明專利涵蓋之範圍內。 However, the above are only the preferred embodiments of the present invention. When the scope of implementation of the present invention cannot be limited by this, that is, the simple equivalent changes and modifications made according to the scope of the patent application and the description of the invention, All are still within the scope of the invention patent.

Claims (6)

一種藉由神經網路協助資料探勘與大數據之資料辨識進行疾病預測的方法,包含下列步驟:(A)輸入疾病相關資料進行資料轉換,提取疾病資料中的特徵,並進行資料編碼,以得到待訓練資料,其中,使用詞向量轉換,將疾病資料中的每單字轉換一個向量,所有單字轉換的單字可以表示一個向量空間(Vector Space),在該空間中可計算單字間的向量距離,且給予特徵不同的比例權重,再進行資料分群;(B)將該待訓練資料以歐式距離計算出特徵間的相似度,並進行資料分群,以得到欲訓練模型,且資料分群是將資料進行疾病分級;(C)輸入未進行資料轉換的疾病相關資料與該欲訓練模型利用歐式距離進行資料分群,以得到訓練模型;及(D)對該訓練模型進行驗證,其包括下列子步驟:(d1)輸入訓練模型;(d2)取該訓練模型之總個案數的1/4進行模型訓練,再取該訓練模型之總個案數的1/4進行模型驗證;及(d3)取該訓練模型之總個案數的1/2進行最後的模型訓練,訓練出的模型再取該訓練模型之總個案數的另外1/2進行測試,以確認模型驗證的正確率。A method for predicting disease by using neural network to assist data exploration and data identification of big data, including the following steps: (A) input disease-related data for data conversion, extract features in disease data, and perform data coding to obtain To-be-trained data, in which word vector conversion is used to convert each word in the disease data to a vector. All words converted can represent a vector space, in which the vector distance between words can be calculated, and Give different proportion weights to the features, and then perform data clustering; (B) Calculate the similarity between the features using the Euclidean distance between the data to be trained, and perform data clustering to obtain the model to be trained, and the data clustering is the data to disease Grading; (C) input disease-related data without data conversion and use the Euclidean distance to cluster the data to obtain a training model; and (D) verify the training model, which includes the following sub-steps: (d1 ) Input the training model; (d2) Take 1/4 of the total number of cases of the training model for model training, and then take the training 1/4 of the total number of cases of the model is used for model verification; and (d3) 1/2 of the total number of cases of the trained model is used for the final model training, and the trained model is then taken from the total number of additional cases of the trained model. Test to confirm the accuracy of the model verification. 依據申請專利範圍第1項所述藉由神經網路協助資料探勘與大數據之資料辨識進行疾病預測的方法,在該步驟(d2)中,當確認可進行曲線擬合(curve fitting)後,再進行下一步驟。According to the method described in item 1 of the scope of the patent application, a method for predicting disease by using neural network to assist data exploration and big data data identification is performed. In this step (d2), after confirming that curve fitting can be performed, Go to the next step. 依據申請專利範圍第2項所述藉由神經網路協助資料探勘與大數據之資料辨識進行疾病預測的方法,其中,在該步驟(d2)及步驟(d3)中,是隨機取該訓練模型之個案。The method for predicting disease by using neural network to assist data exploration and big data data identification according to item 2 of the scope of patent application, wherein in this step (d2) and step (d3), the training model is randomly taken Case. 依據申請專利範圍第3項所述藉由神經網路協助資料探勘與大數據之資料辨識進行疾病預測的方法,其中,在該步驟(d3)中,確認模型驗證的正確率是透過95%信賴區間進行驗證。The method for predicting disease by using neural network to assist data exploration and big data data identification according to item 3 of the scope of patent application, wherein in this step (d3), confirm that the correct rate of model verification is 95% trusted Interval for verification. 依據申請專利範圍第4項所述藉由神經網路協助資料探勘與大數據之資料辨識進行疾病預測的方法,其中,在該步驟(d2)中,反覆重複進行輸入層特徵權重刪除,直到出現下列三個情況的其中之一即停止:加總所有輸入層特徵權重大於95%且須包含所有的特徵;刪除特徵後五次隨機抽樣產生的模型誤差率之95%信賴區間無法有五個同時重疊的範圍;五次模型彼此重複的誤差率之95%信賴區間未包含0,停止後便以最後留下的輸入層特徵作為最後模型訓練的輸入層。The method for predicting diseases by using neural network to assist data exploration and big data data identification as described in item 4 of the scope of patent application, wherein in this step (d2), the feature weight of the input layer is repeatedly deleted until it appears. One of the following three situations is stopped: the total weight of all input layer features is greater than 95% and all features must be included; the 95% confidence interval of the model error rate generated by five random samplings after deleting the features cannot have five simultaneous intervals The overlapping range; the 95% confidence interval of the error rate of the five models repeating each other does not contain 0. After stopping, the last input layer feature is used as the input layer for the final model training. 依據申請專利範圍第5項所述藉由神經網路協助資料探勘與大數據之資料辨識進行疾病預測的方法,其中,在該步驟(d3)中,建立模型之效益並作為不同比較模型間準確度的方法滿足關係式,其中,SD(standard deviation)為訓練模型之資料庫中所有真實數值的標準差,mean為資料庫中所有真實數值之平均值,MAPE為預測模型之平均絕對值誤差率。A method for predicting disease by using neural network to assist data exploration and big data data identification according to item 5 of the scope of patent application, wherein in this step (d3), the benefits of the model are established and used as accurate comparisons between different comparison models. Degree method meets The relational expression, where SD (standard deviation) is the standard deviation of all true values in the database of the training model, mean is the average of all true values in the database, and MAPE is the average absolute value error rate of the prediction model.
TW107140554A 2018-11-15 2018-11-15 Disease prediction method through a big database formed by data mining of neural network TWI678709B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW107140554A TWI678709B (en) 2018-11-15 2018-11-15 Disease prediction method through a big database formed by data mining of neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW107140554A TWI678709B (en) 2018-11-15 2018-11-15 Disease prediction method through a big database formed by data mining of neural network

Publications (2)

Publication Number Publication Date
TWI678709B true TWI678709B (en) 2019-12-01
TW202020885A TW202020885A (en) 2020-06-01

Family

ID=69582567

Family Applications (1)

Application Number Title Priority Date Filing Date
TW107140554A TWI678709B (en) 2018-11-15 2018-11-15 Disease prediction method through a big database formed by data mining of neural network

Country Status (1)

Country Link
TW (1) TWI678709B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI816078B (en) 2021-01-05 2023-09-21 財團法人工業技術研究院 Mining method for sample grouping

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100111396A1 (en) * 2008-11-06 2010-05-06 Los Alamos National Security Object and spatial level quantitative image analysis
CN105894088A (en) * 2016-03-25 2016-08-24 苏州赫博特医疗信息科技有限公司 Medical information extraction system and method based on depth learning and distributed semantic features
TW201643760A (en) * 2015-04-10 2016-12-16 Social Welfare Organization Saiseikai Imp Gift Foundation Inc Saiseikai Suita Hospital Method for discriminating symptom of hepatic disease
TW201822138A (en) * 2016-12-14 2018-06-16 中華電信股份有限公司 Personal health information system and method using neural network interception and classification
WO2018189549A1 (en) * 2017-04-12 2018-10-18 Kheiron Medical Technologies Ltd Assessment of density in mammography

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100111396A1 (en) * 2008-11-06 2010-05-06 Los Alamos National Security Object and spatial level quantitative image analysis
TW201643760A (en) * 2015-04-10 2016-12-16 Social Welfare Organization Saiseikai Imp Gift Foundation Inc Saiseikai Suita Hospital Method for discriminating symptom of hepatic disease
CN105894088A (en) * 2016-03-25 2016-08-24 苏州赫博特医疗信息科技有限公司 Medical information extraction system and method based on depth learning and distributed semantic features
TW201822138A (en) * 2016-12-14 2018-06-16 中華電信股份有限公司 Personal health information system and method using neural network interception and classification
WO2018189549A1 (en) * 2017-04-12 2018-10-18 Kheiron Medical Technologies Ltd Assessment of density in mammography

Also Published As

Publication number Publication date
TW202020885A (en) 2020-06-01

Similar Documents

Publication Publication Date Title
AU2020333132B2 (en) Method and system for disease classification coding based on deep learning, and device and medium
JP7365747B1 (en) Disease treatment process abnormality identification system based on hierarchical neural network
CN109741806B (en) Auxiliary generation method and device for medical image diagnosis report
CN107193919A (en) The search method and system of a kind of electronic health record
CN113204969A (en) Medical named entity recognition model generation method and device and computer equipment
CN109670177A (en) One kind realizing the semantic normalized control method of medicine and control device based on LSTM
CN109994201B (en) Diabetes and hypertension probability calculation system based on deep learning
CN119337327B (en) Trauma severity assessment method and system based on artificial intelligence and big data analysis
WO2020224433A1 (en) Target object attribute prediction method based on machine learning and related device
CN111145902A (en) An Asthma Diagnosis Method Based on Improved Artificial Neural Network
CN111415099A (en) A method for identifying poor students based on multi-class BP-Adaboost
CN120144701A (en) A medical question answering method based on hallucination detection based on large language model
TWI678709B (en) Disease prediction method through a big database formed by data mining of neural network
Duryea et al. Neural network based automated algorithm to identify joint locations on hand/wrist radiographs for arthritis assessment
TWI712053B (en) Method for establishing disease prediction model and treatment prediction model via neural network and then verifying the same
CN117766133A (en) A method and device for identifying TCM syndromes based on intelligent algorithms
CN112489803A (en) Risk event prediction method and system, and generation method of risk event prediction system
CN116312974B (en) Kidney Yang Deficiency Syndrome State Identification System Based on Random Forest Rule Extraction Technology
CN117476235A (en) Method for predicting pathological features of diseases by artificial intelligence technology
CN111466877B (en) An oxygen reduction state prediction method based on LSTM network
CN115295133A (en) Code checking method for surgical operation
CN114913979A (en) Disease risk grade prediction model and construction method and application thereof
CN119580920B (en) A Medical Image Description Method Based on State-Space Model
Rao et al. Extracting Insights and Prognosis of Corona Disease
Ramadhan et al. Pneumonia prediction system using classification and regression trees algorithm

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees